Masked autoencoders(MAEs)have recently achieved great success in computer vision.They can automatically extract representations from unlabeled data and improve the performance of various downstream tasks.However,train...Masked autoencoders(MAEs)have recently achieved great success in computer vision.They can automatically extract representations from unlabeled data and improve the performance of various downstream tasks.However,training an MAE model requires substantial resources,which limits their accessibility to many academic institutions:often laboratories in universities lack the necessary resources.This issue significantly hinders the development of this field.In this paper,we propose FastMAE,an efficient MAE approach.Inspired by the idea of offline tokenizers in natural language processing,FastMAE presents a novel way to build an offline vision tokenizer,which can provide high-level semantics in an efficient way.Benefiting from the offline tokenizer,FastMAE becomes an efficient vision learner.Our experiments demonstrate that FastMAE can achieve 83.6%accuracy with ViT-B in only 18.8 h on 8 NVIDIA Tesla-V100 GPUs,which is 31.3×faster than the original MAE,providing a resource friendly baseline for the computer vision community.Moreover,it also achieves comparable performance to state-of-the-art methods.We hope our research will attract more people to engage in MAE-related research and that we can advance its development together.展开更多
基金supported by the National Science and Technology Major Project(Grant No.2021ZD0112902)the National Natural Science Foundation of China(Grant Nos.623B2057 and 62220106003)+1 种基金Tsinghua University Initiative Scientific Research ProgramTsinghua-Tencent Joint Laboratory for Internet Innovation Technology.
文摘Masked autoencoders(MAEs)have recently achieved great success in computer vision.They can automatically extract representations from unlabeled data and improve the performance of various downstream tasks.However,training an MAE model requires substantial resources,which limits their accessibility to many academic institutions:often laboratories in universities lack the necessary resources.This issue significantly hinders the development of this field.In this paper,we propose FastMAE,an efficient MAE approach.Inspired by the idea of offline tokenizers in natural language processing,FastMAE presents a novel way to build an offline vision tokenizer,which can provide high-level semantics in an efficient way.Benefiting from the offline tokenizer,FastMAE becomes an efficient vision learner.Our experiments demonstrate that FastMAE can achieve 83.6%accuracy with ViT-B in only 18.8 h on 8 NVIDIA Tesla-V100 GPUs,which is 31.3×faster than the original MAE,providing a resource friendly baseline for the computer vision community.Moreover,it also achieves comparable performance to state-of-the-art methods.We hope our research will attract more people to engage in MAE-related research and that we can advance its development together.