Significant advancements have beenwitnessed in visual tracking applications leveragingViT in recent years,mainly due to the formidablemodeling capabilities of Vision Transformer(ViT).However,the strong performance of ...Significant advancements have beenwitnessed in visual tracking applications leveragingViT in recent years,mainly due to the formidablemodeling capabilities of Vision Transformer(ViT).However,the strong performance of such trackers heavily relies on ViT models pretrained for long periods,limitingmore flexible model designs for tracking tasks.To address this issue,we propose an efficient unsupervised ViT pretraining method for the tracking task based on masked autoencoders,called TrackMAE.During pretraining,we employ two shared-parameter ViTs,serving as the appearance encoder and motion encoder,respectively.The appearance encoder encodes randomly masked image data,while the motion encoder encodes randomly masked pairs of video frames.Subsequently,an appearance decoder and a motion decoder separately reconstruct the original image data and video frame data at the pixel level.In this way,ViT learns to understand both the appearance of images and the motion between video frames simultaneously.Experimental results demonstrate that ViT-Base and ViT-Large models,pretrained with TrackMAE and combined with a simple tracking head,achieve state-of-the-art(SOTA)performance without additional design.Moreover,compared to the currently popular MAE pretraining methods,TrackMAE consumes only 1/5 of the training time,which will facilitate the customization of diverse models for tracking.For instance,we additionally customize a lightweight ViT-XS,which achieves SOTA efficient tracking performance.展开更多
Deep convolutional neural networks(DCNNs)are widely used in content-based image retrieval(CBIR)because of the advantages in image feature extraction.However,the training of deep neural networks requires a large number...Deep convolutional neural networks(DCNNs)are widely used in content-based image retrieval(CBIR)because of the advantages in image feature extraction.However,the training of deep neural networks requires a large number of labeled data,which limits the application.Self-supervised learning is a more general approach in unlabeled scenarios.A method of fine-tuning feature extraction networks based on masked learning is proposed.Masked autoencoders(MAE)are used in the fine-tune vision transformer(ViT)model.In addition,the scheme of extracting image descriptors is discussed.The encoder of the MAE uses the ViT to extract global features and performs self-supervised fine-tuning by reconstructing masked area pixels.The method works well on category-level image retrieval datasets with marked improvements in instance-level datasets.For the instance-level datasets Oxford5k and Paris6k,the retrieval accuracy of the base model is improved by 7%and 17%compared to that of the original model,respectively.展开更多
Objective To develop and validate an artificial intelligence(AI)diagnostic model for coronary artery disease based on facial photos.Methods This study was a cross-sectional study.Patients who were scheduled to undergo...Objective To develop and validate an artificial intelligence(AI)diagnostic model for coronary artery disease based on facial photos.Methods This study was a cross-sectional study.Patients who were scheduled to undergo coronary angiography(CAG)at Beijing Anzhen Hospital and Beijing Daxing Hospital from August 2022 to November 2023 were included consecutively.Before CAG,facial photos were collected(including four angles:frontal view,left and right 60°profile,and top of the head).Photo datasets were randomly divided into a training set,a validation set(70%),and a testing set(30%).The modelI was constructed using Masked Autoencoder(MAE)and Vision Transformer(ViT)architectures.Firstly,the model base was pre-training using 2 million facial photos obtained from the publicly available VGGFace dataset,and fine-tuned by the training and validation sets;the model was validated in the test set.In addition,the ResNet architecture was used to process the dataset,and its outputs were compared with those of the models based on MAE and ViT.In the test set,the area under the operating characteristic curve(AUC)oftheAlmodel was calculated using CAG results as the gold standard.Results A total of 5974 participants aged 61(54,67)years were included,including 4179 males(70.0%),with a total of 84964 facial photos.There were 79140 facial photos in the training and validation sets,with 3822 patients with coronary artery disease;there were 5824 facial photos in the test set,with 239 patients with coronary artery disease.The AUC value of the MAE and ViT model initialized with pre-training model weight was 0.841 and 0.824,respectively.The AUC of the ResNet model initialized with random weight was 0.810,while the AUC of the ResNet model initialized with pre-training model weight was O.816.Conclusion The AI model based on facial photos shows good diagnostic performance for coronary artery disease and holds promise for further application in early diagnosis.展开更多
Unlike existing fully-supervised approaches,we rethink colorectal polyp segmentation from an out-of-distribution perspective with a simple but effective self-supervised learning approach.We leverage the ability of mas...Unlike existing fully-supervised approaches,we rethink colorectal polyp segmentation from an out-of-distribution perspective with a simple but effective self-supervised learning approach.We leverage the ability of masked autoencoders-self-supervised vision transformers trained on a reconstruction task-to learn in-distribution representations,here,the distribution of healthy colon images.We then perform out-of-distribution reconstruction and inference,with feature space standardisation to align the latent distribution of the diverse abnormal samples with the statistics of the healthy samples.We generate per-pixel anomaly scores for each image by calculating the difference between the input and reconstructed images and use this signal for out-of-distribution(i.e.,polyp)segmentation.Experimental results on six benchmarks show that our model has excellent segmentation performance and generalises across datasets.Our code is publicly available at https://github.com/GewelsJI/Polyp-OOD.展开更多
基金supported in part by National Natural Science Foundation of China(No.62176041)in part by Excellent Science and Technique Talent Foundation of Dalian(No.2022RY21).
文摘Significant advancements have beenwitnessed in visual tracking applications leveragingViT in recent years,mainly due to the formidablemodeling capabilities of Vision Transformer(ViT).However,the strong performance of such trackers heavily relies on ViT models pretrained for long periods,limitingmore flexible model designs for tracking tasks.To address this issue,we propose an efficient unsupervised ViT pretraining method for the tracking task based on masked autoencoders,called TrackMAE.During pretraining,we employ two shared-parameter ViTs,serving as the appearance encoder and motion encoder,respectively.The appearance encoder encodes randomly masked image data,while the motion encoder encodes randomly masked pairs of video frames.Subsequently,an appearance decoder and a motion decoder separately reconstruct the original image data and video frame data at the pixel level.In this way,ViT learns to understand both the appearance of images and the motion between video frames simultaneously.Experimental results demonstrate that ViT-Base and ViT-Large models,pretrained with TrackMAE and combined with a simple tracking head,achieve state-of-the-art(SOTA)performance without additional design.Moreover,compared to the currently popular MAE pretraining methods,TrackMAE consumes only 1/5 of the training time,which will facilitate the customization of diverse models for tracking.For instance,we additionally customize a lightweight ViT-XS,which achieves SOTA efficient tracking performance.
基金the Project of Introducing Urgently Needed Talents in Key Supporting Regions of Shandong Province,China(No.SDJQP20221805)。
文摘Deep convolutional neural networks(DCNNs)are widely used in content-based image retrieval(CBIR)because of the advantages in image feature extraction.However,the training of deep neural networks requires a large number of labeled data,which limits the application.Self-supervised learning is a more general approach in unlabeled scenarios.A method of fine-tuning feature extraction networks based on masked learning is proposed.Masked autoencoders(MAE)are used in the fine-tune vision transformer(ViT)model.In addition,the scheme of extracting image descriptors is discussed.The encoder of the MAE uses the ViT to extract global features and performs self-supervised fine-tuning by reconstructing masked area pixels.The method works well on category-level image retrieval datasets with marked improvements in instance-level datasets.For the instance-level datasets Oxford5k and Paris6k,the retrieval accuracy of the base model is improved by 7%and 17%compared to that of the original model,respectively.
文摘Objective To develop and validate an artificial intelligence(AI)diagnostic model for coronary artery disease based on facial photos.Methods This study was a cross-sectional study.Patients who were scheduled to undergo coronary angiography(CAG)at Beijing Anzhen Hospital and Beijing Daxing Hospital from August 2022 to November 2023 were included consecutively.Before CAG,facial photos were collected(including four angles:frontal view,left and right 60°profile,and top of the head).Photo datasets were randomly divided into a training set,a validation set(70%),and a testing set(30%).The modelI was constructed using Masked Autoencoder(MAE)and Vision Transformer(ViT)architectures.Firstly,the model base was pre-training using 2 million facial photos obtained from the publicly available VGGFace dataset,and fine-tuned by the training and validation sets;the model was validated in the test set.In addition,the ResNet architecture was used to process the dataset,and its outputs were compared with those of the models based on MAE and ViT.In the test set,the area under the operating characteristic curve(AUC)oftheAlmodel was calculated using CAG results as the gold standard.Results A total of 5974 participants aged 61(54,67)years were included,including 4179 males(70.0%),with a total of 84964 facial photos.There were 79140 facial photos in the training and validation sets,with 3822 patients with coronary artery disease;there were 5824 facial photos in the test set,with 239 patients with coronary artery disease.The AUC value of the MAE and ViT model initialized with pre-training model weight was 0.841 and 0.824,respectively.The AUC of the ResNet model initialized with random weight was 0.810,while the AUC of the ResNet model initialized with pre-training model weight was O.816.Conclusion The AI model based on facial photos shows good diagnostic performance for coronary artery disease and holds promise for further application in early diagnosis.
文摘Unlike existing fully-supervised approaches,we rethink colorectal polyp segmentation from an out-of-distribution perspective with a simple but effective self-supervised learning approach.We leverage the ability of masked autoencoders-self-supervised vision transformers trained on a reconstruction task-to learn in-distribution representations,here,the distribution of healthy colon images.We then perform out-of-distribution reconstruction and inference,with feature space standardisation to align the latent distribution of the diverse abnormal samples with the statistics of the healthy samples.We generate per-pixel anomaly scores for each image by calculating the difference between the input and reconstructed images and use this signal for out-of-distribution(i.e.,polyp)segmentation.Experimental results on six benchmarks show that our model has excellent segmentation performance and generalises across datasets.Our code is publicly available at https://github.com/GewelsJI/Polyp-OOD.