In the field of intelligent air combat,real-time and accurate recognition of within-visual-range(WVR)maneuver actions serves as the foundational cornerstone for constructing autonomous decision-making systems.However,...In the field of intelligent air combat,real-time and accurate recognition of within-visual-range(WVR)maneuver actions serves as the foundational cornerstone for constructing autonomous decision-making systems.However,existing methods face two major challenges:traditional feature engineering suffers from insufficient effective dimensionality in the feature space due to kinematic coupling,making it difficult to distinguish essential differences between maneuvers,while end-to-end deep learning models lack controllability in implicit feature learning and fail to model high-order long-range temporal dependencies.This paper proposes a trajectory feature pre-extraction method based on a Long-range Masked Autoencoder(LMAE),incorporating three key innovations:(1)Random Fragment High-ratio Masking(RFH-Mask),which enforces the model to learn long-range temporal correlations by masking 80%of trajectory data while retaining continuous fragments;(2)Kalman Filter-Guided Objective Function(KFG-OF),integrating trajectory continuity constraints to align the feature space with kinematic principles;and(3)Two-stage Decoupled Architecture,enabling efficient and controllable feature learning through unsupervised pre-training and frozen-feature transfer.Experimental results demonstrate that LMAE significantly improves the average recognition accuracy for 20-class maneuvers compared to traditional end-to-end models,while significantly accelerating convergence speed.The contributions of this work lie in:introducing high-masking-rate autoencoders into low-informationdensity trajectory analysis,proposing a feature engineering framework with enhanced controllability and efficiency,and providing a novel technical pathway for intelligent air combat decision-making systems.展开更多
Significant advancements have beenwitnessed in visual tracking applications leveragingViT in recent years,mainly due to the formidablemodeling capabilities of Vision Transformer(ViT).However,the strong performance of ...Significant advancements have beenwitnessed in visual tracking applications leveragingViT in recent years,mainly due to the formidablemodeling capabilities of Vision Transformer(ViT).However,the strong performance of such trackers heavily relies on ViT models pretrained for long periods,limitingmore flexible model designs for tracking tasks.To address this issue,we propose an efficient unsupervised ViT pretraining method for the tracking task based on masked autoencoders,called TrackMAE.During pretraining,we employ two shared-parameter ViTs,serving as the appearance encoder and motion encoder,respectively.The appearance encoder encodes randomly masked image data,while the motion encoder encodes randomly masked pairs of video frames.Subsequently,an appearance decoder and a motion decoder separately reconstruct the original image data and video frame data at the pixel level.In this way,ViT learns to understand both the appearance of images and the motion between video frames simultaneously.Experimental results demonstrate that ViT-Base and ViT-Large models,pretrained with TrackMAE and combined with a simple tracking head,achieve state-of-the-art(SOTA)performance without additional design.Moreover,compared to the currently popular MAE pretraining methods,TrackMAE consumes only 1/5 of the training time,which will facilitate the customization of diverse models for tracking.For instance,we additionally customize a lightweight ViT-XS,which achieves SOTA efficient tracking performance.展开更多
Deep convolutional neural networks(DCNNs)are widely used in content-based image retrieval(CBIR)because of the advantages in image feature extraction.However,the training of deep neural networks requires a large number...Deep convolutional neural networks(DCNNs)are widely used in content-based image retrieval(CBIR)because of the advantages in image feature extraction.However,the training of deep neural networks requires a large number of labeled data,which limits the application.Self-supervised learning is a more general approach in unlabeled scenarios.A method of fine-tuning feature extraction networks based on masked learning is proposed.Masked autoencoders(MAE)are used in the fine-tune vision transformer(ViT)model.In addition,the scheme of extracting image descriptors is discussed.The encoder of the MAE uses the ViT to extract global features and performs self-supervised fine-tuning by reconstructing masked area pixels.The method works well on category-level image retrieval datasets with marked improvements in instance-level datasets.For the instance-level datasets Oxford5k and Paris6k,the retrieval accuracy of the base model is improved by 7%and 17%compared to that of the original model,respectively.展开更多
Objective To develop and validate an artificial intelligence(AI)diagnostic model for coronary artery disease based on facial photos.Methods This study was a cross-sectional study.Patients who were scheduled to undergo...Objective To develop and validate an artificial intelligence(AI)diagnostic model for coronary artery disease based on facial photos.Methods This study was a cross-sectional study.Patients who were scheduled to undergo coronary angiography(CAG)at Beijing Anzhen Hospital and Beijing Daxing Hospital from August 2022 to November 2023 were included consecutively.Before CAG,facial photos were collected(including four angles:frontal view,left and right 60°profile,and top of the head).Photo datasets were randomly divided into a training set,a validation set(70%),and a testing set(30%).The modelI was constructed using Masked Autoencoder(MAE)and Vision Transformer(ViT)architectures.Firstly,the model base was pre-training using 2 million facial photos obtained from the publicly available VGGFace dataset,and fine-tuned by the training and validation sets;the model was validated in the test set.In addition,the ResNet architecture was used to process the dataset,and its outputs were compared with those of the models based on MAE and ViT.In the test set,the area under the operating characteristic curve(AUC)oftheAlmodel was calculated using CAG results as the gold standard.Results A total of 5974 participants aged 61(54,67)years were included,including 4179 males(70.0%),with a total of 84964 facial photos.There were 79140 facial photos in the training and validation sets,with 3822 patients with coronary artery disease;there were 5824 facial photos in the test set,with 239 patients with coronary artery disease.The AUC value of the MAE and ViT model initialized with pre-training model weight was 0.841 and 0.824,respectively.The AUC of the ResNet model initialized with random weight was 0.810,while the AUC of the ResNet model initialized with pre-training model weight was O.816.Conclusion The AI model based on facial photos shows good diagnostic performance for coronary artery disease and holds promise for further application in early diagnosis.展开更多
Knowledge graph(KG)representation learning aims to map entities and relations into a low-dimensional representation space,showing significant potential in many tasks.Existing approaches follow two categories:(1)Graph-...Knowledge graph(KG)representation learning aims to map entities and relations into a low-dimensional representation space,showing significant potential in many tasks.Existing approaches follow two categories:(1)Graph-based approaches encode KG elements into vectors using structural score functions.(2)Text-based approaches embed text descriptions of entities and relations via pre-trained language models(PLMs),further fine-tuned with triples.We argue that graph-based approaches struggle with sparse data,while text-based approaches face challenges with complex relations.To address these limitations,we propose a unified Text-Augmented Attention-based Recurrent Network,bridging the gap between graph and natural language.Specifically,we employ a graph attention network based on local influence weights to model local structural information and utilize a PLM based prompt learning to learn textual information,enhanced by a mask-reconstruction strategy based on global influence weights and textual contrastive learning for improved robustness and generalizability.Besides,to effectively model multi-hop relations,we propose a novel semantic-depth guided path extraction algorithm and integrate cross-attention layers into recurrent neural networks to facilitate learning the long-term relation dependency and offer an adaptive attention mechanism for varied-length information.Extensive experiments demonstrate that our model exhibits superiority over existing models across KG completion and question-answering tasks.展开更多
Unlike existing fully-supervised approaches,we rethink colorectal polyp segmentation from an out-of-distribution perspective with a simple but effective self-supervised learning approach.We leverage the ability of mas...Unlike existing fully-supervised approaches,we rethink colorectal polyp segmentation from an out-of-distribution perspective with a simple but effective self-supervised learning approach.We leverage the ability of masked autoencoders-self-supervised vision transformers trained on a reconstruction task-to learn in-distribution representations,here,the distribution of healthy colon images.We then perform out-of-distribution reconstruction and inference,with feature space standardisation to align the latent distribution of the diverse abnormal samples with the statistics of the healthy samples.We generate per-pixel anomaly scores for each image by calculating the difference between the input and reconstructed images and use this signal for out-of-distribution(i.e.,polyp)segmentation.Experimental results on six benchmarks show that our model has excellent segmentation performance and generalises across datasets.Our code is publicly available at https://github.com/GewelsJI/Polyp-OOD.展开更多
文摘In the field of intelligent air combat,real-time and accurate recognition of within-visual-range(WVR)maneuver actions serves as the foundational cornerstone for constructing autonomous decision-making systems.However,existing methods face two major challenges:traditional feature engineering suffers from insufficient effective dimensionality in the feature space due to kinematic coupling,making it difficult to distinguish essential differences between maneuvers,while end-to-end deep learning models lack controllability in implicit feature learning and fail to model high-order long-range temporal dependencies.This paper proposes a trajectory feature pre-extraction method based on a Long-range Masked Autoencoder(LMAE),incorporating three key innovations:(1)Random Fragment High-ratio Masking(RFH-Mask),which enforces the model to learn long-range temporal correlations by masking 80%of trajectory data while retaining continuous fragments;(2)Kalman Filter-Guided Objective Function(KFG-OF),integrating trajectory continuity constraints to align the feature space with kinematic principles;and(3)Two-stage Decoupled Architecture,enabling efficient and controllable feature learning through unsupervised pre-training and frozen-feature transfer.Experimental results demonstrate that LMAE significantly improves the average recognition accuracy for 20-class maneuvers compared to traditional end-to-end models,while significantly accelerating convergence speed.The contributions of this work lie in:introducing high-masking-rate autoencoders into low-informationdensity trajectory analysis,proposing a feature engineering framework with enhanced controllability and efficiency,and providing a novel technical pathway for intelligent air combat decision-making systems.
基金supported in part by National Natural Science Foundation of China(No.62176041)in part by Excellent Science and Technique Talent Foundation of Dalian(No.2022RY21).
文摘Significant advancements have beenwitnessed in visual tracking applications leveragingViT in recent years,mainly due to the formidablemodeling capabilities of Vision Transformer(ViT).However,the strong performance of such trackers heavily relies on ViT models pretrained for long periods,limitingmore flexible model designs for tracking tasks.To address this issue,we propose an efficient unsupervised ViT pretraining method for the tracking task based on masked autoencoders,called TrackMAE.During pretraining,we employ two shared-parameter ViTs,serving as the appearance encoder and motion encoder,respectively.The appearance encoder encodes randomly masked image data,while the motion encoder encodes randomly masked pairs of video frames.Subsequently,an appearance decoder and a motion decoder separately reconstruct the original image data and video frame data at the pixel level.In this way,ViT learns to understand both the appearance of images and the motion between video frames simultaneously.Experimental results demonstrate that ViT-Base and ViT-Large models,pretrained with TrackMAE and combined with a simple tracking head,achieve state-of-the-art(SOTA)performance without additional design.Moreover,compared to the currently popular MAE pretraining methods,TrackMAE consumes only 1/5 of the training time,which will facilitate the customization of diverse models for tracking.For instance,we additionally customize a lightweight ViT-XS,which achieves SOTA efficient tracking performance.
基金the Project of Introducing Urgently Needed Talents in Key Supporting Regions of Shandong Province,China(No.SDJQP20221805)。
文摘Deep convolutional neural networks(DCNNs)are widely used in content-based image retrieval(CBIR)because of the advantages in image feature extraction.However,the training of deep neural networks requires a large number of labeled data,which limits the application.Self-supervised learning is a more general approach in unlabeled scenarios.A method of fine-tuning feature extraction networks based on masked learning is proposed.Masked autoencoders(MAE)are used in the fine-tune vision transformer(ViT)model.In addition,the scheme of extracting image descriptors is discussed.The encoder of the MAE uses the ViT to extract global features and performs self-supervised fine-tuning by reconstructing masked area pixels.The method works well on category-level image retrieval datasets with marked improvements in instance-level datasets.For the instance-level datasets Oxford5k and Paris6k,the retrieval accuracy of the base model is improved by 7%and 17%compared to that of the original model,respectively.
文摘Objective To develop and validate an artificial intelligence(AI)diagnostic model for coronary artery disease based on facial photos.Methods This study was a cross-sectional study.Patients who were scheduled to undergo coronary angiography(CAG)at Beijing Anzhen Hospital and Beijing Daxing Hospital from August 2022 to November 2023 were included consecutively.Before CAG,facial photos were collected(including four angles:frontal view,left and right 60°profile,and top of the head).Photo datasets were randomly divided into a training set,a validation set(70%),and a testing set(30%).The modelI was constructed using Masked Autoencoder(MAE)and Vision Transformer(ViT)architectures.Firstly,the model base was pre-training using 2 million facial photos obtained from the publicly available VGGFace dataset,and fine-tuned by the training and validation sets;the model was validated in the test set.In addition,the ResNet architecture was used to process the dataset,and its outputs were compared with those of the models based on MAE and ViT.In the test set,the area under the operating characteristic curve(AUC)oftheAlmodel was calculated using CAG results as the gold standard.Results A total of 5974 participants aged 61(54,67)years were included,including 4179 males(70.0%),with a total of 84964 facial photos.There were 79140 facial photos in the training and validation sets,with 3822 patients with coronary artery disease;there were 5824 facial photos in the test set,with 239 patients with coronary artery disease.The AUC value of the MAE and ViT model initialized with pre-training model weight was 0.841 and 0.824,respectively.The AUC of the ResNet model initialized with random weight was 0.810,while the AUC of the ResNet model initialized with pre-training model weight was O.816.Conclusion The AI model based on facial photos shows good diagnostic performance for coronary artery disease and holds promise for further application in early diagnosis.
基金supported in part by National Key R&D Program of China(2020AAA0108501).
文摘Knowledge graph(KG)representation learning aims to map entities and relations into a low-dimensional representation space,showing significant potential in many tasks.Existing approaches follow two categories:(1)Graph-based approaches encode KG elements into vectors using structural score functions.(2)Text-based approaches embed text descriptions of entities and relations via pre-trained language models(PLMs),further fine-tuned with triples.We argue that graph-based approaches struggle with sparse data,while text-based approaches face challenges with complex relations.To address these limitations,we propose a unified Text-Augmented Attention-based Recurrent Network,bridging the gap between graph and natural language.Specifically,we employ a graph attention network based on local influence weights to model local structural information and utilize a PLM based prompt learning to learn textual information,enhanced by a mask-reconstruction strategy based on global influence weights and textual contrastive learning for improved robustness and generalizability.Besides,to effectively model multi-hop relations,we propose a novel semantic-depth guided path extraction algorithm and integrate cross-attention layers into recurrent neural networks to facilitate learning the long-term relation dependency and offer an adaptive attention mechanism for varied-length information.Extensive experiments demonstrate that our model exhibits superiority over existing models across KG completion and question-answering tasks.
文摘Unlike existing fully-supervised approaches,we rethink colorectal polyp segmentation from an out-of-distribution perspective with a simple but effective self-supervised learning approach.We leverage the ability of masked autoencoders-self-supervised vision transformers trained on a reconstruction task-to learn in-distribution representations,here,the distribution of healthy colon images.We then perform out-of-distribution reconstruction and inference,with feature space standardisation to align the latent distribution of the diverse abnormal samples with the statistics of the healthy samples.We generate per-pixel anomaly scores for each image by calculating the difference between the input and reconstructed images and use this signal for out-of-distribution(i.e.,polyp)segmentation.Experimental results on six benchmarks show that our model has excellent segmentation performance and generalises across datasets.Our code is publicly available at https://github.com/GewelsJI/Polyp-OOD.