Feature fusion is an important technique in medical image classification that can improve diagnostic accuracy by integrating complementary information from multiple sources.Recently,Deep Learning(DL)has been widely us...Feature fusion is an important technique in medical image classification that can improve diagnostic accuracy by integrating complementary information from multiple sources.Recently,Deep Learning(DL)has been widely used in pulmonary disease diagnosis,such as pneumonia and tuberculosis.However,traditional feature fusion methods often suffer from feature disparity,information loss,redundancy,and increased complexity,hindering the further extension of DL algorithms.To solve this problem,we propose a Graph-Convolution Fusion Network with Self-Supervised Feature Alignment(Self-FAGCFN)to address the limitations of traditional feature fusion methods in deep learning-based medical image classification for respiratory diseases such as pneumonia and tuberculosis.The network integrates Convolutional Neural Networks(CNNs)for robust feature extraction from two-dimensional grid structures and Graph Convolutional Networks(GCNs)within a Graph Neural Network branch to capture features based on graph structure,focusing on significant node representations.Additionally,an Attention-Embedding Ensemble Block is included to capture critical features from GCN outputs.To ensure effective feature alignment between pre-and post-fusion stages,we introduce a feature alignment loss that minimizes disparities.Moreover,to address the limitations of proposed methods,such as inappropriate centroid discrepancies during feature alignment and class imbalance in the dataset,we develop a Feature-Centroid Fusion(FCF)strategy and a Multi-Level Feature-Centroid Update(MLFCU)algorithm,respectively.Extensive experiments on public datasets LungVision and Chest-Xray demonstrate that the Self-FAGCFN model significantly outperforms existing methods in diagnosing pneumonia and tuberculosis,highlighting its potential for practical medical applications.展开更多
In the realm of data privacy protection,federated learning aims to collaboratively train a global model.However,heterogeneous data between clients presents challenges,often resulting in slow convergence and inadequate...In the realm of data privacy protection,federated learning aims to collaboratively train a global model.However,heterogeneous data between clients presents challenges,often resulting in slow convergence and inadequate accuracy of the global model.Utilizing shared feature representations alongside customized classifiers for individual clients emerges as a promising personalized solution.Nonetheless,previous research has frequently neglected the integration of global knowledge into local representation learning and the synergy between global and local classifiers,thereby limiting model performance.To tackle these issues,this study proposes a hierarchical optimization method for federated learning with feature alignment and the fusion of classification decisions(FedFCD).FedFCD regularizes the relationship between global and local feature representations to achieve alignment and incorporates decision information from the global classifier,facilitating the late fusion of decision outputs from both global and local classifiers.Additionally,FedFCD employs a hierarchical optimization strategy to flexibly optimize model parameters.Through experiments on the Fashion-MNIST,CIFAR-10 and CIFAR-100 datasets,we demonstrate the effectiveness and superiority of FedFCD.For instance,on the CIFAR-100 dataset,FedFCD exhibited a significant improvement in average test accuracy by 6.83%compared to four outstanding personalized federated learning approaches.Furthermore,extended experiments confirm the robustness of FedFCD across various hyperparameter values.展开更多
Video classification is an important task in video understanding and plays a pivotal role in intelligent monitoring of information content.Most existing methods do not consider the multimodal nature of the video,and t...Video classification is an important task in video understanding and plays a pivotal role in intelligent monitoring of information content.Most existing methods do not consider the multimodal nature of the video,and the modality fusion approach tends to be too simple,often neglecting modality alignment before fusion.This research introduces a novel dual stream multimodal alignment and fusion network named DMAFNet for classifying short videos.The network uses two unimodal encoder modules to extract features within modalities and exploits a multimodal encoder module to learn interaction between modalities.To solve the modality alignment problem,contrastive learning is introduced between two unimodal encoder modules.Additionally,masked language modeling(MLM)and video text matching(VTM)auxiliary tasks are introduced to improve the interaction between video frames and text modalities through backpropagation of loss functions.Diverse experiments prove the efficiency of DMAFNet in multimodal video classification tasks.Compared with other two mainstream baselines,DMAFNet achieves the best results on the 2022 WeChat Big Data Challenge dataset.展开更多
RGB-Infrared person re-IDentification(re-ID)aims to match RGB and infrared(IR)images of the same person.However,the modality discrepancy between RGB and IR images poses a significant challenge for re-ID.To address thi...RGB-Infrared person re-IDentification(re-ID)aims to match RGB and infrared(IR)images of the same person.However,the modality discrepancy between RGB and IR images poses a significant challenge for re-ID.To address this issue,this paper proposes a Proxy-based Embedding Alignment(PEA)method to align the RGB and IR modalities in the embedding space.PEA introduces modality-specific identity proxies and leverages the sample-to-proxy relations to learn the model.Specifically,PEA focuses on three types of alignments:intra-modality alignment,inter-modality alignment,and cycle alignment.Intra-modality alignment aims to align sample features and proxies of the same identity within a modality.Inter-modality alignment aims to align sample features and proxies of the same identity across different modalities.Cycle alignment requires that a proxy is aligned with itself after tracing it along a cross-modality cycle(e.g.,IR→RGB→IR).By integrating these alignments into the training process,PEA effectively mitigates the impact of modality discrepancy and learns discriminative features across modalities.We conduct extensive experiments on several RGB-IR re-ID datasets,and the results show that PEA outperforms current state-of-the-art methods.Notably,on SYSU-MM01 dataset,PEA achieves 71.0%mAP under the multi-shot setting of the indoor-search protocol,surpassing the best-performing method by 7.2%.展开更多
Due to factors such as motion blur,video out-of-focus,and occlusion,multi-frame human pose estimation is a challenging task.Exploiting temporal consistency between consecutive frames is an efficient approach for addre...Due to factors such as motion blur,video out-of-focus,and occlusion,multi-frame human pose estimation is a challenging task.Exploiting temporal consistency between consecutive frames is an efficient approach for addressing this issue.Currently,most methods explore temporal consistency through refinements of the final heatmaps.The heatmaps contain the semantics information of key points,and can improve the detection quality to a certain extent.However,they are generated by features,and feature-level refinements are rarely considered.In this paper,we propose a human pose estimation framework with refinements at the feature and semantics levels.We align auxiliary features with the features of the current frame to reduce the loss caused by different feature distributions.An attention mechanism is then used to fuse auxiliary features with current features.In terms of semantics,we use the difference information between adjacent heatmaps as auxiliary features to refine the current heatmaps.The method is validated on the large-scale benchmark datasets PoseTrack2017 and PoseTrack2018,and the results demonstrate the effectiveness of our method.展开更多
Multiple object tracking(MOT)in unmanned aerial vehicle(UAV)videos has attracted attention.Because of the observation perspectives of UAV,the object scale changes dramatically and is relatively small.Besides,most MOT ...Multiple object tracking(MOT)in unmanned aerial vehicle(UAV)videos has attracted attention.Because of the observation perspectives of UAV,the object scale changes dramatically and is relatively small.Besides,most MOT algorithms in UAV videos cannot achieve real-time due to the tracking-by-detection paradigm.We propose a feature-aligned attention network(FAANet).It mainly consists of a channel and spatial attention module and a feature-aligned aggregation module.We also improve the real-time performance using the joint-detection-embedding paradigm and structural re-parameterization technique.We validate the effectiveness with extensive experiments on UAV detection and tracking benchmark,achieving new state-of-the-art 44.0 MOTA,64.6 IDF1 with 38.24 frames per second running speed on a single 1080Ti graphics processing unit.展开更多
基金supported by the National Natural Science Foundation of China(62276092,62303167)the Postdoctoral Fellowship Program(Grade C)of China Postdoctoral Science Foundation(GZC20230707)+3 种基金the Key Science and Technology Program of Henan Province,China(242102211051,242102211042,212102310084)Key Scientiffc Research Projects of Colleges and Universities in Henan Province,China(25A520009)the China Postdoctoral Science Foundation(2024M760808)the Henan Province medical science and technology research plan joint construction project(LHGJ2024069).
文摘Feature fusion is an important technique in medical image classification that can improve diagnostic accuracy by integrating complementary information from multiple sources.Recently,Deep Learning(DL)has been widely used in pulmonary disease diagnosis,such as pneumonia and tuberculosis.However,traditional feature fusion methods often suffer from feature disparity,information loss,redundancy,and increased complexity,hindering the further extension of DL algorithms.To solve this problem,we propose a Graph-Convolution Fusion Network with Self-Supervised Feature Alignment(Self-FAGCFN)to address the limitations of traditional feature fusion methods in deep learning-based medical image classification for respiratory diseases such as pneumonia and tuberculosis.The network integrates Convolutional Neural Networks(CNNs)for robust feature extraction from two-dimensional grid structures and Graph Convolutional Networks(GCNs)within a Graph Neural Network branch to capture features based on graph structure,focusing on significant node representations.Additionally,an Attention-Embedding Ensemble Block is included to capture critical features from GCN outputs.To ensure effective feature alignment between pre-and post-fusion stages,we introduce a feature alignment loss that minimizes disparities.Moreover,to address the limitations of proposed methods,such as inappropriate centroid discrepancies during feature alignment and class imbalance in the dataset,we develop a Feature-Centroid Fusion(FCF)strategy and a Multi-Level Feature-Centroid Update(MLFCU)algorithm,respectively.Extensive experiments on public datasets LungVision and Chest-Xray demonstrate that the Self-FAGCFN model significantly outperforms existing methods in diagnosing pneumonia and tuberculosis,highlighting its potential for practical medical applications.
基金the National Natural Science Foundation of China(Grant No.62062001)Ningxia Youth Top Talent Project(2021).
文摘In the realm of data privacy protection,federated learning aims to collaboratively train a global model.However,heterogeneous data between clients presents challenges,often resulting in slow convergence and inadequate accuracy of the global model.Utilizing shared feature representations alongside customized classifiers for individual clients emerges as a promising personalized solution.Nonetheless,previous research has frequently neglected the integration of global knowledge into local representation learning and the synergy between global and local classifiers,thereby limiting model performance.To tackle these issues,this study proposes a hierarchical optimization method for federated learning with feature alignment and the fusion of classification decisions(FedFCD).FedFCD regularizes the relationship between global and local feature representations to achieve alignment and incorporates decision information from the global classifier,facilitating the late fusion of decision outputs from both global and local classifiers.Additionally,FedFCD employs a hierarchical optimization strategy to flexibly optimize model parameters.Through experiments on the Fashion-MNIST,CIFAR-10 and CIFAR-100 datasets,we demonstrate the effectiveness and superiority of FedFCD.For instance,on the CIFAR-100 dataset,FedFCD exhibited a significant improvement in average test accuracy by 6.83%compared to four outstanding personalized federated learning approaches.Furthermore,extended experiments confirm the robustness of FedFCD across various hyperparameter values.
基金Fundamental Research Funds for the Central Universities,China(No.2232021A-10)National Natural Science Foundation of China(No.61903078)+1 种基金Shanghai Sailing Program,China(No.22YF1401300)Natural Science Foundation of Shanghai,China(No.20ZR1400400)。
文摘Video classification is an important task in video understanding and plays a pivotal role in intelligent monitoring of information content.Most existing methods do not consider the multimodal nature of the video,and the modality fusion approach tends to be too simple,often neglecting modality alignment before fusion.This research introduces a novel dual stream multimodal alignment and fusion network named DMAFNet for classifying short videos.The network uses two unimodal encoder modules to extract features within modalities and exploits a multimodal encoder module to learn interaction between modalities.To solve the modality alignment problem,contrastive learning is introduced between two unimodal encoder modules.Additionally,masked language modeling(MLM)and video text matching(VTM)auxiliary tasks are introduced to improve the interaction between video frames and text modalities through backpropagation of loss functions.Diverse experiments prove the efficiency of DMAFNet in multimodal video classification tasks.Compared with other two mainstream baselines,DMAFNet achieves the best results on the 2022 WeChat Big Data Challenge dataset.
基金supported by the National Key Research and Development Program of China in the 14th Five-Year(Nos.2021YFFO602103 and 2021YFF0602102).
文摘RGB-Infrared person re-IDentification(re-ID)aims to match RGB and infrared(IR)images of the same person.However,the modality discrepancy between RGB and IR images poses a significant challenge for re-ID.To address this issue,this paper proposes a Proxy-based Embedding Alignment(PEA)method to align the RGB and IR modalities in the embedding space.PEA introduces modality-specific identity proxies and leverages the sample-to-proxy relations to learn the model.Specifically,PEA focuses on three types of alignments:intra-modality alignment,inter-modality alignment,and cycle alignment.Intra-modality alignment aims to align sample features and proxies of the same identity within a modality.Inter-modality alignment aims to align sample features and proxies of the same identity across different modalities.Cycle alignment requires that a proxy is aligned with itself after tracing it along a cross-modality cycle(e.g.,IR→RGB→IR).By integrating these alignments into the training process,PEA effectively mitigates the impact of modality discrepancy and learns discriminative features across modalities.We conduct extensive experiments on several RGB-IR re-ID datasets,and the results show that PEA outperforms current state-of-the-art methods.Notably,on SYSU-MM01 dataset,PEA achieves 71.0%mAP under the multi-shot setting of the indoor-search protocol,surpassing the best-performing method by 7.2%.
基金supported by the National Key Research and Development Program of China(Nos.2021YFC2009200 and 2023YFC3606100)the Special Project of Technological Innovation and Application Development of Chongqing,China(No.cstc2019jscx-msxmX0167)。
文摘Due to factors such as motion blur,video out-of-focus,and occlusion,multi-frame human pose estimation is a challenging task.Exploiting temporal consistency between consecutive frames is an efficient approach for addressing this issue.Currently,most methods explore temporal consistency through refinements of the final heatmaps.The heatmaps contain the semantics information of key points,and can improve the detection quality to a certain extent.However,they are generated by features,and feature-level refinements are rarely considered.In this paper,we propose a human pose estimation framework with refinements at the feature and semantics levels.We align auxiliary features with the features of the current frame to reduce the loss caused by different feature distributions.An attention mechanism is then used to fuse auxiliary features with current features.In terms of semantics,we use the difference information between adjacent heatmaps as auxiliary features to refine the current heatmaps.The method is validated on the large-scale benchmark datasets PoseTrack2017 and PoseTrack2018,and the results demonstrate the effectiveness of our method.
基金This work was supported by National Program on Key Basic Research Project(No.2014CB744903)National Natural Science Foundation of China(Nos.61673270 and 61973212)Key Technology Research Program of Sichuan Provincial Department of Science and Technology(No.2020YFSY0027).
文摘Multiple object tracking(MOT)in unmanned aerial vehicle(UAV)videos has attracted attention.Because of the observation perspectives of UAV,the object scale changes dramatically and is relatively small.Besides,most MOT algorithms in UAV videos cannot achieve real-time due to the tracking-by-detection paradigm.We propose a feature-aligned attention network(FAANet).It mainly consists of a channel and spatial attention module and a feature-aligned aggregation module.We also improve the real-time performance using the joint-detection-embedding paradigm and structural re-parameterization technique.We validate the effectiveness with extensive experiments on UAV detection and tracking benchmark,achieving new state-of-the-art 44.0 MOTA,64.6 IDF1 with 38.24 frames per second running speed on a single 1080Ti graphics processing unit.