With the rapid expansion of social media,analyzing emotions and their causes in texts has gained significant importance.Emotion-cause pair extraction enables the identification of causal relationships between emotions...With the rapid expansion of social media,analyzing emotions and their causes in texts has gained significant importance.Emotion-cause pair extraction enables the identification of causal relationships between emotions and their triggers within a text,facilitating a deeper understanding of expressed sentiments and their underlying reasons.This comprehension is crucial for making informed strategic decisions in various business and societal contexts.However,recent research approaches employing multi-task learning frameworks for modeling often face challenges such as the inability to simultaneouslymodel extracted features and their interactions,or inconsistencies in label prediction between emotion-cause pair extraction and independent assistant tasks like emotion and cause extraction.To address these issues,this study proposes an emotion-cause pair extraction methodology that incorporates joint feature encoding and task alignment mechanisms.The model consists of two primary components:First,joint feature encoding simultaneously generates features for emotion-cause pairs and clauses,enhancing feature interactions between emotion clauses,cause clauses,and emotion-cause pairs.Second,the task alignment technique is applied to reduce the labeling distance between emotion-cause pair extraction and the two assistant tasks,capturing deep semantic information interactions among tasks.The proposed method is evaluated on a Chinese benchmark corpus using 10-fold cross-validation,assessing key performance metrics such as precision,recall,and F1 score.Experimental results demonstrate that the model achieves an F1 score of 76.05%,surpassing the state-of-the-art by 1.03%.The proposed model exhibits significant improvements in emotion-cause pair extraction(ECPE)and cause extraction(CE)compared to existing methods,validating its effectiveness.This research introduces a novel approach based on joint feature encoding and task alignment mechanisms,contributing to advancements in emotion-cause pair extraction.However,the study’s limitation lies in the data sources,potentially restricting the generalizability of the findings.展开更多
Feature fusion is an important technique in medical image classification that can improve diagnostic accuracy by integrating complementary information from multiple sources.Recently,Deep Learning(DL)has been widely us...Feature fusion is an important technique in medical image classification that can improve diagnostic accuracy by integrating complementary information from multiple sources.Recently,Deep Learning(DL)has been widely used in pulmonary disease diagnosis,such as pneumonia and tuberculosis.However,traditional feature fusion methods often suffer from feature disparity,information loss,redundancy,and increased complexity,hindering the further extension of DL algorithms.To solve this problem,we propose a Graph-Convolution Fusion Network with Self-Supervised Feature Alignment(Self-FAGCFN)to address the limitations of traditional feature fusion methods in deep learning-based medical image classification for respiratory diseases such as pneumonia and tuberculosis.The network integrates Convolutional Neural Networks(CNNs)for robust feature extraction from two-dimensional grid structures and Graph Convolutional Networks(GCNs)within a Graph Neural Network branch to capture features based on graph structure,focusing on significant node representations.Additionally,an Attention-Embedding Ensemble Block is included to capture critical features from GCN outputs.To ensure effective feature alignment between pre-and post-fusion stages,we introduce a feature alignment loss that minimizes disparities.Moreover,to address the limitations of proposed methods,such as inappropriate centroid discrepancies during feature alignment and class imbalance in the dataset,we develop a Feature-Centroid Fusion(FCF)strategy and a Multi-Level Feature-Centroid Update(MLFCU)algorithm,respectively.Extensive experiments on public datasets LungVision and Chest-Xray demonstrate that the Self-FAGCFN model significantly outperforms existing methods in diagnosing pneumonia and tuberculosis,highlighting its potential for practical medical applications.展开更多
With the rapid growth of socialmedia,the spread of fake news has become a growing problem,misleading the public and causing significant harm.As social media content is often composed of both images and text,the use of...With the rapid growth of socialmedia,the spread of fake news has become a growing problem,misleading the public and causing significant harm.As social media content is often composed of both images and text,the use of multimodal approaches for fake news detection has gained significant attention.To solve the problems existing in previous multi-modal fake news detection algorithms,such as insufficient feature extraction and insufficient use of semantic relations between modes,this paper proposes the MFFFND-Co(Multimodal Feature Fusion Fake News Detection with Co-Attention Block)model.First,the model deeply explores the textual content,image content,and frequency domain features.Then,it employs a Co-Attention mechanism for cross-modal fusion.Additionally,a semantic consistency detectionmodule is designed to quantify semantic deviations,thereby enhancing the performance of fake news detection.Experimentally verified on two commonly used datasets,Twitter and Weibo,the model achieved F1 scores of 90.0% and 94.0%,respectively,significantly outperforming the pre-modified MFFFND(Multimodal Feature Fusion Fake News Detection with Attention Block)model and surpassing other baseline models.This improves the accuracy of detecting fake information in artificial intelligence detection and engineering software detection.展开更多
In the realm of data privacy protection,federated learning aims to collaboratively train a global model.However,heterogeneous data between clients presents challenges,often resulting in slow convergence and inadequate...In the realm of data privacy protection,federated learning aims to collaboratively train a global model.However,heterogeneous data between clients presents challenges,often resulting in slow convergence and inadequate accuracy of the global model.Utilizing shared feature representations alongside customized classifiers for individual clients emerges as a promising personalized solution.Nonetheless,previous research has frequently neglected the integration of global knowledge into local representation learning and the synergy between global and local classifiers,thereby limiting model performance.To tackle these issues,this study proposes a hierarchical optimization method for federated learning with feature alignment and the fusion of classification decisions(FedFCD).FedFCD regularizes the relationship between global and local feature representations to achieve alignment and incorporates decision information from the global classifier,facilitating the late fusion of decision outputs from both global and local classifiers.Additionally,FedFCD employs a hierarchical optimization strategy to flexibly optimize model parameters.Through experiments on the Fashion-MNIST,CIFAR-10 and CIFAR-100 datasets,we demonstrate the effectiveness and superiority of FedFCD.For instance,on the CIFAR-100 dataset,FedFCD exhibited a significant improvement in average test accuracy by 6.83%compared to four outstanding personalized federated learning approaches.Furthermore,extended experiments confirm the robustness of FedFCD across various hyperparameter values.展开更多
Background Cross-modal retrieval has attracted widespread attention in many cross-media similarity search applications,particularly image-text retrieval in the fields of computer vision and natural language processing...Background Cross-modal retrieval has attracted widespread attention in many cross-media similarity search applications,particularly image-text retrieval in the fields of computer vision and natural language processing.Recently,visual and semantic embedding(VSE)learning has shown promising improvements in image text retrieval tasks.Most existing VSE models employ two unrelated encoders to extract features and then use complex methods to contextualize and aggregate these features into holistic embeddings.Despite recent advances,existing approaches still suffer from two limitations:(1)without considering intermediate interactions and adequate alignment between different modalities,these models cannot guarantee the discriminative ability of representations;and(2)existing feature aggregators are susceptible to certain noisy regions,which may lead to unreasonable pooling coefficients and affect the quality of the final aggregated features.Methods To address these challenges,we propose a novel cross-modal retrieval model containing a well-designed alignment module and a novel multimodal fusion encoder that aims to learn the adequate alignment and interaction of aggregated features to effectively bridge the modality gap.Results Experiments on the Microsoft COCO and Flickr30k datasets demonstrated the superiority of our model over state-of-the-art methods.展开更多
Audio-visual scene classification(AVSC)poses a formidable challenge owing to the intricate spatial-temporal relationships exhibited by audio-visual signals,coupled with the complex spatial patterns of objects and text...Audio-visual scene classification(AVSC)poses a formidable challenge owing to the intricate spatial-temporal relationships exhibited by audio-visual signals,coupled with the complex spatial patterns of objects and textures found in visual images.The focus of recent studies has predominantly revolved around extracting features from diverse neural network structures,inadvertently neglecting the acquisition of semantically meaningful regions and crucial components within audio-visual data.The authors present a feature pyramid attention network(FPANet)for audio-visual scene understanding,which extracts semantically significant characteristics from audio-visual data.The authors’approach builds multi-scale hierarchical features of sound spectrograms and visual images using a feature pyramid representation and localises the semantically relevant regions with a feature pyramid attention module(FPAM).A dimension alignment(DA)strategy is employed to align feature maps from multiple layers,a pyramid spatial attention(PSA)to spatially locate essential regions,and a pyramid channel attention(PCA)to pinpoint significant temporal frames.Experiments on visual scene classification(VSC),audio scene classification(ASC),and AVSC tasks demonstrate that FPANet achieves performance on par with state-of-the-art(SOTA)approaches,with a 95.9 F1-score on the ADVANCE dataset and a relative improvement of 28.8%.Visualisation results show that FPANet can prioritise semantically meaningful areas in audio-visual signals.展开更多
To address the challenge of missing modal information in entity alignment and to mitigate information loss or bias arising frommodal heterogeneity during fusion,while also capturing shared information acrossmodalities...To address the challenge of missing modal information in entity alignment and to mitigate information loss or bias arising frommodal heterogeneity during fusion,while also capturing shared information acrossmodalities,this paper proposes a Multi-modal Pre-synergistic Entity Alignmentmodel based on Cross-modalMutual Information Strategy Optimization(MPSEA).The model first employs independent encoders to process multi-modal features,including text,images,and numerical values.Next,a multi-modal pre-synergistic fusion mechanism integrates graph structural and visual modal features into the textual modality as preparatory information.This pre-fusion strategy enables unified perception of heterogeneous modalities at the model’s initial stage,reducing discrepancies during the fusion process.Finally,using cross-modal deep perception reinforcement learning,the model achieves adaptive multilevel feature fusion between modalities,supporting learningmore effective alignment strategies.Extensive experiments on multiple public datasets show that the MPSEA method achieves gains of up to 7% in Hits@1 and 8.2% in MRR on the FBDB15K dataset,and up to 9.1% in Hits@1 and 7.7% in MRR on the FBYG15K dataset,compared to existing state-of-the-art methods.These results confirm the effectiveness of the proposed model.展开更多
Remote sensing cross-modal image-text retrieval(RSCIR)can flexibly and subjectively retrieve remote sensing images utilizing query text,which has received more researchers’attention recently.However,with the increasi...Remote sensing cross-modal image-text retrieval(RSCIR)can flexibly and subjectively retrieve remote sensing images utilizing query text,which has received more researchers’attention recently.However,with the increasing volume of visual-language pre-training model parameters,direct transfer learning consumes a substantial amount of computational and storage resources.Moreover,recently proposed parameter-efficient transfer learning methods mainly focus on the reconstruction of channel features,ignoring the spatial features which are vital for modeling key entity relationships.To address these issues,we design an efficient transfer learning framework for RSCIR,which is based on spatial feature efficient reconstruction(SPER).A concise and efficient spatial adapter is introduced to enhance the extraction of spatial relationships.The spatial adapter is able to spatially reconstruct the features in the backbone with few parameters while incorporating the prior information from the channel dimension.We conduct quantitative and qualitative experiments on two different commonly used RSCIR datasets.Compared with traditional methods,our approach achieves an improvement of 3%-11% in sumR metric.Compared with methods finetuning all parameters,our proposed method only trains less than 1% of the parameters,while maintaining an overall performance of about 96%.展开更多
Multimodal Aspect-Based Sentiment Analysis(MABSA)aims to detect sentiment polarity toward specific aspects by leveraging both textual and visual inputs.However,existing models suffer from weak aspectimage alignment,mo...Multimodal Aspect-Based Sentiment Analysis(MABSA)aims to detect sentiment polarity toward specific aspects by leveraging both textual and visual inputs.However,existing models suffer from weak aspectimage alignment,modality imbalance dominated by textual signals,and limited reasoning for implicit or ambiguous sentiments requiring external knowledge.To address these issues,we propose a unified framework named Gated-Linear Aspect-Aware Multimodal Sentiment Network(GLAMSNet).First of all,an input encoding module is employed to construct modality-specific and aspect-aware representations.Subsequently,we introduce an image–aspect correlation matching module to provide hierarchical supervision for visual-textual alignment.Building upon these components,we further design a Gated-Linear Aspect-Aware Fusion(GLAF)module to enhance aspect-aware representation learning by adaptively filtering irrelevant textual information and refining semantic alignment under aspect guidance.Additionally,an External Language Model Knowledge-Guided mechanism is integrated to incorporate sentimentaware prior knowledge from GPT-4o,enabling robust semantic reasoning especially under noisy or ambiguous inputs.Experimental studies conducted based on Twitter-15 and Twitter-17 datasets demonstrate that the proposed model outperforms most state-of-the-art methods,achieving 79.36%accuracy and 74.72%F1-score,and 74.31%accuracy and 72.01%F1-score,respectively.展开更多
Video classification is an important task in video understanding and plays a pivotal role in intelligent monitoring of information content.Most existing methods do not consider the multimodal nature of the video,and t...Video classification is an important task in video understanding and plays a pivotal role in intelligent monitoring of information content.Most existing methods do not consider the multimodal nature of the video,and the modality fusion approach tends to be too simple,often neglecting modality alignment before fusion.This research introduces a novel dual stream multimodal alignment and fusion network named DMAFNet for classifying short videos.The network uses two unimodal encoder modules to extract features within modalities and exploits a multimodal encoder module to learn interaction between modalities.To solve the modality alignment problem,contrastive learning is introduced between two unimodal encoder modules.Additionally,masked language modeling(MLM)and video text matching(VTM)auxiliary tasks are introduced to improve the interaction between video frames and text modalities through backpropagation of loss functions.Diverse experiments prove the efficiency of DMAFNet in multimodal video classification tasks.Compared with other two mainstream baselines,DMAFNet achieves the best results on the 2022 WeChat Big Data Challenge dataset.展开更多
The feature space extracted from vibration signals with various faults is often nonlinear and of high dimension.Currently,nonlinear dimensionality reduction methods are available for extracting low-dimensional embeddi...The feature space extracted from vibration signals with various faults is often nonlinear and of high dimension.Currently,nonlinear dimensionality reduction methods are available for extracting low-dimensional embeddings,such as manifold learning.However,these methods are all based on manual intervention,which have some shortages in stability,and suppressing the disturbance noise.To extract features automatically,a manifold learning method with self-organization mapping is introduced for the first time.Under the non-uniform sample distribution reconstructed by the phase space,the expectation maximization(EM) iteration algorithm is used to divide the local neighborhoods adaptively without manual intervention.After that,the local tangent space alignment(LTSA) algorithm is adopted to compress the high-dimensional phase space into a more truthful low-dimensional representation.Finally,the signal is reconstructed by the kernel regression.Several typical states include the Lorenz system,engine fault with piston pin defect,and bearing fault with outer-race defect are analyzed.Compared with the LTSA and continuous wavelet transform,the results show that the background noise can be fully restrained and the entire periodic repetition of impact components is well separated and identified.A new way to automatically and precisely extract the impulsive components from mechanical signals is proposed.展开更多
In this paper,we study the problem of domain adaptation,which is a crucial ingredient in transfer learning with two domains,that is,the source domain with labeled data and the target domain with none or few labels.Dom...In this paper,we study the problem of domain adaptation,which is a crucial ingredient in transfer learning with two domains,that is,the source domain with labeled data and the target domain with none or few labels.Domain adaptation aims to extract knowledge from the source domain to improve the performance of the learning task in the target domain.A popular approach to handle this problem is via adversarial training,which is explained by the H△H-distance theory.However,traditional adversarial network architectures just align the marginal feature distribution in the feature space.The alignment of class condition distribution is not guaranteed.Therefore,we proposed a novel method based on pseudo labels and the cluster assumption to avoid the incorrect class alignment in the feature space.The experiments demonstrate that our framework improves the accuracy on typical transfer learning tasks.展开更多
Multimodal lung tumor medical images can provide anatomical and functional information for the same lesion.Such as Positron Emission Computed Tomography(PET),Computed Tomography(CT),and PET-CT.How to utilize the lesio...Multimodal lung tumor medical images can provide anatomical and functional information for the same lesion.Such as Positron Emission Computed Tomography(PET),Computed Tomography(CT),and PET-CT.How to utilize the lesion anatomical and functional information effectively and improve the network segmentation performance are key questions.To solve the problem,the Saliency Feature-Guided Interactive Feature Enhancement Lung Tumor Segmentation Network(Guide-YNet)is proposed in this paper.Firstly,a double-encoder single-decoder U-Net is used as the backbone in this model,a single-coder single-decoder U-Net is used to generate the saliency guided feature using PET image and transmit it into the skip connection of the backbone,and the high sensitivity of PET images to tumors is used to guide the network to accurately locate lesions.Secondly,a Cross Scale Feature Enhancement Module(CSFEM)is designed to extract multi-scale fusion features after downsampling.Thirdly,a Cross-Layer Interactive Feature Enhancement Module(CIFEM)is designed in the encoder to enhance the spatial position information and semantic information.Finally,a Cross-Dimension Cross-Layer Feature Enhancement Module(CCFEM)is proposed in the decoder,which effectively extractsmultimodal image features through global attention and multi-dimension local attention.The proposed method is verified on the lung multimodal medical image datasets,and the results showthat theMean Intersection overUnion(MIoU),Accuracy(Acc),Dice Similarity Coefficient(Dice),Volumetric overlap error(Voe),Relative volume difference(Rvd)of the proposed method on lung lesion segmentation are 87.27%,93.08%,97.77%,95.92%,89.28%,and 88.68%,respectively.It is of great significance for computer-aided diagnosis.展开更多
To address the problem that traditional keypoint detection methods are susceptible to complex backgrounds and local similarity of images resulting in inaccurate descriptor matching and bias in visual localization, key...To address the problem that traditional keypoint detection methods are susceptible to complex backgrounds and local similarity of images resulting in inaccurate descriptor matching and bias in visual localization, keypoints and descriptors based on cross-modality fusion are proposed and applied to the study of camera motion estimation. A convolutional neural network is used to detect the positions of keypoints and generate the corresponding descriptors, and the pyramid convolution is used to extract multi-scale features in the network. The problem of local similarity of images is solved by capturing local and global feature information and fusing the geometric position information of keypoints to generate descriptors. According to our experiments, the repeatability of our method is improved by 3.7%, and the homography estimation is improved by 1.6%. To demonstrate the practicability of the method, the visual odometry part of simultaneous localization and mapping is constructed and our method is 35% higher positioning accuracy than the traditional method.展开更多
In order to solve the problem that the existing cross-modal entity resolution methods easily ignore the high-level semantic informational correlations between cross-modal data,we propose a novel cross-modal entity res...In order to solve the problem that the existing cross-modal entity resolution methods easily ignore the high-level semantic informational correlations between cross-modal data,we propose a novel cross-modal entity resolution for image and text integrating global and fine-grained joint attention mechanism method.First,we map the cross-modal data to a common embedding space utilizing a feature extraction network.Then,we integrate global joint attention mechanism and fine-grained joint attention mechanism,making the model have the ability to learn the global semantic characteristics and the local fine-grained semantic characteristics of the cross-modal data,which is used to fully exploit the cross-modal semantic correlation and boost the performance of cross-modal entity resolution.Moreover,experiments on Flickr-30K and MS-COCO datasets show that the overall performance of R@sum outperforms by 4.30%and 4.54%compared with 5 state-of-the-art methods,respectively,which can fully demonstrate the superiority of our proposed method.展开更多
With the popularisation of intelligent power,power devices have different shapes,numbers and specifications.This means that the power data has distributional variability,the model learning process cannot achieve suffi...With the popularisation of intelligent power,power devices have different shapes,numbers and specifications.This means that the power data has distributional variability,the model learning process cannot achieve sufficient extraction of data features,which seriously affects the accuracy and performance of anomaly detection.Therefore,this paper proposes a deep learning-based anomaly detection model for power data,which integrates a data alignment enhancement technique based on random sampling and an adaptive feature fusion method leveraging dimension reduction.Aiming at the distribution variability of power data,this paper developed a sliding window-based data adjustment method for this model,which solves the problem of high-dimensional feature noise and low-dimensional missing data.To address the problem of insufficient feature fusion,an adaptive feature fusion method based on feature dimension reduction and dictionary learning is proposed to improve the anomaly data detection accuracy of the model.In order to verify the effectiveness of the proposed method,we conducted effectiveness comparisons through elimination experiments.The experimental results show that compared with the traditional anomaly detection methods,the method proposed in this paper not only has an advantage in model accuracy,but also reduces the amount of parameter calculation of the model in the process of feature matching and improves the detection speed.展开更多
In practical orchards,the challenges posed by fruit overlapping,branch and leaf occlusion,significantly impede the successful implementation of automated picking,particularly for bagging pears.To address this issue,th...In practical orchards,the challenges posed by fruit overlapping,branch and leaf occlusion,significantly impede the successful implementation of automated picking,particularly for bagging pears.To address this issue,this paper introduces the multi-scale cross-modal feature fusion and cost-sensitive classification loss function network(MCCNet),specifically designed to accurately detect bagging pears with various occlusion categories.The network designs a dual-stream convolutional neural network as its backbone,enabling the parallel extraction of multi-modal features.Meanwhile,we propose a novel lightweight cross-modal feature fusion method,inspired by enhancing shared features between modalities while extracting specific features from RGB and depth modalities.The cross-modal method enhances the perceptual capabilities of the model by facilitating the fusion of complementary information from multimodal bagging pear image pairs.Furthermore,we optimize the classification loss function by transforming it into a cost-sensitive loss function,aiming to improve detection classification efficiency and reduce instances of missing and false detections during the picking process.Experimental results on a bagging pear dataset demonstrate that our MCCNet achieves mAP0.5 and mAP0.5:0.95 values of 97.3%and 80.3%,respectively,representing improvements of 3.6%and 6.3%over the classical YOLOv10m model.When benchmarked against several state-of-the-art detection models,our MCCNet network has only 19.5 million parameters while maintaining superior inference speed.展开更多
文摘With the rapid expansion of social media,analyzing emotions and their causes in texts has gained significant importance.Emotion-cause pair extraction enables the identification of causal relationships between emotions and their triggers within a text,facilitating a deeper understanding of expressed sentiments and their underlying reasons.This comprehension is crucial for making informed strategic decisions in various business and societal contexts.However,recent research approaches employing multi-task learning frameworks for modeling often face challenges such as the inability to simultaneouslymodel extracted features and their interactions,or inconsistencies in label prediction between emotion-cause pair extraction and independent assistant tasks like emotion and cause extraction.To address these issues,this study proposes an emotion-cause pair extraction methodology that incorporates joint feature encoding and task alignment mechanisms.The model consists of two primary components:First,joint feature encoding simultaneously generates features for emotion-cause pairs and clauses,enhancing feature interactions between emotion clauses,cause clauses,and emotion-cause pairs.Second,the task alignment technique is applied to reduce the labeling distance between emotion-cause pair extraction and the two assistant tasks,capturing deep semantic information interactions among tasks.The proposed method is evaluated on a Chinese benchmark corpus using 10-fold cross-validation,assessing key performance metrics such as precision,recall,and F1 score.Experimental results demonstrate that the model achieves an F1 score of 76.05%,surpassing the state-of-the-art by 1.03%.The proposed model exhibits significant improvements in emotion-cause pair extraction(ECPE)and cause extraction(CE)compared to existing methods,validating its effectiveness.This research introduces a novel approach based on joint feature encoding and task alignment mechanisms,contributing to advancements in emotion-cause pair extraction.However,the study’s limitation lies in the data sources,potentially restricting the generalizability of the findings.
基金supported by the National Natural Science Foundation of China(62276092,62303167)the Postdoctoral Fellowship Program(Grade C)of China Postdoctoral Science Foundation(GZC20230707)+3 种基金the Key Science and Technology Program of Henan Province,China(242102211051,242102211042,212102310084)Key Scientiffc Research Projects of Colleges and Universities in Henan Province,China(25A520009)the China Postdoctoral Science Foundation(2024M760808)the Henan Province medical science and technology research plan joint construction project(LHGJ2024069).
文摘Feature fusion is an important technique in medical image classification that can improve diagnostic accuracy by integrating complementary information from multiple sources.Recently,Deep Learning(DL)has been widely used in pulmonary disease diagnosis,such as pneumonia and tuberculosis.However,traditional feature fusion methods often suffer from feature disparity,information loss,redundancy,and increased complexity,hindering the further extension of DL algorithms.To solve this problem,we propose a Graph-Convolution Fusion Network with Self-Supervised Feature Alignment(Self-FAGCFN)to address the limitations of traditional feature fusion methods in deep learning-based medical image classification for respiratory diseases such as pneumonia and tuberculosis.The network integrates Convolutional Neural Networks(CNNs)for robust feature extraction from two-dimensional grid structures and Graph Convolutional Networks(GCNs)within a Graph Neural Network branch to capture features based on graph structure,focusing on significant node representations.Additionally,an Attention-Embedding Ensemble Block is included to capture critical features from GCN outputs.To ensure effective feature alignment between pre-and post-fusion stages,we introduce a feature alignment loss that minimizes disparities.Moreover,to address the limitations of proposed methods,such as inappropriate centroid discrepancies during feature alignment and class imbalance in the dataset,we develop a Feature-Centroid Fusion(FCF)strategy and a Multi-Level Feature-Centroid Update(MLFCU)algorithm,respectively.Extensive experiments on public datasets LungVision and Chest-Xray demonstrate that the Self-FAGCFN model significantly outperforms existing methods in diagnosing pneumonia and tuberculosis,highlighting its potential for practical medical applications.
基金supported by Communication University of China(HG23035)partly supported by the Fundamental Research Funds for the Central Universities(CUC230A013).
文摘With the rapid growth of socialmedia,the spread of fake news has become a growing problem,misleading the public and causing significant harm.As social media content is often composed of both images and text,the use of multimodal approaches for fake news detection has gained significant attention.To solve the problems existing in previous multi-modal fake news detection algorithms,such as insufficient feature extraction and insufficient use of semantic relations between modes,this paper proposes the MFFFND-Co(Multimodal Feature Fusion Fake News Detection with Co-Attention Block)model.First,the model deeply explores the textual content,image content,and frequency domain features.Then,it employs a Co-Attention mechanism for cross-modal fusion.Additionally,a semantic consistency detectionmodule is designed to quantify semantic deviations,thereby enhancing the performance of fake news detection.Experimentally verified on two commonly used datasets,Twitter and Weibo,the model achieved F1 scores of 90.0% and 94.0%,respectively,significantly outperforming the pre-modified MFFFND(Multimodal Feature Fusion Fake News Detection with Attention Block)model and surpassing other baseline models.This improves the accuracy of detecting fake information in artificial intelligence detection and engineering software detection.
基金the National Natural Science Foundation of China(Grant No.62062001)Ningxia Youth Top Talent Project(2021).
文摘In the realm of data privacy protection,federated learning aims to collaboratively train a global model.However,heterogeneous data between clients presents challenges,often resulting in slow convergence and inadequate accuracy of the global model.Utilizing shared feature representations alongside customized classifiers for individual clients emerges as a promising personalized solution.Nonetheless,previous research has frequently neglected the integration of global knowledge into local representation learning and the synergy between global and local classifiers,thereby limiting model performance.To tackle these issues,this study proposes a hierarchical optimization method for federated learning with feature alignment and the fusion of classification decisions(FedFCD).FedFCD regularizes the relationship between global and local feature representations to achieve alignment and incorporates decision information from the global classifier,facilitating the late fusion of decision outputs from both global and local classifiers.Additionally,FedFCD employs a hierarchical optimization strategy to flexibly optimize model parameters.Through experiments on the Fashion-MNIST,CIFAR-10 and CIFAR-100 datasets,we demonstrate the effectiveness and superiority of FedFCD.For instance,on the CIFAR-100 dataset,FedFCD exhibited a significant improvement in average test accuracy by 6.83%compared to four outstanding personalized federated learning approaches.Furthermore,extended experiments confirm the robustness of FedFCD across various hyperparameter values.
基金Supported by the National Natural Science Foundation of China (62172109,62072118)the National Science Foundation of Guangdong Province (2022A1515010322)+1 种基金the Guangdong Basic and Applied Basic Research Foundation (2021B1515120010)the Huangpu International Sci&Tech Cooperation foundation of Guangzhou (2021GH12)。
文摘Background Cross-modal retrieval has attracted widespread attention in many cross-media similarity search applications,particularly image-text retrieval in the fields of computer vision and natural language processing.Recently,visual and semantic embedding(VSE)learning has shown promising improvements in image text retrieval tasks.Most existing VSE models employ two unrelated encoders to extract features and then use complex methods to contextualize and aggregate these features into holistic embeddings.Despite recent advances,existing approaches still suffer from two limitations:(1)without considering intermediate interactions and adequate alignment between different modalities,these models cannot guarantee the discriminative ability of representations;and(2)existing feature aggregators are susceptible to certain noisy regions,which may lead to unreasonable pooling coefficients and affect the quality of the final aggregated features.Methods To address these challenges,we propose a novel cross-modal retrieval model containing a well-designed alignment module and a novel multimodal fusion encoder that aims to learn the adequate alignment and interaction of aggregated features to effectively bridge the modality gap.Results Experiments on the Microsoft COCO and Flickr30k datasets demonstrated the superiority of our model over state-of-the-art methods.
基金Shenzhen Institute of Artificial Intelligence and Robotics for Society,Grant/Award Number:AC01202201003-02GuangDong Basic and Applied Basic Research Foundation,Grant/Award Number:2024A1515010252Longgang District Shenzhen's“Ten Action Plan”for Supporting Innovation Projects,Grant/Award Number:LGKCSDPT2024002。
文摘Audio-visual scene classification(AVSC)poses a formidable challenge owing to the intricate spatial-temporal relationships exhibited by audio-visual signals,coupled with the complex spatial patterns of objects and textures found in visual images.The focus of recent studies has predominantly revolved around extracting features from diverse neural network structures,inadvertently neglecting the acquisition of semantically meaningful regions and crucial components within audio-visual data.The authors present a feature pyramid attention network(FPANet)for audio-visual scene understanding,which extracts semantically significant characteristics from audio-visual data.The authors’approach builds multi-scale hierarchical features of sound spectrograms and visual images using a feature pyramid representation and localises the semantically relevant regions with a feature pyramid attention module(FPAM).A dimension alignment(DA)strategy is employed to align feature maps from multiple layers,a pyramid spatial attention(PSA)to spatially locate essential regions,and a pyramid channel attention(PCA)to pinpoint significant temporal frames.Experiments on visual scene classification(VSC),audio scene classification(ASC),and AVSC tasks demonstrate that FPANet achieves performance on par with state-of-the-art(SOTA)approaches,with a 95.9 F1-score on the ADVANCE dataset and a relative improvement of 28.8%.Visualisation results show that FPANet can prioritise semantically meaningful areas in audio-visual signals.
基金partially supported by the National Natural Science Foundation of China under Grants 62471493 and 62402257(for conceptualization and investigation)partially supported by the Natural Science Foundation of Shandong Province,China under Grants ZR2023LZH017,ZR2024MF066,and 2023QF025(for formal analysis and validation)+1 种基金partially supported by the Open Foundation of Key Laboratory of Computing Power Network and Information Security,Ministry of Education,Qilu University of Technology(Shandong Academy of Sciences)under Grant 2023ZD010(for methodology and model design)partially supported by the Russian Science Foundation(RSF)Project under Grant 22-71-10095-P(for validation and results verification).
文摘To address the challenge of missing modal information in entity alignment and to mitigate information loss or bias arising frommodal heterogeneity during fusion,while also capturing shared information acrossmodalities,this paper proposes a Multi-modal Pre-synergistic Entity Alignmentmodel based on Cross-modalMutual Information Strategy Optimization(MPSEA).The model first employs independent encoders to process multi-modal features,including text,images,and numerical values.Next,a multi-modal pre-synergistic fusion mechanism integrates graph structural and visual modal features into the textual modality as preparatory information.This pre-fusion strategy enables unified perception of heterogeneous modalities at the model’s initial stage,reducing discrepancies during the fusion process.Finally,using cross-modal deep perception reinforcement learning,the model achieves adaptive multilevel feature fusion between modalities,supporting learningmore effective alignment strategies.Extensive experiments on multiple public datasets show that the MPSEA method achieves gains of up to 7% in Hits@1 and 8.2% in MRR on the FBDB15K dataset,and up to 9.1% in Hits@1 and 7.7% in MRR on the FBYG15K dataset,compared to existing state-of-the-art methods.These results confirm the effectiveness of the proposed model.
基金supported by the National Key R&D Program of China(No.2022ZD0118402)。
文摘Remote sensing cross-modal image-text retrieval(RSCIR)can flexibly and subjectively retrieve remote sensing images utilizing query text,which has received more researchers’attention recently.However,with the increasing volume of visual-language pre-training model parameters,direct transfer learning consumes a substantial amount of computational and storage resources.Moreover,recently proposed parameter-efficient transfer learning methods mainly focus on the reconstruction of channel features,ignoring the spatial features which are vital for modeling key entity relationships.To address these issues,we design an efficient transfer learning framework for RSCIR,which is based on spatial feature efficient reconstruction(SPER).A concise and efficient spatial adapter is introduced to enhance the extraction of spatial relationships.The spatial adapter is able to spatially reconstruct the features in the backbone with few parameters while incorporating the prior information from the channel dimension.We conduct quantitative and qualitative experiments on two different commonly used RSCIR datasets.Compared with traditional methods,our approach achieves an improvement of 3%-11% in sumR metric.Compared with methods finetuning all parameters,our proposed method only trains less than 1% of the parameters,while maintaining an overall performance of about 96%.
基金supported in part by the National Nature Science Foundation of China under Grants 62476216 and 62273272in part by the Key Research and Development Program of Shaanxi Province under Grant 2024GX-YBXM-146+1 种基金in part by the Scientific Research Program Funded by Education Department of Shaanxi Provincial Government under Grant 23JP091the Youth Innovation Team of Shaanxi Universities.
文摘Multimodal Aspect-Based Sentiment Analysis(MABSA)aims to detect sentiment polarity toward specific aspects by leveraging both textual and visual inputs.However,existing models suffer from weak aspectimage alignment,modality imbalance dominated by textual signals,and limited reasoning for implicit or ambiguous sentiments requiring external knowledge.To address these issues,we propose a unified framework named Gated-Linear Aspect-Aware Multimodal Sentiment Network(GLAMSNet).First of all,an input encoding module is employed to construct modality-specific and aspect-aware representations.Subsequently,we introduce an image–aspect correlation matching module to provide hierarchical supervision for visual-textual alignment.Building upon these components,we further design a Gated-Linear Aspect-Aware Fusion(GLAF)module to enhance aspect-aware representation learning by adaptively filtering irrelevant textual information and refining semantic alignment under aspect guidance.Additionally,an External Language Model Knowledge-Guided mechanism is integrated to incorporate sentimentaware prior knowledge from GPT-4o,enabling robust semantic reasoning especially under noisy or ambiguous inputs.Experimental studies conducted based on Twitter-15 and Twitter-17 datasets demonstrate that the proposed model outperforms most state-of-the-art methods,achieving 79.36%accuracy and 74.72%F1-score,and 74.31%accuracy and 72.01%F1-score,respectively.
基金Fundamental Research Funds for the Central Universities,China(No.2232021A-10)National Natural Science Foundation of China(No.61903078)+1 种基金Shanghai Sailing Program,China(No.22YF1401300)Natural Science Foundation of Shanghai,China(No.20ZR1400400)。
文摘Video classification is an important task in video understanding and plays a pivotal role in intelligent monitoring of information content.Most existing methods do not consider the multimodal nature of the video,and the modality fusion approach tends to be too simple,often neglecting modality alignment before fusion.This research introduces a novel dual stream multimodal alignment and fusion network named DMAFNet for classifying short videos.The network uses two unimodal encoder modules to extract features within modalities and exploits a multimodal encoder module to learn interaction between modalities.To solve the modality alignment problem,contrastive learning is introduced between two unimodal encoder modules.Additionally,masked language modeling(MLM)and video text matching(VTM)auxiliary tasks are introduced to improve the interaction between video frames and text modalities through backpropagation of loss functions.Diverse experiments prove the efficiency of DMAFNet in multimodal video classification tasks.Compared with other two mainstream baselines,DMAFNet achieves the best results on the 2022 WeChat Big Data Challenge dataset.
基金supported by National Natural Science Foundation of China(Grant No.51075323)
文摘The feature space extracted from vibration signals with various faults is often nonlinear and of high dimension.Currently,nonlinear dimensionality reduction methods are available for extracting low-dimensional embeddings,such as manifold learning.However,these methods are all based on manual intervention,which have some shortages in stability,and suppressing the disturbance noise.To extract features automatically,a manifold learning method with self-organization mapping is introduced for the first time.Under the non-uniform sample distribution reconstructed by the phase space,the expectation maximization(EM) iteration algorithm is used to divide the local neighborhoods adaptively without manual intervention.After that,the local tangent space alignment(LTSA) algorithm is adopted to compress the high-dimensional phase space into a more truthful low-dimensional representation.Finally,the signal is reconstructed by the kernel regression.Several typical states include the Lorenz system,engine fault with piston pin defect,and bearing fault with outer-race defect are analyzed.Compared with the LTSA and continuous wavelet transform,the results show that the background noise can be fully restrained and the entire periodic repetition of impact components is well separated and identified.A new way to automatically and precisely extract the impulsive components from mechanical signals is proposed.
基金supported by the National Key Research and Development Program of China(No.2016YFB0901902)the National Natural Science Foundation of China(No.61733018).
文摘In this paper,we study the problem of domain adaptation,which is a crucial ingredient in transfer learning with two domains,that is,the source domain with labeled data and the target domain with none or few labels.Domain adaptation aims to extract knowledge from the source domain to improve the performance of the learning task in the target domain.A popular approach to handle this problem is via adversarial training,which is explained by the H△H-distance theory.However,traditional adversarial network architectures just align the marginal feature distribution in the feature space.The alignment of class condition distribution is not guaranteed.Therefore,we proposed a novel method based on pseudo labels and the cluster assumption to avoid the incorrect class alignment in the feature space.The experiments demonstrate that our framework improves the accuracy on typical transfer learning tasks.
基金supported in part by the National Natural Science Foundation of China(Grant No.62062003)Natural Science Foundation of Ningxia(Grant No.2023AAC03293).
文摘Multimodal lung tumor medical images can provide anatomical and functional information for the same lesion.Such as Positron Emission Computed Tomography(PET),Computed Tomography(CT),and PET-CT.How to utilize the lesion anatomical and functional information effectively and improve the network segmentation performance are key questions.To solve the problem,the Saliency Feature-Guided Interactive Feature Enhancement Lung Tumor Segmentation Network(Guide-YNet)is proposed in this paper.Firstly,a double-encoder single-decoder U-Net is used as the backbone in this model,a single-coder single-decoder U-Net is used to generate the saliency guided feature using PET image and transmit it into the skip connection of the backbone,and the high sensitivity of PET images to tumors is used to guide the network to accurately locate lesions.Secondly,a Cross Scale Feature Enhancement Module(CSFEM)is designed to extract multi-scale fusion features after downsampling.Thirdly,a Cross-Layer Interactive Feature Enhancement Module(CIFEM)is designed in the encoder to enhance the spatial position information and semantic information.Finally,a Cross-Dimension Cross-Layer Feature Enhancement Module(CCFEM)is proposed in the decoder,which effectively extractsmultimodal image features through global attention and multi-dimension local attention.The proposed method is verified on the lung multimodal medical image datasets,and the results showthat theMean Intersection overUnion(MIoU),Accuracy(Acc),Dice Similarity Coefficient(Dice),Volumetric overlap error(Voe),Relative volume difference(Rvd)of the proposed method on lung lesion segmentation are 87.27%,93.08%,97.77%,95.92%,89.28%,and 88.68%,respectively.It is of great significance for computer-aided diagnosis.
基金Supported by the National Natural Science Foundation of China (61802253)。
文摘To address the problem that traditional keypoint detection methods are susceptible to complex backgrounds and local similarity of images resulting in inaccurate descriptor matching and bias in visual localization, keypoints and descriptors based on cross-modality fusion are proposed and applied to the study of camera motion estimation. A convolutional neural network is used to detect the positions of keypoints and generate the corresponding descriptors, and the pyramid convolution is used to extract multi-scale features in the network. The problem of local similarity of images is solved by capturing local and global feature information and fusing the geometric position information of keypoints to generate descriptors. According to our experiments, the repeatability of our method is improved by 3.7%, and the homography estimation is improved by 1.6%. To demonstrate the practicability of the method, the visual odometry part of simultaneous localization and mapping is constructed and our method is 35% higher positioning accuracy than the traditional method.
基金the Special Research Fund for the China Postdoctoral Science Foundation(No.2015M582832)the Major National Science and Technology Program(No.2015ZX01040201)the National Natural Science Foundation of China(No.61371196)。
文摘In order to solve the problem that the existing cross-modal entity resolution methods easily ignore the high-level semantic informational correlations between cross-modal data,we propose a novel cross-modal entity resolution for image and text integrating global and fine-grained joint attention mechanism method.First,we map the cross-modal data to a common embedding space utilizing a feature extraction network.Then,we integrate global joint attention mechanism and fine-grained joint attention mechanism,making the model have the ability to learn the global semantic characteristics and the local fine-grained semantic characteristics of the cross-modal data,which is used to fully exploit the cross-modal semantic correlation and boost the performance of cross-modal entity resolution.Moreover,experiments on Flickr-30K and MS-COCO datasets show that the overall performance of R@sum outperforms by 4.30%and 4.54%compared with 5 state-of-the-art methods,respectively,which can fully demonstrate the superiority of our proposed method.
文摘With the popularisation of intelligent power,power devices have different shapes,numbers and specifications.This means that the power data has distributional variability,the model learning process cannot achieve sufficient extraction of data features,which seriously affects the accuracy and performance of anomaly detection.Therefore,this paper proposes a deep learning-based anomaly detection model for power data,which integrates a data alignment enhancement technique based on random sampling and an adaptive feature fusion method leveraging dimension reduction.Aiming at the distribution variability of power data,this paper developed a sliding window-based data adjustment method for this model,which solves the problem of high-dimensional feature noise and low-dimensional missing data.To address the problem of insufficient feature fusion,an adaptive feature fusion method based on feature dimension reduction and dictionary learning is proposed to improve the anomaly data detection accuracy of the model.In order to verify the effectiveness of the proposed method,we conducted effectiveness comparisons through elimination experiments.The experimental results show that compared with the traditional anomaly detection methods,the method proposed in this paper not only has an advantage in model accuracy,but also reduces the amount of parameter calculation of the model in the process of feature matching and improves the detection speed.
基金sponsored by the National Natural Science Foundation of China(No.32371993)the Natural Science Research Key Project of Anhui Provincial University(No.2022AH040125,2023AH040135,2024AH050471&2024AH050462)the Key Research and Development Plan of Anhui Province(No.202204c06020022&2023n06020057).
文摘In practical orchards,the challenges posed by fruit overlapping,branch and leaf occlusion,significantly impede the successful implementation of automated picking,particularly for bagging pears.To address this issue,this paper introduces the multi-scale cross-modal feature fusion and cost-sensitive classification loss function network(MCCNet),specifically designed to accurately detect bagging pears with various occlusion categories.The network designs a dual-stream convolutional neural network as its backbone,enabling the parallel extraction of multi-modal features.Meanwhile,we propose a novel lightweight cross-modal feature fusion method,inspired by enhancing shared features between modalities while extracting specific features from RGB and depth modalities.The cross-modal method enhances the perceptual capabilities of the model by facilitating the fusion of complementary information from multimodal bagging pear image pairs.Furthermore,we optimize the classification loss function by transforming it into a cost-sensitive loss function,aiming to improve detection classification efficiency and reduce instances of missing and false detections during the picking process.Experimental results on a bagging pear dataset demonstrate that our MCCNet achieves mAP0.5 and mAP0.5:0.95 values of 97.3%and 80.3%,respectively,representing improvements of 3.6%and 6.3%over the classical YOLOv10m model.When benchmarked against several state-of-the-art detection models,our MCCNet network has only 19.5 million parameters while maintaining superior inference speed.