Multimodal Sentiment Analysis(MSA)seeks to predict a speaker's sentiment orientation by comprehensively utilizing modalities such as text,vision,and audio.As deep learning and cross-modal fusion technologies evolv...Multimodal Sentiment Analysis(MSA)seeks to predict a speaker's sentiment orientation by comprehensively utilizing modalities such as text,vision,and audio.As deep learning and cross-modal fusion technologies evolve,key challenges include alleviating heterogeneity across modality feature spaces,avoiding bias from fixed main-modal fusion strategies,and enhancing model adaptability to dynamic changes in modality contribution across different samples.To address these issues,this paper proposes a multimodal sentiment analysis framework based on adaptive modality selection and contrastive learning alignment,named Adaptive Modality Selection and Guided Fusion Network(AMSGFN).The framework first employs a crossmodal contrastive learning alignment mechanism to map text,vision,and audio features into a shared semantic space,mitigating semantic discrepancies among heterogeneous modalities.A lightweight modality scoring module then evaluates the discriminability and reliability of each modality for the current sample,adaptively identifying the dominant modality.Building on this,a dominant modality-guided fusion mechanism selectively integrates supplementary information from auxiliary modalities around the dominant modality,highlighting key emotional semantics while suppressing noise and redundant information.Experimental results demonstrate that the proposed method achieves superior performance compared to existing approaches across multiple public datasets,confirming the effectiveness and robustness of the framework in multimodal sentiment analysis.展开更多
文摘Multimodal Sentiment Analysis(MSA)seeks to predict a speaker's sentiment orientation by comprehensively utilizing modalities such as text,vision,and audio.As deep learning and cross-modal fusion technologies evolve,key challenges include alleviating heterogeneity across modality feature spaces,avoiding bias from fixed main-modal fusion strategies,and enhancing model adaptability to dynamic changes in modality contribution across different samples.To address these issues,this paper proposes a multimodal sentiment analysis framework based on adaptive modality selection and contrastive learning alignment,named Adaptive Modality Selection and Guided Fusion Network(AMSGFN).The framework first employs a crossmodal contrastive learning alignment mechanism to map text,vision,and audio features into a shared semantic space,mitigating semantic discrepancies among heterogeneous modalities.A lightweight modality scoring module then evaluates the discriminability and reliability of each modality for the current sample,adaptively identifying the dominant modality.Building on this,a dominant modality-guided fusion mechanism selectively integrates supplementary information from auxiliary modalities around the dominant modality,highlighting key emotional semantics while suppressing noise and redundant information.Experimental results demonstrate that the proposed method achieves superior performance compared to existing approaches across multiple public datasets,confirming the effectiveness and robustness of the framework in multimodal sentiment analysis.