Multimodal sentiment analysis aims to understand emotions from text,speech,and video data.However,current methods often overlook the dominant role of text and suffer from feature loss during integration.Given the vary...Multimodal sentiment analysis aims to understand emotions from text,speech,and video data.However,current methods often overlook the dominant role of text and suffer from feature loss during integration.Given the varying importance of each modality across different contexts,a central and pressing challenge in multimodal sentiment analysis lies in maximizing the use of rich intra-modal features while minimizing information loss during the fusion process.In response to these critical limitations,we propose a novel framework that integrates spatial position encoding and fusion embedding modules to address these issues.In our model,text is treated as the core modality,while speech and video features are selectively incorporated through a unique position-aware fusion process.The spatial position encoding strategy preserves the internal structural information of speech and visual modalities,enabling the model to capture localized intra-modal dependencies that are often overlooked.This design enhances the richness and discriminative power of the fused representation,enabling more accurate and context-aware sentiment prediction.Finally,we conduct comprehensive evaluations on two widely recognized standard datasets in the field—CMU-MOSI and CMU-MOSEI to validate the performance of the proposed model.The experimental results demonstrate that our model exhibits good performance and effectiveness for sentiment analysis tasks.展开更多
Purpose:Nowadays,public opinions during public emergencies involve not only textual contents but also contain images.However,the existing works mainly focus on textual contents and they do not provide a satisfactory a...Purpose:Nowadays,public opinions during public emergencies involve not only textual contents but also contain images.However,the existing works mainly focus on textual contents and they do not provide a satisfactory accuracy of sentiment analysis,lacking the combination of multimodal contents.In this paper,we propose to combine texts and images generated in the social media to perform sentiment analysis.Design/methodology/approach:We propose a Deep Multimodal Fusion Model(DMFM),which combines textual and visual sentiment analysis.We first train word2vec model on a large-scale public emergency corpus to obtain semantic-rich word vectors as the input of textual sentiment analysis.BiLSTM is employed to generate encoded textual embeddings.To fully excavate visual information from images,a modified pretrained VGG16-based sentiment analysis network is used with the best-performed fine-tuning strategy.A multimodal fusion method is implemented to fuse textual and visual embeddings completely,producing predicted labels.Findings:We performed extensive experiments on Weibo and Twitter public emergency datasets,to evaluate the performance of our proposed model.Experimental results demonstrate that the DMFM provides higher accuracy compared with baseline models.The introduction of images can boost the performance of sentiment analysis during public emergencies.Research limitations:In the future,we will test our model in a wider dataset.We will also consider a better way to learn the multimodal fusion information.Practical implications:We build an efficient multimodal sentiment analysis model for the social media contents during public emergencies.Originality/value:We consider the images posted by online users during public emergencies on social platforms.The proposed method can present a novel scope for sentiment analysis during public emergencies and provide the decision support for the government when formulating policies in public emergencies.展开更多
Multimodal sentiment analysis utilizes multimodal data such as text,facial expressions and voice to detect people’s attitudes.With the advent of distributed data collection and annotation,we can easily obtain and sha...Multimodal sentiment analysis utilizes multimodal data such as text,facial expressions and voice to detect people’s attitudes.With the advent of distributed data collection and annotation,we can easily obtain and share such multimodal data.However,due to professional discrepancies among annotators and lax quality control,noisy labels might be introduced.Recent research suggests that deep neural networks(DNNs)will overfit noisy labels,leading to the poor performance of the DNNs.To address this challenging problem,we present a Multimodal Robust Meta Learning framework(MRML)for multimodal sentiment analysis to resist noisy labels and correlate distinct modalities simultaneously.Specifically,we propose a two-layer fusion net to deeply fuse different modalities and improve the quality of the multimodal data features for label correction and network training.Besides,a multiple meta-learner(label corrector)strategy is proposed to enhance the label correction approach and prevent models from overfitting to noisy labels.We conducted experiments on three popular multimodal datasets to verify the superiority of ourmethod by comparing it with four baselines.展开更多
Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on...Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on unimodal pre-trained models for feature extraction from each modality often overlook the intrinsic connections of semantic information between modalities.This limitation is attributed to their training on unimodal data,and necessitates the use of complex fusion mechanisms for sentiment analysis.In this study,we present a novel approach that combines a vision-language pre-trained model with a proposed multimodal contrastive learning method.Our approach harnesses the power of transfer learning by utilizing a vision-language pre-trained model to extract both visual and textual representations in a unified framework.We employ a Transformer architecture to integrate these representations,thereby enabling the capture of rich semantic infor-mation in image-text pairs.To further enhance the representation learning of these pairs,we introduce our proposed multimodal contrastive learning method,which leads to improved performance in sentiment analysis tasks.Our approach is evaluated through extensive experiments on two publicly accessible datasets,where we demonstrate its effectiveness.We achieve a significant improvement in sentiment analysis accuracy,indicating the supe-riority of our approach over existing techniques.These results highlight the potential of multimodal sentiment analysis and underscore the importance of considering the intrinsic semantic connections between modalities for accurate sentiment assessment.展开更多
With the growing demand formore comprehensive and nuanced sentiment understanding,Multimodal Sentiment Analysis(MSA)has gained significant traction in recent years and continues to attract widespread attention in the ...With the growing demand formore comprehensive and nuanced sentiment understanding,Multimodal Sentiment Analysis(MSA)has gained significant traction in recent years and continues to attract widespread attention in the academic community.Despite notable advances,existing approaches still face critical challenges in both information modeling and modality fusion.On one hand,many current methods rely heavily on encoders to extract global features from each modality,which limits their ability to capture latent fine-grained emotional cues within modalities.On the other hand,prevailing fusion strategies often lack mechanisms to model semantic discrepancies across modalities and to adaptively regulate modality interactions.To address these limitations,we propose a novel framework for MSA,termed Multi-Granularity Guided Fusion(MGGF).The proposed framework consists of three core components:(i)Multi-Granularity Feature Extraction Module,which simultaneously captures both global and local emotional features within each modality,and integrates them to construct richer intra-modal representations;(ii)Cross-ModalGuidance Learning Module(CMGL),which introduces a cross-modal scoring mechanism to quantify the divergence and complementarity betweenmodalities.These scores are then used as guiding signals to enable the fusion strategy to adaptively respond to scenarios of modality agreement or conflict;(iii)Cross-Modal Fusion Module(CMF),which learns the semantic dependencies among modalities and facilitates deep-level emotional feature interaction,thereby enhancing sentiment prediction with complementary information.We evaluate MGGF on two benchmark datasets:MVSA-Single and MVSA-Multiple.Experimental results demonstrate that MGGF outperforms the current state-of-the-art model CLMLF on MVSA-Single by achieving a 2.32% improvement in F1 score.On MVSA-Multiple,it surpasses MGNNS with a 0.26% increase in accuracy.These results substantiate the effectiveness ofMGGFin addressing two major limitations of existing methods—insufficient intra-modal fine-grained sentiment modeling and inadequate cross-modal semantic fusion.展开更多
Joint Multimodal Aspect-based Sentiment Analysis(JMASA)is a significant task in the research of multimodal fine-grained sentiment analysis,which combines two subtasks:Multimodal Aspect Term Extraction(MATE)and Multimo...Joint Multimodal Aspect-based Sentiment Analysis(JMASA)is a significant task in the research of multimodal fine-grained sentiment analysis,which combines two subtasks:Multimodal Aspect Term Extraction(MATE)and Multimodal Aspect-oriented Sentiment Classification(MASC).Currently,most existing models for JMASA only perform text and image feature encoding from a basic level,but often neglect the in-depth analysis of unimodal intrinsic features,which may lead to the low accuracy of aspect term extraction and the poor ability of sentiment prediction due to the insufficient learning of intra-modal features.Given this problem,we propose a Text-Image Feature Fine-grained Learning(TIFFL)model for JMASA.First,we construct an enhanced adjacency matrix of word dependencies and adopt graph convolutional network to learn the syntactic structure features for text,which addresses the context interference problem of identifying different aspect terms.Then,the adjective-noun pairs extracted from image are introduced to enable the semantic representation of visual features more intuitive,which addresses the ambiguous semantic extraction problem during image feature learning.Thereby,the model performance of aspect term extraction and sentiment polarity prediction can be further optimized and enhanced.Experiments on two Twitter benchmark datasets demonstrate that TIFFL achieves competitive results for JMASA,MATE and MASC,thus validating the effectiveness of our proposed methods.展开更多
Multimodal sentiment analysis,which integrates text,speech,and image modalities,has emerged as a prominent research direction in artificial intelligence for precise emotion assessment.However,current techniques experi...Multimodal sentiment analysis,which integrates text,speech,and image modalities,has emerged as a prominent research direction in artificial intelligence for precise emotion assessment.However,current techniques experience difficulties in efficiently managing redundancy and inconsistency across features from different modalities,compromising sentiment analysis accuracy.Additionally,while the analysis of intraclass emotional features has garnered substantial attention,studies of interclass relationships have been neglected.To address these challenges,a multimodal sentiment analysis method based on contrastive learning and cross-modal guided fusion(CLCGF)is proposed.This method encodes text and images to derive latent representations and employs a cross-modal guided module with sparse attention mechanisms to effectively integrate textual and visual features,thereby mitigating redundancy issues within each modality's features.In addition to the sentiment classification task,a supervised contrastive learning task is incorporated to aid the model in learning effective features from multimodal data related to emotions.To assess the efficacy of the CLCGF method,experiments were conducted on three public datasets:MVSA-Single,MVSA-Multiple and HFM.The experimental results indicate that CLCGF significantly improves sentiment analysis accuracy compared with traditional methods.展开更多
Multimodal Sentiment analysis refers to analyzing emotions in infor-mation carriers containing multiple modalities.To better analyze the features within and between modalities and solve the problem of incomplete multi...Multimodal Sentiment analysis refers to analyzing emotions in infor-mation carriers containing multiple modalities.To better analyze the features within and between modalities and solve the problem of incomplete multimodal feature fusion,this paper proposes a multimodal sentiment analysis model MIF(Modal Interactive Feature Encoder For Multimodal Sentiment Analysis).First,the global features of three modalities are obtained through unimodal feature extraction networks.Second,the inter-modal interactive feature encoder and the intra-modal interactive feature encoder extract similarity features between modal-ities and intra-modal special features separately.Finally,unimodal special features and the interaction information between modalities are decoded to get the fusion features and predict sentimental polarity results.We conduct extensive experi-ments on three public multimodal datasets,including one in Chinese and two in English.The results show that the performance of our approach is significantly improved compared with benchmark models.展开更多
Multimodal sentiment analysis(MSA)is an evolving field that integrates information from multiple modalities such as text,audio,and visual data to analyze and interpret human emotions and sentiments.This review provide...Multimodal sentiment analysis(MSA)is an evolving field that integrates information from multiple modalities such as text,audio,and visual data to analyze and interpret human emotions and sentiments.This review provides an extensive survey of the current state of multimodal sentiment analysis,highlighting fundamental concepts,popular datasets,techniques,models,challenges,applications,and future trends.By examining existing research and methodologies,this paper aims to present a cohesive understanding of MSA,Multimodal sentiment analysis(MSA)integrates data from text,audio,and visual sources,each contributing unique insights that enhance the overall understanding of sentiment.Textual data provides explicit content and context,audio data captures the emotional tone through speech characteristics,and visual data offers cues from facial expressions and body language.Despite these strengths,MSA faces limitations such as data integration challenges,computational complexity,and the scarcity of annotated multimodal datasets.Future directions include the development of advanced fusion techniques,real-time processing capabilities,and explainable AI models.These advancements will enable more accurate and robust sentiment analysis,improve user experiences,and enhance applications in human-computer interaction,healthcare,and social media analysis.By addressing these challenges and leveraging diverse data sources,MSA has the potential to revolutionize sentiment analysis and drive positive outcomes across various domains.展开更多
Multimodal Aspect-Based Sentiment Analysis(MABSA)aims to detect sentiment polarity toward specific aspects by leveraging both textual and visual inputs.However,existing models suffer from weak aspectimage alignment,mo...Multimodal Aspect-Based Sentiment Analysis(MABSA)aims to detect sentiment polarity toward specific aspects by leveraging both textual and visual inputs.However,existing models suffer from weak aspectimage alignment,modality imbalance dominated by textual signals,and limited reasoning for implicit or ambiguous sentiments requiring external knowledge.To address these issues,we propose a unified framework named Gated-Linear Aspect-Aware Multimodal Sentiment Network(GLAMSNet).First of all,an input encoding module is employed to construct modality-specific and aspect-aware representations.Subsequently,we introduce an image–aspect correlation matching module to provide hierarchical supervision for visual-textual alignment.Building upon these components,we further design a Gated-Linear Aspect-Aware Fusion(GLAF)module to enhance aspect-aware representation learning by adaptively filtering irrelevant textual information and refining semantic alignment under aspect guidance.Additionally,an External Language Model Knowledge-Guided mechanism is integrated to incorporate sentimentaware prior knowledge from GPT-4o,enabling robust semantic reasoning especially under noisy or ambiguous inputs.Experimental studies conducted based on Twitter-15 and Twitter-17 datasets demonstrate that the proposed model outperforms most state-of-the-art methods,achieving 79.36%accuracy and 74.72%F1-score,and 74.31%accuracy and 72.01%F1-score,respectively.展开更多
Multimodal Sentiment Classification(MSC)uses multimodal data,such as images and texts,to identify the users'sentiment polarities from the information posted by users on the Internet.MSC has attracted considerable ...Multimodal Sentiment Classification(MSC)uses multimodal data,such as images and texts,to identify the users'sentiment polarities from the information posted by users on the Internet.MSC has attracted considerable attention because of its wide applications in social computing and opinion mining.However,improper correlation strategies can cause erroneous fusion as the texts and the images that are unrelated to each other may integrate.Moreover,simply concatenating them modal by modal,even with true correlation,cannot fully capture the features within and between modals.To solve these problems,this paper proposes a Cross-Modal Complementary Network(CMCN)with hierarchical fusion for MSC.The CMCN is designed as a hierarchical structure with three key modules,namely,the feature extraction module to extract features from texts and images,the feature attention module to learn both text and image attention features generated by an image-text correlation generator,and the cross-modal hierarchical fusion module to fuse features within and between modals.Such a CMCN provides a hierarchical fusion framework that can fully integrate different modal features and helps reduce the risk of integrating unrelated modal features.Extensive experimental results on three public datasets show that the proposed approach significantly outperforms the state-of-the-art methods.展开更多
基金supported by the Collaborative Tackling Project of the Yangtze River Delta SciTech Innovation Community(Nos.2024CSJGG01503,2024CSJGG01500)Guangxi Key Research and Development Program(No.AB24010317)Jiangxi Provincial Key Laboratory of Electronic Data Control and Forensics(Jiangxi Police College)(No.2025JXJYKFJJ002).
文摘Multimodal sentiment analysis aims to understand emotions from text,speech,and video data.However,current methods often overlook the dominant role of text and suffer from feature loss during integration.Given the varying importance of each modality across different contexts,a central and pressing challenge in multimodal sentiment analysis lies in maximizing the use of rich intra-modal features while minimizing information loss during the fusion process.In response to these critical limitations,we propose a novel framework that integrates spatial position encoding and fusion embedding modules to address these issues.In our model,text is treated as the core modality,while speech and video features are selectively incorporated through a unique position-aware fusion process.The spatial position encoding strategy preserves the internal structural information of speech and visual modalities,enabling the model to capture localized intra-modal dependencies that are often overlooked.This design enhances the richness and discriminative power of the fused representation,enabling more accurate and context-aware sentiment prediction.Finally,we conduct comprehensive evaluations on two widely recognized standard datasets in the field—CMU-MOSI and CMU-MOSEI to validate the performance of the proposed model.The experimental results demonstrate that our model exhibits good performance and effectiveness for sentiment analysis tasks.
基金This paper is supported by the National Natural Science Foundation of China under contract No.71774084,72274096the National Social Science Fund of China under contract No.16ZDA224,17ZDA291.
文摘Purpose:Nowadays,public opinions during public emergencies involve not only textual contents but also contain images.However,the existing works mainly focus on textual contents and they do not provide a satisfactory accuracy of sentiment analysis,lacking the combination of multimodal contents.In this paper,we propose to combine texts and images generated in the social media to perform sentiment analysis.Design/methodology/approach:We propose a Deep Multimodal Fusion Model(DMFM),which combines textual and visual sentiment analysis.We first train word2vec model on a large-scale public emergency corpus to obtain semantic-rich word vectors as the input of textual sentiment analysis.BiLSTM is employed to generate encoded textual embeddings.To fully excavate visual information from images,a modified pretrained VGG16-based sentiment analysis network is used with the best-performed fine-tuning strategy.A multimodal fusion method is implemented to fuse textual and visual embeddings completely,producing predicted labels.Findings:We performed extensive experiments on Weibo and Twitter public emergency datasets,to evaluate the performance of our proposed model.Experimental results demonstrate that the DMFM provides higher accuracy compared with baseline models.The introduction of images can boost the performance of sentiment analysis during public emergencies.Research limitations:In the future,we will test our model in a wider dataset.We will also consider a better way to learn the multimodal fusion information.Practical implications:We build an efficient multimodal sentiment analysis model for the social media contents during public emergencies.Originality/value:We consider the images posted by online users during public emergencies on social platforms.The proposed method can present a novel scope for sentiment analysis during public emergencies and provide the decision support for the government when formulating policies in public emergencies.
基金supported by STI 2030-Major Projects 2021ZD0200400National Natural Science Foundation of China(62276233 and 62072405)Key Research Project of Zhejiang Province(2023C01048).
文摘Multimodal sentiment analysis utilizes multimodal data such as text,facial expressions and voice to detect people’s attitudes.With the advent of distributed data collection and annotation,we can easily obtain and share such multimodal data.However,due to professional discrepancies among annotators and lax quality control,noisy labels might be introduced.Recent research suggests that deep neural networks(DNNs)will overfit noisy labels,leading to the poor performance of the DNNs.To address this challenging problem,we present a Multimodal Robust Meta Learning framework(MRML)for multimodal sentiment analysis to resist noisy labels and correlate distinct modalities simultaneously.Specifically,we propose a two-layer fusion net to deeply fuse different modalities and improve the quality of the multimodal data features for label correction and network training.Besides,a multiple meta-learner(label corrector)strategy is proposed to enhance the label correction approach and prevent models from overfitting to noisy labels.We conducted experiments on three popular multimodal datasets to verify the superiority of ourmethod by comparing it with four baselines.
基金supported by Science and Technology Research Project of Jiangxi Education Department.Project Grant No.GJJ2203306.
文摘Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on unimodal pre-trained models for feature extraction from each modality often overlook the intrinsic connections of semantic information between modalities.This limitation is attributed to their training on unimodal data,and necessitates the use of complex fusion mechanisms for sentiment analysis.In this study,we present a novel approach that combines a vision-language pre-trained model with a proposed multimodal contrastive learning method.Our approach harnesses the power of transfer learning by utilizing a vision-language pre-trained model to extract both visual and textual representations in a unified framework.We employ a Transformer architecture to integrate these representations,thereby enabling the capture of rich semantic infor-mation in image-text pairs.To further enhance the representation learning of these pairs,we introduce our proposed multimodal contrastive learning method,which leads to improved performance in sentiment analysis tasks.Our approach is evaluated through extensive experiments on two publicly accessible datasets,where we demonstrate its effectiveness.We achieve a significant improvement in sentiment analysis accuracy,indicating the supe-riority of our approach over existing techniques.These results highlight the potential of multimodal sentiment analysis and underscore the importance of considering the intrinsic semantic connections between modalities for accurate sentiment assessment.
基金supported in part by the National Key Research and Development Program of China under Grant 2022YFB3102904in part by the National Natural Science Foundation of China under Grant No.U23A20305 and No.62472440.
文摘With the growing demand formore comprehensive and nuanced sentiment understanding,Multimodal Sentiment Analysis(MSA)has gained significant traction in recent years and continues to attract widespread attention in the academic community.Despite notable advances,existing approaches still face critical challenges in both information modeling and modality fusion.On one hand,many current methods rely heavily on encoders to extract global features from each modality,which limits their ability to capture latent fine-grained emotional cues within modalities.On the other hand,prevailing fusion strategies often lack mechanisms to model semantic discrepancies across modalities and to adaptively regulate modality interactions.To address these limitations,we propose a novel framework for MSA,termed Multi-Granularity Guided Fusion(MGGF).The proposed framework consists of three core components:(i)Multi-Granularity Feature Extraction Module,which simultaneously captures both global and local emotional features within each modality,and integrates them to construct richer intra-modal representations;(ii)Cross-ModalGuidance Learning Module(CMGL),which introduces a cross-modal scoring mechanism to quantify the divergence and complementarity betweenmodalities.These scores are then used as guiding signals to enable the fusion strategy to adaptively respond to scenarios of modality agreement or conflict;(iii)Cross-Modal Fusion Module(CMF),which learns the semantic dependencies among modalities and facilitates deep-level emotional feature interaction,thereby enhancing sentiment prediction with complementary information.We evaluate MGGF on two benchmark datasets:MVSA-Single and MVSA-Multiple.Experimental results demonstrate that MGGF outperforms the current state-of-the-art model CLMLF on MVSA-Single by achieving a 2.32% improvement in F1 score.On MVSA-Multiple,it surpasses MGNNS with a 0.26% increase in accuracy.These results substantiate the effectiveness ofMGGFin addressing two major limitations of existing methods—insufficient intra-modal fine-grained sentiment modeling and inadequate cross-modal semantic fusion.
基金supported by the Science and Technology Project of Henan Province(No.222102210081).
文摘Joint Multimodal Aspect-based Sentiment Analysis(JMASA)is a significant task in the research of multimodal fine-grained sentiment analysis,which combines two subtasks:Multimodal Aspect Term Extraction(MATE)and Multimodal Aspect-oriented Sentiment Classification(MASC).Currently,most existing models for JMASA only perform text and image feature encoding from a basic level,but often neglect the in-depth analysis of unimodal intrinsic features,which may lead to the low accuracy of aspect term extraction and the poor ability of sentiment prediction due to the insufficient learning of intra-modal features.Given this problem,we propose a Text-Image Feature Fine-grained Learning(TIFFL)model for JMASA.First,we construct an enhanced adjacency matrix of word dependencies and adopt graph convolutional network to learn the syntactic structure features for text,which addresses the context interference problem of identifying different aspect terms.Then,the adjective-noun pairs extracted from image are introduced to enable the semantic representation of visual features more intuitive,which addresses the ambiguous semantic extraction problem during image feature learning.Thereby,the model performance of aspect term extraction and sentiment polarity prediction can be further optimized and enhanced.Experiments on two Twitter benchmark datasets demonstrate that TIFFL achieves competitive results for JMASA,MATE and MASC,thus validating the effectiveness of our proposed methods.
基金supported by the Science and Technology Project in Xi'an(22GXFW0123)。
文摘Multimodal sentiment analysis,which integrates text,speech,and image modalities,has emerged as a prominent research direction in artificial intelligence for precise emotion assessment.However,current techniques experience difficulties in efficiently managing redundancy and inconsistency across features from different modalities,compromising sentiment analysis accuracy.Additionally,while the analysis of intraclass emotional features has garnered substantial attention,studies of interclass relationships have been neglected.To address these challenges,a multimodal sentiment analysis method based on contrastive learning and cross-modal guided fusion(CLCGF)is proposed.This method encodes text and images to derive latent representations and employs a cross-modal guided module with sparse attention mechanisms to effectively integrate textual and visual features,thereby mitigating redundancy issues within each modality's features.In addition to the sentiment classification task,a supervised contrastive learning task is incorporated to aid the model in learning effective features from multimodal data related to emotions.To assess the efficacy of the CLCGF method,experiments were conducted on three public datasets:MVSA-Single,MVSA-Multiple and HFM.The experimental results indicate that CLCGF significantly improves sentiment analysis accuracy compared with traditional methods.
文摘Multimodal Sentiment analysis refers to analyzing emotions in infor-mation carriers containing multiple modalities.To better analyze the features within and between modalities and solve the problem of incomplete multimodal feature fusion,this paper proposes a multimodal sentiment analysis model MIF(Modal Interactive Feature Encoder For Multimodal Sentiment Analysis).First,the global features of three modalities are obtained through unimodal feature extraction networks.Second,the inter-modal interactive feature encoder and the intra-modal interactive feature encoder extract similarity features between modal-ities and intra-modal special features separately.Finally,unimodal special features and the interaction information between modalities are decoded to get the fusion features and predict sentimental polarity results.We conduct extensive experi-ments on three public multimodal datasets,including one in Chinese and two in English.The results show that the performance of our approach is significantly improved compared with benchmark models.
文摘Multimodal sentiment analysis(MSA)is an evolving field that integrates information from multiple modalities such as text,audio,and visual data to analyze and interpret human emotions and sentiments.This review provides an extensive survey of the current state of multimodal sentiment analysis,highlighting fundamental concepts,popular datasets,techniques,models,challenges,applications,and future trends.By examining existing research and methodologies,this paper aims to present a cohesive understanding of MSA,Multimodal sentiment analysis(MSA)integrates data from text,audio,and visual sources,each contributing unique insights that enhance the overall understanding of sentiment.Textual data provides explicit content and context,audio data captures the emotional tone through speech characteristics,and visual data offers cues from facial expressions and body language.Despite these strengths,MSA faces limitations such as data integration challenges,computational complexity,and the scarcity of annotated multimodal datasets.Future directions include the development of advanced fusion techniques,real-time processing capabilities,and explainable AI models.These advancements will enable more accurate and robust sentiment analysis,improve user experiences,and enhance applications in human-computer interaction,healthcare,and social media analysis.By addressing these challenges and leveraging diverse data sources,MSA has the potential to revolutionize sentiment analysis and drive positive outcomes across various domains.
基金supported in part by the National Nature Science Foundation of China under Grants 62476216 and 62273272in part by the Key Research and Development Program of Shaanxi Province under Grant 2024GX-YBXM-146+1 种基金in part by the Scientific Research Program Funded by Education Department of Shaanxi Provincial Government under Grant 23JP091the Youth Innovation Team of Shaanxi Universities.
文摘Multimodal Aspect-Based Sentiment Analysis(MABSA)aims to detect sentiment polarity toward specific aspects by leveraging both textual and visual inputs.However,existing models suffer from weak aspectimage alignment,modality imbalance dominated by textual signals,and limited reasoning for implicit or ambiguous sentiments requiring external knowledge.To address these issues,we propose a unified framework named Gated-Linear Aspect-Aware Multimodal Sentiment Network(GLAMSNet).First of all,an input encoding module is employed to construct modality-specific and aspect-aware representations.Subsequently,we introduce an image–aspect correlation matching module to provide hierarchical supervision for visual-textual alignment.Building upon these components,we further design a Gated-Linear Aspect-Aware Fusion(GLAF)module to enhance aspect-aware representation learning by adaptively filtering irrelevant textual information and refining semantic alignment under aspect guidance.Additionally,an External Language Model Knowledge-Guided mechanism is integrated to incorporate sentimentaware prior knowledge from GPT-4o,enabling robust semantic reasoning especially under noisy or ambiguous inputs.Experimental studies conducted based on Twitter-15 and Twitter-17 datasets demonstrate that the proposed model outperforms most state-of-the-art methods,achieving 79.36%accuracy and 74.72%F1-score,and 74.31%accuracy and 72.01%F1-score,respectively.
基金supported by the National Key Research and Development Program of China(No.2020AAA0104903)。
文摘Multimodal Sentiment Classification(MSC)uses multimodal data,such as images and texts,to identify the users'sentiment polarities from the information posted by users on the Internet.MSC has attracted considerable attention because of its wide applications in social computing and opinion mining.However,improper correlation strategies can cause erroneous fusion as the texts and the images that are unrelated to each other may integrate.Moreover,simply concatenating them modal by modal,even with true correlation,cannot fully capture the features within and between modals.To solve these problems,this paper proposes a Cross-Modal Complementary Network(CMCN)with hierarchical fusion for MSC.The CMCN is designed as a hierarchical structure with three key modules,namely,the feature extraction module to extract features from texts and images,the feature attention module to learn both text and image attention features generated by an image-text correlation generator,and the cross-modal hierarchical fusion module to fuse features within and between modals.Such a CMCN provides a hierarchical fusion framework that can fully integrate different modal features and helps reduce the risk of integrating unrelated modal features.Extensive experimental results on three public datasets show that the proposed approach significantly outperforms the state-of-the-art methods.