期刊文献+
共找到73篇文章
< 1 2 4 >
每页显示 20 50 100
Keypoints and Descriptors Based on Cross-Modality Information Fusion for Camera Localization
1
作者 MA Shuo GAO Yongbin+ +4 位作者 TIAN Fangzheng LU Junxin HUANG Bo GU Jia ZHOU Yilong 《Wuhan University Journal of Natural Sciences》 CAS CSCD 2021年第2期128-136,共9页
To address the problem that traditional keypoint detection methods are susceptible to complex backgrounds and local similarity of images resulting in inaccurate descriptor matching and bias in visual localization, key... To address the problem that traditional keypoint detection methods are susceptible to complex backgrounds and local similarity of images resulting in inaccurate descriptor matching and bias in visual localization, keypoints and descriptors based on cross-modality fusion are proposed and applied to the study of camera motion estimation. A convolutional neural network is used to detect the positions of keypoints and generate the corresponding descriptors, and the pyramid convolution is used to extract multi-scale features in the network. The problem of local similarity of images is solved by capturing local and global feature information and fusing the geometric position information of keypoints to generate descriptors. According to our experiments, the repeatability of our method is improved by 3.7%, and the homography estimation is improved by 1.6%. To demonstrate the practicability of the method, the visual odometry part of simultaneous localization and mapping is constructed and our method is 35% higher positioning accuracy than the traditional method. 展开更多
关键词 keypoints DESCRIPTORS cross-modality information global feature visual odometry
原文传递
Review of Visible-Infrared Cross-Modality Person Re-Identification
2
作者 Yinyin Zhang 《Journal of New Media》 2023年第1期23-31,共9页
Person re-identification(ReID)is a sub-problem under image retrieval.It is a technology that uses computer vision to identify a specific pedestrian in a collection of pictures or videos.The pedestrian image under cros... Person re-identification(ReID)is a sub-problem under image retrieval.It is a technology that uses computer vision to identify a specific pedestrian in a collection of pictures or videos.The pedestrian image under cross-device is taken from a monitored pedestrian image.At present,most ReID methods deal with the matching between visible and visible images,but with the continuous improvement of security monitoring system,more and more infrared cameras are used to monitor at night or in dim light.Due to the image differences between infrared camera and RGB camera,there is a huge visual difference between cross-modality images,so the traditional ReID method is difficult to apply in this scene.In view of this situation,studying the pedestrian matching between visible and infrared modalities is particularly crucial.Visible-infrared person re-identification(VI-ReID)was first proposed in 2017,and then attracted more and more attention,and many advanced methods emerged. 展开更多
关键词 Person re-identification cross-modality
在线阅读 下载PDF
EdgeST-Fusion:A Cross-Modal Federated Learning and Graph Transformer Framework for Multimodal Spatiotemporal Data Analytics in Smart City Consumer Electronics
3
作者 Mohammed M.Alenazi 《Computers, Materials & Continua》 2026年第5期1376-1408,共33页
Multimodal spatiotemporal data from smart city consumer electronics present critical challenges including cross-modal temporal misalignment,unreliable data quality,limited joint modeling of spatial and temporal depend... Multimodal spatiotemporal data from smart city consumer electronics present critical challenges including cross-modal temporal misalignment,unreliable data quality,limited joint modeling of spatial and temporal dependencies,and weak resilience to adversarial updates.To address these limitations,EdgeST-Fusion is introduced as a cross-modal federated graph transformer framework for context-aware smart city analytics.The architecture integrates cross-modal embedding networks for modality alignment,graph transformer encoders for spatial dependency modeling,temporal self-attention for dynamic pattern learning,and adaptive anomaly detection to ensure data quality and security during aggregation.A privacy-preserving federated learning protocol with differential privacy guarantees enables collaborative model training without centralizing sensitive data.The framework employs data-quality-aware weighted aggregation to enhance robustness against noisy and malicious client updates.Experimental evaluation on the GeoLife,PeMS-Bay,and SmartHome+datasets demonstrates that EdgeST-Fusion achieves 21.8%improvement in prediction accuracy,35.7%reduction in communication overhead,and 29.4%enhancement in security resilience compared to recent baselines.Real-world deployment across three smart city testbeds validates practical viability with 90.0%average accuracy and sub-250 ms inference latency.The proposed framework remains feasible for deployment on heterogeneous and resource-constrained consumer electronics devices whilemaintaining strong privacy guarantees and scalability for large-scale urban environments. 展开更多
关键词 Federated learning graph transformer spatiotemporal analytics consumer electronics smart cities cross-modal fusion edge computing privacy preservation
在线阅读 下载PDF
Cross-modality transformations in biological microscopy enabled by deep learning 被引量:1
4
作者 Dana Hassan Jesús Domínguez +6 位作者 Benjamin Midtvedt Henrik Klein Moberg Jesús Pineda Christoph Langhammer Giovanni Volpe Antoni Homs Corber Caroline B.Adiels 《Advanced Photonics》 CSCD 2024年第6期13-33,共21页
Recent advancements in deep learning(DL)have propelled the virtual transformation of microscopy images across optical modalities,enabling unprecedented multimodal imaging analysis hitherto impossible.Despite these str... Recent advancements in deep learning(DL)have propelled the virtual transformation of microscopy images across optical modalities,enabling unprecedented multimodal imaging analysis hitherto impossible.Despite these strides,the integration of such algorithms into scientists’daily routines and clinical trials remains limited,largely due to a lack of recognition within their respective fields and the plethora of available transformation methods.To address this,we present a structured overview of cross-modality transformations,encompassing applications,data sets,and implementations,aimed at unifying this evolving field.Our review focuses on DL solutions for two key applications:contrast enhancement of targeted features within images and resolution enhancements.We recognize cross-modality transformations as a valuable resource for biologists seeking a deeper understanding of the field,as well as for technology developers aiming to better grasp sample limitations and potential applications.Notably,they enable high-contrast,high-specificity imaging akin to fluorescence microscopy without the need for laborious,costly,and disruptive physical-staining procedures.In addition,they facilitate the realization of imaging with properties that would typically require costly or complex physical modifications,such as achieving superresolution capabilities.By consolidating the current state of research in this review,we aim to catalyze further investigation and development,ultimately bringing the potential of cross-modality transformations into the hands of researchers and clinicians alike. 展开更多
关键词 cross-modality transformations virtual staining SUPERRESOLUTION deep learning fluorescence bright-field phase contrast
原文传递
Cross-Modal Simplex Center Learning for Speech-Face Association
5
作者 Qiming Ma Fanliang Bu +3 位作者 Rong Wang Lingbin Bu Yifan Wang Zhiyuan Li 《Computers, Materials & Continua》 2025年第3期5169-5184,共16页
Speech-face association aims to achieve identity matching between facial images and voice segments by aligning cross-modal features.Existing research primarily focuses on learning shared-space representations and comp... Speech-face association aims to achieve identity matching between facial images and voice segments by aligning cross-modal features.Existing research primarily focuses on learning shared-space representations and computing one-to-one similarities between cross-modal sample pairs to establish their correlation.However,these approaches do not fully account for intra-class variations between the modalities or the many-to-many relationships among cross-modal samples,which are crucial for robust association modeling.To address these challenges,we propose a novel framework that leverages global information to align voice and face embeddings while effectively correlating identity information embedded in both modalities.First,we jointly pre-train face recognition and speaker recognition networks to encode discriminative features from facial images and voice segments.This shared pre-training step ensures the extraction of complementary identity information across modalities.Subsequently,we introduce a cross-modal simplex center loss,which aligns samples with identity centers located at the vertices of a regular simplex inscribed on a hypersphere.This design enforces an equidistant and balanced distribution of identity embeddings,reducing intra-class variations.Furthermore,we employ an improved triplet center loss that emphasizes hard sample mining and optimizes inter-class separability,enhancing the model’s ability to generalize across challenging scenarios.Extensive experiments validate the effectiveness of our framework,demonstrating superior performance across various speech-face association tasks,including matching,verification,and retrieval.Notably,in the challenging gender-constrained matching task,our method achieves a remarkable accuracy of 79.22%,significantly outperforming existing approaches.These results highlight the potential of the proposed framework to advance the state of the art in cross-modal identity association. 展开更多
关键词 Speech-face association cross-modal learning cross-modal matching cross-modal retrieval
在线阅读 下载PDF
MSCM-Net:Rail Surface Defect Detection Based on a Multi-Scale Cross-Modal Network
6
作者 Xin Wen Xiao Zheng Yu He 《Computers, Materials & Continua》 2025年第3期4371-4388,共18页
Detecting surface defects on unused rails is crucial for evaluating rail quality and durability to ensure the safety of rail transportation.However,existing detection methods often struggle with challenges such as com... Detecting surface defects on unused rails is crucial for evaluating rail quality and durability to ensure the safety of rail transportation.However,existing detection methods often struggle with challenges such as complex defect morphology,texture similarity,and fuzzy edges,leading to poor accuracy and missed detections.In order to resolve these problems,we propose MSCM-Net(Multi-Scale Cross-Modal Network),a multiscale cross-modal framework focused on detecting rail surface defects.MSCM-Net introduces an attention mechanism to dynamically weight the fusion of RGB and depth maps,effectively capturing and enhancing features at different scales for each modality.To further enrich feature representation and improve edge detection in blurred areas,we propose a multi-scale void fusion module that integrates multi-scale feature information.To improve cross-modal feature fusion,we develop a cross-enhanced fusion module that transfers fused features between layers to incorporate interlayer information.We also introduce a multimodal feature integration module,which merges modality-specific features from separate decoders into a shared decoder,enhancing detection by leveraging richer complementary information.Finally,we validate MSCM-Net on the NEU RSDDS-AUG RGB-depth dataset,comparing it against 12 leading methods,and the results show that MSCM-Net achieves superior performance on all metrics. 展开更多
关键词 Surface defect detection multiscale framework cross-modal fusion edge detection
在线阅读 下载PDF
Robust Audio-Visual Fusion for Emotion Recognition Based on Cross-Modal Learning under Noisy Conditions
7
作者 A-Seong Moon Seungyeon Jeong +3 位作者 Donghee Kim Mohd Asyraf Zulkifley Bong-Soo Sohn Jaesung Lee 《Computers, Materials & Continua》 2025年第11期2851-2872,共22页
Emotion recognition under uncontrolled and noisy environments presents persistent challenges in the design of emotionally responsive systems.The current study introduces an audio-visual recognition framework designed ... Emotion recognition under uncontrolled and noisy environments presents persistent challenges in the design of emotionally responsive systems.The current study introduces an audio-visual recognition framework designed to address performance degradation caused by environmental interference,such as background noise,overlapping speech,and visual obstructions.The proposed framework employs a structured fusion approach,combining early-stage feature-level integration with decision-level coordination guided by temporal attention mechanisms.Audio data are transformed into mel-spectrogram representations,and visual data are represented as raw frame sequences.Spatial and temporal features are extracted through convolutional and transformer-based encoders,allowing the framework to capture complementary and hierarchical information fromboth sources.Across-modal attentionmodule enables selective emphasis on relevant signals while suppressing modality-specific noise.Performance is validated on a modified version of the AFEW dataset,in which controlled noise is introduced to emulate realistic conditions.The framework achieves higher classification accuracy than comparative baselines,confirming increased robustness under conditions of cross-modal disruption.This result demonstrates the suitability of the proposed method for deployment in practical emotion-aware technologies operating outside controlled environments.The study also contributes a systematic approach to fusion design and supports further exploration in the direction of resilientmultimodal emotion analysis frameworks.The source code is publicly available at https://github.com/asmoon002/AVER(accessed on 18 August 2025). 展开更多
关键词 Multimodal learning emotion recognition cross-modal attention robust representation learning
在线阅读 下载PDF
Fake News Detection Based on Cross-Modal Ambiguity Computation and Multi-Scale Feature Fusion
8
作者 Jianxiang Cao Jinyang Wu +5 位作者 Wenqian Shang Chunhua Wang Kang Song Tong Yi Jiajun Cai Haibin Zhu 《Computers, Materials & Continua》 2025年第5期2659-2675,共17页
With the rapid growth of socialmedia,the spread of fake news has become a growing problem,misleading the public and causing significant harm.As social media content is often composed of both images and text,the use of... With the rapid growth of socialmedia,the spread of fake news has become a growing problem,misleading the public and causing significant harm.As social media content is often composed of both images and text,the use of multimodal approaches for fake news detection has gained significant attention.To solve the problems existing in previous multi-modal fake news detection algorithms,such as insufficient feature extraction and insufficient use of semantic relations between modes,this paper proposes the MFFFND-Co(Multimodal Feature Fusion Fake News Detection with Co-Attention Block)model.First,the model deeply explores the textual content,image content,and frequency domain features.Then,it employs a Co-Attention mechanism for cross-modal fusion.Additionally,a semantic consistency detectionmodule is designed to quantify semantic deviations,thereby enhancing the performance of fake news detection.Experimentally verified on two commonly used datasets,Twitter and Weibo,the model achieved F1 scores of 90.0% and 94.0%,respectively,significantly outperforming the pre-modified MFFFND(Multimodal Feature Fusion Fake News Detection with Attention Block)model and surpassing other baseline models.This improves the accuracy of detecting fake information in artificial intelligence detection and engineering software detection. 展开更多
关键词 Fake news detection MULTIMODAL cross-modal ambiguity computation multi-scale feature fusion
在线阅读 下载PDF
Research on a Digital Virtual Human Lip Synchronization Optimization Algorithm
9
作者 FAN Jia-li ZHAO Si-jia SI Zhan-jun 《印刷与数字媒体技术研究》 北大核心 2026年第1期226-235,250,共11页
Lip synchronization serves as a core technology for enabling natural interactions in digital virtual humans.However,it faces challenges such as insufficient dynamic correspondence between speech and lip movements and ... Lip synchronization serves as a core technology for enabling natural interactions in digital virtual humans.However,it faces challenges such as insufficient dynamic correspondence between speech and lip movements and inadequate modeling of image details.To address these limitations,a comprehensively optimized lip synchronization framework extending the Wav2Lip architecture was proposed in this study.Firstly,based on the Wav2Lip model,a facial region extraction strategy using facial keypoints was designed,which effectively enhances the robustness of facial alignment during lip synchronization for digital virtual humans.Then,a cross-modal attention fusion module between visual and speech features was introduced to improve cross-modal information fusion,and a dynamic receptive field convolution module was developed in the generation branch to enhance the modeling performance of the lip region.Finally,experiments were conducted on the VFHQ dataset.The proposed method was compared with Wav2Lip,VideoRetalking,and DI-Net models,and its performance was evaluated using three metrics:LSE-C,CSIM,and FID.Experimental results showed that the proposed method achieves significant improvements in synchronization accuracy and image fidelity,providing an efficient and feasible solution for lip-synthesis tasks of digital virtual humans. 展开更多
关键词 Lip synchronization Digital human cross-modal attention Audio-visual synthesis
在线阅读 下载PDF
A Survey on Multimodal Emotion Recognition:Methods,Datasets,and Future Directions
10
作者 A-Seong Moon Haesung Kim +1 位作者 Ye-Chan Park Jaesung Lee 《Computers, Materials & Continua》 2026年第5期1-42,共42页
Multimodal emotion recognition has emerged as a key research area for enabling human-centered artificial intelligence,supported by the rapid progress in vision,audio,language,and physiological modeling.Existing approa... Multimodal emotion recognition has emerged as a key research area for enabling human-centered artificial intelligence,supported by the rapid progress in vision,audio,language,and physiological modeling.Existing approaches integrate heterogeneous affective cues through diverse embedding strategies and fusion mechanisms,yet the field remains fragmented due to differences in feature alignment,temporal synchronization,modality reliability,and robustness to noise or missing inputs.This survey provides a comprehensive analysis of MER research from 2021 to 2025,consolidating advances in modality-specific representation learning,cross-modal feature construction,and early,late,and hybrid fusion paradigms.We systematically review visual,acoustic,textual,and sensor-based embeddings,highlighting howpre-trained encoders,self-supervised learning,and large languagemodels have reshaped the representational foundations ofMER.We further categorize fusion strategies by interaction depth and architectural design,examining how attention mechanisms,cross-modal transformers,adaptive gating,and multimodal large language models redefine the integration of affective signals.Finally,we summarize major benchmark datasets and evaluation metrics and discuss emerging challenges related to scalability,generalization,and interpretability.This survey aims to provide a unified perspective onmultimodal fusion for emotion recognition and to guide future research toward more coherent and generalizable multimodal affective intelligence. 展开更多
关键词 Multimodal emotion recognition multimodal learning cross-modal learning fusion strategies representation learning
在线阅读 下载PDF
A Multimodal Sentiment Analysis Method Based on Multi-Granularity Guided Fusion
11
作者 Zilin Zhang Yan Liu +3 位作者 Jia Liu Senbao Hou Yuping Zhang Chenyuan Wang 《Computers, Materials & Continua》 2026年第2期1228-1241,共14页
With the growing demand formore comprehensive and nuanced sentiment understanding,Multimodal Sentiment Analysis(MSA)has gained significant traction in recent years and continues to attract widespread attention in the ... With the growing demand formore comprehensive and nuanced sentiment understanding,Multimodal Sentiment Analysis(MSA)has gained significant traction in recent years and continues to attract widespread attention in the academic community.Despite notable advances,existing approaches still face critical challenges in both information modeling and modality fusion.On one hand,many current methods rely heavily on encoders to extract global features from each modality,which limits their ability to capture latent fine-grained emotional cues within modalities.On the other hand,prevailing fusion strategies often lack mechanisms to model semantic discrepancies across modalities and to adaptively regulate modality interactions.To address these limitations,we propose a novel framework for MSA,termed Multi-Granularity Guided Fusion(MGGF).The proposed framework consists of three core components:(i)Multi-Granularity Feature Extraction Module,which simultaneously captures both global and local emotional features within each modality,and integrates them to construct richer intra-modal representations;(ii)Cross-ModalGuidance Learning Module(CMGL),which introduces a cross-modal scoring mechanism to quantify the divergence and complementarity betweenmodalities.These scores are then used as guiding signals to enable the fusion strategy to adaptively respond to scenarios of modality agreement or conflict;(iii)Cross-Modal Fusion Module(CMF),which learns the semantic dependencies among modalities and facilitates deep-level emotional feature interaction,thereby enhancing sentiment prediction with complementary information.We evaluate MGGF on two benchmark datasets:MVSA-Single and MVSA-Multiple.Experimental results demonstrate that MGGF outperforms the current state-of-the-art model CLMLF on MVSA-Single by achieving a 2.32% improvement in F1 score.On MVSA-Multiple,it surpasses MGNNS with a 0.26% increase in accuracy.These results substantiate the effectiveness ofMGGFin addressing two major limitations of existing methods—insufficient intra-modal fine-grained sentiment modeling and inadequate cross-modal semantic fusion. 展开更多
关键词 Multimodal sentiment analysis cross-modal fusion cross-modal guided learning
在线阅读 下载PDF
MultiAgent-CoT:A Multi-Agent Chain-of-Thought Reasoning Model for Robust Multimodal Dialogue Understanding
12
作者 Ans D.Alghamdi 《Computers, Materials & Continua》 2026年第2期1395-1429,共35页
Multimodal dialogue systems often fail to maintain coherent reasoning over extended conversations and suffer from hallucination due to limited context modeling capabilities.Current approaches struggle with crossmodal ... Multimodal dialogue systems often fail to maintain coherent reasoning over extended conversations and suffer from hallucination due to limited context modeling capabilities.Current approaches struggle with crossmodal alignment,temporal consistency,and robust handling of noisy or incomplete inputs across multiple modalities.We propose Multi Agent-Chain of Thought(CoT),a novel multi-agent chain-of-thought reasoning framework where specialized agents for text,vision,and speech modalities collaboratively construct shared reasoning traces through inter-agent message passing and consensus voting mechanisms.Our architecture incorporates self-reflection modules,conflict resolution protocols,and dynamic rationale alignment to enhance consistency,factual accuracy,and user engagement.The framework employs a hierarchical attention mechanism with cross-modal fusion and implements adaptive reasoning depth based on dialogue complexity.Comprehensive evaluations on Situated Interactive Multi-Modal Conversations(SIMMC)2.0,VisDial v1.0,and newly introduced challenging scenarios demonstrate statistically significant improvements in grounding accuracy(p<0.01),chain-of-thought interpretability,and robustness to adversarial inputs compared to state-of-the-art monolithic transformer baselines and existing multi-agent approaches. 展开更多
关键词 Multi-agent systems chain-of-thought reasoning multimodal dialogue conversational artificial intelligence(AI) cross-modal fusion reasoning Interpretability
在线阅读 下载PDF
MDGET-MER:Multi-Level Dynamic Gating and Emotion Transfer for Multi-Modal Emotion Recognition
13
作者 Musheng Chen Qiang Wen +2 位作者 Xiaohong Qiu Junhua Wu Wenqing Fu 《Computers, Materials & Continua》 2026年第3期872-893,共22页
In multi-modal emotion recognition,excessive reliance on historical context often impedes the detection of emotional shifts,while modality heterogeneity and unimodal noise limit recognition performance.Existing method... In multi-modal emotion recognition,excessive reliance on historical context often impedes the detection of emotional shifts,while modality heterogeneity and unimodal noise limit recognition performance.Existing methods struggle to dynamically adjust cross-modal complementary strength to optimize fusion quality and lack effective mechanisms to model the dynamic evolution of emotions.To address these issues,we propose a multi-level dynamic gating and emotion transfer framework for multi-modal emotion recognition.A dynamic gating mechanism is applied across unimodal encoding,cross-modal alignment,and emotion transfer modeling,substantially improving noise robustness and feature alignment.First,we construct a unimodal encoder based on gated recurrent units and feature-selection gating to suppress intra-modal noise and enhance contextual representation.Second,we design a gated-attention crossmodal encoder that dynamically calibrates the complementary contributions of visual and audio modalities to the dominant textual features and eliminates redundant information.Finally,we introduce a gated enhanced emotion transfer module that explicitly models the temporal dependence of emotional evolution in dialogues via transfer gating and optimizes continuity modeling with a comparative learning loss.Experimental results demonstrate that the proposed method outperforms state-of-the-art models on the public MELD and IEMOCAP datasets. 展开更多
关键词 Multi-modal emotion recognition dynamic gating emotion transfer module cross-modal dynamic alignment noise robustness
在线阅读 下载PDF
Q-ALIGNer:A Quantum Entanglement-Driven Multimodal Framework for Robust Fake News Detection
14
作者 Sara Tehsin Inzamam Mashood Nasir +4 位作者 Wiem Abdelbaki Fadwa Alrowais Reham Abualhamayel Abdulsamad Ebrahim Yahya Radwa Marzouk 《Computers, Materials & Continua》 2026年第5期1670-1700,共31页
The rapid proliferation of multimodal misinformation on social media demands detection frameworks that are not only accurate but also robust to noise,adversarial manipulation,and semantic inconsistency between modalit... The rapid proliferation of multimodal misinformation on social media demands detection frameworks that are not only accurate but also robust to noise,adversarial manipulation,and semantic inconsistency between modalities.Existing multimodal fake news detection approaches often rely on deterministic fusion strategies,which limits their ability to model uncertainty and complex cross-modal dependencies.To address these challenges,we propose Q-ALIGNer,a quantum-inspired multimodal framework that integrates classical feature extraction with quantumstate encoding,learnable cross-modal entanglement,and robustness-aware training objectives.The proposed framework adopts quantumformalism as a representational abstraction,enabling probabilisticmodeling ofmultimodal alignment while remaining fully executable on classical hardware.Q-ALIGNer is evaluated on four widely used benchmark datasets—FakeNewsNet,Fakeddit,Weibo,and MediaEval VMU—covering diverse platforms,languages,and content characteristics.Experimental results demonstrate consistent performance improvements over strong text-only,vision-only,multimodal,and quantum-inspired baselines,including BERT,RoBERTa,XLNet,ResNet,EfficientNet,ViT,Multimodal-BERT,ViLBERT,and QEMF.Q-ALIGNer achieves accuracies of 91.2%,92.9%,91.7%,and 92.1%on FakeNewsNet,Fakeddit,Weibo,and MediaEval VMU,respectively,with F1-score gains of 3–4 percentage points over QEMF.Robustness evaluation shows a reduced adversarial accuracy gap of 2.6%,compared to 7%–9%for baseline models,while calibration analysis indicates improved reliability with an expected calibration error of 0.031.In addition,computational analysis shows that Q-ALIGNer reduces training time to 19.6 h compared to 48.2 h for QEMF at a comparable parameter scale.These results indicate that quantum-inspired alignment and entanglement can enhance robustness,uncertainty awareness,and efficiency in multimodal fake news detection,positioning Q-ALIGNer as a principled and practical content-centric framework for misinformation analysis. 展开更多
关键词 Machine learning fake news detection multimodal learning quantum natural language processing cross-modal entanglement adversarial robustness uncertainty calibration
在线阅读 下载PDF
Cross-Modal Consistency with Aesthetic Similarity for Multimodal False Information Detection 被引量:1
15
作者 Weijian Fan Ziwei Shi 《Computers, Materials & Continua》 SCIE EI 2024年第5期2723-2741,共19页
With the explosive growth of false information on social media platforms, the automatic detection of multimodalfalse information has received increasing attention. Recent research has significantly contributed to mult... With the explosive growth of false information on social media platforms, the automatic detection of multimodalfalse information has received increasing attention. Recent research has significantly contributed to multimodalinformation exchange and fusion, with many methods attempting to integrate unimodal features to generatemultimodal news representations. However, they still need to fully explore the hierarchical and complex semanticcorrelations between different modal contents, severely limiting their performance detecting multimodal falseinformation. This work proposes a two-stage detection framework for multimodal false information detection,called ASMFD, which is based on image aesthetic similarity to segment and explores the consistency andinconsistency features of images and texts. Specifically, we first use the Contrastive Language-Image Pre-training(CLIP) model to learn the relationship between text and images through label awareness and train an imageaesthetic attribute scorer using an aesthetic attribute dataset. Then, we calculate the aesthetic similarity betweenthe image and related images and use this similarity as a threshold to divide the multimodal correlation matrixinto consistency and inconsistencymatrices. Finally, the fusionmodule is designed to identify essential features fordetectingmultimodal false information. In extensive experiments on four datasets, the performance of the ASMFDis superior to state-of-the-art baseline methods. 展开更多
关键词 Social media false information detection image aesthetic assessment cross-modal consistency
在线阅读 下载PDF
A Multi-Level Circulant Cross-Modal Transformer for Multimodal Speech Emotion Recognition 被引量:1
16
作者 Peizhu Gong Jin Liu +3 位作者 Zhongdai Wu Bing Han YKenWang Huihua He 《Computers, Materials & Continua》 SCIE EI 2023年第2期4203-4220,共18页
Speech emotion recognition,as an important component of humancomputer interaction technology,has received increasing attention.Recent studies have treated emotion recognition of speech signals as a multimodal task,due... Speech emotion recognition,as an important component of humancomputer interaction technology,has received increasing attention.Recent studies have treated emotion recognition of speech signals as a multimodal task,due to its inclusion of the semantic features of two different modalities,i.e.,audio and text.However,existing methods often fail in effectively represent features and capture correlations.This paper presents a multi-level circulant cross-modal Transformer(MLCCT)formultimodal speech emotion recognition.The proposed model can be divided into three steps,feature extraction,interaction and fusion.Self-supervised embedding models are introduced for feature extraction,which give a more powerful representation of the original data than those using spectrograms or audio features such as Mel-frequency cepstral coefficients(MFCCs)and low-level descriptors(LLDs).In particular,MLCCT contains two types of feature interaction processes,where a bidirectional Long Short-term Memory(Bi-LSTM)with circulant interaction mechanism is proposed for low-level features,while a two-stream residual cross-modal Transformer block is appliedwhen high-level features are involved.Finally,we choose self-attention blocks for fusion and a fully connected layer to make predictions.To evaluate the performance of our proposed model,comprehensive experiments are conducted on three widely used benchmark datasets including IEMOCAP,MELD and CMU-MOSEI.The competitive results verify the effectiveness of our approach. 展开更多
关键词 Speech emotion recognition self-supervised embedding model cross-modal transformer self-attention
在线阅读 下载PDF
Mechanism of Cross-modal Information Influencing Taste 被引量:1
17
作者 Pei LIANG Jia-yu JIANG +2 位作者 Qiang LIU Su-lin ZHANG Hua-jing YANG 《Current Medical Science》 SCIE CAS 2020年第3期474-479,共6页
Studies on the integration of cross-modal information with taste perception has been mostly limited to uni-modal level.The cross-modal sensory interaction and the neural network of information processing and its contr... Studies on the integration of cross-modal information with taste perception has been mostly limited to uni-modal level.The cross-modal sensory interaction and the neural network of information processing and its control were not fully explored and the mechanisms remain poorly understood.This mini review investigated the impact of uni-modal and multi-modal information on the taste perception,from the perspective of cognitive status,such as emotion,expectation and attention,and discussed the hypothesis that the cognitive status is the key step for visual sense to exert influence on taste.This work may help researchers better understand the mechanism of cross-modal information processing and further develop neutrally-based artificial intelligent(AI)system. 展开更多
关键词 cross-modal information integration cognitive status taste perception
暂未订购
Multimodal Sentiment Analysis Based on a Cross-Modal Multihead Attention Mechanism 被引量:1
18
作者 Lujuan Deng Boyi Liu Zuhe Li 《Computers, Materials & Continua》 SCIE EI 2024年第1期1157-1170,共14页
Multimodal sentiment analysis aims to understand people’s emotions and opinions from diverse data.Concate-nating or multiplying various modalities is a traditional multi-modal sentiment analysis fusion method.This fu... Multimodal sentiment analysis aims to understand people’s emotions and opinions from diverse data.Concate-nating or multiplying various modalities is a traditional multi-modal sentiment analysis fusion method.This fusion method does not utilize the correlation information between modalities.To solve this problem,this paper proposes amodel based on amulti-head attention mechanism.First,after preprocessing the original data.Then,the feature representation is converted into a sequence of word vectors and positional encoding is introduced to better understand the semantic and sequential information in the input sequence.Next,the input coding sequence is fed into the transformer model for further processing and learning.At the transformer layer,a cross-modal attention consisting of a pair of multi-head attention modules is employed to reflect the correlation between modalities.Finally,the processed results are input into the feedforward neural network to obtain the emotional output through the classification layer.Through the above processing flow,the model can capture semantic information and contextual relationships and achieve good results in various natural language processing tasks.Our model was tested on the CMU Multimodal Opinion Sentiment and Emotion Intensity(CMU-MOSEI)and Multimodal EmotionLines Dataset(MELD),achieving an accuracy of 82.04% and F1 parameters reached 80.59% on the former dataset. 展开更多
关键词 Emotion analysis deep learning cross-modal attention mechanism
在线阅读 下载PDF
ViT2CMH:Vision Transformer Cross-Modal Hashing for Fine-Grained Vision-Text Retrieval 被引量:1
19
作者 Mingyong Li Qiqi Li +1 位作者 Zheng Jiang Yan Ma 《Computer Systems Science & Engineering》 SCIE EI 2023年第8期1401-1414,共14页
In recent years,the development of deep learning has further improved hash retrieval technology.Most of the existing hashing methods currently use Convolutional Neural Networks(CNNs)and Recurrent Neural Networks(RNNs)... In recent years,the development of deep learning has further improved hash retrieval technology.Most of the existing hashing methods currently use Convolutional Neural Networks(CNNs)and Recurrent Neural Networks(RNNs)to process image and text information,respectively.This makes images or texts subject to local constraints,and inherent label matching cannot capture finegrained information,often leading to suboptimal results.Driven by the development of the transformer model,we propose a framework called ViT2CMH mainly based on the Vision Transformer to handle deep Cross-modal Hashing tasks rather than CNNs or RNNs.Specifically,we use a BERT network to extract text features and use the vision transformer as the image network of the model.Finally,the features are transformed into hash codes for efficient and fast retrieval.We conduct extensive experiments on Microsoft COCO(MS-COCO)and Flickr30K,comparing with baselines of some hashing methods and image-text matching methods,showing that our method has better performance. 展开更多
关键词 Hash learning cross-modal retrieval fine-grained matching TRANSFORMER
在线阅读 下载PDF
CSMCCVA:Framework of cross-modal semantic mapping based on cognitive computing of visual and auditory sensations 被引量:1
20
作者 刘扬 Zheng Fengbin Zuo Xianyu 《High Technology Letters》 EI CAS 2016年第1期90-98,共9页
Cross-modal semantic mapping and cross-media retrieval are key problems of the multimedia search engine.This study analyzes the hierarchy,the functionality,and the structure in the visual and auditory sensations of co... Cross-modal semantic mapping and cross-media retrieval are key problems of the multimedia search engine.This study analyzes the hierarchy,the functionality,and the structure in the visual and auditory sensations of cognitive system,and establishes a brain-like cross-modal semantic mapping framework based on cognitive computing of visual and auditory sensations.The mechanism of visual-auditory multisensory integration,selective attention in thalamo-cortical,emotional control in limbic system and the memory-enhancing in hippocampal were considered in the framework.Then,the algorithms of cross-modal semantic mapping were given.Experimental results show that the framework can be effectively applied to the cross-modal semantic mapping,and also provides an important significance for brain-like computing of non-von Neumann structure. 展开更多
关键词 multimedia neural cognitive computing (MNCC) brain-like computing cross-modal semantic mapping (CSM) selective attention limbic system multisensory integration memory-enhancing mechanism
在线阅读 下载PDF
上一页 1 2 4 下一页 到第
使用帮助 返回顶部