Artificial intelligence has achieved remarkable success in materials science,accelerating novel material design.However,real-world material systems exhibit multiscale complexity—spanning composition,processing,struct...Artificial intelligence has achieved remarkable success in materials science,accelerating novel material design.However,real-world material systems exhibit multiscale complexity—spanning composition,processing,structure,and properties—posing significant challenges for modeling.While some approaches fuse multiscale features to improve prediction,important modalities such as microstructure are often missing due to high acquisition costs.Existing methods struggle with incomplete data and lack a framework to bridge multiscale material knowledge.To address this,we propose MatMCL,a structure-guided multimodal learning framework that jointly analyzes multiscale material information and enables robust property prediction with incomplete modalities.Using a selfconstructed multimodal dataset of electrospun nanofibers,we demonstrate that MatMCL improves mechanical property prediction without structural information,generates microstructures from processing parameters,and enables cross-modal retrieval.We further extend it via multi-stage learning and apply it to nanofiber-reinforced composite design.MatMCL uncovers processingstructure-property relationships,suggesting its promise as a generalizable approach for AI-driven material design.展开更多
The problem of fake news detection(FND)is becoming increasingly important in the field of natural language processing(NLP)because of the rapid dissemination of misleading information on the web.Large language models(L...The problem of fake news detection(FND)is becoming increasingly important in the field of natural language processing(NLP)because of the rapid dissemination of misleading information on the web.Large language models(LLMs)such as GPT-4.Zero excels in natural language understanding tasks but can still struggle to distinguish between fact and fiction,particularly when applied in the wild.However,a key challenge of existing FND methods is that they only consider unimodal data(e.g.,images),while more detailed multimodal data(e.g.,user behaviour,temporal dynamics)is neglected,and the latter is crucial for full-context understanding.To overcome these limitations,we introduce M3-FND(Multimodal Misinformation Mitigation for False News Detection),a novel methodological framework that integrates LLMs with multimodal data sources to perform context-aware veracity assessments.Our method proposes a hybrid system that combines image-text alignment,user credibility profiling,and temporal pattern recognition,which is also strengthened through a natural feedback loop that provides real-time feedback for correcting downstream errors.We use contextual reinforcement learning to schedule prompt updating and update the classifier threshold based on the latest multimodal input,which enables the model to better adapt to changing misinformation attack strategies.M3-FND is tested on three diverse datasets,FakeNewsNet,Twitter15,andWeibo,which contain both text and visual socialmedia content.Experiments showthatM3-FND significantly outperforms conventional and LLMbased baselines in terms of accuracy,F1-score,and AUC on all benchmarks.Our results indicate the importance of employing multimodal cues and adaptive learning for effective and timely detection of fake news.展开更多
High-throughput transcriptomics has evolved from bulk RNA-seq to single-cell and spatial profiling,yet its clinical translation still depends on effective integration across diverse omics and data modalities.Emerging ...High-throughput transcriptomics has evolved from bulk RNA-seq to single-cell and spatial profiling,yet its clinical translation still depends on effective integration across diverse omics and data modalities.Emerging foundation models and multimodal learning frameworks are enabling scalable and transferable representations of cellular states,while advances in interpretability and real-world data integration are bridging the gap between discovery and clinical application.This paper outlines a concise roadmap for AI-driven,transcriptome-centered multi-omics integration in precision medicine(Figure 1).展开更多
The rapid proliferation of multimodal misinformation on social media demands detection frameworks that are not only accurate but also robust to noise,adversarial manipulation,and semantic inconsistency between modalit...The rapid proliferation of multimodal misinformation on social media demands detection frameworks that are not only accurate but also robust to noise,adversarial manipulation,and semantic inconsistency between modalities.Existing multimodal fake news detection approaches often rely on deterministic fusion strategies,which limits their ability to model uncertainty and complex cross-modal dependencies.To address these challenges,we propose Q-ALIGNer,a quantum-inspired multimodal framework that integrates classical feature extraction with quantumstate encoding,learnable cross-modal entanglement,and robustness-aware training objectives.The proposed framework adopts quantumformalism as a representational abstraction,enabling probabilisticmodeling ofmultimodal alignment while remaining fully executable on classical hardware.Q-ALIGNer is evaluated on four widely used benchmark datasets—FakeNewsNet,Fakeddit,Weibo,and MediaEval VMU—covering diverse platforms,languages,and content characteristics.Experimental results demonstrate consistent performance improvements over strong text-only,vision-only,multimodal,and quantum-inspired baselines,including BERT,RoBERTa,XLNet,ResNet,EfficientNet,ViT,Multimodal-BERT,ViLBERT,and QEMF.Q-ALIGNer achieves accuracies of 91.2%,92.9%,91.7%,and 92.1%on FakeNewsNet,Fakeddit,Weibo,and MediaEval VMU,respectively,with F1-score gains of 3–4 percentage points over QEMF.Robustness evaluation shows a reduced adversarial accuracy gap of 2.6%,compared to 7%–9%for baseline models,while calibration analysis indicates improved reliability with an expected calibration error of 0.031.In addition,computational analysis shows that Q-ALIGNer reduces training time to 19.6 h compared to 48.2 h for QEMF at a comparable parameter scale.These results indicate that quantum-inspired alignment and entanglement can enhance robustness,uncertainty awareness,and efficiency in multimodal fake news detection,positioning Q-ALIGNer as a principled and practical content-centric framework for misinformation analysis.展开更多
Multimodal emotion recognition has emerged as a key research area for enabling human-centered artificial intelligence,supported by the rapid progress in vision,audio,language,and physiological modeling.Existing approa...Multimodal emotion recognition has emerged as a key research area for enabling human-centered artificial intelligence,supported by the rapid progress in vision,audio,language,and physiological modeling.Existing approaches integrate heterogeneous affective cues through diverse embedding strategies and fusion mechanisms,yet the field remains fragmented due to differences in feature alignment,temporal synchronization,modality reliability,and robustness to noise or missing inputs.This survey provides a comprehensive analysis of MER research from 2021 to 2025,consolidating advances in modality-specific representation learning,cross-modal feature construction,and early,late,and hybrid fusion paradigms.We systematically review visual,acoustic,textual,and sensor-based embeddings,highlighting howpre-trained encoders,self-supervised learning,and large languagemodels have reshaped the representational foundations ofMER.We further categorize fusion strategies by interaction depth and architectural design,examining how attention mechanisms,cross-modal transformers,adaptive gating,and multimodal large language models redefine the integration of affective signals.Finally,we summarize major benchmark datasets and evaluation metrics and discuss emerging challenges related to scalability,generalization,and interpretability.This survey aims to provide a unified perspective onmultimodal fusion for emotion recognition and to guide future research toward more coherent and generalizable multimodal affective intelligence.展开更多
Multimodal deep learning has emerged as a key paradigm in contemporary medical diagnostics,advancing precision medicine by enabling integration and learning from diverse data sources.The exponential growth of high-dim...Multimodal deep learning has emerged as a key paradigm in contemporary medical diagnostics,advancing precision medicine by enabling integration and learning from diverse data sources.The exponential growth of high-dimensional healthcare data,encompassing genomic,transcriptomic,and other omics profiles,as well as radiological imaging and histopathological slides,makes this approach increasingly important because,when examined separately,these data sources only offer a fragmented picture of intricate disease processes.Multimodal deep learning leverages the complementary properties of multiple data modalities to enable more accurate prognostic modeling,more robust disease characterization,and improved treatment decision-making.This review provides a comprehensive overview of the current state of multimodal deep learning approaches in medical diagnosis.We classify and examine important application domains,such as(1)radiology,where automated report generation and lesion detection are facilitated by image-text integration;(2)histopathology,where fusion models improve tumor classification and grading;and(3)multi-omics,where molecular subtypes and latent biomarkers are revealed through cross-modal learning.We provide an overview of representative research,methodological advancements,and clinical consequences for each domain.Additionally,we critically analyzed the fundamental issues preventing wider adoption,including computational complexity(particularly in training scalable,multi-branch networks),data heterogeneity(resulting from modality-specific noise,resolution variations,and inconsistent annotations),and the challenge of maintaining significant cross-modal correlations during fusion.These problems impede interpretability,which is crucial for clinical trust and use,in addition to performance and generalizability.Lastly,we outline important areas for future research,including the development of standardized protocols for harmonizing data,the creation of lightweight and interpretable fusion architectures,the integration of real-time clinical decision support systems,and the promotion of cooperation for federated multimodal learning.Our goal is to provide researchers and clinicians with a concise overview of the field’s present state,enduring constraints,and exciting directions for further research through this review.展开更多
Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate...Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate answer.In this paper,we propose a VQA system intended to answer yes/no questions about real-world images,in Arabic.To support a robust VQA system,we work in two directions:(1)Using deep neural networks to semantically represent the given image and question in a fine-grainedmanner,namely ResNet-152 and Gated Recurrent Units(GRU).(2)Studying the role of the utilizedmultimodal bilinear pooling fusion technique in the trade-o.between the model complexity and the overall model performance.Some fusion techniques could significantly increase the model complexity,which seriously limits their applicability for VQA models.So far,there is no evidence of how efficient these multimodal bilinear pooling fusion techniques are for VQA systems dedicated to yes/no questions.Hence,a comparative analysis is conducted between eight bilinear pooling fusion techniques,in terms of their ability to reduce themodel complexity and improve themodel performance in this case of VQA systems.Experiments indicate that these multimodal bilinear pooling fusion techniques have improved the VQA model’s performance,until reaching the best performance of 89.25%.Further,experiments have proven that the number of answers in the developed VQA system is a critical factor that a.ects the effectiveness of these multimodal bilinear pooling techniques in achieving their main objective of reducing the model complexity.The Multimodal Local Perception Bilinear Pooling(MLPB)technique has shown the best balance between the model complexity and its performance,for VQA systems designed to answer yes/no questions.展开更多
Automated waste sorting can dramatically increase waste sorting efficiency and reduce its regulation cost. Most of the current methods only use a single modality such as image data or acoustic data for waste classific...Automated waste sorting can dramatically increase waste sorting efficiency and reduce its regulation cost. Most of the current methods only use a single modality such as image data or acoustic data for waste classification, which makes it difficult to classify mixed and confusable wastes. In these complex situations, using multiple modalities becomes necessary to achieve a high classification accuracy. Traditionally, the fusion of multiple modalities has been limited by fixed handcrafted features. In this study, the deep-learning approach was applied to the multimodal fusion at the feature level for municipal solid-waste sorting.More specifically, the pre-trained VGG16 and one-dimensional convolutional neural networks(1 D CNNs) were utilized to extract features from visual data and acoustic data, respectively. These deeply learned features were then fused in the fully connected layers for classification. The results of comparative experiments proved that the proposed method was superior to the single-modality methods. Additionally, the feature-based fusion strategy performed better than the decision-based strategy with deeply learned features.展开更多
Modern computational models have leveraged biological advances in human brain research. This study addresses the problem of multimodal learning with the help of brain-inspired models. Specifically, a unified multimoda...Modern computational models have leveraged biological advances in human brain research. This study addresses the problem of multimodal learning with the help of brain-inspired models. Specifically, a unified multimodal learning architecture is proposed based on deep neural networks, which are inspired by the biology of the visual cortex of the human brain. This unified framework is validated by two practical multimodal learning tasks: image captioning, involving visual and natural language signals, and visual-haptic fusion, involving haptic and visual signals. Extensive experiments are conducted under the framework, and competitive results are achieved.展开更多
This mixed-method empirical study investigated the role of learning strategies and motivation in predicting L2 Chinese learning outcomes in an online multimodal learning environment.Both quantitative and qualitative a...This mixed-method empirical study investigated the role of learning strategies and motivation in predicting L2 Chinese learning outcomes in an online multimodal learning environment.Both quantitative and qualitative approaches also examined the learners'perspectives on online multimodal Chinese learning.The participants in this study were fifteen pre-intermediate adult Chinese learners aged 18-26.They were originally from different countries(Spain,Italy,Argentina,Colombia,and Mexico)and lived in Barcelona.They were multilingual,speaking more than two European languages,without exposure to any other Asian languages apart from Chinese.The study's investigation was composed of Strategy Inventory for Language Learning(SILL),motivation questionnaire,learner perception questionnaire,and focus group interview.The whole trial period lasted three months;after the experiment,the statistics were analyzed via the Spearman correlation coefficient.The statistical analysis results showed that strategy use was highly correlated with online multimodal Chinese learning outcomes;this indicated that strategy use played a vital role in online multimodal Chinese learning.Motivation was also found to have a significant effect.The perception questionnaire uncovered that the students were overall satisfied and favoring the online multimodal learning experience design.The detailed insights from the participants were exhibited in the transcripted analysis of focus group interviews.展开更多
Electronic nose and thermal images are effective ways to diagnose the presence of gases in real-time realtime.Multimodal fusion of these modalities can result in the development of highly accurate diagnostic systems.T...Electronic nose and thermal images are effective ways to diagnose the presence of gases in real-time realtime.Multimodal fusion of these modalities can result in the development of highly accurate diagnostic systems.The low-cost thermal imaging software produces low-resolution thermal images in grayscale format,hence necessitating methods for improving the resolution and colorizing the images.The objective of this paper is to develop and train a super-resolution generative adversarial network for improving the resolution of the thermal images,followed by a sparse autoencoder for colorization of thermal images and amultimodal convolutional neural network for gas detection using electronic nose and thermal images.The dataset used comprises 6400 thermal images and electronic nose measurements for four classes.A multimodal Convolutional Neural Network(CNN)comprising an EfficientNetB2 pre-trainedmodel was developed using both early and late feature fusion.The Super Resolution Generative Adversarial Network(SRGAN)model was developed and trained on low and high-resolution thermal images.Asparse autoencoder was trained on the grayscale and colorized thermal images.The SRGAN was trained on lowand high-resolution thermal images,achieving a Structural Similarity Index(SSIM)of 90.28,a Peak Signal-to-Noise Ratio(PSNR)of 68.74,and a Mean Absolute Error(MAE)of 0.066.The autoencoder model produced an MAE of 0.035,a Mean Squared Error(MSE)of 0.006,and a Root Mean Squared Error(RMSE)of 0.0705.The multimodal CNN,trained on these images and electronic nose measurements using both early and late fusion techniques,achieved accuracies of 97.89% and 98.55%,respectively.Hence,the proposed framework can be of great aid for the integration with low-cost software to generate high quality thermal camera images and highly accurate detection of gases in real-time.展开更多
Electrocardiogram (ECG) analysis is critical for detecting arrhythmias, but traditional methods struggle with large-scale Electrocardiogram data and rare arrhythmia events in imbalanced datasets. These methods fail to...Electrocardiogram (ECG) analysis is critical for detecting arrhythmias, but traditional methods struggle with large-scale Electrocardiogram data and rare arrhythmia events in imbalanced datasets. These methods fail to perform multi-perspective learning of temporal signals and Electrocardiogram images, nor can they fully extract the latent information within the data, falling short of the accuracy required by clinicians. Therefore, this paper proposes an innovative hybrid multimodal spatiotemporal neural network to address these challenges. The model employs a multimodal data augmentation framework integrating visual and signal-based features to enhance the classification performance of rare arrhythmias in imbalanced datasets. Additionally, the spatiotemporal fusion module incorporates a spatiotemporal graph convolutional network to jointly model temporal and spatial features, uncovering complex dependencies within the Electrocardiogram data and improving the model’s ability to represent complex patterns. In experiments conducted on the MIT-BIH arrhythmia dataset, the model achieved 99.95% accuracy, 99.80% recall, and a 99.78% F1 score. The model was further validated for generalization using the clinical INCART arrhythmia dataset, and the results demonstrated its effectiveness in terms of both generalization and robustness.展开更多
Emotion recognition under uncontrolled and noisy environments presents persistent challenges in the design of emotionally responsive systems.The current study introduces an audio-visual recognition framework designed ...Emotion recognition under uncontrolled and noisy environments presents persistent challenges in the design of emotionally responsive systems.The current study introduces an audio-visual recognition framework designed to address performance degradation caused by environmental interference,such as background noise,overlapping speech,and visual obstructions.The proposed framework employs a structured fusion approach,combining early-stage feature-level integration with decision-level coordination guided by temporal attention mechanisms.Audio data are transformed into mel-spectrogram representations,and visual data are represented as raw frame sequences.Spatial and temporal features are extracted through convolutional and transformer-based encoders,allowing the framework to capture complementary and hierarchical information fromboth sources.Across-modal attentionmodule enables selective emphasis on relevant signals while suppressing modality-specific noise.Performance is validated on a modified version of the AFEW dataset,in which controlled noise is introduced to emulate realistic conditions.The framework achieves higher classification accuracy than comparative baselines,confirming increased robustness under conditions of cross-modal disruption.This result demonstrates the suitability of the proposed method for deployment in practical emotion-aware technologies operating outside controlled environments.The study also contributes a systematic approach to fusion design and supports further exploration in the direction of resilientmultimodal emotion analysis frameworks.The source code is publicly available at https://github.com/asmoon002/AVER(accessed on 18 August 2025).展开更多
With the rise of encrypted traffic,traditional network analysis methods have become less effective,leading to a shift towards deep learning-based approaches.Among these,multimodal learning-based classification methods...With the rise of encrypted traffic,traditional network analysis methods have become less effective,leading to a shift towards deep learning-based approaches.Among these,multimodal learning-based classification methods have gained attention due to their ability to leverage diverse feature sets from encrypted traffic,improving classification accuracy.However,existing research predominantly relies on late fusion techniques,which hinder the full utilization of deep features within the data.To address this limitation,we propose a novel multimodal encrypted traffic classification model that synchronizes modality fusion with multiscale feature extraction.Specifically,our approach performs real-time fusion of modalities at each stage of feature extraction,enhancing feature representation at each level and preserving inter-level correlations for more effective learning.This continuous fusion strategy improves the model’s ability to detect subtle variations in encrypted traffic,while boosting its robustness and adaptability to evolving network conditions.Experimental results on two real-world encrypted traffic datasets demonstrate that our method achieves a classification accuracy of 98.23% and 97.63%,outperforming existing multimodal learning-based methods.展开更多
The potential for reducing greenhouse gas(GHG)emissions and energy consumption in wastewater treatment can be realized through intelligent control,with machine learning(ML)and multimodality emerging as a promising sol...The potential for reducing greenhouse gas(GHG)emissions and energy consumption in wastewater treatment can be realized through intelligent control,with machine learning(ML)and multimodality emerging as a promising solution.Here,we introduce an ML technique based on multimodal strategies,focusing specifically on intelligent aeration control in wastewater treatment plants(WWTPs).The generalization of the multimodal strategy is demonstrated on eight ML models.The results demonstrate that this multimodal strategy significantly enhances model indicators for ML in environmental science and the efficiency of aeration control,exhibiting exceptional performance and interpretability.Integrating random forest with visual models achieves the highest accuracy in forecasting aeration quantity in multimodal models,with a mean absolute percentage error of 4.4%and a coefficient of determination of 0.948.Practical testing in a full-scale plant reveals that the multimodal model can reduce operation costs by 19.8%compared to traditional fuzzy control methods.The potential application of these strategies in critical water science domains is discussed.To foster accessibility and promote widespread adoption,the multimodal ML models are freely available on GitHub,thereby eliminating technical barriers and encouraging the application of artificial intelligence in urban wastewater treatment.展开更多
This paper presents an end-to-end deep learning method to solve geometry problems via feature learning and contrastive learning of multimodal data.A key challenge in solving geometry problems using deep learning is to...This paper presents an end-to-end deep learning method to solve geometry problems via feature learning and contrastive learning of multimodal data.A key challenge in solving geometry problems using deep learning is to automatically adapt to the task of understanding single-modal and multimodal problems.Existing methods either focus on single-modal ormultimodal problems,and they cannot fit each other.A general geometry problem solver shouldobviouslybe able toprocess variousmodalproblems at the same time.Inthispaper,a shared feature-learning model of multimodal data is adopted to learn the unified feature representation of text and image,which can solve the heterogeneity issue between multimodal geometry problems.A contrastive learning model of multimodal data enhances the semantic relevance betweenmultimodal features and maps them into a unified semantic space,which can effectively adapt to both single-modal and multimodal downstream tasks.Based on the feature extraction and fusion of multimodal data,a proposed geometry problem solver uses relation extraction,theorem reasoning,and problem solving to present solutions in a readable way.Experimental results show the effectiveness of the method.展开更多
The contribution of this work is twofold: (1) a multimodality prediction method of chaotic time series with the Gaussian process mixture (GPM) model is proposed, which employs a divide and conquer strategy. It au...The contribution of this work is twofold: (1) a multimodality prediction method of chaotic time series with the Gaussian process mixture (GPM) model is proposed, which employs a divide and conquer strategy. It automatically divides the chaotic time series into multiple modalities with different extrinsic patterns and intrinsic characteristics, and thus can more precisely fit the chaotic time series. (2) An effective sparse hard-cut expec- tation maximization (SHC-EM) learning algorithm for the GPM model is proposed to improve the prediction performance. SHO-EM replaces a large learning sample set with fewer pseudo inputs, accelerating model learning based on these pseudo inputs. Experiments on Lorenz and Chua time series demonstrate that the proposed method yields not only accurate multimodality prediction, but also the prediction confidence interval SHC-EM outperforms the traditional variational 1earning in terms of both prediction accuracy and speed. In addition, SHC-EM is more robust and insusceptible to noise than variational learning.展开更多
Foundation models are reshaping artificial intelligence,yet their deployment in specialised domains such as agricultural question answering(AQA)still faces challenges including data scarcity and barriers to domainspec...Foundation models are reshaping artificial intelligence,yet their deployment in specialised domains such as agricultural question answering(AQA)still faces challenges including data scarcity and barriers to domainspecific knowledge.To systematically review recent progress in this area,this paper adopts a task–paradigmperspective and examines applications across three major AQA task families.For text-based QA,we analyse the strengths and limitations of retrieval-based,generative,and hybrid approaches built on large languagemodels,revealing a clear trend toward hybrid paradigms that balance precision and flexibility.For visual diagnosis,we discuss techniques such as crossmodal alignment and prompt-driven generation,which are pushing systems beyond simple pest and disease recognition toward deeper causal reasoning.Formultimodal reasoning,we show how the fusion of heterogeneous data—including text,images,speech,and sensor streams—enables comprehensive decision-making for diagnosis,monitoring,and yield prediction.To address the lack of unified benchmarks,we further propose a standardised evaluation protocol and a diagnostic taxonomy specifically designed to characterise agriculture-specific errors.Finally,we outline a concreteAQA roadmap that emphasises safety alignment,hallucination control,and lightweight deployment,aiming to guide future systems toward greater efficiency,trustworthiness,and sustainability.展开更多
Humanoid robots hold significant promise for social interaction and emotional companionship.However,their effectiveness hinges on the ability to convey nuanced and authentic emotions.Here,we presented a universal huma...Humanoid robots hold significant promise for social interaction and emotional companionship.However,their effectiveness hinges on the ability to convey nuanced and authentic emotions.Here,we presented a universal humanoid robot head with a facial kinematics model.Using a reinforcement learning framework guided by symmetry assessment,emotion decoupling,and MLLM authenticity evaluation,our system autonomously learns to generate adaptive facial expressions through dynamic landmark adjustments.By transferring the simulation training results to real-world environments,the robot can perform natural and expressive expressions.Another novel feature is the independent regulation of emotion intensity and expression magnitude across emotional categories,which enhances the ability to achieve culturally adaptive and socially resonant robotic expressions significantly.This research advances adaptive humanoid interaction,offering an easier and more efficient pathway toward culturally resonant and psychologically plausible robotic expressions.展开更多
Cross-lingual image description,the task of generating image captions in a target language from images and descriptions in a source language,is addressed in this study through a novel approach that combines neural net...Cross-lingual image description,the task of generating image captions in a target language from images and descriptions in a source language,is addressed in this study through a novel approach that combines neural network models and semantic matching techniques.Experiments conducted on the Flickr8k and AraImg2k benchmark datasets,featuring images and descriptions in English and Arabic,showcase remarkable performance improvements over state-of-the-art methods.Our model,equipped with the Image&Cross-Language Semantic Matching module and the Target Language Domain Evaluation module,significantly enhances the semantic relevance of generated image descriptions.For English-to-Arabic and Arabic-to-English cross-language image descriptions,our approach achieves a CIDEr score for English and Arabic of 87.9%and 81.7%,respectively,emphasizing the substantial contributions of our methodology.Comparative analyses with previous works further affirm the superior performance of our approach,and visual results underscore that our model generates image captions that are both semantically accurate and stylistically consistent with the target language.In summary,this study advances the field of cross-lingual image description,offering an effective solution for generating image captions across languages,with the potential to impact multilingual communication and accessibility.Future research directions include expanding to more languages and incorporating diverse visual and textual data sources.展开更多
基金supported by the National Key Research and Development Program of China(2022YFB3807300)Zhejiang Provincial Natural Science Foundation of China(LR25E030001)+2 种基金the Key Research and Development Project of Zhejiang Province(2024C03073)the financial support from the State Key Laboratory of Transvascular Implantation Devices(012024019)Transvascular Implantation Devices Research Institute China(TIDRIC)(KY012024007,KY012024009).
文摘Artificial intelligence has achieved remarkable success in materials science,accelerating novel material design.However,real-world material systems exhibit multiscale complexity—spanning composition,processing,structure,and properties—posing significant challenges for modeling.While some approaches fuse multiscale features to improve prediction,important modalities such as microstructure are often missing due to high acquisition costs.Existing methods struggle with incomplete data and lack a framework to bridge multiscale material knowledge.To address this,we propose MatMCL,a structure-guided multimodal learning framework that jointly analyzes multiscale material information and enables robust property prediction with incomplete modalities.Using a selfconstructed multimodal dataset of electrospun nanofibers,we demonstrate that MatMCL improves mechanical property prediction without structural information,generates microstructures from processing parameters,and enables cross-modal retrieval.We further extend it via multi-stage learning and apply it to nanofiber-reinforced composite design.MatMCL uncovers processingstructure-property relationships,suggesting its promise as a generalizable approach for AI-driven material design.
文摘The problem of fake news detection(FND)is becoming increasingly important in the field of natural language processing(NLP)because of the rapid dissemination of misleading information on the web.Large language models(LLMs)such as GPT-4.Zero excels in natural language understanding tasks but can still struggle to distinguish between fact and fiction,particularly when applied in the wild.However,a key challenge of existing FND methods is that they only consider unimodal data(e.g.,images),while more detailed multimodal data(e.g.,user behaviour,temporal dynamics)is neglected,and the latter is crucial for full-context understanding.To overcome these limitations,we introduce M3-FND(Multimodal Misinformation Mitigation for False News Detection),a novel methodological framework that integrates LLMs with multimodal data sources to perform context-aware veracity assessments.Our method proposes a hybrid system that combines image-text alignment,user credibility profiling,and temporal pattern recognition,which is also strengthened through a natural feedback loop that provides real-time feedback for correcting downstream errors.We use contextual reinforcement learning to schedule prompt updating and update the classifier threshold based on the latest multimodal input,which enables the model to better adapt to changing misinformation attack strategies.M3-FND is tested on three diverse datasets,FakeNewsNet,Twitter15,andWeibo,which contain both text and visual socialmedia content.Experiments showthatM3-FND significantly outperforms conventional and LLMbased baselines in terms of accuracy,F1-score,and AUC on all benchmarks.Our results indicate the importance of employing multimodal cues and adaptive learning for effective and timely detection of fake news.
文摘High-throughput transcriptomics has evolved from bulk RNA-seq to single-cell and spatial profiling,yet its clinical translation still depends on effective integration across diverse omics and data modalities.Emerging foundation models and multimodal learning frameworks are enabling scalable and transferable representations of cellular states,while advances in interpretability and real-world data integration are bridging the gap between discovery and clinical application.This paper outlines a concise roadmap for AI-driven,transcriptome-centered multi-omics integration in precision medicine(Figure 1).
基金Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2026R77)Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia,the Deanship of Scientific Research at Northern Border University,Arar,Saudi Arabia,through the project number NBU-FFR-2026-2248-02.
文摘The rapid proliferation of multimodal misinformation on social media demands detection frameworks that are not only accurate but also robust to noise,adversarial manipulation,and semantic inconsistency between modalities.Existing multimodal fake news detection approaches often rely on deterministic fusion strategies,which limits their ability to model uncertainty and complex cross-modal dependencies.To address these challenges,we propose Q-ALIGNer,a quantum-inspired multimodal framework that integrates classical feature extraction with quantumstate encoding,learnable cross-modal entanglement,and robustness-aware training objectives.The proposed framework adopts quantumformalism as a representational abstraction,enabling probabilisticmodeling ofmultimodal alignment while remaining fully executable on classical hardware.Q-ALIGNer is evaluated on four widely used benchmark datasets—FakeNewsNet,Fakeddit,Weibo,and MediaEval VMU—covering diverse platforms,languages,and content characteristics.Experimental results demonstrate consistent performance improvements over strong text-only,vision-only,multimodal,and quantum-inspired baselines,including BERT,RoBERTa,XLNet,ResNet,EfficientNet,ViT,Multimodal-BERT,ViLBERT,and QEMF.Q-ALIGNer achieves accuracies of 91.2%,92.9%,91.7%,and 92.1%on FakeNewsNet,Fakeddit,Weibo,and MediaEval VMU,respectively,with F1-score gains of 3–4 percentage points over QEMF.Robustness evaluation shows a reduced adversarial accuracy gap of 2.6%,compared to 7%–9%for baseline models,while calibration analysis indicates improved reliability with an expected calibration error of 0.031.In addition,computational analysis shows that Q-ALIGNer reduces training time to 19.6 h compared to 48.2 h for QEMF at a comparable parameter scale.These results indicate that quantum-inspired alignment and entanglement can enhance robustness,uncertainty awareness,and efficiency in multimodal fake news detection,positioning Q-ALIGNer as a principled and practical content-centric framework for misinformation analysis.
基金supported by the Institute of Information&Communications Technology Planning&Evaluation grant funded by the Korea government(MSIT)(No.RS-2021-II211341,AI Graduate School Support Program,Chung-Ang University)in part by the Institute of Information and Communications Technology Planning and Evaluation grant funded by the Korea government(MSIT)(Development of Integrated Development Framework that Supports Automatic Neural Network Generation and Deployment Optimized for Runtime Environment,Grant No.2021-0-00766).
文摘Multimodal emotion recognition has emerged as a key research area for enabling human-centered artificial intelligence,supported by the rapid progress in vision,audio,language,and physiological modeling.Existing approaches integrate heterogeneous affective cues through diverse embedding strategies and fusion mechanisms,yet the field remains fragmented due to differences in feature alignment,temporal synchronization,modality reliability,and robustness to noise or missing inputs.This survey provides a comprehensive analysis of MER research from 2021 to 2025,consolidating advances in modality-specific representation learning,cross-modal feature construction,and early,late,and hybrid fusion paradigms.We systematically review visual,acoustic,textual,and sensor-based embeddings,highlighting howpre-trained encoders,self-supervised learning,and large languagemodels have reshaped the representational foundations ofMER.We further categorize fusion strategies by interaction depth and architectural design,examining how attention mechanisms,cross-modal transformers,adaptive gating,and multimodal large language models redefine the integration of affective signals.Finally,we summarize major benchmark datasets and evaluation metrics and discuss emerging challenges related to scalability,generalization,and interpretability.This survey aims to provide a unified perspective onmultimodal fusion for emotion recognition and to guide future research toward more coherent and generalizable multimodal affective intelligence.
文摘Multimodal deep learning has emerged as a key paradigm in contemporary medical diagnostics,advancing precision medicine by enabling integration and learning from diverse data sources.The exponential growth of high-dimensional healthcare data,encompassing genomic,transcriptomic,and other omics profiles,as well as radiological imaging and histopathological slides,makes this approach increasingly important because,when examined separately,these data sources only offer a fragmented picture of intricate disease processes.Multimodal deep learning leverages the complementary properties of multiple data modalities to enable more accurate prognostic modeling,more robust disease characterization,and improved treatment decision-making.This review provides a comprehensive overview of the current state of multimodal deep learning approaches in medical diagnosis.We classify and examine important application domains,such as(1)radiology,where automated report generation and lesion detection are facilitated by image-text integration;(2)histopathology,where fusion models improve tumor classification and grading;and(3)multi-omics,where molecular subtypes and latent biomarkers are revealed through cross-modal learning.We provide an overview of representative research,methodological advancements,and clinical consequences for each domain.Additionally,we critically analyzed the fundamental issues preventing wider adoption,including computational complexity(particularly in training scalable,multi-branch networks),data heterogeneity(resulting from modality-specific noise,resolution variations,and inconsistent annotations),and the challenge of maintaining significant cross-modal correlations during fusion.These problems impede interpretability,which is crucial for clinical trust and use,in addition to performance and generalizability.Lastly,we outline important areas for future research,including the development of standardized protocols for harmonizing data,the creation of lightweight and interpretable fusion architectures,the integration of real-time clinical decision support systems,and the promotion of cooperation for federated multimodal learning.Our goal is to provide researchers and clinicians with a concise overview of the field’s present state,enduring constraints,and exciting directions for further research through this review.
文摘Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate answer.In this paper,we propose a VQA system intended to answer yes/no questions about real-world images,in Arabic.To support a robust VQA system,we work in two directions:(1)Using deep neural networks to semantically represent the given image and question in a fine-grainedmanner,namely ResNet-152 and Gated Recurrent Units(GRU).(2)Studying the role of the utilizedmultimodal bilinear pooling fusion technique in the trade-o.between the model complexity and the overall model performance.Some fusion techniques could significantly increase the model complexity,which seriously limits their applicability for VQA models.So far,there is no evidence of how efficient these multimodal bilinear pooling fusion techniques are for VQA systems dedicated to yes/no questions.Hence,a comparative analysis is conducted between eight bilinear pooling fusion techniques,in terms of their ability to reduce themodel complexity and improve themodel performance in this case of VQA systems.Experiments indicate that these multimodal bilinear pooling fusion techniques have improved the VQA model’s performance,until reaching the best performance of 89.25%.Further,experiments have proven that the number of answers in the developed VQA system is a critical factor that a.ects the effectiveness of these multimodal bilinear pooling techniques in achieving their main objective of reducing the model complexity.The Multimodal Local Perception Bilinear Pooling(MLPB)technique has shown the best balance between the model complexity and its performance,for VQA systems designed to answer yes/no questions.
基金supported by the National Natural Science Foundation of China(Grant Nos.51875507,52005439)the Key Research and Development Program of Zhejiang Province(Grant No.2021C01018)。
文摘Automated waste sorting can dramatically increase waste sorting efficiency and reduce its regulation cost. Most of the current methods only use a single modality such as image data or acoustic data for waste classification, which makes it difficult to classify mixed and confusable wastes. In these complex situations, using multiple modalities becomes necessary to achieve a high classification accuracy. Traditionally, the fusion of multiple modalities has been limited by fixed handcrafted features. In this study, the deep-learning approach was applied to the multimodal fusion at the feature level for municipal solid-waste sorting.More specifically, the pre-trained VGG16 and one-dimensional convolutional neural networks(1 D CNNs) were utilized to extract features from visual data and acoustic data, respectively. These deeply learned features were then fused in the fully connected layers for classification. The results of comparative experiments proved that the proposed method was superior to the single-modality methods. Additionally, the feature-based fusion strategy performed better than the decision-based strategy with deeply learned features.
基金supported by National Natural Science Foundation of China(Grant Nos.61621136008,61327809,61210013,91420302,and 91520201)
文摘Modern computational models have leveraged biological advances in human brain research. This study addresses the problem of multimodal learning with the help of brain-inspired models. Specifically, a unified multimodal learning architecture is proposed based on deep neural networks, which are inspired by the biology of the visual cortex of the human brain. This unified framework is validated by two practical multimodal learning tasks: image captioning, involving visual and natural language signals, and visual-haptic fusion, involving haptic and visual signals. Extensive experiments are conducted under the framework, and competitive results are achieved.
文摘This mixed-method empirical study investigated the role of learning strategies and motivation in predicting L2 Chinese learning outcomes in an online multimodal learning environment.Both quantitative and qualitative approaches also examined the learners'perspectives on online multimodal Chinese learning.The participants in this study were fifteen pre-intermediate adult Chinese learners aged 18-26.They were originally from different countries(Spain,Italy,Argentina,Colombia,and Mexico)and lived in Barcelona.They were multilingual,speaking more than two European languages,without exposure to any other Asian languages apart from Chinese.The study's investigation was composed of Strategy Inventory for Language Learning(SILL),motivation questionnaire,learner perception questionnaire,and focus group interview.The whole trial period lasted three months;after the experiment,the statistics were analyzed via the Spearman correlation coefficient.The statistical analysis results showed that strategy use was highly correlated with online multimodal Chinese learning outcomes;this indicated that strategy use played a vital role in online multimodal Chinese learning.Motivation was also found to have a significant effect.The perception questionnaire uncovered that the students were overall satisfied and favoring the online multimodal learning experience design.The detailed insights from the participants were exhibited in the transcripted analysis of focus group interviews.
基金funded by the Centre for Advanced Modelling and Geospatial Information Systems(CAMGIS),Faculty of Engineering and IT,University of Technology Sydneysupported by the Researchers Supporting Project,King Saud University,Riyadh,Saudi Arabia,under Project RSP2025 R14.
文摘Electronic nose and thermal images are effective ways to diagnose the presence of gases in real-time realtime.Multimodal fusion of these modalities can result in the development of highly accurate diagnostic systems.The low-cost thermal imaging software produces low-resolution thermal images in grayscale format,hence necessitating methods for improving the resolution and colorizing the images.The objective of this paper is to develop and train a super-resolution generative adversarial network for improving the resolution of the thermal images,followed by a sparse autoencoder for colorization of thermal images and amultimodal convolutional neural network for gas detection using electronic nose and thermal images.The dataset used comprises 6400 thermal images and electronic nose measurements for four classes.A multimodal Convolutional Neural Network(CNN)comprising an EfficientNetB2 pre-trainedmodel was developed using both early and late feature fusion.The Super Resolution Generative Adversarial Network(SRGAN)model was developed and trained on low and high-resolution thermal images.Asparse autoencoder was trained on the grayscale and colorized thermal images.The SRGAN was trained on lowand high-resolution thermal images,achieving a Structural Similarity Index(SSIM)of 90.28,a Peak Signal-to-Noise Ratio(PSNR)of 68.74,and a Mean Absolute Error(MAE)of 0.066.The autoencoder model produced an MAE of 0.035,a Mean Squared Error(MSE)of 0.006,and a Root Mean Squared Error(RMSE)of 0.0705.The multimodal CNN,trained on these images and electronic nose measurements using both early and late fusion techniques,achieved accuracies of 97.89% and 98.55%,respectively.Hence,the proposed framework can be of great aid for the integration with low-cost software to generate high quality thermal camera images and highly accurate detection of gases in real-time.
基金supported by The Henan Province Science and Technology Research Project(242102211046)the Key Scientific Research Project of Higher Education Institutions in Henan Province(25A520039)+1 种基金theNatural Science Foundation project of Zhongyuan Institute of Technology(K2025YB011)the Zhongyuan University of Technology Graduate Education and Teaching Reform Research Project(JG202424).
文摘Electrocardiogram (ECG) analysis is critical for detecting arrhythmias, but traditional methods struggle with large-scale Electrocardiogram data and rare arrhythmia events in imbalanced datasets. These methods fail to perform multi-perspective learning of temporal signals and Electrocardiogram images, nor can they fully extract the latent information within the data, falling short of the accuracy required by clinicians. Therefore, this paper proposes an innovative hybrid multimodal spatiotemporal neural network to address these challenges. The model employs a multimodal data augmentation framework integrating visual and signal-based features to enhance the classification performance of rare arrhythmias in imbalanced datasets. Additionally, the spatiotemporal fusion module incorporates a spatiotemporal graph convolutional network to jointly model temporal and spatial features, uncovering complex dependencies within the Electrocardiogram data and improving the model’s ability to represent complex patterns. In experiments conducted on the MIT-BIH arrhythmia dataset, the model achieved 99.95% accuracy, 99.80% recall, and a 99.78% F1 score. The model was further validated for generalization using the clinical INCART arrhythmia dataset, and the results demonstrated its effectiveness in terms of both generalization and robustness.
基金funded by the Institute of Information&CommunicationsTechnology Planning&Evaluation(IITP)grant funded by the Korea government(MSIT),grant number 2021-0-01341.
文摘Emotion recognition under uncontrolled and noisy environments presents persistent challenges in the design of emotionally responsive systems.The current study introduces an audio-visual recognition framework designed to address performance degradation caused by environmental interference,such as background noise,overlapping speech,and visual obstructions.The proposed framework employs a structured fusion approach,combining early-stage feature-level integration with decision-level coordination guided by temporal attention mechanisms.Audio data are transformed into mel-spectrogram representations,and visual data are represented as raw frame sequences.Spatial and temporal features are extracted through convolutional and transformer-based encoders,allowing the framework to capture complementary and hierarchical information fromboth sources.Across-modal attentionmodule enables selective emphasis on relevant signals while suppressing modality-specific noise.Performance is validated on a modified version of the AFEW dataset,in which controlled noise is introduced to emulate realistic conditions.The framework achieves higher classification accuracy than comparative baselines,confirming increased robustness under conditions of cross-modal disruption.This result demonstrates the suitability of the proposed method for deployment in practical emotion-aware technologies operating outside controlled environments.The study also contributes a systematic approach to fusion design and supports further exploration in the direction of resilientmultimodal emotion analysis frameworks.The source code is publicly available at https://github.com/asmoon002/AVER(accessed on 18 August 2025).
基金supported by the National Key Research and Development Program of China No.2023YFB2705000.
文摘With the rise of encrypted traffic,traditional network analysis methods have become less effective,leading to a shift towards deep learning-based approaches.Among these,multimodal learning-based classification methods have gained attention due to their ability to leverage diverse feature sets from encrypted traffic,improving classification accuracy.However,existing research predominantly relies on late fusion techniques,which hinder the full utilization of deep features within the data.To address this limitation,we propose a novel multimodal encrypted traffic classification model that synchronizes modality fusion with multiscale feature extraction.Specifically,our approach performs real-time fusion of modalities at each stage of feature extraction,enhancing feature representation at each level and preserving inter-level correlations for more effective learning.This continuous fusion strategy improves the model’s ability to detect subtle variations in encrypted traffic,while boosting its robustness and adaptability to evolving network conditions.Experimental results on two real-world encrypted traffic datasets demonstrate that our method achieves a classification accuracy of 98.23% and 97.63%,outperforming existing multimodal learning-based methods.
基金the financial support by the National Natural Science Foundation of China(52230004 and 52293445)the Key Research and Development Project of Shandong Province(2020CXGC011202-005)the Shenzhen Science and Technology Program(KCXFZ20211020163404007 and KQTD20190929172630447).
文摘The potential for reducing greenhouse gas(GHG)emissions and energy consumption in wastewater treatment can be realized through intelligent control,with machine learning(ML)and multimodality emerging as a promising solution.Here,we introduce an ML technique based on multimodal strategies,focusing specifically on intelligent aeration control in wastewater treatment plants(WWTPs).The generalization of the multimodal strategy is demonstrated on eight ML models.The results demonstrate that this multimodal strategy significantly enhances model indicators for ML in environmental science and the efficiency of aeration control,exhibiting exceptional performance and interpretability.Integrating random forest with visual models achieves the highest accuracy in forecasting aeration quantity in multimodal models,with a mean absolute percentage error of 4.4%and a coefficient of determination of 0.948.Practical testing in a full-scale plant reveals that the multimodal model can reduce operation costs by 19.8%compared to traditional fuzzy control methods.The potential application of these strategies in critical water science domains is discussed.To foster accessibility and promote widespread adoption,the multimodal ML models are freely available on GitHub,thereby eliminating technical barriers and encouraging the application of artificial intelligence in urban wastewater treatment.
基金supported by the NationalNatural Science Foundation of China (No.62107014,Jian P.,62177025,He B.)the Key R&D and Promotion Projects of Henan Province (No.212102210147,Jian P.)Innovative Education Program for Graduate Students at North China University of Water Resources and Electric Power,China (No.YK-2021-99,Guo F.).
文摘This paper presents an end-to-end deep learning method to solve geometry problems via feature learning and contrastive learning of multimodal data.A key challenge in solving geometry problems using deep learning is to automatically adapt to the task of understanding single-modal and multimodal problems.Existing methods either focus on single-modal ormultimodal problems,and they cannot fit each other.A general geometry problem solver shouldobviouslybe able toprocess variousmodalproblems at the same time.Inthispaper,a shared feature-learning model of multimodal data is adopted to learn the unified feature representation of text and image,which can solve the heterogeneity issue between multimodal geometry problems.A contrastive learning model of multimodal data enhances the semantic relevance betweenmultimodal features and maps them into a unified semantic space,which can effectively adapt to both single-modal and multimodal downstream tasks.Based on the feature extraction and fusion of multimodal data,a proposed geometry problem solver uses relation extraction,theorem reasoning,and problem solving to present solutions in a readable way.Experimental results show the effectiveness of the method.
基金Supported by the National Natural Science Foundation of China under Grant No 60972106the China Postdoctoral Science Foundation under Grant No 2014M561053+1 种基金the Humanity and Social Science Foundation of Ministry of Education of China under Grant No 15YJA630108the Hebei Province Natural Science Foundation under Grant No E2016202341
文摘The contribution of this work is twofold: (1) a multimodality prediction method of chaotic time series with the Gaussian process mixture (GPM) model is proposed, which employs a divide and conquer strategy. It automatically divides the chaotic time series into multiple modalities with different extrinsic patterns and intrinsic characteristics, and thus can more precisely fit the chaotic time series. (2) An effective sparse hard-cut expec- tation maximization (SHC-EM) learning algorithm for the GPM model is proposed to improve the prediction performance. SHO-EM replaces a large learning sample set with fewer pseudo inputs, accelerating model learning based on these pseudo inputs. Experiments on Lorenz and Chua time series demonstrate that the proposed method yields not only accurate multimodality prediction, but also the prediction confidence interval SHC-EM outperforms the traditional variational 1earning in terms of both prediction accuracy and speed. In addition, SHC-EM is more robust and insusceptible to noise than variational learning.
基金supported by the Ningxia Natural Science Foundation(2025AAC050001)the Scientific Research Startup Project for Full-Time Introduced High-Level Talents in Ningxia(2024BEH04130)+2 种基金the National Natural Science Foundation of China(32460444)the Ningxia Hui Autonomous Region Key Research and Development Program(2024BBF0101302,2023BDE02001)Supported by the Special Fund for Basic Research Business of Central Universities of North Minzu University(2025BG234,2023ZRLG12).
文摘Foundation models are reshaping artificial intelligence,yet their deployment in specialised domains such as agricultural question answering(AQA)still faces challenges including data scarcity and barriers to domainspecific knowledge.To systematically review recent progress in this area,this paper adopts a task–paradigmperspective and examines applications across three major AQA task families.For text-based QA,we analyse the strengths and limitations of retrieval-based,generative,and hybrid approaches built on large languagemodels,revealing a clear trend toward hybrid paradigms that balance precision and flexibility.For visual diagnosis,we discuss techniques such as crossmodal alignment and prompt-driven generation,which are pushing systems beyond simple pest and disease recognition toward deeper causal reasoning.Formultimodal reasoning,we show how the fusion of heterogeneous data—including text,images,speech,and sensor streams—enables comprehensive decision-making for diagnosis,monitoring,and yield prediction.To address the lack of unified benchmarks,we further propose a standardised evaluation protocol and a diagnostic taxonomy specifically designed to characterise agriculture-specific errors.Finally,we outline a concreteAQA roadmap that emphasises safety alignment,hallucination control,and lightweight deployment,aiming to guide future systems toward greater efficiency,trustworthiness,and sustainability.
基金supported by the National Natural Science Foundation of China(Grant No.52405041)the Major Program of the Zhejiang Provincial Natural Science Foundation of China(Grant No.LD25E050001)the Key R&D Program of Zhejiang Province(Grant No.2025C01186)。
文摘Humanoid robots hold significant promise for social interaction and emotional companionship.However,their effectiveness hinges on the ability to convey nuanced and authentic emotions.Here,we presented a universal humanoid robot head with a facial kinematics model.Using a reinforcement learning framework guided by symmetry assessment,emotion decoupling,and MLLM authenticity evaluation,our system autonomously learns to generate adaptive facial expressions through dynamic landmark adjustments.By transferring the simulation training results to real-world environments,the robot can perform natural and expressive expressions.Another novel feature is the independent regulation of emotion intensity and expression magnitude across emotional categories,which enhances the ability to achieve culturally adaptive and socially resonant robotic expressions significantly.This research advances adaptive humanoid interaction,offering an easier and more efficient pathway toward culturally resonant and psychologically plausible robotic expressions.
文摘Cross-lingual image description,the task of generating image captions in a target language from images and descriptions in a source language,is addressed in this study through a novel approach that combines neural network models and semantic matching techniques.Experiments conducted on the Flickr8k and AraImg2k benchmark datasets,featuring images and descriptions in English and Arabic,showcase remarkable performance improvements over state-of-the-art methods.Our model,equipped with the Image&Cross-Language Semantic Matching module and the Target Language Domain Evaluation module,significantly enhances the semantic relevance of generated image descriptions.For English-to-Arabic and Arabic-to-English cross-language image descriptions,our approach achieves a CIDEr score for English and Arabic of 87.9%and 81.7%,respectively,emphasizing the substantial contributions of our methodology.Comparative analyses with previous works further affirm the superior performance of our approach,and visual results underscore that our model generates image captions that are both semantically accurate and stylistically consistent with the target language.In summary,this study advances the field of cross-lingual image description,offering an effective solution for generating image captions across languages,with the potential to impact multilingual communication and accessibility.Future research directions include expanding to more languages and incorporating diverse visual and textual data sources.