Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate...Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate answer.In this paper,we propose a VQA system intended to answer yes/no questions about real-world images,in Arabic.To support a robust VQA system,we work in two directions:(1)Using deep neural networks to semantically represent the given image and question in a fine-grainedmanner,namely ResNet-152 and Gated Recurrent Units(GRU).(2)Studying the role of the utilizedmultimodal bilinear pooling fusion technique in the trade-o.between the model complexity and the overall model performance.Some fusion techniques could significantly increase the model complexity,which seriously limits their applicability for VQA models.So far,there is no evidence of how efficient these multimodal bilinear pooling fusion techniques are for VQA systems dedicated to yes/no questions.Hence,a comparative analysis is conducted between eight bilinear pooling fusion techniques,in terms of their ability to reduce themodel complexity and improve themodel performance in this case of VQA systems.Experiments indicate that these multimodal bilinear pooling fusion techniques have improved the VQA model’s performance,until reaching the best performance of 89.25%.Further,experiments have proven that the number of answers in the developed VQA system is a critical factor that a.ects the effectiveness of these multimodal bilinear pooling techniques in achieving their main objective of reducing the model complexity.The Multimodal Local Perception Bilinear Pooling(MLPB)technique has shown the best balance between the model complexity and its performance,for VQA systems designed to answer yes/no questions.展开更多
Accurate prediction of drug responses in cancer cell lines(CCLs)and transferable prediction of clinical drug responses using CCLs are two major tasks in personalized medicine.Despite the rapid advancements in existing...Accurate prediction of drug responses in cancer cell lines(CCLs)and transferable prediction of clinical drug responses using CCLs are two major tasks in personalized medicine.Despite the rapid advancements in existing computational methods for preclinical and clinical cancer drug response(CDR)prediction,challenges remain regarding the generalization of new drugs that are unseen in the training set.Herein,we propose a multimodal fusion deep learning(DL)model called drug-target and single-cell language based CDR(DTLCDR)to predict preclinical and clinical CDRs.The model integrates chemical descriptors,molecular graph representations,predicted protein target profiles of drugs,and cell line expression profiles with general knowledge from single cells.Among these features,a well-trained drug-target interaction(DTI)prediction model is used to generate target profiles of drugs,and a pretrained single-cell language model is integrated to provide general genomic knowledge.Comparison experiments on the cell line drug sensitivity dataset demonstrated that DTLCDR exhibited improved generalizability and robustness in predicting unseen drugs compared with previous state-of-the-art baseline methods.Further ablation studies verified the effectiveness of each component of our model,highlighting the significant contribution of target information to generalizability.Subsequently,the ability of DTLCDR to predict novel molecules was validated through in vitro cell experiments,demonstrating its potential for real-world applications.Moreover,DTLCDR was transferred to the clinical datasets,demonstrating satisfactory performance in the clinical data,regardless of whether the drugs were included in the cell line dataset.Overall,our results suggest that the DTLCDR is a promising tool for personalized drug discovery.展开更多
Purpose:This study aims to integrate large language models(LLMs)with interpretable machine learning methods to develop a multimodal data-driven framework for predicting corporate financial fraud,addressing the limitat...Purpose:This study aims to integrate large language models(LLMs)with interpretable machine learning methods to develop a multimodal data-driven framework for predicting corporate financial fraud,addressing the limitations of traditional approaches in long-text semantic parsing,model interpretability,and multisource data fusion,thereby providing regulatory agencies with intelligent auditing tools.Design/methodology/approach:Analyzing 5,304 Chinese listed firms’annual reports(2015-2020)from the CSMAD database,this study leverages the Doubao LLMs to generate chunked summaries and 256-dimensional semantic vectors,developing textual semantic features.It integrates 19 financial indicators,11 governance metrics,and linguistic characteristics(tone,readability)with fraud prediction models optimized through a group of Gradient Boosted Decision Tree(GBDT)algorithms.SHAP value analysis in the final model reveals the risk transmission mechanism by quantifying the marginal impacts of financial,governance,and textual features on fraud likelihood.Findings:The study found that LLMs effectively distill lengthy annual reports into semantic summaries,while GBDT algorithms(AUC>0.850)outperform the traditional Logistic Regression model in fraud detection.Multimodal fusion improved performance by 7.4%,with financial,governance,and textual features providing complementary signals.SHAP analysis revealed financial distress,governance conflicts,and narrative patterns(e.g.,tone anchoring,semantic thresholds)as key fraud indicators,highlighting managerial intent in report language.Research limitations:This study identifies three key limitations:1)lack of interpretability for semantic features,2)absence of granular fraud-type differentiation,and 3)unexplored comparative validation with other deep learning methods.Future research will address these gaps to enhance fraud detection precision and model transparency.Practical implications:The developed semantic-enhanced evaluation model provides a quantitative tool for assessing listed companies’information disclosure quality and enables practical implementation through its derivative real-time monitoring system.This advancement significantly strengthens capital market risk early warning capabilities,offering actionable insights for securities regulation.Originality/value:This study presents three key innovations:1)A novel“chunking-summarizationembedding”framework for efficient semantic compression of lengthy annual reports(30,000 words);2)Demonstration of LLMs’superior performance in financial text analysis,outperforming traditional methods by 19.3%;3)A novel“language-psychology-behavior”triad model for analyzing managerial fraud motives.展开更多
The contribution of this work is twofold: (1) a multimodality prediction method of chaotic time series with the Gaussian process mixture (GPM) model is proposed, which employs a divide and conquer strategy. It au...The contribution of this work is twofold: (1) a multimodality prediction method of chaotic time series with the Gaussian process mixture (GPM) model is proposed, which employs a divide and conquer strategy. It automatically divides the chaotic time series into multiple modalities with different extrinsic patterns and intrinsic characteristics, and thus can more precisely fit the chaotic time series. (2) An effective sparse hard-cut expec- tation maximization (SHC-EM) learning algorithm for the GPM model is proposed to improve the prediction performance. SHO-EM replaces a large learning sample set with fewer pseudo inputs, accelerating model learning based on these pseudo inputs. Experiments on Lorenz and Chua time series demonstrate that the proposed method yields not only accurate multimodality prediction, but also the prediction confidence interval SHC-EM outperforms the traditional variational 1earning in terms of both prediction accuracy and speed. In addition, SHC-EM is more robust and insusceptible to noise than variational learning.展开更多
Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on...Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on unimodal pre-trained models for feature extraction from each modality often overlook the intrinsic connections of semantic information between modalities.This limitation is attributed to their training on unimodal data,and necessitates the use of complex fusion mechanisms for sentiment analysis.In this study,we present a novel approach that combines a vision-language pre-trained model with a proposed multimodal contrastive learning method.Our approach harnesses the power of transfer learning by utilizing a vision-language pre-trained model to extract both visual and textual representations in a unified framework.We employ a Transformer architecture to integrate these representations,thereby enabling the capture of rich semantic infor-mation in image-text pairs.To further enhance the representation learning of these pairs,we introduce our proposed multimodal contrastive learning method,which leads to improved performance in sentiment analysis tasks.Our approach is evaluated through extensive experiments on two publicly accessible datasets,where we demonstrate its effectiveness.We achieve a significant improvement in sentiment analysis accuracy,indicating the supe-riority of our approach over existing techniques.These results highlight the potential of multimodal sentiment analysis and underscore the importance of considering the intrinsic semantic connections between modalities for accurate sentiment assessment.展开更多
The problem of fake news detection(FND)is becoming increasingly important in the field of natural language processing(NLP)because of the rapid dissemination of misleading information on the web.Large language models(L...The problem of fake news detection(FND)is becoming increasingly important in the field of natural language processing(NLP)because of the rapid dissemination of misleading information on the web.Large language models(LLMs)such as GPT-4.Zero excels in natural language understanding tasks but can still struggle to distinguish between fact and fiction,particularly when applied in the wild.However,a key challenge of existing FND methods is that they only consider unimodal data(e.g.,images),while more detailed multimodal data(e.g.,user behaviour,temporal dynamics)is neglected,and the latter is crucial for full-context understanding.To overcome these limitations,we introduce M3-FND(Multimodal Misinformation Mitigation for False News Detection),a novel methodological framework that integrates LLMs with multimodal data sources to perform context-aware veracity assessments.Our method proposes a hybrid system that combines image-text alignment,user credibility profiling,and temporal pattern recognition,which is also strengthened through a natural feedback loop that provides real-time feedback for correcting downstream errors.We use contextual reinforcement learning to schedule prompt updating and update the classifier threshold based on the latest multimodal input,which enables the model to better adapt to changing misinformation attack strategies.M3-FND is tested on three diverse datasets,FakeNewsNet,Twitter15,andWeibo,which contain both text and visual socialmedia content.Experiments showthatM3-FND significantly outperforms conventional and LLMbased baselines in terms of accuracy,F1-score,and AUC on all benchmarks.Our results indicate the importance of employing multimodal cues and adaptive learning for effective and timely detection of fake news.展开更多
High-throughput transcriptomics has evolved from bulk RNA-seq to single-cell and spatial profiling,yet its clinical translation still depends on effective integration across diverse omics and data modalities.Emerging ...High-throughput transcriptomics has evolved from bulk RNA-seq to single-cell and spatial profiling,yet its clinical translation still depends on effective integration across diverse omics and data modalities.Emerging foundation models and multimodal learning frameworks are enabling scalable and transferable representations of cellular states,while advances in interpretability and real-world data integration are bridging the gap between discovery and clinical application.This paper outlines a concise roadmap for AI-driven,transcriptome-centered multi-omics integration in precision medicine(Figure 1).展开更多
In recent years,Vision-Language Models(VLMs)have emerged as a significant breakthrough in multimodal learning,demonstrating remarkable progress in tasks such as image-text alignment,image generation,and semantic reaso...In recent years,Vision-Language Models(VLMs)have emerged as a significant breakthrough in multimodal learning,demonstrating remarkable progress in tasks such as image-text alignment,image generation,and semantic reasoning.This paper systematically reviews current VLM pretraining methodologies,including contrastive learning and generative paradigms,while providing an in-depth analysis of efficient transfer learning strategies such as prompt tuning,LoRA,and adapter modules.Through representative models like CLIP,BLIP,and GIT,we examine their practical applications in visual grounding,imagetext retrieval,visual question answering,affective computing,and embodied AI.Furthermore,we identify persistent challenges in fine-grained semantic modeling,cross-modal reasoning,and cross-lingual transfer.Finally,we envision future trends in unified architectures,multimodal reinforcement learning,and domain adaptation,aiming to provide systematic reference and technical insights for subsequent research.展开更多
Electronic nose and thermal images are effective ways to diagnose the presence of gases in real-time realtime.Multimodal fusion of these modalities can result in the development of highly accurate diagnostic systems.T...Electronic nose and thermal images are effective ways to diagnose the presence of gases in real-time realtime.Multimodal fusion of these modalities can result in the development of highly accurate diagnostic systems.The low-cost thermal imaging software produces low-resolution thermal images in grayscale format,hence necessitating methods for improving the resolution and colorizing the images.The objective of this paper is to develop and train a super-resolution generative adversarial network for improving the resolution of the thermal images,followed by a sparse autoencoder for colorization of thermal images and amultimodal convolutional neural network for gas detection using electronic nose and thermal images.The dataset used comprises 6400 thermal images and electronic nose measurements for four classes.A multimodal Convolutional Neural Network(CNN)comprising an EfficientNetB2 pre-trainedmodel was developed using both early and late feature fusion.The Super Resolution Generative Adversarial Network(SRGAN)model was developed and trained on low and high-resolution thermal images.Asparse autoencoder was trained on the grayscale and colorized thermal images.The SRGAN was trained on lowand high-resolution thermal images,achieving a Structural Similarity Index(SSIM)of 90.28,a Peak Signal-to-Noise Ratio(PSNR)of 68.74,and a Mean Absolute Error(MAE)of 0.066.The autoencoder model produced an MAE of 0.035,a Mean Squared Error(MSE)of 0.006,and a Root Mean Squared Error(RMSE)of 0.0705.The multimodal CNN,trained on these images and electronic nose measurements using both early and late fusion techniques,achieved accuracies of 97.89% and 98.55%,respectively.Hence,the proposed framework can be of great aid for the integration with low-cost software to generate high quality thermal camera images and highly accurate detection of gases in real-time.展开更多
Background:Retinal vein occlusion(RvO)is a leading cause of visual impairment on a global scale.Its patho-logical mechanisms involve a complex interplay of vascular obstruction,ischemia,and secondary inflammatory resp...Background:Retinal vein occlusion(RvO)is a leading cause of visual impairment on a global scale.Its patho-logical mechanisms involve a complex interplay of vascular obstruction,ischemia,and secondary inflammatory responses.Recent interdisciplinary advances,underpinned by the integration of multimodal data,have estab-lished a new paradigm for unraveling the pathophysiological mechanisms of RvO,enabling early diagnosis and personalized treatment strategies.Main text:This review critically synthesizes recent progress at the intersection of machine learning,bioinfor-matics,and clinical medicine,focusing on developing predictive models and deep analysis,exploring molecular mechanisms,and identifying markers associated with RvO.By bridging technological innovation with clinical needs,this review underscores the potential of data-driven strategies to advance RvO research and optimize patient care.Conclusions:Machine learning-bioinformatics integration has revolutionised RvO research through predictive modelling and mechanistic insights,particularly via deep learning-enhanced retinal imaging and multi-omics networks.Despite progress,clinical translation requires resolving data standardisation inconsistencies and model generalizability limitations.Establishing multicentre validation frameworks and interpretable AI tools,coupled with patient-focused data platforms through cross-disciplinary collaboration,could enable precision interventions to optimally preserve vision.展开更多
In recent years,multi-label zero-shot learning(ML-ZSL)has garnered increasing attention because of its wide range of potential applications,such as image annotation,text classification,and bioinformatics.The central c...In recent years,multi-label zero-shot learning(ML-ZSL)has garnered increasing attention because of its wide range of potential applications,such as image annotation,text classification,and bioinformatics.The central challenge in ML-ZSL lies in predicting multiple labels for unseen classes without requiring any labeled training data,which contrasts with conventional supervised learning paradigms.However,existing methods face several significant challenges.These include the substantial semantic gap between different modalities,which impedes effective knowledge transfer,and the intricate and typically complex relationships among multiple labels,making it difficult to model them in a meaningful and accurate manner.To overcome these challenges,we propose a graph-augmented multimodal chain-of-thought(GMCoT)reasoning approach.The proposed method combines the strengths of multimodal large language models with graph-based structures,significantly enhancing the reasoning process involved in multi-label prediction.First,a novel multimodal chain-of-thought reasoning framework is presented which imitates human-like step-by-step reasoning to produce multi-label predictions.Second,a technique is presented for integrating label graphs into the reasoning process.This technique enables the capture of complex semantic relationships among labels,thereby improving the accuracy and consistency of multi-label generation.Comprehensive experiments on benchmark datasets demonstrate that the proposed GMCoT approach outperforms state-of-the-art methods in ML-ZSL.展开更多
We used interpretable machine learning to combine information from multiple heterogeneous spectra:X-ray absorption near-edge spectra(XANES)and atomic pair distribution functions(PDFs)to extract local structural and ch...We used interpretable machine learning to combine information from multiple heterogeneous spectra:X-ray absorption near-edge spectra(XANES)and atomic pair distribution functions(PDFs)to extract local structural and chemical environments of transition metal cations in oxides.Random forest models were trained on simulated XANES,PDF,and both combined to extract oxidation state,coordination number,and mean nearest-neighbor bond length.XANES-only models generally outperformed PDF-only models,even for structural tasks,although using the metal’s differential-PDFs(dPDFs)instead of total-PDFs narrowed this gap.When combined with PDFs,information from XANES often dominates the prediction.Our results demonstrate that XANES contains rich structural information and highlight the utility of species-specificity.This interpretable,multimodal approach is quick to implement with suitable databases and offers valuable insights into the relative strengths of different modalities,guiding researchers in experiment design and identifying when combining complementary techniques adds meaningful information to a scientific investigation.展开更多
The rate of soybean canopy establishment largely determines photoperiodic sensitivity,subsequently influencing yield potential.However,assessing the rate of soybean canopy development in large-scale field breeding tri...The rate of soybean canopy establishment largely determines photoperiodic sensitivity,subsequently influencing yield potential.However,assessing the rate of soybean canopy development in large-scale field breeding trials is both laborious and time-consuming.High-throughput phenotyping methods based on unmanned aerial vehicle(UAV)systems can be used to monitor and quantitatively describe the development of soybean canopies for different genotypes.In this study,high-resolution and time-series raw data from field soybean populations were collected using UAVs.展开更多
提出了一种融合集成学习与多模态大语言模型(multimodal large language models,MLLMs)的图文情感分析方法。针对图文情感分析中类别不平衡与跨模态情感不一致等关键挑战,设计了EMSAN(ensemble multimodal sentiment analysis network)...提出了一种融合集成学习与多模态大语言模型(multimodal large language models,MLLMs)的图文情感分析方法。针对图文情感分析中类别不平衡与跨模态情感不一致等关键挑战,设计了EMSAN(ensemble multimodal sentiment analysis network)框架。该框架采用主辅模型结构,将在完整数据集上训练的主模型与在平衡子集上优化的辅助模型相结合,实现对各情感类别的精准识别。在特征学习方面,EMSAN采用两阶段策略增强情感特征:利用多模态大语言模型生成高质量的图像描述,缩小视觉与文本模态间的语义差距;引入一致性对比学习机制,通过对比文本和视觉特征的差异,强化跨模态情感的一致性表达,获得更为精细的特征。通过在平衡和不平衡数据集上的学习,EMSAN在保持数据自然分布的同时,有效缓解了类别不平衡问题。多个公共基准数据集上的实验结果表明,提出的方法取得了显著的性能提升。展开更多
营养信息学正由传统基于规则与常规机器学习范式,迈向以大语言模型(large language model,LLM)与多模态大模型(multimodal large language models,MLLM)为核心的新阶段。本文系统综述了2019–2025年间营养大模型领域的研究进展,归纳了视...营养信息学正由传统基于规则与常规机器学习范式,迈向以大语言模型(large language model,LLM)与多模态大模型(multimodal large language models,MLLM)为核心的新阶段。本文系统综述了2019–2025年间营养大模型领域的研究进展,归纳了视觉-语言对齐、领域知识注入、检索增强生成(retrieval-augmented generation,RAG)及可解释推理等关键架构与训练技术。在此基础上,本文详细梳理了模型在个性化膳食推荐、营养状态评估、疾病营养管理及膳食自动化记录等典型场景的应用现状。此外,本文总结了Nutrition5k、NutriBench等核心数据集与评测基准的演变历程。最后,针对模型可信度、数据隐私、跨文化泛化及临床循证支持等挑战,本文提出未来研究应深度融合临床证据,构建高质量多模态数据体系,并推进人机协同的精准营养服务落地,以提升临床转化价值。展开更多
文摘Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate answer.In this paper,we propose a VQA system intended to answer yes/no questions about real-world images,in Arabic.To support a robust VQA system,we work in two directions:(1)Using deep neural networks to semantically represent the given image and question in a fine-grainedmanner,namely ResNet-152 and Gated Recurrent Units(GRU).(2)Studying the role of the utilizedmultimodal bilinear pooling fusion technique in the trade-o.between the model complexity and the overall model performance.Some fusion techniques could significantly increase the model complexity,which seriously limits their applicability for VQA models.So far,there is no evidence of how efficient these multimodal bilinear pooling fusion techniques are for VQA systems dedicated to yes/no questions.Hence,a comparative analysis is conducted between eight bilinear pooling fusion techniques,in terms of their ability to reduce themodel complexity and improve themodel performance in this case of VQA systems.Experiments indicate that these multimodal bilinear pooling fusion techniques have improved the VQA model’s performance,until reaching the best performance of 89.25%.Further,experiments have proven that the number of answers in the developed VQA system is a critical factor that a.ects the effectiveness of these multimodal bilinear pooling techniques in achieving their main objective of reducing the model complexity.The Multimodal Local Perception Bilinear Pooling(MLPB)technique has shown the best balance between the model complexity and its performance,for VQA systems designed to answer yes/no questions.
基金supported by the National Key Research and Development Program of China(Grant No.:2023YFC2605002)the National Key R&D Program of China(Grant No.:2022YFF1203003)+2 种基金Beijing AI Health Cultivation Project,China(Grant No.:Z221100003522022)the National Natural Science Foundation of China(Grant No.:82273772)the Beijing Natural Science Foundation,China(Grant No.:7212152).
文摘Accurate prediction of drug responses in cancer cell lines(CCLs)and transferable prediction of clinical drug responses using CCLs are two major tasks in personalized medicine.Despite the rapid advancements in existing computational methods for preclinical and clinical cancer drug response(CDR)prediction,challenges remain regarding the generalization of new drugs that are unseen in the training set.Herein,we propose a multimodal fusion deep learning(DL)model called drug-target and single-cell language based CDR(DTLCDR)to predict preclinical and clinical CDRs.The model integrates chemical descriptors,molecular graph representations,predicted protein target profiles of drugs,and cell line expression profiles with general knowledge from single cells.Among these features,a well-trained drug-target interaction(DTI)prediction model is used to generate target profiles of drugs,and a pretrained single-cell language model is integrated to provide general genomic knowledge.Comparison experiments on the cell line drug sensitivity dataset demonstrated that DTLCDR exhibited improved generalizability and robustness in predicting unseen drugs compared with previous state-of-the-art baseline methods.Further ablation studies verified the effectiveness of each component of our model,highlighting the significant contribution of target information to generalizability.Subsequently,the ability of DTLCDR to predict novel molecules was validated through in vitro cell experiments,demonstrating its potential for real-world applications.Moreover,DTLCDR was transferred to the clinical datasets,demonstrating satisfactory performance in the clinical data,regardless of whether the drugs were included in the cell line dataset.Overall,our results suggest that the DTLCDR is a promising tool for personalized drug discovery.
基金supported by the 2021 Guangdong Province(China)Science and Technology Plan Project“Research and Application of Key Technologies for Multi-level Knowledge Retrieval Based on Big Data Intelligence”(Project No.2021B0101420004)the 2022 commissioned project“Cross-border E-commerce Taxation and Related Research”from the State Taxation Administration Guangdong Provincial Taxation Bureau,China.
文摘Purpose:This study aims to integrate large language models(LLMs)with interpretable machine learning methods to develop a multimodal data-driven framework for predicting corporate financial fraud,addressing the limitations of traditional approaches in long-text semantic parsing,model interpretability,and multisource data fusion,thereby providing regulatory agencies with intelligent auditing tools.Design/methodology/approach:Analyzing 5,304 Chinese listed firms’annual reports(2015-2020)from the CSMAD database,this study leverages the Doubao LLMs to generate chunked summaries and 256-dimensional semantic vectors,developing textual semantic features.It integrates 19 financial indicators,11 governance metrics,and linguistic characteristics(tone,readability)with fraud prediction models optimized through a group of Gradient Boosted Decision Tree(GBDT)algorithms.SHAP value analysis in the final model reveals the risk transmission mechanism by quantifying the marginal impacts of financial,governance,and textual features on fraud likelihood.Findings:The study found that LLMs effectively distill lengthy annual reports into semantic summaries,while GBDT algorithms(AUC>0.850)outperform the traditional Logistic Regression model in fraud detection.Multimodal fusion improved performance by 7.4%,with financial,governance,and textual features providing complementary signals.SHAP analysis revealed financial distress,governance conflicts,and narrative patterns(e.g.,tone anchoring,semantic thresholds)as key fraud indicators,highlighting managerial intent in report language.Research limitations:This study identifies three key limitations:1)lack of interpretability for semantic features,2)absence of granular fraud-type differentiation,and 3)unexplored comparative validation with other deep learning methods.Future research will address these gaps to enhance fraud detection precision and model transparency.Practical implications:The developed semantic-enhanced evaluation model provides a quantitative tool for assessing listed companies’information disclosure quality and enables practical implementation through its derivative real-time monitoring system.This advancement significantly strengthens capital market risk early warning capabilities,offering actionable insights for securities regulation.Originality/value:This study presents three key innovations:1)A novel“chunking-summarizationembedding”framework for efficient semantic compression of lengthy annual reports(30,000 words);2)Demonstration of LLMs’superior performance in financial text analysis,outperforming traditional methods by 19.3%;3)A novel“language-psychology-behavior”triad model for analyzing managerial fraud motives.
基金Supported by the National Natural Science Foundation of China under Grant No 60972106the China Postdoctoral Science Foundation under Grant No 2014M561053+1 种基金the Humanity and Social Science Foundation of Ministry of Education of China under Grant No 15YJA630108the Hebei Province Natural Science Foundation under Grant No E2016202341
文摘The contribution of this work is twofold: (1) a multimodality prediction method of chaotic time series with the Gaussian process mixture (GPM) model is proposed, which employs a divide and conquer strategy. It automatically divides the chaotic time series into multiple modalities with different extrinsic patterns and intrinsic characteristics, and thus can more precisely fit the chaotic time series. (2) An effective sparse hard-cut expec- tation maximization (SHC-EM) learning algorithm for the GPM model is proposed to improve the prediction performance. SHO-EM replaces a large learning sample set with fewer pseudo inputs, accelerating model learning based on these pseudo inputs. Experiments on Lorenz and Chua time series demonstrate that the proposed method yields not only accurate multimodality prediction, but also the prediction confidence interval SHC-EM outperforms the traditional variational 1earning in terms of both prediction accuracy and speed. In addition, SHC-EM is more robust and insusceptible to noise than variational learning.
基金supported by Science and Technology Research Project of Jiangxi Education Department.Project Grant No.GJJ2203306.
文摘Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on unimodal pre-trained models for feature extraction from each modality often overlook the intrinsic connections of semantic information between modalities.This limitation is attributed to their training on unimodal data,and necessitates the use of complex fusion mechanisms for sentiment analysis.In this study,we present a novel approach that combines a vision-language pre-trained model with a proposed multimodal contrastive learning method.Our approach harnesses the power of transfer learning by utilizing a vision-language pre-trained model to extract both visual and textual representations in a unified framework.We employ a Transformer architecture to integrate these representations,thereby enabling the capture of rich semantic infor-mation in image-text pairs.To further enhance the representation learning of these pairs,we introduce our proposed multimodal contrastive learning method,which leads to improved performance in sentiment analysis tasks.Our approach is evaluated through extensive experiments on two publicly accessible datasets,where we demonstrate its effectiveness.We achieve a significant improvement in sentiment analysis accuracy,indicating the supe-riority of our approach over existing techniques.These results highlight the potential of multimodal sentiment analysis and underscore the importance of considering the intrinsic semantic connections between modalities for accurate sentiment assessment.
文摘The problem of fake news detection(FND)is becoming increasingly important in the field of natural language processing(NLP)because of the rapid dissemination of misleading information on the web.Large language models(LLMs)such as GPT-4.Zero excels in natural language understanding tasks but can still struggle to distinguish between fact and fiction,particularly when applied in the wild.However,a key challenge of existing FND methods is that they only consider unimodal data(e.g.,images),while more detailed multimodal data(e.g.,user behaviour,temporal dynamics)is neglected,and the latter is crucial for full-context understanding.To overcome these limitations,we introduce M3-FND(Multimodal Misinformation Mitigation for False News Detection),a novel methodological framework that integrates LLMs with multimodal data sources to perform context-aware veracity assessments.Our method proposes a hybrid system that combines image-text alignment,user credibility profiling,and temporal pattern recognition,which is also strengthened through a natural feedback loop that provides real-time feedback for correcting downstream errors.We use contextual reinforcement learning to schedule prompt updating and update the classifier threshold based on the latest multimodal input,which enables the model to better adapt to changing misinformation attack strategies.M3-FND is tested on three diverse datasets,FakeNewsNet,Twitter15,andWeibo,which contain both text and visual socialmedia content.Experiments showthatM3-FND significantly outperforms conventional and LLMbased baselines in terms of accuracy,F1-score,and AUC on all benchmarks.Our results indicate the importance of employing multimodal cues and adaptive learning for effective and timely detection of fake news.
文摘High-throughput transcriptomics has evolved from bulk RNA-seq to single-cell and spatial profiling,yet its clinical translation still depends on effective integration across diverse omics and data modalities.Emerging foundation models and multimodal learning frameworks are enabling scalable and transferable representations of cellular states,while advances in interpretability and real-world data integration are bridging the gap between discovery and clinical application.This paper outlines a concise roadmap for AI-driven,transcriptome-centered multi-omics integration in precision medicine(Figure 1).
文摘In recent years,Vision-Language Models(VLMs)have emerged as a significant breakthrough in multimodal learning,demonstrating remarkable progress in tasks such as image-text alignment,image generation,and semantic reasoning.This paper systematically reviews current VLM pretraining methodologies,including contrastive learning and generative paradigms,while providing an in-depth analysis of efficient transfer learning strategies such as prompt tuning,LoRA,and adapter modules.Through representative models like CLIP,BLIP,and GIT,we examine their practical applications in visual grounding,imagetext retrieval,visual question answering,affective computing,and embodied AI.Furthermore,we identify persistent challenges in fine-grained semantic modeling,cross-modal reasoning,and cross-lingual transfer.Finally,we envision future trends in unified architectures,multimodal reinforcement learning,and domain adaptation,aiming to provide systematic reference and technical insights for subsequent research.
基金funded by the Centre for Advanced Modelling and Geospatial Information Systems(CAMGIS),Faculty of Engineering and IT,University of Technology Sydneysupported by the Researchers Supporting Project,King Saud University,Riyadh,Saudi Arabia,under Project RSP2025 R14.
文摘Electronic nose and thermal images are effective ways to diagnose the presence of gases in real-time realtime.Multimodal fusion of these modalities can result in the development of highly accurate diagnostic systems.The low-cost thermal imaging software produces low-resolution thermal images in grayscale format,hence necessitating methods for improving the resolution and colorizing the images.The objective of this paper is to develop and train a super-resolution generative adversarial network for improving the resolution of the thermal images,followed by a sparse autoencoder for colorization of thermal images and amultimodal convolutional neural network for gas detection using electronic nose and thermal images.The dataset used comprises 6400 thermal images and electronic nose measurements for four classes.A multimodal Convolutional Neural Network(CNN)comprising an EfficientNetB2 pre-trainedmodel was developed using both early and late feature fusion.The Super Resolution Generative Adversarial Network(SRGAN)model was developed and trained on low and high-resolution thermal images.Asparse autoencoder was trained on the grayscale and colorized thermal images.The SRGAN was trained on lowand high-resolution thermal images,achieving a Structural Similarity Index(SSIM)of 90.28,a Peak Signal-to-Noise Ratio(PSNR)of 68.74,and a Mean Absolute Error(MAE)of 0.066.The autoencoder model produced an MAE of 0.035,a Mean Squared Error(MSE)of 0.006,and a Root Mean Squared Error(RMSE)of 0.0705.The multimodal CNN,trained on these images and electronic nose measurements using both early and late fusion techniques,achieved accuracies of 97.89% and 98.55%,respectively.Hence,the proposed framework can be of great aid for the integration with low-cost software to generate high quality thermal camera images and highly accurate detection of gases in real-time.
基金supported by the National Natural Science Foundation of China(82271094 to J.Z.).
文摘Background:Retinal vein occlusion(RvO)is a leading cause of visual impairment on a global scale.Its patho-logical mechanisms involve a complex interplay of vascular obstruction,ischemia,and secondary inflammatory responses.Recent interdisciplinary advances,underpinned by the integration of multimodal data,have estab-lished a new paradigm for unraveling the pathophysiological mechanisms of RvO,enabling early diagnosis and personalized treatment strategies.Main text:This review critically synthesizes recent progress at the intersection of machine learning,bioinfor-matics,and clinical medicine,focusing on developing predictive models and deep analysis,exploring molecular mechanisms,and identifying markers associated with RvO.By bridging technological innovation with clinical needs,this review underscores the potential of data-driven strategies to advance RvO research and optimize patient care.Conclusions:Machine learning-bioinformatics integration has revolutionised RvO research through predictive modelling and mechanistic insights,particularly via deep learning-enhanced retinal imaging and multi-omics networks.Despite progress,clinical translation requires resolving data standardisation inconsistencies and model generalizability limitations.Establishing multicentre validation frameworks and interpretable AI tools,coupled with patient-focused data platforms through cross-disciplinary collaboration,could enable precision interventions to optimally preserve vision.
基金supported by the Key R&D Program of Zhejiang Province(No.2024C01021)the National Regional Innovation and Development Joint Fund of China(No.U24A20254)the Leading Talents of Technological Innovation Program of Zhejiang Province(No.2023R5214)。
文摘In recent years,multi-label zero-shot learning(ML-ZSL)has garnered increasing attention because of its wide range of potential applications,such as image annotation,text classification,and bioinformatics.The central challenge in ML-ZSL lies in predicting multiple labels for unseen classes without requiring any labeled training data,which contrasts with conventional supervised learning paradigms.However,existing methods face several significant challenges.These include the substantial semantic gap between different modalities,which impedes effective knowledge transfer,and the intricate and typically complex relationships among multiple labels,making it difficult to model them in a meaningful and accurate manner.To overcome these challenges,we propose a graph-augmented multimodal chain-of-thought(GMCoT)reasoning approach.The proposed method combines the strengths of multimodal large language models with graph-based structures,significantly enhancing the reasoning process involved in multi-label prediction.First,a novel multimodal chain-of-thought reasoning framework is presented which imitates human-like step-by-step reasoning to produce multi-label predictions.Second,a technique is presented for integrating label graphs into the reasoning process.This technique enables the capture of complex semantic relationships among labels,thereby improving the accuracy and consistency of multi-label generation.Comprehensive experiments on benchmark datasets demonstrate that the proposed GMCoT approach outperforms state-of-the-art methods in ML-ZSL.
基金funded by Toyota Research Institute,grant number PO-002332.
文摘We used interpretable machine learning to combine information from multiple heterogeneous spectra:X-ray absorption near-edge spectra(XANES)and atomic pair distribution functions(PDFs)to extract local structural and chemical environments of transition metal cations in oxides.Random forest models were trained on simulated XANES,PDF,and both combined to extract oxidation state,coordination number,and mean nearest-neighbor bond length.XANES-only models generally outperformed PDF-only models,even for structural tasks,although using the metal’s differential-PDFs(dPDFs)instead of total-PDFs narrowed this gap.When combined with PDFs,information from XANES often dominates the prediction.Our results demonstrate that XANES contains rich structural information and highlight the utility of species-specificity.This interpretable,multimodal approach is quick to implement with suitable databases and offers valuable insights into the relative strengths of different modalities,guiding researchers in experiment design and identifying when combining complementary techniques adds meaningful information to a scientific investigation.
基金supported by the National Natural Science Foundation of China(grant no.U21A20215)Zhejiang Lab(grant no.2021PE0AC04)+1 种基金Hainan Yazhou Bay Seed Laboratory(B21HJ0101)the Natural Science Foundation of Jilin Province(20220101277JC).
文摘The rate of soybean canopy establishment largely determines photoperiodic sensitivity,subsequently influencing yield potential.However,assessing the rate of soybean canopy development in large-scale field breeding trials is both laborious and time-consuming.High-throughput phenotyping methods based on unmanned aerial vehicle(UAV)systems can be used to monitor and quantitatively describe the development of soybean canopies for different genotypes.In this study,high-resolution and time-series raw data from field soybean populations were collected using UAVs.
文摘提出了一种融合集成学习与多模态大语言模型(multimodal large language models,MLLMs)的图文情感分析方法。针对图文情感分析中类别不平衡与跨模态情感不一致等关键挑战,设计了EMSAN(ensemble multimodal sentiment analysis network)框架。该框架采用主辅模型结构,将在完整数据集上训练的主模型与在平衡子集上优化的辅助模型相结合,实现对各情感类别的精准识别。在特征学习方面,EMSAN采用两阶段策略增强情感特征:利用多模态大语言模型生成高质量的图像描述,缩小视觉与文本模态间的语义差距;引入一致性对比学习机制,通过对比文本和视觉特征的差异,强化跨模态情感的一致性表达,获得更为精细的特征。通过在平衡和不平衡数据集上的学习,EMSAN在保持数据自然分布的同时,有效缓解了类别不平衡问题。多个公共基准数据集上的实验结果表明,提出的方法取得了显著的性能提升。
文摘营养信息学正由传统基于规则与常规机器学习范式,迈向以大语言模型(large language model,LLM)与多模态大模型(multimodal large language models,MLLM)为核心的新阶段。本文系统综述了2019–2025年间营养大模型领域的研究进展,归纳了视觉-语言对齐、领域知识注入、检索增强生成(retrieval-augmented generation,RAG)及可解释推理等关键架构与训练技术。在此基础上,本文详细梳理了模型在个性化膳食推荐、营养状态评估、疾病营养管理及膳食自动化记录等典型场景的应用现状。此外,本文总结了Nutrition5k、NutriBench等核心数据集与评测基准的演变历程。最后,针对模型可信度、数据隐私、跨文化泛化及临床循证支持等挑战,本文提出未来研究应深度融合临床证据,构建高质量多模态数据体系,并推进人机协同的精准营养服务落地,以提升临床转化价值。