期刊文献+
共找到181篇文章
< 1 2 10 >
每页显示 20 50 100
Performance vs.Complexity Comparative Analysis of Multimodal Bilinear Pooling Fusion Approaches for Deep Learning-Based Visual Arabic-Question Answering Systems
1
作者 Sarah M.Kamel Mai A.Fadel +1 位作者 Lamiaa Elrefaei Shimaa I.Hassan 《Computer Modeling in Engineering & Sciences》 2025年第4期373-411,共39页
Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate... Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate answer.In this paper,we propose a VQA system intended to answer yes/no questions about real-world images,in Arabic.To support a robust VQA system,we work in two directions:(1)Using deep neural networks to semantically represent the given image and question in a fine-grainedmanner,namely ResNet-152 and Gated Recurrent Units(GRU).(2)Studying the role of the utilizedmultimodal bilinear pooling fusion technique in the trade-o.between the model complexity and the overall model performance.Some fusion techniques could significantly increase the model complexity,which seriously limits their applicability for VQA models.So far,there is no evidence of how efficient these multimodal bilinear pooling fusion techniques are for VQA systems dedicated to yes/no questions.Hence,a comparative analysis is conducted between eight bilinear pooling fusion techniques,in terms of their ability to reduce themodel complexity and improve themodel performance in this case of VQA systems.Experiments indicate that these multimodal bilinear pooling fusion techniques have improved the VQA model’s performance,until reaching the best performance of 89.25%.Further,experiments have proven that the number of answers in the developed VQA system is a critical factor that a.ects the effectiveness of these multimodal bilinear pooling techniques in achieving their main objective of reducing the model complexity.The Multimodal Local Perception Bilinear Pooling(MLPB)technique has shown the best balance between the model complexity and its performance,for VQA systems designed to answer yes/no questions. 展开更多
关键词 Arabic-VQA deep learning-based VQA deep multimodal information fusion multimodal representation learning VQA of yes/no questions VQA model complexity VQA model performance performance-complexity trade-off
在线阅读 下载PDF
DTLCDR:A target-based multimodal fusion deep learning framework for cancer drug response prediction
2
作者 Jie Yu Cheng Shi +4 位作者 Yiran Zhou Ningfeng Liu Xiaolin Zong Zhenming Liu Liangren Zhang 《Journal of Pharmaceutical Analysis》 2025年第8期1825-1836,共12页
Accurate prediction of drug responses in cancer cell lines(CCLs)and transferable prediction of clinical drug responses using CCLs are two major tasks in personalized medicine.Despite the rapid advancements in existing... Accurate prediction of drug responses in cancer cell lines(CCLs)and transferable prediction of clinical drug responses using CCLs are two major tasks in personalized medicine.Despite the rapid advancements in existing computational methods for preclinical and clinical cancer drug response(CDR)prediction,challenges remain regarding the generalization of new drugs that are unseen in the training set.Herein,we propose a multimodal fusion deep learning(DL)model called drug-target and single-cell language based CDR(DTLCDR)to predict preclinical and clinical CDRs.The model integrates chemical descriptors,molecular graph representations,predicted protein target profiles of drugs,and cell line expression profiles with general knowledge from single cells.Among these features,a well-trained drug-target interaction(DTI)prediction model is used to generate target profiles of drugs,and a pretrained single-cell language model is integrated to provide general genomic knowledge.Comparison experiments on the cell line drug sensitivity dataset demonstrated that DTLCDR exhibited improved generalizability and robustness in predicting unseen drugs compared with previous state-of-the-art baseline methods.Further ablation studies verified the effectiveness of each component of our model,highlighting the significant contribution of target information to generalizability.Subsequently,the ability of DTLCDR to predict novel molecules was validated through in vitro cell experiments,demonstrating its potential for real-world applications.Moreover,DTLCDR was transferred to the clinical datasets,demonstrating satisfactory performance in the clinical data,regardless of whether the drugs were included in the cell line dataset.Overall,our results suggest that the DTLCDR is a promising tool for personalized drug discovery. 展开更多
关键词 Personalized medicine Cancer drug response multimodal fusion Deep learning Drug-target interaction Single-cell language model
暂未订购
Multimodal detection framework for financial fraud integrating LLMs and interpretable machine learning
3
作者 Hui Nie Zhao-hui Long +1 位作者 Ze-jun Fang Lu-qiong Gao 《Journal of Data and Information Science》 2025年第4期291-315,共25页
Purpose:This study aims to integrate large language models(LLMs)with interpretable machine learning methods to develop a multimodal data-driven framework for predicting corporate financial fraud,addressing the limitat... Purpose:This study aims to integrate large language models(LLMs)with interpretable machine learning methods to develop a multimodal data-driven framework for predicting corporate financial fraud,addressing the limitations of traditional approaches in long-text semantic parsing,model interpretability,and multisource data fusion,thereby providing regulatory agencies with intelligent auditing tools.Design/methodology/approach:Analyzing 5,304 Chinese listed firms’annual reports(2015-2020)from the CSMAD database,this study leverages the Doubao LLMs to generate chunked summaries and 256-dimensional semantic vectors,developing textual semantic features.It integrates 19 financial indicators,11 governance metrics,and linguistic characteristics(tone,readability)with fraud prediction models optimized through a group of Gradient Boosted Decision Tree(GBDT)algorithms.SHAP value analysis in the final model reveals the risk transmission mechanism by quantifying the marginal impacts of financial,governance,and textual features on fraud likelihood.Findings:The study found that LLMs effectively distill lengthy annual reports into semantic summaries,while GBDT algorithms(AUC>0.850)outperform the traditional Logistic Regression model in fraud detection.Multimodal fusion improved performance by 7.4%,with financial,governance,and textual features providing complementary signals.SHAP analysis revealed financial distress,governance conflicts,and narrative patterns(e.g.,tone anchoring,semantic thresholds)as key fraud indicators,highlighting managerial intent in report language.Research limitations:This study identifies three key limitations:1)lack of interpretability for semantic features,2)absence of granular fraud-type differentiation,and 3)unexplored comparative validation with other deep learning methods.Future research will address these gaps to enhance fraud detection precision and model transparency.Practical implications:The developed semantic-enhanced evaluation model provides a quantitative tool for assessing listed companies’information disclosure quality and enables practical implementation through its derivative real-time monitoring system.This advancement significantly strengthens capital market risk early warning capabilities,offering actionable insights for securities regulation.Originality/value:This study presents three key innovations:1)A novel“chunking-summarizationembedding”framework for efficient semantic compression of lengthy annual reports(30,000 words);2)Demonstration of LLMs’superior performance in financial text analysis,outperforming traditional methods by 19.3%;3)A novel“language-psychology-behavior”triad model for analyzing managerial fraud motives. 展开更多
关键词 Financial fraud detection Large language models multimodal data fusion Interpretable machine learning Annual report
在线阅读 下载PDF
Multimodality Prediction of Chaotic Time Series with Sparse Hard-Cut EM Learning of the Gaussian Process Mixture Model 被引量:1
4
作者 周亚同 樊煜 +1 位作者 陈子一 孙建成 《Chinese Physics Letters》 SCIE CAS CSCD 2017年第5期22-26,共5页
The contribution of this work is twofold: (1) a multimodality prediction method of chaotic time series with the Gaussian process mixture (GPM) model is proposed, which employs a divide and conquer strategy. It au... The contribution of this work is twofold: (1) a multimodality prediction method of chaotic time series with the Gaussian process mixture (GPM) model is proposed, which employs a divide and conquer strategy. It automatically divides the chaotic time series into multiple modalities with different extrinsic patterns and intrinsic characteristics, and thus can more precisely fit the chaotic time series. (2) An effective sparse hard-cut expec- tation maximization (SHC-EM) learning algorithm for the GPM model is proposed to improve the prediction performance. SHO-EM replaces a large learning sample set with fewer pseudo inputs, accelerating model learning based on these pseudo inputs. Experiments on Lorenz and Chua time series demonstrate that the proposed method yields not only accurate multimodality prediction, but also the prediction confidence interval SHC-EM outperforms the traditional variational 1earning in terms of both prediction accuracy and speed. In addition, SHC-EM is more robust and insusceptible to noise than variational learning. 展开更多
关键词 GPM multimodality Prediction of Chaotic Time Series with Sparse Hard-Cut EM learning of the Gaussian Process Mixture model EM SHC
原文传递
Leveraging Vision-Language Pre-Trained Model and Contrastive Learning for Enhanced Multimodal Sentiment Analysis
5
作者 Jieyu An Wan Mohd Nazmee Wan Zainon Binfen Ding 《Intelligent Automation & Soft Computing》 SCIE 2023年第8期1673-1689,共17页
Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on... Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on unimodal pre-trained models for feature extraction from each modality often overlook the intrinsic connections of semantic information between modalities.This limitation is attributed to their training on unimodal data,and necessitates the use of complex fusion mechanisms for sentiment analysis.In this study,we present a novel approach that combines a vision-language pre-trained model with a proposed multimodal contrastive learning method.Our approach harnesses the power of transfer learning by utilizing a vision-language pre-trained model to extract both visual and textual representations in a unified framework.We employ a Transformer architecture to integrate these representations,thereby enabling the capture of rich semantic infor-mation in image-text pairs.To further enhance the representation learning of these pairs,we introduce our proposed multimodal contrastive learning method,which leads to improved performance in sentiment analysis tasks.Our approach is evaluated through extensive experiments on two publicly accessible datasets,where we demonstrate its effectiveness.We achieve a significant improvement in sentiment analysis accuracy,indicating the supe-riority of our approach over existing techniques.These results highlight the potential of multimodal sentiment analysis and underscore the importance of considering the intrinsic semantic connections between modalities for accurate sentiment assessment. 展开更多
关键词 multimodal sentiment analysis vision–language pre-trained model contrastive learning sentiment classification
在线阅读 下载PDF
LLM-Powered Multimodal Reasoning for Fake News Detection
6
作者 Md.Ahsan Habib Md.Anwar Hussen Wadud +1 位作者 M.F.Mridha Md.Jakir Hossen 《Computers, Materials & Continua》 2026年第4期1821-1864,共44页
The problem of fake news detection(FND)is becoming increasingly important in the field of natural language processing(NLP)because of the rapid dissemination of misleading information on the web.Large language models(L... The problem of fake news detection(FND)is becoming increasingly important in the field of natural language processing(NLP)because of the rapid dissemination of misleading information on the web.Large language models(LLMs)such as GPT-4.Zero excels in natural language understanding tasks but can still struggle to distinguish between fact and fiction,particularly when applied in the wild.However,a key challenge of existing FND methods is that they only consider unimodal data(e.g.,images),while more detailed multimodal data(e.g.,user behaviour,temporal dynamics)is neglected,and the latter is crucial for full-context understanding.To overcome these limitations,we introduce M3-FND(Multimodal Misinformation Mitigation for False News Detection),a novel methodological framework that integrates LLMs with multimodal data sources to perform context-aware veracity assessments.Our method proposes a hybrid system that combines image-text alignment,user credibility profiling,and temporal pattern recognition,which is also strengthened through a natural feedback loop that provides real-time feedback for correcting downstream errors.We use contextual reinforcement learning to schedule prompt updating and update the classifier threshold based on the latest multimodal input,which enables the model to better adapt to changing misinformation attack strategies.M3-FND is tested on three diverse datasets,FakeNewsNet,Twitter15,andWeibo,which contain both text and visual socialmedia content.Experiments showthatM3-FND significantly outperforms conventional and LLMbased baselines in terms of accuracy,F1-score,and AUC on all benchmarks.Our results indicate the importance of employing multimodal cues and adaptive learning for effective and timely detection of fake news. 展开更多
关键词 Fake news detection multimodal learning large language models prompt engineering instruction tuning reinforcement learning misinformation mitigation
在线阅读 下载PDF
AI-driven integration of multi-omics and multimodal data for precision medicine
7
作者 Heng-Rui Liu 《Medical Data Mining》 2026年第1期1-2,共2页
High-throughput transcriptomics has evolved from bulk RNA-seq to single-cell and spatial profiling,yet its clinical translation still depends on effective integration across diverse omics and data modalities.Emerging ... High-throughput transcriptomics has evolved from bulk RNA-seq to single-cell and spatial profiling,yet its clinical translation still depends on effective integration across diverse omics and data modalities.Emerging foundation models and multimodal learning frameworks are enabling scalable and transferable representations of cellular states,while advances in interpretability and real-world data integration are bridging the gap between discovery and clinical application.This paper outlines a concise roadmap for AI-driven,transcriptome-centered multi-omics integration in precision medicine(Figure 1). 展开更多
关键词 high throughput transcriptomics multi omics single cell multimodal learning frameworks foundation models omics data modalitiesemerging ai driven precision medicine
在线阅读 下载PDF
A survey on pre-training and transfer learning for multimodal Vision-Language Models
8
作者 Zhongren Liang 《Advances in Engineering Innovation》 2025年第7期135-139,共5页
In recent years,Vision-Language Models(VLMs)have emerged as a significant breakthrough in multimodal learning,demonstrating remarkable progress in tasks such as image-text alignment,image generation,and semantic reaso... In recent years,Vision-Language Models(VLMs)have emerged as a significant breakthrough in multimodal learning,demonstrating remarkable progress in tasks such as image-text alignment,image generation,and semantic reasoning.This paper systematically reviews current VLM pretraining methodologies,including contrastive learning and generative paradigms,while providing an in-depth analysis of efficient transfer learning strategies such as prompt tuning,LoRA,and adapter modules.Through representative models like CLIP,BLIP,and GIT,we examine their practical applications in visual grounding,imagetext retrieval,visual question answering,affective computing,and embodied AI.Furthermore,we identify persistent challenges in fine-grained semantic modeling,cross-modal reasoning,and cross-lingual transfer.Finally,we envision future trends in unified architectures,multimodal reinforcement learning,and domain adaptation,aiming to provide systematic reference and technical insights for subsequent research. 展开更多
关键词 Vision-Language models multimodal learning pre-training transfer learning contrastive learning
在线阅读 下载PDF
Multimodal Gas Detection Using E-Nose and Thermal Images:An Approach Utilizing SRGAN and Sparse Autoencoder
9
作者 Pratik Jadhav Vuppala Adithya Sairam +5 位作者 Niranjan Bhojane Abhyuday Singh Shilpa Gite Biswajeet Pradhan Mrinal Bachute Abdullah Alamri 《Computers, Materials & Continua》 2025年第5期3493-3517,共25页
Electronic nose and thermal images are effective ways to diagnose the presence of gases in real-time realtime.Multimodal fusion of these modalities can result in the development of highly accurate diagnostic systems.T... Electronic nose and thermal images are effective ways to diagnose the presence of gases in real-time realtime.Multimodal fusion of these modalities can result in the development of highly accurate diagnostic systems.The low-cost thermal imaging software produces low-resolution thermal images in grayscale format,hence necessitating methods for improving the resolution and colorizing the images.The objective of this paper is to develop and train a super-resolution generative adversarial network for improving the resolution of the thermal images,followed by a sparse autoencoder for colorization of thermal images and amultimodal convolutional neural network for gas detection using electronic nose and thermal images.The dataset used comprises 6400 thermal images and electronic nose measurements for four classes.A multimodal Convolutional Neural Network(CNN)comprising an EfficientNetB2 pre-trainedmodel was developed using both early and late feature fusion.The Super Resolution Generative Adversarial Network(SRGAN)model was developed and trained on low and high-resolution thermal images.Asparse autoencoder was trained on the grayscale and colorized thermal images.The SRGAN was trained on lowand high-resolution thermal images,achieving a Structural Similarity Index(SSIM)of 90.28,a Peak Signal-to-Noise Ratio(PSNR)of 68.74,and a Mean Absolute Error(MAE)of 0.066.The autoencoder model produced an MAE of 0.035,a Mean Squared Error(MSE)of 0.006,and a Root Mean Squared Error(RMSE)of 0.0705.The multimodal CNN,trained on these images and electronic nose measurements using both early and late fusion techniques,achieved accuracies of 97.89% and 98.55%,respectively.Hence,the proposed framework can be of great aid for the integration with low-cost software to generate high quality thermal camera images and highly accurate detection of gases in real-time. 展开更多
关键词 Thermal imaging gas detection multimodal learning generative models autoencoders
在线阅读 下载PDF
Multimodal data-driven approaches in retinal vein occlusion:A narrative review integrating machine learning and bioinformatics
10
作者 Chunlan Liang Lian Liu Jingxiang Zhong 《Advances in Ophthalmology Practice and Research》 2025年第4期235-244,共10页
Background:Retinal vein occlusion(RvO)is a leading cause of visual impairment on a global scale.Its patho-logical mechanisms involve a complex interplay of vascular obstruction,ischemia,and secondary inflammatory resp... Background:Retinal vein occlusion(RvO)is a leading cause of visual impairment on a global scale.Its patho-logical mechanisms involve a complex interplay of vascular obstruction,ischemia,and secondary inflammatory responses.Recent interdisciplinary advances,underpinned by the integration of multimodal data,have estab-lished a new paradigm for unraveling the pathophysiological mechanisms of RvO,enabling early diagnosis and personalized treatment strategies.Main text:This review critically synthesizes recent progress at the intersection of machine learning,bioinfor-matics,and clinical medicine,focusing on developing predictive models and deep analysis,exploring molecular mechanisms,and identifying markers associated with RvO.By bridging technological innovation with clinical needs,this review underscores the potential of data-driven strategies to advance RvO research and optimize patient care.Conclusions:Machine learning-bioinformatics integration has revolutionised RvO research through predictive modelling and mechanistic insights,particularly via deep learning-enhanced retinal imaging and multi-omics networks.Despite progress,clinical translation requires resolving data standardisation inconsistencies and model generalizability limitations.Establishing multicentre validation frameworks and interpretable AI tools,coupled with patient-focused data platforms through cross-disciplinary collaboration,could enable precision interventions to optimally preserve vision. 展开更多
关键词 BIOINFORMATICS Clinical prediction models Deep learning MARKERS multimodal data Machine learning Retinal vein occlusion
原文传递
GMCoT:a graph-augmented multimodal chain-of-thought reasoning framework for multi-label zero-shot learning
11
作者 Xiang WEN Haobo WANG +2 位作者 Ke CHEN Tianlei HU Gang CHEN 《Frontiers of Information Technology & Electronic Engineering》 2025年第12期2623-2637,共15页
In recent years,multi-label zero-shot learning(ML-ZSL)has garnered increasing attention because of its wide range of potential applications,such as image annotation,text classification,and bioinformatics.The central c... In recent years,multi-label zero-shot learning(ML-ZSL)has garnered increasing attention because of its wide range of potential applications,such as image annotation,text classification,and bioinformatics.The central challenge in ML-ZSL lies in predicting multiple labels for unseen classes without requiring any labeled training data,which contrasts with conventional supervised learning paradigms.However,existing methods face several significant challenges.These include the substantial semantic gap between different modalities,which impedes effective knowledge transfer,and the intricate and typically complex relationships among multiple labels,making it difficult to model them in a meaningful and accurate manner.To overcome these challenges,we propose a graph-augmented multimodal chain-of-thought(GMCoT)reasoning approach.The proposed method combines the strengths of multimodal large language models with graph-based structures,significantly enhancing the reasoning process involved in multi-label prediction.First,a novel multimodal chain-of-thought reasoning framework is presented which imitates human-like step-by-step reasoning to produce multi-label predictions.Second,a technique is presented for integrating label graphs into the reasoning process.This technique enables the capture of complex semantic relationships among labels,thereby improving the accuracy and consistency of multi-label generation.Comprehensive experiments on benchmark datasets demonstrate that the proposed GMCoT approach outperforms state-of-the-art methods in ML-ZSL. 展开更多
关键词 Chain-of-thought Multi-label zero-shot learning multimodal reasoning Large language model
原文传递
Interpretable multimodal machine learning analysis of X-ray absorption near-edge spectra and pair distribution functions
12
作者 Tanaporn Na Narong Zoe N.Zachko +1 位作者 Steven B.Torrisi Simon J.L.Billinge 《npj Computational Materials》 2025年第1期1071-1082,共12页
We used interpretable machine learning to combine information from multiple heterogeneous spectra:X-ray absorption near-edge spectra(XANES)and atomic pair distribution functions(PDFs)to extract local structural and ch... We used interpretable machine learning to combine information from multiple heterogeneous spectra:X-ray absorption near-edge spectra(XANES)and atomic pair distribution functions(PDFs)to extract local structural and chemical environments of transition metal cations in oxides.Random forest models were trained on simulated XANES,PDF,and both combined to extract oxidation state,coordination number,and mean nearest-neighbor bond length.XANES-only models generally outperformed PDF-only models,even for structural tasks,although using the metal’s differential-PDFs(dPDFs)instead of total-PDFs narrowed this gap.When combined with PDFs,information from XANES often dominates the prediction.Our results demonstrate that XANES contains rich structural information and highlight the utility of species-specificity.This interpretable,multimodal approach is quick to implement with suitable databases and offers valuable insights into the relative strengths of different modalities,guiding researchers in experiment design and identifying when combining complementary techniques adds meaningful information to a scientific investigation. 展开更多
关键词 multimodal analysis oxidesrandom forest models transition metal cations atomic pair distribution functions pdfs heterogeneous spectra x ray extract local structural chemical environments combine information interpretable machine learning
原文传递
结合多模态检测头的小蠹类害虫细粒度识别模型
13
作者 李巨虎 路佳 +2 位作者 徐玉立 李世豪 蔡祥 《农业工程学报》 北大核心 2026年第1期273-283,共11页
为解决小蠹类害虫(Dendroctonus spp)物种多样性高、近缘种形态相似且常同域分布导致的种类鉴定困难问题。该研究提出了能够细粒度识别小蠹虫种类的FGRS-Net(fine-grained recognition for scolytidae network)模型。首先,为缓解样本不... 为解决小蠹类害虫(Dendroctonus spp)物种多样性高、近缘种形态相似且常同域分布导致的种类鉴定困难问题。该研究提出了能够细粒度识别小蠹虫种类的FGRS-Net(fine-grained recognition for scolytidae network)模型。首先,为缓解样本不足导致的识别偏差,该研究自主设计了基于多模态嵌入的检测头模块;其次,为提取跨尺度鉴别特征,利用注意力机制混合模块ACmix(attention convolution mixer)实现了融合特征捕捉;为进一步获取特征并降低参数量,引入了全维度动态卷积模块ODConv(omni-dimensional dynamic convolution)重点关注昆虫细粒度特征;并通过剪枝以及知识蒸馏轻量化模型;为全面评估模型在实际应用中的可靠性,该研究在低照度、模糊及复杂背景遮挡等多种干扰条件下进行了系统的鲁棒性测试,并在不同计算架构的边缘设备上完成了部署验证。试验结果显示,FGRS-Net的平均精度均值达到89.3%,召回率为98%,浮点运算量降低16%,NVIDIA RTX 5090 GPU部署帧率达到289帧/s;双平台开发板部署帧率分别为11、27帧/s。实践表明,FGRS-Net模型具有精确度高和轻量化的优点,相比于现有主流模型具有较好的竞争力,该研究方法可为后续细粒度小蠹虫识别提供参考。 展开更多
关键词 小蠹虫检测 细粒度分类 多模态学习 轻量化模型 动态卷积
在线阅读 下载PDF
基于多模态大模型的协作论证支架对知识建构的影响研究
14
作者 朱珂 吴雅欣 夏静怡 《电化教育研究》 北大核心 2026年第2期52-60,共9页
在追求知识建构的数字化学习语境下,传统支架面临着结构固化、形式同质、交互单向等困境。随着通用人工智能从单一模态向多模态方向演进,智能化支架通过智适应反馈、跨模态处理、多主体交互等辅助协作论证,有助于推进知识建构深入发展... 在追求知识建构的数字化学习语境下,传统支架面临着结构固化、形式同质、交互单向等困境。随着通用人工智能从单一模态向多模态方向演进,智能化支架通过智适应反馈、跨模态处理、多主体交互等辅助协作论证,有助于推进知识建构深入发展。本研究结合SMCKI协作论证模型,依托多模态大模型技术,构建基于多模态大模型的协作论证支架;以68名大学生为研究对象开展教学实验,基于学习者的生成性交互数据,采用内容分析、多水平逻辑回归与路径分析等方法,从阶段性发展角度验证支架效果。结果表明,基于多模态大模型的协作论证支架引发了丰富的协作论证话语转移行为和高阶知识建构观点,总结升华话语转移行为与关键观点密切相关,关键观点是促进知识建构由低阶至高阶跃迁的主导因素。最后,基于研究结论从多模态开放系统、差异化支持策略与协同共生机理三方面提出策略建议。 展开更多
关键词 多模态大模型 知识建构 协作论证支架 SMCKI模型 人机协同学习
在线阅读 下载PDF
Time-Series Field Phenotyping of Soybean Growth Analysis by Combining Multimodal Deep Learning and Dynamic Modeling
15
作者 Hui Yu Lin Weng +5 位作者 Songquan Wu Jingjing He Yilin Yuan Jun Wang Xiaogang Xu Xianzhong Feng 《Plant Phenomics》 SCIE EI CSCD 2024年第2期323-334,共12页
The rate of soybean canopy establishment largely determines photoperiodic sensitivity,subsequently influencing yield potential.However,assessing the rate of soybean canopy development in large-scale field breeding tri... The rate of soybean canopy establishment largely determines photoperiodic sensitivity,subsequently influencing yield potential.However,assessing the rate of soybean canopy development in large-scale field breeding trials is both laborious and time-consuming.High-throughput phenotyping methods based on unmanned aerial vehicle(UAV)systems can be used to monitor and quantitatively describe the development of soybean canopies for different genotypes.In this study,high-resolution and time-series raw data from field soybean populations were collected using UAVs. 展开更多
关键词 analysis learning modeling dynamic growth FIELD DEEP COMBINING multimodal PHENOTYPING
原文传递
多维特征融合骨质疏松风险评估模型研究 被引量:1
16
作者 王朝亚 孟超 《生物化学与生物物理进展》 北大核心 2026年第1期238-248,共11页
目的旨在构建并验证一种基于深度学习的多维特征融合风险评估算法模型,以提升早期识别骨质疏松和预测骨折风险能力。方法纳入来自多组数据库的12856名受试者数据,通过多模态深度学习框架,整合骨密度测量值、骨微结构参数、骨转换标志物... 目的旨在构建并验证一种基于深度学习的多维特征融合风险评估算法模型,以提升早期识别骨质疏松和预测骨折风险能力。方法纳入来自多组数据库的12856名受试者数据,通过多模态深度学习框架,整合骨密度测量值、骨微结构参数、骨转换标志物、临床风险因素、基因标记物和体感数据等多维特征,构建风险评估算法模型,并在独立外部数据集上评估算法性能。结果在测试集上,该算法预测骨质疏松症的准确率为89.7%,敏感度为87.5%,特异度为91.2%,受试者工作特征曲线下面积(AUC)为0.936(95%置信区间(CI):0.927~0.945),简化模型AUC为0.917(95%CI:0.905~0.931),优于传统FRAX®模型(AUC=0.842,95%CI:0.829~0.855),在独立外部验证集上AUC为0.918(95%CI:0.905~0.931)。特征重要性分析显示,骨密度、骨小梁分离度、I型胶原C端肽、平衡参数和特定基因多态性是重要的预测因素。亚组分析显示,该算法在不同性别、年龄和种族人群中均表现良好。结论基于多维特征融合算法模型可显著提高骨质疏松风险评估的准确性和个体化水平,具有良好的泛化能力和临床应用前景,有望为临床实践提供更精准的决策支持工具。 展开更多
关键词 骨质疏松症 机器学习 多模态特征融合 骨折风险 预测模型
原文传递
结合关键字提取和图对比学习的文档版面分析
17
作者 马晓松 刘杰 +1 位作者 李晓辉 郭颖 《小型微型计算机系统》 北大核心 2026年第1期150-156,共7页
文档版面分析是信息检索和文档理解领域的重要任务和必要前提.传统的文档版面分析方法往往忽略了文本内容与结构之间的深度关联.本文提出了基于图神经网络结合大语言模型和图对比学习的方法,以提高文档版面分析的精确度.首先,通过大语... 文档版面分析是信息检索和文档理解领域的重要任务和必要前提.传统的文档版面分析方法往往忽略了文本内容与结构之间的深度关联.本文提出了基于图神经网络结合大语言模型和图对比学习的方法,以提高文档版面分析的精确度.首先,通过大语言模型自动提取关键字并融合到图节点中,增强了图神经网络对文档内容与结构的理解.其次,采用图对比学习,通过视图间对比损失优化节点表示,使模型更有效地区分文档布局模式.实验结果表明,在DocLayNet数据集上的测试中,该方法显著提升了文档版面分析的准确率,优于现有的基准方法.本文的方法为文档理解与信息提取领域提供了一种新的技术路径,有望在更多实际应用中得到广泛应用. 展开更多
关键词 图神经网络 大模型 多模态 图对比学习 文档版面分析
在线阅读 下载PDF
基于集成学习与多模态大语言模型的图文情感分析方法
18
作者 王宁 武芳宇 +2 位作者 赵宇轩 张百灵 庞超逸 《计算机工程与应用》 北大核心 2026年第3期153-162,共10页
提出了一种融合集成学习与多模态大语言模型(multimodal large language models,MLLMs)的图文情感分析方法。针对图文情感分析中类别不平衡与跨模态情感不一致等关键挑战,设计了EMSAN(ensemble multimodal sentiment analysis network)... 提出了一种融合集成学习与多模态大语言模型(multimodal large language models,MLLMs)的图文情感分析方法。针对图文情感分析中类别不平衡与跨模态情感不一致等关键挑战,设计了EMSAN(ensemble multimodal sentiment analysis network)框架。该框架采用主辅模型结构,将在完整数据集上训练的主模型与在平衡子集上优化的辅助模型相结合,实现对各情感类别的精准识别。在特征学习方面,EMSAN采用两阶段策略增强情感特征:利用多模态大语言模型生成高质量的图像描述,缩小视觉与文本模态间的语义差距;引入一致性对比学习机制,通过对比文本和视觉特征的差异,强化跨模态情感的一致性表达,获得更为精细的特征。通过在平衡和不平衡数据集上的学习,EMSAN在保持数据自然分布的同时,有效缓解了类别不平衡问题。多个公共基准数据集上的实验结果表明,提出的方法取得了显著的性能提升。 展开更多
关键词 集成学习 多模态大语言模型 图文情感分析
在线阅读 下载PDF
基于大模型与时序知识增强的天然气用气量短期预测方法
19
作者 赵周丙 吴冕 +4 位作者 吴柯莹 虞维超 宋尚飞 史博会 宫敬 《油气储运》 北大核心 2026年第1期109-119,共11页
【目的】随着中国天然气管网智能化发展水平的不断提高,天然气用气量的精准预测已成为管网优化调度的关键。当前预测方法存在对复杂多维因素过度依赖、覆盖用户范围受限以及时序知识集成能力缺乏等局限性,大语言模型(简称大模型)技术的... 【目的】随着中国天然气管网智能化发展水平的不断提高,天然气用气量的精准预测已成为管网优化调度的关键。当前预测方法存在对复杂多维因素过度依赖、覆盖用户范围受限以及时序知识集成能力缺乏等局限性,大语言模型(简称大模型)技术的进步为解决上述问题提供了有效途径。然而,现有大模型对相关行业领域认知不足,进而导致预测结果准确度低,且针对天然气用气量预测的大模型适配研究尚未深入。【方法】提出一种基于大模型与时序知识增强的天然气用气量预测方法:首先,构建天然气用气量时间序列知识库(简称时序知识库),以提取具有区域性的气量数据特征,并在构建时融入动态时间规整与中心化思想的K-means聚类算法,以解决欧氏距离失效问题;其次,为使部分参数固定的预训练大模型更有效地理解输入序列,在提示词范式中注入了数据分解、时序知识库相似性检索片段及统计学等先验知识;最后,构建补丁重编程层以适配大模型的输入,通过多头交叉自注意力机制实现时序数据与文本模态的对齐。【结果】算例验证表明:①建立时序知识库检索机制与构建提示词范式,可有效提升天然气用气量预测的准确性,且较传统方法滞后性更小、强趋势与周期性拟合能力更强、预测精度更高,通过构建重编程补丁嵌入层,可有效提升大模型针对强波动性数据的拟合与预测能力;②新建方法在4种数据集上的预测精度显著优于其他模型,从衡量预测准确性的关键指标来看,均方根误差、平均绝对误差、对称平均绝对百分比误差、平均绝对百分比误差的平均值分别为23635.6、10915.1、1.9%、1.9%,模型的决定系数平均值为0.96,能够很好地拟合观测数据,验证了新建预测方法的泛化性;③在超长期负荷预测中,新建方法通过融入丰富的多模态领域先验知识,较其他模型预测精度最高,模型预测结果的均方根误差、平均绝对误差、对称平均绝对百分比误差、平均绝对百分比误差分别比其他模型平均降低13.62%、21.49%、22.21%、22.91%,新建方法的决定系数平均值为0.95。【结论】新建方法不仅优于现有天然气用气量生成式预测领域的基准方法,还为多模态智能决策系统的构建提供了新的技术路径,推动预测技术从单一场景向跨模态协同方向演进。 展开更多
关键词 天然气 用气量预测 动态时间规整 大语言模型 检索增强 提示学习 多模态对齐
原文传递
营养大模型的技术架构、应用进展与未来挑战
20
作者 张成东 孔浩楠 +3 位作者 杨元 闫媛媛 童天朗 王慧 《生命科学》 2026年第1期1-17,共17页
营养信息学正由传统基于规则与常规机器学习范式,迈向以大语言模型(large language model,LLM)与多模态大模型(multimodal large language models,MLLM)为核心的新阶段。本文系统综述了2019–2025年间营养大模型领域的研究进展,归纳了视... 营养信息学正由传统基于规则与常规机器学习范式,迈向以大语言模型(large language model,LLM)与多模态大模型(multimodal large language models,MLLM)为核心的新阶段。本文系统综述了2019–2025年间营养大模型领域的研究进展,归纳了视觉-语言对齐、领域知识注入、检索增强生成(retrieval-augmented generation,RAG)及可解释推理等关键架构与训练技术。在此基础上,本文详细梳理了模型在个性化膳食推荐、营养状态评估、疾病营养管理及膳食自动化记录等典型场景的应用现状。此外,本文总结了Nutrition5k、NutriBench等核心数据集与评测基准的演变历程。最后,针对模型可信度、数据隐私、跨文化泛化及临床循证支持等挑战,本文提出未来研究应深度融合临床证据,构建高质量多模态数据体系,并推进人机协同的精准营养服务落地,以提升临床转化价值。 展开更多
关键词 营养大模型 多模态学习 大语言模型 个性化营养 检索增强生成 评测基准
原文传递
上一页 1 2 10 下一页 到第
使用帮助 返回顶部