期刊文献+
共找到154篇文章
< 1 2 8 >
每页显示 20 50 100
Performance vs.Complexity Comparative Analysis of Multimodal Bilinear Pooling Fusion Approaches for Deep Learning-Based Visual Arabic-Question Answering Systems
1
作者 Sarah M.Kamel Mai A.Fadel +1 位作者 Lamiaa Elrefaei Shimaa I.Hassan 《Computer Modeling in Engineering & Sciences》 2025年第4期373-411,共39页
Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate... Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate answer.In this paper,we propose a VQA system intended to answer yes/no questions about real-world images,in Arabic.To support a robust VQA system,we work in two directions:(1)Using deep neural networks to semantically represent the given image and question in a fine-grainedmanner,namely ResNet-152 and Gated Recurrent Units(GRU).(2)Studying the role of the utilizedmultimodal bilinear pooling fusion technique in the trade-o.between the model complexity and the overall model performance.Some fusion techniques could significantly increase the model complexity,which seriously limits their applicability for VQA models.So far,there is no evidence of how efficient these multimodal bilinear pooling fusion techniques are for VQA systems dedicated to yes/no questions.Hence,a comparative analysis is conducted between eight bilinear pooling fusion techniques,in terms of their ability to reduce themodel complexity and improve themodel performance in this case of VQA systems.Experiments indicate that these multimodal bilinear pooling fusion techniques have improved the VQA model’s performance,until reaching the best performance of 89.25%.Further,experiments have proven that the number of answers in the developed VQA system is a critical factor that a.ects the effectiveness of these multimodal bilinear pooling techniques in achieving their main objective of reducing the model complexity.The Multimodal Local Perception Bilinear Pooling(MLPB)technique has shown the best balance between the model complexity and its performance,for VQA systems designed to answer yes/no questions. 展开更多
关键词 Arabic-VQA deep learning-based VQA deep multimodal information fusion multimodal representation learning VQA of yes/no questions VQA model complexity VQA model performance performance-complexity trade-off
在线阅读 下载PDF
Multimodal detection framework for financial fraud integrating LLMs and interpretable machine learning
2
作者 Hui Nie Zhao-hui Long +1 位作者 Ze-jun Fang Lu-qiong Gao 《Journal of Data and Information Science》 2025年第4期291-315,共25页
Purpose:This study aims to integrate large language models(LLMs)with interpretable machine learning methods to develop a multimodal data-driven framework for predicting corporate financial fraud,addressing the limitat... Purpose:This study aims to integrate large language models(LLMs)with interpretable machine learning methods to develop a multimodal data-driven framework for predicting corporate financial fraud,addressing the limitations of traditional approaches in long-text semantic parsing,model interpretability,and multisource data fusion,thereby providing regulatory agencies with intelligent auditing tools.Design/methodology/approach:Analyzing 5,304 Chinese listed firms’annual reports(2015-2020)from the CSMAD database,this study leverages the Doubao LLMs to generate chunked summaries and 256-dimensional semantic vectors,developing textual semantic features.It integrates 19 financial indicators,11 governance metrics,and linguistic characteristics(tone,readability)with fraud prediction models optimized through a group of Gradient Boosted Decision Tree(GBDT)algorithms.SHAP value analysis in the final model reveals the risk transmission mechanism by quantifying the marginal impacts of financial,governance,and textual features on fraud likelihood.Findings:The study found that LLMs effectively distill lengthy annual reports into semantic summaries,while GBDT algorithms(AUC>0.850)outperform the traditional Logistic Regression model in fraud detection.Multimodal fusion improved performance by 7.4%,with financial,governance,and textual features providing complementary signals.SHAP analysis revealed financial distress,governance conflicts,and narrative patterns(e.g.,tone anchoring,semantic thresholds)as key fraud indicators,highlighting managerial intent in report language.Research limitations:This study identifies three key limitations:1)lack of interpretability for semantic features,2)absence of granular fraud-type differentiation,and 3)unexplored comparative validation with other deep learning methods.Future research will address these gaps to enhance fraud detection precision and model transparency.Practical implications:The developed semantic-enhanced evaluation model provides a quantitative tool for assessing listed companies’information disclosure quality and enables practical implementation through its derivative real-time monitoring system.This advancement significantly strengthens capital market risk early warning capabilities,offering actionable insights for securities regulation.Originality/value:This study presents three key innovations:1)A novel“chunking-summarizationembedding”framework for efficient semantic compression of lengthy annual reports(30,000 words);2)Demonstration of LLMs’superior performance in financial text analysis,outperforming traditional methods by 19.3%;3)A novel“language-psychology-behavior”triad model for analyzing managerial fraud motives. 展开更多
关键词 Financial fraud detection Large language models multimodal data fusion Interpretable machine learning Annual report
在线阅读 下载PDF
Multimodality Prediction of Chaotic Time Series with Sparse Hard-Cut EM Learning of the Gaussian Process Mixture Model 被引量:1
3
作者 周亚同 樊煜 +1 位作者 陈子一 孙建成 《Chinese Physics Letters》 SCIE CAS CSCD 2017年第5期22-26,共5页
The contribution of this work is twofold: (1) a multimodality prediction method of chaotic time series with the Gaussian process mixture (GPM) model is proposed, which employs a divide and conquer strategy. It au... The contribution of this work is twofold: (1) a multimodality prediction method of chaotic time series with the Gaussian process mixture (GPM) model is proposed, which employs a divide and conquer strategy. It automatically divides the chaotic time series into multiple modalities with different extrinsic patterns and intrinsic characteristics, and thus can more precisely fit the chaotic time series. (2) An effective sparse hard-cut expec- tation maximization (SHC-EM) learning algorithm for the GPM model is proposed to improve the prediction performance. SHO-EM replaces a large learning sample set with fewer pseudo inputs, accelerating model learning based on these pseudo inputs. Experiments on Lorenz and Chua time series demonstrate that the proposed method yields not only accurate multimodality prediction, but also the prediction confidence interval SHC-EM outperforms the traditional variational 1earning in terms of both prediction accuracy and speed. In addition, SHC-EM is more robust and insusceptible to noise than variational learning. 展开更多
关键词 GPM multimodality Prediction of Chaotic Time Series with Sparse Hard-Cut EM learning of the Gaussian Process Mixture model EM SHC
原文传递
Leveraging Vision-Language Pre-Trained Model and Contrastive Learning for Enhanced Multimodal Sentiment Analysis
4
作者 Jieyu An Wan Mohd Nazmee Wan Zainon Binfen Ding 《Intelligent Automation & Soft Computing》 SCIE 2023年第8期1673-1689,共17页
Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on... Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on unimodal pre-trained models for feature extraction from each modality often overlook the intrinsic connections of semantic information between modalities.This limitation is attributed to their training on unimodal data,and necessitates the use of complex fusion mechanisms for sentiment analysis.In this study,we present a novel approach that combines a vision-language pre-trained model with a proposed multimodal contrastive learning method.Our approach harnesses the power of transfer learning by utilizing a vision-language pre-trained model to extract both visual and textual representations in a unified framework.We employ a Transformer architecture to integrate these representations,thereby enabling the capture of rich semantic infor-mation in image-text pairs.To further enhance the representation learning of these pairs,we introduce our proposed multimodal contrastive learning method,which leads to improved performance in sentiment analysis tasks.Our approach is evaluated through extensive experiments on two publicly accessible datasets,where we demonstrate its effectiveness.We achieve a significant improvement in sentiment analysis accuracy,indicating the supe-riority of our approach over existing techniques.These results highlight the potential of multimodal sentiment analysis and underscore the importance of considering the intrinsic semantic connections between modalities for accurate sentiment assessment. 展开更多
关键词 multimodal sentiment analysis vision–language pre-trained model contrastive learning sentiment classification
在线阅读 下载PDF
AI-driven integration of multi-omics and multimodal data for precision medicine
5
作者 Heng-Rui Liu 《Medical Data Mining》 2026年第1期1-2,共2页
High-throughput transcriptomics has evolved from bulk RNA-seq to single-cell and spatial profiling,yet its clinical translation still depends on effective integration across diverse omics and data modalities.Emerging ... High-throughput transcriptomics has evolved from bulk RNA-seq to single-cell and spatial profiling,yet its clinical translation still depends on effective integration across diverse omics and data modalities.Emerging foundation models and multimodal learning frameworks are enabling scalable and transferable representations of cellular states,while advances in interpretability and real-world data integration are bridging the gap between discovery and clinical application.This paper outlines a concise roadmap for AI-driven,transcriptome-centered multi-omics integration in precision medicine(Figure 1). 展开更多
关键词 high throughput transcriptomics multi omics single cell multimodal learning frameworks foundation models omics data modalitiesemerging ai driven precision medicine
在线阅读 下载PDF
Multimodal Gas Detection Using E-Nose and Thermal Images:An Approach Utilizing SRGAN and Sparse Autoencoder
6
作者 Pratik Jadhav Vuppala Adithya Sairam +5 位作者 Niranjan Bhojane Abhyuday Singh Shilpa Gite Biswajeet Pradhan Mrinal Bachute Abdullah Alamri 《Computers, Materials & Continua》 2025年第5期3493-3517,共25页
Electronic nose and thermal images are effective ways to diagnose the presence of gases in real-time realtime.Multimodal fusion of these modalities can result in the development of highly accurate diagnostic systems.T... Electronic nose and thermal images are effective ways to diagnose the presence of gases in real-time realtime.Multimodal fusion of these modalities can result in the development of highly accurate diagnostic systems.The low-cost thermal imaging software produces low-resolution thermal images in grayscale format,hence necessitating methods for improving the resolution and colorizing the images.The objective of this paper is to develop and train a super-resolution generative adversarial network for improving the resolution of the thermal images,followed by a sparse autoencoder for colorization of thermal images and amultimodal convolutional neural network for gas detection using electronic nose and thermal images.The dataset used comprises 6400 thermal images and electronic nose measurements for four classes.A multimodal Convolutional Neural Network(CNN)comprising an EfficientNetB2 pre-trainedmodel was developed using both early and late feature fusion.The Super Resolution Generative Adversarial Network(SRGAN)model was developed and trained on low and high-resolution thermal images.Asparse autoencoder was trained on the grayscale and colorized thermal images.The SRGAN was trained on lowand high-resolution thermal images,achieving a Structural Similarity Index(SSIM)of 90.28,a Peak Signal-to-Noise Ratio(PSNR)of 68.74,and a Mean Absolute Error(MAE)of 0.066.The autoencoder model produced an MAE of 0.035,a Mean Squared Error(MSE)of 0.006,and a Root Mean Squared Error(RMSE)of 0.0705.The multimodal CNN,trained on these images and electronic nose measurements using both early and late fusion techniques,achieved accuracies of 97.89% and 98.55%,respectively.Hence,the proposed framework can be of great aid for the integration with low-cost software to generate high quality thermal camera images and highly accurate detection of gases in real-time. 展开更多
关键词 Thermal imaging gas detection multimodal learning generative models autoencoders
在线阅读 下载PDF
DTLCDR:A target-based multimodal fusion deep learning framework for cancer drug response prediction
7
作者 Jie Yu Cheng Shi +4 位作者 Yiran Zhou Ningfeng Liu Xiaolin Zong Zhenming Liu Liangren Zhang 《Journal of Pharmaceutical Analysis》 2025年第8期1825-1836,共12页
Accurate prediction of drug responses in cancer cell lines(CCLs)and transferable prediction of clinical drug responses using CCLs are two major tasks in personalized medicine.Despite the rapid advancements in existing... Accurate prediction of drug responses in cancer cell lines(CCLs)and transferable prediction of clinical drug responses using CCLs are two major tasks in personalized medicine.Despite the rapid advancements in existing computational methods for preclinical and clinical cancer drug response(CDR)prediction,challenges remain regarding the generalization of new drugs that are unseen in the training set.Herein,we propose a multimodal fusion deep learning(DL)model called drug-target and single-cell language based CDR(DTLCDR)to predict preclinical and clinical CDRs.The model integrates chemical descriptors,molecular graph representations,predicted protein target profiles of drugs,and cell line expression profiles with general knowledge from single cells.Among these features,a well-trained drug-target interaction(DTI)prediction model is used to generate target profiles of drugs,and a pretrained single-cell language model is integrated to provide general genomic knowledge.Comparison experiments on the cell line drug sensitivity dataset demonstrated that DTLCDR exhibited improved generalizability and robustness in predicting unseen drugs compared with previous state-of-the-art baseline methods.Further ablation studies verified the effectiveness of each component of our model,highlighting the significant contribution of target information to generalizability.Subsequently,the ability of DTLCDR to predict novel molecules was validated through in vitro cell experiments,demonstrating its potential for real-world applications.Moreover,DTLCDR was transferred to the clinical datasets,demonstrating satisfactory performance in the clinical data,regardless of whether the drugs were included in the cell line dataset.Overall,our results suggest that the DTLCDR is a promising tool for personalized drug discovery. 展开更多
关键词 Personalized medicine Cancer drug response multimodal fusion Deep learning Drug-target interaction Single-cell language model
暂未订购
Multimodal data-driven approaches in retinal vein occlusion:A narrative review integrating machine learning and bioinformatics
8
作者 Chunlan Liang Lian Liu Jingxiang Zhong 《Advances in Ophthalmology Practice and Research》 2025年第4期235-244,共10页
Background:Retinal vein occlusion(RvO)is a leading cause of visual impairment on a global scale.Its patho-logical mechanisms involve a complex interplay of vascular obstruction,ischemia,and secondary inflammatory resp... Background:Retinal vein occlusion(RvO)is a leading cause of visual impairment on a global scale.Its patho-logical mechanisms involve a complex interplay of vascular obstruction,ischemia,and secondary inflammatory responses.Recent interdisciplinary advances,underpinned by the integration of multimodal data,have estab-lished a new paradigm for unraveling the pathophysiological mechanisms of RvO,enabling early diagnosis and personalized treatment strategies.Main text:This review critically synthesizes recent progress at the intersection of machine learning,bioinfor-matics,and clinical medicine,focusing on developing predictive models and deep analysis,exploring molecular mechanisms,and identifying markers associated with RvO.By bridging technological innovation with clinical needs,this review underscores the potential of data-driven strategies to advance RvO research and optimize patient care.Conclusions:Machine learning-bioinformatics integration has revolutionised RvO research through predictive modelling and mechanistic insights,particularly via deep learning-enhanced retinal imaging and multi-omics networks.Despite progress,clinical translation requires resolving data standardisation inconsistencies and model generalizability limitations.Establishing multicentre validation frameworks and interpretable AI tools,coupled with patient-focused data platforms through cross-disciplinary collaboration,could enable precision interventions to optimally preserve vision. 展开更多
关键词 BIOINFORMATICS Clinical prediction models Deep learning MARKERS multimodal data Machine learning Retinal vein occlusion
原文传递
Time-Series Field Phenotyping of Soybean Growth Analysis by Combining Multimodal Deep Learning and Dynamic Modeling
9
作者 Hui Yu Lin Weng +5 位作者 Songquan Wu Jingjing He Yilin Yuan Jun Wang Xiaogang Xu Xianzhong Feng 《Plant Phenomics》 SCIE EI CSCD 2024年第2期323-334,共12页
The rate of soybean canopy establishment largely determines photoperiodic sensitivity,subsequently influencing yield potential.However,assessing the rate of soybean canopy development in large-scale field breeding tri... The rate of soybean canopy establishment largely determines photoperiodic sensitivity,subsequently influencing yield potential.However,assessing the rate of soybean canopy development in large-scale field breeding trials is both laborious and time-consuming.High-throughput phenotyping methods based on unmanned aerial vehicle(UAV)systems can be used to monitor and quantitatively describe the development of soybean canopies for different genotypes.In this study,high-resolution and time-series raw data from field soybean populations were collected using UAVs. 展开更多
关键词 analysis learning modeling dynamic growth FIELD DEEP COMBINING multimodal PHENOTYPING
原文传递
Move to See More:Approaching Object With Partial Occlusion Using Large Multimodal Model and Active Object Detection
10
作者 Aoqi Wang Guohui Tian +1 位作者 Yuhao Wang Zhongyang Li 《IET Cyber-Systems and Robotics》 2025年第1期43-55,共13页
Active object detection(AOD)is a crucial task in the field of robotics.A key challenge in household environments for AOD is that the target object is often undetectable due to partial occlusion,which leads to the fail... Active object detection(AOD)is a crucial task in the field of robotics.A key challenge in household environments for AOD is that the target object is often undetectable due to partial occlusion,which leads to the failure of traditional methods.To address the occlusion problem,this paper first proposes a novel occlusion handling method based on the large multimodal model(LMM).The method utilises an LMM to detect and analyse input RGB images and generates adjustment actions to progressively eliminate occlusion.After the occlusion is handled,an improved AOD method based on a deep Q-learning network(DQN)is used to complete the task.We introduce an attention mechanism to process image features,enabling the model to focus on critical regions of the input images.Additionally,a new reward function is proposed that comprehensively considers the bounding box of the target object and the robot's distance to the object,along with the actions performed by the robot.Ex-periments on the dataset and in real-world scenarios validate the effectiveness of the proposed method in performing AOD tasks under partial occlusion. 展开更多
关键词 active object detection large multimodal model reinforcement learning robots
原文传递
Multimodal large-language model empowering nextgeneration autonomous driving systems
11
作者 Zhiqiang Hu Mingxing Xu Qixiu Cheng 《Journal of Intelligent and Connected Vehicles》 2025年第2期1-3,共3页
1 Introduction Autonomous driving technology has made significant advancements in recent years.The evolution of autonomous driving systems from traditional modular designs to end-to-end learning paradigms has led to c... 1 Introduction Autonomous driving technology has made significant advancements in recent years.The evolution of autonomous driving systems from traditional modular designs to end-to-end learning paradigms has led to comprehensive improvements in driving capabilities.In modular designs,driving tasks are segmented into independent modules,such as perception,decision-making,planning,and control. 展开更多
关键词 driving capabilities autonomous driving systems end end learning next generation multimodal large language model autonomous driving traditional modular designs
在线阅读 下载PDF
从地理信息系统到地理智能体 被引量:9
12
作者 罗斌 刘文豪 +3 位作者 吴进 韩嘉福 吴文周 李洪省 《地球信息科学学报》 北大核心 2025年第1期83-99,共17页
【目的】地理系统是涵盖地球表层自然与人文现象及其相互关系的综合系统,而现有地理信息系统(GIS)虽能数字化处理这些地理要素,但其局限性在于缺乏物理与信息空间的双向交互,并且其模型通常依赖于预设规则和历史数据,难以应对快速变化... 【目的】地理系统是涵盖地球表层自然与人文现象及其相互关系的综合系统,而现有地理信息系统(GIS)虽能数字化处理这些地理要素,但其局限性在于缺乏物理与信息空间的双向交互,并且其模型通常依赖于预设规则和历史数据,难以应对快速变化和三维结构复杂的地理情境。为此,本文提出了“地理智能体”,作为地理信息系统的进阶形式,融合了具身智能、自监督学习和多模态语言模型,旨在提升环境感知、空间理解和自主决策能力。【方法】本文设计的地理智能体架构包含多模态感知、智能中枢和行动操控模块,分别通过传感器网络获取全方位环境信息、利用知识图谱和生成模型进行复杂情境推理,并最终实现对物理环境的实时调控和多层次规划。此外,地理智能体将通过地球模拟器和试验场平台测试,以适应虚拟和真实环境的差异,从而在复杂、动态地理情境中具备更强的自主应对能力。【结果】本文以虚拟数字人“地球通”为例,初步展示地理智能体在空间智能化应用中的具体实现。【结论】“地球通”作为地理智能体的原型机,集成了时空知识图谱(GeoKG)和认知地图生成大模型(GeoGPT)等模块,能够辅助用户在应急管理、城市规划和生态监测等领域中快速获取智能化的空间决策支持,充分体现了地理信息系统从信息处理工具向自主空间智能体的发展演化。 展开更多
关键词 智能地理系统 地理智能体 具身智能 自监督学习 多模态感知 知识图谱 大模型 空间智能
原文传递
基于深度学习的监控视频异常检测方法综述 被引量:2
13
作者 汪洋 周脚根 +1 位作者 严俊 关佶红 《中国图象图形学报》 北大核心 2025年第3期615-640,共26页
利用监控视频监测异常在社会治理中具有至关重要的地位,因此视频异常检测一直是计算机视觉领域备受关注且具有挑战性的议题。鉴于此,以深度学习的视角,对当前关键的视频异常检测方法进行了分类和综述。首先,全面介绍了视频异常的定义,... 利用监控视频监测异常在社会治理中具有至关重要的地位,因此视频异常检测一直是计算机视觉领域备受关注且具有挑战性的议题。鉴于此,以深度学习的视角,对当前关键的视频异常检测方法进行了分类和综述。首先,全面介绍了视频异常的定义,包括异常的划定和类型分类;随后,分析了目前全监督、弱监督、无监督等方面的深度学习方法在视频异常检测领域的进展,探讨了各自的优缺点,特别针对结合大模型的最新研究进展进行了探讨;接着,详细介绍了常见和最新的数据集,并对它们的特点进行了比较分析和截图展示;最后,介绍了多种异常判定和性能评估标准,对各算法的性能表现进行了对比分析。根据这些信息,本文展望了未来数据集、评估标准以及方法研究的可能发展方向,特别强调了大模型在视频异常检测中的新机遇。综上,本文对于深化读者对视频异常检测领域的理解,以及指导未来的研究方向具有积极意义。 展开更多
关键词 视频异常检测 深度学习 数据集 大模型 监督学习 弱监督学习 无监督学习 多模态
原文传递
视觉—语言—动作模型综述:从前史到前沿 被引量:4
14
作者 张慧 梁姝彤 +5 位作者 李明轩 田永林 葛经纬 于慧 李灵犀 王飞跃 《自动化学报》 北大核心 2025年第9期1922-1950,共29页
视觉-语言-动作(VLA)模型作为具身智能发展的核心方向,旨在构建统一的多模态表示与感知–决策–执行一体化架构,以突破传统模块化系统在功能割裂、语义对齐不足及泛化能力有限等方面的瓶颈.本文系统回顾前VLA时代的技术积淀,梳理模块化... 视觉-语言-动作(VLA)模型作为具身智能发展的核心方向,旨在构建统一的多模态表示与感知–决策–执行一体化架构,以突破传统模块化系统在功能割裂、语义对齐不足及泛化能力有限等方面的瓶颈.本文系统回顾前VLA时代的技术积淀,梳理模块化、端到端和混合三类主流建模范式,分析其结构特点、能力优势与面临的关键挑战.在此基础上,总结当前代表性VLA模型的体系结构、训练机制、多模态融合策略及应用成效,并对典型数据集与评测基准进行分类比较.最后,结合跨模态协同、知识注入、长时序规划与真实环境泛化等方面,展望未来VLA模型的发展趋势与研究方向. 展开更多
关键词 具身智能 视觉—语言—动作模型 多模态融合 端到端学习 任务泛化
在线阅读 下载PDF
“华西黉医”大模型构建与应用 被引量:2
15
作者 石锐 郑兵 +13 位作者 姚巡 杨豪 杨煦晨 张思远 王真吾 刘东峰 董婧 谢佳希 马虎 贺志阳 蒋成 乔丰 罗凤鸣 黄进 《中国胸心血管外科临床杂志》 北大核心 2025年第5期587-593,共7页
目的构建“华西黉医”大模型,探索其在辅助病历生成中的应用效果。方法采用“数据标注-模型训练-场景孵化”全链条的医疗大模型建设范式,通过多模态数据融合、领域自适应训练及国产化硬件适配策略,构建720亿参数规模的医学大模型,即“... 目的构建“华西黉医”大模型,探索其在辅助病历生成中的应用效果。方法采用“数据标注-模型训练-场景孵化”全链条的医疗大模型建设范式,通过多模态数据融合、领域自适应训练及国产化硬件适配策略,构建720亿参数规模的医学大模型,即“华西黉医”大模型。结合语音识别、知识图谱和强化学习技术,在构建“华西黉医”大模型的基础上开发辅助病历生成应用系统。结果以出院小结辅助生成为例,试点科室应用病历生成系统后每份病历书写平均时间由21 min缩短至5 min,效率提高3.2倍,系统输出准确率92.4%。结论医疗机构构建自主可控的医学大模型并以此孵化各类应用系统的模式可行,能为同类机构人工智能建设提供路径参考。 展开更多
关键词 医疗大模型 数据标注 多模态学习 病历生成 人工智能
原文传递
IQAGPT:computed tomography image quality assessment with vision-language and ChatGPT models
16
作者 Zhihao Chen Bin Hu +4 位作者 Chuang Niu Tao Chen Yuxin Li Hongming Shan Ge Wang 《Visual Computing for Industry,Biomedicine,and Art》 2024年第1期165-181,共17页
Large language models(LLMs),such as ChatGPT,have demonstrated impressive capabilities in various tasks and attracted increasing interest as a natural language interface across many domains.Recently,large vision-langua... Large language models(LLMs),such as ChatGPT,have demonstrated impressive capabilities in various tasks and attracted increasing interest as a natural language interface across many domains.Recently,large vision-language models(VLMs)that learn rich vision–language correlation from image–text pairs,like BLIP-2 and GPT-4,have been intensively investigated.However,despite these developments,the application of LLMs and VLMs in image quality assessment(IQA),particularly in medical imaging,remains unexplored.This is valuable for objective performance evaluation and potential supplement or even replacement of radiologists’opinions.To this end,this study intro-duces IQAGPT,an innovative computed tomography(CT)IQA system that integrates image-quality captioning VLM with ChatGPT to generate quality scores and textual reports.First,a CT-IQA dataset comprising 1,000 CT slices with diverse quality levels is professionally annotated and compiled for training and evaluation.To better leverage the capabilities of LLMs,the annotated quality scores are converted into semantically rich text descriptions using a prompt template.Second,the image-quality captioning VLM is fine-tuned on the CT-IQA dataset to generate qual-ity descriptions.The captioning model fuses image and text features through cross-modal attention.Third,based on the quality descriptions,users verbally request ChatGPT to rate image-quality scores or produce radiological qual-ity reports.Results demonstrate the feasibility of assessing image quality using LLMs.The proposed IQAGPT outper-formed GPT-4 and CLIP-IQA,as well as multitask classification and regression models that solely rely on images. 展开更多
关键词 Deep learning Medical imaging Image captioning multimodalITY Large language model Vision-language model GPT-4 Subjective evaluation
在线阅读 下载PDF
多模态持续学习方法研究进展 被引量:1
17
作者 张伟 钱龙玥 +1 位作者 张林 李腾 《数据采集与处理》 北大核心 2025年第5期1122-1138,共17页
多模态持续学习(Multimodal continual learning,MMCL)作为机器学习和人工智能领域的一个重要研究方向,旨在通过融合多种模态数据(如图像、文本或语音等)来实现持续的知识积累与任务适应。相较于传统单模态学习方法,MMCL不仅能够并行处... 多模态持续学习(Multimodal continual learning,MMCL)作为机器学习和人工智能领域的一个重要研究方向,旨在通过融合多种模态数据(如图像、文本或语音等)来实现持续的知识积累与任务适应。相较于传统单模态学习方法,MMCL不仅能够并行处理多源异构数据,还能在有效保持已有知识的同时适应新任务需求,展现出在智能系统中的巨大应用潜力。本文系统性地对多模态持续学习进行综述。首先,从基本概念、评估体系和经典单模态持续学习方法3个维度阐述了MMCL的基础理论框架。其次,深入剖析了MMCL在实际应用中的优势与挑战:尽管其在多模态信息融合方面具有显著优势,但仍面临模态不平衡、异构性融合等关键挑战,这些挑战既制约了当前方法的性能表现,也为未来研究指明了方向。基于此,本文随后从基于回放、正则化、参数隔离和大模型4个主要方面,全面梳理了MMCL方法的研究现状与最新进展。最后,对MMCL的未来发展趋势进行了前瞻性展望。 展开更多
关键词 多模态持续学习 模态对齐 灾难性遗忘 预训练模型 任务适应性
在线阅读 下载PDF
基于多模态数据的在线学习元认知能力数字化建模及应用 被引量:1
18
作者 王洪江 张一夫 +2 位作者 伦昊 陈沛瑜 张少英 《电化教育研究》 北大核心 2025年第8期81-89,共9页
在线学习元认知能力对在线学习成效具有重要影响,对其进行模型构建有助于学习者调整学习策略和过程,但当前建模方式存在理论指导不统一、偏向于单模态指标等困境。为此,文章梳理在线学习元认知理论,构建基于多模态数据的在线学习元认知... 在线学习元认知能力对在线学习成效具有重要影响,对其进行模型构建有助于学习者调整学习策略和过程,但当前建模方式存在理论指导不统一、偏向于单模态指标等困境。为此,文章梳理在线学习元认知理论,构建基于多模态数据的在线学习元认知能力数字化模型,并将其应用于常态化在线课程中,以验证基于模型的在线学习元认知能力评估效果。结果表明:(1)基于多模态数据的数字化模型为准确评估在线学习元认知能力提供了基础;(2)各模态数据采用深度学习模型进行分析,结合多模态决策级融合,使评估结果具有全面性与可解释性。 展开更多
关键词 在线学习元认知能力 数字化模型 多模态评估 常态化教学 深度学习
在线阅读 下载PDF
AI知识蒸馏技术演进与应用综述 被引量:1
19
作者 毛克彪 代旺 +2 位作者 郭中华 孙学宏 肖柳瑞 《农业大数据学报》 2025年第2期144-154,共11页
人工智能(AI)中知识蒸馏(KD)技术通过构建师生框架实现模型轻量化,成为解决深度学习性能与效率瓶颈的关键技术。本文从算法原理演进的视角,系统解析知识蒸馏的理论框架,将知识迁移路径归纳为基于响应、特征、关系及结构四类范式,并构建... 人工智能(AI)中知识蒸馏(KD)技术通过构建师生框架实现模型轻量化,成为解决深度学习性能与效率瓶颈的关键技术。本文从算法原理演进的视角,系统解析知识蒸馏的理论框架,将知识迁移路径归纳为基于响应、特征、关系及结构四类范式,并构建动态与静态知识蒸馏方法的对比评估体系。我们深入探讨了跨模态特征对齐、自适应蒸馏架构及多教师协同验证等创新机制,同时剖析渐进式知识迁移与对抗蒸馏等融合策略。通过计算机视觉与自然语言处理领域的实证分析,评估了该技术在图像分类、语义分割及文本生成等场景中的实用性。特别地,我们强调了知识蒸馏在农业与地学领域的潜力,例如在资源受限环境下的精准农业和地理空间分析中实现高效部署。研究发现当前模型普遍存在知识选择机制模糊、理论解释性不足等瓶颈问题。据此,我们探讨了自动化蒸馏系统与多模态知识融合等前沿方向的可行性,为边缘智能部署及隐私计算提供了新的技术路径,尤其适用于农业智能化与地学研究。 展开更多
关键词 知识蒸馏 模型压缩 知识迁移 动态优化 多模态学习
在线阅读 下载PDF
多模态学习技术面向图书馆智慧服务中的创新路径探究 被引量:3
20
作者 桑媛媛 《农业图书情报学报》 2025年第3期42-52,共11页
[目的/意义]随着智慧图书馆发展迈入新纪元,多模态学习技术整合语音、图像、视频等多种信息模态,为信息服务体系带来了革命性变革,极大提升了用户互动体验。通过探索多模态技术在智慧图书馆领域的应用现状与前景,本文旨在为智慧图书馆... [目的/意义]随着智慧图书馆发展迈入新纪元,多模态学习技术整合语音、图像、视频等多种信息模态,为信息服务体系带来了革命性变革,极大提升了用户互动体验。通过探索多模态技术在智慧图书馆领域的应用现状与前景,本文旨在为智慧图书馆的创新转型提供理论支持与实践指南,助力其迈向更加智能化的未来。[方法/过程]研究回顾了多模态学习技术的理论渊源及跨学科发展历程,并深入分析其在智慧图书馆的关键应用场景,包括智能化导览、智能问答系统、用户教育的智能支持、沉浸式阅读。结合相关案例,对多模态交互技术在提升图书馆服务效能及满足个性化需求方面的表现进行了详尽阐述。此外,还探讨了当前技术应用中存在的瓶颈与挑战。[结果/结论]多模态技术在智慧图书馆中的应用显著提升了服务的精准性和互动性,优化了用户体验。然而,其推广与实践仍面临数据隐私保护、技术成本高昂及用户接受度不均等现实问题。本文提出了一系列发展策略,包括完善技术框架、优化用户体验、加强人机协作及注重伦理考量,以助力多模态技术在智慧图书馆的全面应用,为推动图书馆智能化转型提供重要支持。 展开更多
关键词 多模态学习 智慧图书馆 智慧服务 路径创新 多模态大模型 未来学习中心
在线阅读 下载PDF
上一页 1 2 8 下一页 到第
使用帮助 返回顶部