期刊文献+
共找到15篇文章
< 1 >
每页显示 20 50 100
Medical multimodal large language models:A systematic review 被引量:1
1
作者 Yuan Hu Chenhan Xu +2 位作者 Bo Lin Weibin Yang Yuan Yan Tang 《Intelligent Oncology》 2025年第4期308-325,共18页
The rapid advancement of artificial intelligence(AI)has ushered in a new era of medical multimodal large language models(MLLMs),which integrate diverse data modalities such as text,imaging,physiological signals,and ge... The rapid advancement of artificial intelligence(AI)has ushered in a new era of medical multimodal large language models(MLLMs),which integrate diverse data modalities such as text,imaging,physiological signals,and genomics to enhance clinical decision-making.This systematic review explores the core methodologies and applied research frontiers of medical MLLMs,focusing on their architecture,training methods,evaluation techniques,and applications.We highlight the transformative potential of MLLMs in achieving cross-modal semantic alignment,medical knowledge integration,and robust clinical reasoning.Despite their promise,challenges such as data heterogeneity,hallucination,and computational efficiency persist.By reviewing state-of-the-art solutions and future directions,this paper provides a comprehensive technical guide for developing reliable and interpretable medical MLLMs,ultimately aiming to bridge the gap between AI and clinical practice. 展开更多
关键词 multimodal large language model HALLUCINATION Medical multimodal dataset Clinical evaluation
在线阅读 下载PDF
DeepGut:A collaborative multimodal large language model framework for digestive disease assisted diagnosis and treatment
2
作者 Xiao-Han Wan Mei-Xia Liu +6 位作者 Yan Zhang Guan-Jun Kou Lei-Qi Xu Han Liu Xiao-Yun Yang Xiu-Li Zuo Yan-Qing Li 《World Journal of Gastroenterology》 2025年第31期92-100,共9页
BACKGROUND Gastrointestinal diseases have complex etiologies and clinical presentations.An accurate diagnosis requires physicians to integrate diverse information,including medical history,laboratory test results,and ... BACKGROUND Gastrointestinal diseases have complex etiologies and clinical presentations.An accurate diagnosis requires physicians to integrate diverse information,including medical history,laboratory test results,and imaging findings.Existing artificial intelligence-assisted diagnostic tools are limited to single-modality information,resulting in recommendations that are often incomplete and may be associated with clinical or legal risks.AIM To develop and evaluate a collaborative multimodal large language model(LLM)framework for clinical decision-making in digestive diseases.METHODS In this observational study,DeepGut,a multimodal LLM collaborative diagnostic framework,was developed to integrate four distinct large models into a four-tiered structure.The framework sequentially accomplishes multimodal infor-mation extraction,logical“chain”construction,diagnostic and treatment suggestion generation,and risk analysis.The model was evaluated using objective metrics,which assess the reliability and comprehensiveness of model-generated results,and subjective expert opinions,which examine the effectiveness of the framework in assisting physicians.RESULTS The diagnostic and treatment recommendations generated by the DeepGut framework achieved exceptional performance,with a diagnostic accuracy of 97.8%,diagnostic completeness of 93.9%,treatment plan accuracy of 95.2%,and treatment plan completeness of 98.0%,significantly surpassing the capabilities of single-modal LLM-based diagnostic tools.Experts evaluating the framework commended the completeness,relevance,and logical coherence of its outputs.However,the collaborative multimodal LLM approach resulted in increased input and output token counts,leading to higher computational costs and extended diagnostic times.CONCLUSION The framework achieves successful integration of multimodal diagnostic data,demonstrating enhanced performance enabled by multimodal LLM collaboration,which opens new horizons for the clinical application of artificial intelligence-assisted technology. 展开更多
关键词 Gastrointestinal diseases Artificial intelligence-assisted diagnosis and treatment multimodal large language model Multiple large language model collaboration DeepGut
在线阅读 下载PDF
Challenges and Optimization of Multimodal Large Language Models for Tree Falling Scenarios
3
作者 Lei Feng Yicheng Huang +5 位作者 Chunjie Sheng Yuxing Shi Jianhong Jin Yun Xu Yuzhou Du Sihao Miao 《国际计算机前沿大会会议论文集》 2025年第1期533-543,共11页
This study proposed an optimization method for multimodal large language models(MLLMs)reasoning based on structured chain of thought,aiming to enhance the visual decision-making capability in tree falling scenarios.Th... This study proposed an optimization method for multimodal large language models(MLLMs)reasoning based on structured chain of thought,aiming to enhance the visual decision-making capability in tree falling scenarios.The research first analyzed challenges faced by existing MLLMs when processing complicated visual scenes,including insufficient reasoning performance and low integration efficiency with other systems.To address these issues,an innovative structured chain of thought approach was introduced,which significantly improved the reasoning accuracy of the model in handling complex visual scenarios.To validate the proposed method,a specialized dataset focusing on tree falling scenarios in social governance was constructed,and a practical agent workflow was designed based on this dataset.Experimental results demonstrated that the proposed approach achieved better performance in real-world applications.The findings provide a reliable and efficient technical solution to visual decision-making in social governance. 展开更多
关键词 multimodal large language model MLLM social governance AI agent
原文传递
SHIELD:an evaluation benchmark for face spoofing and forgery detection with multimodal large language models
4
作者 Yichen Shi Yuhao Gao +7 位作者 Yingxin Lai Hongyang Wang Jun Feng Lei He Jun Wan Changsheng Chen Zitong Yu Xiaochun Cao 《Visual Intelligence》 2025年第1期113-137,共25页
Multimodal large language models(MLLMs)have demonstrated strong capabilities in vision-related tasks,capitalizing on their visual semantic comprehension and reasoning capabilities.However,their ability to detect subtl... Multimodal large language models(MLLMs)have demonstrated strong capabilities in vision-related tasks,capitalizing on their visual semantic comprehension and reasoning capabilities.However,their ability to detect subtle visual spoofing and forgery clues in face attack detection tasks remains underexplored.In this paper,we introduce a benchmark,SHIELD,to evaluate MLLMs for face spoofing and forgery detection.Specifically,we design true/false and multiple-choice questions to assess MLLM performance on multimodal face data across two tasks.For the face anti-spoofing task,we evaluate three modalities(i.e.,RGB,infrared,and depth)under six attack types.For the face forgery detection task,we evaluate GAN-based and diffusion-based data,incorporating visual and acoustic modalities.We conduct zero-shot and few-shot evaluations in standard and chain of thought(COT)settings.Additionally,we propose a novel multi-attribute chain of thought(MA-COT)paradigm for describing and judging various task-specific and task-irrelevant attributes of face images.The findings of this study demonstrate that MLLMs exhibit strong potential for addressing the challenges associated with the security of facial recognition technology applications. 展开更多
关键词 Face anti-spoofing(FAS) Face forgery detection multimodal large language models(MLLMs) Multi-attribute chain of thought(MA-COT)
在线阅读 下载PDF
Evaluating Sex and Age Biases in Multimodal Large Language Models for Skin Disease Identification from Dermatoscopic Images
5
作者 Zhiyu Wan Yuhang Guo +2 位作者 Shunxing Bao Qian Wang Bradley A.Malin 《Health Data Science》 2025年第1期225-238,共14页
Background:Multimodal large language models(LLMs)have shown potential in various health-related fields.However,many healthcare studies have raised concerns about the reliability and biases of LLMs in healthcare applic... Background:Multimodal large language models(LLMs)have shown potential in various health-related fields.However,many healthcare studies have raised concerns about the reliability and biases of LLMs in healthcare applications.Methods:To explore the practical application of multimodal LLMs in skin disease identification,and to evaluate sex and age biases,we tested the performance of 2 popular multimodal LLMs,ChatGPT-4 and LLaVA-1.6,across diverse sex and age groups using a subset of a large dermatoscopic dataset containing around 10,000 images and 3 skin diseases(melanoma,melanocytic nevi,and benign keratosis-like lesions).Results:In comparison to 3 deep learning models(VGG16,ResNet50,and Model Derm)based on convolutional neural network(CNN),one vision transformer model(Swin-B),we found that ChatGPT-4 and LLaVA-1.6 demonstrated overall accuracies that were 3% and 23% higher(and F1-scores that were 4% and 34% higher),respectively,than the best performing CNN-based baseline while maintaining accuracies that were 38% and 26% lower(and F1-scores that were 38% and 19% lower),respectively,than Swin-B.Meanwhile,ChatGPT-4 is generally unbiased in identifying these skin diseases across sex and age groups,while LLaVA-1.6 is generally unbiased across age groups,in contrast to Swin-B,which is biased in identifying melanocytic nevi.Conclusions:This study suggests the usefulness and fairness of LLMs in dermatological applications,aiding physicians and practitioners with diagnostic recommendations and patient screening.To further verify and evaluate the reliability and fairness of LLMs in healthcare,experiments using larger and more diverse datasets need to be performed in the future. 展开更多
关键词 dermatoscopic dataset healthcare applications skin disease identificationand skin disease identification multimodal large language models sex biases age biases large language models llms
原文传递
Foundation models:Insights and implications for gastrointestinal cancer
6
作者 Lei Shi Rui Huang +1 位作者 Li-Ling Zhao An-Jie Guo 《World Journal of Gastroenterology》 2025年第47期7-34,共28页
Gastrointestinal(GI)cancers represent a major global health concern due to their high incidence and mortality rates.Foundation models(FMs),also referred to as large models,represent a novel class of artificial intelli... Gastrointestinal(GI)cancers represent a major global health concern due to their high incidence and mortality rates.Foundation models(FMs),also referred to as large models,represent a novel class of artificial intelligence technologies that have demonstrated considerable potential in addressing these challenges.These models encompass large language models(LLMs),vision FMs(VFMs),and multimodal LLMs(MLLMs),all of which utilize transformer architectures and self-supervised pre-training on extensive unlabeled datasets to achieve robust cross-domain generalization.This review delineates the principal applications of these models:LLMs facilitate the structuring of clinical narratives,extraction of insights from medical records,and enhancement of physician-patient communication;VFMs are employed in the analysis of endoscopic,radiological,and pathological images for lesion detection and staging;MLLMs integrate heterogeneous data modalities,including imaging,textual information,and genomic data,to support diagnostic processes,treatment prediction,and prognostic evaluation.Despite these promising developments,several challenges remain,such as the need for data standardization,limited diversity within training datasets,substantial computational resource requirements,and ethical-legal concerns.In conclusion,FMs exhibit significant potential to advance research and clinical management of GI cancers.Future research efforts should prioritize the refinement of these models,promote international collaborations,and adopt interdisciplinary approaches.Such a comprehensive strategy is essential to fully harness the capabilities of FMs,driving substantial progress in the fight against GI malignancies. 展开更多
关键词 Foundation models Gastrointestinal cancers large language models Vision foundation models multimodal large language models
在线阅读 下载PDF
Video-Bench:A comprehensive benchmark and toolkit for evaluating video-based large language models
7
作者 Munan Ning Bin Zhu +5 位作者 Yujia Xie Bin Lin Jiaxi Cui Lu Yuan Dongdong Chen Li Yuan 《Computational Visual Media》 2026年第1期71-84,共14页
Video-based large language models(VideoLLMs)have been recently introduced,targeting both fundamental improvements in perception and comprehension,and a diverse range of user inquiries.In pursuit of the ultimate goal o... Video-based large language models(VideoLLMs)have been recently introduced,targeting both fundamental improvements in perception and comprehension,and a diverse range of user inquiries.In pursuit of the ultimate goal of achieving artificial general intelligence,a truly intelligent Video-LLM model should not only see and understand the surroundings,but also possess human-level commonsense,and make well-informed decisions for users.To guide the development of such a model,the establishment of a robust and comprehensive evaluation system becomes crucial.To this end,this paper proposes Video-Bench,a new comprehensive benchmark along with a toolkit specifically designed for evaluating Video-LLMs.The benchmark comprises 10 meticulously crafted tasks,evaluating the capabilities of Video-LLMs across three distinct levels:video-exclusive understanding,prior knowledge-based question-answering,and comprehension and decision-making.In addition,we introduce an automatic toolkit tailored to process model outputs for various tasks,facilitating the calculation of metrics and conveniently generating final scores.We evaluate 9 representative Video-LLMs using Video-Bench.The findings reveal that current Video-LLMs still fall considerably short of achieving human-like comprehension and analysis of real-world video,and offer valuable insights for future research directions.The benchmark and toolkit are available at https://github.com/PKU-YuanGroup/Video-Bench. 展开更多
关键词 multimodal large language model vision question answering benchmark video processing
原文传递
CAPGen: An MLLM-Based Framework Integrated with Iterative Optimization Mechanism for Cultural Artifacts Poster Generation
8
作者 Qianqian Hu Chuhan Li +1 位作者 Mohan Zhang Fang Liu 《Computers, Materials & Continua》 2026年第1期494-510,共17页
Due to the digital transformation tendency among cultural institutions and the substantial influence of the social media platform,the demands of visual communication keep increasing for promoting traditional cultural ... Due to the digital transformation tendency among cultural institutions and the substantial influence of the social media platform,the demands of visual communication keep increasing for promoting traditional cultural artifacts online.As an effective medium,posters serve to attract public attention and facilitate broader engagement with cultural artifacts.However,existing poster generation methods mainly rely on fixed templates and manual design,which limits their scalability and adaptability to the diverse visual and semantic features of the artifacts.Therefore,we propose CAPGen,an automated aesthetic Cultural Artifacts Poster Generation framework built on a Multimodal Large Language Model(MLLM)with integrated iterative optimization.During our research,we collaborated with designers to define principles of graphic design for cultural artifact posters,to guide the MLLM in generating layout parameters.Later,we generated these parameters into posters.Finally,we refined the posters using an MLLM integrated with a multi-round iterative optimization mechanism.Qualitative results show that CAPGen consistently outperforms baseline methods in both visual quality and aesthetic performance.Furthermore,ablation studies indicate that the prompt,iterative optimization mechanism,and design principles significantly enhance the effectiveness of poster generation. 展开更多
关键词 Aesthetic poster generation prompt engineering multimodal large language models iterative optimization design principles
在线阅读 下载PDF
Current Trends and Future Prospects of Large-Scale Foundation Model in K-12 Education 被引量:1
9
作者 Qiannan Zhu Mei Wang +1 位作者 Ting Zhang Hua Huang 《Frontiers of Digital Education》 2025年第2期9-31,共23页
The rapid advancement of artificial intelligence has significantly impacted education,with largescalefoundatiomn odels(LFMs)emerging as transformative tools.While LFMs have demonstrated exceptional performance across ... The rapid advancement of artificial intelligence has significantly impacted education,with largescalefoundatiomn odels(LFMs)emerging as transformative tools.While LFMs have demonstrated exceptional performance across diverse domains,their integration into K-12 education remains in its early stages,requiring alignment with pedagogical principles,cognitive development,and curriculum standards.This paper provides a comprehensive technological review of LFM applications in K-12 education,examining current workflows,challenges,and future opportunities.We explore how LFMs facilitate personalized learning,teacher-student collaboration,and automated assessment while highlighting critical issues such as motivation,engagement,and_age-appropriate instructional strategies.By analyzing global developments,this study offers valuable insights for educators seeking to optimize AI-driven teaching methods and for students leveraging AI for self-directed learning.Our findings aim to inform future research and drive innovation in educational Al,ensuring the effective and ethical integration of LFMs into the evolving K-12 educational landscape. 展开更多
关键词 large-scalefoundationmodels(LFMs) multimodal large language model large language model K-12 education
在线阅读 下载PDF
Large investment model
10
作者 Jian GUO Heung-Yeung SHUM 《Frontiers of Information Technology & Electronic Engineering》 2025年第10期1771-1792,共22页
Traditional quantitative investment research is encountering diminishing returns alongside rising labor and time costs.To overcome these challenges,we introduce the large investment model(LIM),a novel research paradig... Traditional quantitative investment research is encountering diminishing returns alongside rising labor and time costs.To overcome these challenges,we introduce the large investment model(LIM),a novel research paradigm designed to enhance both performance and efficiency at scale.LIM employs end-to-end learning and universal modeling to create an upstream foundation model,which is capable of autonomously learning comprehensive signal patterns from diverse financial data spanning multiple exchanges,instruments,and frequencies.These“global patterns”are subsequently transferred to downstream strategy modeling,optimizing performance for specific tasks.We detail the system architecture design of LIM,address the technical challenges inherent in this approach,and outline potential directions for future research. 展开更多
关键词 Artificial general intelligence END-TO-END large investment model Quantitative investment Foundation model multimodal large language model
原文传递
Federated Services:A Smart Service Ecology With Federated Security for Aligned Data Supply and Scenario-Oriented Demands
11
作者 Xiaofeng Jia Juanjuan Li +5 位作者 Shouwen Wang Hongwei Qi Fei-Yue Wang Rui Qin Min Zhang Xiaolong Liang 《IEEE/CAA Journal of Automatica Sinica》 2025年第5期925-936,共12页
This paper introduces federated services as a smart service ecology with federated security to align distributed data supply with diversified service demands spanning digital and societal contexts.It presents the comp... This paper introduces federated services as a smart service ecology with federated security to align distributed data supply with diversified service demands spanning digital and societal contexts.It presents the comprehensive researches on the theoretical foundation and technical system of federated services,aiming at advancing our understanding and implementation of this novel service paradigm.First,a thorough examination of the characteristics of federated security within federated services is conducted.Then,a five-layer technical framework is formulated under a decentralized intelligent architecture,ensuring secure,agile,and adaptable service provision.On this basis,the operational mechanisms underlying data federation and service confederation is analyzed,with emphasis on the smart supply-demand matching model.Furthermore,a scenario-oriented taxonomy of federated services accompanied by illustrative examples is proposed.Our work offers actionable insights and roadmap for realizing and advancing federated services,contributing to the refinement and wider adoption of this transformative service paradigm in the digital era. 展开更多
关键词 Decentralized autonomous organizations and operations decentralized physical infrastructure networks federated security federated services multimodal large language models smart contracts
在线阅读 下载PDF
CLIP4Video-Sampling: Global Semantics-Guided Multi-Granularity Frame Sampling for Video-Text Retrieval
12
作者 Tao Zhang Yu Zhang 《Journal of Computer and Communications》 2024年第11期26-36,共11页
Video-text retrieval (VTR) is an essential task in multimodal learning, aiming to bridge the semantic gap between visual and textual data. Effective video frame sampling plays a crucial role in improving retrieval per... Video-text retrieval (VTR) is an essential task in multimodal learning, aiming to bridge the semantic gap between visual and textual data. Effective video frame sampling plays a crucial role in improving retrieval performance, as it determines the quality of the visual content representation. Traditional sampling methods, such as uniform sampling and optical flow-based techniques, often fail to capture the full semantic range of videos, leading to redundancy and inefficiencies. In this work, we propose CLIP4Video-Sampling: Global Semantics-Guided Multi-Granularity Frame Sampling for Video-Text Retrieval, a global semantics-guided multi-granularity frame sampling strategy designed to optimize both computational efficiency and retrieval accuracy. By integrating multi-scale global and local temporal sampling and leveraging the CLIP (Contrastive Language-Image Pre-training) model’s powerful feature extraction capabilities, our method significantly outperforms existing approaches in both zero-shot and fine-tuned video-text retrieval tasks on popular datasets. CLIP4Video-Sampling reduces redundancy, ensures keyframe coverage, and serves as an adaptable pre-processing module for multimodal models. 展开更多
关键词 Video Sampling multimodal large language model Text-Video Retrieval CLIP model
在线阅读 下载PDF
Frontiers in Intelligent Colonoscopy
13
作者 Ge-Peng Ji Jingyi Liu +4 位作者 Peng Xu Nick Barnes Fahad Shahbaz Khan Salman Khan Deng-Ping Fan 《Machine Intelligence Research》 2026年第1期70-114,共45页
Colonoscopy is currently one of the most sensitive screening methods for colorectal cancer.This study investigates the frontiers of intelligent colonoscopy techniques and their prospective implications for multimodal ... Colonoscopy is currently one of the most sensitive screening methods for colorectal cancer.This study investigates the frontiers of intelligent colonoscopy techniques and their prospective implications for multimodal medical applications.With this goal,we begin by assessing the current data-centric and model-centric landscapes through four tasks for colonoscopic scene perception,including classification,detection,segmentation,and vision-language understanding.Our assessment reveals domain-specific challenges and underscores the need for further multimodal research in colonoscopy.To address these gaps,we establish three foundational initiatives:a large-scale multimodal instruction tuning dataset ColonINST,a colonoscopy-designed multimodal language model ColonGPT,and a multimodal benchmark.To facilitate continuous advancements in this rapidly evolving field,we provide a public website for the latest updates:https://github.com/ai4colonoscopy/IntelliScope. 展开更多
关键词 Colonoscopy survey polyp segmentation multimodal large language model multimodal benchmark healthcare AI
原文传递
When SAM2 meets video camouflaged object segmentation:a comprehensive evaluation and adaptation
14
作者 Yuli Zhou Guolei Sun +3 位作者 Yawei Li Guo-Sen Xie Luca Benini Ender Konukoglu 《Visual Intelligence》 2025年第1期138-151,共14页
This study investigates the application and performance of the Segment Anything Model 2 (SAM2) in thechallenging task of video camouflaged object segmentation (VCOS). VCOS involves detecting objects that blendseamless... This study investigates the application and performance of the Segment Anything Model 2 (SAM2) in thechallenging task of video camouflaged object segmentation (VCOS). VCOS involves detecting objects that blendseamlessly in the surroundings for videos due to similar colors and textures and poor light conditions. Compared tothe objects in normal scenes, camouflaged objects are much more difficult to detect. SAM2, a video foundationmodel, has shown potential in various tasks. However, its effectiveness in dynamic camouflaged scenarios remainsunder-explored. This study presents a comprehensive study on SAM2’s ability in VCOS. First, we assess SAM2’sperformance on camouflaged video datasets using different models and prompts (click, box, and mask). Second, weexplore the integration of SAM2 with existing multimodal large language models (MLLMs) and VCOS methods. Third,we specifically adapt SAM2 by fine-tuning it on the video camouflaged dataset. Our comprehensive experimentsdemonstrate that SAM2 has the excellent zero-shot ability to detect camouflaged objects in videos. We also showthat this ability could be further improved by specifically adjusting SAM2’s parameters for VCOS. 展开更多
关键词 multimodal large language model Prompt engineering SAM2 Video camouflaged object segmentation
在线阅读 下载PDF
TimeJudge:empowering video-LLMs as zero-shot judges for temporal consistency in video captions
15
作者 Yangliu HU Zikai SONG +2 位作者 Junqing YU Yiping Phoebe CHEN Wei YANG 《Frontiers of Information Technology & Electronic Engineering》 2025年第11期2204-2214,共11页
Video large language models(video-LLMs)have demonstrated impressive capabilities in multimodal understanding,but their potential as zero-shot evaluators for temporal consistency in video captions remains underexplored... Video large language models(video-LLMs)have demonstrated impressive capabilities in multimodal understanding,but their potential as zero-shot evaluators for temporal consistency in video captions remains underexplored.Existing methods notably underperform in detecting critical temporal errors,such as missing,hallucinated,or misordered actions.To address this gap,we introduce two key contributions.(1)TimeJudge:a novel zero-shot framework that recasts temporal error detection as answering calibrated binary question pairs.It incorporates modality-sensitive confidence calibration and uses consistency-weighted voting for robust prediction aggregation.(2)TEDBench:a rigorously constructed benchmark featuring videos across four distinct complexity levels,specifically designed with fine-grained temporal error annotations to evaluate video-LLM performance on this task.Through a comprehensive evaluation of multiple state-of-the-art video-LLMs on TEDBench,we demonstrate that TimeJudge consistently yields substantial gains in terms of recall and F1-score without requiring any task-specific fine-tuning.Our approach provides a generalizable,scalable,and training-free solution for enhancing the temporal error detection capabilities of video-LLMs. 展开更多
关键词 Video large language model(Video-LLM) multimodal large language model(MLLM) MLLM-as-a-Judge Video caption BENCHMARK
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部