期刊文献+
共找到33篇文章
< 1 2 >
每页显示 20 50 100
From Algorithm to Expert:RLHF-Guided Vision-Language Model for 3D-EEM Fluorescence Spectroscopy Matching
1
作者 Chenglong Lu Jiehui Li +5 位作者 Tonglin Chen Changhua Zhou Yixin Fan Xinlin Ren Ziyi Ju Wei Wang 《Computers, Materials & Continua》 2026年第5期1883-1900,共18页
Existing methods for tracing water pollution sources typically integrate three-dimensional excitationemission matrix(3D-EEM)fluorescence spectroscopy with similarity-based matching algorithms.However,these approaches ... Existing methods for tracing water pollution sources typically integrate three-dimensional excitationemission matrix(3D-EEM)fluorescence spectroscopy with similarity-based matching algorithms.However,these approaches exhibit high error rates in borderline cases and necessitate expert manual review,which limits scalability and introduces inconsistencies between algorithmic outputs and expert judgment.To address these limitations,we propose a large vision-language model(VLM)designed as an“expert agent”to automatically refine similarity scores,ensuring alignment with expert decisions and overcoming key application bottlenecks.The model consists of two core components:(1)rule-based similarity calculation module generate initial spectral similarity scores,and(2)pre-trained large vision-language model fine-tuned via supervised learning and reinforcement learning with human feedback(RLHF)to emulate expert assessments.To facilitate training and evaluation,we introduce two expert-annotated datasets,Spec1k and SpecReason,which capture both quantitative corrections and qualitative reasoning patterns,allowing the model to emulate expert decision-making processes.Experimental results demonstrate that our method achieves 81.45%source attribution accuracy,38.24%higher than rule-based and machine learning baselines.Real-world deployment further validates its effectiveness. 展开更多
关键词 vision-language model reinforcement learning with human feedback pollution source tracing 3D fluorescence spectroscopy
在线阅读 下载PDF
The Synergy of Seeing and Saying: Revolutionary Advances in Multi-modality Medical Vision-Language Large Models
2
作者 Xiang LI Yu SUN +3 位作者 Jia LIN Like LI Ting FENG Shen YIN 《Artificial Intelligence Science and Engineering》 2025年第2期79-97,共19页
The application of visual-language large models in the field of medical health has gradually become a research focus.The models combine the capability for image understanding and natural language processing,and can si... The application of visual-language large models in the field of medical health has gradually become a research focus.The models combine the capability for image understanding and natural language processing,and can simultaneously process multi-modality data such as medical images and medical reports.These models can not only recognize images,but also understand the semantic relationship between images and texts,effectively realize the integration of medical information,and provide strong support for clinical decision-making and disease diagnosis.The visual-language large model has good performance for specific medical tasks,and also shows strong potential and high intelligence in the general task models.This paper provides a comprehensive review of the visual-language large model in the field of medical health.Specifically,this paper first introduces the basic theoretical basis and technical principles.Then,this paper introduces the specific application scenarios in the field of medical health,including modality fusion,semi-supervised learning,weakly supervised learning,unsupervised learning,cross-domain model and general models.Finally,the challenges including insufficient data,interpretability,and practical deployment are discussed.According to the existing challenges,four potential future development directions are given. 展开更多
关键词 large language models vision-language models medical health multimodality models
在线阅读 下载PDF
VOTI:Jailbreaking Vision-Language Models via Visual Obfuscation and Task Induction
3
作者 ZHU Yifan CHU Zhixuan REN Kui 《ZTE Communications》 2025年第3期15-26,共12页
In recent years,large vision-language models(VLMs)have achieved significant breakthroughs in cross-modal understanding and generation.However,the safety issues arising from their multimodal interactions become promine... In recent years,large vision-language models(VLMs)have achieved significant breakthroughs in cross-modal understanding and generation.However,the safety issues arising from their multimodal interactions become prominent.VLMs are vulnerable to jailbreak attacks,where attackers craft carefully designed prompts to bypass safety mechanisms,leading them to generate harmful content.To address this,we investigate the alignment between visual inputs and task execution,uncovering locality defects and attention biases in VLMs.Based on these findings,we propose VOTI,a novel jailbreak framework leveraging visual obfuscation and task induction.VOTI subtly embeds malicious keywords within neutral image layouts to evade detection,and breaks down harmful queries into a sequence of subtasks.This approach disperses malicious intent across modalities,exploiting VLMs’over-reliance on local visual cues and their fragility in multi-step reasoning to bypass global safety mechanisms.Implemented as an automated framework,VOTI integrates large language models as red-team assistants to generate and iteratively optimize jailbreak strategies.Extensive experiments across seven mainstream VLMs demonstrate VOTI’s effectiveness,achieving a 73.46%attack success rate on GPT-4o-mini.These results reveal critical vulnerabilities in VLMs,highlighting the urgent need for improving robust defenses and multimodal alignment. 展开更多
关键词 large vision-language models jailbreak attacks red teaming security of large models safety alignment
在线阅读 下载PDF
A Review on Vision-Language-Based Approaches: Challenges and Applications
4
作者 Huu-Tuong Ho Luong Vuong Nguyen +4 位作者 Minh-Tien Pham Quang-Huy Pham Quang-Duong Tran Duong Nguyen Minh Huy Tri-Hai Nguyen 《Computers, Materials & Continua》 2025年第2期1733-1756,共24页
In multimodal learning, Vision-Language Models (VLMs) have become a critical research focus, enabling the integration of textual and visual data. These models have shown significant promise across various natural lang... In multimodal learning, Vision-Language Models (VLMs) have become a critical research focus, enabling the integration of textual and visual data. These models have shown significant promise across various natural language processing tasks, such as visual question answering and computer vision applications, including image captioning and image-text retrieval, highlighting their adaptability for complex, multimodal datasets. In this work, we review the landscape of Bootstrapping Language-Image Pre-training (BLIP) and other VLM techniques. A comparative analysis is conducted to assess VLMs’ strengths, limitations, and applicability across tasks while examining challenges such as scalability, data quality, and fine-tuning complexities. The work concludes by outlining potential future directions in VLM research, focusing on enhancing model interpretability, addressing ethical implications, and advancing multimodal integration in real-world applications. 展开更多
关键词 Bootstrapping language-image pre-training(BLIP) multimodal learning vision-language model(vlm) vision-language pre-training(VLP)
在线阅读 下载PDF
IQAGPT:computed tomography image quality assessment with vision-language and ChatGPT models
5
作者 Zhihao Chen Bin Hu +4 位作者 Chuang Niu Tao Chen Yuxin Li Hongming Shan Ge Wang 《Visual Computing for Industry,Biomedicine,and Art》 2024年第1期165-181,共17页
Large language models(LLMs),such as ChatGPT,have demonstrated impressive capabilities in various tasks and attracted increasing interest as a natural language interface across many domains.Recently,large vision-langua... Large language models(LLMs),such as ChatGPT,have demonstrated impressive capabilities in various tasks and attracted increasing interest as a natural language interface across many domains.Recently,large vision-language models(VLMs)that learn rich vision–language correlation from image–text pairs,like BLIP-2 and GPT-4,have been intensively investigated.However,despite these developments,the application of LLMs and VLMs in image quality assessment(IQA),particularly in medical imaging,remains unexplored.This is valuable for objective performance evaluation and potential supplement or even replacement of radiologists’opinions.To this end,this study intro-duces IQAGPT,an innovative computed tomography(CT)IQA system that integrates image-quality captioning VLM with ChatGPT to generate quality scores and textual reports.First,a CT-IQA dataset comprising 1,000 CT slices with diverse quality levels is professionally annotated and compiled for training and evaluation.To better leverage the capabilities of LLMs,the annotated quality scores are converted into semantically rich text descriptions using a prompt template.Second,the image-quality captioning VLM is fine-tuned on the CT-IQA dataset to generate qual-ity descriptions.The captioning model fuses image and text features through cross-modal attention.Third,based on the quality descriptions,users verbally request ChatGPT to rate image-quality scores or produce radiological qual-ity reports.Results demonstrate the feasibility of assessing image quality using LLMs.The proposed IQAGPT outper-formed GPT-4 and CLIP-IQA,as well as multitask classification and regression models that solely rely on images. 展开更多
关键词 Deep learning Medical imaging Image captioning MULTIMODALITY Large language model vision-language model GPT-4 Subjective evaluation
在线阅读 下载PDF
Vision-Language Model-Driven Human-Vehicle Interaction for Autonomous Driving:Status,Challenge,and Innovation
6
作者 Rongfeng Zhao Aimin Du +2 位作者 Mobing Cai Zhongpan Zhu Bin He 《Big Data Mining and Analytics》 2026年第2期425-447,共23页
This paper investigates the potential of Vision-Language Models(VLMs)to enhance Human–Vehicle Interaction(HVI)in Autonomous Driving(AD)scenarios,particularly in interactions between vehicles and other traffic partici... This paper investigates the potential of Vision-Language Models(VLMs)to enhance Human–Vehicle Interaction(HVI)in Autonomous Driving(AD)scenarios,particularly in interactions between vehicles and other traffic participants,with a focus on rationality and safety in external HVI.Leveraging recent advancements in large language models,VLMs demonstrate remarkable capabilities in understanding real-world contexts and generating significant interest in HVI applications.This paper provides an overview of AD,HVI,and VLMs,along with the historical context of large language model applications in HVI.The HVI discussed herein involves dynamic game processes encompassing perception and decision-making between vehicles and traffic participants,such as pedestrians.Furthermore,we examine the perceptual challenges associated with applying VLMs to HVI and compile relevant datasets.This research fills a gap in the existing literature by systematically analyzing the current status,challenges,and future opportunities of VLM applications in HVI.To advance VLM integration in AD,various implementation strategies are discussed.The findings highlight the potential of VLMs to transform HVI in AD,improving both passenger experience and driving safety.Overall,this study contributes to a comprehensive understanding of VLM applications in HVI and provides insights to guide future research and development. 展开更多
关键词 Human-Vehicle Interaction(HVI) Large Language model(LLM) vision-language large model(vlm) Autonomous Driving(AD) perception technology
原文传递
A systematic review of vision and vision-language foundation models in ophthalmology
7
作者 Kai Jin Tao Yu +7 位作者 Gui-shuang Ying Zongyuan Ge Kelvin Zhenghao Li Yukun Zhou Danli Shi Meng Wang Polat Goktas Andrzej Grzybowski 《Advances in Ophthalmology Practice and Research》 2026年第1期8-19,共12页
Background:Vision and vision-language foundation models,a subset of advanced artificial intelligence(AI)frameworks,have shown transformative potential in various medical fields.In ophthalmology,these models,particular... Background:Vision and vision-language foundation models,a subset of advanced artificial intelligence(AI)frameworks,have shown transformative potential in various medical fields.In ophthalmology,these models,particularly large language models and vision-based models,have demonstrated great potential to improve diagnostic accuracy,enhance treatment planning,and streamline clinical workflows.However,their deployment in ophthalmology has faced several challenges,particularly regarding generalizability and integration into clinical practice.This systematic review aims to summarize the current evidence on the use of vision and visionlanguage foundation models in ophthalmology,identifying key applications,outcomes,and challenges.Main text:A comprehensive search on PubMed,Web of Science,Scopus,and Google Scholar was conducted to identify studies published between January 2020 and July 2025.Studies were included if they developed or applied foundation models,such as vision-based models and large language models,to clinically relevant ophthalmic applications.A total of 10 studies met the inclusion criteria,covering areas such as retinal diseases,glaucoma,and ocular surface tumor.The primary outcome measures are model performance metrics,integration into clinical workflows,and the clinical utility of the models.Additionally,the review explored the limitations of foundation models,such as the reliance on large datasets,computational resources,and interpretability challenges.The majority of studies demonstrated that foundation models could achieve high diagnostic accuracy,with several reports indicating excellent performance comparable to or exceeding those of experienced clinicians.Foundation models achieved high accuracy rates up to 95%for diagnosing retinal diseases,and similar performances for detecting glaucoma progression.Despite promising results,concerns about algorithmic bias,overfitting,and the need for diverse training data were common.High computational demands,EHR compatibility,and the need for clinician validation also posed challenges.Additionally,model interpretability issues hindered clinician trust and adoption.Conclusions:Vision and vision-language foundation models in ophthalmology show significant potential for advancing diagnostic accuracy and treatment strategies,particularly in retinal diseases,glaucoma,and ocular oncology.However,challenges such as data quality,transparency,and ethical considerations must be addressed.Future research should focus on refining model performance,improving interpretability and generalizability,and exploring strategies for integrating these models into routine clinical practice to maximize their impact in clinical ophthalmology. 展开更多
关键词 Ophthalmology Vision foundation models vision-language models Artificial intelligence Clinical integration
原文传递
考虑水动力升力效应的船舶随浪中纯稳性丧失运动预报分析
8
作者 张晋雅 马宁 +1 位作者 段建文 史琪琪 《中国舰船研究》 北大核心 2026年第1期12-22,共11页
[目的]旨在研究随浪中横倾与航速共同作用下水动力升力效应的影响,提升纯稳性丧失的数值预报精度。[方法]首先,发展基于统一理论的六自由度弱非线性运动模型的数值计算方法,该方法不仅耦合耐波性与操纵性的动力学特征,还通过涡格法(VLM... [目的]旨在研究随浪中横倾与航速共同作用下水动力升力效应的影响,提升纯稳性丧失的数值预报精度。[方法]首先,发展基于统一理论的六自由度弱非线性运动模型的数值计算方法,该方法不仅耦合耐波性与操纵性的动力学特征,还通过涡格法(VLM)引入升力项,以表征横倾角和航速变化所引起的横向流体作用;其次,针对涡格法可能导致升力项高估的问题,采用计算流体力学(CFD)方法,对船舶自航时的涡脱落情况进行定量分析,进而对涡格法所计算的升力项进行修正;最后,将修正后的六自由度运动模型预报结果与公开发表的模型试验数据进行对比验证。[结果]结果显示,当船速增高时,升力对船舶横摇响应具有显著的放大效应;经CFD修正后的升力项可有效降低涡格法的高估偏差,提升六自由度运动模型对船舶随浪非线性横摇运动的预报精度。[结论]所做研究明确了升力效应对船舶随浪非线性横摇运动的影响规律,验证了基于涡格法和CFD修正的六自由度弱非线性运动模型在船舶横摇运动预报中的有效性,可为船舶稳性评估及航行策略制定提供技术支持。 展开更多
关键词 船舶稳性 横摇运动 弱非线性运动模型 六自由度 涡格法 水动力升力 计算流体力学
在线阅读 下载PDF
视觉语言模型驱动的目标计数
9
作者 曹锋 张孝文 +2 位作者 岳子杰 李莉 史淼晶 《中国图象图形学报》 北大核心 2026年第1期289-302,共14页
目的大型视觉语言模型的进展给解决基于文本提示的目标计数问题带来新的思路。然而,现有方法仍面临类别语义错位与解码器架构局限两大挑战。前者导致模型易将相似背景或无关类别误检为目标,后者依赖单一卷积神经网络(convolutional neur... 目的大型视觉语言模型的进展给解决基于文本提示的目标计数问题带来新的思路。然而,现有方法仍面临类别语义错位与解码器架构局限两大挑战。前者导致模型易将相似背景或无关类别误检为目标,后者依赖单一卷积神经网络(convolutional neural network,CNN)架构的局部特征提取,可能引发全局语义与局部细节的割裂,严重制约复杂场景下的计数鲁棒性。针对上述问题,提出跨分支协作对齐网络(cross-branch cooperative alignment net⁃work,CANet)。方法其核心包括:1)双分支解码器架构:通过并行Transformer分支(建模全局上下文依赖)与CNN分支(提取细粒度局部特征),结合信息互馈模块实现跨分支的特征交互和密度图预测;2)视觉—文本类别对齐损失:通过约束图像与文本特征的跨模态对齐,迫使模型区分目标与干扰语义,实现对类别的准确检测。结果在5个基准数据集上与先进的4种基于文本的目标计数方法进行比较实验。在FSC-147(few-shot counting-147)数据集上,CANet相较于性能第2的模型,在测试集上的平均绝对误差(mean absolute error,MAE)和均方根误差(root mean squared error,RMSE)分别降低1.22和8.45;在CARPK(car parking lot dataset)和PUCPR+(Pontifical Catholic Univer⁃sity of Parana+dataset)数据集的交叉验证实验上,相较于性能第2的模型,MAE分别降低0.08和3.58;在SHA(ShanghaiTech part-A)和SHB(ShanghaiTech part-B)数据集的交叉验证实验上,相较于性能第2的模型,MAE分别降低了47.0和9.8。同时也在FSC-147数据集上进行丰富的消融实验以验证算法的有效性,消融实验结果表明提出的方法针对两个问题做出了有效改进。结论本文方法能够解决现有方法所面临的两个问题,使计数结果更加准确。本文方法在4个数据集的交叉验证实验均取得SOTA(state-of-the-art)的性能,表明了CANet在零样本目标计数任务中的强大泛化能力。 展开更多
关键词 目标计数 视觉语言模型(vlm) 文本提示 双分支解码器 信息互馈
原文传递
多模态对地观测大模型:架构、关键技术和未来展望
10
作者 许文嘉 于睿卿 +6 位作者 薛铭浩 汪雪怡 张源奔 魏智威 张柘 彭木根 吴一戎 《雷达学报(中英文)》 北大核心 2026年第1期361-386,共26页
近年来,人工智能技术和对地观测领域的结合已成为领域发展的前沿热点,多模态大语言模型(MLLM)的快速发展为智能解译带来新的机遇和挑战。多模态对地观测大模型通过构建大语言模型与视觉模型之间的桥接机制并采用联合训练方式,深度融合... 近年来,人工智能技术和对地观测领域的结合已成为领域发展的前沿热点,多模态大语言模型(MLLM)的快速发展为智能解译带来新的机遇和挑战。多模态对地观测大模型通过构建大语言模型与视觉模型之间的桥接机制并采用联合训练方式,深度融合光学影像、合成孔径雷达影像与文本等多模态信息,有效推动对地观测智能解译由浅层语义匹配向高层的世界知识理解跃迁。该文系统性回顾了多模态对地观测大模型的相关研究成果,以期为新的研究方向提供依据。具体而言,该文首先明确了多模态对地观测大模型(EO-MLLM)的概念定义,并梳理了多模态对地观测大模型的发展脉络。随后,详细阐述了多模态对地观测大模型的模型架构、训练方法、适用任务及其对应的基准数据集,并介绍了对地观测智能体。最后,探讨了多模态对地观测大模型的研究现状和未来发展方向。 展开更多
关键词 大语言模型 多模态大语言模型 多模态对地观测大模型 视觉语言模型 对地观测智能体
在线阅读 下载PDF
A Novel Unified Framework for Automated Generation and Multimodal Validation of UML Diagrams
11
作者 Van-Viet Nguyen Huu-Khanh Nguyen +4 位作者 Kim-Son Nguyen Thi Minh-Hue Luong Duc-Quang Vu Trung-Nghia Phung The-Vinh Nguyen 《Computer Modeling in Engineering & Sciences》 2026年第1期1023-1050,共28页
It remains difficult to automate the creation and validation of Unified Modeling Language(UML)dia-grams due to unstructured requirements,limited automated pipelines,and the lack of reliable evaluation methods.This stu... It remains difficult to automate the creation and validation of Unified Modeling Language(UML)dia-grams due to unstructured requirements,limited automated pipelines,and the lack of reliable evaluation methods.This study introduces a cohesive architecture that amalgamates requirement development,UML synthesis,and multimodal validation.First,LLaMA-3.2-1B-Instruct was utilized to generate user-focused requirements.Then,DeepSeek-R1-Distill-Qwen-32B applies its reasoning skills to transform these requirements into PlantUML code.Using this dual-LLM pipeline,we constructed a synthetic dataset of 11,997 UML diagrams spanning six major diagram families.Rendering analysis showed that 89.5%of the generated diagrams compile correctly,while invalid cases were detected automatically.To assess quality,we employed a multimodal scoring method that combines Qwen2.5-VL-3B,LLaMA-3.2-11B-Vision-Instruct and Aya-Vision-8B,with weights based on MMMU performance.A study with 94 experts revealed strong alignment between automatic and manual evaluations,yielding a Pearson correlation of r=0.82 and a Fleiss’Kappa of 0.78.This indicates a high degree of concordance between automated metrics and human judgment.Overall,the results demonstrated that our scoring system is effective and that the proposed generation pipeline produces UML diagrams that are both syntactically correct and semantically coherent.More broadly,the system provides a scalable and reproducible foundation for future work in AI-driven software modeling and multimodal verification. 展开更多
关键词 Automated dataset generation vision-language models multimodal validation software engineering automation UMLCode
在线阅读 下载PDF
Research on Automated Game QA Reporting Based on Natural Language Captions
12
作者 Jun Myeong Kim Jang Young Jeong +1 位作者 Shin Jin Kang Beomjoo Seo 《Computers, Materials & Continua》 2026年第2期1690-1705,共16页
GameQualityAssurance(QA)currently relies heavily onmanual testing,a process that is both costly and time-consuming.Traditional script-and log-based automation tools are limited in their ability to detect unpredictable... GameQualityAssurance(QA)currently relies heavily onmanual testing,a process that is both costly and time-consuming.Traditional script-and log-based automation tools are limited in their ability to detect unpredictable visual bugs,especially those that are context-dependent or graphical in nature.As a result,many issues go unnoticed during manual QA,which reduces overall game quality,degrades the user experience,and creates inefficiencies throughout the development cycle.This study proposes two approaches to address these challenges.The first leverages a Large Language Model(LLM)to directly analyze gameplay videos,detect visual bugs,and automatically generate QA reports in natural language.The second approach introduces a pipeline method:first generating textual descriptions of visual bugs in game videos using the ClipCap model,then using those descriptions as input for the LLM to synthesize QA reports.Through these two multi-faceted approaches,this study evaluates the feasibility of automated game QA systems.To implement this system,we constructed a visual bug database derived from real-world game cases and fine-tuned the ClipCap model for the game video domain.Our proposed approach aims to enhance both efficiency and quality in game development by reducing the burden of manual QA while improving the accuracy of visual bug detection and ensuring consistent,reliable report generation. 展开更多
关键词 Game quality assurance vision-language model automated report system
在线阅读 下载PDF
Vision-language model-based human-robot collaboration for smart manufacturing:A state-of-the-art survey 被引量:1
13
作者 Junming FAN Yue YIN +3 位作者 Tian WANG Wenhang DONG Pai ZHENG Lihui WANG 《Frontiers of Engineering Management》 2025年第1期177-200,共24页
human-robot collaboration(HRC)is set to transform the manufacturing paradigm by leveraging the strengths of human flexibility and robot precision.The recent breakthrough of Large Language Models(LLMs)and Vision-Langua... human-robot collaboration(HRC)is set to transform the manufacturing paradigm by leveraging the strengths of human flexibility and robot precision.The recent breakthrough of Large Language Models(LLMs)and Vision-Language Models(VLMs)has motivated the preliminary explorations and adoptions of these models in the smart manufacturing field.However,despite the considerable amount of effort,existing research mainly focused on individual components without a comprehensive perspective to address the full potential of VLMs,especially for HRC in smart manufacturing scenarios.To fill the gap,this work offers a systematic review of the latest advance-ments and applications of VLMs in HRC for smart manu-facturing,which covers the fundamental architectures and pretraining methodologies of LLMs and VLMs,their applications in robotic task planning,navigation,and manipulation,and role in enhancing human-robot skill transfer through multimodal data integration.Lastly,the paper discusses current limitations and future research directions in VLM-based HRC,highlighting the trend in fully realizing the potential of these technologies for smart manufacturing. 展开更多
关键词 vision-language models large language models human-robot collaboration smart manufacturing
原文传递
A survey on pre-training and transfer learning for multimodal Vision-Language Models
14
作者 Zhongren Liang 《Advances in Engineering Innovation》 2025年第7期135-139,共5页
In recent years,Vision-Language Models(VLMs)have emerged as a significant breakthrough in multimodal learning,demonstrating remarkable progress in tasks such as image-text alignment,image generation,and semantic reaso... In recent years,Vision-Language Models(VLMs)have emerged as a significant breakthrough in multimodal learning,demonstrating remarkable progress in tasks such as image-text alignment,image generation,and semantic reasoning.This paper systematically reviews current VLM pretraining methodologies,including contrastive learning and generative paradigms,while providing an in-depth analysis of efficient transfer learning strategies such as prompt tuning,LoRA,and adapter modules.Through representative models like CLIP,BLIP,and GIT,we examine their practical applications in visual grounding,imagetext retrieval,visual question answering,affective computing,and embodied AI.Furthermore,we identify persistent challenges in fine-grained semantic modeling,cross-modal reasoning,and cross-lingual transfer.Finally,we envision future trends in unified architectures,multimodal reinforcement learning,and domain adaptation,aiming to provide systematic reference and technical insights for subsequent research. 展开更多
关键词 vision-language models multimodal learning pre-training transfer learning contrastive learning
在线阅读 下载PDF
多粒度提示驱动的野生动物识别 被引量:1
15
作者 李鹏飞 邵一飞 +3 位作者 裴生雷 祁清 贾国庆 余炼 《闽南师范大学学报(自然科学版)》 2025年第2期35-48,共14页
现有的野生动物识别方法主要依赖于静态数据集,难以适应物种动态迁移和新增类别识别的需求,导致监测效率低下。针对这一问题,提出多粒度提示驱动的野生动物识别方法(multi-granularity prompt-driven for wildlife recognition,MGP-WILD... 现有的野生动物识别方法主要依赖于静态数据集,难以适应物种动态迁移和新增类别识别的需求,导致监测效率低下。针对这一问题,提出多粒度提示驱动的野生动物识别方法(multi-granularity prompt-driven for wildlife recognition,MGP-WILD)。通过云端大语言模型生成层次化语义描述(粗粒度生物分类+细粒度形态特征),由边缘节点协同维护动态知识表。具体而言,MGP-WILD利用大语言模型生成多粒度文本提示,相较于传统单粒度提示方法,本工作通过多粒度语义描述生成,实现了粗细粒度特征的深度融合,并结合视觉语言模型的跨模态对齐能力,实现了零样本精准识别。实验结果表明,该方法在多个数据集上均有较大提升,尤其在开放集识别任务中展现了较强的适应性。该系统已成功应用于青海野生动物栖息地保护,构建了基于真实场景的动物图像数据集,为生态脆弱区的生物多样性保护提供了创新技术范式。代码及部分数据集将在GitHub上公开。 展开更多
关键词 野生动物识别 云边协同 大型语言模型(LLM) 视觉语言模型(vlm) 多粒度提示
在线阅读 下载PDF
基于多粒度共享语义中心关联的文本到人物检索方法 被引量:1
16
作者 康斌 陈斌 +3 位作者 王俊杰 李昱林 赵军智 咸伟志 《计算机应用》 北大核心 2025年第3期808-814,共7页
基于文本的人物检索旨在通过使用文本描述作为查询来识别特定人物。现有的先进方法通常设计多种对齐机制实现跨模态数据在全局和局部的对应关系,然而忽略了不同对齐机制之间的相互影响。因此,提出一种多粒度共享语义中心关联机制,深入... 基于文本的人物检索旨在通过使用文本描述作为查询来识别特定人物。现有的先进方法通常设计多种对齐机制实现跨模态数据在全局和局部的对应关系,然而忽略了不同对齐机制之间的相互影响。因此,提出一种多粒度共享语义中心关联机制,深入探索全局对齐和局部对齐之间的促进和抑制效应。首先,引入一个多粒度交叉对齐模块,并通过增强图像-句子和局部区域-分词之间的交互,实现跨模态数据在联合嵌入空间的多层次对齐;其次,建立一个共享语义中心,将它作为一个可学习的语义枢纽,并通过全局特征和局部特征的关联,增强不同对齐机制之间的语义一致性,促进全局和局部特征的协同作用。在共享语义中心内,计算图像特征和文本特征之间的局部和全局跨模态相似性关系,提供一种全局视角与局部视角的互补度量,并最大限度地促进多种对齐机制之间的正向效应;最后,在CUHK-PEDES数据集上进行实验。结果表明:所提方法在Rank-1指标上较基线方法显著提升了8.69个百分点,平均精度均值(mAP)提升了6.85个百分点。在ICFG-PEDES和RSTPReid数据集上所提方法也取得了优异的性能,明显超越了所有对比方法。 展开更多
关键词 视觉-语言模型 人物检索 全局对齐 局部对齐 共享语义中心
在线阅读 下载PDF
军事指挥系统视觉语言模型提示注入攻击与防御
17
作者 姜碧怡 宋振波 +1 位作者 陆建峰 陆辰 《指挥信息系统与技术》 2025年第6期23-29,共7页
视觉语言模型(VLM)凭借军事领域专业知识,在军事指挥系统中承担关键角色,可提升目标识别与战场分析效率。针对现有VLM安全研究多聚焦通用领域、军事场景专项探索不足的问题,对GPT-4o、Claude-3.5-Sonnet、Claude-Sonnet-4和Qwen2.5-VL这... 视觉语言模型(VLM)凭借军事领域专业知识,在军事指挥系统中承担关键角色,可提升目标识别与战场分析效率。针对现有VLM安全研究多聚焦通用领域、军事场景专项探索不足的问题,对GPT-4o、Claude-3.5-Sonnet、Claude-Sonnet-4和Qwen2.5-VL这4种先进VLM开展定量研究,模拟文本、视觉与延迟视觉3类提示注入攻击。试验结果表明,所有模型均易受攻击,亚视觉提示会显著提升有害信息输出概率。提出的防御策略包括伦理约束、监督模型及混合策略3类方法,均可有效缓解攻击危害,其中混合策略对多数模型的威胁漏检率降低30%~40%,普适性最优。 展开更多
关键词 视觉语言模型(vlm) 军事指挥系统 提示注入攻击 提示注入攻击防御
在线阅读 下载PDF
CLIP-IML:A novel approach for CLIP-based image manipulation localization
18
作者 Xue-Yang Hou Yilihamu Yaermaimaiti Shuo-Qi Cheng 《Journal of Electronic Science and Technology》 2025年第3期56-70,共15页
Existing image manipulation localization(IML)techniques require large,densely annotated sets of forged images.This requirement greatly increases labeling costs and limits a model’s ability to handle manipulation type... Existing image manipulation localization(IML)techniques require large,densely annotated sets of forged images.This requirement greatly increases labeling costs and limits a model’s ability to handle manipulation types that are novel or absent from the training data.To address these issues,we present CLIP-IML,an IML framework that leverages contrastive language-image pre-training(CLIP).A lightweight feature-reconstruction module transforms CLIP token sequences into spatial tensors,after which a compact feature-pyramid network and a multi-scale fusion decoder work together to capture information from fine to coarse levels.We evaluated CLIP-IML on ten public datasets that cover copy-move,splicing,removal,and artificial intelligence(AI)-generated forgeries.The framework raises the average F1-score by 7.85%relative to the strongest recent baselines and secures either the first-or second-place performance on every dataset.Ablation studies show that CLIP pre-training,higher resolution inputs,and the multi-scale decoder each make complementary contributions.Under six common post-processing perturbations,as well as the compression pipelines used by Facebook,Weibo,and WeChat,the performance decline never exceeds 2.2%,confirming strong practical robustness.Moreover,CLIP-IML requires only a few thousand annotated images for training,which markedly reduces data-collection and labeling effort compared with previous methods.All of these results indicate that CLIP-IML is highly generalizable for image tampering localization across a wide range of tampering scenarios. 展开更多
关键词 Image manipulation localization Multi-scale feature Pre-trained model vision-language model Vision Transformer
在线阅读 下载PDF
Artificial intelligence assisted ultrasound report generation
19
作者 Jia-Hui Zeng Kai-Kai Zhao Ning-Bo Zhao 《Artificial Intelligence in Medical Imaging》 2025年第1期13-20,共8页
Artificial intelligence(AI)assisted ultrasound report generation represents a technology that leverages artificial intelligence to convert ultrasound imaging analysis results into structured diagnostic reports.By inte... Artificial intelligence(AI)assisted ultrasound report generation represents a technology that leverages artificial intelligence to convert ultrasound imaging analysis results into structured diagnostic reports.By integrating image recognition and natural language generation models,AI systems can automatically detect and analyze lesions or abnormalities in ultrasound images,generating textual descriptions of diagnostic conclusions(e.g.,fatty liver,liver fibrosis,automated BIRADS grading of breast lesions),imaging findings,and clinical recommendations to form comprehensive reports.This technology enhances the efficiency and accuracy of imaging diagnosis,reduces physicians’workloads,ensures report standardization and consistency,and provides robust support for clinical decisionmaking.Current state-of-the-art algorithms for automated ultrasound report generation primarily rely on vision-language models,which harness the generalization capabilities of large language models and large vision models through multimodal(language+vision)feature alignment.However,existing approaches inadequately address challenges such as numerical measurement generation,effective utilization of report templates,incorporation of historical reports,learning text-image correlations,and overfitting under limited data conditions.This paper aims to introduce the current state of research on ultrasound report generation,the existing issues,and to provide some thoughts for future research. 展开更多
关键词 Artificial intelligence Ultrasound report generation vision-language models Natural language generation Large language model
在线阅读 下载PDF
Effectiveness assessment of recent large vision-language models 被引量:2
20
作者 Yao Jiang Xinyu Yan +5 位作者 Ge-Peng Ji Keren Fu Meijun Sun Huan Xiong Deng-Ping Fan Fahad Shahbaz Khan 《Visual Intelligence》 2024年第1期197-213,共17页
The advent of large vision-language models(LVLMs)represents a remarkable advance in the quest for artificial general intelligence.However,the models’effectiveness in both specialized and general tasks warrants furthe... The advent of large vision-language models(LVLMs)represents a remarkable advance in the quest for artificial general intelligence.However,the models’effectiveness in both specialized and general tasks warrants further investigation.This paper endeavors to evaluate the competency of popular LVLMs in specialized and general tasks,respectively,aiming to offer a comprehensive understanding of these novel models.To gauge their effectiveness in specialized tasks,we employ six challenging tasks in three different application scenarios:natural,healthcare,and industrial.These six tasks include salient/camouflaged/transparent object detection,as well as polyp detection,skin lesion detection,and industrial anomaly detection.We examine the performance of three recent open-source LVLMs,including MiniGPT-v2,LLaVA-1.5,and Shikra,on both visual recognition and localization in these tasks.Moreover,we conduct empirical investigations utilizing the aforementioned LVLMs together with GPT-4V,assessing their multi-modal understanding capabilities in general tasks including object counting,absurd question answering,affordance reasoning,attribute recognition,and spatial relation reasoning.Our investigations reveal that these LVLMs demonstrate limited proficiency not only in specialized tasks but also in general tasks.We delve deep into this inadequacy and uncover several potential factors,including limited cognition in specialized tasks,object hallucination,text-to-image interference,and decreased robustness in complex problems.We hope that this study can provide useful insights for the future development of LVLMs,helping researchers improve LVLMs for both general and specialized applications. 展开更多
关键词 Large vision-language models(Lvlms) Recognition LOCALIZATION Multi-modal understanding
在线阅读 下载PDF
上一页 1 2 下一页 到第
使用帮助 返回顶部