期刊文献+
共找到13篇文章
< 1 >
每页显示 20 50 100
A Review on Vision-Language-Based Approaches: Challenges and Applications
1
作者 Huu-Tuong Ho Luong Vuong Nguyen +4 位作者 Minh-Tien Pham Quang-Huy Pham Quang-Duong Tran Duong Nguyen Minh Huy Tri-Hai Nguyen 《Computers, Materials & Continua》 2025年第2期1733-1756,共24页
In multimodal learning, Vision-Language Models (VLMs) have become a critical research focus, enabling the integration of textual and visual data. These models have shown significant promise across various natural lang... In multimodal learning, Vision-Language Models (VLMs) have become a critical research focus, enabling the integration of textual and visual data. These models have shown significant promise across various natural language processing tasks, such as visual question answering and computer vision applications, including image captioning and image-text retrieval, highlighting their adaptability for complex, multimodal datasets. In this work, we review the landscape of Bootstrapping Language-Image Pre-training (BLIP) and other VLM techniques. A comparative analysis is conducted to assess VLMs’ strengths, limitations, and applicability across tasks while examining challenges such as scalability, data quality, and fine-tuning complexities. The work concludes by outlining potential future directions in VLM research, focusing on enhancing model interpretability, addressing ethical implications, and advancing multimodal integration in real-world applications. 展开更多
关键词 Bootstrapping language-image pre-training(BLIP) multimodal learning vision-language model(VLM) vision-language pre-training(VLP)
在线阅读 下载PDF
VOTI:Jailbreaking Vision-Language Models via Visual Obfuscation and Task Induction
2
作者 ZHU Yifan CHU Zhixuan REN Kui 《ZTE Communications》 2025年第3期15-26,共12页
In recent years,large vision-language models(VLMs)have achieved significant breakthroughs in cross-modal understanding and generation.However,the safety issues arising from their multimodal interactions become promine... In recent years,large vision-language models(VLMs)have achieved significant breakthroughs in cross-modal understanding and generation.However,the safety issues arising from their multimodal interactions become prominent.VLMs are vulnerable to jailbreak attacks,where attackers craft carefully designed prompts to bypass safety mechanisms,leading them to generate harmful content.To address this,we investigate the alignment between visual inputs and task execution,uncovering locality defects and attention biases in VLMs.Based on these findings,we propose VOTI,a novel jailbreak framework leveraging visual obfuscation and task induction.VOTI subtly embeds malicious keywords within neutral image layouts to evade detection,and breaks down harmful queries into a sequence of subtasks.This approach disperses malicious intent across modalities,exploiting VLMs’over-reliance on local visual cues and their fragility in multi-step reasoning to bypass global safety mechanisms.Implemented as an automated framework,VOTI integrates large language models as red-team assistants to generate and iteratively optimize jailbreak strategies.Extensive experiments across seven mainstream VLMs demonstrate VOTI’s effectiveness,achieving a 73.46%attack success rate on GPT-4o-mini.These results reveal critical vulnerabilities in VLMs,highlighting the urgent need for improving robust defenses and multimodal alignment. 展开更多
关键词 large vision-language models jailbreak attacks red teaming security of large models safety alignment
在线阅读 下载PDF
The Synergy of Seeing and Saying: Revolutionary Advances in Multi-modality Medical Vision-Language Large Models
3
作者 Xiang LI Yu SUN +3 位作者 Jia LIN Like LI Ting FENG Shen YIN 《Artificial Intelligence Science and Engineering》 2025年第2期79-97,共19页
The application of visual-language large models in the field of medical health has gradually become a research focus.The models combine the capability for image understanding and natural language processing,and can si... The application of visual-language large models in the field of medical health has gradually become a research focus.The models combine the capability for image understanding and natural language processing,and can simultaneously process multi-modality data such as medical images and medical reports.These models can not only recognize images,but also understand the semantic relationship between images and texts,effectively realize the integration of medical information,and provide strong support for clinical decision-making and disease diagnosis.The visual-language large model has good performance for specific medical tasks,and also shows strong potential and high intelligence in the general task models.This paper provides a comprehensive review of the visual-language large model in the field of medical health.Specifically,this paper first introduces the basic theoretical basis and technical principles.Then,this paper introduces the specific application scenarios in the field of medical health,including modality fusion,semi-supervised learning,weakly supervised learning,unsupervised learning,cross-domain model and general models.Finally,the challenges including insufficient data,interpretability,and practical deployment are discussed.According to the existing challenges,four potential future development directions are given. 展开更多
关键词 large language models vision-language models medical health multimodality models
在线阅读 下载PDF
IQAGPT:computed tomography image quality assessment with vision-language and ChatGPT models
4
作者 Zhihao Chen Bin Hu +4 位作者 Chuang Niu Tao Chen Yuxin Li Hongming Shan Ge Wang 《Visual Computing for Industry,Biomedicine,and Art》 2024年第1期165-181,共17页
Large language models(LLMs),such as ChatGPT,have demonstrated impressive capabilities in various tasks and attracted increasing interest as a natural language interface across many domains.Recently,large vision-langua... Large language models(LLMs),such as ChatGPT,have demonstrated impressive capabilities in various tasks and attracted increasing interest as a natural language interface across many domains.Recently,large vision-language models(VLMs)that learn rich vision–language correlation from image–text pairs,like BLIP-2 and GPT-4,have been intensively investigated.However,despite these developments,the application of LLMs and VLMs in image quality assessment(IQA),particularly in medical imaging,remains unexplored.This is valuable for objective performance evaluation and potential supplement or even replacement of radiologists’opinions.To this end,this study intro-duces IQAGPT,an innovative computed tomography(CT)IQA system that integrates image-quality captioning VLM with ChatGPT to generate quality scores and textual reports.First,a CT-IQA dataset comprising 1,000 CT slices with diverse quality levels is professionally annotated and compiled for training and evaluation.To better leverage the capabilities of LLMs,the annotated quality scores are converted into semantically rich text descriptions using a prompt template.Second,the image-quality captioning VLM is fine-tuned on the CT-IQA dataset to generate qual-ity descriptions.The captioning model fuses image and text features through cross-modal attention.Third,based on the quality descriptions,users verbally request ChatGPT to rate image-quality scores or produce radiological qual-ity reports.Results demonstrate the feasibility of assessing image quality using LLMs.The proposed IQAGPT outper-formed GPT-4 and CLIP-IQA,as well as multitask classification and regression models that solely rely on images. 展开更多
关键词 Deep learning Medical imaging Image captioning MULTIMODALITY Large language model vision-language model GPT-4 Subjective evaluation
在线阅读 下载PDF
VLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning 被引量:3
5
作者 WEI Tingting YUAN Weilin +2 位作者 LUO Junren ZHANG Wanpeng LU Lina 《Journal of Systems Engineering and Electronics》 SCIE EI CSCD 2023年第1期9-18,共10页
In the field of satellite imagery, remote sensing image captioning(RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a visi... In the field of satellite imagery, remote sensing image captioning(RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a vision-language aligning paradigm for RSIC to jointly represent vision and language. First, a new RSIC dataset DIOR-Captions is built for augmenting object detection in optical remote(DIOR) sensing images dataset with manually annotated Chinese and English contents. Second, a Vision-Language aligning model with Cross-modal Attention(VLCA) is presented to generate accurate and abundant bilingual descriptions for remote sensing images. Third, a crossmodal learning network is introduced to address the problem of visual-lingual alignment. Notably, VLCA is also applied to end-toend Chinese captions generation by using the pre-training language model of Chinese. The experiments are carried out with various baselines to validate VLCA on the proposed dataset. The results demonstrate that the proposed algorithm is more descriptive and informative than existing algorithms in producing captions. 展开更多
关键词 remote sensing image captioning(RSIC) vision-language representation remote sensing image caption dataset attention mechanism
在线阅读 下载PDF
Vision-language model-based human-robot collaboration for smart manufacturing:A state-of-the-art survey 被引量:1
6
作者 Junming FAN Yue YIN +3 位作者 Tian WANG Wenhang DONG Pai ZHENG Lihui WANG 《Frontiers of Engineering Management》 2025年第1期177-200,共24页
human-robot collaboration(HRC)is set to transform the manufacturing paradigm by leveraging the strengths of human flexibility and robot precision.The recent breakthrough of Large Language Models(LLMs)and Vision-Langua... human-robot collaboration(HRC)is set to transform the manufacturing paradigm by leveraging the strengths of human flexibility and robot precision.The recent breakthrough of Large Language Models(LLMs)and Vision-Language Models(VLMs)has motivated the preliminary explorations and adoptions of these models in the smart manufacturing field.However,despite the considerable amount of effort,existing research mainly focused on individual components without a comprehensive perspective to address the full potential of VLMs,especially for HRC in smart manufacturing scenarios.To fill the gap,this work offers a systematic review of the latest advance-ments and applications of VLMs in HRC for smart manu-facturing,which covers the fundamental architectures and pretraining methodologies of LLMs and VLMs,their applications in robotic task planning,navigation,and manipulation,and role in enhancing human-robot skill transfer through multimodal data integration.Lastly,the paper discusses current limitations and future research directions in VLM-based HRC,highlighting the trend in fully realizing the potential of these technologies for smart manufacturing. 展开更多
关键词 vision-language models large language models human-robot collaboration smart manufacturing
原文传递
Masked Vision-language Transformer in Fashion
7
作者 Ge-Peng Ji Mingchen Zhuge +3 位作者 Dehong Gao Deng-Ping Fan Christos Sakaridis Luc Van Gool 《Machine Intelligence Research》 EI CSCD 2023年第3期421-434,共14页
We present a masked vision-language transformer(MVLT)for fashion-specific multi-modal representation.Technically,we simply utilize the vision transformer architecture for replacing the bidirectional encoder representa... We present a masked vision-language transformer(MVLT)for fashion-specific multi-modal representation.Technically,we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers(BERT)in the pre-training model,making MVLT the first end-to-end framework for the fashion domain.Besides,we designed masked image reconstruction(MIR)for a fine-grained understanding of fashion.MVLT is an extensible and convenient architecture that admits raw multimodal inputs without extra pre-processing models(e.g.,ResNet),implicitly modeling the vision-language alignments.More importantly,MVLT can easily generalize to various matching and generative tasks.Experimental results show obvious improvements in retrieval(rank@5:17%)and recognition(accuracy:3%)tasks over the Fashion-Gen 2018 winner,Kaleido-BERT.The code is available at https://github.com/GewelsJI/MVLT. 展开更多
关键词 vision-language masked image reconstruction transformer fashion e-commercial
原文传递
Effectiveness assessment of recent large vision-language models 被引量:1
8
作者 Yao Jiang Xinyu Yan +5 位作者 Ge-Peng Ji Keren Fu Meijun Sun Huan Xiong Deng-Ping Fan Fahad Shahbaz Khan 《Visual Intelligence》 2024年第1期197-213,共17页
The advent of large vision-language models(LVLMs)represents a remarkable advance in the quest for artificial general intelligence.However,the models’effectiveness in both specialized and general tasks warrants furthe... The advent of large vision-language models(LVLMs)represents a remarkable advance in the quest for artificial general intelligence.However,the models’effectiveness in both specialized and general tasks warrants further investigation.This paper endeavors to evaluate the competency of popular LVLMs in specialized and general tasks,respectively,aiming to offer a comprehensive understanding of these novel models.To gauge their effectiveness in specialized tasks,we employ six challenging tasks in three different application scenarios:natural,healthcare,and industrial.These six tasks include salient/camouflaged/transparent object detection,as well as polyp detection,skin lesion detection,and industrial anomaly detection.We examine the performance of three recent open-source LVLMs,including MiniGPT-v2,LLaVA-1.5,and Shikra,on both visual recognition and localization in these tasks.Moreover,we conduct empirical investigations utilizing the aforementioned LVLMs together with GPT-4V,assessing their multi-modal understanding capabilities in general tasks including object counting,absurd question answering,affordance reasoning,attribute recognition,and spatial relation reasoning.Our investigations reveal that these LVLMs demonstrate limited proficiency not only in specialized tasks but also in general tasks.We delve deep into this inadequacy and uncover several potential factors,including limited cognition in specialized tasks,object hallucination,text-to-image interference,and decreased robustness in complex problems.We hope that this study can provide useful insights for the future development of LVLMs,helping researchers improve LVLMs for both general and specialized applications. 展开更多
关键词 Large vision-language models(LVLMs) Recognition LOCALIZATION Multi-modal understanding
在线阅读 下载PDF
CLIP-IML:A novel approach for CLIP-based image manipulation localization
9
作者 Xue-Yang Hou Yilihamu Yaermaimaiti Shuo-Qi Cheng 《Journal of Electronic Science and Technology》 2025年第3期56-70,共15页
Existing image manipulation localization(IML)techniques require large,densely annotated sets of forged images.This requirement greatly increases labeling costs and limits a model’s ability to handle manipulation type... Existing image manipulation localization(IML)techniques require large,densely annotated sets of forged images.This requirement greatly increases labeling costs and limits a model’s ability to handle manipulation types that are novel or absent from the training data.To address these issues,we present CLIP-IML,an IML framework that leverages contrastive language-image pre-training(CLIP).A lightweight feature-reconstruction module transforms CLIP token sequences into spatial tensors,after which a compact feature-pyramid network and a multi-scale fusion decoder work together to capture information from fine to coarse levels.We evaluated CLIP-IML on ten public datasets that cover copy-move,splicing,removal,and artificial intelligence(AI)-generated forgeries.The framework raises the average F1-score by 7.85%relative to the strongest recent baselines and secures either the first-or second-place performance on every dataset.Ablation studies show that CLIP pre-training,higher resolution inputs,and the multi-scale decoder each make complementary contributions.Under six common post-processing perturbations,as well as the compression pipelines used by Facebook,Weibo,and WeChat,the performance decline never exceeds 2.2%,confirming strong practical robustness.Moreover,CLIP-IML requires only a few thousand annotated images for training,which markedly reduces data-collection and labeling effort compared with previous methods.All of these results indicate that CLIP-IML is highly generalizable for image tampering localization across a wide range of tampering scenarios. 展开更多
关键词 Image manipulation localization Multi-scale feature Pre-trained model vision-language model Vision Transformer
在线阅读 下载PDF
Artificial intelligence assisted ultrasound report generation
10
作者 Jia-Hui Zeng Kai-Kai Zhao Ning-Bo Zhao 《Artificial Intelligence in Medical Imaging》 2025年第1期13-20,共8页
Artificial intelligence(AI)assisted ultrasound report generation represents a technology that leverages artificial intelligence to convert ultrasound imaging analysis results into structured diagnostic reports.By inte... Artificial intelligence(AI)assisted ultrasound report generation represents a technology that leverages artificial intelligence to convert ultrasound imaging analysis results into structured diagnostic reports.By integrating image recognition and natural language generation models,AI systems can automatically detect and analyze lesions or abnormalities in ultrasound images,generating textual descriptions of diagnostic conclusions(e.g.,fatty liver,liver fibrosis,automated BIRADS grading of breast lesions),imaging findings,and clinical recommendations to form comprehensive reports.This technology enhances the efficiency and accuracy of imaging diagnosis,reduces physicians’workloads,ensures report standardization and consistency,and provides robust support for clinical decisionmaking.Current state-of-the-art algorithms for automated ultrasound report generation primarily rely on vision-language models,which harness the generalization capabilities of large language models and large vision models through multimodal(language+vision)feature alignment.However,existing approaches inadequately address challenges such as numerical measurement generation,effective utilization of report templates,incorporation of historical reports,learning text-image correlations,and overfitting under limited data conditions.This paper aims to introduce the current state of research on ultrasound report generation,the existing issues,and to provide some thoughts for future research. 展开更多
关键词 Artificial intelligence Ultrasound report generation vision-language Models Natural language generation Large language model
在线阅读 下载PDF
Causal reasoning in typical computer vision tasks 被引量:2
11
作者 ZHANG KeXuan SUN QiYu +1 位作者 ZHAO ChaoQiang TANG Yang 《Science China(Technological Sciences)》 SCIE EI CAS CSCD 2024年第1期105-120,共16页
Deep learning has revolutionized the field of artificial intelligence.Based on the statistical correlations uncovered by deep learning-based methods,computer vision tasks,such as autonomous driving and robotics,are gr... Deep learning has revolutionized the field of artificial intelligence.Based on the statistical correlations uncovered by deep learning-based methods,computer vision tasks,such as autonomous driving and robotics,are growing rapidly.Despite being the basis of deep learning,such correlation strongly depends on the distribution of the original data and is susceptible to uncontrolled factors.Without the guidance of prior knowledge,statistical correlations alone cannot correctly reflect the essential causal relations and may even introduce spurious correlations.As a result,researchers are now trying to enhance deep learningbased methods with causal theory.Causal theory can model the intrinsic causal structure unaffected by data bias and effectively avoids spurious correlations.This paper aims to comprehensively review the existing causal methods in typical vision and visionlanguage tasks such as semantic segmentation,object detection,and image captioning.The advantages of causality and the approaches for building causal paradigms will be summarized.Future roadmaps are also proposed,including facilitating the development of causal theory and its application in other complex scenarios and systems. 展开更多
关键词 causal reasoning computer vision tasks vision-language tasks semantic segmentation object detection
原文传递
Mini-InternVL:a flexible-transfer pocket multi-modal model with 5%parameters and 90%performance
12
作者 Zhangwei Gao Zhe Chen +12 位作者 Erfei Cui Yiming Ren Weiyun Wang Jinguo Zhu Hao Tian Shenglong Ye Junjun He Xizhou Zhu Lewei Lu Tong Lu Yu Qiao Jifeng Dai Wenhai Wang 《Visual Intelligence》 2024年第1期392-408,共17页
Multi-modal large language models(MLLMs)have demonstrated impressive performance in vision-language tasks across a wide range of domains.However,the large model scale and associated high computational cost pose signif... Multi-modal large language models(MLLMs)have demonstrated impressive performance in vision-language tasks across a wide range of domains.However,the large model scale and associated high computational cost pose significant challenges for training and deploying MLLMs on consumer-grade GPUs or edge devices,thereby hindering their widespread application.In this work,we introduce Mini-InternVL,a series of MLLMs with parameters ranging from 1 billion to 4 billion,which achieves 90% of the performance with only 5% of the parameters.This significant improvement in efficiency and effectiveness makes our models more accessible and applicable in various real-world scenarios.To further promote the adoption of our models,we are developing a unified adaptation framework for Mini-InternVL,which enables our models to transfer and outperform specialized models in downstream tasks,including autonomous driving,medical image processing,and remote sensing.We believe that our models can provide valuable insights and resources to advance the development of efficient and effective MLLMs. 展开更多
关键词 Lightweight multi-modal large language model vision-language model Knowledge distillation Visual instruction tuning
在线阅读 下载PDF
A Computational Model of Concept Generalization in Cross-Modal Reference 被引量:1
13
作者 Patrick McCrae Wolfgang Menzel Maosong SUN 《Tsinghua Science and Technology》 SCIE EI CAS 2011年第2期113-120,共8页
Cross-modal interactions between visual understanding and linguistic processing substantially contribute to the remarkable robustness of human language processing.We argue that the formation of cross-modal referential... Cross-modal interactions between visual understanding and linguistic processing substantially contribute to the remarkable robustness of human language processing.We argue that the formation of cross-modal referential links is a prerequisite for the occurrence of cross-modal interactions between vision and language.In this paper we examine a computational model for a cross-modal reference formation with respect to its robustness against conceptual underspecification in the visual modality.This investigation is motivated by the fact that natural systems are well capable of establishing a cross-modal reference between modalities with different degrees of conceptual specification.In the investigated model,conceptually underspecified context information continues to drive the syntactic disambiguation of verb-centered syntactic ambiguities as long as the visual context contains the situation arity information of the visual scene. 展开更多
关键词 vision-language interaction cross-modal reference syntactic disambiguation
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部