Existing methods for tracing water pollution sources typically integrate three-dimensional excitationemission matrix(3D-EEM)fluorescence spectroscopy with similarity-based matching algorithms.However,these approaches ...Existing methods for tracing water pollution sources typically integrate three-dimensional excitationemission matrix(3D-EEM)fluorescence spectroscopy with similarity-based matching algorithms.However,these approaches exhibit high error rates in borderline cases and necessitate expert manual review,which limits scalability and introduces inconsistencies between algorithmic outputs and expert judgment.To address these limitations,we propose a large vision-language model(VLM)designed as an“expert agent”to automatically refine similarity scores,ensuring alignment with expert decisions and overcoming key application bottlenecks.The model consists of two core components:(1)rule-based similarity calculation module generate initial spectral similarity scores,and(2)pre-trained large vision-language model fine-tuned via supervised learning and reinforcement learning with human feedback(RLHF)to emulate expert assessments.To facilitate training and evaluation,we introduce two expert-annotated datasets,Spec1k and SpecReason,which capture both quantitative corrections and qualitative reasoning patterns,allowing the model to emulate expert decision-making processes.Experimental results demonstrate that our method achieves 81.45%source attribution accuracy,38.24%higher than rule-based and machine learning baselines.Real-world deployment further validates its effectiveness.展开更多
In multimodal learning, Vision-Language Models (VLMs) have become a critical research focus, enabling the integration of textual and visual data. These models have shown significant promise across various natural lang...In multimodal learning, Vision-Language Models (VLMs) have become a critical research focus, enabling the integration of textual and visual data. These models have shown significant promise across various natural language processing tasks, such as visual question answering and computer vision applications, including image captioning and image-text retrieval, highlighting their adaptability for complex, multimodal datasets. In this work, we review the landscape of Bootstrapping Language-Image Pre-training (BLIP) and other VLM techniques. A comparative analysis is conducted to assess VLMs’ strengths, limitations, and applicability across tasks while examining challenges such as scalability, data quality, and fine-tuning complexities. The work concludes by outlining potential future directions in VLM research, focusing on enhancing model interpretability, addressing ethical implications, and advancing multimodal integration in real-world applications.展开更多
In recent years,large vision-language models(VLMs)have achieved significant breakthroughs in cross-modal understanding and generation.However,the safety issues arising from their multimodal interactions become promine...In recent years,large vision-language models(VLMs)have achieved significant breakthroughs in cross-modal understanding and generation.However,the safety issues arising from their multimodal interactions become prominent.VLMs are vulnerable to jailbreak attacks,where attackers craft carefully designed prompts to bypass safety mechanisms,leading them to generate harmful content.To address this,we investigate the alignment between visual inputs and task execution,uncovering locality defects and attention biases in VLMs.Based on these findings,we propose VOTI,a novel jailbreak framework leveraging visual obfuscation and task induction.VOTI subtly embeds malicious keywords within neutral image layouts to evade detection,and breaks down harmful queries into a sequence of subtasks.This approach disperses malicious intent across modalities,exploiting VLMs’over-reliance on local visual cues and their fragility in multi-step reasoning to bypass global safety mechanisms.Implemented as an automated framework,VOTI integrates large language models as red-team assistants to generate and iteratively optimize jailbreak strategies.Extensive experiments across seven mainstream VLMs demonstrate VOTI’s effectiveness,achieving a 73.46%attack success rate on GPT-4o-mini.These results reveal critical vulnerabilities in VLMs,highlighting the urgent need for improving robust defenses and multimodal alignment.展开更多
The application of visual-language large models in the field of medical health has gradually become a research focus.The models combine the capability for image understanding and natural language processing,and can si...The application of visual-language large models in the field of medical health has gradually become a research focus.The models combine the capability for image understanding and natural language processing,and can simultaneously process multi-modality data such as medical images and medical reports.These models can not only recognize images,but also understand the semantic relationship between images and texts,effectively realize the integration of medical information,and provide strong support for clinical decision-making and disease diagnosis.The visual-language large model has good performance for specific medical tasks,and also shows strong potential and high intelligence in the general task models.This paper provides a comprehensive review of the visual-language large model in the field of medical health.Specifically,this paper first introduces the basic theoretical basis and technical principles.Then,this paper introduces the specific application scenarios in the field of medical health,including modality fusion,semi-supervised learning,weakly supervised learning,unsupervised learning,cross-domain model and general models.Finally,the challenges including insufficient data,interpretability,and practical deployment are discussed.According to the existing challenges,four potential future development directions are given.展开更多
In the field of satellite imagery, remote sensing image captioning(RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a visi...In the field of satellite imagery, remote sensing image captioning(RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a vision-language aligning paradigm for RSIC to jointly represent vision and language. First, a new RSIC dataset DIOR-Captions is built for augmenting object detection in optical remote(DIOR) sensing images dataset with manually annotated Chinese and English contents. Second, a Vision-Language aligning model with Cross-modal Attention(VLCA) is presented to generate accurate and abundant bilingual descriptions for remote sensing images. Third, a crossmodal learning network is introduced to address the problem of visual-lingual alignment. Notably, VLCA is also applied to end-toend Chinese captions generation by using the pre-training language model of Chinese. The experiments are carried out with various baselines to validate VLCA on the proposed dataset. The results demonstrate that the proposed algorithm is more descriptive and informative than existing algorithms in producing captions.展开更多
Large language models(LLMs),such as ChatGPT,have demonstrated impressive capabilities in various tasks and attracted increasing interest as a natural language interface across many domains.Recently,large vision-langua...Large language models(LLMs),such as ChatGPT,have demonstrated impressive capabilities in various tasks and attracted increasing interest as a natural language interface across many domains.Recently,large vision-language models(VLMs)that learn rich vision–language correlation from image–text pairs,like BLIP-2 and GPT-4,have been intensively investigated.However,despite these developments,the application of LLMs and VLMs in image quality assessment(IQA),particularly in medical imaging,remains unexplored.This is valuable for objective performance evaluation and potential supplement or even replacement of radiologists’opinions.To this end,this study intro-duces IQAGPT,an innovative computed tomography(CT)IQA system that integrates image-quality captioning VLM with ChatGPT to generate quality scores and textual reports.First,a CT-IQA dataset comprising 1,000 CT slices with diverse quality levels is professionally annotated and compiled for training and evaluation.To better leverage the capabilities of LLMs,the annotated quality scores are converted into semantically rich text descriptions using a prompt template.Second,the image-quality captioning VLM is fine-tuned on the CT-IQA dataset to generate qual-ity descriptions.The captioning model fuses image and text features through cross-modal attention.Third,based on the quality descriptions,users verbally request ChatGPT to rate image-quality scores or produce radiological qual-ity reports.Results demonstrate the feasibility of assessing image quality using LLMs.The proposed IQAGPT outper-formed GPT-4 and CLIP-IQA,as well as multitask classification and regression models that solely rely on images.展开更多
Background:Vision and vision-language foundation models,a subset of advanced artificial intelligence(AI)frameworks,have shown transformative potential in various medical fields.In ophthalmology,these models,particular...Background:Vision and vision-language foundation models,a subset of advanced artificial intelligence(AI)frameworks,have shown transformative potential in various medical fields.In ophthalmology,these models,particularly large language models and vision-based models,have demonstrated great potential to improve diagnostic accuracy,enhance treatment planning,and streamline clinical workflows.However,their deployment in ophthalmology has faced several challenges,particularly regarding generalizability and integration into clinical practice.This systematic review aims to summarize the current evidence on the use of vision and visionlanguage foundation models in ophthalmology,identifying key applications,outcomes,and challenges.Main text:A comprehensive search on PubMed,Web of Science,Scopus,and Google Scholar was conducted to identify studies published between January 2020 and July 2025.Studies were included if they developed or applied foundation models,such as vision-based models and large language models,to clinically relevant ophthalmic applications.A total of 10 studies met the inclusion criteria,covering areas such as retinal diseases,glaucoma,and ocular surface tumor.The primary outcome measures are model performance metrics,integration into clinical workflows,and the clinical utility of the models.Additionally,the review explored the limitations of foundation models,such as the reliance on large datasets,computational resources,and interpretability challenges.The majority of studies demonstrated that foundation models could achieve high diagnostic accuracy,with several reports indicating excellent performance comparable to or exceeding those of experienced clinicians.Foundation models achieved high accuracy rates up to 95%for diagnosing retinal diseases,and similar performances for detecting glaucoma progression.Despite promising results,concerns about algorithmic bias,overfitting,and the need for diverse training data were common.High computational demands,EHR compatibility,and the need for clinician validation also posed challenges.Additionally,model interpretability issues hindered clinician trust and adoption.Conclusions:Vision and vision-language foundation models in ophthalmology show significant potential for advancing diagnostic accuracy and treatment strategies,particularly in retinal diseases,glaucoma,and ocular oncology.However,challenges such as data quality,transparency,and ethical considerations must be addressed.Future research should focus on refining model performance,improving interpretability and generalizability,and exploring strategies for integrating these models into routine clinical practice to maximize their impact in clinical ophthalmology.展开更多
This paper investigates the potential of Vision-Language Models(VLMs)to enhance Human–Vehicle Interaction(HVI)in Autonomous Driving(AD)scenarios,particularly in interactions between vehicles and other traffic partici...This paper investigates the potential of Vision-Language Models(VLMs)to enhance Human–Vehicle Interaction(HVI)in Autonomous Driving(AD)scenarios,particularly in interactions between vehicles and other traffic participants,with a focus on rationality and safety in external HVI.Leveraging recent advancements in large language models,VLMs demonstrate remarkable capabilities in understanding real-world contexts and generating significant interest in HVI applications.This paper provides an overview of AD,HVI,and VLMs,along with the historical context of large language model applications in HVI.The HVI discussed herein involves dynamic game processes encompassing perception and decision-making between vehicles and traffic participants,such as pedestrians.Furthermore,we examine the perceptual challenges associated with applying VLMs to HVI and compile relevant datasets.This research fills a gap in the existing literature by systematically analyzing the current status,challenges,and future opportunities of VLM applications in HVI.To advance VLM integration in AD,various implementation strategies are discussed.The findings highlight the potential of VLMs to transform HVI in AD,improving both passenger experience and driving safety.Overall,this study contributes to a comprehensive understanding of VLM applications in HVI and provides insights to guide future research and development.展开更多
human-robot collaboration(HRC)is set to transform the manufacturing paradigm by leveraging the strengths of human flexibility and robot precision.The recent breakthrough of Large Language Models(LLMs)and Vision-Langua...human-robot collaboration(HRC)is set to transform the manufacturing paradigm by leveraging the strengths of human flexibility and robot precision.The recent breakthrough of Large Language Models(LLMs)and Vision-Language Models(VLMs)has motivated the preliminary explorations and adoptions of these models in the smart manufacturing field.However,despite the considerable amount of effort,existing research mainly focused on individual components without a comprehensive perspective to address the full potential of VLMs,especially for HRC in smart manufacturing scenarios.To fill the gap,this work offers a systematic review of the latest advance-ments and applications of VLMs in HRC for smart manu-facturing,which covers the fundamental architectures and pretraining methodologies of LLMs and VLMs,their applications in robotic task planning,navigation,and manipulation,and role in enhancing human-robot skill transfer through multimodal data integration.Lastly,the paper discusses current limitations and future research directions in VLM-based HRC,highlighting the trend in fully realizing the potential of these technologies for smart manufacturing.展开更多
In recent years,Vision-Language Models(VLMs)have emerged as a significant breakthrough in multimodal learning,demonstrating remarkable progress in tasks such as image-text alignment,image generation,and semantic reaso...In recent years,Vision-Language Models(VLMs)have emerged as a significant breakthrough in multimodal learning,demonstrating remarkable progress in tasks such as image-text alignment,image generation,and semantic reasoning.This paper systematically reviews current VLM pretraining methodologies,including contrastive learning and generative paradigms,while providing an in-depth analysis of efficient transfer learning strategies such as prompt tuning,LoRA,and adapter modules.Through representative models like CLIP,BLIP,and GIT,we examine their practical applications in visual grounding,imagetext retrieval,visual question answering,affective computing,and embodied AI.Furthermore,we identify persistent challenges in fine-grained semantic modeling,cross-modal reasoning,and cross-lingual transfer.Finally,we envision future trends in unified architectures,multimodal reinforcement learning,and domain adaptation,aiming to provide systematic reference and technical insights for subsequent research.展开更多
We present a masked vision-language transformer(MVLT)for fashion-specific multi-modal representation.Technically,we simply utilize the vision transformer architecture for replacing the bidirectional encoder representa...We present a masked vision-language transformer(MVLT)for fashion-specific multi-modal representation.Technically,we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers(BERT)in the pre-training model,making MVLT the first end-to-end framework for the fashion domain.Besides,we designed masked image reconstruction(MIR)for a fine-grained understanding of fashion.MVLT is an extensible and convenient architecture that admits raw multimodal inputs without extra pre-processing models(e.g.,ResNet),implicitly modeling the vision-language alignments.More importantly,MVLT can easily generalize to various matching and generative tasks.Experimental results show obvious improvements in retrieval(rank@5:17%)and recognition(accuracy:3%)tasks over the Fashion-Gen 2018 winner,Kaleido-BERT.The code is available at https://github.com/GewelsJI/MVLT.展开更多
The advent of large vision-language models(LVLMs)represents a remarkable advance in the quest for artificial general intelligence.However,the models’effectiveness in both specialized and general tasks warrants furthe...The advent of large vision-language models(LVLMs)represents a remarkable advance in the quest for artificial general intelligence.However,the models’effectiveness in both specialized and general tasks warrants further investigation.This paper endeavors to evaluate the competency of popular LVLMs in specialized and general tasks,respectively,aiming to offer a comprehensive understanding of these novel models.To gauge their effectiveness in specialized tasks,we employ six challenging tasks in three different application scenarios:natural,healthcare,and industrial.These six tasks include salient/camouflaged/transparent object detection,as well as polyp detection,skin lesion detection,and industrial anomaly detection.We examine the performance of three recent open-source LVLMs,including MiniGPT-v2,LLaVA-1.5,and Shikra,on both visual recognition and localization in these tasks.Moreover,we conduct empirical investigations utilizing the aforementioned LVLMs together with GPT-4V,assessing their multi-modal understanding capabilities in general tasks including object counting,absurd question answering,affordance reasoning,attribute recognition,and spatial relation reasoning.Our investigations reveal that these LVLMs demonstrate limited proficiency not only in specialized tasks but also in general tasks.We delve deep into this inadequacy and uncover several potential factors,including limited cognition in specialized tasks,object hallucination,text-to-image interference,and decreased robustness in complex problems.We hope that this study can provide useful insights for the future development of LVLMs,helping researchers improve LVLMs for both general and specialized applications.展开更多
It remains difficult to automate the creation and validation of Unified Modeling Language(UML)dia-grams due to unstructured requirements,limited automated pipelines,and the lack of reliable evaluation methods.This stu...It remains difficult to automate the creation and validation of Unified Modeling Language(UML)dia-grams due to unstructured requirements,limited automated pipelines,and the lack of reliable evaluation methods.This study introduces a cohesive architecture that amalgamates requirement development,UML synthesis,and multimodal validation.First,LLaMA-3.2-1B-Instruct was utilized to generate user-focused requirements.Then,DeepSeek-R1-Distill-Qwen-32B applies its reasoning skills to transform these requirements into PlantUML code.Using this dual-LLM pipeline,we constructed a synthetic dataset of 11,997 UML diagrams spanning six major diagram families.Rendering analysis showed that 89.5%of the generated diagrams compile correctly,while invalid cases were detected automatically.To assess quality,we employed a multimodal scoring method that combines Qwen2.5-VL-3B,LLaMA-3.2-11B-Vision-Instruct and Aya-Vision-8B,with weights based on MMMU performance.A study with 94 experts revealed strong alignment between automatic and manual evaluations,yielding a Pearson correlation of r=0.82 and a Fleiss’Kappa of 0.78.This indicates a high degree of concordance between automated metrics and human judgment.Overall,the results demonstrated that our scoring system is effective and that the proposed generation pipeline produces UML diagrams that are both syntactically correct and semantically coherent.More broadly,the system provides a scalable and reproducible foundation for future work in AI-driven software modeling and multimodal verification.展开更多
GameQualityAssurance(QA)currently relies heavily onmanual testing,a process that is both costly and time-consuming.Traditional script-and log-based automation tools are limited in their ability to detect unpredictable...GameQualityAssurance(QA)currently relies heavily onmanual testing,a process that is both costly and time-consuming.Traditional script-and log-based automation tools are limited in their ability to detect unpredictable visual bugs,especially those that are context-dependent or graphical in nature.As a result,many issues go unnoticed during manual QA,which reduces overall game quality,degrades the user experience,and creates inefficiencies throughout the development cycle.This study proposes two approaches to address these challenges.The first leverages a Large Language Model(LLM)to directly analyze gameplay videos,detect visual bugs,and automatically generate QA reports in natural language.The second approach introduces a pipeline method:first generating textual descriptions of visual bugs in game videos using the ClipCap model,then using those descriptions as input for the LLM to synthesize QA reports.Through these two multi-faceted approaches,this study evaluates the feasibility of automated game QA systems.To implement this system,we constructed a visual bug database derived from real-world game cases and fine-tuned the ClipCap model for the game video domain.Our proposed approach aims to enhance both efficiency and quality in game development by reducing the burden of manual QA while improving the accuracy of visual bug detection and ensuring consistent,reliable report generation.展开更多
Flood disasters triggered by excessive rainfall cause severe damage to infrastructure and pose significant risks to human life.Within the context of disaster management,accurately identifying affected structures and p...Flood disasters triggered by excessive rainfall cause severe damage to infrastructure and pose significant risks to human life.Within the context of disaster management,accurately identifying affected structures and providing interpretable analytical results are of critical importance.This study proposes a new disaster analysis framework that integrates the Multi-Atrous Self-Attention(MASA)mechanism,which is designed to capture multi-scale spatial features effectively,with vision-language models for explainable flood assessment.The proposed approach consists of two main components.The first component performs segmentation to detect and quantify flood-affected structures,while the second component employs a fine-tuned vision language model to generate natural language descriptions of the disaster scene.The MASA module processes image-mask pairs from the FloodNet dataset to segment disaster related structures,whereas the LoRA(Low Rank Adaptation)enhanced BLIP-2(Bootstrapped Language Image Pre-training)model learns image-text pairs from the LADI-v2 dataset to produce textual disaster descriptions.Through this dual stage structure,the system provides both quantitative and linguistic outputs,enabling interpretable flood impact assessment.Experimental results demonstrate that the proposed MASA-based segmentation model achieves a mean Intersection over Union(mIoU)of 73.78%on FloodNet,outperforming state-of-the-art segmentation models.Furthermore,the LoRA-fine-tuned BLIP-2 model achieves a BLEU score of 80.77%on the LADI-v2 dataset,indicating fluent,contextually relevant,and semantically coherent textual outputs.The proposed system contributes to disaster analysis by enhancing explainability and interpretability in flood damage assessment.展开更多
Existing image manipulation localization(IML)techniques require large,densely annotated sets of forged images.This requirement greatly increases labeling costs and limits a model’s ability to handle manipulation type...Existing image manipulation localization(IML)techniques require large,densely annotated sets of forged images.This requirement greatly increases labeling costs and limits a model’s ability to handle manipulation types that are novel or absent from the training data.To address these issues,we present CLIP-IML,an IML framework that leverages contrastive language-image pre-training(CLIP).A lightweight feature-reconstruction module transforms CLIP token sequences into spatial tensors,after which a compact feature-pyramid network and a multi-scale fusion decoder work together to capture information from fine to coarse levels.We evaluated CLIP-IML on ten public datasets that cover copy-move,splicing,removal,and artificial intelligence(AI)-generated forgeries.The framework raises the average F1-score by 7.85%relative to the strongest recent baselines and secures either the first-or second-place performance on every dataset.Ablation studies show that CLIP pre-training,higher resolution inputs,and the multi-scale decoder each make complementary contributions.Under six common post-processing perturbations,as well as the compression pipelines used by Facebook,Weibo,and WeChat,the performance decline never exceeds 2.2%,confirming strong practical robustness.Moreover,CLIP-IML requires only a few thousand annotated images for training,which markedly reduces data-collection and labeling effort compared with previous methods.All of these results indicate that CLIP-IML is highly generalizable for image tampering localization across a wide range of tampering scenarios.展开更多
Artificial intelligence(AI)assisted ultrasound report generation represents a technology that leverages artificial intelligence to convert ultrasound imaging analysis results into structured diagnostic reports.By inte...Artificial intelligence(AI)assisted ultrasound report generation represents a technology that leverages artificial intelligence to convert ultrasound imaging analysis results into structured diagnostic reports.By integrating image recognition and natural language generation models,AI systems can automatically detect and analyze lesions or abnormalities in ultrasound images,generating textual descriptions of diagnostic conclusions(e.g.,fatty liver,liver fibrosis,automated BIRADS grading of breast lesions),imaging findings,and clinical recommendations to form comprehensive reports.This technology enhances the efficiency and accuracy of imaging diagnosis,reduces physicians’workloads,ensures report standardization and consistency,and provides robust support for clinical decisionmaking.Current state-of-the-art algorithms for automated ultrasound report generation primarily rely on vision-language models,which harness the generalization capabilities of large language models and large vision models through multimodal(language+vision)feature alignment.However,existing approaches inadequately address challenges such as numerical measurement generation,effective utilization of report templates,incorporation of historical reports,learning text-image correlations,and overfitting under limited data conditions.This paper aims to introduce the current state of research on ultrasound report generation,the existing issues,and to provide some thoughts for future research.展开更多
Open-vocabulary semantic segmentation(OV-Seg)aims to segment novel categories without dense annotations,facilitating scalable perception in open-world scenarios.While recent advances in foundation models-such as CLIP ...Open-vocabulary semantic segmentation(OV-Seg)aims to segment novel categories without dense annotations,facilitating scalable perception in open-world scenarios.While recent advances in foundation models-such as CLIP for semantic grounding and segment anything model(SAM)for spatial localization-have boosted OV-Seg performance,most existing methods remain limited to object-level segmentation and neglect the compositional and hierarchical structure of real-world entities.In this paper,we propose a hierarchical open-vocabulary segmentation framework that integrates CLIP and SAM with structured external knowledge in the form of object-part hierarchies.Specifically,we construct a category-level knowledge graph to encode part-whole relationships and guide the generation of enriched prompts,thereby aligning visual features with both object-level and part-level semantics.Extensive experiments on PartImageNet and PASCAL-Part demonstrate that our method consistently outperforms state-of-the-art baselines,especially in part-level segmentation and novel-category generalization.These results confirm the effectiveness of incorporating structured priors to enhance compositional and fine-grained visual understanding in open-vocabulary settings.展开更多
Medical visual question answering(Med-VQA)is a task that aims to answer clinical questions given a medical image.Existing literature generally treats it as a classic classification task based on interaction features o...Medical visual question answering(Med-VQA)is a task that aims to answer clinical questions given a medical image.Existing literature generally treats it as a classic classification task based on interaction features of the image and question.However,such a paradigm ignores the valuable semantics of candidate answers as well as their relations.From the real-world dataset,we observe that:1)The text of candidate answers has a strong intrinsic correlation with medical images;2)Subtle differences among multiple candidate answers are crucial for identifying the correct one.Therefore,we propose an answer semantics enhanced(ASE)method to integrate the semantics of answers and capture their subtle differences.Specifically,we enhance the semantic correlation of image-question-answer triplets by aligning images and question-answer tuples within the feature fusion module.Then,we devise a contrastive learning loss to highlight the semantic differences between the correct answer and other answers.Finally,extensive experiments demonstrate the effectiveness of our method.展开更多
Deep learning has revolutionized the field of artificial intelligence.Based on the statistical correlations uncovered by deep learning-based methods,computer vision tasks,such as autonomous driving and robotics,are gr...Deep learning has revolutionized the field of artificial intelligence.Based on the statistical correlations uncovered by deep learning-based methods,computer vision tasks,such as autonomous driving and robotics,are growing rapidly.Despite being the basis of deep learning,such correlation strongly depends on the distribution of the original data and is susceptible to uncontrolled factors.Without the guidance of prior knowledge,statistical correlations alone cannot correctly reflect the essential causal relations and may even introduce spurious correlations.As a result,researchers are now trying to enhance deep learningbased methods with causal theory.Causal theory can model the intrinsic causal structure unaffected by data bias and effectively avoids spurious correlations.This paper aims to comprehensively review the existing causal methods in typical vision and visionlanguage tasks such as semantic segmentation,object detection,and image captioning.The advantages of causality and the approaches for building causal paradigms will be summarized.Future roadmaps are also proposed,including facilitating the development of causal theory and its application in other complex scenarios and systems.展开更多
文摘Existing methods for tracing water pollution sources typically integrate three-dimensional excitationemission matrix(3D-EEM)fluorescence spectroscopy with similarity-based matching algorithms.However,these approaches exhibit high error rates in borderline cases and necessitate expert manual review,which limits scalability and introduces inconsistencies between algorithmic outputs and expert judgment.To address these limitations,we propose a large vision-language model(VLM)designed as an“expert agent”to automatically refine similarity scores,ensuring alignment with expert decisions and overcoming key application bottlenecks.The model consists of two core components:(1)rule-based similarity calculation module generate initial spectral similarity scores,and(2)pre-trained large vision-language model fine-tuned via supervised learning and reinforcement learning with human feedback(RLHF)to emulate expert assessments.To facilitate training and evaluation,we introduce two expert-annotated datasets,Spec1k and SpecReason,which capture both quantitative corrections and qualitative reasoning patterns,allowing the model to emulate expert decision-making processes.Experimental results demonstrate that our method achieves 81.45%source attribution accuracy,38.24%higher than rule-based and machine learning baselines.Real-world deployment further validates its effectiveness.
文摘In multimodal learning, Vision-Language Models (VLMs) have become a critical research focus, enabling the integration of textual and visual data. These models have shown significant promise across various natural language processing tasks, such as visual question answering and computer vision applications, including image captioning and image-text retrieval, highlighting their adaptability for complex, multimodal datasets. In this work, we review the landscape of Bootstrapping Language-Image Pre-training (BLIP) and other VLM techniques. A comparative analysis is conducted to assess VLMs’ strengths, limitations, and applicability across tasks while examining challenges such as scalability, data quality, and fine-tuning complexities. The work concludes by outlining potential future directions in VLM research, focusing on enhancing model interpretability, addressing ethical implications, and advancing multimodal integration in real-world applications.
文摘In recent years,large vision-language models(VLMs)have achieved significant breakthroughs in cross-modal understanding and generation.However,the safety issues arising from their multimodal interactions become prominent.VLMs are vulnerable to jailbreak attacks,where attackers craft carefully designed prompts to bypass safety mechanisms,leading them to generate harmful content.To address this,we investigate the alignment between visual inputs and task execution,uncovering locality defects and attention biases in VLMs.Based on these findings,we propose VOTI,a novel jailbreak framework leveraging visual obfuscation and task induction.VOTI subtly embeds malicious keywords within neutral image layouts to evade detection,and breaks down harmful queries into a sequence of subtasks.This approach disperses malicious intent across modalities,exploiting VLMs’over-reliance on local visual cues and their fragility in multi-step reasoning to bypass global safety mechanisms.Implemented as an automated framework,VOTI integrates large language models as red-team assistants to generate and iteratively optimize jailbreak strategies.Extensive experiments across seven mainstream VLMs demonstrate VOTI’s effectiveness,achieving a 73.46%attack success rate on GPT-4o-mini.These results reveal critical vulnerabilities in VLMs,highlighting the urgent need for improving robust defenses and multimodal alignment.
基金The Natural Science Foundation of Hebei Province(F2024501044).
文摘The application of visual-language large models in the field of medical health has gradually become a research focus.The models combine the capability for image understanding and natural language processing,and can simultaneously process multi-modality data such as medical images and medical reports.These models can not only recognize images,but also understand the semantic relationship between images and texts,effectively realize the integration of medical information,and provide strong support for clinical decision-making and disease diagnosis.The visual-language large model has good performance for specific medical tasks,and also shows strong potential and high intelligence in the general task models.This paper provides a comprehensive review of the visual-language large model in the field of medical health.Specifically,this paper first introduces the basic theoretical basis and technical principles.Then,this paper introduces the specific application scenarios in the field of medical health,including modality fusion,semi-supervised learning,weakly supervised learning,unsupervised learning,cross-domain model and general models.Finally,the challenges including insufficient data,interpretability,and practical deployment are discussed.According to the existing challenges,four potential future development directions are given.
基金supported by the National Natural Science Foundation of China (61702528,61806212)。
文摘In the field of satellite imagery, remote sensing image captioning(RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a vision-language aligning paradigm for RSIC to jointly represent vision and language. First, a new RSIC dataset DIOR-Captions is built for augmenting object detection in optical remote(DIOR) sensing images dataset with manually annotated Chinese and English contents. Second, a Vision-Language aligning model with Cross-modal Attention(VLCA) is presented to generate accurate and abundant bilingual descriptions for remote sensing images. Third, a crossmodal learning network is introduced to address the problem of visual-lingual alignment. Notably, VLCA is also applied to end-toend Chinese captions generation by using the pre-training language model of Chinese. The experiments are carried out with various baselines to validate VLCA on the proposed dataset. The results demonstrate that the proposed algorithm is more descriptive and informative than existing algorithms in producing captions.
基金supported in part by the National Natural Science Foundation of China,No.62101136Shanghai Sailing Program,No.21YF1402800National Institutes of Health,Nos.R01CA237267,R01HL151561,R01EB031102,and R01EB032716.
文摘Large language models(LLMs),such as ChatGPT,have demonstrated impressive capabilities in various tasks and attracted increasing interest as a natural language interface across many domains.Recently,large vision-language models(VLMs)that learn rich vision–language correlation from image–text pairs,like BLIP-2 and GPT-4,have been intensively investigated.However,despite these developments,the application of LLMs and VLMs in image quality assessment(IQA),particularly in medical imaging,remains unexplored.This is valuable for objective performance evaluation and potential supplement or even replacement of radiologists’opinions.To this end,this study intro-duces IQAGPT,an innovative computed tomography(CT)IQA system that integrates image-quality captioning VLM with ChatGPT to generate quality scores and textual reports.First,a CT-IQA dataset comprising 1,000 CT slices with diverse quality levels is professionally annotated and compiled for training and evaluation.To better leverage the capabilities of LLMs,the annotated quality scores are converted into semantically rich text descriptions using a prompt template.Second,the image-quality captioning VLM is fine-tuned on the CT-IQA dataset to generate qual-ity descriptions.The captioning model fuses image and text features through cross-modal attention.Third,based on the quality descriptions,users verbally request ChatGPT to rate image-quality scores or produce radiological qual-ity reports.Results demonstrate the feasibility of assessing image quality using LLMs.The proposed IQAGPT outper-formed GPT-4 and CLIP-IQA,as well as multitask classification and regression models that solely rely on images.
基金supported by Natural Science Foundation of China(grant number 82201195).
文摘Background:Vision and vision-language foundation models,a subset of advanced artificial intelligence(AI)frameworks,have shown transformative potential in various medical fields.In ophthalmology,these models,particularly large language models and vision-based models,have demonstrated great potential to improve diagnostic accuracy,enhance treatment planning,and streamline clinical workflows.However,their deployment in ophthalmology has faced several challenges,particularly regarding generalizability and integration into clinical practice.This systematic review aims to summarize the current evidence on the use of vision and visionlanguage foundation models in ophthalmology,identifying key applications,outcomes,and challenges.Main text:A comprehensive search on PubMed,Web of Science,Scopus,and Google Scholar was conducted to identify studies published between January 2020 and July 2025.Studies were included if they developed or applied foundation models,such as vision-based models and large language models,to clinically relevant ophthalmic applications.A total of 10 studies met the inclusion criteria,covering areas such as retinal diseases,glaucoma,and ocular surface tumor.The primary outcome measures are model performance metrics,integration into clinical workflows,and the clinical utility of the models.Additionally,the review explored the limitations of foundation models,such as the reliance on large datasets,computational resources,and interpretability challenges.The majority of studies demonstrated that foundation models could achieve high diagnostic accuracy,with several reports indicating excellent performance comparable to or exceeding those of experienced clinicians.Foundation models achieved high accuracy rates up to 95%for diagnosing retinal diseases,and similar performances for detecting glaucoma progression.Despite promising results,concerns about algorithmic bias,overfitting,and the need for diverse training data were common.High computational demands,EHR compatibility,and the need for clinician validation also posed challenges.Additionally,model interpretability issues hindered clinician trust and adoption.Conclusions:Vision and vision-language foundation models in ophthalmology show significant potential for advancing diagnostic accuracy and treatment strategies,particularly in retinal diseases,glaucoma,and ocular oncology.However,challenges such as data quality,transparency,and ethical considerations must be addressed.Future research should focus on refining model performance,improving interpretability and generalizability,and exploring strategies for integrating these models into routine clinical practice to maximize their impact in clinical ophthalmology.
基金supported by the Shanghai Municipal Science and Technology Major Project(No.2021SHZDZX0100)the National Natural Science Foundation of China(No.62088101)+1 种基金the Fundamental Research Funds for the Central Universities(No.22120220642)the Opening Project of the State Key Laboratory of Autonomous Intelligent Unmanned Systems(No.ZZKF2025-2-3).
文摘This paper investigates the potential of Vision-Language Models(VLMs)to enhance Human–Vehicle Interaction(HVI)in Autonomous Driving(AD)scenarios,particularly in interactions between vehicles and other traffic participants,with a focus on rationality and safety in external HVI.Leveraging recent advancements in large language models,VLMs demonstrate remarkable capabilities in understanding real-world contexts and generating significant interest in HVI applications.This paper provides an overview of AD,HVI,and VLMs,along with the historical context of large language model applications in HVI.The HVI discussed herein involves dynamic game processes encompassing perception and decision-making between vehicles and traffic participants,such as pedestrians.Furthermore,we examine the perceptual challenges associated with applying VLMs to HVI and compile relevant datasets.This research fills a gap in the existing literature by systematically analyzing the current status,challenges,and future opportunities of VLM applications in HVI.To advance VLM integration in AD,various implementation strategies are discussed.The findings highlight the potential of VLMs to transform HVI in AD,improving both passenger experience and driving safety.Overall,this study contributes to a comprehensive understanding of VLM applications in HVI and provides insights to guide future research and development.
基金Research Institute for Advanced Manufacturing(RIAM)of The Hong Kong Polytechnic University(1-CDJT)Intra-Faculty Interdisciplinary Project 2023/24(1-WZ4N)+6 种基金Research Committee of The Hong Kong Polytechnic UniversityState Key Laboratory of Intelligent Manufacturing Equipment and Technology,Huazhong University of Science and Technology(IMETKF2024010)Guangdong-Hong Kong Technology Cooperation Funding Scheme(GHX/075/22GD)Innovation and Technology Commission(ITC)COMAC International Collaborative Research Project(COMAC-SFGS-2023-3148)General Research Fund from the Research Grants Council of the Hong Kong Special Administrative Region,China(Project Nos.PolyU15210222 and PolyU15206723)Open access funding provided by the Hong Kong Polytechnic University.
文摘human-robot collaboration(HRC)is set to transform the manufacturing paradigm by leveraging the strengths of human flexibility and robot precision.The recent breakthrough of Large Language Models(LLMs)and Vision-Language Models(VLMs)has motivated the preliminary explorations and adoptions of these models in the smart manufacturing field.However,despite the considerable amount of effort,existing research mainly focused on individual components without a comprehensive perspective to address the full potential of VLMs,especially for HRC in smart manufacturing scenarios.To fill the gap,this work offers a systematic review of the latest advance-ments and applications of VLMs in HRC for smart manu-facturing,which covers the fundamental architectures and pretraining methodologies of LLMs and VLMs,their applications in robotic task planning,navigation,and manipulation,and role in enhancing human-robot skill transfer through multimodal data integration.Lastly,the paper discusses current limitations and future research directions in VLM-based HRC,highlighting the trend in fully realizing the potential of these technologies for smart manufacturing.
文摘In recent years,Vision-Language Models(VLMs)have emerged as a significant breakthrough in multimodal learning,demonstrating remarkable progress in tasks such as image-text alignment,image generation,and semantic reasoning.This paper systematically reviews current VLM pretraining methodologies,including contrastive learning and generative paradigms,while providing an in-depth analysis of efficient transfer learning strategies such as prompt tuning,LoRA,and adapter modules.Through representative models like CLIP,BLIP,and GIT,we examine their practical applications in visual grounding,imagetext retrieval,visual question answering,affective computing,and embodied AI.Furthermore,we identify persistent challenges in fine-grained semantic modeling,cross-modal reasoning,and cross-lingual transfer.Finally,we envision future trends in unified architectures,multimodal reinforcement learning,and domain adaptation,aiming to provide systematic reference and technical insights for subsequent research.
文摘We present a masked vision-language transformer(MVLT)for fashion-specific multi-modal representation.Technically,we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers(BERT)in the pre-training model,making MVLT the first end-to-end framework for the fashion domain.Besides,we designed masked image reconstruction(MIR)for a fine-grained understanding of fashion.MVLT is an extensible and convenient architecture that admits raw multimodal inputs without extra pre-processing models(e.g.,ResNet),implicitly modeling the vision-language alignments.More importantly,MVLT can easily generalize to various matching and generative tasks.Experimental results show obvious improvements in retrieval(rank@5:17%)and recognition(accuracy:3%)tasks over the Fashion-Gen 2018 winner,Kaleido-BERT.The code is available at https://github.com/GewelsJI/MVLT.
基金supported by the National Natural Science Foundation of China(No.62176169)the Fundamental Research Funds for the Central Universities(Nankai University,070-63243150).
文摘The advent of large vision-language models(LVLMs)represents a remarkable advance in the quest for artificial general intelligence.However,the models’effectiveness in both specialized and general tasks warrants further investigation.This paper endeavors to evaluate the competency of popular LVLMs in specialized and general tasks,respectively,aiming to offer a comprehensive understanding of these novel models.To gauge their effectiveness in specialized tasks,we employ six challenging tasks in three different application scenarios:natural,healthcare,and industrial.These six tasks include salient/camouflaged/transparent object detection,as well as polyp detection,skin lesion detection,and industrial anomaly detection.We examine the performance of three recent open-source LVLMs,including MiniGPT-v2,LLaVA-1.5,and Shikra,on both visual recognition and localization in these tasks.Moreover,we conduct empirical investigations utilizing the aforementioned LVLMs together with GPT-4V,assessing their multi-modal understanding capabilities in general tasks including object counting,absurd question answering,affordance reasoning,attribute recognition,and spatial relation reasoning.Our investigations reveal that these LVLMs demonstrate limited proficiency not only in specialized tasks but also in general tasks.We delve deep into this inadequacy and uncover several potential factors,including limited cognition in specialized tasks,object hallucination,text-to-image interference,and decreased robustness in complex problems.We hope that this study can provide useful insights for the future development of LVLMs,helping researchers improve LVLMs for both general and specialized applications.
基金supported by the DH2025-TN07-07 project conducted at the Thai Nguyen University of Information and Communication Technology,Thai Nguyen,Vietnam,with additional support from the AI in Software Engineering Lab.
文摘It remains difficult to automate the creation and validation of Unified Modeling Language(UML)dia-grams due to unstructured requirements,limited automated pipelines,and the lack of reliable evaluation methods.This study introduces a cohesive architecture that amalgamates requirement development,UML synthesis,and multimodal validation.First,LLaMA-3.2-1B-Instruct was utilized to generate user-focused requirements.Then,DeepSeek-R1-Distill-Qwen-32B applies its reasoning skills to transform these requirements into PlantUML code.Using this dual-LLM pipeline,we constructed a synthetic dataset of 11,997 UML diagrams spanning six major diagram families.Rendering analysis showed that 89.5%of the generated diagrams compile correctly,while invalid cases were detected automatically.To assess quality,we employed a multimodal scoring method that combines Qwen2.5-VL-3B,LLaMA-3.2-11B-Vision-Instruct and Aya-Vision-8B,with weights based on MMMU performance.A study with 94 experts revealed strong alignment between automatic and manual evaluations,yielding a Pearson correlation of r=0.82 and a Fleiss’Kappa of 0.78.This indicates a high degree of concordance between automated metrics and human judgment.Overall,the results demonstrated that our scoring system is effective and that the proposed generation pipeline produces UML diagrams that are both syntactically correct and semantically coherent.More broadly,the system provides a scalable and reproducible foundation for future work in AI-driven software modeling and multimodal verification.
基金supported by a grant from the Korea Creative Content Agency,funded by the Ministry of Culture,Sports and Tourism of the Republic of Korea in 2025,for the project,“Development of AI-based large-scale automatic game verification technology to improve game production verification efficiency for small and medium-sized game companies”(RS 2024-00393500).
文摘GameQualityAssurance(QA)currently relies heavily onmanual testing,a process that is both costly and time-consuming.Traditional script-and log-based automation tools are limited in their ability to detect unpredictable visual bugs,especially those that are context-dependent or graphical in nature.As a result,many issues go unnoticed during manual QA,which reduces overall game quality,degrades the user experience,and creates inefficiencies throughout the development cycle.This study proposes two approaches to address these challenges.The first leverages a Large Language Model(LLM)to directly analyze gameplay videos,detect visual bugs,and automatically generate QA reports in natural language.The second approach introduces a pipeline method:first generating textual descriptions of visual bugs in game videos using the ClipCap model,then using those descriptions as input for the LLM to synthesize QA reports.Through these two multi-faceted approaches,this study evaluates the feasibility of automated game QA systems.To implement this system,we constructed a visual bug database derived from real-world game cases and fine-tuned the ClipCap model for the game video domain.Our proposed approach aims to enhance both efficiency and quality in game development by reducing the burden of manual QA while improving the accuracy of visual bug detection and ensuring consistent,reliable report generation.
基金supported by The Scientific and Technological Research Council of Turkey(TUBITAK)under project number 123E669.
文摘Flood disasters triggered by excessive rainfall cause severe damage to infrastructure and pose significant risks to human life.Within the context of disaster management,accurately identifying affected structures and providing interpretable analytical results are of critical importance.This study proposes a new disaster analysis framework that integrates the Multi-Atrous Self-Attention(MASA)mechanism,which is designed to capture multi-scale spatial features effectively,with vision-language models for explainable flood assessment.The proposed approach consists of two main components.The first component performs segmentation to detect and quantify flood-affected structures,while the second component employs a fine-tuned vision language model to generate natural language descriptions of the disaster scene.The MASA module processes image-mask pairs from the FloodNet dataset to segment disaster related structures,whereas the LoRA(Low Rank Adaptation)enhanced BLIP-2(Bootstrapped Language Image Pre-training)model learns image-text pairs from the LADI-v2 dataset to produce textual disaster descriptions.Through this dual stage structure,the system provides both quantitative and linguistic outputs,enabling interpretable flood impact assessment.Experimental results demonstrate that the proposed MASA-based segmentation model achieves a mean Intersection over Union(mIoU)of 73.78%on FloodNet,outperforming state-of-the-art segmentation models.Furthermore,the LoRA-fine-tuned BLIP-2 model achieves a BLEU score of 80.77%on the LADI-v2 dataset,indicating fluent,contextually relevant,and semantically coherent textual outputs.The proposed system contributes to disaster analysis by enhancing explainability and interpretability in flood damage assessment.
基金supported by the Natural Science Foundation of Xinjiang Uygur Autonomous Region under Grant No.2023D01C21the National Natural Science Foundation of China under Grant No.62362063.
文摘Existing image manipulation localization(IML)techniques require large,densely annotated sets of forged images.This requirement greatly increases labeling costs and limits a model’s ability to handle manipulation types that are novel or absent from the training data.To address these issues,we present CLIP-IML,an IML framework that leverages contrastive language-image pre-training(CLIP).A lightweight feature-reconstruction module transforms CLIP token sequences into spatial tensors,after which a compact feature-pyramid network and a multi-scale fusion decoder work together to capture information from fine to coarse levels.We evaluated CLIP-IML on ten public datasets that cover copy-move,splicing,removal,and artificial intelligence(AI)-generated forgeries.The framework raises the average F1-score by 7.85%relative to the strongest recent baselines and secures either the first-or second-place performance on every dataset.Ablation studies show that CLIP pre-training,higher resolution inputs,and the multi-scale decoder each make complementary contributions.Under six common post-processing perturbations,as well as the compression pipelines used by Facebook,Weibo,and WeChat,the performance decline never exceeds 2.2%,confirming strong practical robustness.Moreover,CLIP-IML requires only a few thousand annotated images for training,which markedly reduces data-collection and labeling effort compared with previous methods.All of these results indicate that CLIP-IML is highly generalizable for image tampering localization across a wide range of tampering scenarios.
文摘Artificial intelligence(AI)assisted ultrasound report generation represents a technology that leverages artificial intelligence to convert ultrasound imaging analysis results into structured diagnostic reports.By integrating image recognition and natural language generation models,AI systems can automatically detect and analyze lesions or abnormalities in ultrasound images,generating textual descriptions of diagnostic conclusions(e.g.,fatty liver,liver fibrosis,automated BIRADS grading of breast lesions),imaging findings,and clinical recommendations to form comprehensive reports.This technology enhances the efficiency and accuracy of imaging diagnosis,reduces physicians’workloads,ensures report standardization and consistency,and provides robust support for clinical decisionmaking.Current state-of-the-art algorithms for automated ultrasound report generation primarily rely on vision-language models,which harness the generalization capabilities of large language models and large vision models through multimodal(language+vision)feature alignment.However,existing approaches inadequately address challenges such as numerical measurement generation,effective utilization of report templates,incorporation of historical reports,learning text-image correlations,and overfitting under limited data conditions.This paper aims to introduce the current state of research on ultrasound report generation,the existing issues,and to provide some thoughts for future research.
文摘Open-vocabulary semantic segmentation(OV-Seg)aims to segment novel categories without dense annotations,facilitating scalable perception in open-world scenarios.While recent advances in foundation models-such as CLIP for semantic grounding and segment anything model(SAM)for spatial localization-have boosted OV-Seg performance,most existing methods remain limited to object-level segmentation and neglect the compositional and hierarchical structure of real-world entities.In this paper,we propose a hierarchical open-vocabulary segmentation framework that integrates CLIP and SAM with structured external knowledge in the form of object-part hierarchies.Specifically,we construct a category-level knowledge graph to encode part-whole relationships and guide the generation of enriched prompts,thereby aligning visual features with both object-level and part-level semantics.Extensive experiments on PartImageNet and PASCAL-Part demonstrate that our method consistently outperforms state-of-the-art baselines,especially in part-level segmentation and novel-category generalization.These results confirm the effectiveness of incorporating structured priors to enhance compositional and fine-grained visual understanding in open-vocabulary settings.
基金supported by National Natural Science Foundation of China(Nos.62032013 and 62102074)the Science and Technology Projects in Liaoning Province,China(No.2023JH3/10200005).
文摘Medical visual question answering(Med-VQA)is a task that aims to answer clinical questions given a medical image.Existing literature generally treats it as a classic classification task based on interaction features of the image and question.However,such a paradigm ignores the valuable semantics of candidate answers as well as their relations.From the real-world dataset,we observe that:1)The text of candidate answers has a strong intrinsic correlation with medical images;2)Subtle differences among multiple candidate answers are crucial for identifying the correct one.Therefore,we propose an answer semantics enhanced(ASE)method to integrate the semantics of answers and capture their subtle differences.Specifically,we enhance the semantic correlation of image-question-answer triplets by aligning images and question-answer tuples within the feature fusion module.Then,we devise a contrastive learning loss to highlight the semantic differences between the correct answer and other answers.Finally,extensive experiments demonstrate the effectiveness of our method.
基金supported by the National Natural Science Foundation of China(Grant Nos.62233005 and 62293502)the Programme of Introducing Talents of Discipline to Universities(the 111 Project,Grant No.B17017)+1 种基金the Fundamental Research Funds for the Central Universities(Grant No.222202317006)Shanghai AI Lab。
文摘Deep learning has revolutionized the field of artificial intelligence.Based on the statistical correlations uncovered by deep learning-based methods,computer vision tasks,such as autonomous driving and robotics,are growing rapidly.Despite being the basis of deep learning,such correlation strongly depends on the distribution of the original data and is susceptible to uncontrolled factors.Without the guidance of prior knowledge,statistical correlations alone cannot correctly reflect the essential causal relations and may even introduce spurious correlations.As a result,researchers are now trying to enhance deep learningbased methods with causal theory.Causal theory can model the intrinsic causal structure unaffected by data bias and effectively avoids spurious correlations.This paper aims to comprehensively review the existing causal methods in typical vision and visionlanguage tasks such as semantic segmentation,object detection,and image captioning.The advantages of causality and the approaches for building causal paradigms will be summarized.Future roadmaps are also proposed,including facilitating the development of causal theory and its application in other complex scenarios and systems.