现有的野生动物识别方法主要依赖于静态数据集,难以适应物种动态迁移和新增类别识别的需求,导致监测效率低下。针对这一问题,提出多粒度提示驱动的野生动物识别方法(multi-granularity prompt-driven for wildlife recognition,MGP-WILD...现有的野生动物识别方法主要依赖于静态数据集,难以适应物种动态迁移和新增类别识别的需求,导致监测效率低下。针对这一问题,提出多粒度提示驱动的野生动物识别方法(multi-granularity prompt-driven for wildlife recognition,MGP-WILD)。通过云端大语言模型生成层次化语义描述(粗粒度生物分类+细粒度形态特征),由边缘节点协同维护动态知识表。具体而言,MGP-WILD利用大语言模型生成多粒度文本提示,相较于传统单粒度提示方法,本工作通过多粒度语义描述生成,实现了粗细粒度特征的深度融合,并结合视觉语言模型的跨模态对齐能力,实现了零样本精准识别。实验结果表明,该方法在多个数据集上均有较大提升,尤其在开放集识别任务中展现了较强的适应性。该系统已成功应用于青海野生动物栖息地保护,构建了基于真实场景的动物图像数据集,为生态脆弱区的生物多样性保护提供了创新技术范式。代码及部分数据集将在GitHub上公开。展开更多
In multimodal learning, Vision-Language Models (VLMs) have become a critical research focus, enabling the integration of textual and visual data. These models have shown significant promise across various natural lang...In multimodal learning, Vision-Language Models (VLMs) have become a critical research focus, enabling the integration of textual and visual data. These models have shown significant promise across various natural language processing tasks, such as visual question answering and computer vision applications, including image captioning and image-text retrieval, highlighting their adaptability for complex, multimodal datasets. In this work, we review the landscape of Bootstrapping Language-Image Pre-training (BLIP) and other VLM techniques. A comparative analysis is conducted to assess VLMs’ strengths, limitations, and applicability across tasks while examining challenges such as scalability, data quality, and fine-tuning complexities. The work concludes by outlining potential future directions in VLM research, focusing on enhancing model interpretability, addressing ethical implications, and advancing multimodal integration in real-world applications.展开更多
This paper investigates the potential of Vision-Language Models(VLMs)to enhance Human–Vehicle Interaction(HVI)in Autonomous Driving(AD)scenarios,particularly in interactions between vehicles and other traffic partici...This paper investigates the potential of Vision-Language Models(VLMs)to enhance Human–Vehicle Interaction(HVI)in Autonomous Driving(AD)scenarios,particularly in interactions between vehicles and other traffic participants,with a focus on rationality and safety in external HVI.Leveraging recent advancements in large language models,VLMs demonstrate remarkable capabilities in understanding real-world contexts and generating significant interest in HVI applications.This paper provides an overview of AD,HVI,and VLMs,along with the historical context of large language model applications in HVI.The HVI discussed herein involves dynamic game processes encompassing perception and decision-making between vehicles and traffic participants,such as pedestrians.Furthermore,we examine the perceptual challenges associated with applying VLMs to HVI and compile relevant datasets.This research fills a gap in the existing literature by systematically analyzing the current status,challenges,and future opportunities of VLM applications in HVI.To advance VLM integration in AD,various implementation strategies are discussed.The findings highlight the potential of VLMs to transform HVI in AD,improving both passenger experience and driving safety.Overall,this study contributes to a comprehensive understanding of VLM applications in HVI and provides insights to guide future research and development.展开更多
在智能驾驶系列任务中,使用视觉大语言模型(Vision Large Language Model,VLM)进行轨迹规划任务时面对的主要技术难题是:如何感知周围的世界并根据这些信息处理复杂的任务。现有开源视觉大语言模型在预训练阶段缺乏驾驶场景的空间先验,...在智能驾驶系列任务中,使用视觉大语言模型(Vision Large Language Model,VLM)进行轨迹规划任务时面对的主要技术难题是:如何感知周围的世界并根据这些信息处理复杂的任务。现有开源视觉大语言模型在预训练阶段缺乏驾驶场景的空间先验,导致其对空间信息的理解能力显著不足,难以直接胜任轨迹规划任务。为此,提出一种“空间问答微调+鸟瞰图感知输入”双重增强的端到端轨迹规划框架:首先是第一重增强,即基于数据集的可用标注数据构建驾驶场景空间问答微调数据集,使2B参数的Qwen2-VL在障碍物类别辨识、相对距离及尺度估计方面获得显式空间先验;随后为第二重增强,即利用环视摄像头实时生成动态鸟瞰图(Bird Eye View,BEV),完成轻量级空间重建;最终,将鸟瞰图图像、原始环视帧及文本指令共同输入经LoRA微调的视觉大语言模型,以问答形式直接输出规范化轨迹。所提方法的有效性在nuScenes数据集和NAVSIM数据集上得到验证。研究结果表明:该方法在现实世界中具有优秀的轨迹规划能力,更符合真实人驾的驾驶习惯,具备多种场景的泛化能力。展开更多
利用纳米压痕仪的连续刚度测量模式测试了常温氙离子辐照后Hastelloy N合金的纳米硬度。结果表明,辐照样品的纳米硬度均大于未辐照样品的纳米硬度,且辐照剂量在0.5~3.0 dpa这一范围内时,辐照样品的纳米硬度处于饱和状态。在Nix-Gao模型...利用纳米压痕仪的连续刚度测量模式测试了常温氙离子辐照后Hastelloy N合金的纳米硬度。结果表明,辐照样品的纳米硬度均大于未辐照样品的纳米硬度,且辐照剂量在0.5~3.0 dpa这一范围内时,辐照样品的纳米硬度处于饱和状态。在Nix-Gao模型的基础上,分离出未辐照样品和辐照样品的压痕尺寸效应,并通过VLM(volume law of mixture)模型来模拟实验测得的纳米硬度。由于随着压头压入深度的增加,塑性影响区中将同时包含辐照损伤层与基体,在VLM模型中引入“界面参数”(χ)以修正基体的形变量,改进后的模型能够更好地模拟纳米压痕的实验结果。展开更多
We present a novel framework,CLIPSP,and a novel adaptive prompt method to leverage pre-trained knowledge from CLIP for scene parsing.Our approach addresses the limitations of DenseCLIP,which demonstrates the superior ...We present a novel framework,CLIPSP,and a novel adaptive prompt method to leverage pre-trained knowledge from CLIP for scene parsing.Our approach addresses the limitations of DenseCLIP,which demonstrates the superior image segmentation provided by CLIP pre-trained models over ImageNet pre-trained models,but struggles with rough pixel-text score maps for complex scene parsing.We argue that,as they contain all textual information in a dataset,the pixel-text score maps,i.e.,dense prompts,are inevitably mixed with noise.To overcome this challenge,we propose a two-step method.Firstly,we extract visual and language features and perform multi-label classification to identify the most likely categories in the input images.Secondly,based on the top-k categories and confidence scores,our method generates scene tokens which can be treated as adaptive prompts for implicit modeling of scenes,and incorporates them into the visual features fed into the decoder for segmentation.Our method imposes a constraint on prompts and suppresses the probability of irrelevant categories appearing in the scene parsing results.Our method achieves competitive performance,limited by the available visual-language pre-trained models.Our CLIP-SP performs 1.14%better(in terms of mIoU)than DenseCLIP on ADE20K,using a ResNet-50 backbone.展开更多
Vision-language models(VLMs)have shown strong open-vocabulary learning abilities in various video understanding tasks.However,when applied to open-vocabulary temporal action detection(OV-TAD),existing OV-TAD methods o...Vision-language models(VLMs)have shown strong open-vocabulary learning abilities in various video understanding tasks.However,when applied to open-vocabulary temporal action detection(OV-TAD),existing OV-TAD methods often face challenges in generalizing to unseen action categories due to their reliance on visual features,resulting in limited generalization.In this paper,we propose a novel framework,Concept-Guided Semantic Projection(CSP),to enhance the generalization ability of OV-TAD methods.By projecting video features into a unified action concept space,CSP enables the use of abstracted action concepts for action detection,rather than solely relying on visual details.To further improve feature consistency across action categories,we introduce a mutual contrastive loss(MCL),ensuring semantic coherence and better feature discrimination.Extensive experiments on the ActivityNet and THUMOS14 benchmarks demonstrate that our method outperforms state-of-the-art OV-TAD methods.Code and data are available at Concept-Guided-OV-TAD.展开更多
文摘现有的野生动物识别方法主要依赖于静态数据集,难以适应物种动态迁移和新增类别识别的需求,导致监测效率低下。针对这一问题,提出多粒度提示驱动的野生动物识别方法(multi-granularity prompt-driven for wildlife recognition,MGP-WILD)。通过云端大语言模型生成层次化语义描述(粗粒度生物分类+细粒度形态特征),由边缘节点协同维护动态知识表。具体而言,MGP-WILD利用大语言模型生成多粒度文本提示,相较于传统单粒度提示方法,本工作通过多粒度语义描述生成,实现了粗细粒度特征的深度融合,并结合视觉语言模型的跨模态对齐能力,实现了零样本精准识别。实验结果表明,该方法在多个数据集上均有较大提升,尤其在开放集识别任务中展现了较强的适应性。该系统已成功应用于青海野生动物栖息地保护,构建了基于真实场景的动物图像数据集,为生态脆弱区的生物多样性保护提供了创新技术范式。代码及部分数据集将在GitHub上公开。
文摘In multimodal learning, Vision-Language Models (VLMs) have become a critical research focus, enabling the integration of textual and visual data. These models have shown significant promise across various natural language processing tasks, such as visual question answering and computer vision applications, including image captioning and image-text retrieval, highlighting their adaptability for complex, multimodal datasets. In this work, we review the landscape of Bootstrapping Language-Image Pre-training (BLIP) and other VLM techniques. A comparative analysis is conducted to assess VLMs’ strengths, limitations, and applicability across tasks while examining challenges such as scalability, data quality, and fine-tuning complexities. The work concludes by outlining potential future directions in VLM research, focusing on enhancing model interpretability, addressing ethical implications, and advancing multimodal integration in real-world applications.
基金supported by the Shanghai Municipal Science and Technology Major Project(No.2021SHZDZX0100)the National Natural Science Foundation of China(No.62088101)+1 种基金the Fundamental Research Funds for the Central Universities(No.22120220642)the Opening Project of the State Key Laboratory of Autonomous Intelligent Unmanned Systems(No.ZZKF2025-2-3).
文摘This paper investigates the potential of Vision-Language Models(VLMs)to enhance Human–Vehicle Interaction(HVI)in Autonomous Driving(AD)scenarios,particularly in interactions between vehicles and other traffic participants,with a focus on rationality and safety in external HVI.Leveraging recent advancements in large language models,VLMs demonstrate remarkable capabilities in understanding real-world contexts and generating significant interest in HVI applications.This paper provides an overview of AD,HVI,and VLMs,along with the historical context of large language model applications in HVI.The HVI discussed herein involves dynamic game processes encompassing perception and decision-making between vehicles and traffic participants,such as pedestrians.Furthermore,we examine the perceptual challenges associated with applying VLMs to HVI and compile relevant datasets.This research fills a gap in the existing literature by systematically analyzing the current status,challenges,and future opportunities of VLM applications in HVI.To advance VLM integration in AD,various implementation strategies are discussed.The findings highlight the potential of VLMs to transform HVI in AD,improving both passenger experience and driving safety.Overall,this study contributes to a comprehensive understanding of VLM applications in HVI and provides insights to guide future research and development.
文摘在智能驾驶系列任务中,使用视觉大语言模型(Vision Large Language Model,VLM)进行轨迹规划任务时面对的主要技术难题是:如何感知周围的世界并根据这些信息处理复杂的任务。现有开源视觉大语言模型在预训练阶段缺乏驾驶场景的空间先验,导致其对空间信息的理解能力显著不足,难以直接胜任轨迹规划任务。为此,提出一种“空间问答微调+鸟瞰图感知输入”双重增强的端到端轨迹规划框架:首先是第一重增强,即基于数据集的可用标注数据构建驾驶场景空间问答微调数据集,使2B参数的Qwen2-VL在障碍物类别辨识、相对距离及尺度估计方面获得显式空间先验;随后为第二重增强,即利用环视摄像头实时生成动态鸟瞰图(Bird Eye View,BEV),完成轻量级空间重建;最终,将鸟瞰图图像、原始环视帧及文本指令共同输入经LoRA微调的视觉大语言模型,以问答形式直接输出规范化轨迹。所提方法的有效性在nuScenes数据集和NAVSIM数据集上得到验证。研究结果表明:该方法在现实世界中具有优秀的轨迹规划能力,更符合真实人驾的驾驶习惯,具备多种场景的泛化能力。
文摘利用纳米压痕仪的连续刚度测量模式测试了常温氙离子辐照后Hastelloy N合金的纳米硬度。结果表明,辐照样品的纳米硬度均大于未辐照样品的纳米硬度,且辐照剂量在0.5~3.0 dpa这一范围内时,辐照样品的纳米硬度处于饱和状态。在Nix-Gao模型的基础上,分离出未辐照样品和辐照样品的压痕尺寸效应,并通过VLM(volume law of mixture)模型来模拟实验测得的纳米硬度。由于随着压头压入深度的增加,塑性影响区中将同时包含辐照损伤层与基体,在VLM模型中引入“界面参数”(χ)以修正基体的形变量,改进后的模型能够更好地模拟纳米压痕的实验结果。
文摘We present a novel framework,CLIPSP,and a novel adaptive prompt method to leverage pre-trained knowledge from CLIP for scene parsing.Our approach addresses the limitations of DenseCLIP,which demonstrates the superior image segmentation provided by CLIP pre-trained models over ImageNet pre-trained models,but struggles with rough pixel-text score maps for complex scene parsing.We argue that,as they contain all textual information in a dataset,the pixel-text score maps,i.e.,dense prompts,are inevitably mixed with noise.To overcome this challenge,we propose a two-step method.Firstly,we extract visual and language features and perform multi-label classification to identify the most likely categories in the input images.Secondly,based on the top-k categories and confidence scores,our method generates scene tokens which can be treated as adaptive prompts for implicit modeling of scenes,and incorporates them into the visual features fed into the decoder for segmentation.Our method imposes a constraint on prompts and suppresses the probability of irrelevant categories appearing in the scene parsing results.Our method achieves competitive performance,limited by the available visual-language pre-trained models.Our CLIP-SP performs 1.14%better(in terms of mIoU)than DenseCLIP on ADE20K,using a ResNet-50 backbone.
基金supported by the National Natural Science Foundation of China under Grant No.62402490the Guangdong Basic and Applied Basic Research Foundation of China under Grant No.2025A1515010101.
文摘Vision-language models(VLMs)have shown strong open-vocabulary learning abilities in various video understanding tasks.However,when applied to open-vocabulary temporal action detection(OV-TAD),existing OV-TAD methods often face challenges in generalizing to unseen action categories due to their reliance on visual features,resulting in limited generalization.In this paper,we propose a novel framework,Concept-Guided Semantic Projection(CSP),to enhance the generalization ability of OV-TAD methods.By projecting video features into a unified action concept space,CSP enables the use of abstracted action concepts for action detection,rather than solely relying on visual details.To further improve feature consistency across action categories,we introduce a mutual contrastive loss(MCL),ensuring semantic coherence and better feature discrimination.Extensive experiments on the ActivityNet and THUMOS14 benchmarks demonstrate that our method outperforms state-of-the-art OV-TAD methods.Code and data are available at Concept-Guided-OV-TAD.