视觉语义引导的路侧多模态感知动态自适应均衡优化框架

Vision-language Model Guided Adaptive Balanced Optimization Framework for Roadside Multi-modal Perception

导出

摘要随着车路协同系统对全域感知需求的增加,路侧多模态感知技术成为突破车载端感知局限的关键。基于此,提出一种基于视觉语言模型(VLM)语义引导的多模态感知自适应均衡优化框架,旨在提升路侧感知系统性能。框架设计了动态权重分配模块,通过跨模态注意力与帧级残差建模,实现空间自适应的多模态融合。提出的梯度敏感异步优化器精细调控各模态学习率,解决了模态收敛不均的问题。轻量级门控调度机制根据模态状态和场景语义熵动态触发视觉语言模型校准,减少了计算开销。试验结果表明:所提方法在DAIR-V2X-I与RCooper数据集上分别达到79.20%与80.16%的3D目标检测精度,相较于对比的同类方法平均提升3.9%(最高可达7.51%)。同时,门控调度机制使视觉语言模型模块的平均调用频率下降41.2%,有效削减了冗余计算;整体模型显存占用较基线模型仅增加约4.0%。所提方法为车路协同系统的技术发展提供了新的解决思路。 With the growing demand for comprehensive perception in vehicle-infrastructure cooperative systems,roadside multi-modal perception has become a key approach to overcoming the limitations of onboard sensing.This paper proposes an adaptive balanced optimization framework for multi-modal perception guided by a vision-language model(VLM)to enhance the performance of roadside sensing systems.The framework introduces a dynamic weight allocation module that achieves spatially adaptive multi-modal fusion through cross-modal attention and frame-level residual modeling.To address the convergence imbalance among modalities,a gradient-sensitive asynchronous optimizer is designed to finely regulate modality-specific learning rates.In addition,a lightweight gated scheduling mechanism dynamically triggers VLM calibration based on modality states and scene semantic entropy,thereby reducing computational overhead.Experimental results demonstrate that the proposed method achieves 3D object detection mAPs of 79.20%and 80.16%on the DAIR-V2X-I and RCooper datasets,respectively,outperforming comparable methods by an average of 3.9%(up to 7.51%).Meanwhile,the gated scheduling mechanism reduces the average VLM invocation frequency by 41.2%,effectively cutting redundant computation,while the overall GPU memory usage increases by only about 4.0%compared with the baseline.This work provides a novel,efficient,and scalable solution for advancing intelligent perception in vehicle-infrastructure cooperative systems.

作者张国宇陈前孙剑杭鹏 ZHANG Guo-yu;CHEN Qian;SUN Jian;HANG Peng(Key Laboratory of Road and Traffic Engineering of Ministry of Education,Tongji University,Shanghai 201804,China)

机构地区同济大学道路与交通工程教育部重点实验室

出处《中国公路学报》 2026年第3期88-100,共13页 China Journal of Highway and Transport

基金车路一体智能交通全国重点实验室开放基金项目(2024-A002) 国家自然科学基金杰出青年科学基金项目(52302502) 上海市2023年度“科技创新行动计划”社会发展科技攻关项目(23DZ1203400)。

关键词交通工程路侧多模态感知多模态融合视觉语言模型(VLM) 动态优化门控调度 traffic engineering roadside multi-modal perception multi-modal fusion vision-language model(VLM) dynamic optimization gating scheduling

分类号 U495 [交通运输工程]

中国公路学报

2026年第3期

浏览历史

内容加载中请稍等...

视觉语义引导的路侧多模态感知动态自适应均衡优化框架

相关作者

相关机构

相关主题

浏览历史