期刊文献+

视觉语义引导的路侧多模态感知动态自适应均衡优化框架

Vision-language Model Guided Adaptive Balanced Optimization Framework for Roadside Multi-modal Perception
原文传递
导出
摘要 随着车路协同系统对全域感知需求的增加,路侧多模态感知技术成为突破车载端感知局限的关键。基于此,提出一种基于视觉语言模型(VLM)语义引导的多模态感知自适应均衡优化框架,旨在提升路侧感知系统性能。框架设计了动态权重分配模块,通过跨模态注意力与帧级残差建模,实现空间自适应的多模态融合。提出的梯度敏感异步优化器精细调控各模态学习率,解决了模态收敛不均的问题。轻量级门控调度机制根据模态状态和场景语义熵动态触发视觉语言模型校准,减少了计算开销。试验结果表明:所提方法在DAIR-V2X-I与RCooper数据集上分别达到79.20%与80.16%的3D目标检测精度,相较于对比的同类方法平均提升3.9%(最高可达7.51%)。同时,门控调度机制使视觉语言模型模块的平均调用频率下降41.2%,有效削减了冗余计算;整体模型显存占用较基线模型仅增加约4.0%。所提方法为车路协同系统的技术发展提供了新的解决思路。 With the growing demand for comprehensive perception in vehicle-infrastructure cooperative systems,roadside multi-modal perception has become a key approach to overcoming the limitations of onboard sensing.This paper proposes an adaptive balanced optimization framework for multi-modal perception guided by a vision-language model(VLM)to enhance the performance of roadside sensing systems.The framework introduces a dynamic weight allocation module that achieves spatially adaptive multi-modal fusion through cross-modal attention and frame-level residual modeling.To address the convergence imbalance among modalities,a gradient-sensitive asynchronous optimizer is designed to finely regulate modality-specific learning rates.In addition,a lightweight gated scheduling mechanism dynamically triggers VLM calibration based on modality states and scene semantic entropy,thereby reducing computational overhead.Experimental results demonstrate that the proposed method achieves 3D object detection mAPs of 79.20%and 80.16%on the DAIR-V2X-I and RCooper datasets,respectively,outperforming comparable methods by an average of 3.9%(up to 7.51%).Meanwhile,the gated scheduling mechanism reduces the average VLM invocation frequency by 41.2%,effectively cutting redundant computation,while the overall GPU memory usage increases by only about 4.0%compared with the baseline.This work provides a novel,efficient,and scalable solution for advancing intelligent perception in vehicle-infrastructure cooperative systems.
作者 张国宇 陈前 孙剑 杭鹏 ZHANG Guo-yu;CHEN Qian;SUN Jian;HANG Peng(Key Laboratory of Road and Traffic Engineering of Ministry of Education,Tongji University,Shanghai 201804,China)
出处 《中国公路学报》 2026年第3期88-100,共13页 China Journal of Highway and Transport
基金 车路一体智能交通全国重点实验室开放基金项目(2024-A002) 国家自然科学基金杰出青年科学基金项目(52302502) 上海市2023年度“科技创新行动计划”社会发展科技攻关项目(23DZ1203400)。
关键词 交通工程 路侧多模态感知 多模态融合 视觉语言模型(VLM) 动态优化 门控调度 traffic engineering roadside multi-modal perception multi-modal fusion vision-language model(VLM) dynamic optimization gating scheduling

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部