渐进式双阶段模态交互的单域泛化目标检测

Progressive dual-stage modality interaction for single-domain generalized object detection

下载PDF

导出

摘要针对现有基于视觉语言的单域泛化模型采用固定的单向文本引导视觉局部对齐操作,导致局部-全局上下文建模能力不足的问题,提出一种渐进式双阶段模态交互(PDMI)框架。PDMI能够在模态内以多层次方式提取全局域不变特征,在模态间充分挖掘视觉和文本互补语义,以获得细粒度语义知识。首先,结合固定域无关提示和可学习的自适应域提示(ADP)引导样本获得对特定域的语义感知能力;同时,在视觉主干网络ResNet-101基础上,设计多层级的模态内交互(MIMI)模块,基于自适应视觉提示引导,对源域图像进行模态内Mamba交互(IMMI)以提取图像的全局域不变特征,改善视觉特征表示的分布。其次,提出跨模态双向交互融合(CMBIF)机制,提取并对齐细粒度的跨模态特征,以视觉或文本双向引导实现细粒度模态间交互。最后,采用跨模态自适应融合(CMAF)模块自动搜索模态间信息的最佳组合,进一步减小模态间交互的冗余特征。在3个具有挑战性的领域偏移数据集Diverse Weather、Virtual-to-Reality和UAV-OD上的实验结果显示:PDMI在目标域上的平均精度(mPT)比C-Gap、SRCD(Semantic Reasoning with Compound Domains)和FDD(Frequency Domain Disentanglement)方法分别平均提高了2.0、4.0和4.2个百分点。可见,PDMI能够有效提取全局-局部域不变特征提升对未见目标域的泛化能力,这对目标域和源域存在显著分布偏移且目标域数据受限的场景至关重要。 The existing vision-language-based single-domain generalization models rely on fixed unidirectional text guidance for local visual alignment,which limits their ability to model local-global context.Aiming at the problem,a Progressive Dual-stage Modality Interaction(PDMI)framework was proposed.In PDMI,global domain-invariant features were extracted hierarchically within modalities,and the complementary semantic information was fully exploited between visual and textual modalities,thereby capturing fine-grained semantic knowledge.Firstly,fixed domain-agnostic prompts and learnable Adaptive Domain Prompts(ADP)were integrated to guide the obtaining of the semantic awareness of samples toward specific domains.At the same time,based on the ResNet-101 visual backbone,a Multi-level Intra-Modality Interaction(MIMI)module was designed,in which Intra-Modality Mamba Interactions(IMMI)were performed on source domain images based on the guidance of adaptive visual prompts to extract global domain-invariant features,thereby improving the distribution of visual representations.Then,a Cross-Modality Bidirectional Interaction and Fusion(CMBIF)mechanism was adopted to extract and align fine-grained cross-modality feature,realizing fine-grained interactions between modalities through bidirectional guidance of visual or textual prompts.Finally,a Cross-Modality Adaptive Fusion(CMAF)module was employed to search for the optimal combination of inter-modal information automatically,thereby reducing redundant features of interactions between modalities.Experiments were conducted on three challenging domain shift datasets:Diverse Weather,Virtual-to-Reality,and UAV-OD.The results show that PDMI achieves higher mean Precision on the Target domain(mPT),compared to C-Gap,SRCD(Semantic Reasoning with Compound Domains),and FDD(Frequency Domain Disentanglement)methods by 2.0,4.0,and 4.2 percentage points,respectively and averagely.It can be seen that PDMI can extract global-local domain-invariant features effectively to enhance the generalization to unseen target domains significantly,which is essential for scenarios with significant distribution shifts between the source and target domains as well as limited target domain data.

作者张永兵闫丽蓉唐晓芬 ZHANG Yongbing;YAN Lirong;TANG Xiaofen(School of Information Engineering,Ningxia University,Yinchuan Ningxia 750021,China;Ningxia Key Laboratory of Artificial Intelligence and Information Security for Channeling Computing Resources from the East to the West(Ningxia University),Yinchuan Ningxia 750021,China)

机构地区宁夏大学信息工程学院

出处《计算机应用》 2026年第4期1264-1274,共11页 journal of Computer Applications

基金国家自然科学基金资助项目(61966029)。

关键词单域泛化目标检测视觉语言模型提示学习多模态融合 single-domain generalized object detection Vision-Language Model(VLM) prompt learning multimodal fusion

分类号 TP391.41 [自动化与计算机技术]

计算机应用

2026年第4期

浏览历史

内容加载中请稍等...

渐进式双阶段模态交互的单域泛化目标检测

相关作者

相关机构

相关主题

浏览历史