大模型驱动的多模态点云语义分割测试时自适应方法

Large-model driven test-time adaptation for multi-modal point cloud semantic segmentation

导出

摘要目的点云语义分割在面对跨域分布差异时常出现性能下降,测试时自适应(test-time adaptation,TTA)可以通过在测试阶段利用目标域的无标签数据对源域训练的模型进行在线微调,从而缓解域偏移问题。然而,传统方法往往难以精确处理点云的空间连续性与局部结构约束,适应效果有限。为增强模型泛化能力,部分方法引入二维图像利用跨模态信息以增强模型的适应性,但跨模态对齐误差易导致语义碎片化的问题,影响语义分割性能。针对上述挑战,本文提出一种结合视觉大模型知识的测试时自适应点云语义分割方法。方法首先,利用CLIP(contrastive language-image pre-training)文本编码器生成类别对应的文本嵌入,将视觉—文本先验知识融入逐点特征的预测过程,为点云提供泛化能力更强的语义补充信息;其次,通过SAM(segment-anything-model)生成的区域掩码对点云特征进行局部的一致性约束,有效缓解因对齐误差导致的特征不连续及进而产生的语义碎片化问题,提升模型的语义分割性能。结果本文方法在3个数据集划分的3个真实场景(数据集—数据集、地点—地点、时间—时间)中,与现有的测试时自适应和无监督域自适应方法进行了对比。实验结果表明,本文方法在数据集—数据集场景中的性能提升尤为显著。在地点—地点和时间—时间场景中,本文方法也优于当前先进模型。此外,本文的测试时自适应方法在无法获取源域数据的条件下,仍能超越部分无监督域自适应方法,展现出较高的实用价值。结论本文提出的利用视觉大模型知识引导测试时自适应方法,通过融合视觉—文本信息和局部特征一致性约束,显著提升了点云语义分割在多种场景中的泛化性能。 Objective Point cloud semantic segmentation aims at assigning semantic labels to individual points in 3D space.However,the performance of segmentation models often deteriorates when they are applied to unseen domains due to domain shifts——differences in the distribution of data between the source and target domains.To address this problem,domain adaptation techniques,particularly test-time adaptation(TTA),have gained attention.Unlike traditional unsuper vised domain adaptation(UDA),TTA does not require access to labeled source-domain data,which makes it a privacypreserving and practical solution.TTA aims to fine-tune a model on unlabeled target-domain data,which enables it to generalize better to the target domain without compromising privacy.Recent advancements have incorporated multi-modal data,such as pairing point cloud data with images,to further improve domain adaptation performance.These methods enhance feature representations and provide richer statistical estimations for pseudo-label generation.However,in the case of domain shift,alignment deviations may occur between modalities,which misleads the model to make incorrect predictions.In addition,the domain differences of the modalities may further lead to inconsistent model understanding and fragmented semantic predictions,which affects the performance of point cloud semantic segmentation tasks.To this end,the use of pre-trained foundation models,such as contrastive language-image pretraining(CLIP)and segment anything model(SAM),with their rich intrinsic prior knowledge,has shown promising results in improving semantic segmentation performance in domain adaptation.Method This study proposes a foundation model-driven TTA method that incorporates knowledge from pre-trained models to improve multi-modal domain adaptation.The approach consists of two main components.One is the CLIP-based text-embedded segmentation layer.Based on the CLIP model,this layer uses text embeddings tailored to the categories of the target domain to replace the original segmentation layer in the model.This integration enhances the semantic generalization capability of the model,which allows it to adapt better to the target domain without requiring additional labeled data.The capability of CLIP to align text and image feature spaces through contrastive learning aids in transferring semantic knowledge from large-scale datasets to domain-specific tasks.The other is the SAM maskbased feature consistency constraint module.Built upon SAM,this module generates object masks with consistent semantics across the point cloud and image modalities.By enforcing feature consistency within these masks,the module improves feature extraction and helps the model maintain semantic coherence between different modalities.This way ensures that features remain aligned and consistent throughout the adaptation process,which enhances segmentation performance,particularly in scenarios involving complex domain shifts.Result The method is evaluated on three multi-modal domain adaptation scenarios using datasets:nuScenes,A2D2,and SemanticKITTI.These datasets offer diverse challenges,such as geographic and lighting domain shifts,as well as cross-dataset adaptation.Our method is compared with state-of-the-art TTA and UDA approaches to assess its effectiveness.Experimental results demonstrate the effectiveness of the proposed method across various domain adaptation scenarios.In the A2D2-SemanticKITTI adaptation setting,the method achieves a mean intersection over union(mIoU)of 48.3%in the 2D adaptation scenario,which surpasses the state-of-the-art method by 4.6%.In the 3D adaptation setting,the method achieves an mIoU of 42.9%.These results highlight the capability of the proposed method to improve 2D and 3D semantic segmentation performance by leveraging two proposed modules.Compared with UDA methods,which rely on source and target domain data,the proposed method performs competitively despite only utilizing unlabeled target-domain data.This performance demonstrates the practicality of the method in realworld scenarios where access to source data is limited due to privacy concerns.Furthermore,qualitative results illustrate substantial improvements in segmentation quality,with the method producing highly accurate and coherent results in complex environments.Ablation studies further show the validation of the proposed modules.The text-embedded segmentation layer significantly enhanced 2D performance,with the mIoU increased from 45.8%to 48.1%.The mask-constrained feature consistency module had a greater impact on 3D performance,which improved the mIoU from 40.2%to 41.8%.When both modules were combined,the method achieved the highest performance across 2D and 3D tasks,with an mIoU of 48.3%for 2D and 42.9%for 3D.These experiments confirm that the combination of both modules is crucial for achieving optimal performance in multi-modal domain adaptation.Conclusion In this study,we propose a TTA method that uses the knowledge of a large visual model to guide the test.By integrating visual-text information and local feature consistency constraints,it substantially improves the generalization performance of point cloud semantic segmentation in various scenarios.Extensive experiments across three real-world scenarios demonstrate the superior performance of our method.

作者刘雪帆刘砚李浩然张晔郭裕兰 Liu Xuefan;Liu Yan;Li Haoran;Zhang Ye;Guo Yulan(School of Electronic and Communication Engineering,Sun Yat-sen University,Shenzhen 518107,China)

机构地区中山大学深圳校区电子与通信工程学院

出处《中国图象图形学报》 2025年第11期3651-3664,共14页 Journal of Image and Graphics

基金国家自然科学基金项目(U20A20185,62372491) 广东省基础与应用基础研究基金项目(2022B1515020103,2023B1515120087) 深圳市科技计划资助(RCYX20200714114641140)。

关键词点云语义分割测试时自适应(TTA) 视觉基础模型多模态 point cloud semantic segmentation test-time adaptation(TTA) vision foundation model multi-modality

分类号 TP391 [自动化与计算机技术]

中国图象图形学报

2025年第11期

浏览历史

内容加载中请稍等...

大模型驱动的多模态点云语义分割测试时自适应方法

相关作者

相关机构

相关主题

浏览历史