TMamba:面向高效目标跟踪的视觉状态空间模型

TMamba:a visual state-space model for efficient object tracking

导出

摘要目的Transformer的出现显著提升了目标跟踪模型的精度和鲁棒性,但其二次计算复杂度使得这些模型计算量较大,难以在实际场景中应用。且基于Transformer的模型还会导致较高的显存消耗,限制了跟踪模型的序列级训练。为了解决这些问题,提出一种基于视觉状态空间的目标跟踪模型。方法基于视觉Mamba框架提出TMamba算法。与基于Transformer的目标跟踪模型相比,TMamba在实现优越性能的同时显著降低了计算量和显存占用,为跟踪模型的序列级训练提供了新的思路。TMamba的核心模块是特征融合模块,该模块将深层特征的语义信息与浅层特征的细节信息相结合,为预测头提供更精确的特征,从而提高预测的准确性。此外,本文还提出双图像扫描策略来弥补视觉状态空间模型与追踪领域之间的差距。双图像扫描策略联合扫描模板和搜索区域图像,使视觉状态空间模型更适配跟踪模型。结果基于所提出的特征融合模块以及双图像扫描策略开发了一系列基于状态空间模型的目标跟踪模型。而且在7个数据集上对所提出的模型进行了全面评测,结果显示,TMamba在降低计算量和参数量的同时,在各数据集上均取得显著性能。TMamba-B在LaSOT数据集上取得了66%的成功率,超越了大多数基于Transformer的模型,同时仅有50.7 M的参数量和14.2 G的计算量。结论提出的TMamba算法探索了使用状态空间模型进行目标跟踪的可能性。TMamba在多个数据集上以更少的参数量和计算量实现了与基于Transformer的目标跟踪模型相当的性能。TMamba的低参数量、低计算量以及低显存占用的特点,有望进一步促进目标跟踪模型的实际应用,并推动跟踪模型序列级训练的发展。 Objective The emergence of Transformer models has revolutionized the field of object tracking,significantly enhancing the accuracy and robustness of these models.Transformers,with their self-attention mechanisms,have been demonstrated to capture long-range dependencies and complex relationships within data,making them a powerful tool for various computer vision tasks,including object tracking.However,a critical drawback of Transformer-based object tracking models is their computational complexity,which scales quadratically with the length of the input sequence.This characteristic imposes a substantial computational burden,particularly in practical scenarios where efficiency is paramount.Realworld applications require models that not only perform well but also operate with minimal computational cost,fewer parameters,and fast response times.However,the high computational demands and parameter counts of Transformer-based models render them less suitable for these applications.Moreover,Transformer-based object tracking models typically exhibit high memory consumption,which poses an additional challenge to video-level object tracking tasks.High memory usage restricts the number of video frames that can be processed simultaneously,limiting the ability to capture sufficient temporal information required for effective tracking.This limitation hinders the development of video-level tracking models,because the inability to sample sufficient frames can lead to suboptimal performance and reduced tracking accuracy.To address these challenges,this study introduces a novel object tracking model based on visual state-space models.Method Building upon the visual Mamba framework,we propose the TMamba algorithm,which leverages the strengths of state-space models for object tracking.The TMamba model offers a promising alternative to Transformer-based tracking models by achieving superior performance with significantly reduced computational load and memory usage.This reduction is crucial for enabling the deployment of object tracking models in resource-constrained environments,such as edge devices and realtime systems.The core component of TMamba is the feature fusion module,which is designed to integrate information from different feature hierarchies within the network.In particular,the feature fusion module combines the rich semantic information from deep features with the detailed,high-resolution information from shallow features.By fusing these features,the module produces a multilevel representation that provides the prediction head with more accurate and comprehensive information,leading to improved prediction accuracy.A key innovation of TMamba is the introduction of a dual image scanning strategy,which addresses the unique challenges of adapting visual state-space models to the tracking domain.In visual state-space models,the approach for scanning images is crucial,because it directly affects the model’s ability to process and interpret visual data.In contrast with classification and detection tasks,wherein a single image is inputted into the network,object tracking requires the simultaneous processing of multiple images,typically a template and a search region.How these images are scanned and fed into the network is a critical factor that determines the model’s performance.Our proposed dual image scanning strategy involves jointly scanning the template and search region images,allowing the visual state-space model to better accommodate the specific requirements of object tracking.This strategy enhances the model’s ability to learn spatial and temporal dependencies across frames,leading to more accurate and reliable tracking.Result To evaluate the effectiveness of the proposed TMamba algorithm,we developed a series of object tracking models based on state-space models and conducted extensive experiments on seven benchmark datasets.These datasets include LaSOT,TrackingNet,and GOT-10k,which are widely used in the object tracking community for performance evaluation.The results demonstrate that TMamba consistently achieves outstanding performance across all the datasets,with significant reductions in computational cost and parameter count compared with Transformer-based models.For example,TMambaB,one of the configurations of our model,achieves a 66%area under the curve(AUC)score on the LaSOT dataset,an 82.3%AUC score on TrackingNet,and a 72%AUC score on GOT-10k.These results not only surpass those of many Transformer-based models but also highlight the efficiency of TMamba in terms of computational resources.TMamba-B contains only 50.7 million parameters and requires only 14.2 GFLOPs for processing,making it one of the most efficient models in its class.This efficiency is achieved without compromising accuracy,demonstrating the potential of state-space models in high-performance object tracking.Further analysis of the experimental results reveals several key insights.First,the feature fusion module plays a crucial role in enhancing the model’s performance by effectively combining information from different feature levels.This fusion allows TMamba to leverage the strength of deep and shallow features,resulting in a more robust representation that is well-suited for tracking diverse objects under various conditions.Second,the dual image scanning strategy proves to be highly effective in bridging the gap between visual state-space models and the tracking domain.By jointly scanning the template and search region images,this strategy enables TMamba to better capture spatial and temporal relationships,which are essential for accurate tracking.Conclusion This study introduces the TMamba algorithm and investigates the feasibility of employing state-space models in the domain of object tracking.The results demonstrate that TMamba not only matches,but in some cases,even surpasses the performance of Transformer-based object tracking models across multiple datasets.Moreover,TMamba achieves these results with a significantly reduced parameter count and lower computational complexity,making it a more practical choice for real-world applications.The characteristics of TMamba,namely,its low parameter count,minimal computational demands,and reduced memory usage,suggest that it exhibits considerable potential to advance the practical application of object tracking models.By addressing the limitations of existing Transformer-based approaches,TMamba paves the way for the development of more efficient and scalable video-level object tracking solutions.

作者康奔陈鑫赵洁王栋 Kang Ben;Chen Xin;Zhao Jie;Wang Dong(School of Information and Communication Engineering,Dalian University of Technology,Dalian 116024,China)

机构地区大连理工大学信息与通信工程学院

出处《中国图象图形学报》北大核心 2025年第10期3199-3214,共16页 Journal of Image and Graphics

基金国家自然科学基金项目(U23A20384,62402084) 中国博士后科学基金项目(2024M750319)。

关键词单目标跟踪状态空间模型(SSM) 多尺度特征融合序列训练高效存储模型 single-object tracking state space model(SSM) multi-scale feature fusion sequential training memory-efficient model

分类号 TP183 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献1

1朱建章,王栋,卢湖川.背景与时间感知的相关滤波实时视觉跟踪[J].中国图象图形学报,2019,0(4):536-549. 被引量：12

二级参考文献2

1张微,康宝生.相关滤波目标跟踪进展综述[J].中国图象图形学报,2017,22(8):1017-1033. 被引量：85
2卢湖川,李佩霞,王栋.目标跟踪算法综述[J].模式识别与人工智能,2018,31(1):61-76. 被引量：173

共引文献11

1张雨杨,施展.基于背景感知相关滤波的深度提取跟踪算法研究[J].软件,2020,41(3):84-87. 被引量：1
2姜文涛,涂潮,刘万军,金岩.自适应上下文感知相关滤波跟踪[J].激光与光电子学进展,2020,57(24):137-144.
3谢煜,黄俊,李旭.基于上下文感知与自适应响应融合的相关滤波跟踪算法[J].小型微型计算机系统,2021,42(4):816-822. 被引量：2
4陈媛,惠燕,胡秀华.一种自适应尺度与学习速率调整的背景感知相关滤波跟踪算法[J].计算机科学,2021,48(5):177-183. 被引量：2
5范荣全,林明星,刘克亮,张劲,刘俊勇,刘友波.基于压路机多角度判别的倾轧轨迹面检测算法[J].计算机工程与设计,2021,42(8):2324-2333.
6卓力,张时雨,张辉,李嘉锋.无人机影像单目标跟踪综述[J].北京工业大学学报,2021,47(10):1174-1187. 被引量：3
7李勇锋,谢维信.多特征融合的自适应加权采样上下文感知相关滤波跟踪算法[J].信号处理,2022,38(1):211-222. 被引量：2
8林淑彬,吴贵山,姚文勇,杨文元.基于光照自适应动态一致性的无人机目标跟踪[J].智能系统学报,2022,17(6):1093-1103. 被引量：4
9李胜杰.基于随机尺度高效响应的背景感知目标跟踪方法[J].软件,2023,44(2):1-6.
10姜文涛,张博强.通道和异常适应性的目标跟踪算法[J].计算机科学与探索,2023,17(7):1644-1657. 被引量：1

1王兴刚,张长青,任文琦,傅雪阳,周涛,赵峰,石争浩,陈秀妍.《中国图象图形学报》视觉状态空间模型及应用专栏简介[J].中国图象图形学报,2025,30(10):3171-3172.
2刘建明,庄维宽.结合视觉Mamba和块特征分布的工业异常检测[J].中国图象图形学报,2025,30(10):3215-3229.
3李愿,付辉,刘浩志.双重注意力下的多尺度残差遥感图像去雾网络[J].自然资源遥感,2025,37(4):31-39. 被引量：1
4陈琼,孙靖博,李俊霖,束佳宸,邓韵筠,程曦,陈宗存.基于联邦学习的糖尿病性黄斑水肿分割[J].南通大学学报(自然科学版),2025,24(3):1-11.

中国图象图形学报

2025年第10期

浏览历史

内容加载中请稍等...

TMamba:面向高效目标跟踪的视觉状态空间模型

参考文献1

二级参考文献2

共引文献11

相关作者

相关机构

相关主题

浏览历史