期刊文献+

TMamba:面向高效目标跟踪的视觉状态空间模型

TMamba:a visual state-space model for efficient object tracking
原文传递
导出
摘要 目的Transformer的出现显著提升了目标跟踪模型的精度和鲁棒性,但其二次计算复杂度使得这些模型计算量较大,难以在实际场景中应用。且基于Transformer的模型还会导致较高的显存消耗,限制了跟踪模型的序列级训练。为了解决这些问题,提出一种基于视觉状态空间的目标跟踪模型。方法基于视觉Mamba框架提出TMamba算法。与基于Transformer的目标跟踪模型相比,TMamba在实现优越性能的同时显著降低了计算量和显存占用,为跟踪模型的序列级训练提供了新的思路。TMamba的核心模块是特征融合模块,该模块将深层特征的语义信息与浅层特征的细节信息相结合,为预测头提供更精确的特征,从而提高预测的准确性。此外,本文还提出双图像扫描策略来弥补视觉状态空间模型与追踪领域之间的差距。双图像扫描策略联合扫描模板和搜索区域图像,使视觉状态空间模型更适配跟踪模型。结果基于所提出的特征融合模块以及双图像扫描策略开发了一系列基于状态空间模型的目标跟踪模型。而且在7个数据集上对所提出的模型进行了全面评测,结果显示,TMamba在降低计算量和参数量的同时,在各数据集上均取得显著性能。TMamba-B在LaSOT数据集上取得了66%的成功率,超越了大多数基于Transformer的模型,同时仅有50.7 M的参数量和14.2 G的计算量。结论提出的TMamba算法探索了使用状态空间模型进行目标跟踪的可能性。TMamba在多个数据集上以更少的参数量和计算量实现了与基于Transformer的目标跟踪模型相当的性能。TMamba的低参数量、低计算量以及低显存占用的特点,有望进一步促进目标跟踪模型的实际应用,并推动跟踪模型序列级训练的发展。 Objective The emergence of Transformer models has revolutionized the field of object tracking,significantly enhancing the accuracy and robustness of these models.Transformers,with their self-attention mechanisms,have been demonstrated to capture long-range dependencies and complex relationships within data,making them a powerful tool for various computer vision tasks,including object tracking.However,a critical drawback of Transformer-based object tracking models is their computational complexity,which scales quadratically with the length of the input sequence.This characteristic imposes a substantial computational burden,particularly in practical scenarios where efficiency is paramount.Realworld applications require models that not only perform well but also operate with minimal computational cost,fewer parameters,and fast response times.However,the high computational demands and parameter counts of Transformer-based models render them less suitable for these applications.Moreover,Transformer-based object tracking models typically exhibit high memory consumption,which poses an additional challenge to video-level object tracking tasks.High memory usage restricts the number of video frames that can be processed simultaneously,limiting the ability to capture sufficient temporal information required for effective tracking.This limitation hinders the development of video-level tracking models,because the inability to sample sufficient frames can lead to suboptimal performance and reduced tracking accuracy.To address these challenges,this study introduces a novel object tracking model based on visual state-space models.Method Building upon the visual Mamba framework,we propose the TMamba algorithm,which leverages the strengths of state-space models for object tracking.The TMamba model offers a promising alternative to Transformer-based tracking models by achieving superior performance with significantly reduced computational load and memory usage.This reduction is crucial for enabling the deployment of object tracking models in resource-constrained environments,such as edge devices and realtime systems.The core component of TMamba is the feature fusion module,which is designed to integrate information from different feature hierarchies within the network.In particular,the feature fusion module combines the rich semantic information from deep features with the detailed,high-resolution information from shallow features.By fusing these features,the module produces a multilevel representation that provides the prediction head with more accurate and comprehensive information,leading to improved prediction accuracy.A key innovation of TMamba is the introduction of a dual image scanning strategy,which addresses the unique challenges of adapting visual state-space models to the tracking domain.In visual state-space models,the approach for scanning images is crucial,because it directly affects the model’s ability to process and interpret visual data.In contrast with classification and detection tasks,wherein a single image is inputted into the network,object tracking requires the simultaneous processing of multiple images,typically a template and a search region.How these images are scanned and fed into the network is a critical factor that determines the model’s performance.Our proposed dual image scanning strategy involves jointly scanning the template and search region images,allowing the visual state-space model to better accommodate the specific requirements of object tracking.This strategy enhances the model’s ability to learn spatial and temporal dependencies across frames,leading to more accurate and reliable tracking.Result To evaluate the effectiveness of the proposed TMamba algorithm,we developed a series of object tracking models based on state-space models and conducted extensive experiments on seven benchmark datasets.These datasets include LaSOT,TrackingNet,and GOT-10k,which are widely used in the object tracking community for performance evaluation.The results demonstrate that TMamba consistently achieves outstanding performance across all the datasets,with significant reductions in computational cost and parameter count compared with Transformer-based models.For example,TMambaB,one of the configurations of our model,achieves a 66%area under the curve(AUC)score on the LaSOT dataset,an 82.3%AUC score on TrackingNet,and a 72%AUC score on GOT-10k.These results not only surpass those of many Transformer-based models but also highlight the efficiency of TMamba in terms of computational resources.TMamba-B contains only 50.7 million parameters and requires only 14.2 GFLOPs for processing,making it one of the most efficient models in its class.This efficiency is achieved without compromising accuracy,demonstrating the potential of state-space models in high-performance object tracking.Further analysis of the experimental results reveals several key insights.First,the feature fusion module plays a crucial role in enhancing the model’s performance by effectively combining information from different feature levels.This fusion allows TMamba to leverage the strength of deep and shallow features,resulting in a more robust representation that is well-suited for tracking diverse objects under various conditions.Second,the dual image scanning strategy proves to be highly effective in bridging the gap between visual state-space models and the tracking domain.By jointly scanning the template and search region images,this strategy enables TMamba to better capture spatial and temporal relationships,which are essential for accurate tracking.Conclusion This study introduces the TMamba algorithm and investigates the feasibility of employing state-space models in the domain of object tracking.The results demonstrate that TMamba not only matches,but in some cases,even surpasses the performance of Transformer-based object tracking models across multiple datasets.Moreover,TMamba achieves these results with a significantly reduced parameter count and lower computational complexity,making it a more practical choice for real-world applications.The characteristics of TMamba,namely,its low parameter count,minimal computational demands,and reduced memory usage,suggest that it exhibits considerable potential to advance the practical application of object tracking models.By addressing the limitations of existing Transformer-based approaches,TMamba paves the way for the development of more efficient and scalable video-level object tracking solutions.
作者 康奔 陈鑫 赵洁 王栋 Kang Ben;Chen Xin;Zhao Jie;Wang Dong(School of Information and Communication Engineering,Dalian University of Technology,Dalian 116024,China)
出处 《中国图象图形学报》 北大核心 2025年第10期3199-3214,共16页 Journal of Image and Graphics
基金 国家自然科学基金项目(U23A20384,62402084) 中国博士后科学基金项目(2024M750319)。
关键词 单目标跟踪 状态空间模型(SSM) 多尺度特征融合 序列训练 高效存储模型 single-object tracking state space model(SSM) multi-scale feature fusion sequential training memory-efficient model
  • 相关文献

参考文献1

二级参考文献2

共引文献11

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部