期刊文献+

结合状态空间模型和Transformer的时空增强视频字幕生成 被引量:3

Spatiotemporal Enhancement of Video Captioning Integrating a State Space Model and Transformer
在线阅读 下载PDF
导出
摘要 视频字幕生成(Video Captioning)旨在用自然语言描述视频中的内容,在人机交互、辅助视障人士、体育视频解说等领域具有广泛的应用前景。然而视频中复杂的时空内容变化增加了视频字幕生成的难度,之前的方法通过提取时空特征、先验信息等方式提高生成字幕的质量,但在时空联合建模方面仍存在不足,可能导致视觉信息提取不充分,影响字幕生成结果。为了解决这个问题,本文提出一种新颖的时空增强的状态空间模型和Transformer(SpatioTemporal-enhanced State space model and Transformer,ST2)模型,通过引入最近流行的具有全局感受野和线性的计算复杂度的Mamba(一种状态空间模型),增强时空联合建模能力。首先,通过将Mamba与Transformer并行结合,提出空间增强的状态空间模型(State Space Model,SSM)和Transformer(Spatial enHanced State space model and Transformer module,SH-ST),克服了卷积的感受野问题并降低计算复杂度,同时增强模型提取空间信息的能力。然后为了增强时间建模,我们利用Mamba的时间扫描特性,并结合Transformer的全局建模能力,提出时间增强的SSM和Transformer(Temporal enHanced State space model and Transformer module,TH-ST)。具体地,我们对SH-ST产生的特征进行重排序,从而使Mamba以交叉扫描的方式增强重排序后特征的时间关系,最后用Transformer进一步增强时间建模能力。实验结果表明,我们ST2模型中SH-ST和TH-ST结构设计的有效性,且在广泛使用的视频字幕生成数据集MSVD和MSR-VTT上取得了具有竞争力的结果。具体的,我们的方法分别在MSVD和MSR-VTT数据集上的绝对CIDEr分数超过最先进的结果6.9%和2.6%,在MSVD上的绝对CIDEr分数超过了基线结果4.9%。 Video captioning aims to describe the content of videos using natural language,offering extensive applications in areas such as human-computer interaction,assistance for visually impaired individuals,and sports commentary.However,the complex spatiotemporal variations within videos make it challenging to generate accurate captions.Previous methods have attempted to enhance caption quality by extracting spatiotemporal features and leveraging prior information.Despite these efforts,they often struggle with spatiotemporal joint modeling,which can lead to inadequate visual information extraction and negatively impact the quality of generated captions.To address this challenge,we propose a novel model,ST2,which enhances spatiotemporal joint modeling capabilities by incorporating Mamba—a recently popular state-space model(SSM)known for its global receptive field and linear computational complexity.By combining Mamba with the Transformer framework,we introduce a Spatially Enhanced SSM and Transformer(SHST)that overcomes the receptive field limitations of convolutional approaches while reducing computational complexity,thereby improving the model's ability to extract spatial information.To further strengthen temporal modeling,we utilize Mamba's temporal scanning characteristics in conjunction with the global modeling capabilities of the Transformer.This results in a Temporally Enhanced SSM and Transformer(TH-ST).Specifically,the features generated by SH-ST are reordered to allow Mamba to enhance the temporal relationships of these rearrange features through cross-scanning,after which the Transformer is employed to further bolster temporal modeling capabilities.Experimental results validate the effectiveness of the SH-ST and TH-ST structural designs within our ST2 model,achieving competitive results on widely used video captioning datasets,MSVD and MSR-VTT.Notably,our method surpasses state-of-the-art results,achieving a 6.9%and 2.6%improvement in absolute CIDEr scores on the MSVD and MSR-VTT datasets,respectively,and exceeding baseline results by 4.9%in absolute CIDEr scores on MSVD.
作者 孙昊英 李树一 习泽宇 毋立芳 SUN Haoying;LI Shuyi;XI Zeyu;WU Lifang(School of Information Science and Technology,Beijing University of Technology,Beijing 100124,China)
出处 《信号处理》 北大核心 2025年第2期279-289,共11页 Journal of Signal Processing
基金 国家自然科学基金(62236010,62306021)。
关键词 视频字幕生成 视频理解 状态空间模型 TRANSFORMER video captioning video understanding state space model Transformer
  • 相关文献

参考文献2

二级参考文献15

  • 1俞天力,章毓晋.基于全局运动信息的视频检索技术[J].电子学报,2001,29(z1):1794-1798. 被引量:19
  • 2OhmJ R,Bunjamin F,Liebsch W,et al.A set of visual feature descriptors and their combination in a low-level description scheme[J].Signal Processing:Image Communication,2000,16:157-179.
  • 3Minth-Son Dao,Francesco G.B.DeNatate,Andrea Massa.Video Retrieval using Video Object-Trajectory and Edge Potential Function[C] ,International Symposium on Intelligent Multimedia,Video and Speech Processing,ISIMP 2004:454-457.
  • 4Bashar Tahayna,Mohammed Belkhatir,Saadat Alhashmi.Motion Information for Video Retrieval[C].IEEE International Conference on Multimedia and Expo,ICME 2009,p 870-873.
  • 5Man-Kwan Shan,Suh-Yin Lee,Content-based Video Retrieval via Motion Trajectories[C] ,In Processing of the International Conference on SPIE,2000,Vol.3561,pp.52-61.
  • 6S.-F.Chang,W.Chen,H.J.Meng,H.Sundaram,D.Zhong,A Fully Automatic Content Based Video Search Engine Supporting Multi-Object Spatio-temporalQueries[J].IEEE Transactions on Circuits and Systems for Video Technology.Special Issue on Image and Video Processing for Interactive Multimedia,1998,8(5):602-615.
  • 7A.Dyana,M.P.Subramauian,Sukhendu Des.Combing Features for Shape and Motion Trajectory of Video Objects for Efficient Content based Video Retrieval[C].Seventh International Conference on Advances in Pattern Recognition,2009,p113-116.
  • 8E.Sahouria.Video indexing based on object motion.M.S.thesis,Dept.of EECS,UC Berkeley,Berkeley,CA.2000.
  • 9S.Dagtas,W.AI-Khatib,A.Ghafoor,R.L Kashyap,Models for Motion Based Video Indexing and Retrieval[J].IEEE Transactions on Image Processing,Special Issue on Image and Video Processing for Digital Libraries,January 2000,9 (1):88-101.
  • 10Feng Bailan,Cao Juan,Lin Shouxun,Zhang Yongdong,Tao Kun,Motion Region-based Trajectory Analysis and Re-ranking for Video Retrieval[C].IEEE International Conference on Multimedia and Expo,ICME 2009:378-381.

共引文献6

同被引文献29

引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部