结合状态空间模型和Transformer的时空增强视频字幕生成被引量：3

Spatiotemporal Enhancement of Video Captioning Integrating a State Space Model and Transformer

下载PDF

导出

摘要视频字幕生成(Video Captioning)旨在用自然语言描述视频中的内容,在人机交互、辅助视障人士、体育视频解说等领域具有广泛的应用前景。然而视频中复杂的时空内容变化增加了视频字幕生成的难度,之前的方法通过提取时空特征、先验信息等方式提高生成字幕的质量,但在时空联合建模方面仍存在不足,可能导致视觉信息提取不充分,影响字幕生成结果。为了解决这个问题,本文提出一种新颖的时空增强的状态空间模型和Transformer(SpatioTemporal-enhanced State space model and Transformer,ST2)模型,通过引入最近流行的具有全局感受野和线性的计算复杂度的Mamba(一种状态空间模型),增强时空联合建模能力。首先,通过将Mamba与Transformer并行结合,提出空间增强的状态空间模型(State Space Model,SSM)和Transformer(Spatial enHanced State space model and Transformer module,SH-ST),克服了卷积的感受野问题并降低计算复杂度,同时增强模型提取空间信息的能力。然后为了增强时间建模,我们利用Mamba的时间扫描特性,并结合Transformer的全局建模能力,提出时间增强的SSM和Transformer(Temporal enHanced State space model and Transformer module,TH-ST)。具体地,我们对SH-ST产生的特征进行重排序,从而使Mamba以交叉扫描的方式增强重排序后特征的时间关系,最后用Transformer进一步增强时间建模能力。实验结果表明,我们ST2模型中SH-ST和TH-ST结构设计的有效性,且在广泛使用的视频字幕生成数据集MSVD和MSR-VTT上取得了具有竞争力的结果。具体的,我们的方法分别在MSVD和MSR-VTT数据集上的绝对CIDEr分数超过最先进的结果6.9%和2.6%,在MSVD上的绝对CIDEr分数超过了基线结果4.9%。 Video captioning aims to describe the content of videos using natural language,offering extensive applications in areas such as human-computer interaction,assistance for visually impaired individuals,and sports commentary.However,the complex spatiotemporal variations within videos make it challenging to generate accurate captions.Previous methods have attempted to enhance caption quality by extracting spatiotemporal features and leveraging prior information.Despite these efforts,they often struggle with spatiotemporal joint modeling,which can lead to inadequate visual information extraction and negatively impact the quality of generated captions.To address this challenge,we propose a novel model,ST2,which enhances spatiotemporal joint modeling capabilities by incorporating Mamba—a recently popular state-space model(SSM)known for its global receptive field and linear computational complexity.By combining Mamba with the Transformer framework,we introduce a Spatially Enhanced SSM and Transformer(SHST)that overcomes the receptive field limitations of convolutional approaches while reducing computational complexity,thereby improving the model's ability to extract spatial information.To further strengthen temporal modeling,we utilize Mamba's temporal scanning characteristics in conjunction with the global modeling capabilities of the Transformer.This results in a Temporally Enhanced SSM and Transformer(TH-ST).Specifically,the features generated by SH-ST are reordered to allow Mamba to enhance the temporal relationships of these rearrange features through cross-scanning,after which the Transformer is employed to further bolster temporal modeling capabilities.Experimental results validate the effectiveness of the SH-ST and TH-ST structural designs within our ST2 model,achieving competitive results on widely used video captioning datasets,MSVD and MSR-VTT.Notably,our method surpasses state-of-the-art results,achieving a 6.9%and 2.6%improvement in absolute CIDEr scores on the MSVD and MSR-VTT datasets,respectively,and exceeding baseline results by 4.9%in absolute CIDEr scores on MSVD.

作者孙昊英李树一习泽宇毋立芳 SUN Haoying;LI Shuyi;XI Zeyu;WU Lifang(School of Information Science and Technology,Beijing University of Technology,Beijing 100124,China)

机构地区北京工业大学信息科学技术学院

出处《信号处理》北大核心 2025年第2期279-289,共11页 Journal of Signal Processing

基金国家自然科学基金(62236010,62306021)。

关键词视频字幕生成视频理解状态空间模型 TRANSFORMER video captioning video understanding state space model Transformer

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献2

1武文博,顾广华,刘青茹,赵志明,李刚.基于深度卷积与全局特征的图像密集字幕描述[J].信号处理,2020,36(9):1525-1532. 被引量：3
2程照辉,毋立芳,刘健.基于运动特征的视频检索[J].信号处理,2011,27(5):765-770. 被引量：5

二级参考文献15

1俞天力,章毓晋.基于全局运动信息的视频检索技术[J].电子学报,2001,29(z1):1794-1798. 被引量：19
2OhmJ R,Bunjamin F,Liebsch W,et al.A set of visual feature descriptors and their combination in a low-level description scheme[J].Signal Processing:Image Communication,2000,16:157-179.
3Minth-Son Dao,Francesco G.B.DeNatate,Andrea Massa.Video Retrieval using Video Object-Trajectory and Edge Potential Function[C] ,International Symposium on Intelligent Multimedia,Video and Speech Processing,ISIMP 2004:454-457.
4Bashar Tahayna,Mohammed Belkhatir,Saadat Alhashmi.Motion Information for Video Retrieval[C].IEEE International Conference on Multimedia and Expo,ICME 2009,p 870-873.
5Man-Kwan Shan,Suh-Yin Lee,Content-based Video Retrieval via Motion Trajectories[C] ,In Processing of the International Conference on SPIE,2000,Vol.3561,pp.52-61.
6S.-F.Chang,W.Chen,H.J.Meng,H.Sundaram,D.Zhong,A Fully Automatic Content Based Video Search Engine Supporting Multi-Object Spatio-temporalQueries[J].IEEE Transactions on Circuits and Systems for Video Technology.Special Issue on Image and Video Processing for Interactive Multimedia,1998,8(5):602-615.
7A.Dyana,M.P.Subramauian,Sukhendu Des.Combing Features for Shape and Motion Trajectory of Video Objects for Efficient Content based Video Retrieval[C].Seventh International Conference on Advances in Pattern Recognition,2009,p113-116.
8E.Sahouria.Video indexing based on object motion.M.S.thesis,Dept.of EECS,UC Berkeley,Berkeley,CA.2000.
9S.Dagtas,W.AI-Khatib,A.Ghafoor,R.L Kashyap,Models for Motion Based Video Indexing and Retrieval[J].IEEE Transactions on Image Processing,Special Issue on Image and Video Processing for Digital Libraries,January 2000,9 (1):88-101.
10Feng Bailan,Cao Juan,Lin Shouxun,Zhang Yongdong,Tao Kun,Motion Region-based Trajectory Analysis and Re-ranking for Video Retrieval[C].IEEE International Conference on Multimedia and Expo,ICME 2009:378-381.

共引文献6

1张丽,赵家彦,于学军.3D台球训练考评系统的设计与实现[J].电脑与信息技术,2014,22(6):53-56. 被引量：2
2蓝章礼,帅丹,李益才.基于相关系数的道路监控视频关键帧提取算法[J].重庆交通大学学报（自然科学版）,2016,35(1):129-133. 被引量：6
3陈泓佑,李郁峰.基于几何特征和贝叶斯的运动目标分类识别方法[J].计算机工程与设计,2016,37(12):3378-3383. 被引量：2
4任祥钰.一种乒乓球视频数据挖掘算法的研究[J].自动化技术与应用,2019,38(5):17-21. 被引量：4
5王旭,刘昌宏,李生春,刘爽,赵康廷,陈亮.基于自然语言生成的制造企业自动化图表分析方法研究[J].计算机科学,2024,51(4):174-181. 被引量：2
6张志亮.基于注意力网络融合的图像文本跨模态检索算法[J].电视技术,2024,48(11):78-81. 被引量：3

同被引文献29

1胡仪玮.文化转向视角下的影视字幕翻译研究[J].海外英语,2023(3):31-33. 被引量：1
2周城光,周军,韦向峰,周文佳,王荣泉.科普视频双语字幕生成系统的设计与实现[J].网络新媒体技术,2023,12(2):62-68. 被引量：4
3胡州明,唐冬来,李玉,朱海萍,宋卫平,颜涛.基于自然语言处理的电力调度语音识别方法[J].微型电脑应用,2023,39(6):171-174. 被引量：5
4袁启旺,芦健秋,户传真,涂小雅,周志文.深度学习图像字幕生成技术文献特征研究[J].科技风,2023(28):68-70. 被引量：1
5郝立涛,于振生.基于人工智能的自然语言处理技术的发展与应用[J].黑龙江科学,2023,14(22):124-126. 被引量：14
6倪玉航,张杰.基于预训练模型的注意力叠加方法及其在图像字幕生成中的应用[J].江苏理工学院学报,2023,29(6):12-22. 被引量：1
7李孟媛.生态翻译学视角下的字幕翻译浅析——以科技类字幕为例[J].今古文创,2024(24):114-116. 被引量：1
8雷印杰,徐凯,郭裕兰,杨鑫,武玉伟,胡玮,杨佳琪,汪汉云.“三维视觉—语言”推理技术的前沿研究与最新趋势[J].中国图象图形学报,2024,29(6):1747-1764. 被引量：5
9赵成睿,李斌,李洪全,张照芳,蒋熙蕴.基于文本Embedding和相似度计算的FA范围准确性校核方法研究[J].自动化应用,2024,65(20):15-17. 被引量：2
10周峻宇,施水才,王洪俊.基于深度学习的图像字幕生成综述[J].软件导刊,2025,24(1):211-220. 被引量：3

引证文献3

1贲晛烨,张艳宁,付莹,安平,毋立芳,白慧慧.《信号处理》图像与视频处理专刊编者按[J].信号处理,2025,41(2):193-197.
2周海.面向多语种语音的自动字幕生成系统设计[J].电声技术,2025,49(8):86-88.
3王庆.基于NLP的节目字幕与语音一致性校验方法[J].计算机应用文摘,2026,42(1):238-240.

1Jie Guo,Mengying Wang,Wenwei Wang,Yan Zhou,Bin Song.DI-VTR:Dual inter-modal interaction model for video-text retrieval[J].Journal of Information and Intelligence,2024,2(5):388-403.
2丁云霞,时义舒,胡鹏,胡锐,李德权.局部注意力与Mogrifier-LSTM的图像描述生成方法[J].哈尔滨商业大学学报(自然科学版),2025,41(1):3-9.
3王亮,夏舟勇,胡营营,王军.基于CLIP的多模态融合视频描述生成[J].计算机工程与设计,2025,46(2):384-391.
4晋嘉利,余璐.应用动态Token的融合特征的持续图像字幕生成[J].计算机工程与应用,2025,61(4):176-191. 被引量：1
5何东,汪斌,张龙,吴自越,刘高阳,李佳怡.激光增材制造参数对构件残余应力影响的数值模拟[J].应用激光,2024,44(11):24-33. 被引量：2
6张鹏虎.现状、热点与展望:中华民族现代文明研究述评[J].昆明理工大学学报(社会科学版),2025,25(1):103-114.
7王会,李奔,曹泽浩,何宗泰.飞秒激光诱导钛合金表面减反射结构演化机制研究[J].激光与红外,2025,55(1):67-74.
8李昕阳,宋宇,宋奕薇,张琳琳,林中跃,王薇.O-RADS分类联合超声造影分级在卵巢-附件区肿物诊断中的应用[J].中国超声医学杂志,2025,41(2):193-197. 被引量：3
9吴雪,胡大伟,王茵.考虑交通拥堵的众包取送货低碳路径优化[J].交通运输系统工程与信息,2025,25(1):188-201. 被引量：1
10周鲜成,郑梓亮,杨堃,吕阳.城市物流配送中时间-位置依赖型多目标绿色车辆路径问题研究[J].控制与决策,2025,40(2):413-422. 被引量：2

信号处理

2025年第2期

浏览历史

内容加载中请稍等...

结合状态空间模型和Transformer的时空增强视频字幕生成被引量：3

参考文献2

二级参考文献15

共引文献6

同被引文献29

引证文献3

相关作者

相关机构

相关主题

浏览历史

结合状态空间模型和Transformer的时空增强视频字幕生成 被引量：3

参考文献2

二级参考文献15

共引文献6

同被引文献29

引证文献3

相关作者

相关机构

相关主题

浏览历史

结合状态空间模型和Transformer的时空增强视频字幕生成被引量：3