摘要
目前网络短视频形式多样、内容灵活,传统视频识别方法在对其标签分类上大多效果有限,而视频的内容具有明显的多模态特征,融合多个模态进行视频识别成为了该研究领域的热点问题之一。基于此,提出一种利用BiLSTM-Attention网络的多模态融合视频标签分类模型。该模型利用视频的图形和音频特征为时间可对齐的序列特征的特点,通过BiLSTM对提取出的图形、音频特征数据进行对齐和处理,并把文本特征融入到图形、音频的时序Attention中,以此融合了3个模态,接下来在短视频标签数据集上进行训练和测试。结果表明在使用3个模态下一级精度和二级精度指标的准确率分别为72%和84%,相比使用2个模态的准确率有明显提升,尤其在精确度要求较高的一级精度指标中提升最为显著,提升了9%的准确率,说明该模型引入多模态可以一定程度上提升短视频分类的精确度和准确率。
Currently,short videos on the internet have diverse forms and flexible content.Traditional video recognition methods are mostly limited in their ability to classify tags for them.However,the content of videos has clear multimodal features,and the integration of multiple modalities in video recognition has become one of the hot topics in this field.To address this,a multimodal fusion video tag classification model using BiLSTM-Attention networks is proposed.This model takes advantage of the temporal alignment of sequences by utilizing the graphical and audio features of videos.The extracted graphical and audio feature data is processed and aligned by BiLSTM,and text features are integrated into the temporal Attention of graphics and audio,thereby fusing three modalities.The resulting model is trained and tested on a short video tag dataset.The results show that the accuracy of the first-level precision and second-level precision indicators when using three modalities are 72%and 84%,respectively.Compared with using two modes,the accuracy has been significantly improved,especially in the first-level precision indicator where it has increased by 9%accuracy,indicating that introducing multiple modalities can improve the accuracy and precision of short video classification to a certain degree.
作者
何凡
刘美玲
于海泉
范缤元
赵柯桥
HE Fan;LIU Meiling;YU Haiquan;FAN Binyuan;ZHAO Keqiao(College of Computer and Control Engineering,Northeast Forestry University,Harbin 150040,China)
出处
《智能计算机与应用》
2025年第5期194-198,共5页
Intelligent Computer and Applications
基金
黑龙江省大学生创新创业训练计划(S202210225334)。