基于多模态融合Transformer的视听广义零次学习方法被引量：1

An Audio-visual Generalized Zero-Shot Learning Method Based on Multimodal Fusion Transformer

下载PDF

导出

摘要视听零次学习需要理解音频和视觉信息之间的关系,以便能够推理未见过的类别。尽管领域做出了许多努力并取得了重大进展,但往往专注于学习强大的表征,从而忽视了音频和视频之间的依赖关系和输出分布与目标分布不一致的问题。因此,该文提出了基于Transformer的视听广义零次学习方法。具体来说,使用注意力机制来学习数据的内部信息,增强不同模态的信息交互,以捕捉视听数据之间的语义一致性;为了度量不同概率分布之间的差异和类别之间的一致性,引入了Kullback-Leibler(KL)散度和余弦相似度损失。为了评估所提方法,在VGGSound-GZSL^(cls),UCF-GZSL^(cls)和ActivityNet-GZSL^(cls)3个基准数据集上进行测试。大量的实验结果表明,所提方法在3个数据集上都取得了最先进的性能。 Objective Audio-visual Generalized Zero-Shot Learning(GZSL)integrates audio and visual signals in videos to enable the classification of known classes and the effective recognition of unseen classes.Most existing approaches prioritize the alignment of audio-visual and textual label embeddings,but overlook the interdependence between audio and video,and the mismatch between model outputs and target distributions.This study proposes an audio-visual GZSL method based on a Multimodal Fusion Transformer(MFT)to address these limitations.Methods The MFT employs a transformer-based multi-head attention mechanism to enable effective crossmodal interaction between visual and audio features.To optimize the output probability distribution,the Kullback-Leibler(KL)divergence between the predicted and target distributions is minimized,thereby aligning predictions more closely with the true distribution.This optimization also reduces overfitting and improves generalization to unseen classes.In addition,cosine similarity loss is applied to measure the similarity of learned representations within the same class,promoting feature consistency and improving discriminability.Results and Discussions The experiments include both GZSL and Zero-Shot Learning(ZSL)tasks.The ZSL task requires classification of unseen classes only,whereas the GZSL task addresses both unseen and seen class classification to mitigate catastrophic forgetting.To evaluate the proposed method,experiments are conducted on three benchmark datasets:VGGSound-GZSL^(cls),UCF-GZSL^(cls),and ActivityNet-GZSL^(cls)(Table 1).MFT is quantitatively compared with five ZSL methods and nine GZSL methods(Table 2).The results show that the proposed method achieves state-of-the-art performance on all three datasets.For example,on ActivityNet-GZSLcls,MFT exceedes the previous best ClipClap-GZSL method by 14.6%.This confirms the effectiveness of MFT in modeling cross-modal dependencies,aligning predicted and target distributions,and achieving semantic consistency between audio and visual features.Ablation studies(Tables 3~5)further support the contribution of each module in the proposed framework.Conclusions This study proposes a transformer-based audio-visual GZSL method that uses a multi-head self-attention mechanism to extract intrinsic information from audio and video data and enhance cross-modal interaction.This design enables more accurate capture of semantic consistency between modalities,improving the quality of cross-modal feature representations.To align the predicted and target distributions and reinforce intra-class consistency,KL divergence and cosine similarity loss are incorporated during training.KL divergence improves the match between predicted and true distributions,while cosine similarity loss enhances discriminability within each class.Extensive experiments demonstrate the effectiveness of the proposed method.

作者杨静李小勇阮小利李少波唐向红徐计 YANG Jing;LI Xiaoyong;RUAN Xiaoli;LI Shaobo;TANG Xianghong;XU Ji(State Key Laboratory of Public Big Data,Guizhou University,Guiyang 550025,China;Department of Computer Science and Engineering,Shanghai Jiao Tong University,Shanghai 201100,China)

机构地区贵州大学公共大数据国家重点实验室上海交通大学计算机科学与工程系

出处《电子与信息学报》北大核心 2025年第7期2375-2384,共10页 Journal of Electronics & Information Technology

基金国家自然科学基金(62441608,62166005) 贵州省科技项目基金(QKHZC[2023]368) 贵阳市科技人才培养对象及项目基金(ZKHT[2023]48-8) 贵州大学基础科研基金([2024]08) 公共大数据国家重点实验室开放项目(PBD2023-16)。

关键词视听零次学习视频分类注意力机制 KL散度 Audio-visual zero-shot learning Video classification Attention mechanisms Kullback-Leibler(KL)divergence

分类号 TN919.81 [电子电信—通信与信息系统] TP183 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

同被引文献4

1余璀璨,李慧斌.基于深度学习的人脸识别方法综述[J].工程数学学报,2021,38(4):451-469. 被引量：55
2张文志,杨森,柳广春,杜梦豪.基于LBPH的智能考勤方法研究[J].辽宁科技学院学报,2022,24(3):30-33.
3岳也,温瑞萍,王川龙.带有特征信息卷积神经网络的人脸识别算法[J].工程数学学报,2024,41(3):410-420. 被引量：12
4范远芳.人脸识别技术在高校中的研究与应用[J].信息记录材料,2025,26(3):114-116. 被引量：1

引证文献1

1吕玉博,王懿泉,聂思羽.基于融合CNN和LBPH的学生人脸识别方法[J].现代信息科技,2025,9(20):51-54.

1丁文静.细节化护理在重症颅脑损伤患者行肠内营养治疗中的应用效果分析[J].中国科技期刊数据库医药,2021(1):081-082.
2孟凡珍,何龙.智能医学工程专业[J].考试与招生,2025(6):86-88.
3张涛.高中生物概念进阶学习的方法[J].高中生学习(作文素材与时评),2025(3):64-65.
4赵琪.化学高效学习的技巧[J].高中生学习(作文素材与时评),2025(2):76-78.

电子与信息学报

2025年第7期

浏览历史

内容加载中请稍等...

基于多模态融合Transformer的视听广义零次学习方法被引量：1

同被引文献4

引证文献1

相关作者

相关机构

相关主题

浏览历史

基于多模态融合Transformer的视听广义零次学习方法 被引量：1

同被引文献4

引证文献1

相关作者

相关机构

相关主题

浏览历史

基于多模态融合Transformer的视听广义零次学习方法被引量：1