期刊文献+

基于多模态融合Transformer的视听广义零次学习方法 被引量:1

An Audio-visual Generalized Zero-Shot Learning Method Based on Multimodal Fusion Transformer
在线阅读 下载PDF
导出
摘要 视听零次学习需要理解音频和视觉信息之间的关系,以便能够推理未见过的类别。尽管领域做出了许多努力并取得了重大进展,但往往专注于学习强大的表征,从而忽视了音频和视频之间的依赖关系和输出分布与目标分布不一致的问题。因此,该文提出了基于Transformer的视听广义零次学习方法。具体来说,使用注意力机制来学习数据的内部信息,增强不同模态的信息交互,以捕捉视听数据之间的语义一致性;为了度量不同概率分布之间的差异和类别之间的一致性,引入了Kullback-Leibler(KL)散度和余弦相似度损失。为了评估所提方法,在VGGSound-GZSL^(cls),UCF-GZSL^(cls)和ActivityNet-GZSL^(cls)3个基准数据集上进行测试。大量的实验结果表明,所提方法在3个数据集上都取得了最先进的性能。 Objective Audio-visual Generalized Zero-Shot Learning(GZSL)integrates audio and visual signals in videos to enable the classification of known classes and the effective recognition of unseen classes.Most existing approaches prioritize the alignment of audio-visual and textual label embeddings,but overlook the interdependence between audio and video,and the mismatch between model outputs and target distributions.This study proposes an audio-visual GZSL method based on a Multimodal Fusion Transformer(MFT)to address these limitations.Methods The MFT employs a transformer-based multi-head attention mechanism to enable effective crossmodal interaction between visual and audio features.To optimize the output probability distribution,the Kullback-Leibler(KL)divergence between the predicted and target distributions is minimized,thereby aligning predictions more closely with the true distribution.This optimization also reduces overfitting and improves generalization to unseen classes.In addition,cosine similarity loss is applied to measure the similarity of learned representations within the same class,promoting feature consistency and improving discriminability.Results and Discussions The experiments include both GZSL and Zero-Shot Learning(ZSL)tasks.The ZSL task requires classification of unseen classes only,whereas the GZSL task addresses both unseen and seen class classification to mitigate catastrophic forgetting.To evaluate the proposed method,experiments are conducted on three benchmark datasets:VGGSound-GZSL^(cls),UCF-GZSL^(cls),and ActivityNet-GZSL^(cls)(Table 1).MFT is quantitatively compared with five ZSL methods and nine GZSL methods(Table 2).The results show that the proposed method achieves state-of-the-art performance on all three datasets.For example,on ActivityNet-GZSLcls,MFT exceedes the previous best ClipClap-GZSL method by 14.6%.This confirms the effectiveness of MFT in modeling cross-modal dependencies,aligning predicted and target distributions,and achieving semantic consistency between audio and visual features.Ablation studies(Tables 3~5)further support the contribution of each module in the proposed framework.Conclusions This study proposes a transformer-based audio-visual GZSL method that uses a multi-head self-attention mechanism to extract intrinsic information from audio and video data and enhance cross-modal interaction.This design enables more accurate capture of semantic consistency between modalities,improving the quality of cross-modal feature representations.To align the predicted and target distributions and reinforce intra-class consistency,KL divergence and cosine similarity loss are incorporated during training.KL divergence improves the match between predicted and true distributions,while cosine similarity loss enhances discriminability within each class.Extensive experiments demonstrate the effectiveness of the proposed method.
作者 杨静 李小勇 阮小利 李少波 唐向红 徐计 YANG Jing;LI Xiaoyong;RUAN Xiaoli;LI Shaobo;TANG Xianghong;XU Ji(State Key Laboratory of Public Big Data,Guizhou University,Guiyang 550025,China;Department of Computer Science and Engineering,Shanghai Jiao Tong University,Shanghai 201100,China)
出处 《电子与信息学报》 北大核心 2025年第7期2375-2384,共10页 Journal of Electronics & Information Technology
基金 国家自然科学基金(62441608,62166005) 贵州省科技项目基金(QKHZC[2023]368) 贵阳市科技人才培养对象及项目基金(ZKHT[2023]48-8) 贵州大学基础科研基金([2024]08) 公共大数据国家重点实验室开放项目(PBD2023-16)。
关键词 视听零次学习 视频分类 注意力机制 KL散度 Audio-visual zero-shot learning Video classification Attention mechanisms Kullback-Leibler(KL)divergence
  • 相关文献

同被引文献4

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部