期刊文献+
共找到1篇文章
< 1 >
每页显示 20 50 100
AVCLNet:Multimodal Multispeaker Tracking Network Using Audio-Visual Contrastive Learning
1
作者 Yihan Li Yidi Li +3 位作者 Zhenhuan Xu Hao Guo Mengyuan Liu Weiwei Wan 《CAAI Transactions on Intelligence Technology》 2026年第1期238-255,共18页
Audio-visual speaker tracking aims to determine the locations of multiple speakers in the scene by leveraging signals captured from multisensor platforms.Multimodal fusion methods can improve both the accuracy and rob... Audio-visual speaker tracking aims to determine the locations of multiple speakers in the scene by leveraging signals captured from multisensor platforms.Multimodal fusion methods can improve both the accuracy and robustness of speaker tracking.However,in complex multispeaker tracking scenarios,critical challenges such as cross-modal feature discrepancy,weak sound source localisation ambiguity and frequent identity switch errors remain unresolved,which severely hinder the modelling of speaker identity consistency and consequently lead to degraded tracking accuracy and unstable tracking trajectories.To this end,this paper proposes a multimodal multispeaker tracking network using audio-visual contrastive learning(AVCLNet).By integrating heterogeneous modal representations into a unified space through audio-visual contrastive learning,which facilitates cross-modal feature alignment,mitigates cross-modal feature bias and enhances identity-consistent representations.In the audio-visual measurement stage,we design a vision-guided weak sound source weighted enhancement method,which leverages visual cues to establish cross-modal mappings and employs a spatiotemporal dynamic weighted mechanism to improve the detectability of weak sound sources.Furthermore,in the data association phase,a dual geometric constraint strategy is introduced by combining the 2D and 3D spatial geometric information,reducing frequent identity switch errors.Experiments on the AV16.3 and CAV3D datasets show that AVCLNet outperforms state-of-the-art methods,demonstrating superior robustness in multispeaker scenarios. 展开更多
关键词 computer vision machine perception multimodal approaches pattern recognition video signal processing
在线阅读 下载PDF
上一页 1 下一页 到第
使用帮助 返回顶部