Audio-visual speaker tracking aims to determine the locations of multiple speakers in the scene by leveraging signals captured from multisensor platforms.Multimodal fusion methods can improve both the accuracy and rob...Audio-visual speaker tracking aims to determine the locations of multiple speakers in the scene by leveraging signals captured from multisensor platforms.Multimodal fusion methods can improve both the accuracy and robustness of speaker tracking.However,in complex multispeaker tracking scenarios,critical challenges such as cross-modal feature discrepancy,weak sound source localisation ambiguity and frequent identity switch errors remain unresolved,which severely hinder the modelling of speaker identity consistency and consequently lead to degraded tracking accuracy and unstable tracking trajectories.To this end,this paper proposes a multimodal multispeaker tracking network using audio-visual contrastive learning(AVCLNet).By integrating heterogeneous modal representations into a unified space through audio-visual contrastive learning,which facilitates cross-modal feature alignment,mitigates cross-modal feature bias and enhances identity-consistent representations.In the audio-visual measurement stage,we design a vision-guided weak sound source weighted enhancement method,which leverages visual cues to establish cross-modal mappings and employs a spatiotemporal dynamic weighted mechanism to improve the detectability of weak sound sources.Furthermore,in the data association phase,a dual geometric constraint strategy is introduced by combining the 2D and 3D spatial geometric information,reducing frequent identity switch errors.Experiments on the AV16.3 and CAV3D datasets show that AVCLNet outperforms state-of-the-art methods,demonstrating superior robustness in multispeaker scenarios.展开更多
基金supported by the National Natural Science Foundation of China(62403345)the Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology(2024B1212010006)the Shanxi Provincial Department of Science and Technology Basic Research Project(202403021212174,202403021221074).
文摘Audio-visual speaker tracking aims to determine the locations of multiple speakers in the scene by leveraging signals captured from multisensor platforms.Multimodal fusion methods can improve both the accuracy and robustness of speaker tracking.However,in complex multispeaker tracking scenarios,critical challenges such as cross-modal feature discrepancy,weak sound source localisation ambiguity and frequent identity switch errors remain unresolved,which severely hinder the modelling of speaker identity consistency and consequently lead to degraded tracking accuracy and unstable tracking trajectories.To this end,this paper proposes a multimodal multispeaker tracking network using audio-visual contrastive learning(AVCLNet).By integrating heterogeneous modal representations into a unified space through audio-visual contrastive learning,which facilitates cross-modal feature alignment,mitigates cross-modal feature bias and enhances identity-consistent representations.In the audio-visual measurement stage,we design a vision-guided weak sound source weighted enhancement method,which leverages visual cues to establish cross-modal mappings and employs a spatiotemporal dynamic weighted mechanism to improve the detectability of weak sound sources.Furthermore,in the data association phase,a dual geometric constraint strategy is introduced by combining the 2D and 3D spatial geometric information,reducing frequent identity switch errors.Experiments on the AV16.3 and CAV3D datasets show that AVCLNet outperforms state-of-the-art methods,demonstrating superior robustness in multispeaker scenarios.