Recently,audio–visual speech recognition(AVSR)has attracted increasing attention.However,most existing works simplify the complex challenges in real-world applications and only focus on scenarios with two speakers an...Recently,audio–visual speech recognition(AVSR)has attracted increasing attention.However,most existing works simplify the complex challenges in real-world applications and only focus on scenarios with two speakers and perfectly aligned audio-video clips.In this work,we study the effect of speaker number and modal misalignment in the AVSR task,and propose an end-to-end AVSR framework under a more realistic condition.Specifically,we propose a speaker-number-aware mixture-of-experts(SA-MoE)mechanism to explicitly model the characteristic difference in scenarios with different speaker numbers,and a cross-modal realignment(CMR)module for robust handling of asynchronous inputs.We also use the underlying difficulty difference and introduce a new training strategy named challenge-based curriculum learning(CBCL),which forces the model to focus on difficult,challenging data instead of simple data to improve efficiency.展开更多
基金Project supported by the National Natural Science Foundation of China(No.62572423)。
文摘Recently,audio–visual speech recognition(AVSR)has attracted increasing attention.However,most existing works simplify the complex challenges in real-world applications and only focus on scenarios with two speakers and perfectly aligned audio-video clips.In this work,we study the effect of speaker number and modal misalignment in the AVSR task,and propose an end-to-end AVSR framework under a more realistic condition.Specifically,we propose a speaker-number-aware mixture-of-experts(SA-MoE)mechanism to explicitly model the characteristic difference in scenarios with different speaker numbers,and a cross-modal realignment(CMR)module for robust handling of asynchronous inputs.We also use the underlying difficulty difference and introduce a new training strategy named challenge-based curriculum learning(CBCL),which forces the model to focus on difficult,challenging data instead of simple data to improve efficiency.