摘要
针对复杂语音环境下CAM++模型在特征提取与识别性能方面存在的不足,本文提出了一种融合空洞卷积与时频多尺度注意力机制的说话人确认模型TF-DCAM。该模型首先利用空洞残差卷积与时频重聚焦机制增强特征提取能力,提升对冗余信息的抑制效果;其次引入时频多尺度注意力模块,通过通道注意力与跨纬度交互机制提升模型对关键信息的感知能力;再通过自适应掩码时序卷积模块强化长时依赖建模;最后采用对比损失函数联合优化嵌入空间结构。实验在CN-Celeb数据集上表明,TF-DCAM在EER和minDCF上分别相较基线模型降低了14.98%和10.98%;在VoxCeleb1上亦展现出良好的跨语种泛化能力。结果证明所提方法在保证轻量化的同时显著提升了说话人确认性能与鲁棒性。
To address the limitations of the CAM++model in feature extraction and recognition performance under complex acoustic conditions,this paper proposes TF-DCAM,a speaker verification model integrating dilated convolution and temporal-frequency multi-scale attention mechanisms.The model enhances feature representation through dilated residual convolution and a time-frequency adaptive refocusing unit to suppress redundant information.A temporal-frequency multi-scale attention module is introduced to improve sensitivity to key information via channel attention and cross-dimensional interaction.An adaptive masking temporal convolution module is further incorporated to model long-term dependencies effectively.Finally,a combination of contrastive loss functions is applied to jointly optimize the speaker embedding space.Experiments conducted on the CN-Celeb dataset show that TF-DCAM reduces EER and minDCF by 14.98%and 10.98%respectively,compared with the baseline.The model also demonstrates strong cross-lingual generalization on the VoxCeleb1 dataset.Results indicate that the proposed method significantly improves speaker verification performance and robustness while maintaining model efficiency.
作者
李嘉麒
郑展恒
曾庆宁
王健
Li Jiaqi;Zheng Zhanheng;Zeng Qingning;Wang Jian(School of Information and Communication,Guilin University of Electronic Technology,Guilin 541004,China;Key Laboratory of Cognitive Radio and Information Processing,Ministry of Education,Guilin University of Electronic Technology,Guilin 541004,China)
出处
《电子测量技术》
2025年第22期119-128,共10页
Electronic Measurement Technology
基金
认知无线电与信息处理教育部重点实验室项目(CRKL230103)资助。
关键词
深度学习
说话人确认
时频多尺度注意力
空洞卷积
对比损失函数
deep learning
speaker verification
temporal-frequency multi-scale attention
dilated convolution
contrastive loss