针对Wav2Lip模型计算量大,推理速度慢,在一些对实时性要求较高或算力较为有限的应用场景中可能难以满足预期效果等问题,论文提出了基于全局通道数剪枝的方法,选用了三种不同剪枝比例,对Wav2Lip模型进行了全局通道数剪枝并对比。实验结...针对Wav2Lip模型计算量大,推理速度慢,在一些对实时性要求较高或算力较为有限的应用场景中可能难以满足预期效果等问题,论文提出了基于全局通道数剪枝的方法,选用了三种不同剪枝比例,对Wav2Lip模型进行了全局通道数剪枝并对比。实验结果表明,论文提出的全局通道数剪枝方案成功地:1) 提升了推理速度;2) 减小了模型体积;3) 保持或提升了所生成图像的效果。该方案在降低计算成本的同时,能够实现高效且稳定的推理性能。In response to the issues of high computational complexity, slow inference speed, and potential difficulty in achieving expected results in some application scenarios that require high real-time performance or limited computing power for the Wav2Lip model, the paper proposes a method based on global channel pruning, using three different pruning ratios to perform global channel pruning on the Wav2Lip model and compare them, the experimental results show that the global channel pruning scheme proposed in the paper successfully: 1) improves inference speed;2) Reduced the size of the model;3) Maintained or improved the effect of the generated image. This solution can achieve efficient and stable inference performance while reducing computational costs.展开更多
Speech Emotion Recognition(SER)has received widespread attention as a crucial way for understanding human emotional states.However,the impact of irrelevant information on speech signals and data sparsity limit the dev...Speech Emotion Recognition(SER)has received widespread attention as a crucial way for understanding human emotional states.However,the impact of irrelevant information on speech signals and data sparsity limit the development of SER system.To address these issues,this paper proposes a framework that incorporates the Attentive Mask Residual Network(AM-ResNet)and the self-supervised learning model Wav2vec 2.0 to obtain AM-ResNet features and Wav2vec 2.0 features respectively,together with a cross-attention module to interact and fuse these two features.The AM-ResNet branch mainly consists of maximum amplitude difference detection,mask residual block,and an attention mechanism.Among them,the maximum amplitude difference detection and the mask residual block act on the pre-processing and the network,respectively,to reduce the impact of silent frames,and the attention mechanism assigns different weights to unvoiced and voiced speech to reduce redundant emotional information caused by unvoiced speech.In the Wav2vec 2.0 branch,this model is introduced as a feature extractor to obtain general speech features(Wav2vec 2.0 features)through pre-training with a large amount of unlabeled speech data,which can assist the SER task and cope with data sparsity problems.In the cross-attention module,AM-ResNet features and Wav2vec 2.0 features are interacted with and fused to obtain the cross-fused features,which are used to predict the final emotion.Furthermore,multi-label learning is also used to add ambiguous emotion utterances to deal with data limitations.Finally,experimental results illustrate the usefulness and superiority of our proposed framework over existing state-of-the-art approaches.展开更多
文摘针对Wav2Lip模型计算量大,推理速度慢,在一些对实时性要求较高或算力较为有限的应用场景中可能难以满足预期效果等问题,论文提出了基于全局通道数剪枝的方法,选用了三种不同剪枝比例,对Wav2Lip模型进行了全局通道数剪枝并对比。实验结果表明,论文提出的全局通道数剪枝方案成功地:1) 提升了推理速度;2) 减小了模型体积;3) 保持或提升了所生成图像的效果。该方案在降低计算成本的同时,能够实现高效且稳定的推理性能。In response to the issues of high computational complexity, slow inference speed, and potential difficulty in achieving expected results in some application scenarios that require high real-time performance or limited computing power for the Wav2Lip model, the paper proposes a method based on global channel pruning, using three different pruning ratios to perform global channel pruning on the Wav2Lip model and compare them, the experimental results show that the global channel pruning scheme proposed in the paper successfully: 1) improves inference speed;2) Reduced the size of the model;3) Maintained or improved the effect of the generated image. This solution can achieve efficient and stable inference performance while reducing computational costs.
基金supported by Chongqing University of Posts and Telecommunications Ph.D.Innovative Talents Project(Grant No.BYJS202106)Chongqing Postgraduate Research Innovation Project(Grant No.CYB21203).
文摘Speech Emotion Recognition(SER)has received widespread attention as a crucial way for understanding human emotional states.However,the impact of irrelevant information on speech signals and data sparsity limit the development of SER system.To address these issues,this paper proposes a framework that incorporates the Attentive Mask Residual Network(AM-ResNet)and the self-supervised learning model Wav2vec 2.0 to obtain AM-ResNet features and Wav2vec 2.0 features respectively,together with a cross-attention module to interact and fuse these two features.The AM-ResNet branch mainly consists of maximum amplitude difference detection,mask residual block,and an attention mechanism.Among them,the maximum amplitude difference detection and the mask residual block act on the pre-processing and the network,respectively,to reduce the impact of silent frames,and the attention mechanism assigns different weights to unvoiced and voiced speech to reduce redundant emotional information caused by unvoiced speech.In the Wav2vec 2.0 branch,this model is introduced as a feature extractor to obtain general speech features(Wav2vec 2.0 features)through pre-training with a large amount of unlabeled speech data,which can assist the SER task and cope with data sparsity problems.In the cross-attention module,AM-ResNet features and Wav2vec 2.0 features are interacted with and fused to obtain the cross-fused features,which are used to predict the final emotion.Furthermore,multi-label learning is also used to add ambiguous emotion utterances to deal with data limitations.Finally,experimental results illustrate the usefulness and superiority of our proposed framework over existing state-of-the-art approaches.