针对情绪识别中信息不全面、易受噪声干扰等问题,基于Transformer网络构建了一种融合文本、视觉和听觉等信息的多模态情感识别网络模型(Bidirectional Encoder Representations from Transformers and Residual Neural Network and Conn...针对情绪识别中信息不全面、易受噪声干扰等问题,基于Transformer网络构建了一种融合文本、视觉和听觉等信息的多模态情感识别网络模型(Bidirectional Encoder Representations from Transformers and Residual Neural Network and Connectionist Temporal Classification and Transformer, BRCTN)。引入人物特征信息辅助情绪识别,提高模型提取关键特征的能力;将单模态情绪识别的输出向量通过模态对齐重组为统一格式;将3个模态和人物特征映射到高维度全局向量空间,学习不同模态特征之间的潜在联系。该模型在IEMOCAP数据集上进行验证,结果表明,与其他方法相比,BRCTN的准确率达87%,识别性能最好。展开更多
Speech emotion recognition is essential for frictionless human-machine interaction,where machines respond to human instructions with context-aware actions.The properties of individuals’voices vary with culture,langua...Speech emotion recognition is essential for frictionless human-machine interaction,where machines respond to human instructions with context-aware actions.The properties of individuals’voices vary with culture,language,gender,and personality.These variations in speaker-specific properties may hamper the performance of standard representations in downstream tasks such as speech emotion recognition(SER).This study demonstrates the significance of speaker-specific speech characteristics and how considering them can be leveraged to improve the performance of SER models.In the proposed approach,two wav2vec-based modules(a speaker-identification network and an emotion classification network)are trained with the Arcface loss.The speaker-identification network has a single attention block to encode an input audio waveform into a speaker-specific representation.The emotion classification network uses a wav2vec 2.0-backbone as well as four attention blocks to encode the same input audio waveform into an emotion representation.These two representations are then fused into a single vector representation containing emotion and speaker-specific information.Experimental results showed that the use of speaker-specific characteristics improves SER performance.Additionally,combining these with an angular marginal loss such as the Arcface loss improves intra-class compactness while increasing inter-class separability,as demonstrated by the plots of t-distributed stochastic neighbor embeddings(t-SNE).The proposed approach outperforms previous methods using similar training strategies,with a weighted accuracy(WA)of 72.14%and unweighted accuracy(UA)of 72.97%on the Interactive Emotional Dynamic Motion Capture(IEMOCAP)dataset.This demonstrates its effectiveness and potential to enhance human-machine interaction through more accurate emotion recognition in speech.展开更多
文摘针对情绪识别中信息不全面、易受噪声干扰等问题,基于Transformer网络构建了一种融合文本、视觉和听觉等信息的多模态情感识别网络模型(Bidirectional Encoder Representations from Transformers and Residual Neural Network and Connectionist Temporal Classification and Transformer, BRCTN)。引入人物特征信息辅助情绪识别,提高模型提取关键特征的能力;将单模态情绪识别的输出向量通过模态对齐重组为统一格式;将3个模态和人物特征映射到高维度全局向量空间,学习不同模态特征之间的潜在联系。该模型在IEMOCAP数据集上进行验证,结果表明,与其他方法相比,BRCTN的准确率达87%,识别性能最好。
基金supported by the Chung-Ang University Graduate Research Scholarship in 2021.
文摘Speech emotion recognition is essential for frictionless human-machine interaction,where machines respond to human instructions with context-aware actions.The properties of individuals’voices vary with culture,language,gender,and personality.These variations in speaker-specific properties may hamper the performance of standard representations in downstream tasks such as speech emotion recognition(SER).This study demonstrates the significance of speaker-specific speech characteristics and how considering them can be leveraged to improve the performance of SER models.In the proposed approach,two wav2vec-based modules(a speaker-identification network and an emotion classification network)are trained with the Arcface loss.The speaker-identification network has a single attention block to encode an input audio waveform into a speaker-specific representation.The emotion classification network uses a wav2vec 2.0-backbone as well as four attention blocks to encode the same input audio waveform into an emotion representation.These two representations are then fused into a single vector representation containing emotion and speaker-specific information.Experimental results showed that the use of speaker-specific characteristics improves SER performance.Additionally,combining these with an angular marginal loss such as the Arcface loss improves intra-class compactness while increasing inter-class separability,as demonstrated by the plots of t-distributed stochastic neighbor embeddings(t-SNE).The proposed approach outperforms previous methods using similar training strategies,with a weighted accuracy(WA)of 72.14%and unweighted accuracy(UA)of 72.97%on the Interactive Emotional Dynamic Motion Capture(IEMOCAP)dataset.This demonstrates its effectiveness and potential to enhance human-machine interaction through more accurate emotion recognition in speech.