期刊文献+
共找到3篇文章
< 1 >
每页显示 20 50 100
基于局部Attention和CTC融合的语音情感识别方法研究 被引量:2
1
作者 孟令源 孙哲 +2 位作者 刘扬 赵振 李永伟 《计算机应用与软件》 北大核心 2024年第10期197-201,共5页
针对基于时间序列的语音情感识别方法难以计算情感帧携带的情感信息量的问题,提出一种局部注意力机制(LAM)和结合连接主义时间分类(CTC)融合的语音情感识别模型(LAM-CTC)。提取VGFCC情感特征作为共享编码器的输入;CTC层最小化代价损失... 针对基于时间序列的语音情感识别方法难以计算情感帧携带的情感信息量的问题,提出一种局部注意力机制(LAM)和结合连接主义时间分类(CTC)融合的语音情感识别模型(LAM-CTC)。提取VGFCC情感特征作为共享编码器的输入;CTC层最小化代价损失并预测情感类别,LAM层使用局部注意力机制计算上下文向量;通过解码器对上下文向量进行解码;通过平均值法将解码结果融合得到情感预测结果。实验结果表明,提出的模型在IEMOCAP数据集上的UAR和WAR分别达到了68.1%和68.3%。 展开更多
关键词 语音情感识别 注意力机制 CTC VGFCC iemocap
在线阅读 下载PDF
融合文本、语音和表情的多模态情绪识别
2
作者 谢星宇 丁彩琴 +1 位作者 王宪伦 潘东杰 《青岛大学学报(工程技术版)》 CAS 2024年第3期20-30,共11页
针对情绪识别中信息不全面、易受噪声干扰等问题,基于Transformer网络构建了一种融合文本、视觉和听觉等信息的多模态情感识别网络模型(Bidirectional Encoder Representations from Transformers and Residual Neural Network and Conn... 针对情绪识别中信息不全面、易受噪声干扰等问题,基于Transformer网络构建了一种融合文本、视觉和听觉等信息的多模态情感识别网络模型(Bidirectional Encoder Representations from Transformers and Residual Neural Network and Connectionist Temporal Classification and Transformer, BRCTN)。引入人物特征信息辅助情绪识别,提高模型提取关键特征的能力;将单模态情绪识别的输出向量通过模态对齐重组为统一格式;将3个模态和人物特征映射到高维度全局向量空间,学习不同模态特征之间的潜在联系。该模型在IEMOCAP数据集上进行验证,结果表明,与其他方法相比,BRCTN的准确率达87%,识别性能最好。 展开更多
关键词 TRANSFORMER iemocap 多模态融合 情绪识别
在线阅读 下载PDF
Using Speaker-Specific Emotion Representations in Wav2vec 2.0-Based Modules for Speech Emotion Recognition
3
作者 Somin Park Mpabulungi Mark +1 位作者 Bogyung Park Hyunki Hong 《Computers, Materials & Continua》 SCIE EI 2023年第10期1009-1030,共22页
Speech emotion recognition is essential for frictionless human-machine interaction,where machines respond to human instructions with context-aware actions.The properties of individuals’voices vary with culture,langua... Speech emotion recognition is essential for frictionless human-machine interaction,where machines respond to human instructions with context-aware actions.The properties of individuals’voices vary with culture,language,gender,and personality.These variations in speaker-specific properties may hamper the performance of standard representations in downstream tasks such as speech emotion recognition(SER).This study demonstrates the significance of speaker-specific speech characteristics and how considering them can be leveraged to improve the performance of SER models.In the proposed approach,two wav2vec-based modules(a speaker-identification network and an emotion classification network)are trained with the Arcface loss.The speaker-identification network has a single attention block to encode an input audio waveform into a speaker-specific representation.The emotion classification network uses a wav2vec 2.0-backbone as well as four attention blocks to encode the same input audio waveform into an emotion representation.These two representations are then fused into a single vector representation containing emotion and speaker-specific information.Experimental results showed that the use of speaker-specific characteristics improves SER performance.Additionally,combining these with an angular marginal loss such as the Arcface loss improves intra-class compactness while increasing inter-class separability,as demonstrated by the plots of t-distributed stochastic neighbor embeddings(t-SNE).The proposed approach outperforms previous methods using similar training strategies,with a weighted accuracy(WA)of 72.14%and unweighted accuracy(UA)of 72.97%on the Interactive Emotional Dynamic Motion Capture(IEMOCAP)dataset.This demonstrates its effectiveness and potential to enhance human-machine interaction through more accurate emotion recognition in speech. 展开更多
关键词 Attention block iemocap dataset speaker-specific representation speech emotion recognition wav2vec 2.0
在线阅读 下载PDF
上一页 1 下一页 到第
使用帮助 返回顶部