With the increasing popularity of mobile internet devices,speech emotion recognition has become a convenient and valuable means of human-computer interaction.The performance of speech emotion recognition depends on th...With the increasing popularity of mobile internet devices,speech emotion recognition has become a convenient and valuable means of human-computer interaction.The performance of speech emotion recognition depends on the discriminating and emotion-related utterance-level representations extracted from speech.Moreover,sufficient data are required to model the relationship between emotional states and speech.Mainstream emotion recognition methods cannot avoid the influence of the silence period in speech,and environmental noise significantly affects the recognition performance.This study intends to supplement the silence periods with removed speech information and applies segmentwise multilayer perceptrons to enhance the utterance-level representation aggregation.In addition,improved semisupervised learning is employed to overcome the prob-lem of data scarcity.Particular experiments are conducted to evaluate the proposed method on the IEMOCAP corpus,which reveals that it achieves 68.0%weighted accuracy and 68.8%unweighted accuracy in four emotion classifications.The experimental results demonstrate that the proposed method aggregates utterance-level more effectively and that semisupervised learning enhances the performance of our method.展开更多
文摘With the increasing popularity of mobile internet devices,speech emotion recognition has become a convenient and valuable means of human-computer interaction.The performance of speech emotion recognition depends on the discriminating and emotion-related utterance-level representations extracted from speech.Moreover,sufficient data are required to model the relationship between emotional states and speech.Mainstream emotion recognition methods cannot avoid the influence of the silence period in speech,and environmental noise significantly affects the recognition performance.This study intends to supplement the silence periods with removed speech information and applies segmentwise multilayer perceptrons to enhance the utterance-level representation aggregation.In addition,improved semisupervised learning is employed to overcome the prob-lem of data scarcity.Particular experiments are conducted to evaluate the proposed method on the IEMOCAP corpus,which reveals that it achieves 68.0%weighted accuracy and 68.8%unweighted accuracy in four emotion classifications.The experimental results demonstrate that the proposed method aggregates utterance-level more effectively and that semisupervised learning enhances the performance of our method.