Emotion mismatch between training and testing is one of the important factors causing the performance degradation of speaker recognition system. In our previous work, a bi-model emotion speaker recognition (BESR) meth...Emotion mismatch between training and testing is one of the important factors causing the performance degradation of speaker recognition system. In our previous work, a bi-model emotion speaker recognition (BESR) method based on virtual HD (High Different from neutral, with large pitch offset) speech synthesizing was proposed to deal with this problem. It enhanced the system performance under mismatch emotion states in MASC, while still suffering the system risk introduced by fusing the scores from the unreliable VHD model and the neutral model with equal weight. In this paper, we propose a new BESR method based on score reliability fusion. Two strategies, by utilizing identification rate and scores average relative loss difference, are presented to estimate the weights for the two group scores. The results on both MASC and EPST shows that by using the weights generated by the two strategies, the BESR method achieve a better performance than that by using the equal weight, and the better one even achieves a result comparable to that by using the best weights selected by exhaustive strategy.展开更多
A transformation matrix linear interpolation (TMLI) approach for speaker adaptation is proposed. TMLI uses the transformation matrixes produced by MLLR from selected training speakers and the testing speaker. With onl...A transformation matrix linear interpolation (TMLI) approach for speaker adaptation is proposed. TMLI uses the transformation matrixes produced by MLLR from selected training speakers and the testing speaker. With only 3 adaptation sentences, the performance shows a 12.12% word error rate reduction. As the number of adaptation sentences increases, the performance saturates quickly. To improve the behavior of TMLI for large amounts of adaptation data, the TMLI+MAP method which combines TMLI with MAP technique is proposed. Experimental results show TMLI+MAP achieved better recognition accuracy than MAP and MLLR+MAP for both small and large amounts of adaptation data. Key words speech recognition - speaker adaptation - MLLR - MAP - maximum likelihood model interpolation (MLMI) CLC number TN 912. 34 Foundation item: Supported by the Science and Technology Committee of Shanghai (01JC14033)Biography: XU Xiang-hua (1977-), female, Ph. D. candidate, research direction: large vocabulary continuous Mandarin speech recognition and speaker adaptation展开更多
This paper presented a speaker adaptable very low bit rate speech coder based on HMM (Hidden Markov Model) which includes the dynamic features, i.e., delta and delta delta parameters of speech. The performance of this...This paper presented a speaker adaptable very low bit rate speech coder based on HMM (Hidden Markov Model) which includes the dynamic features, i.e., delta and delta delta parameters of speech. The performance of this speech coder has been improved by using the dynamic features generated by an algorithm for speech parameter generation from HMM because the generated speech parameter vectors reflect not only the means of static and dynamic feature vectors but also the covariance of those. The encoder part is equivalent to an HMM based phoneme recognizer and transmits phoneme indexes, state durations, pitch information and speaker characteristics adaptation vectors to the decoder. The decoder receives those messages and concatenates phoneme HMM sequence according to the phoneme indexes. Then the decoder generates a sequence of mel cepstral coefficient vectors using HMM based speech parameter generation technique. Finally the decoder synthesizes speech by directly exciting the MLSA(Mel Log Spectrum Approximation) filter with the generated mel cepstral coefficient vectors, according to the pitch information.展开更多
Emotion mismatch between training and testing will cause system performance decline sharply which is emotional speaker recognition. It is an important idea to solve this problem according to the emotion normalization ...Emotion mismatch between training and testing will cause system performance decline sharply which is emotional speaker recognition. It is an important idea to solve this problem according to the emotion normalization of test speech. This method proceeds from analysis of the differences between every kind of emotional speech and neutral speech. Besides, it takes the baseband mismatch of emotional changes as the main line. At the same time, it gives the corresponding algorithm according to four technical points which are emotional expansion, emotional shield, emotional normalization and score compensation. Compared with the traditional GMM-UBM method, the recognition rate in MASC corpus and EPST corpus was increased by 3.80% and 8.81% respectively.展开更多
文摘Emotion mismatch between training and testing is one of the important factors causing the performance degradation of speaker recognition system. In our previous work, a bi-model emotion speaker recognition (BESR) method based on virtual HD (High Different from neutral, with large pitch offset) speech synthesizing was proposed to deal with this problem. It enhanced the system performance under mismatch emotion states in MASC, while still suffering the system risk introduced by fusing the scores from the unreliable VHD model and the neutral model with equal weight. In this paper, we propose a new BESR method based on score reliability fusion. Two strategies, by utilizing identification rate and scores average relative loss difference, are presented to estimate the weights for the two group scores. The results on both MASC and EPST shows that by using the weights generated by the two strategies, the BESR method achieve a better performance than that by using the equal weight, and the better one even achieves a result comparable to that by using the best weights selected by exhaustive strategy.
文摘A transformation matrix linear interpolation (TMLI) approach for speaker adaptation is proposed. TMLI uses the transformation matrixes produced by MLLR from selected training speakers and the testing speaker. With only 3 adaptation sentences, the performance shows a 12.12% word error rate reduction. As the number of adaptation sentences increases, the performance saturates quickly. To improve the behavior of TMLI for large amounts of adaptation data, the TMLI+MAP method which combines TMLI with MAP technique is proposed. Experimental results show TMLI+MAP achieved better recognition accuracy than MAP and MLLR+MAP for both small and large amounts of adaptation data. Key words speech recognition - speaker adaptation - MLLR - MAP - maximum likelihood model interpolation (MLMI) CLC number TN 912. 34 Foundation item: Supported by the Science and Technology Committee of Shanghai (01JC14033)Biography: XU Xiang-hua (1977-), female, Ph. D. candidate, research direction: large vocabulary continuous Mandarin speech recognition and speaker adaptation
文摘This paper presented a speaker adaptable very low bit rate speech coder based on HMM (Hidden Markov Model) which includes the dynamic features, i.e., delta and delta delta parameters of speech. The performance of this speech coder has been improved by using the dynamic features generated by an algorithm for speech parameter generation from HMM because the generated speech parameter vectors reflect not only the means of static and dynamic feature vectors but also the covariance of those. The encoder part is equivalent to an HMM based phoneme recognizer and transmits phoneme indexes, state durations, pitch information and speaker characteristics adaptation vectors to the decoder. The decoder receives those messages and concatenates phoneme HMM sequence according to the phoneme indexes. Then the decoder generates a sequence of mel cepstral coefficient vectors using HMM based speech parameter generation technique. Finally the decoder synthesizes speech by directly exciting the MLSA(Mel Log Spectrum Approximation) filter with the generated mel cepstral coefficient vectors, according to the pitch information.
文摘Emotion mismatch between training and testing will cause system performance decline sharply which is emotional speaker recognition. It is an important idea to solve this problem according to the emotion normalization of test speech. This method proceeds from analysis of the differences between every kind of emotional speech and neutral speech. Besides, it takes the baseband mismatch of emotional changes as the main line. At the same time, it gives the corresponding algorithm according to four technical points which are emotional expansion, emotional shield, emotional normalization and score compensation. Compared with the traditional GMM-UBM method, the recognition rate in MASC corpus and EPST corpus was increased by 3.80% and 8.81% respectively.