The traditional FastSpeech2 has high generation efficiency and speech naturalness,but it still has limitations in metrical modeling,especially in the lack of effective linkage between semantics and metre.To enhance th...The traditional FastSpeech2 has high generation efficiency and speech naturalness,but it still has limitations in metrical modeling,especially in the lack of effective linkage between semantics and metre.To enhance the performance of synthesized speech in terms of rhythmic expression,ProsodySpeech speech synthesis system that incorporates BERT pre-trained language model was proposed in this study.By introducing the Pre-trained Language Model Adapter(PLM Adapter)and the Semantic-Prosody Mapping Network(SPMN),and by fully utilizing the deep semantic information extracted by BERT,the system enhanced its control over rhythmic features such as pitch,energy,and duration.The proposed model achieved effective alignment and mapping between semantic information and prosody parameters by introducing a shared semantic processing layer,a global self-attention mechanism,and a specially designed prosody mapping branch.Experimental results showed that the model proposed in this study outperforms VITS and StyleTTS2 in terms of Mean Opinion Score(MOS),and the synthesized speech has a more obvious advantage in terms of rhythmic naturalness and expressive richness,which verified the effectiveness of the proposed model in enhancing the expression of speech rhythms,and the synthesized speech is closer to the expression of natural human speech.展开更多
文摘The traditional FastSpeech2 has high generation efficiency and speech naturalness,but it still has limitations in metrical modeling,especially in the lack of effective linkage between semantics and metre.To enhance the performance of synthesized speech in terms of rhythmic expression,ProsodySpeech speech synthesis system that incorporates BERT pre-trained language model was proposed in this study.By introducing the Pre-trained Language Model Adapter(PLM Adapter)and the Semantic-Prosody Mapping Network(SPMN),and by fully utilizing the deep semantic information extracted by BERT,the system enhanced its control over rhythmic features such as pitch,energy,and duration.The proposed model achieved effective alignment and mapping between semantic information and prosody parameters by introducing a shared semantic processing layer,a global self-attention mechanism,and a specially designed prosody mapping branch.Experimental results showed that the model proposed in this study outperforms VITS and StyleTTS2 in terms of Mean Opinion Score(MOS),and the synthesized speech has a more obvious advantage in terms of rhythmic naturalness and expressive richness,which verified the effectiveness of the proposed model in enhancing the expression of speech rhythms,and the synthesized speech is closer to the expression of natural human speech.