期刊文献+

基于感知加权线谱对距离的最小生成误差语音合成模型训练方法

Minimum Generation Error Training Based on Perceptually Weighted Line Spectral Pair Distance for Statistical Parametric Speech Synthesis
原文传递
导出
摘要 提出一种基于感知加权线谱对(Line Spectral Pair,LSP)距离的最小生成误差(Minimum Generation Error,MGE)模型训练方法,用以改善基于隐马尔科夫模型的参数语音合成系统性能.在采用线谱对参数表征语音频谱特征时,传统MGE训练中使用的欧氏距离生成误差计算方法并不能较好地反映生成频谱与自然频谱之间的真实距离,而采用与谱参数无关的对数谱间距(Log Spectral Distortion,LSD)定义的生成误差函数可改善这一问题,但改进后主观效果不明显,且运算复杂度很高.文中先提出基于加权LSP距离的MGE模型训练方法,并在实验中从主客观对比不同加权方法以及基于LSD的MGE训练.最后,找到一种感知加权方法,不但具有较好的主观表现,而且在运算复杂度上与传统MGE训练相比几乎没有增加. A Minimum Generation Error (MGE) training method based on perceptually weighted Line Spectral Pair (LSP) distance is proposed to improve the performance of Hidden Markov Model (HMM) based parametric speech synthesis system. The generation error defined by Euclidean distance used in the traditional MGE training, is not eligible in measuring the real gap between generated spectrum and natural spectrum when the speech spectrum is described by LSP. Although using generation error defined by Log Spectral Distortion (LSD) having nothing to do with spectrum parameters manages to deal with this problem, the improvement seems trivial compared to the incurred higher computational complexity. In this paper, an MGE training criterion based on weighted LSP distance is proposed, and this MGE training method is subjectively and objectively contrasted with different weighted methods and LSD basedMGE training method. Eventually, a perceptually weighted training method is obtained, which not only achieves the best performance, but also incurs no extra computational complexity compared with thetraditional MGE training.
出处 《模式识别与人工智能》 EI CSCD 北大核心 2010年第4期572-579,共8页 Pattern Recognition and Artificial Intelligence
关键词 语音合成 隐可尔科夫模型(HMM) 最小生成误差(MGE) 感知加权 线谱对参数 Speech Synthesis, Hidden Markov Model (HMM), Minimum Generation Error (MGE),Perceptually Weighting, Line Spectral Pair Parameter
  • 相关文献

参考文献16

  • 1Masuko T,Tokuda K,Kobayashi T,et al.Speech Synthesis Using HMMs with Dynamic Features // Proc of the IEEE International Conference on Acoustics,Speech and Signal Processing.Atlanta,USA,1996,Ⅰ:389-392.
  • 2Yoshimura T,Tokuda K,Masuko T,et al.Simultaneous Modeling of Spectrum,Pitch and Duration in HMM-Based Speech Synthesis//Proc of the IEEE International Conference on Acoustics,Speech and Signal Processing.Phoenix,USA,1999,Ⅴ:2347-2350.
  • 3Tokuda K,Kobayasbi T,Imai S.Speech Parameter Generation from HMM Using Dynamic Features // Proc of the IEEE International Conference on Acoustics,Speech and Signal Processing.Detroit,USA,1995,Ⅰ:660-663.
  • 4Ling Zhenhua,Qin Long,Lu Heng,et al.The USTC and iFlytek Speech Synthesis Systems for Blizzard Challenge 2007//Proc of the Blizzard Challenge Workshop.Bonn,Germany,2007:17-21.
  • 5Zen H,Toda T.An Overview of Nitech HMM-Based Speech Synthesis System for Blizzard Challenge 2005//Proc of the 9th European Conference on Speech Communication and Technology.Lisbon,Portngal,2005:93-96.
  • 6Wu Yijian,Wang Renhua.Minimum Generation Error Training for HMM-Based Speech Synthesis // Proc of the IEEE International Conference on Acoustics,Speech and Signal Processing.Toulouse,France,2006,Ⅰ:889-892.
  • 7Wu Yijian,Guo Wu,Wang Renhua.Minimum Generation Error Criterion for Tree-Based Clustering of Context Dependent HMMs//Proc of the 9th International Conference on Speken Language Processing.Pittsburgh,USA,2006:2046-2049.
  • 8Qin Long,Wu Yijian,Ling Zhenhua,et al.Minimum Generation Error Linear Regressi6n Based Model Adaptation for HMM-Based Speech Synthesis// Proc of the IEEE International Conference on Acoustics,Speech and Signal Processing.Las Vegas,USA,2008:3953-3956.
  • 9McLoughlin I V.Line Spectral Pairs.Signal Processing Journal,2008,88(3):448-467.
  • 10吴义坚,王仁华.基于HMM的可训练中文语音合成[J].中文信息学报,2006,20(4):75-81. 被引量:17

二级参考文献12

  • 1R.H.Wang,Qingfeng Liu,Deyu Xia,Towards A Chinese Text-To-Speech System With Higher Naturalness[A],In:Proc.of ICSLP[C].Sydney,1998,p2047-2050.
  • 2R.H.Wang,Zhongke Ma,Wei Li,Donglai Zhu,A Corpus-Based Chinese Speech Synthesis with ContextualDependent Unit Selection[A].In:Proc.of ICSLP[C].Beijing,2000,p391 -394.
  • 3L.R.Rabiner,A tutorial on hidden Markov models and selected applications in speech recognition.Proc.of IEEE,1989[J].vol.77,pp.257-286.
  • 4R.E.Donovan and E.M.Eide,The IBM trainable speech synthesis system[A].In:Proc.of ICSLP[C].Sydney,1998,vol.5,pp.1703-1706.
  • 5X.Huang,A.Acero,H.Hon,Y.Ju,J.Liu,S.Merdith,and M.Plumpe,Recent improvements on Microsoft's trainable text-to-speech system-Whistler[A].In:Proc.of ICASSP[C].Munich,1997,pp.959-962.
  • 6T.masuko,K.Tokuda,T.Kobayashi,and S.Imai,Speech synthesis from HMMs using dynamic features[A].In:Proc.of ICASSP[C].Atlanta,1996,pp.389 -392.
  • 7T.Yoshimura,K.Tokuda,T.Masuko,T.Kobayashi,and T.Kitamura,Simultaneous modeling of spectrum,pitch and duration in HMM-based speech synthesis[A].In:Proc.of Eurospeech[C].Budapest,1999,vol.5,pp.2347-2350.
  • 8K.Tokuda,T.Masuko,N.Miyazaki,and T.Kobayashi,Hidden Markov models based on multi-space probability distribution for pitch pattern modeling.In:Proc.of ICASSP[C].Arizona,1999,pp.229-232.
  • 9T.Yoshimura,K.Tokuda,T.Masuko,T.Kobayashi and T.Kitamura,Duration modeling in HMM-based speech synthesis system[A].In:Proc.of ICSLP[C].Sydney,1998,vol.2,pp.29-32.
  • 10H.Kawahara,I.Masuda-Katsuse and A.deCheveigne,Restructuring speech representations using pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based FO extraction:possible role of a repetitive structure in sounds,Speech Communication[J].1999,vol.27,pp.187-207.

共引文献16

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部