ITU-T G. 729 is the primarily recommended speech codec by H. 323 standard. This paper describes how to implement G. 729 codec in IP telephony gateway, and goes deep into the programming skills on TMS320C6201 DSP and o...ITU-T G. 729 is the primarily recommended speech codec by H. 323 standard. This paper describes how to implement G. 729 codec in IP telephony gateway, and goes deep into the programming skills on TMS320C6201 DSP and optimizing methods of program code to reduce the speech processing delay time of G. 729 codec. Due to adopting these optimizing methods and programming skills, we have implemented a high-speed speech codec that can process concurrently 20 voice channels with single TMS320C6201 chip in IP telephony gateway. Finally, the paper analyzes the performance results of ITU-T G. 729 codec based on TMS320C6201.展开更多
At medium or long distance (〉 10 kin) underwater acoustic speech communication, information transfer rate is constrained by the complicated, time varying channel and limited bandwidth. The bit rate of speech coding...At medium or long distance (〉 10 kin) underwater acoustic speech communication, information transfer rate is constrained by the complicated, time varying channel and limited bandwidth. The bit rate of speech coding is required to be as low as possible. The time delay of underwater acoustic wave propagation can be used for low bit rate speech coding. After investigating the Mixed Excitation Linear Prediction (MELP) standard and taking account of the auditory perceptual features, a variable and adjustable bit rate speech codec algorithm has been proposed, whose average bit rate is about 600 bps. The average Perceptual Evaluation of Speech Quality Mean Opinion Score (PESQ MOS) of synthesized speeches is about 2.8. It has been proved by the computer simulation and sea trial that the performance of the proposed algorithm is well and robust when bit error rate is no more than 10-3. The synthesized speech is vivid and intelligible, and keeps main individual characteristics of speaker.展开更多
借助离散神经音频编解码器的能力,大型语言模型(Large language model,LLM)已被广泛认为是一种零样本语音合成(Text-to-Speech,TTS)的潜在方法。然而,基于采样的解码策略虽然能够为语音生成带来丰富的多样性,但同时也引入了诸如拼写错...借助离散神经音频编解码器的能力,大型语言模型(Large language model,LLM)已被广泛认为是一种零样本语音合成(Text-to-Speech,TTS)的潜在方法。然而,基于采样的解码策略虽然能够为语音生成带来丰富的多样性,但同时也引入了诸如拼写错误、遗漏和重复等鲁棒性问题。为了解决上述问题,我们提出了VALL-E R,一个鲁棒且高效的零样本TTS系统,并以VALL-E为基础进行构建。具体而言,我们引入了一种音素单调对齐策略,通过约束声学标记与其对应的音素严格匹配,增强了音素与声学序列之间的映射关系,从而确保更精确的对齐。此外,我们采用编解码器合并的方法,在浅层量化层对离散码进行降采样,以减少解码计算量,同时保持语音输出的高质量。受益于这些策略,VALL-E R在音素可控性方面取得了显著提升,并通过逼近真实语音的词错误率展现了卓越的鲁棒性。此外,该系统仅需较少的自回归推理步骤,推理时间降低超过60%,极大提升了推理效率。展开更多
QIM(Quantization Index Modulation,量化索引调制)隐写在标量或矢量量化时嵌入机密信息,可在语音压缩编码过程中进行高隐蔽性的信息隐藏,文中试图对该种隐写进行检测.文中发现该种隐写将导致压缩语音流中的音素分布特性发生改变,提出...QIM(Quantization Index Modulation,量化索引调制)隐写在标量或矢量量化时嵌入机密信息,可在语音压缩编码过程中进行高隐蔽性的信息隐藏,文中试图对该种隐写进行检测.文中发现该种隐写将导致压缩语音流中的音素分布特性发生改变,提出了音素向量空间模型和音素状态转移模型对音素分布特性进行了量化表示.基于所得量化特征并结合SVM(Support Vector Machine,支持向量机)构建了隐写检测器.针对典型的低速率语音编码标准G.729以及G.723.1的实验表明,文中方法性能远优于现有检测方法,实现了对QIM隐写的快速准确检测.展开更多
基金Supported by the National Natural Science Foundation of China under grant!69773046
文摘ITU-T G. 729 is the primarily recommended speech codec by H. 323 standard. This paper describes how to implement G. 729 codec in IP telephony gateway, and goes deep into the programming skills on TMS320C6201 DSP and optimizing methods of program code to reduce the speech processing delay time of G. 729 codec. Due to adopting these optimizing methods and programming skills, we have implemented a high-speed speech codec that can process concurrently 20 voice channels with single TMS320C6201 chip in IP telephony gateway. Finally, the paper analyzes the performance results of ITU-T G. 729 codec based on TMS320C6201.
基金supported by the National Natural Science Foundation of China(61102152)
文摘At medium or long distance (〉 10 kin) underwater acoustic speech communication, information transfer rate is constrained by the complicated, time varying channel and limited bandwidth. The bit rate of speech coding is required to be as low as possible. The time delay of underwater acoustic wave propagation can be used for low bit rate speech coding. After investigating the Mixed Excitation Linear Prediction (MELP) standard and taking account of the auditory perceptual features, a variable and adjustable bit rate speech codec algorithm has been proposed, whose average bit rate is about 600 bps. The average Perceptual Evaluation of Speech Quality Mean Opinion Score (PESQ MOS) of synthesized speeches is about 2.8. It has been proved by the computer simulation and sea trial that the performance of the proposed algorithm is well and robust when bit error rate is no more than 10-3. The synthesized speech is vivid and intelligible, and keeps main individual characteristics of speaker.
文摘借助离散神经音频编解码器的能力,大型语言模型(Large language model,LLM)已被广泛认为是一种零样本语音合成(Text-to-Speech,TTS)的潜在方法。然而,基于采样的解码策略虽然能够为语音生成带来丰富的多样性,但同时也引入了诸如拼写错误、遗漏和重复等鲁棒性问题。为了解决上述问题,我们提出了VALL-E R,一个鲁棒且高效的零样本TTS系统,并以VALL-E为基础进行构建。具体而言,我们引入了一种音素单调对齐策略,通过约束声学标记与其对应的音素严格匹配,增强了音素与声学序列之间的映射关系,从而确保更精确的对齐。此外,我们采用编解码器合并的方法,在浅层量化层对离散码进行降采样,以减少解码计算量,同时保持语音输出的高质量。受益于这些策略,VALL-E R在音素可控性方面取得了显著提升,并通过逼近真实语音的词错误率展现了卓越的鲁棒性。此外,该系统仅需较少的自回归推理步骤,推理时间降低超过60%,极大提升了推理效率。
文摘QIM(Quantization Index Modulation,量化索引调制)隐写在标量或矢量量化时嵌入机密信息,可在语音压缩编码过程中进行高隐蔽性的信息隐藏,文中试图对该种隐写进行检测.文中发现该种隐写将导致压缩语音流中的音素分布特性发生改变,提出了音素向量空间模型和音素状态转移模型对音素分布特性进行了量化表示.基于所得量化特征并结合SVM(Support Vector Machine,支持向量机)构建了隐写检测器.针对典型的低速率语音编码标准G.729以及G.723.1的实验表明,文中方法性能远优于现有检测方法,实现了对QIM隐写的快速准确检测.