摘要
针对基于汉语词的 Ngram 模型统计数据稀疏问题和应用域变化造成原统计模型识别性能降低,提出具有应用域适应能力的 Ngram 模型平滑算法。对两种应用域的语料进行了前、后向 0 到3 元文法统计,采用隐马尔可夫模型( H M M)在语音识别中的成功经验,由 Baum w elch 算法来获得优化权值,每个权值代表相关模型的统计可靠性。由前后向的3gram 模型可得到5gram 文法约束的平滑算法,以弥补统计矩阵数据的稀疏现象。将《人民日报》语料的统计结果作为先验统计结果,和《计算机世界》作为转换域的专业语料进行后继训练,得到一种适应应用域的3gram 模型。实验结果表明,前后向约束的3gram 文法得到的5gram 平滑可以较小的存储代价得到较高的文法约束。
Statistic data sparse problem of Chinese word N gram model and changing of application domains caused former statistic model low recognition performance. A Chinese N gram model smoothing algorithm of task adaptation ability was put forward. A 0 gram to 3 gram forward and backwards probability statistics models were built in two application domains, it adopted the success experience of HMM in speech recognition, to apply Baum welch algorithm for optimum of the weights. Each weight stands for reliability of the correlation statistic models. The 5 gram statistic probability smoothing algorithm was obtained from the forward and backwards 3 gram, in order to offset the matrix sparse data of statistic probability. The “People Daily” corpus statistic is regard as the preliminary result, and “PC World” as the corpus of the changing domain to carry on successive training, a 3 gram model of task adaptation is gotten. The experiment results show, the 5 gram model is obtained from forward and backwards 3 gram models that has a higher grammar restriction with less shortage cost, thus the perplexity of statistic models is decreased greatly.
出处
《清华大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
1999年第9期99-102,共4页
Journal of Tsinghua University(Science and Technology)
基金
国家自然科学基金
教育部博士后重点科研基金