摘要
为改善语言模型的自适应能力,提出的面向用户的语言模型在组织结构上由通过大规模平衡语料的训练得到的通用语言模型(其原始参数维持不变)和通过在线学习得到的用户模型(其参数采用先进先出技术动态更新)组成;在数据存储结构上,通用模型采用多级索引结构来解决数据稀疏问题,用户模型采用线性结构表示,用二分法查找。根据最大限度纠正语言模型的转换错误和避免语言模型不平衡的原则,提出了适应汉语N-gram模型的机器学习方法。实验结果表明,这种机器学习方法具有“强化”特点,和“渐进学习”方式一起为应用系统提供了更灵活的选择。
In order to improve the adaptability of language model, the user-oriented language model is proposed consisting of the general-purpose language model (with its original parameters kept unchanged) obtained through large-scale training on balanced corpus and the user model (with its parameters dynamically updated using the first in and first out technique) obtained through on-line learning. In the data storage structure, a multi-level index structure is used in the general-purpose model to solve the data sparseness problem, and the user model is represented by linear structures, and searched by the halving method. A machine learning method suitable for Chinese N-gram model is proposed following the principle of correcting as much language model transfer errors as possible and avoiding language model imbalance. Experimental results indicate that this machine learning method has the strengthening characteristics, and provides a progressive learning mode with more flexible choice for the application system.
出处
《哈尔滨工业大学学报》
EI
CAS
CSCD
北大核心
2004年第2期150-153,共4页
Journal of Harbin Institute of Technology
基金
国家自然科学基金(69973015)
国家高技术研究发展计划资助项目(2001AA114041).