摘要
我们为蒙古语词法分析建立了一种生成式的概率统计模型。该模型将蒙古语语句的词法分析结果描述为有向图结构,图中节点表示分析结果中的词干、词缀及其相应标注,而边则表示节点之间的转移或生成关系。特别地,在本工作中我们刻画了词干到词干转移概率、词缀到词缀转移概率、词干到词缀生成概率、相应的标注之间的三种转移或生成概率,以及词干或词缀到相应标注相互生成概率。以内蒙古大学开发的20万词规模的三级标注人工语料库为训练数据,该模型取得了词级切分正确率95.1%,词级联合切分与标注正确率93%的成绩。
We propose a generative statistical model for Mongolian lexical analysis.This model describes the lexical analysis result as a directed graph,where the nodes represent the stems,affixes and their tags,while the edges represent the transition or generation relationships between nodes.Especially in this work,we adopt three kinds of transition or generation probabilities: a) probabilities of stem-stem transition,affix-affix transition and stem-affix generation;b) the transition or generation probabilities between the corresponding tags;and c) the generation probabilities between stems or affixes and their tags.Using the 3rd-level annotated corpus with about 200000 words as the training data,this model achieves a word-level segmentation accuracy of 95.1%,and a word-level joint segmentation and tagging accuracy of 93%.
出处
《中文信息学报》
CSCD
北大核心
2011年第5期94-100,共7页
Journal of Chinese Information Processing
基金
国家自然科学基金资助项目(Contract60736014)
863重点项目(2006AA010108)
教育部
国家语委民族语言文字规范标准建设及信息化资助项目(MZ115-038)