摘要
组合型歧义切分字段一直是汉语自动分词的难点,难点在于消歧依赖其上下文语境信息。本文采集、统计了组合型歧义字段的前后语境信息,应用对数似然比建立了语境计算模型,并考虑了语境信息的窗口大小、位置和频次对消歧的影响而设计了权值计算公式。在此基础上,1.使用语境信息中对数似然比的最大值进行消歧;2.使用语境信息中合、分两种情况下各自的对数似然比之和,取值大者进行消歧。对高频出现的14个组合型分词歧义进行实验,前者的平均准确率为84.93%,后者的平均准确率为95.60%。实验证明使用语境信息之和对消解组合型分词歧义具有良好效果。
Combinational ambiguity is a challenging issue in Chinese word segmentation in that its disambiguation depends on the contextual information. This paper collected contextual information statistics of combinational ambiguity words and establishes a context model using log likelihood ratio. A weight calculation formula is designed considering contextual information's window size, location and the frequency. Based on this, two methods are investigated for disambiguation. One uses the maximum log likelihood ratio in contextual information; the other uses the maximum sum of log likelihood ratio between the situation of combination and separation in contextual information. Tested on 14 high-frequence ambiguous words, the average accuracy of the former method reaches 84.93M, and that of the latter reaches 95.60 %. The result of the experiment reveals that using the combination of contextual information is effective for disambiguation.
出处
《中文信息学报》
CSCD
北大核心
2007年第6期13-16,42,共5页
Journal of Chinese Information Processing
基金
山西省忻州师范学院基金资助项目(200307)
关键词
计算机应用
中文信息处理
自然语言处理
汉语自动分词
组合型切分歧义
对数似然比
语境信息
computer application
Chinese information processing
natural language processing
Chinese word segmentatiom combinational ambiguity
log likelihood ratio
contextual information