摘要
文章提出了利用汉语中的二字应成词,计算汉语句内相邻字之间的互信息1及t-信息差这两个统计信息量的新方法,进而应用这两个统计量,解决汉语自动分词中的歧义字段的自动切分问题。实验结果表明,采用该文所述的方法,对歧义字段的切分正确率将达到90%,与其他分词方法相比较,进一步提高了系统的分词精度,尤其与文献1所述方法比较,对于有大量汉语信息的语料,将降低系统的时间复杂度。
This paper gives a new method to compute the two statistical measures,interact information and difference of three -character information of adjacent characters,by utilizing two Chinese characters used as a word in Chinese sentences.Further,it resolves ambiguity word automatic segmentation in Chinese.In this paper,the test results appear that the right rate of separating ambiguity is90%.Compared with those by other methods,it improves the accuracy of ambiguity word automatic segmentation,particularly,compared with document Ⅰ,the complexity of time that there are much more information will reduce.
出处
《计算机工程与应用》
CSCD
北大核心
2003年第1期17-18,26,共3页
Computer Engineering and Applications
基金
国家863高技术研究发展计划(编号:2001AA114101)
关键词
汉语二字应成词
歧义字段切分方法
中文信息处理
t-信息差
自动分词
汉语文本
interact information,difference of t-information,two Chinese characters used as a word,word automatic segmentation,ambiguity word