摘要
:汉语自动分词在面向大规模真实文本进行分词时仍然存在很多困难。其中两个关键问题是未登录词的识别和切分歧义的消除。本文描述了一种旨在降低分词难度和提高分词精度的多步处理策略 ,整个处理步骤包括 7个部分 ,即消除伪歧义、句子的全切分、部分确定性切分、数词串处理、重叠词处理、基于统计的未登录词识别以及使用词性信息消除切分歧义的一体化处理。开放测试结果表明分词精确率可达
The automatic word segmentation of Chinese sentences is difficult when the processing mechanism faces large scale real texts.The crucial two issues in Chinese segmentation are the identification of unknown words and the disambiguation of segmentation strings.This paper describes a strategy based on multi steps processing for decreasing the difficulties and improving the accuracy of the segmentation.The processing steps include seven parts,i.e.,disambiguation of pseudo ambiguities,full segmentation of a sentence,determinate segmentation for some words,processing of numeral string,processing for reduplication of words,statistical identification for unknown words and final correction for segmentation ambiguities with part of speech which is integrated in the tagger.The output of this procedure is promising with above 98% accuracy in open test.
出处
《中文信息学报》
CSCD
北大核心
2001年第1期13-18,共6页
Journal of Chinese Information Processing
基金
国家自然科学基金! ( 697750 17)