摘要
确定未登录词边界是汉语自动分析中特有的一个问题,未登录词的种类和数量之多,是处理大规模真实文本的严重障碍。本文分析了现有的解决未登录词问题的各种方案,提出两趟分词、在“分词碎片”中计算单字成词概率和未登录词概率的一揽子解决方案,并报告一个初步的、令人鼓舞的开放测试结果。
Abstract Identifying unlisted words is a peculiar problem to Chinese segmentation. The variety and vast amount of unlisted words becomes a bottleneck in processing huge corpora. After discussing various methods,the paper proposes a new package scheme: segmenting twice and calculating the probability of Chinese characters as words vs. the probability of unlisted words in fragments. The result of a preliminary open test is guite inspiring.
出处
《语言文字应用》
CSSCI
北大核心
1999年第3期103-109,共7页
Applied Linguistics