摘要
目前 ,很多最新的术语和专有名词 ,首先以字母词语的形式出现在汉语中 ,并日益广泛应用。而字母词语多数是汉语自动分词中的未登录词 ,其正确识别 ,将有助于提高中文分词、信息检索、搜索引擎、机器翻译等应用软件的质量。本文在对字母词语进行先期考察的基础上 ,分析了字母词语组成情况的复杂特征和自动识别的难点 ,结合字母词语的各种统计特征和其独有的特点———字母串“锚点” ,提出了从中心往两边扩展的规则加统计辅助的字母词语自动提取的算法。并且对字母词语的双语同现问题进行了处理。算法简单 ,但有效。召回率为 10 0 % ,准确率在 80 %以上。
Nowadays, more and more lettered words are used in Chinese texts, most of which are new terms or proper nouns. And this may become a trend quite obvious to us. Usually, lettered words are unknown phrases or words in automatic Chinese segmentation. Based on the observation of lettered words in our Chinese corpus, the correct identification of them will improve the quality of Chinese segmentation, information retrieval, searching technology, machine translation, etc. This paper analyzes the complex features of Chinese lettered words, and discusses the difficulties in extracting them. An algorithm for the automatic identification of Chinese lettered words is presented here, which uses a letter string as the anchor and search its left and right contexts for the boundaries of the lettered word. The algorithm is simple, but it is effective. Our experiment on the corpus of the Peoples Daily (Year 2002) shows the precision and the recall rates being 80% and 100% respectively.
出处
《中文信息学报》
CSCD
北大核心
2005年第2期78-85,共8页
Journal of Chinese Information Processing
基金
"国家语言资源监测与研究中心"项目资助 (0 4L2 0 0 4 - 0 1- 0 1- 0 3)
关键词
人工智能
自然语言处理
字母词语
自动提取
artificial intelligence
natural language processing
lettered-word
automatic extracting