摘要
随着社会的飞速发展 ,新词语不断地在日常生活中涌现出来。搜集和整理这些新词语 ,是中文信息处理中的一个重要研究课题。本文提出了一种自动检测新词语的方法 ,通过大规模地分析从Internet上采集而来的网页 ,建立巨大的词和字串的集合 ,从中自动检测新词语 ,而后再根据构词规则对自动检测的结果进行进一步的过滤 ,最终抽取出采集语料中存在的新词语。根据该方法实现的系统 ,可以寻找不限长度和不限领域的新词语 ,目前正应用于《现代汉语新词语信息 (电子 )词典》的编纂 ,在实用中大大的减轻了人工查找新词语的负担。
With the fast development of the society,more and more new words come out in our life. It is one of the important topics in Chinese natural language processing to collect those new words. A method is presented for detecting these new words automaitcally in this paper. Through analysing webpages grabbed from the Internet, a large word and string set is built, which new words are detected from and filtered by rules. At last new words which exist in the webpages grabbed are extracted. The system built in this way can find new words in any length and in any field.Now it is applying to the compilation of Modern Chinese New Word Information Dictionary. It reduced human labor a lot in practise.
出处
《中文信息学报》
CSCD
北大核心
2004年第6期1-9,共9页
Journal of Chinese Information Processing
关键词
计算机应用
中文信息处理
新词语
自动检测
computer application
Chinese language processing
new word
automatic detection