摘要
共现词提取在信息挖掘和自然语言处理中有着十分重要的地位。而传统的共现词提取方法仅仅局限在单一的一种统计量上 ,其结果十分不精确 ,需要人工再进行整理。本文提出了一种基于词汇吸引与排斥模型的共现词提取算法 ,并通过将多种常用统计量进行组合 ,改进了算法的效果。在开放测试环境下 ,所提取的共现词其用户感兴趣度为 6 0 87%。将该算法应用于基于Web的共现词检索系统 。
Co-occurrence word retrieval is very important in information mining and natural language processing. But traditional co-occurrence word retrieval methods used only a single statistic method, so the result is very imprecise, and needs lots of manual collation. In this paper we present a co-occurrence words extraction algorithm based on the lexical attraction and repulsion model, and combine some common statistical methods with the algorithm to improve its effect. In the open test, our system's Interesting performance is 60.87%. We show good performance in speed and precision when applied the algorithm on a co-occurrence search system based on web.
出处
《中文信息学报》
CSCD
北大核心
2004年第6期16-22,共7页
Journal of Chinese Information Processing
基金
福建省自然科学基金资助项目 (A0 310 0 0 9)
福建省重点科技资助项目 (2 0 0 1J0 0 5 )
关键词
计算机应用
中文信息处理
共现词
词汇吸引与排斥模型
共现距离
computer application
Chinese information processing
co-occurrence
lexical attraction and repulsion model
co-occurrence distance