摘要
提出了一种基于词聚类的中文文本主题抽取方法,该方法利用相关度对词的共现进行分 析,建立词之间的语义关联,并生成代表某一主题概念的用种子词表示的词类。对于给定文档,先进 行特征词抽取,再借助词类生成该文档的主题因子,最后按权重输出主题因子,作为文本的主题。实 验结果表明,该方法具有较高的抽准率。
A novel chinese text subject extraction method based on word clustering was presented. This method analysed the co-occurrence of words by using relativity calculation to create semantic relativity and generated a word cluster which represents a subject conception and is presented by seed words. To a given text, its features were extracted firstly. Then its subject genes was producted by means of word cluster. At last,the top subject genes were sorted in descending order of weights and selected as the subject. The experimental results indicate that the method has higher precision.
出处
《计算机应用》
CSCD
北大核心
2005年第4期754-756,共3页
journal of Computer Applications
基金
国家自然科学基金(60475022)
山西省自然科学基金(20041041)
山西省回国留学人员基金(2002004)
关键词
主题抽取
词聚类
种子词
主题因子
信息论
词同现
CHI统计
subject extraction
word clustering
seed words
subject gene
information theory
word co-occurrence
CHI statistics