摘要
关键词在文本聚类/分类、自动摘要、信息检索等领域具有重要地位,然而当前互联网上的众多新闻网页没有提供关键词,人工标注关键词代价巨大,并且大多数已有的关键词自动提取算法都需要建立在人工标注的训练集之上,因而难以实用。由于关键词是文章中较重要且主题关联较凝聚的词的集合,因此提出一种基于密度聚类模式的中文新闻网页关键词提取方法,根据词语之间的共现信息,对网页分词后的词语进行聚类,在分析词语关联度的基础上提取出反映新闻主题的关键词。通过大量随机新闻网页实验结果表明,与单纯的TF/IDF(词频和文档频率倒数的乘积)方法相比,此算法召回率平均提高了7.15%,准确率平均提高了7.075%。
Keywords play a key role in text clustering,text classification,automatic text summarization, and information retrieval. However,keywords available for most web pages. Manual assignment of keywords is expensive and time-consuming and most existing automatic keyword extraction approaches require training sets with human-labeled keywords. Keywords can be considered as a set of words which are important and subject correlated cohesively in a document. Therefore,an automatic keyword extraction method based on density clustering is proposed in this paper. Web pages are segmented, and the words are clustered based on their co-occurrence. Co-occurrence relations between words are analyzed, and keywords that capture the main topics of the document are extracted. Experiments from Chinese news pages demonstrate that the recall rate can be improved by 7.15 percent and the precision rate by 7. 075 percent compared with the TF/IDF (term frequency/inverted document frequency) method.
出处
《广西师范大学学报(自然科学版)》
CAS
北大核心
2009年第1期201-204,共4页
Journal of Guangxi Normal University:Natural Science Edition
基金
国家基金海青课题资助项目(60828005)
关键词
关键词提取
词共现
聚类
自然语言处理
keyword extraction
word co-occurrence
clustering natural language processing