摘要
描述一种基于改进TF-IDF特征词加权算法的科技文献聚类方法:首先提取科技文献的特征词;然后根据特征词的词频、所在位置和词性为特征词加权,建立科技文献的向量空间模型;接着使用基于密度的聚类算法对科技文献向量空间模型数据进行聚类分析;最后使用主成分分析法对科技文献聚类的结果进行标识,利用F-measure方法对聚类结果进行评价。实验表明,用提出的科技文献聚类方法能够从所检索的科技文献中发现热点研究领域,并能识别具有学科融合性质的研究方向。
This paper describes a new clustering algorithm for scientific literature based on an improved TF-IDF weighted algorithm for feature words. Firstly, the authors extract feature words from the sets of literature. Then, they weight the feature words with their frequency, places in literature, parts of speech and establish the vector space model. After that, they cluster the data of VSM by the clustering algorithm based on density. Finally, they label the cluster by using the method of principal component analysis and evaluate the cluster by using F-measure method. Experiments show that: the clustering algorithm for scientific literature can find some fields of disciplinary research and discover a few fields of research with interdisciplinary.
出处
《图书情报工作》
CSSCI
北大核心
2012年第4期6-11,共6页
Library and Information Service
基金
教育部人文社会科学项目"高校专家知识地图构建研究"(项目编号:10YJC870022)研究成果之一
关键词
科技文献
文本挖掘
聚类
scientific literature
text mining
clustering