摘要
为了有效地提高文本聚类的质量和效率,在对已有的层次聚类和K-means算法分析和研究的基础上,针对互联网信息处理量大、实时性高的特点,设计并实现了一种用于高维稀疏相似矩阵的文本聚类算法。该算法结合了层次聚类和K-means聚类的思想,根据一个阈值来控制聚类算法的选取和新簇的建立,并通过文本特征提取和文档相似度矩阵计算实现文本聚类。实验结果表明,该算法的召回率和正确率更高。
To improve the quality and efficiency of text clustering effectively, based on the analysis and research of the hierarchical clustering and k-means algorithms, a kind of text clustering algorithm for a higher-dimensional sparse matrix is designed and implemented for the characteristic of large quantity of internet information and high real-time. The algorithm combines the ideas of the hierarchical clustering and K-means clustering, which controls the selection of clustering algorithm and the establishment of new clusters through a threshold and realizes text clustering through extraction of text feature and calculation of text similarity matrix. Experiments showed that the accuracy and recall rate of this algorithm are higher.
出处
《计算机工程与设计》
CSCD
北大核心
2010年第9期2013-2015,2019,共4页
Computer Engineering and Design
关键词
中文文本
文本分类
聚类算法
层次聚类
K-MEANS
Chinese texts
text classification
clustering algorithm
hierarchical clustering
K-means