摘要
文本聚类具有数据稀疏性的特点,常见的聚类方法采用基于距离的相异度,为了增强文档的区分特征,提出一种基于非对称相似度的方法,来度量文档对象之间的关联。定义了文本对象之间的非对称相似度度量。利用文本非对称相似度矩阵的稀疏特性,采用强连通构件的划分方法对文本对象进行聚类分析。并通过迭代的方法形成聚类结果的概念层次。实验结果表明:非对称相似度比距离相异度具有更高的准确率和更少的执行时间,当聚类结果簇数目达到较小时,准确率提高约为20%。
Text clustering data sets have sparse data spaces, with existing text clustering methods using distance-based dissimilarity to measure the document similarity. The document discrimination ability can be strengthened by a asymmetric similarity approach for text clustering. The asymmetric similarity is measured by a clustering analysis of the strong components of the sparse matrix. The approach provides a conceptual structure after the hierarchical clustering. Tests on textual data sets show that the asymmetric similarity measure provides higher precision with less run time than the distance-based dissimilarity method. With small numbers of clusters, the accuracy is improved by about 20%.
出处
《清华大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2006年第7期1325-1328,共4页
Journal of Tsinghua University(Science and Technology)
基金
国家"八六三"高技术项目(2002AA444120)
关键词
机器学习
文字信息处理
文本聚类
machine learning
text information processing
text clustering