期刊文献+

基于非对称相似度的文本聚类方法 被引量:7

Text clustering based on asymmetric similarity
原文传递
导出
摘要 文本聚类具有数据稀疏性的特点,常见的聚类方法采用基于距离的相异度,为了增强文档的区分特征,提出一种基于非对称相似度的方法,来度量文档对象之间的关联。定义了文本对象之间的非对称相似度度量。利用文本非对称相似度矩阵的稀疏特性,采用强连通构件的划分方法对文本对象进行聚类分析。并通过迭代的方法形成聚类结果的概念层次。实验结果表明:非对称相似度比距离相异度具有更高的准确率和更少的执行时间,当聚类结果簇数目达到较小时,准确率提高约为20%。 Text clustering data sets have sparse data spaces, with existing text clustering methods using distance-based dissimilarity to measure the document similarity. The document discrimination ability can be strengthened by a asymmetric similarity approach for text clustering. The asymmetric similarity is measured by a clustering analysis of the strong components of the sparse matrix. The approach provides a conceptual structure after the hierarchical clustering. Tests on textual data sets show that the asymmetric similarity measure provides higher precision with less run time than the distance-based dissimilarity method. With small numbers of clusters, the accuracy is improved by about 20%.
出处 《清华大学学报(自然科学版)》 EI CAS CSCD 北大核心 2006年第7期1325-1328,共4页 Journal of Tsinghua University(Science and Technology)
基金 国家"八六三"高技术项目(2002AA444120)
关键词 机器学习 文字信息处理 文本聚类 machine learning text information processing text clustering
  • 相关文献

参考文献6

  • 1Han J,Kamber M.Data Mining:Concept and Techniques[M].San Fransisco:Morgan Kaufmann Publishers,2001.
  • 2Salton G,Wong A,Yang C S.A vector space model for automatic indexing[J].Communications of the ACM,1975,18(11):613-620.
  • 3Modha D,Spangler S.Feature weighting in k-means clustering[J].Machine Learning,2003,52(3):217-237.
  • 4Beil F,Ester M,Xu X.Frequent term-based text clustering[C]∥ Proc 8th Int Conf on Knowledge Discovery and Data Mining.New York:ACM Press,2002:436-442.
  • 5Pissanetzky S.Sparse Matrix Technology[M].London:Academic Press,1984.
  • 6Lewis D,Yang Y,Rose T,et al.RCV1:a new benchmark collection for text categorization research[J].Journal of Machine Learning Research,2004,5(Apr):361-397.

同被引文献57

引证文献7

二级引证文献29

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部