期刊文献+

基于统计学习的自适应文本聚类 被引量:2

Research of Adaptive Text Clustering Based on the Statistics of the Datasets
在线阅读 下载PDF
导出
摘要 针对文本数据的高维性和稀疏性从而使传统的聚类算法在文本聚类应用中的表现不能让人满意的问题,通过计算文档相似度矩阵,在聚类过程中动态地统计学习已划分和未划分文本集合的相关信息,探测剩余未划分的数据集中的与已划分类簇覆盖度较小的最大密集区域,逐步生成预定数目的初始聚类中心集合,最后将剩余文档划分到最相似的初始聚类中心集合完成聚类,从而有效地减小了划分聚类算法对初始聚类中心的敏感性。算法中的一些阈值参数均通过在聚类过程中动态地对数据集进行统计学习得到,避免了多数聚类算法通过经验或实验设定阈值参数的盲目性,在不同数据集上的鲁棒性更强。在几个中英文数据集上的实验结果表明本文算法在不同数据集上表现良好,优于CLUTO聚类器中的聚类算法。 Due to the high dimensionality and sparseness of text data,the performance of traditional clustering algorithm may not be satisfied in clustering text data.The largest dense region having a small coverage rate with the partitioned clusters was selected out as initial cluster centroid set gradually by learning the similarity information between the partitioned and remainning sets.After generating the predetermined number of initial cluster centroid set,the remaining documents were assigned to their nearest clusters.By this way,the sensitivity of the clustering algorithm to the initial cluster centroid was reduced.Some threshold values used in this algorithm were calculated by the automatic statistic of the dataset dynamically in the process of clustering to avoid the blindness of the threshold parameters by experience or experiment in most clustering algorithms.The experiments on several Chinese and English datasets showed that this algorithm performes better than clustering algorithms in CLUTO.
出处 《四川大学学报(工程科学版)》 EI CAS CSCD 北大核心 2012年第1期106-111,117,共7页 Journal of Sichuan University (Engineering Science Edition)
基金 国家科技支撑计划资助项目(2007BAH08802) 陕西省13115科技创新工程重大专项资助项目(2007ZDKG-57)
关键词 聚类 向量空间模型 相似度 划分 阈值 clustering VSM similarity partition threshold
  • 相关文献

参考文献17

  • 1胡建军,唐常杰,李川,彭京,元昌安,陈安龙,蒋永光.基于最近邻优先的高效聚类算法[J].四川大学学报(工程科学版),2004,36(6):93-99. 被引量:24
  • 2Han J W,Kamber M.Data mining concepts and techniques[M].北京:机械工业出版社,2008:261-284.
  • 3Pena J M, Lozano J A, Larranaga P. An empirical comparison of four initialization methods for the K-means algorithm [J]. Pattern Recognition Letters, 1999,20 (10) : 1027-1040.
  • 4Bradley P S, Fayyad U M. Refining initial points for K-means clustering[ C]//Proceedings of the 15th International Con ference on Machine Learning. San Francisco, USA:Morgan Kaufmann, 1998:91-99.
  • 5Steinbach M, Karypis G, Kumar V. A comparison of docu ment clustering techniques [ C ]//Proceedings of KDD 2000 Workshop on Text Mining. 2000 : 1-20.
  • 6Zhao Ying, Karypis G. Hierarchical clustering algorithms for document datasets [ C ]//Proceedings of Data Mining and Knowledge Discovery. 2005,10 (2) : 141-168.
  • 7Higgs R E, Bemis K G, Watson I A, et al. Experimental de signs for selecting molecules fromlarge chemical databases [ J]. Journal of Chemical Information and Computer Sci ences, 1997,37 (5) : 861-870.
  • 8Snarey M, Terrett N K, Willet P, et al. Comparison of algo rithms for dissimilarity-based compound selection [ J ]. Jour nal of Molecular Graphics & Modelling, 1997,15 (6) : 372 -385.
  • 9张健沛,杨悦,杨静,张泽宝.基于最优划分的K-Means初始聚类中心选取算法[J].系统仿真学报,2009,21(9):2586-2590. 被引量:62
  • 10秦钰,荆继武,向继,张爱华.基于优化初始类中心点的K-means改进算法[J].中国科学院研究生院学报,2007,24(6):771-777. 被引量:10

二级参考文献54

共引文献287

同被引文献18

引证文献2

二级引证文献31

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部