期刊文献+

快速混合Web文档聚类 被引量:3

Fast hybrid clustering for Web documents
在线阅读 下载PDF
导出
摘要 提出了一种使用后缀树聚类算法优化K-means文档聚类初始值的快速混合聚类方法STK-means。该方法首先构建文档集的后缀树模型,使用后缀树聚类算法识别初始聚类、提取K-means聚类算法初始值中心值。然后,把后缀树模型的节点映射到M维向量空间模型中的特征项,利用TF-IDF方案计算基于短语的文档向量特征值。最后,使用K-means算法产生聚类结果。实验结果表明该方法优于传统K-means聚类算法和后缀树聚类算法,并具备了这些算法聚类速度快的优点。 A fast hybrid clustering algorithm for Web documents clustering is proposed which optimizes the initial center val- ues of K-means algorithm through STC algorithm.Firstly,the initial center values are extracted after the Web document set is clustered by STC algorithm.Secondly,by mapping the each internal node of suffix tree into M-dimensional VSM,each fea- ture term weights is computed using TF-IDF extended with phrases.Finally, the final result is generated by K-means algo- rithm.The evaluation experiments indicate that the new hybrid algorithm is more effective on clustering documents than ordi- nary K-means and STC algorithm.Moreover,it is as fast as K-means and STC algorithm.
出处 《计算机工程与应用》 CSCD 北大核心 2010年第22期12-15,共4页 Computer Engineering and Applications
基金 国家科技支撑计划No.2007BAH08B04 重庆市科技支撑计划No.2008AC20084~~
关键词 聚类算法 K-MEANS算法 后缀树 WEB文档聚类 基于短语的相似度 clustering algorithms K-means algorithm suffix tree Web document clustering phrase-based similarity
  • 相关文献

参考文献16

  • 1Manning C D,Raghavan P, Schiitze H.An introduction to information retrieval[M].Cambridge, England: Cambridge University Press, 2009 : 349-400.
  • 2刘远超,王晓龙,刘秉权.一种改进的k-means文档聚类初值选择算法[J].高技术通讯,2006,16(1):11-15. 被引量:23
  • 3汪中,刘贵全,陈恩红.一种优化初始中心点的K-means算法[J].模式识别与人工智能,2009,22(2):299-304. 被引量:142
  • 4吴文丽,刘玉树,赵基海.一种新的混合聚类算法[J].系统仿真学报,2007,19(1):16-18. 被引量:18
  • 5Huang J Z, Ng M K, Rong H, et al.Automated variable weighting in K-means type clustering[J].IEEE Transactions on Pattem Analysis and Machine Intelligence,2005,27(5):657-668.
  • 6Chim H, Deng Xiao-tie.Efficient phrase-based document similarity for clustering[J].IEEE Transactions on Knowledge and Data Engineering,2008,20(9) : 1217-1229.
  • 7Zamir O, Etzioni O, Madani O, et al.Fast and intuitive clustering of Web documents[C]//Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, 1997: 287-290.
  • 8Zamir O,Etzioni O.Web document clustering:A feasibility demonstration[C]//Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998 : 46-54.
  • 9Ukkonen E.On-line construction of suffix trees[J].Algorithmica, 1995,14(3) :249-260.
  • 10Wang Jian-hua,Li Rui-xu.A new cluster merging algorithm of suffix tree clustering[J].Intelligent Information Processing III, 2007: 197-203.

二级参考文献35

  • 1杨燕,靳蕃,Mohamed Kamel.一种基于蚁群算法的聚类组合方法[J].铁道学报,2004,26(4):64-69. 被引量:39
  • 2李永森,杨善林,马溪骏,胡笑旋,陈增明.空间聚类算法中的K值优化问题研究[J].系统仿真学报,2006,18(3):573-576. 被引量:39
  • 3钱线,黄萱菁,吴立德.初始化K-means的谱方法[J].自动化学报,2007,33(4):342-346. 被引量:32
  • 4Hatzivassiloglou V, Klavans J L, Holcombe M L, et al.Simfinder: A flexible clustering tool for surmnarization. In: Proceedings of the NAACI, 2001 Workshop on Automatic Surrunarization, Pittsburgh, PA, 2001, 41-49 .
  • 5Jain A K,Dubes R C. Algorithms for clustering data. Englewood Cliffs NJ, USA: Prentice Hall, 1988.
  • 6Sneath P H, Sokal R R. Numerical Taxonomy. London, UK:Freeman. 1973.
  • 7King B. Step-wise clustering procedures. Journal of the Amercian Statistical Association , 1967, 69(8) :86-101.
  • 8Guha S, Rastogi R, Shim K. CURE: An efficient clustering algorithm for large databases. Information Systems, 2001, 26( 1 ) : 35-58.
  • 9Guha S, Rastogi R, Shim K. ROCK: a robust clustering algorithm for categorical attributes. In : Proceedings of the 15th International Cotfference on Data Engineering. Sydney: IEEE Computer Society Press, 1999. 512-521.
  • 10Karypis G, Han E H, Kumar V. Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 1999, 32(8) :68-75.

共引文献181

同被引文献25

引证文献3

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部