期刊文献+

适用于大规模文本处理的动态密度聚类算法 被引量:11

A Dynamic Density-Based Clustering Algorithm Appropriate to Large-Scale Text Processing
在线阅读 下载PDF
导出
摘要 针对传统的基于密度的聚类算法对海量数据处理时,存在参数输入复杂及时间复杂度高的问题,给出新的密度定义方法,并在此基础上提出一种只需一个简单输入参数就能动态识别密度不均匀聚类簇的聚类算法,同时将其扩充为可以处理海量数据的两阶段动态密度聚类算法。在人造数据集、大规模数据集以及中英文文本语料数据集上的实验表明,所提出的算法具有输入参数简单和聚类效率高的特点,可以应用于海量文本数据的聚类处理。 Because of the high time complexity and complicated parameter setting in traditional density-based clustering algorithm, a new density definition is proposed, which just needs one parameter and can find clusters with different densities. The authors also expand the algorithm to a two-stage dynamic density-based clustering algorithm, which can process large-scale text corpus data. Experiments on synthetic dataset, large-scale dataset from UCI, English text corpus and Chinese text corpus show that TSDDBCA algorithm has the characteristic of easy parameter setting and high clustering efficiency, and can be applied to clustering process to large-scale text data.
出处 《北京大学学报(自然科学版)》 EI CAS CSCD 北大核心 2013年第1期133-139,共7页 Acta Scientiarum Naturalium Universitatis Pekinensis
基金 国家自然科学基金(61070061) 国家社会科学基金(12BYY045) 教育部人文社会科学研究青年基金(11YJCZH086 12YJCZH281) 广东省高层次人才项目(粤教师函[2010]79号)资助
关键词 文本挖掘 聚类 海量数据 动态密度 text mining clustering large-scale data dynamic density
  • 相关文献

参考文献14

  • 1Yang Yiming. A comparison study on feature selection in text categorization//Proceedings of the Fourteenth International Conference on Machine Learning (ICML 1997). Nashville, Tennessee, 1997:412-420.
  • 2Salton G, Wong A, Yang C S. A vector space model for automatic indexing. Communications of ACM, 1975, 18(11): 613-620.
  • 3Lewis D D. Reuters-21578 text categorization collection data set [DB/OL]. (1997)[2012-05-30]. http://archive.ics.uci.edu/ml/datasets/Reuters21578 + Text + Categorization + collection.
  • 4Ertoz L, Michael S, Kumar V. Finding clusters of different sizes, shapes, and densities in noisy, highdimensional data // Proceedings of the third SIAM International Conference on Data Mining (SIAM 2003). San Francisco, CA, 2003:47-58.
  • 5Guha S,Rastogi R, Shim K. ROCK: a robust clustering algorithm for categorical attributes // Proceedings of the 15th ICDE. Sydney, 1999:512-521.
  • 6Jiang Shengyi, Xu Yuming. An efficient clustering algorithm // Procedings of 2004 International Con- ference on Machine Learning and Cybernetics. Shanghai, 2004:1513-1518.
  • 7Ester M, Kriegel H P, Sander J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise // Proceedings of the 2nd International Conference on Knowledge Discovering in Databases and Data Mining (KDD-96). Massa- chusetts: AAAI Press, 1996:226-232.
  • 8He Zengyou, Xu Xiaofei, Deng Shengchun. Squeezer an efficient algorithm for clustering categorical data Journal of Computer Science and Technology, 2002 17(5): 611-624.
  • 9Karypis G, Han E, Kumar V. CHAMELEON: a hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 1999, 32(8): 68-75.
  • 10搜狐研发中心.搜狗文本分类语料库[DB/OL].(2006)[2012-05-30].http://www.sogou.com/labs/dl/C.html.

二级参考文献17

  • 1Sudipto Guha, Rajeev Rastogi, Kyuseok Shim. ROCK: A robust clustering algorithm for categorical attributes. In Proc. 1999 Int. Conf. Data Engineering, Sydney, Australia, Mar., 1999, pp.512-521.
  • 2Alexandros Nanopoulos, Yannis Theodoridis, Yannis Manolopoulos. C2P: Clustering based on closest pairs. In Proc. 27th Int. Conf. Very Large Database, Rome, Italy, September, 2001, pp.331-340.
  • 3Ester M, Kriegel H P, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases.In Proc. 1996 Int. Conf. Knowledge Discovery and Data Mining (KDD'96), Portland, Oregon, USA, Aug., 1996,pp.226-231.
  • 4Zhang T, Ramakrishnan R, Livny M. BIRTH: An efficient data clustering method for very large databases. In Proc.the ACM-SIGMOD Int. Conf. Management of Data, Montreal, Quebec, Canada, June, 1996, pp.103-114.
  • 5Sudipto Guha, Rajeev Rastogi, Kyuseok Shim. CURE: A clustering algorithm for large databases. In Proc. the ACM SIGMOD Int. Conf. Management of Data, Seattle, Washington, USA, June, 1998, pp.73-84.
  • 6Karypis G, Han E-H, Kumar V. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 1999, 32(8): 68-75.
  • 7Sheikholeslami G, chatterjee S, Zhang A. WaveCluster: A multi-resolution clustering approach for very large spatial databases. In Proc. 1998 Int. Conf. Very Large Databases, New York, August, 1998, pp.428-439.
  • 8Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. the 1998 ACM SIGMOD Int. Conf. Management of Data, Seattle, Washington,USA, June, 1998, pp.94-105.
  • 9Jiang M FI Tseng S S, Su C M. Two-phase clustering process for outliers detection. Pattern Recognition Letters,2001, 22(6/7): 691-700.
  • 10Venkatesh Ganti, Johannes Gehrke, Raghu Ramakrishnan. CACTUS-clustering categorical data using summaries.In Proc. 1999 Int. Conf. Knowledge Discovery and Data Mining, August, 1999, pp.73-83.

共引文献31

同被引文献131

引证文献11

二级引证文献69

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部