期刊文献+

基于Hash函数取样的线性时间聚类方法LCHS 被引量:2

LCHS: a New Linear Clustering Method Based on Sampling with Hash Function
在线阅读 下载PDF
导出
摘要 作为数据挖掘中的经典算法,k-中心点算法存在效率低、对大数据集适应性差等严重不足.该文针对这一不足,提出并实现Hash分层模型LCHS(LinearClusteringBasedHashSampling),主要贡献包括:(1)将m维超立方体按等概率空间进行分桶,使得每层(即Hash桶)的数据个数相近,以较小的计算代价获得分层抽样的效果;(2)新算法保证了样本具有对总体数据的充分的统计代表性;(3)从理论上证明了新算法复杂度为O(N);(4)对比实验表明新算法在数据集的个数接近10000时,效率比传统算法提高2个数量级,数据集的个数接近8000时,聚类质量比CLARA算法提高55%. As the classical method in data mining, the k-median algorithm is with serious deficiency such as low efficiency , bad adaptability for large data set etc. To solve this problem, a new method named LCHS ( Linear Clustering Based Hash Sampling) is proposed in this paper. The main contribution includes: (1) Partitions the buckets by using the space of equal probability in the m-dimension super-cube to make the number of data items in each layer( namely the bucket of Hash) approximate equal, gets the layering sampling with the small cost; (2) the samples under the new algorithms is with sufficient representative power for total data set; (3) proves that the complexity of the new algorithm is O(N);(4) By the comparing experiment shows that the performance of LCHS is 2 magnitude higher than traditional with the number of data set near to 10000,and the clustering quantity is increase 55,% with number of data set near to 8000.
出处 《小型微型计算机系统》 CSCD 北大核心 2005年第8期1364-1368,共5页 Journal of Chinese Computer Systems
基金 国家自然科学基金(60473071)资助 国家"九七三"计划项目(2002CB111504)资助 高等学校博士学科点专项科研基金SRFDP(20020610007)资助 广西自然科学基金(桂科自0339039)资助.
关键词 K-中心点 聚类分析 线性时间 HASH函数 取样 k-median algorithm clustering linear time Hash function sampling
  • 相关文献

参考文献8

  • 1Jiawei Han0 Micheline Kamber. Data mining: Concepts and techniques[M]. Morgan Kaufmann Publishers, 2001.
  • 2MacQueen J. Some methods for classification and analysis of multivariate observations [C]. Proc. 5th Berkeley Symp. Math Statist, Prob. , 1967,1: 281-297.
  • 3Kaufman L and Rousseeuw P J. Finding groups in data:An introduction to cluster anaysis [M]. New youk: Johnwiley&Sons, 1990.
  • 4Ng R, Han J. Efficient and effective clustering method for spatiall data mining[C]. In Proc. 1994 Int. Conf. Very Large Data Base(VLDB'94) ,144-155, Santiago, Chile,Sept. 1994.
  • 5Murray R Spiegel, Larry J Stephens. Schaum's outline of theory and problems of statistics, Third Edition[M]. McGraw-Hill Companies, Inc. 1999.
  • 6元昌安,唐常杰,谢方军,王锦.复共线性空间数据回归模型挖掘算法及其实现[J].四川大学学报(自然科学版),2004,41(1):66-70. 被引量:2
  • 7Leslie Kish. Survey sampling[M]. John Wiley & Sons. Inc.1985.
  • 8Jain A K, Dubes R C. Algorithms for clustering data[M].Prentice-Hall, 1988.

二级参考文献6

共引文献1

同被引文献15

引证文献2

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部