摘要
作为数据挖掘中的经典算法,k-中心点算法存在效率低、对大数据集适应性差等严重不足.该文针对这一不足,提出并实现Hash分层模型LCHS(LinearClusteringBasedHashSampling),主要贡献包括:(1)将m维超立方体按等概率空间进行分桶,使得每层(即Hash桶)的数据个数相近,以较小的计算代价获得分层抽样的效果;(2)新算法保证了样本具有对总体数据的充分的统计代表性;(3)从理论上证明了新算法复杂度为O(N);(4)对比实验表明新算法在数据集的个数接近10000时,效率比传统算法提高2个数量级,数据集的个数接近8000时,聚类质量比CLARA算法提高55%.
As the classical method in data mining, the k-median algorithm is with serious deficiency such as low efficiency , bad adaptability for large data set etc. To solve this problem, a new method named LCHS ( Linear Clustering Based Hash Sampling) is proposed in this paper. The main contribution includes: (1) Partitions the buckets by using the space of equal probability in the m-dimension super-cube to make the number of data items in each layer( namely the bucket of Hash) approximate equal, gets the layering sampling with the small cost; (2) the samples under the new algorithms is with sufficient representative power for total data set; (3) proves that the complexity of the new algorithm is O(N);(4) By the comparing experiment shows that the performance of LCHS is 2 magnitude higher than traditional with the number of data set near to 10000,and the clustering quantity is increase 55,% with number of data set near to 8000.
出处
《小型微型计算机系统》
CSCD
北大核心
2005年第8期1364-1368,共5页
Journal of Chinese Computer Systems
基金
国家自然科学基金(60473071)资助
国家"九七三"计划项目(2002CB111504)资助
高等学校博士学科点专项科研基金SRFDP(20020610007)资助
广西自然科学基金(桂科自0339039)资助.