摘要
聚类分析是数据挖掘领域的一项重要研究课题.随着数据量的急剧增加,针对大数据集的聚类分析成为一个难点.虽然k均值算法具有易实现、复杂度与数据集大小成线性关系的优点,将其应用于大数据集时仍然存在效率低的问题.分布式聚类是解决这一问题的有效方法.在已有分布式聚类算法kDMeans基础上,结合向量内积不等式关系对算法加以优化,提出分布式聚类算法kDCBIP.理论分析和实验结果表明,算法kDCBIP优于kDMeans,可以有效地解决大数据集聚类问题,算法是有效可行的.
Clustering is an important research in data mining. Clustering in large data sets becomes a nut with the accumulating of the data. Despite its simplicity and its linear time, a serial k-Means algorithm's time complexity remains expensive when it is applied to a large data set. Distributed clustering is an effective method to solve this problem. In this paper, the knowledge of vectors' inner product inequation is adopted to improve efficiency Of the existing parallel k-Means algorithm(k-DMeans), and an effective distributed k-Means clustering algorithm k-DCBIP is proposed. Theoretical analysis and experimental results testify that k-DCBIP outperforms the algorithm k-DMeans, and it is effective and efficient.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2005年第9期1493-1497,共5页
Journal of Computer Research and Development
基金
国家自然科学基金项目(70371015)
教育部高等学校博士学科点专项科研基金项目(20040286009)~~
关键词
分布式聚类
数据点的模
向量内积
向量内积不等式
distributed clustering
mode of a data point
vectors' inner product
vectors' inner product ineguation