摘要
每次K-means算法更新聚类中心后,会对数据集中所有的点迭代计算它们与最新聚类中心的距离,进而获取点的最新聚类。这种全局迭代计算的特征导致传统K-means算法时间效率低。随着数据集增大,算法的时间效率和聚类性能下降过快,因此传统的K-means算法不适合大数据环境下的聚类使用。针对大数据场景下的时间效率和性能优化问题,提出了一种基于Spark的K-means安全区间更新优化算法。在每次更新聚类中心后,该算法更新安全区间标签,根据标签是否大于0每次判断落在该区间内的全部数据的簇别,避免计算所有点与中心的距离,减少因全局迭代造成的时间和计算资源开销。算法基于Spark机器MLlib组件的点向量模型优化了模型性能。通过衡量平均误差准则和算法时间两个指标,进行了优化K-means与传统K-means聚类的性能对比实验。结果表明,所提出的优化算法在上述两个指标上均优于传统的K-means聚类算法,适用于大数据环境下的数据聚类场景。
At each time when the K-means algorithm updates the cluster center,it needs to calculate iteratively the distance between all the points in the dataset with the latest clustering center to get the latest clustering of each point. This feature of global iterative computation leads to low efficiency of traditional K-means algorithm. As the data set increases,its time efficiency and clustering performance de- crease too fast, so that the traditional K-means algorithm is not suitable for clustering in big data. Therefore, a new K-means secure inter- val updating algorithm based on Spark is proposed for time efficiency and performance optimization in big data. After updated the cluster center every time, it updates security interval label. According to whether the label is greater than 0 instead of calculation of the distance between all the points and the new center and cluster identification of all the data in the interval every time, which reduces the overhead of time and computation. The performance of the algorithm model based on the point vector model of Spark MLlib component has been optimized. It is made a comparison with the traditional K -means algorithm on average error criterion and operation time. The experimen- tal results show that it is superior to the traditional K -means clustering algorithm in the above two indexes and is suitable for data cluste- ring scenario in big data.
出处
《计算机技术与发展》
2017年第8期1-6,共6页
Computer Technology and Development
基金
江苏省农业科技自主创新资金项目(CX(16)1006)