摘要
针对传统K-均值聚类算法对初始聚类中心敏感、现有初始聚类中心优化算法缺乏客观性,提出一种基于样本空间分布密度的初始聚类中心优化K-均值算法。该算法利用数据集样本的空间分布信息定义数据对象的密度,并根据整个数据集的空间信息定义了数据对象的邻域;在此基础上选择位于数据集样本密集区且相距较远的数据对象作为初始聚类中心,实现K-均值聚类。UCI机器学习数据库数据集以及随机生成的带有噪声点的人工模拟数据集的实验测试证明,本算法不仅具有很好的聚类效果,而且运行时间短,对噪声数据有很强的抗干扰性能。基于样本空间分布密度的初始聚类中心优化K-均值算法优于传统K-均值聚类算法和已有的相关K-均值初始中心优化算法。
To overcome the sensible of traditional K-means clustering algorithm to initial centers,and avoid the arbitrary of available improved K-means algorithms for discovering good initial centers,this paper proposed a new algorithm to find the optimal initial centers for K-means clustering algorithm.It defined the density and the neighborhood for each sample according to the natural pattern distribution of exemplars in data space,so that the samples chose as initial seeds not only lie in the higher density area,but also far away from each other.It tested the new algorithm on some well-known datasets from UCI machine learning repository and on some synthetic datasets with different proportion noises using many different measures.The experimental results demonstrate that our new algorithm achieves excellent clustering result in short run time and is insensible to noisy data.It outperforms the traditional K-means clustering algorithm and those available algorithms for improving the initial seeds of K-means clustering algorithm.
出处
《计算机应用研究》
CSCD
北大核心
2012年第3期888-892,共5页
Application Research of Computers
基金
中央高校基本科研业务费专项资金重点资助项目(GK200901006)
陕西省自然科学基础研究计划资助项目(2010JM3004)
中央高校基本科研业务费专项资金资助项目(GK201001003)
关键词
聚类
K-均值聚类
初始中心
邻域
样本分布密度
clustering
K-means clustering
initial centers
neighborhood
density of pattern distribution