摘要
关联规则挖掘典型算法Apriori由于在频繁项集的生成时,需要多次扫描数据库,空间和时间耗费较大。之后虽然有很多Apriori算法的改进版本,但大多是从数据存储结构的角度,少有研究考虑到数据集本身的性质。对此提出了基于clustering算法的事务抽样关联规则挖掘算法,通过聚类技术对事务进行聚类,得出能够反映原始交易数据特征的事务子集,然后,在该子集上开展挖掘分析工作。该方法在8个不同规模人造数据集和1个真实数据集上进行了实验。其中,在较小规模人造数据集上,时间比原方法节省0.03 s;规模越大,节省时间越多,在大小为15 000、维度为30的数据集上运行时,比原方法节省了70 s;在真实数据集上,不同参数设置下该方法耗时仅为原方法的50%。实验证明,该方法与传统Apriori算法相比,效率较高,尤其在数据量大时,效果提升更明显。该算法的思想也可以扩展应用到其他改进的Apriori算法中。
Association rule mining is an important research branch of data mining. Its typical algorithm Apriori faces a serious problem that it needs to scan dataset many times and consumes much time and memory. Especially,when both data size and dimension are very large,it is perhaps not tolerable. With the coming of the big data time,finding frequent itemsets is more and more difficult. To solve this problem,the authors proposed a new method based on clustering and typical Apriori algorithm. It first found a representative subset of raw data set by clustering algorithm,and then mined and analyzed the subset. Experiments were carried out on 8 toy data sets with different sizes and a real data set about game properties transaction. For toy data,this method reduced running time 0. 03 seconds and 70 seconds,on the data set which size is 200*10 and 15 000 * 20 respectively. For the real data set,consumed time of this method is only a half of the old method.Experimental results demonstrate the effectiveness of the method.
出处
《计算机应用》
CSCD
北大核心
2015年第A02期77-79,84,共4页
journal of Computer Applications
基金
山东省高等学校科技计划项目(J15LN55)
山东省职业教育与成人教育科研规划课题(2014zcj015)
山东省教改课题(YCXY-X2014011)