摘要
文章针对频繁项集挖掘中传统串行Eclat算法面对海量数据时挖掘效率不高的问题,提出一种海量数据下的并行频繁项集挖掘算法,即I-SPEclat算法。首先,对Eclat算法存在的缺陷进行改进,引入图的邻接矩阵作为数据的存储结构,避免了大量的交集运算;其次,利用先验性质对候选项集进行预剪枝和后剪枝,减少无用候选项集的数量,节约存储空间;再次,根据项集的前缀对数据进行划分,平衡每个计算节点的工作负载;最后,将改进的Eclat算法在Spark分布式计算框架上实现并行化。实验结果表明,I-SPEclat算法较已有的改进Eclat算法在时间消耗和内存消耗方面均有减少,且面对不同规模的数据集也有着良好的扩展性。
Aiming at the problem that the traditional serial Eclat algorithm in frequent itemset mining is not efficient when faced with mass data,this paper proposes a parallel frequent itemset mining algorithm under massive data,that is,I-SPEclat algorithm.The algorithm first improves the defects of Eclat algorithm,and introduces the adjacency matrix of graph as the storage structure of data,which avoids a large number of intersection operations.Then,the paper uses a priori nature to pre-cut and post-cut the candidate set,reduces the number of useless candidate sets and saves storage space.After that,this paper divides the data according to the prefix of the itemset,and balances the workload of each computing node.Finally,the paper parallelizes the improved Eclat algorithm on the Spark distributed computing framework.The experimental results show that the I-SPEclat algorithm is less time-consuming and memory-consuming than the existing improved Eclat algorithm,and also very scalable in the face of data sets with different sizes.
作者
敖孟飞
石鸿雁
Ao Mengfei;Shi Hongyan(School of Science,Shenyang University of Technology,Shenyang 110870,China)
出处
《统计与决策》
CSSCI
北大核心
2022年第18期48-53,共6页
Statistics & Decision
基金
国家自然科学基金资助项目(61074005)。