摘要
为在大数据环境中精确地进行关联规则挖掘,基于分布式框架Spark,改进关联规则挖掘算法Apriori,解决使用该算法处理大规模数据时遇到的单机内存资源限制和性能缺陷,同时保证结果准确度。利用开源数据集和海量轨迹数据集评估算法的有效性,实验结果表明:与传统方法相比,改进后的Apriori算法进行规则挖掘能够得到相同准确度的结果,并且通过增加处理节点的数量灵活扩展待挖掘数据规模,从而使关联规则挖掘不再受数据规模限制。
In order to accurately carry out association rule mining in big data environment,this paper uses the distributed computing framework Spark,improving the association rules algorithm Apriori. It solved the standalone memory resource constraint and reduced time performance problems caused by Apriori. Then,using open source data sample andmassive data sample of tracks for experiments,the experiments show that compared with the traditional Apriori,the improved Apriori can get the same accurate of results,and the size of the sample can be expanded by increasing the number of nodes,so that the association rule mining is no longer limited by data scale.
出处
《信息技术》
2018年第2期153-158,共6页
Information Technology