摘要
针对FP-Growth算法面对海量数据挖掘时串行操作机制出现内存瓶颈或者数据挖掘失效等问题,提出将基于Spark平台的FP-Growth算法在数据分组策略和项头表结构两方面进行优化。一方面提出一种S型的负载权值均衡分组的方式;另一方面,设计出一种新的项头表结构,此结构包含Hash查找表,能有效降低查找时间复杂度。实验证明,优化的基于Spark平台的FP-Growth算法(OptFP-Spark算法)具有更高的并行运算加速比、更好的并行挖掘效果及更高效的计算效率。
In view of the defect of memory bottleneck or data mining failure found in FP growth algorithm when processing massive data mining,a new method has thus been proposed to optimize FP growth algorithm based on spark platform in data grouping strategy and item header table structure.On the one hand,an S-typed grouping method has been proposed,which can realize a balanced grouping of load weights.On the other hand,a new item header table structure of FP-Growth with a hash look-up table has been proposed,which can effectively reduce the complexity of look-up time.Experimental results show that,characterized with a very high computational efficiency,the optimized FP-Growth algorithm,which is based on Spark platform,has a higher speedup of parallel computing and better parallel mining efficiency.
作者
黄婕
HUANG Jie(Hunan Provincial Engineering Research Center for Aircraft Maintenance,Changsha 410124,China;Department of Aviation Electronic Equipment Maintenance,Changsha Aeronautical Vocational and Technical College,Changsha 410124,China;School of Software,Central South University,Changsha 410075,China)
出处
《湖南工业大学学报》
2020年第1期77-84,共8页
Journal of Hunan University of Technology
基金
湖南省教育厅科学研究基金资助项目(17C0009)