摘要
数据倾斜是海量数据分析与处理中常见场景之一.在数据倾斜场景下,传统MapReduce连接查询算法并不能充分利用Hadoop平台并行计算编程模型特性.本文主要研究基于数据倾斜的M apReduce连接查询算法.针对传统多表连接查询算法不能有效解决数据倾斜导致的性能瓶颈问题,设计并实现统计倾斜轮询分区连接查询优化算法,该算法以HDFS作为数据存储层,通过统计倾斜与轮询分区策略有效将数据分发到Hadoop集群各个计算节点.实验表明,本文提出的算法在不同数据倾斜率下均能有效实现负载均衡,充分利用MapReduce并行计算特性,并已在实际应用场景中获得较好性能提升.
Data skew is one of the common scenarios in massive data analysis and processing.In the data skew scene,traditional MapReduce join query algorithm cannot take full advantage of Hadoop platform parallel computing programming model characteristics.In this paper,we mainly study the MapReduce join query algorithm based on data skew.Aiming at the problem that the traditional multi-table join query algorithm cannot solve the performance bottleneck of data skew,we design and implement count skew polling repartition join query optimization algorithm.The algorithm uses HDFS as the storage layers,and distributes the data to the Hadoop cluster calculation nodes through count skew and polling repartition strategy.Experimental results show that the proposed algorithm can achieve load balancing effectively under different skew rates,make full use of the characteristics of MapReduce parallel computing,and has received a good performance in practical application scenarios.
出处
《小型微型计算机系统》
CSCD
北大核心
2018年第2期367-371,共5页
Journal of Chinese Computer Systems
基金
国家自然科学基金委项目(61662015)资助
广西科技厅科技开发重点项目(桂科攻1598019-3)资助
NSFC-广东联合基金重点项目(U1501252)资助