期刊文献+

CSPRJ:基于数据倾斜的MapReduce连接查询算法 被引量:2

CSPRJ:MapReduce Join Query Algorithm Based on Data Skew
在线阅读 下载PDF
导出
摘要 数据倾斜是海量数据分析与处理中常见场景之一.在数据倾斜场景下,传统MapReduce连接查询算法并不能充分利用Hadoop平台并行计算编程模型特性.本文主要研究基于数据倾斜的M apReduce连接查询算法.针对传统多表连接查询算法不能有效解决数据倾斜导致的性能瓶颈问题,设计并实现统计倾斜轮询分区连接查询优化算法,该算法以HDFS作为数据存储层,通过统计倾斜与轮询分区策略有效将数据分发到Hadoop集群各个计算节点.实验表明,本文提出的算法在不同数据倾斜率下均能有效实现负载均衡,充分利用MapReduce并行计算特性,并已在实际应用场景中获得较好性能提升. Data skew is one of the common scenarios in massive data analysis and processing.In the data skew scene,traditional MapReduce join query algorithm cannot take full advantage of Hadoop platform parallel computing programming model characteristics.In this paper,we mainly study the MapReduce join query algorithm based on data skew.Aiming at the problem that the traditional multi-table join query algorithm cannot solve the performance bottleneck of data skew,we design and implement count skew polling repartition join query optimization algorithm.The algorithm uses HDFS as the storage layers,and distributes the data to the Hadoop cluster calculation nodes through count skew and polling repartition strategy.Experimental results show that the proposed algorithm can achieve load balancing effectively under different skew rates,make full use of the characteristics of MapReduce parallel computing,and has received a good performance in practical application scenarios.
出处 《小型微型计算机系统》 CSCD 北大核心 2018年第2期367-371,共5页 Journal of Chinese Computer Systems
基金 国家自然科学基金委项目(61662015)资助 广西科技厅科技开发重点项目(桂科攻1598019-3)资助 NSFC-广东联合基金重点项目(U1501252)资助
关键词 数据倾斜 MAPREDUCE HADOOP 连接查询 查询优化 负载均衡 data skew MapReduce Hadoop join query query optimization load balancing
  • 相关文献

参考文献2

二级参考文献37

  • 1郭世泽,何韶军,牛伟.基于HASH表和SYN计算的TCP包重组方法[J].信息安全与通信保密,2006(2):18-19. 被引量:5
  • 2Ghemawat S, Gobioff H, Leung ST. The Google file system. In: Proc. of the SOSP 2003. 2003.20-43. [doi: 10.1145/1165389. 945450].
  • 3Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. In: Proc. of the OSDI 2004. 2004. 137-150. [doi: 10.1145/1327452.1327492].
  • 4Yang HC, Dasdan A, Hsiao RL, Parker DS. Map-Reduce-Merge: Simplified relational data processing on large cluster. In: Proc. of the SIGMOD 2007. 2007. 1029-1040. [doi: 10.1145/1247480.1247602].
  • 5Lammel R. Google's MapReduce programming model Revisited. Science Computer Program, 2008,70(1):1-30. [doi: 10.1016/ j .scico .2007.07.001 ].
  • 6Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hi:ce: A warehousing solution over a map-reduce framework. Proc. of the VLDB Endowment, 2009,2(2): 1626-1627.
  • 7Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Zhang N, Antony S, Liu H, Murthy R. Hive--A petabyte scale data warehouse using Hadoop data engineering. In: Proc. of the ICDE. 2010. 996-1005. [doi: 10.1109/ICDE.2010.5447738].
  • 8Olston C, Reed B, Sirvastava U, Kumar R, Tomkins A. Pig Latin: A not-so-foreign language for data processing. In: Proc. of the SIGMOD. 2008. 1099-1110. [doi: 10.1145/1376616.1376726].
  • 9White T. Hadoop: The Definitive Guide. O'Reilly, 2009.
  • 10Apache Hadoop. http://hadoop.apache.org/.

共引文献39

同被引文献10

引证文献2

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部