摘要
针对目前缺少大型数据分析框架间的横向比较问题,使用有代表性的大数据工作负载,对Hadoop、Spark和Flink进行性能和可扩展性等因素的比较评价。此外,通过调整工作负载的一些主要参数,例如HDFS块大小、输入数据大小、互连网络或线程配置等,描述了这些框架的行为模式特征。实验结果分析表明,对于非排序的基准测试程序,使用Spark或Flink替代Hadoop,分别带来平均77%和70%执行时间的降低。整体上,Spark的性能结果最好;而Flink通过使用的显式迭代程序,极大提高了迭代算法的性能。
In view of the lack of lateral comparison between large data analysis frameworks,the representative big data workload and the factors such as performance and scalability is considered for comparing and evaluating Hadoop,Spark and Flink,which fills gaps in research. In addition,describing the characteristics of these frameworks' behavior patterns by adjusting some main parameters of workload,such as HDFS block size,input data size,interconnection network or thread configuration. The experimental results show that for non sorting benchmark programs,the use of Spark or Flink instead of Hadoop brings average execution time reduction of 77% and 70%,respectively. On the whole,Spark has the best performance results. And the performance of the iterative algorithm is greatly improved by the explicit iterative program used by Flink.
作者
代明竹
高嵩峰
DAI Ming-zhu;GAO Song-feng(Beijing University of Civil Engineering and Architecture, Beijing 100044, China)
出处
《中国电子科学研究院学报》
北大核心
2018年第2期149-155,共7页
Journal of China Academy of Electronics and Information Technology
关键词
大数据
分析框架
基准测试程序
模型
Big data
Analytical Framework
Benchmarking Program
Model