Extracting and mining social networks information from massive Web data is of both theoretical and practical significance. However, one of definite features of this task was a large scale data processing, which remain...Extracting and mining social networks information from massive Web data is of both theoretical and practical significance. However, one of definite features of this task was a large scale data processing, which remained to be a great challenge that would be addressed. MapReduce is a kind of distributed programming model. Just through the implementation of map and reduce those two functions, the distributed tasks can work well. Nevertheless, this model does not directly support heterogeneous datasets processing, while heterogeneous datasets are common in Web. This article proposes a new framework which improves original MapReduce framework into a new one called Map-Reduce-Merge. It adds merge phase that can efficiently solve the problems of heterogeneous data processing. At the same time, some works of optimization and improvement are done based on the features of Web data.展开更多
MapReduce是一个流行的并行处理大规模数据计算模型.为提升异构环境下的MapReduce性能,提出一种异构环境下基于节点作业时间感知的动态MapReduce调度策略:DTHE(Dynamic Map Reduce scheduling based on the Time-aware of node jobs in ...MapReduce是一个流行的并行处理大规模数据计算模型.为提升异构环境下的MapReduce性能,提出一种异构环境下基于节点作业时间感知的动态MapReduce调度策略:DTHE(Dynamic Map Reduce scheduling based on the Time-aware of node jobs in Heterogeneous Environments).DTHE在作业执行前,首先标记部分任务作为节点样本任务并优先处理,在执行其他任务时分析样本任务,预测节点性能和数据分布特征,动态采取相应的调度策略;在作业运行中实时监测节点任务状态,提前拉取节点下一个任务数据到本地内存.实验结果表明:在异构环境下,DTEH能够缩短5.1%的作业执行时间并减少磁盘I/O,有效提升MapReduce性能.展开更多
文摘Extracting and mining social networks information from massive Web data is of both theoretical and practical significance. However, one of definite features of this task was a large scale data processing, which remained to be a great challenge that would be addressed. MapReduce is a kind of distributed programming model. Just through the implementation of map and reduce those two functions, the distributed tasks can work well. Nevertheless, this model does not directly support heterogeneous datasets processing, while heterogeneous datasets are common in Web. This article proposes a new framework which improves original MapReduce framework into a new one called Map-Reduce-Merge. It adds merge phase that can efficiently solve the problems of heterogeneous data processing. At the same time, some works of optimization and improvement are done based on the features of Web data.
文摘MapReduce是一个流行的并行处理大规模数据计算模型.为提升异构环境下的MapReduce性能,提出一种异构环境下基于节点作业时间感知的动态MapReduce调度策略:DTHE(Dynamic Map Reduce scheduling based on the Time-aware of node jobs in Heterogeneous Environments).DTHE在作业执行前,首先标记部分任务作为节点样本任务并优先处理,在执行其他任务时分析样本任务,预测节点性能和数据分布特征,动态采取相应的调度策略;在作业运行中实时监测节点任务状态,提前拉取节点下一个任务数据到本地内存.实验结果表明:在异构环境下,DTEH能够缩短5.1%的作业执行时间并减少磁盘I/O,有效提升MapReduce性能.