We have witnessed the fast-growing deployment of Hadoop,an open-source implementation of the MapReduce programming model,for purpose of data-intensive computing in the cloud.However,Hadoop was not originally designed ...We have witnessed the fast-growing deployment of Hadoop,an open-source implementation of the MapReduce programming model,for purpose of data-intensive computing in the cloud.However,Hadoop was not originally designed to run transient jobs in which us ers need to move data back and forth between storage and computing facilities.As a result,Hadoop is inefficient and wastes resources when operating in the cloud.This paper discusses the inefficiency of MapReduce in the cloud.We study the causes of this inefficiency and propose a solution.Inefficiency mainly occurs during data movement.Transferring large data to computing nodes is very time-con suming and also violates the rationale of Hadoop,which is to move computation to the data.To address this issue,we developed a dis tributed cache system and virtual machine scheduler.We show that our prototype can improve performance significantly when run ning different applications.展开更多
In this paper, a prefetching technique is proposed to solve the performance problem caused by remote data access delay. In the technique, the map tasks which will cause the delay are predicted first and then the input...In this paper, a prefetching technique is proposed to solve the performance problem caused by remote data access delay. In the technique, the map tasks which will cause the delay are predicted first and then the input data of these tasks will be preloaded before the tasks are scheduled. During the execution, the input data can be read from local nodes. Therefore, the delay can be hidden. The technique has been implemented in Hadoop-0. 20.1. The experiment results have shown that the technique reduces map tasks causing delay, and improves the performance of Hadoop MapRe- duce by 20%.展开更多
This paper studies online scheduling of jobs with kind release times on a single machine. Here "kind release time" means that in online setting, no jobs can be released when the machine is busy. Each job J h...This paper studies online scheduling of jobs with kind release times on a single machine. Here "kind release time" means that in online setting, no jobs can be released when the machine is busy. Each job J has a kind release time r(J) ≥ 0, a processing time p(J) > 0 and a deadline d(J) > 0. The goal is to determine a schedule which maximizes total processing time( p(J)E(J)) or total number( E(J)) of the accepted jobs. For the first objective function p(J)E(J), we first present a lower bound 2(1/2), and then provide an online algorithm LEJ with a competitive ratio of 3. This is the first deterministic algorithm for the problem with a constant competitive ratio. When p(J) ∈ {1, k}, k > 1 is a real number, we first present a lower bound min{(1 + k)/k, 2 k/(1 + k)}, and then we show that LEJ has a competitive ratio of1 + k/k. In particular, when all the k length jobs have tight deadlines, we first present a lower bound max{4/(2 + k), 1}(for p(J)E(J)) and 4/3(for E(J)). Then we prove that LEJ is k/k-competitive for p(J)E(J) and we provide an online algorithm H with a competitive ratio of 2 k/( k + 1) for the second objective function E(J).展开更多
MapReduce is a popular parallel data-processing system, and task scheduling is one of the kernel techniques in MapReduce. In many applications, users have requirements that their MapReduce jobs should be completed bef...MapReduce is a popular parallel data-processing system, and task scheduling is one of the kernel techniques in MapReduce. In many applications, users have requirements that their MapReduce jobs should be completed before specific deadlines. Hence, in this paper, a novel scheduling algorithm based on the most effective sequence (SAMES) is proposed for deadline-constraint jobs in MapReduce. First, according to the characteristics of MapReduce, we propose a novel sequence-based execution strategy for MapReduce jobs and a new concept, the effective sequence (ES). Then, we design some efficient approaches for finding ESes and choose the most effective sequence (MES) for job execution. We also propose methods for MES-updates and exception handling. Finally, we verify the effectiveness of SAMES through experiments. The experimental results show that SAMES is an efficient scheduling algorithm for deadline-constraint jobs in MapReduce.展开更多
To fulfill the requirements for hybrid real-time system scheduling, a long-release-interval-first (LRIF) real-time scheduling algorithm is proposed. The algorithm adopts both the fixed priority and the dynamic prior...To fulfill the requirements for hybrid real-time system scheduling, a long-release-interval-first (LRIF) real-time scheduling algorithm is proposed. The algorithm adopts both the fixed priority and the dynamic priority to assign priorities for tasks. By assigning higher priorities to the aperiodic soft real-time jobs with longer release intervals, it guarantees the executions for periodic hard real-time tasks and further probabilistically guarantees the executions for aperiodic soft real-time tasks. The schedulability test approach for the LRIF algorithm is presented. The implementation issues of the LRIF algorithm are also discussed. Simulation result shows that LRIF obtains better schedulable performance than the maximum urgency first (MUF) algorithm, the earliest deadline first (EDF) algorithm and EDF for hybrid tasks. LRIF has great capability to schedule both periodic hard real-time and aperiodic soft real-time tasks.展开更多
MapReduce是目前最为流行的用于大数据分析的并行系统之一.许多企业已经搭建了自己的MapReduce集群,为广大用户提供计算服务.用户可以向集群提交具有完成时限要求的MapReduce作业,若作业被按时完成,则企业可以获得一定的收益.针对这种...MapReduce是目前最为流行的用于大数据分析的并行系统之一.许多企业已经搭建了自己的MapReduce集群,为广大用户提供计算服务.用户可以向集群提交具有完成时限要求的MapReduce作业,若作业被按时完成,则企业可以获得一定的收益.针对这种应用场景,该文首次提出了MapReduce集群中的最大收益问题.为有效地解决该问题,首先提出了一种基于序列的任务调度策略(简称为SEQ策略),并证明了在处理具有完成时限约束的作业时SEQ策略存在优势.基于SEQ策略,该文提出了最大收益的调度算法(Scheduling Algorithm for Maximum Benefit,简称AMB算法),该算法可以快速地确定可接收作业,并给出有效的执行方案,以达到最大化收益的目的.另外,针对在实际应用中的某些异常情况(如节点宕机),该文也设计了有效的超时处理策略,进一步增加了算法的实用性.最后,通过大量的实验验证了该文所提出算法的有效性.展开更多
数据倾斜一直是影响MapReduce性能的关键问题之一.为缓解数据倾斜问题,提出一种基于抽样分区的MapReduce在线负载均衡机制:MR-LSP(MapReduce on-line Load balancing mechanism based on Sample Partition).MR-LSP在作业执行之前,通过...数据倾斜一直是影响MapReduce性能的关键问题之一.为缓解数据倾斜问题,提出一种基于抽样分区的MapReduce在线负载均衡机制:MR-LSP(MapReduce on-line Load balancing mechanism based on Sample Partition).MR-LSP在作业执行之前,通过对源数据抽样分析,预测数据的分布特征,动态采取相应的负载均衡数据分区策略;在作业运行期间实时监控节点负载,进一步动态优化数据分区策略.实验结果表明:MR-LSP能够提高系统3.2%的负载均衡,降低4.3%的作业执行时间,有效缓解了MapReduce的数据倾斜问题.展开更多
文摘We have witnessed the fast-growing deployment of Hadoop,an open-source implementation of the MapReduce programming model,for purpose of data-intensive computing in the cloud.However,Hadoop was not originally designed to run transient jobs in which us ers need to move data back and forth between storage and computing facilities.As a result,Hadoop is inefficient and wastes resources when operating in the cloud.This paper discusses the inefficiency of MapReduce in the cloud.We study the causes of this inefficiency and propose a solution.Inefficiency mainly occurs during data movement.Transferring large data to computing nodes is very time-con suming and also violates the rationale of Hadoop,which is to move computation to the data.To address this issue,we developed a dis tributed cache system and virtual machine scheduler.We show that our prototype can improve performance significantly when run ning different applications.
文摘In this paper, a prefetching technique is proposed to solve the performance problem caused by remote data access delay. In the technique, the map tasks which will cause the delay are predicted first and then the input data of these tasks will be preloaded before the tasks are scheduled. During the execution, the input data can be read from local nodes. Therefore, the delay can be hidden. The technique has been implemented in Hadoop-0. 20.1. The experiment results have shown that the technique reduces map tasks causing delay, and improves the performance of Hadoop MapRe- duce by 20%.
基金Supported by the National Natural Science Foundation of China(11501279,11501171,11671188,and11401604)the Young Backbone Teachers of Luoyang Normal University(2018XJGGJS-10)Henan Colleges(2015GGJS-193)
文摘This paper studies online scheduling of jobs with kind release times on a single machine. Here "kind release time" means that in online setting, no jobs can be released when the machine is busy. Each job J has a kind release time r(J) ≥ 0, a processing time p(J) > 0 and a deadline d(J) > 0. The goal is to determine a schedule which maximizes total processing time( p(J)E(J)) or total number( E(J)) of the accepted jobs. For the first objective function p(J)E(J), we first present a lower bound 2(1/2), and then provide an online algorithm LEJ with a competitive ratio of 3. This is the first deterministic algorithm for the problem with a constant competitive ratio. When p(J) ∈ {1, k}, k > 1 is a real number, we first present a lower bound min{(1 + k)/k, 2 k/(1 + k)}, and then we show that LEJ has a competitive ratio of1 + k/k. In particular, when all the k length jobs have tight deadlines, we first present a lower bound max{4/(2 + k), 1}(for p(J)E(J)) and 4/3(for E(J)). Then we prove that LEJ is k/k-competitive for p(J)E(J) and we provide an online algorithm H with a competitive ratio of 2 k/( k + 1) for the second objective function E(J).
基金This work was supported by the National Basic Research Program of China (973 Program) (2012CB316201 ), the National Natural Science Foundation of China (Grant No. 61033007), the National Research Foundation for the Doctoral Program of Higher Education of China (20120042110028) and the MOE-Intel Special Fund of Information Technology (MOE-INTEL-2012-06).
文摘MapReduce is a popular parallel data-processing system, and task scheduling is one of the kernel techniques in MapReduce. In many applications, users have requirements that their MapReduce jobs should be completed before specific deadlines. Hence, in this paper, a novel scheduling algorithm based on the most effective sequence (SAMES) is proposed for deadline-constraint jobs in MapReduce. First, according to the characteristics of MapReduce, we propose a novel sequence-based execution strategy for MapReduce jobs and a new concept, the effective sequence (ES). Then, we design some efficient approaches for finding ESes and choose the most effective sequence (MES) for job execution. We also propose methods for MES-updates and exception handling. Finally, we verify the effectiveness of SAMES through experiments. The experimental results show that SAMES is an efficient scheduling algorithm for deadline-constraint jobs in MapReduce.
基金The Natural Science Foundation of Jiangsu Province(NoBK2005408)
文摘To fulfill the requirements for hybrid real-time system scheduling, a long-release-interval-first (LRIF) real-time scheduling algorithm is proposed. The algorithm adopts both the fixed priority and the dynamic priority to assign priorities for tasks. By assigning higher priorities to the aperiodic soft real-time jobs with longer release intervals, it guarantees the executions for periodic hard real-time tasks and further probabilistically guarantees the executions for aperiodic soft real-time tasks. The schedulability test approach for the LRIF algorithm is presented. The implementation issues of the LRIF algorithm are also discussed. Simulation result shows that LRIF obtains better schedulable performance than the maximum urgency first (MUF) algorithm, the earliest deadline first (EDF) algorithm and EDF for hybrid tasks. LRIF has great capability to schedule both periodic hard real-time and aperiodic soft real-time tasks.
文摘MapReduce是目前最为流行的用于大数据分析的并行系统之一.许多企业已经搭建了自己的MapReduce集群,为广大用户提供计算服务.用户可以向集群提交具有完成时限要求的MapReduce作业,若作业被按时完成,则企业可以获得一定的收益.针对这种应用场景,该文首次提出了MapReduce集群中的最大收益问题.为有效地解决该问题,首先提出了一种基于序列的任务调度策略(简称为SEQ策略),并证明了在处理具有完成时限约束的作业时SEQ策略存在优势.基于SEQ策略,该文提出了最大收益的调度算法(Scheduling Algorithm for Maximum Benefit,简称AMB算法),该算法可以快速地确定可接收作业,并给出有效的执行方案,以达到最大化收益的目的.另外,针对在实际应用中的某些异常情况(如节点宕机),该文也设计了有效的超时处理策略,进一步增加了算法的实用性.最后,通过大量的实验验证了该文所提出算法的有效性.
文摘数据倾斜一直是影响MapReduce性能的关键问题之一.为缓解数据倾斜问题,提出一种基于抽样分区的MapReduce在线负载均衡机制:MR-LSP(MapReduce on-line Load balancing mechanism based on Sample Partition).MR-LSP在作业执行之前,通过对源数据抽样分析,预测数据的分布特征,动态采取相应的负载均衡数据分区策略;在作业运行期间实时监控节点负载,进一步动态优化数据分区策略.实验结果表明:MR-LSP能够提高系统3.2%的负载均衡,降低4.3%的作业执行时间,有效缓解了MapReduce的数据倾斜问题.