摘要
Spark大数据计算框架被广泛用于处理和分析爆发式增长的大数据。云端能够提供按需和按量付费的计算资源来满足用户的请求。当前,许多组织将大数据计算集群部署在云端上开展大数据计算任务,其需要高效地处理Spark作业调度问题以满足各种用户对QoS的要求,如降低使用资源的花费和缩短作业的响应时间。而现有研究大多未能统一考虑多用户要求,忽略了Spark集群环境和工作负载的特性,导致资源浪费和用户对QoS的要求得不到满足等。为此,通过对部署在云端的Spark集群作业调度问题进行建模,设计了一种新的基于DRL技术的Spark作业调度器来满足多个QoS要求。搭建了DRL集群仿真环境,用于对作业调度器的核心DRL Agent进行训练。在调度环境中实现了基于绝对深度Q值网络、基于近端策略优化与广义优势估计联合的训练方法,使DRL Agent可以自适应地学习不同类型作业,以及动态、突发的集群环境特征,实现对Spark作业的合理调度,以降低集群总使用成本、缩短作业的平均响应时间。在基准套件上对DRL Agent测试的结果表明,与其他现有的Spark作业调度解决方案相比,本文设计的DRL Agent作业调度器在集群总使用成本、作业平均响应时间以及QoS达成率上具有显著的优越性,证明了其有效性。
Spark is a widely-used big data computing framework to process and analyze the explosive-growing data.The cloud can provide on-demand and pay-as-you-go computing resources to satisfy the users’requirements.Currently,many organizations have deployed big data computing clusters on the cloud.These clusters are required to efficiently handle the Spark job scheduling problem so as to meet the QoS requirements of various users,such as reducing the cost of resource usage and shortening the job response time.However,most of the existing methods don’t consider the requirements of multiple users together,and fail to take into account the characteristics of Spark cluster environments and workloads.To address the above-mentioned challenge,a new Spark job scheduler based on DRL technology was designed to adapt to multiple QoS requirements by modeling the job scheduling problem of Spark clusters deployed in the cloud.A DRL cluster simulation environment was built to train the core DRL Agent of job scheduler.In the scheduling environment,training methods based on absolute deep Q-network and a combination of proximal policy optimization and generalized advantage estimation were implemented,enabling DRL agent to adaptively learn the characteristics of different types of jobs as well as the characteristics of dynamic and bursty cluster environments.This enables rational scheduling of Spark jobs to reduce the total usage cost of the cluster and shorten the average response time of jobs.Testing results of DRL Agent on the benchmark suite show that compared with other existing Spark job scheduling solutions,the newly designed DRL Agent job scheduler in this paper has significant advantages in terms of total cluster usage cost,average job response time and QoS achievement rate,which confirming the feasibility and effectiveness of the job scheduler designed in this paper.
作者
何玉林
莫沛恒
Philippe Fournier-Viger
黄哲学
HE Yulin;MO Peiheng;Fournier-Viger Philippe;HUANG Zhexue(Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ),Shenzhen 518107,China;College of Computer Science&Software Engineering,Shenzhen University,Shenzhen 518060)
出处
《大数据》
2025年第4期154-177,共24页
Big Data Research
基金
广东省自然科学基金项目(No.2023A1515011667)
广东省基础与应用基础研究基金项目(No.2023B1515120020)
深圳市科技重大专项项目(No.KJZD20230923114809020)。
关键词
大数据计算
服务质量
Spark作业调度器
云环境
深度强化学习
big data computing
quality of service
Spark job scheduler
cloud environment
deep reinforcement learning