With the development of deep learning,hardware accelerators represented by GPUs have been used to accelerate the execution of deep learning applications.A key problem in GPU cluster is how to schedule various deep lea...With the development of deep learning,hardware accelerators represented by GPUs have been used to accelerate the execution of deep learning applications.A key problem in GPU cluster is how to schedule various deep learning applications,including training applications and latency-critical inference applications,to achieve optimal system performance.In cloud datacenters,inference applications often require fewer resources,and the exclusive GPU execution of one inference application can result in a significant waste of GPU resources.Existing work mainly focuses on the co-location execution of multiple inference applications in datacenters using MPS(Multi-Process Service).There are several problems with this execution pattern,datacenters may be in low-workload state for long periods of time due to the diurnal pattern of inference applications,MPS-based data sharing can lead to interaction errors between contexts,and resource contention may cause Quality of Service(QoS)violations.To solve above problems,we propose ArkGPU,a runtime system that dynamically allocates resources.ArkGPU can improve the resource utilization of the cluster,while guaranteeing the QoS of inference applications.ArkGPU is comprised of a performance predictor,a scheduler,a resource limiter,and an adjustment unit.We conduct extensive experiments on the NVIDIA V100 GPU to verify the effectiveness of ArkGPU.We achieve High-Goodput for latency-critical applications which have an average throughput increase of 584.27%compared to MPS.We deploy multiple applications simultaneously on ArkGPU,and in this case,goodput is improved by 94.98%compared to k8s-native and 38.65%compared to MPS.展开更多
基金supported by National Key Research and Development Program(Grant No.2022YFB4501404)the Beijing Natural Science Foundation(4232036)CAS Project for Youth Innovation Promotion Association.
文摘With the development of deep learning,hardware accelerators represented by GPUs have been used to accelerate the execution of deep learning applications.A key problem in GPU cluster is how to schedule various deep learning applications,including training applications and latency-critical inference applications,to achieve optimal system performance.In cloud datacenters,inference applications often require fewer resources,and the exclusive GPU execution of one inference application can result in a significant waste of GPU resources.Existing work mainly focuses on the co-location execution of multiple inference applications in datacenters using MPS(Multi-Process Service).There are several problems with this execution pattern,datacenters may be in low-workload state for long periods of time due to the diurnal pattern of inference applications,MPS-based data sharing can lead to interaction errors between contexts,and resource contention may cause Quality of Service(QoS)violations.To solve above problems,we propose ArkGPU,a runtime system that dynamically allocates resources.ArkGPU can improve the resource utilization of the cluster,while guaranteeing the QoS of inference applications.ArkGPU is comprised of a performance predictor,a scheduler,a resource limiter,and an adjustment unit.We conduct extensive experiments on the NVIDIA V100 GPU to verify the effectiveness of ArkGPU.We achieve High-Goodput for latency-critical applications which have an average throughput increase of 584.27%compared to MPS.We deploy multiple applications simultaneously on ArkGPU,and in this case,goodput is improved by 94.98%compared to k8s-native and 38.65%compared to MPS.