Despite advancements in computer hardware,the performance of GROMACS simulations has not exhibited significant improvement,primarily due to the inefficient utilization of substantial hardware resources.Enhancing resou...Despite advancements in computer hardware,the performance of GROMACS simulations has not exhibited significant improvement,primarily due to the inefficient utilization of substantial hardware resources.Enhancing resource utilization in GROMACS simulations can be achieved through effective resource scheduling when running multiple simulations concur-rently on a single computing node,particularly benefiting small-scale system simulations which are frequently employed.Previous research focused on co-running multiple GROMACS simulations through the utilization of time-slice technology.However,this approach introduced notable context-switching overhead and predominantly concentrated on optimizing GPU resources utilization,while neglecting the collaborative scheduling of heterogeneous CPU and GPU devices.Nowadays,various GPU vendors have introduced hardware partitioning technologies for spatial resources allocation,complementing traditional time-sharing techniques.Moreover,GROMACS operates as a heterogeneous computing application,alternating computations between the CPU and GPU devices.Notably,GPU utilization sometimes accounts for as little as 35%.Conse-quently,a comprehensive approach involving coordinated scheduling between both the GPU and CPU is imperative.To lever-age the potential of hardware partitioning technologies in alignment with GROMACS’runtime characteristics,we propose FILL:a resource scheduling system designed for co-running multiple GROMACS jobs.FILL employs space partitioning technology to effectively allocate hardware resources and facilitates collaborative scheduling of CPU and GPU resources,thereby ensuring precise and deterministic allocation of GROMACS job resources.The scheduling aims to improve system throughput while considering the turnaround time of simulations.Implemented on servers equipped with NVIDIA and AMD GPUs,FILL has showcased noteworthy advancements in system throughput.On NVIDIA GPU servers,FILL achieved an impressive improvement of up to 167%compared to the baseline approach and an astonishing boost of 27,928%compared to state-of-the-art alternatives.Similarly,on AMD GPU servers,FILL demonstrated significant enhancements of 459%and 24%over the baseline and state-of-the-art methods,respectively.These remarkable results validate the effectiveness of FILL in optimizing system throughput for multiple GROMACS simulations.展开更多
Recent years have witnessed a processor develop- ment trend that integrates central processing unit (CPU) and graphic processing unit (GPU) into a single chip. The inte- gration helps to save some host-device data...Recent years have witnessed a processor develop- ment trend that integrates central processing unit (CPU) and graphic processing unit (GPU) into a single chip. The inte- gration helps to save some host-device data copying that a discrete GPU usually requires, but also introduces deep re- source sharing and possible interference between CPU and GPU. This work investigates the performance implications of independently co-running CPU and GPU programs on these platforms. First, we perform a comprehensive measurement that covers a wide variety of factors, including processor ar- chitectures, operating systems, benchmarks, timing mecha- nisms, inputs, and power management schemes. These mea- surements reveal a number of surprising observations. We an- alyze these observations and produce a list of novel insights, including the important roles of operating system (OS) con- text switching and power management in determining the program performance, and the subtle effect of CPU-GPU data copying. Finally, we confirm those insights through case studies, and point out some promising directions to mitigate anomalous performance degradation on integrated heteroge- neous processors.展开更多
Efficiency of batch processing is becoming increasingly important for many modern commercial service centers, e.g., clusters and cloud computing datacenters. However, periodical resource contentions have become the ma...Efficiency of batch processing is becoming increasingly important for many modern commercial service centers, e.g., clusters and cloud computing datacenters. However, periodical resource contentions have become the major performance obstacles for concurrently running applications on mainstream CMP servers. I/O contention is such a kind of obstacle, which may impede both the co-running performance of batch jobs and the system throughput seriously. In this paper, a dynamic I/O-aware scheduling algorithm is proposed to lower the impacts of I/O contention and to enhance the co-running performance in batch processing. We set up our environment on an 8-socket, 64-core server in Dawning Linux Cluster. Fifteen workloads ranging from 8 jobs to 256 jobs are evaluated. Our experimental results show significant improvements on the throughputs of the workloads, which range from 7% to 431%. Meanwhile, noticeable improvements on the slowdown of workloads and the average runtime for each job can be achieved. These results show that a well-tuned dynamic I/O-aware scheduler is beneficial for batch-mode services. It can also enhance the resource utilization via throughput improvement on modern service platforms.展开更多
基金sponsored in part by NKRDP(2021YFB0300800)in part by NSFC(62102396)+2 种基金Beijing Nova Program(Z211100002121143,20220484217)Youth Innovation Promotion Association of Chinese Academy of Sciences(2021099)Pilot for Major Scientific Research Facility of Jiangsu Province of China(NO.BM2021800).
文摘Despite advancements in computer hardware,the performance of GROMACS simulations has not exhibited significant improvement,primarily due to the inefficient utilization of substantial hardware resources.Enhancing resource utilization in GROMACS simulations can be achieved through effective resource scheduling when running multiple simulations concur-rently on a single computing node,particularly benefiting small-scale system simulations which are frequently employed.Previous research focused on co-running multiple GROMACS simulations through the utilization of time-slice technology.However,this approach introduced notable context-switching overhead and predominantly concentrated on optimizing GPU resources utilization,while neglecting the collaborative scheduling of heterogeneous CPU and GPU devices.Nowadays,various GPU vendors have introduced hardware partitioning technologies for spatial resources allocation,complementing traditional time-sharing techniques.Moreover,GROMACS operates as a heterogeneous computing application,alternating computations between the CPU and GPU devices.Notably,GPU utilization sometimes accounts for as little as 35%.Conse-quently,a comprehensive approach involving coordinated scheduling between both the GPU and CPU is imperative.To lever-age the potential of hardware partitioning technologies in alignment with GROMACS’runtime characteristics,we propose FILL:a resource scheduling system designed for co-running multiple GROMACS jobs.FILL employs space partitioning technology to effectively allocate hardware resources and facilitates collaborative scheduling of CPU and GPU resources,thereby ensuring precise and deterministic allocation of GROMACS job resources.The scheduling aims to improve system throughput while considering the turnaround time of simulations.Implemented on servers equipped with NVIDIA and AMD GPUs,FILL has showcased noteworthy advancements in system throughput.On NVIDIA GPU servers,FILL achieved an impressive improvement of up to 167%compared to the baseline approach and an astonishing boost of 27,928%compared to state-of-the-art alternatives.Similarly,on AMD GPU servers,FILL demonstrated significant enhancements of 459%and 24%over the baseline and state-of-the-art methods,respectively.These remarkable results validate the effectiveness of FILL in optimizing system throughput for multiple GROMACS simulations.
基金We thank the constructive comments from the anony- mous referees. This material was based upon work supported by DOE Early Career Award, the National Science Foundation (NSF) (1455404 and 1525609), and NSF CAREER Award. This work is also supported partly by the NSF (CNS-1217372, CNS-1239423, CCF-1255729, CNS-1319353, and CNS-1319417) and the National Natural Science Foundation of China (NSFC) (Grant Nos. 61272143, 61272144, and 61472431). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DOE, NSF, or NSFC.
文摘Recent years have witnessed a processor develop- ment trend that integrates central processing unit (CPU) and graphic processing unit (GPU) into a single chip. The inte- gration helps to save some host-device data copying that a discrete GPU usually requires, but also introduces deep re- source sharing and possible interference between CPU and GPU. This work investigates the performance implications of independently co-running CPU and GPU programs on these platforms. First, we perform a comprehensive measurement that covers a wide variety of factors, including processor ar- chitectures, operating systems, benchmarks, timing mecha- nisms, inputs, and power management schemes. These mea- surements reveal a number of surprising observations. We an- alyze these observations and produce a list of novel insights, including the important roles of operating system (OS) con- text switching and power management in determining the program performance, and the subtle effect of CPU-GPU data copying. Finally, we confirm those insights through case studies, and point out some promising directions to mitigate anomalous performance degradation on integrated heteroge- neous processors.
基金Supported by the National High Technology Research and Development 863 Program of China under Grant No.2012AA010902the National Basic Research 973 Program of China under Grant No.2011CB302504the National Natural Science Foundation of China under Grant Nos.61202055,60925009,60921002,61100011
文摘Efficiency of batch processing is becoming increasingly important for many modern commercial service centers, e.g., clusters and cloud computing datacenters. However, periodical resource contentions have become the major performance obstacles for concurrently running applications on mainstream CMP servers. I/O contention is such a kind of obstacle, which may impede both the co-running performance of batch jobs and the system throughput seriously. In this paper, a dynamic I/O-aware scheduling algorithm is proposed to lower the impacts of I/O contention and to enhance the co-running performance in batch processing. We set up our environment on an 8-socket, 64-core server in Dawning Linux Cluster. Fifteen workloads ranging from 8 jobs to 256 jobs are evaluated. Our experimental results show significant improvements on the throughputs of the workloads, which range from 7% to 431%. Meanwhile, noticeable improvements on the slowdown of workloads and the average runtime for each job can be achieved. These results show that a well-tuned dynamic I/O-aware scheduler is beneficial for batch-mode services. It can also enhance the resource utilization via throughput improvement on modern service platforms.