The reliability of high-performance computing(HPC)is essential for program execution stability.However,as the hardware fault rate constantly increases,fault-tolerance techniques such as Checkpoint/Restart(C/R)introduc...The reliability of high-performance computing(HPC)is essential for program execution stability.However,as the hardware fault rate constantly increases,fault-tolerance techniques such as Checkpoint/Restart(C/R)introduce significant system overhead.This paper proposes Program Error Resilience-Aware Checkpointing Mechanism(ResCheckpointer)to mitigate the overhead of the C/R mechanism.The primary motivation of ResCheckpointer is that we observe that crash proneness(i.e.,the probability of the program crashing after fault occurrence)varies significantly among inter-and intra-HPC programs,which prompts us to flexibly adjust checkpoint intervals for further C/R overhead optimization.Specifically,we first construct the graph neural network(GNN)based learning paradigms to excavate the complex error propagation and effect mechanisms hidden within the HPC program’s execution flow,and propose Crash-Predictor for efficiently predicting programs’crash proneness.Based on this,we build ResCheckpointer,which equips an intelligent checkpoint interval setting strategy for HPC programs,i.e.,denser for the crash proneness stage while sparser for the error resilience stage.Experimental results show that ResCheckpointer can achieve up to 55.37%C/R cost reduction compared with the baseline C/R mechanism.展开更多
对于HPC用户来说,计算成本是迁云所考虑的重要因素之一,阿里云上提供的抢占式实例,是一种按需实例,旨在降低使用公共云计算资源成本,抢占式实例市场价格是波动的,通常远低于正常的按需实例,甚至达到正常按需实例的一折。抢占式实例一般...对于HPC用户来说,计算成本是迁云所考虑的重要因素之一,阿里云上提供的抢占式实例,是一种按需实例,旨在降低使用公共云计算资源成本,抢占式实例市场价格是波动的,通常远低于正常的按需实例,甚至达到正常按需实例的一折。抢占式实例一般会在创建时为用户保留一段最短时间,过后有可能会被释放,所以一般适用于无状态的应用场景。提出在公共云上的自动伸缩策略,其面向通用的HPC集群调度器,基于用户的应用软件类型、提交作业规律以及用户对性能和成本等多方面需求,自动在云上部署扩容计算资源,控制成本。对用户来说,可以做到“only pay for what you want and what you use”。基于公共云上丰富的资源规格类型和售卖方式,利用自动伸缩服务,抢占式实例,断点续算等技术可以配置低成本的公共云上HPC自动伸缩方案:用户提交作业的同时可以指定成本上限,自动伸缩服务自动在低于此成本的前提下寻找和扩容抢占式计算资源,同时利用断点续算功能保证作业在计算资源切换的时候可以继续运算。最后,通过LAMMPS和GROMACS两个高性能应用实例验证了该策略的可行性和有效性。展开更多
High performance computer (HPC) is a complex huge system, of which the architecture design meets increasing difficulties and risks. Traditional methods, such as theoretical analysis, component-level simulation and s...High performance computer (HPC) is a complex huge system, of which the architecture design meets increasing difficulties and risks. Traditional methods, such as theoretical analysis, component-level simulation and sequential simulation, are not applicable to system-level simulations of HPC systems. Even the parallel simulation using large-scale parallel machines also have many difficulties in scalability, reliability, generality, as well as efficiency. According to the current needs of HPC architecture design, this paper proposes a system-level parallel simulation platform: ArchSim. We first introduce the architecture of ArchSim simulation platform which is composed of a global server (GS), local server agents (LSA) and entities. Secondly, we emphasize some key techniques of ArchSim, including the synchronization protocol, the communication mechanism and the distributed checkpointing/restart mechanism. We then make a synthesized test of some main performance indices of ArchSim with the phold benchmark and analyze the extra overhead generated by ArchSim. Finally, based on ArchSim, we construct a parallel event-driven interconnection network simulator and a system-level simulator for a small scale HPC system with 256 processors. The results of the performance test and HPC system simulations demonstrate that ArchSim can achieve high speedup ratio and high scalability on parallel host machine and support system-level simulations for the architecture design of HPC systems.展开更多
基金supported by the National Key Research and Development Program of China under Grant No.2023YFB4502304the National Natural Science Foundation of China under Grant Nos.62272190 and 62302190.
文摘The reliability of high-performance computing(HPC)is essential for program execution stability.However,as the hardware fault rate constantly increases,fault-tolerance techniques such as Checkpoint/Restart(C/R)introduce significant system overhead.This paper proposes Program Error Resilience-Aware Checkpointing Mechanism(ResCheckpointer)to mitigate the overhead of the C/R mechanism.The primary motivation of ResCheckpointer is that we observe that crash proneness(i.e.,the probability of the program crashing after fault occurrence)varies significantly among inter-and intra-HPC programs,which prompts us to flexibly adjust checkpoint intervals for further C/R overhead optimization.Specifically,we first construct the graph neural network(GNN)based learning paradigms to excavate the complex error propagation and effect mechanisms hidden within the HPC program’s execution flow,and propose Crash-Predictor for efficiently predicting programs’crash proneness.Based on this,we build ResCheckpointer,which equips an intelligent checkpoint interval setting strategy for HPC programs,i.e.,denser for the crash proneness stage while sparser for the error resilience stage.Experimental results show that ResCheckpointer can achieve up to 55.37%C/R cost reduction compared with the baseline C/R mechanism.
文摘对于HPC用户来说,计算成本是迁云所考虑的重要因素之一,阿里云上提供的抢占式实例,是一种按需实例,旨在降低使用公共云计算资源成本,抢占式实例市场价格是波动的,通常远低于正常的按需实例,甚至达到正常按需实例的一折。抢占式实例一般会在创建时为用户保留一段最短时间,过后有可能会被释放,所以一般适用于无状态的应用场景。提出在公共云上的自动伸缩策略,其面向通用的HPC集群调度器,基于用户的应用软件类型、提交作业规律以及用户对性能和成本等多方面需求,自动在云上部署扩容计算资源,控制成本。对用户来说,可以做到“only pay for what you want and what you use”。基于公共云上丰富的资源规格类型和售卖方式,利用自动伸缩服务,抢占式实例,断点续算等技术可以配置低成本的公共云上HPC自动伸缩方案:用户提交作业的同时可以指定成本上限,自动伸缩服务自动在低于此成本的前提下寻找和扩容抢占式计算资源,同时利用断点续算功能保证作业在计算资源切换的时候可以继续运算。最后,通过LAMMPS和GROMACS两个高性能应用实例验证了该策略的可行性和有效性。
基金supported by the National High Technology Research and Development 863 Program of China under Grant No. 2007AA01Z117the National Basic Research 973 Program of China under Grant No.2007CB310900
文摘High performance computer (HPC) is a complex huge system, of which the architecture design meets increasing difficulties and risks. Traditional methods, such as theoretical analysis, component-level simulation and sequential simulation, are not applicable to system-level simulations of HPC systems. Even the parallel simulation using large-scale parallel machines also have many difficulties in scalability, reliability, generality, as well as efficiency. According to the current needs of HPC architecture design, this paper proposes a system-level parallel simulation platform: ArchSim. We first introduce the architecture of ArchSim simulation platform which is composed of a global server (GS), local server agents (LSA) and entities. Secondly, we emphasize some key techniques of ArchSim, including the synchronization protocol, the communication mechanism and the distributed checkpointing/restart mechanism. We then make a synthesized test of some main performance indices of ArchSim with the phold benchmark and analyze the extra overhead generated by ArchSim. Finally, based on ArchSim, we construct a parallel event-driven interconnection network simulator and a system-level simulator for a small scale HPC system with 256 processors. The results of the performance test and HPC system simulations demonstrate that ArchSim can achieve high speedup ratio and high scalability on parallel host machine and support system-level simulations for the architecture design of HPC systems.