In the paper, based on the job characteristics and resources availability, an optimistic checkpoint mechanism for dynamic grids(OCM4G) is proposed. It can determine whether to checkpoint a given job running on a giv...In the paper, based on the job characteristics and resources availability, an optimistic checkpoint mechanism for dynamic grids(OCM4G) is proposed. It can determine whether to checkpoint a given job running on a given resource node and establish optimal aperiodic checkpoint intervals by applying the knowledge of job characteristics and resource availability. We evaluate OCM4G over a real grid environment (ChitlaGrid) and the results show that OCM4G achieves better performance than the periodic checkpoint and the analytical method of calculating aperiodic checkpoint intervals.展开更多
容错一直是高性能计算领域的热点和难点问题。检查点是解决容错问题的一种常用技术手段,它能够将运行进程的状态转储成文件并恢复。容器具有较强的资源隔离能力,可以为检查点技术提供更理想的运行环境与载体,避免迁移后任务在节点变更...容错一直是高性能计算领域的热点和难点问题。检查点是解决容错问题的一种常用技术手段,它能够将运行进程的状态转储成文件并恢复。容器具有较强的资源隔离能力,可以为检查点技术提供更理想的运行环境与载体,避免迁移后任务在节点变更的情况下由于环境与资源变化而出现异常。因此,容器和检查点相结合能够更好地支撑任务迁移的研究与实现。文中围绕基于CRIU(Checkpoint/Restore In Userspace)的Singularity容器检查点方案的设计和优化展开,根据检查点技术在高性能计算容器应用中的特点,在CRIU安全使用、迁移性能优化、保持网络状态方面给出了有效的解决方案,基于这些方案拓展了Singularity容器检查点功能,并且实现了原型工具Migrator来验证容器迁移性能。期望本工作能为后续实现高性能计算任务迁移提供有效的支撑。展开更多
As deep neural networks (DNNs) have been successfully adopted in various domains, the training of these large-scale models becomes increasingly difficult and is often deployed on compute clusters composed of many devi...As deep neural networks (DNNs) have been successfully adopted in various domains, the training of these large-scale models becomes increasingly difficult and is often deployed on compute clusters composed of many devices like GPUs. However, as the size of the cluster increases, so does the possibility of failures during training. Currently, faults are mainly handled by recording checkpoints and recovering, but this approach causes large overhead and affects the training efficiency even when no error occurs. The low checkpointing frequency leads to a large loss of training time, while the high recording frequency affects the training efficiency. To solve this contradiction, we propose BAFT, a bubble-aware fault tolerant framework for hybrid parallel distributed training. BAFT can automatically analyze parallel strategies, profile the runtime information, and schedule checkpointing tasks at the granularity of pipeline stage depending on the bubble distribution in the training. It supports higher checkpoint efficiency and only introduces less than 1% time overhead, which allows us to record checkpoints at high frequency, thereby reducing the time loss in error recovery and avoiding the impact of fault tolerance on training.展开更多
The reliability of high-performance computing(HPC)is essential for program execution stability.However,as the hardware fault rate constantly increases,fault-tolerance techniques such as Checkpoint/Restart(C/R)introduc...The reliability of high-performance computing(HPC)is essential for program execution stability.However,as the hardware fault rate constantly increases,fault-tolerance techniques such as Checkpoint/Restart(C/R)introduce significant system overhead.This paper proposes Program Error Resilience-Aware Checkpointing Mechanism(ResCheckpointer)to mitigate the overhead of the C/R mechanism.The primary motivation of ResCheckpointer is that we observe that crash proneness(i.e.,the probability of the program crashing after fault occurrence)varies significantly among inter-and intra-HPC programs,which prompts us to flexibly adjust checkpoint intervals for further C/R overhead optimization.Specifically,we first construct the graph neural network(GNN)based learning paradigms to excavate the complex error propagation and effect mechanisms hidden within the HPC program’s execution flow,and propose Crash-Predictor for efficiently predicting programs’crash proneness.Based on this,we build ResCheckpointer,which equips an intelligent checkpoint interval setting strategy for HPC programs,i.e.,denser for the crash proneness stage while sparser for the error resilience stage.Experimental results show that ResCheckpointer can achieve up to 55.37%C/R cost reduction compared with the baseline C/R mechanism.展开更多
基金Supported by the National Natural Science Foundation of China (90412010,60603058,and 60673174)the Ministry of Education of China and Program for New Century Excellent Talents in University (NCET-07-0334)
文摘In the paper, based on the job characteristics and resources availability, an optimistic checkpoint mechanism for dynamic grids(OCM4G) is proposed. It can determine whether to checkpoint a given job running on a given resource node and establish optimal aperiodic checkpoint intervals by applying the knowledge of job characteristics and resource availability. We evaluate OCM4G over a real grid environment (ChitlaGrid) and the results show that OCM4G achieves better performance than the periodic checkpoint and the analytical method of calculating aperiodic checkpoint intervals.
文摘容错一直是高性能计算领域的热点和难点问题。检查点是解决容错问题的一种常用技术手段,它能够将运行进程的状态转储成文件并恢复。容器具有较强的资源隔离能力,可以为检查点技术提供更理想的运行环境与载体,避免迁移后任务在节点变更的情况下由于环境与资源变化而出现异常。因此,容器和检查点相结合能够更好地支撑任务迁移的研究与实现。文中围绕基于CRIU(Checkpoint/Restore In Userspace)的Singularity容器检查点方案的设计和优化展开,根据检查点技术在高性能计算容器应用中的特点,在CRIU安全使用、迁移性能优化、保持网络状态方面给出了有效的解决方案,基于这些方案拓展了Singularity容器检查点功能,并且实现了原型工具Migrator来验证容器迁移性能。期望本工作能为后续实现高性能计算任务迁移提供有效的支撑。
基金supported by the National Key R&D Program of China(2021ZD0110104)the National Natural Science Foundation of China(Grant Nos.62222210,U21B2017,61832006,and 62072297).
文摘As deep neural networks (DNNs) have been successfully adopted in various domains, the training of these large-scale models becomes increasingly difficult and is often deployed on compute clusters composed of many devices like GPUs. However, as the size of the cluster increases, so does the possibility of failures during training. Currently, faults are mainly handled by recording checkpoints and recovering, but this approach causes large overhead and affects the training efficiency even when no error occurs. The low checkpointing frequency leads to a large loss of training time, while the high recording frequency affects the training efficiency. To solve this contradiction, we propose BAFT, a bubble-aware fault tolerant framework for hybrid parallel distributed training. BAFT can automatically analyze parallel strategies, profile the runtime information, and schedule checkpointing tasks at the granularity of pipeline stage depending on the bubble distribution in the training. It supports higher checkpoint efficiency and only introduces less than 1% time overhead, which allows us to record checkpoints at high frequency, thereby reducing the time loss in error recovery and avoiding the impact of fault tolerance on training.
基金supported by the National Key Research and Development Program of China under Grant No.2023YFB4502304the National Natural Science Foundation of China under Grant Nos.62272190 and 62302190.
文摘The reliability of high-performance computing(HPC)is essential for program execution stability.However,as the hardware fault rate constantly increases,fault-tolerance techniques such as Checkpoint/Restart(C/R)introduce significant system overhead.This paper proposes Program Error Resilience-Aware Checkpointing Mechanism(ResCheckpointer)to mitigate the overhead of the C/R mechanism.The primary motivation of ResCheckpointer is that we observe that crash proneness(i.e.,the probability of the program crashing after fault occurrence)varies significantly among inter-and intra-HPC programs,which prompts us to flexibly adjust checkpoint intervals for further C/R overhead optimization.Specifically,we first construct the graph neural network(GNN)based learning paradigms to excavate the complex error propagation and effect mechanisms hidden within the HPC program’s execution flow,and propose Crash-Predictor for efficiently predicting programs’crash proneness.Based on this,we build ResCheckpointer,which equips an intelligent checkpoint interval setting strategy for HPC programs,i.e.,denser for the crash proneness stage while sparser for the error resilience stage.Experimental results show that ResCheckpointer can achieve up to 55.37%C/R cost reduction compared with the baseline C/R mechanism.
基金国家自然科学基金(6092106261003087)+2 种基金国家"八六三"高技术研究发展计划项目基金(2009AA01Z102)资助support by the National Natural Science Foundation of Chinawith the project #60621003