期刊文献+
共找到1篇文章
< 1 >
每页显示 20 50 100
BAFT:bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelism
1
作者 Runzhe CHEN Guandong LU +6 位作者 Yakai WANG Rui ZHANG Zheng HU Yanming MIAO zhifang cai Jingwen LENG Minyi GUO 《Frontiers of Computer Science》 2025年第1期29-39,共11页
As deep neural networks (DNNs) have been successfully adopted in various domains, the training of these large-scale models becomes increasingly difficult and is often deployed on compute clusters composed of many devi... As deep neural networks (DNNs) have been successfully adopted in various domains, the training of these large-scale models becomes increasingly difficult and is often deployed on compute clusters composed of many devices like GPUs. However, as the size of the cluster increases, so does the possibility of failures during training. Currently, faults are mainly handled by recording checkpoints and recovering, but this approach causes large overhead and affects the training efficiency even when no error occurs. The low checkpointing frequency leads to a large loss of training time, while the high recording frequency affects the training efficiency. To solve this contradiction, we propose BAFT, a bubble-aware fault tolerant framework for hybrid parallel distributed training. BAFT can automatically analyze parallel strategies, profile the runtime information, and schedule checkpointing tasks at the granularity of pipeline stage depending on the bubble distribution in the training. It supports higher checkpoint efficiency and only introduces less than 1% time overhead, which allows us to record checkpoints at high frequency, thereby reducing the time loss in error recovery and avoiding the impact of fault tolerance on training. 展开更多
关键词 distributed training fault tolerance CHECKPOINT pipeline parallelism error recovery
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部