Transformer-based large language models(LLMs)have made significant strides in the field of artificial intelligence(AI).However,training these LLMs imposes immense demands on computational power and bandwidth for hardw...Transformer-based large language models(LLMs)have made significant strides in the field of artificial intelligence(AI).However,training these LLMs imposes immense demands on computational power and bandwidth for hardware systems.Wafer-scale chips(WSCs)offer a promising solution,yet they struggle with limited on-chip memory and complex tensor partitioning.To fully harness the high-bandwidth,low-latency on-chip interconnect benefits of WSCs and to alleviate the on-chip memory limitations,a specialized mapping and architecture co-exploration method is essential.Despite existing efforts in memory optimization and mapping,current approaches fall short for WSC scenarios.To bridge this gap,we introduce TMAC,an architecture-mapping co-exploration framework that integrates recomputation into the design space,fully exploiting optimization opportunities overlooked by existing works.Further,TMAC takes advantage of the superior on-chip interconnect performance of WSCs by incorporating a more flexible tensor partition scheme.TMAC then introduces a novel operator-centric encoding scheme(OCES)designed to comprehensively describe the mapping space for training LLMs.Unlike previous studies that focus solely on communication volume analysis based on mapping,TMAC explores the design space by evaluating the combined impact of mapping and architecture on training performance.However,fully accounting for these untapped optimization opportunities increases the complexity of the design space.To address this,we streamline the simulation process,reducing the time needed for exploration.Compared to AccPar,Deepspeed and Megatron,TMAC delivers a 3.1×,2.9×,1.6×performance gain.In terms of memory usage,TMAC requires 3.6×,3.1×less memory than AccPar and Deepspeed,respectively and is comparable to Megatron’s full recomputation method.展开更多
GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world's fastest supercomputer...GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world's fastest supercomputer in the TOP500 list, built at NUDT (National University of Defense Technology) last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications. By analyzing the SIMT (single-instruction, multiple-thread) characteristics of programs running on GPGPUs, we have developed PartialRC, a new checkpoint-based compiler-directed partial recomputing method, for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs. In this paper, we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region, describe a checkpoint-based faulttolerance framework developed on PartialRC, and discuss an implementation on the CUDA platform. Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC (a traditional full-recomputing Checkpoint-Rollback-Restart fault recovery method for CPUs) shows that PartialRC reduces significantly the fault recovery overheads incurred by FullRC, by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average. In addition, PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens.展开更多
基金This work was supported in part by the National Science and Technology Major Project under Grant 2022ZD0115200in part by Frontier Technique Collaboration Project under Grant QYJS-2023-2801-B+3 种基金in part by NSFC under Grant 62125403 and Grant 92164301in part by the Beijing S&T Project under Grant Z221100007722023in part by the Shanghai Municipal Science and Technology Major Project,in part by the 2022 Special Project on Industrial Foundation Reconstruction and High Quality Development of Manufacturing Industry under Grant CEIEC-2022-ZM02-0245in part by the Beijing National Research Center for Information Science and Technology,and in part by the Beijing Advanced Innovation Center for Integrated Circuits.
文摘Transformer-based large language models(LLMs)have made significant strides in the field of artificial intelligence(AI).However,training these LLMs imposes immense demands on computational power and bandwidth for hardware systems.Wafer-scale chips(WSCs)offer a promising solution,yet they struggle with limited on-chip memory and complex tensor partitioning.To fully harness the high-bandwidth,low-latency on-chip interconnect benefits of WSCs and to alleviate the on-chip memory limitations,a specialized mapping and architecture co-exploration method is essential.Despite existing efforts in memory optimization and mapping,current approaches fall short for WSC scenarios.To bridge this gap,we introduce TMAC,an architecture-mapping co-exploration framework that integrates recomputation into the design space,fully exploiting optimization opportunities overlooked by existing works.Further,TMAC takes advantage of the superior on-chip interconnect performance of WSCs by incorporating a more flexible tensor partition scheme.TMAC then introduces a novel operator-centric encoding scheme(OCES)designed to comprehensively describe the mapping space for training LLMs.Unlike previous studies that focus solely on communication volume analysis based on mapping,TMAC explores the design space by evaluating the combined impact of mapping and architecture on training performance.However,fully accounting for these untapped optimization opportunities increases the complexity of the design space.To address this,we streamline the simulation process,reducing the time needed for exploration.Compared to AccPar,Deepspeed and Megatron,TMAC delivers a 3.1×,2.9×,1.6×performance gain.In terms of memory usage,TMAC requires 3.6×,3.1×less memory than AccPar and Deepspeed,respectively and is comparable to Megatron’s full recomputation method.
基金supported by the National Natural Science Foundation of China under Grant Nos. 60921062, 61003087, 61120106005 and 61170049
文摘GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world's fastest supercomputer in the TOP500 list, built at NUDT (National University of Defense Technology) last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications. By analyzing the SIMT (single-instruction, multiple-thread) characteristics of programs running on GPGPUs, we have developed PartialRC, a new checkpoint-based compiler-directed partial recomputing method, for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs. In this paper, we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region, describe a checkpoint-based faulttolerance framework developed on PartialRC, and discuss an implementation on the CUDA platform. Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC (a traditional full-recomputing Checkpoint-Rollback-Restart fault recovery method for CPUs) shows that PartialRC reduces significantly the fault recovery overheads incurred by FullRC, by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average. In addition, PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens.