Transformer-based large language models(LLMs)have made significant strides in the field of artificial intelligence(AI).However,training these LLMs imposes immense demands on computational power and bandwidth for hardw...Transformer-based large language models(LLMs)have made significant strides in the field of artificial intelligence(AI).However,training these LLMs imposes immense demands on computational power and bandwidth for hardware systems.Wafer-scale chips(WSCs)offer a promising solution,yet they struggle with limited on-chip memory and complex tensor partitioning.To fully harness the high-bandwidth,low-latency on-chip interconnect benefits of WSCs and to alleviate the on-chip memory limitations,a specialized mapping and architecture co-exploration method is essential.Despite existing efforts in memory optimization and mapping,current approaches fall short for WSC scenarios.To bridge this gap,we introduce TMAC,an architecture-mapping co-exploration framework that integrates recomputation into the design space,fully exploiting optimization opportunities overlooked by existing works.Further,TMAC takes advantage of the superior on-chip interconnect performance of WSCs by incorporating a more flexible tensor partition scheme.TMAC then introduces a novel operator-centric encoding scheme(OCES)designed to comprehensively describe the mapping space for training LLMs.Unlike previous studies that focus solely on communication volume analysis based on mapping,TMAC explores the design space by evaluating the combined impact of mapping and architecture on training performance.However,fully accounting for these untapped optimization opportunities increases the complexity of the design space.To address this,we streamline the simulation process,reducing the time needed for exploration.Compared to AccPar,Deepspeed and Megatron,TMAC delivers a 3.1×,2.9×,1.6×performance gain.In terms of memory usage,TMAC requires 3.6×,3.1×less memory than AccPar and Deepspeed,respectively and is comparable to Megatron’s full recomputation method.展开更多
基金This work was supported in part by the National Science and Technology Major Project under Grant 2022ZD0115200in part by Frontier Technique Collaboration Project under Grant QYJS-2023-2801-B+3 种基金in part by NSFC under Grant 62125403 and Grant 92164301in part by the Beijing S&T Project under Grant Z221100007722023in part by the Shanghai Municipal Science and Technology Major Project,in part by the 2022 Special Project on Industrial Foundation Reconstruction and High Quality Development of Manufacturing Industry under Grant CEIEC-2022-ZM02-0245in part by the Beijing National Research Center for Information Science and Technology,and in part by the Beijing Advanced Innovation Center for Integrated Circuits.
文摘Transformer-based large language models(LLMs)have made significant strides in the field of artificial intelligence(AI).However,training these LLMs imposes immense demands on computational power and bandwidth for hardware systems.Wafer-scale chips(WSCs)offer a promising solution,yet they struggle with limited on-chip memory and complex tensor partitioning.To fully harness the high-bandwidth,low-latency on-chip interconnect benefits of WSCs and to alleviate the on-chip memory limitations,a specialized mapping and architecture co-exploration method is essential.Despite existing efforts in memory optimization and mapping,current approaches fall short for WSC scenarios.To bridge this gap,we introduce TMAC,an architecture-mapping co-exploration framework that integrates recomputation into the design space,fully exploiting optimization opportunities overlooked by existing works.Further,TMAC takes advantage of the superior on-chip interconnect performance of WSCs by incorporating a more flexible tensor partition scheme.TMAC then introduces a novel operator-centric encoding scheme(OCES)designed to comprehensively describe the mapping space for training LLMs.Unlike previous studies that focus solely on communication volume analysis based on mapping,TMAC explores the design space by evaluating the combined impact of mapping and architecture on training performance.However,fully accounting for these untapped optimization opportunities increases the complexity of the design space.To address this,we streamline the simulation process,reducing the time needed for exploration.Compared to AccPar,Deepspeed and Megatron,TMAC delivers a 3.1×,2.9×,1.6×performance gain.In terms of memory usage,TMAC requires 3.6×,3.1×less memory than AccPar and Deepspeed,respectively and is comparable to Megatron’s full recomputation method.