Based on the recommendation of ICTD'09 TPC members, this Special Issue of the Journal of Electronic Science & Technology of China (JESTC) contained 22 high quality papers selected from the Proceedings of 2009 IEEE...Based on the recommendation of ICTD'09 TPC members, this Special Issue of the Journal of Electronic Science & Technology of China (JESTC) contained 22 high quality papers selected from the Proceedings of 2009 IEEE Circuits and Systems International Conference on Testing and Diagnosis (ICTD '09) which is fully sponsored by the IEEE Circuits and Systems Society (CASS), and is technically co-sponsored by the University of Electronic Science and Technology of China (UESTC), the Chinese Institute of Electronics (CIE), the China Instrument & Control Society (CIS), and organized by UESTC.展开更多
Recently,large Transformer models have achieved impressive results in various natural language processing tasks but require enormous parameters and intensive computations,necessitating deployment on multi-device syste...Recently,large Transformer models have achieved impressive results in various natural language processing tasks but require enormous parameters and intensive computations,necessitating deployment on multi-device systems.Current solutions introduce complicated topologies with dedicated high-bandwidth interconnects to reduce communication overhead.To deal with the complexity problem in system architecture and reduce the overhead of inter-device communications,this paper proposes SALTM,a multi-device system based on a unidirectional ring topology and a 2-D model partitioning method considering quantization and pruning.First,a 1-D model partitioning method is proposed to reduce the amount of communication.Then,the block distributed on each device is further partitioned in the orthogonal direction,introducing a task-level pipeline to overlap communication and computation.To further explore the SALTM’s performance on a real large model like GPT-3,we develop an analytical model to evaluate the performance and communication overhead.Our simulation shows that a BERT model with 110 million parameters,implemented by SALTM on four FPGAs can achieve 9.65×and 1.12×speedups compared to CPU and GPU,respectively.The simulation also shows that the execution time of 4-FPGA SALTM is 1.52×that of an ideal system with infinite inter-device bandwidth.For GPT-3 with 175 billion parameters,our analytical model predicts that SALTM comprising 16 VC1502 FPGAs and 16 A30 GPUs can achieve inference latency of 287 ms and 164 ms,respectively.展开更多
文摘Based on the recommendation of ICTD'09 TPC members, this Special Issue of the Journal of Electronic Science & Technology of China (JESTC) contained 22 high quality papers selected from the Proceedings of 2009 IEEE Circuits and Systems International Conference on Testing and Diagnosis (ICTD '09) which is fully sponsored by the IEEE Circuits and Systems Society (CASS), and is technically co-sponsored by the University of Electronic Science and Technology of China (UESTC), the Chinese Institute of Electronics (CIE), the China Instrument & Control Society (CIS), and organized by UESTC.
基金supported by the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant XDB44000000.
文摘Recently,large Transformer models have achieved impressive results in various natural language processing tasks but require enormous parameters and intensive computations,necessitating deployment on multi-device systems.Current solutions introduce complicated topologies with dedicated high-bandwidth interconnects to reduce communication overhead.To deal with the complexity problem in system architecture and reduce the overhead of inter-device communications,this paper proposes SALTM,a multi-device system based on a unidirectional ring topology and a 2-D model partitioning method considering quantization and pruning.First,a 1-D model partitioning method is proposed to reduce the amount of communication.Then,the block distributed on each device is further partitioned in the orthogonal direction,introducing a task-level pipeline to overlap communication and computation.To further explore the SALTM’s performance on a real large model like GPT-3,we develop an analytical model to evaluate the performance and communication overhead.Our simulation shows that a BERT model with 110 million parameters,implemented by SALTM on four FPGAs can achieve 9.65×and 1.12×speedups compared to CPU and GPU,respectively.The simulation also shows that the execution time of 4-FPGA SALTM is 1.52×that of an ideal system with infinite inter-device bandwidth.For GPT-3 with 175 billion parameters,our analytical model predicts that SALTM comprising 16 VC1502 FPGAs and 16 A30 GPUs can achieve inference latency of 287 ms and 164 ms,respectively.