Supercomputers and data centers are continuously developing on scales and capabilities to empower scientific and intelligent applications.As the de facto standard to offer dense computation,various accelerators like G...Supercomputers and data centers are continuously developing on scales and capabilities to empower scientific and intelligent applications.As the de facto standard to offer dense computation,various accelerators like GPUs have been widely deployed,which inevitably incurs the heterogeneous programming and usage issues.Targeting at addressing the issues,SYCL has been proposed to facilitate programs to run on different platforms based on varying accelerators and vendors.However,SYCL has a limited functionality to conduct communication between devices,so SYCL resorts to MPI or vendor-specific communication libraries,neither of which could fulfill the demand of portability and performance for SYCL programs at the same time.To overcome the dilemma of portability and performance,we propose SuCL,a communication-specific library and framework which provides an abstraction layer atop of various programming models.SuCL provides unified communication APIs for upper SYCL programs,and leverages vendor-optimized communication libraries to improve performance.To ensure program functionality,SuCL introduces selection mechanism to help selecting proper communication libraries for SYCL programs at runtime.SuCL also utilizes additional SYCL features to improve performance and programming easiness.Experiments on different platforms show that SuCL outperforms MPI in micro-benchmarks significantly,and in application evaluations SuCL is capable to produce speedups up to 60%and 30%on NVIDIA platform and AMD platform respectively.展开更多
With the rapid advancement of artificial intelligence and high-performance computing,heterogeneous computing platforms have evolved to encompass increasingly diverse architectures.While SYCL,an open standard for heter...With the rapid advancement of artificial intelligence and high-performance computing,heterogeneous computing platforms have evolved to encompass increasingly diverse architectures.While SYCL,an open standard for heterogeneous programming,has gained widespread adoption,its mainstream implementations(such as DPC++and AdaptiveCpp)primarily target SIMT-architecture devices like GPUs,presenting substantial challenges when adapting to specialized accelerators such as the Cambricon MLU,which employs a fundamentally different SIMD execution model.This cross-programming-model extension encounters two critical challenges:(1)bridging the programming abstraction gap between SIMT’s thread-level parallelism and SIMD’s data-level parallelism;and(2)harmonizing SYCL’s unified memory model with device-specific memory architectures.This paper proposes a novel cross-programming-model SYCL extension methodology to achieve full SYCL support for SIMD architectures,demonstrated through a comprehensive implementation for the Cambricon MLU platform.Our approach introduces MLU-specific vector programming interfaces while maintaining compatibility with the SYCL standard,enabling seamless integration of SIMD-based accelerators into the SYCL ecosystem.To validate our methodology,we integrated the extended SYCL-MLU implementation into PaddlePaddle’s CINN compiler,achieving a geometric mean performance improvement of 9.14%across representative neural networks,including ResNet,YOLOv3,and BERT.This research significantly broadens the application scope of SYCL in heterogeneous programming and provides a systematic methodology for extending SYCL to other SIMD-based hardware platforms.展开更多
SYCL is a modern royalty-free heterogeneous programming specification maintained by the Khronos Group.Recently,it has become increasingly more prevalent and matured,leading to various assessments of its performance,po...SYCL is a modern royalty-free heterogeneous programming specification maintained by the Khronos Group.Recently,it has become increasingly more prevalent and matured,leading to various assessments of its performance,portability,and programmability.While previous evaluations have mainly focused on X86 CPUs,NVIDIA GPUs,and AMD GPUs,how well SYCL performs on ARM multi-core CPUs is still unknown.In this paper,we evaluate three SYCL implementations(i.e.,DPCPP,AdaptiveCPP,and MLIR-SYCL)on ARM multi-core CPUs,to uncover performance traps and offer optimization techniques.We use the SYCL-Bench benchmark suite to assess the performance of DPCPP,AdaptiveCPP,and MLIR-SYCL against their OpenMP counterparts.We also assess the compiler and runtime overhead to evaluate the usability and productivity of the SYCL implementations.Our empirical results demonstrate that these SYCL implementations can achieve satisfactory performance on ARM multi-core processors.Additionally,we highlight several key optimizations,such as NUMA management,which must be carefully addressed to enhance performance.展开更多
基金supported by the National Key R&D Program of China(Grant No.2023YFB3002202).
文摘Supercomputers and data centers are continuously developing on scales and capabilities to empower scientific and intelligent applications.As the de facto standard to offer dense computation,various accelerators like GPUs have been widely deployed,which inevitably incurs the heterogeneous programming and usage issues.Targeting at addressing the issues,SYCL has been proposed to facilitate programs to run on different platforms based on varying accelerators and vendors.However,SYCL has a limited functionality to conduct communication between devices,so SYCL resorts to MPI or vendor-specific communication libraries,neither of which could fulfill the demand of portability and performance for SYCL programs at the same time.To overcome the dilemma of portability and performance,we propose SuCL,a communication-specific library and framework which provides an abstraction layer atop of various programming models.SuCL provides unified communication APIs for upper SYCL programs,and leverages vendor-optimized communication libraries to improve performance.To ensure program functionality,SuCL introduces selection mechanism to help selecting proper communication libraries for SYCL programs at runtime.SuCL also utilizes additional SYCL features to improve performance and programming easiness.Experiments on different platforms show that SuCL outperforms MPI in micro-benchmarks significantly,and in application evaluations SuCL is capable to produce speedups up to 60%and 30%on NVIDIA platform and AMD platform respectively.
基金supported by the Beijing Science and Technology Planning Project(Grant No.Z231100010323007).
文摘With the rapid advancement of artificial intelligence and high-performance computing,heterogeneous computing platforms have evolved to encompass increasingly diverse architectures.While SYCL,an open standard for heterogeneous programming,has gained widespread adoption,its mainstream implementations(such as DPC++and AdaptiveCpp)primarily target SIMT-architecture devices like GPUs,presenting substantial challenges when adapting to specialized accelerators such as the Cambricon MLU,which employs a fundamentally different SIMD execution model.This cross-programming-model extension encounters two critical challenges:(1)bridging the programming abstraction gap between SIMT’s thread-level parallelism and SIMD’s data-level parallelism;and(2)harmonizing SYCL’s unified memory model with device-specific memory architectures.This paper proposes a novel cross-programming-model SYCL extension methodology to achieve full SYCL support for SIMD architectures,demonstrated through a comprehensive implementation for the Cambricon MLU platform.Our approach introduces MLU-specific vector programming interfaces while maintaining compatibility with the SYCL standard,enabling seamless integration of SIMD-based accelerators into the SYCL ecosystem.To validate our methodology,we integrated the extended SYCL-MLU implementation into PaddlePaddle’s CINN compiler,achieving a geometric mean performance improvement of 9.14%across representative neural networks,including ResNet,YOLOv3,and BERT.This research significantly broadens the application scope of SYCL in heterogeneous programming and provides a systematic methodology for extending SYCL to other SIMD-based hardware platforms.
文摘SYCL is a modern royalty-free heterogeneous programming specification maintained by the Khronos Group.Recently,it has become increasingly more prevalent and matured,leading to various assessments of its performance,portability,and programmability.While previous evaluations have mainly focused on X86 CPUs,NVIDIA GPUs,and AMD GPUs,how well SYCL performs on ARM multi-core CPUs is still unknown.In this paper,we evaluate three SYCL implementations(i.e.,DPCPP,AdaptiveCPP,and MLIR-SYCL)on ARM multi-core CPUs,to uncover performance traps and offer optimization techniques.We use the SYCL-Bench benchmark suite to assess the performance of DPCPP,AdaptiveCPP,and MLIR-SYCL against their OpenMP counterparts.We also assess the compiler and runtime overhead to evaluate the usability and productivity of the SYCL implementations.Our empirical results demonstrate that these SYCL implementations can achieve satisfactory performance on ARM multi-core processors.Additionally,we highlight several key optimizations,such as NUMA management,which must be carefully addressed to enhance performance.