With the rapid advancement of artificial intelligence and high-performance computing,heterogeneous computing platforms have evolved to encompass increasingly diverse architectures.While SYCL,an open standard for heter...With the rapid advancement of artificial intelligence and high-performance computing,heterogeneous computing platforms have evolved to encompass increasingly diverse architectures.While SYCL,an open standard for heterogeneous programming,has gained widespread adoption,its mainstream implementations(such as DPC++and AdaptiveCpp)primarily target SIMT-architecture devices like GPUs,presenting substantial challenges when adapting to specialized accelerators such as the Cambricon MLU,which employs a fundamentally different SIMD execution model.This cross-programming-model extension encounters two critical challenges:(1)bridging the programming abstraction gap between SIMT’s thread-level parallelism and SIMD’s data-level parallelism;and(2)harmonizing SYCL’s unified memory model with device-specific memory architectures.This paper proposes a novel cross-programming-model SYCL extension methodology to achieve full SYCL support for SIMD architectures,demonstrated through a comprehensive implementation for the Cambricon MLU platform.Our approach introduces MLU-specific vector programming interfaces while maintaining compatibility with the SYCL standard,enabling seamless integration of SIMD-based accelerators into the SYCL ecosystem.To validate our methodology,we integrated the extended SYCL-MLU implementation into PaddlePaddle’s CINN compiler,achieving a geometric mean performance improvement of 9.14%across representative neural networks,including ResNet,YOLOv3,and BERT.This research significantly broadens the application scope of SYCL in heterogeneous programming and provides a systematic methodology for extending SYCL to other SIMD-based hardware platforms.展开更多
As increasingly widening gap of computing demand and performance in embedded computing domain,heterogeneous computing architecture which delivers better performance as well as lower power in limited size is gaining mo...As increasingly widening gap of computing demand and performance in embedded computing domain,heterogeneous computing architecture which delivers better performance as well as lower power in limited size is gaining more and more attention. At first,the heterogeneous computing model is presented. And the different tightly coupled single chip heterogeneous architectures and their application domain are introduced. Then,task partitioning methods are described. Several programming model technology are analyzed and discussed. Finally,main challenges and future perspective of High Performance Embedded Computing(HPEC) are summarized.展开更多
Real-time rendering applications leverage heterogeneous computing to optimize performance.However,software development across multiple devices presents challenges,including data layout inconsistencies,synchronization ...Real-time rendering applications leverage heterogeneous computing to optimize performance.However,software development across multiple devices presents challenges,including data layout inconsistencies,synchronization issues,resource management complexities,and architectural disparities.Additionally,the creation of such systems requires verbose and unsafe programming models.Recent developments in domain-specific and unified shading languages aim to mitigate these issues.Yet,current programming models primarily address data layout consistency,neglecting other persistent challenges.In this paper,we introduce RenderKernel,a programming model designed to simplify the development of real-time rendering systems.Recognizing the need for a high-level approach,RenderKernel addresses the specific challenges of real-time rendering,enabling development on heterogeneous systems as if they were homogeneous.This model allows for early detection and prevention of errors due to system heterogeneity at compile-time.Furthermore,RenderKernel enables the use of common programming patterns from homogeneous environments,freeing developers from the complexities of underlying heterogeneous systems.Developers can focus on coding unique application features,thereby enhancing productivity and reducing the cognitive load associated with real-time rendering system development.展开更多
On June 17, 2013, MilkyWay-2 (Tianhe-2) supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design...On June 17, 2013, MilkyWay-2 (Tianhe-2) supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design of hardware and software systems. The key architecture features of MilkyWay-2 are highlighted, including neo-heterogeneous compute nodes integrating commodity- off-the-shelf processors and accelerators that share similar instruction set architecture, powerful networks that employ proprietary interconnection chips to support the massively parallel message-passing communications, proprietary 16- core processor designed for scientific computing, efficient software stacks that provide high performance file system, emerging programming model for heterogeneous systems, and intelligent system administration. We perform extensive evaluation with wide-ranging applications from LINPACK and Graph500 benchmarks to massively parallel software deployed in the system.展开更多
OpenCL programming provides full code portability between different hardware platforms,and can serve as a good programming candidate for heterogeneous systems,which typically consist of a host processor and several ac...OpenCL programming provides full code portability between different hardware platforms,and can serve as a good programming candidate for heterogeneous systems,which typically consist of a host processor and several accelerators.However,to make full use of the computing capacity of such a system,programmers are requested to manage diverse OpenCL-enabled devices explicitly,including distributing the workload between different devices and managing data transfer between multiple devices.All these tedious jobs pose a huge challenge for programmers.In this paper,a distributed shared OpenCL memory(DSOM) is presented,which relieves users of having to manage data transfer explicitly,by supporting shared buffers across devices.DSOM allocates shared buffers in the system memory and treats the on-device memory as a software managed virtual cache buffer.To support fine-grained shared buffer management,we designed a kernel parser in DSOM for buffer access range analysis.A basic modified,shared,invalid cache coherency is implemented for DSOM to maintain coherency for cache buffers.In addition,we propose a novel strategy to minimize communication cost between devices by launching each necessary data transfer as early as possible.This strategy enables overlap of data transfer with kernel execution.Our experimental results show that the applicability of our method for buffer access range analysis is good,and the efficiency of DSOM is high.展开更多
The explosive growth of graph data sets has led to an increase in the computing power and storage resources required for graph computing.To handle large-scale graph processing,heterogeneous platforms have become neces...The explosive growth of graph data sets has led to an increase in the computing power and storage resources required for graph computing.To handle large-scale graph processing,heterogeneous platforms have become necessary to provide suf-ficient computing power and storage.The most popular scheme for this is the CPU-GPU architecture.However,the steep learning curve and complex concurrency control for heterogeneous platforms pose a challenge for developers.Additionally,GPUs from different vendors have varying software stacks,making cross-platform porting and verification challenging.Recently,Intel proposed a unified programming model to manage multiple heterogeneous devices at the same time,named oneAPI.It provides a more friendly programming model for simple C++developers and a convenient concurrency control scheme,allowing managing different vendors of devices at the same time.Hence there is an opportunity to utilize oneAPI to design a general cross-architecture framework for large-scale graph computing.In this paper,we propose a large-scale graph computing framework for multiple types of accelerators with Intel oneAPI and we name it as OneGraph.Our approach signifi-cantly reduces the data transfer between GPU and CPU and masks the latency by asynchronous transfer,which significantly improves performance.We conducted rigorous performance tests on the framework using four classical graph algorithms.The experiment results show that our approach achieves an average speedup of 3.3x over the state-of-the-art partitioning-based approaches.Moreover,thanks to the cross-architecture model of Intel oneAPI,the framework can be deployed on different GPU platforms without code modification.And our evaluation proves that OneGraph has only less than 1%performance loss compared to the dedicated programming model on GPUs in large-scale graph computing.展开更多
基金supported by the Beijing Science and Technology Planning Project(Grant No.Z231100010323007).
文摘With the rapid advancement of artificial intelligence and high-performance computing,heterogeneous computing platforms have evolved to encompass increasingly diverse architectures.While SYCL,an open standard for heterogeneous programming,has gained widespread adoption,its mainstream implementations(such as DPC++and AdaptiveCpp)primarily target SIMT-architecture devices like GPUs,presenting substantial challenges when adapting to specialized accelerators such as the Cambricon MLU,which employs a fundamentally different SIMD execution model.This cross-programming-model extension encounters two critical challenges:(1)bridging the programming abstraction gap between SIMT’s thread-level parallelism and SIMD’s data-level parallelism;and(2)harmonizing SYCL’s unified memory model with device-specific memory architectures.This paper proposes a novel cross-programming-model SYCL extension methodology to achieve full SYCL support for SIMD architectures,demonstrated through a comprehensive implementation for the Cambricon MLU platform.Our approach introduces MLU-specific vector programming interfaces while maintaining compatibility with the SYCL standard,enabling seamless integration of SIMD-based accelerators into the SYCL ecosystem.To validate our methodology,we integrated the extended SYCL-MLU implementation into PaddlePaddle’s CINN compiler,achieving a geometric mean performance improvement of 9.14%across representative neural networks,including ResNet,YOLOv3,and BERT.This research significantly broadens the application scope of SYCL in heterogeneous programming and provides a systematic methodology for extending SYCL to other SIMD-based hardware platforms.
基金supported by National Natural Science Foundation of China(Grant No.50305035)。
文摘As increasingly widening gap of computing demand and performance in embedded computing domain,heterogeneous computing architecture which delivers better performance as well as lower power in limited size is gaining more and more attention. At first,the heterogeneous computing model is presented. And the different tightly coupled single chip heterogeneous architectures and their application domain are introduced. Then,task partitioning methods are described. Several programming model technology are analyzed and discussed. Finally,main challenges and future perspective of High Performance Embedded Computing(HPEC) are summarized.
基金funded by China Scholarship Council(2020091-10135).
文摘Real-time rendering applications leverage heterogeneous computing to optimize performance.However,software development across multiple devices presents challenges,including data layout inconsistencies,synchronization issues,resource management complexities,and architectural disparities.Additionally,the creation of such systems requires verbose and unsafe programming models.Recent developments in domain-specific and unified shading languages aim to mitigate these issues.Yet,current programming models primarily address data layout consistency,neglecting other persistent challenges.In this paper,we introduce RenderKernel,a programming model designed to simplify the development of real-time rendering systems.Recognizing the need for a high-level approach,RenderKernel addresses the specific challenges of real-time rendering,enabling development on heterogeneous systems as if they were homogeneous.This model allows for early detection and prevention of errors due to system heterogeneity at compile-time.Furthermore,RenderKernel enables the use of common programming patterns from homogeneous environments,freeing developers from the complexities of underlying heterogeneous systems.Developers can focus on coding unique application features,thereby enhancing productivity and reducing the cognitive load associated with real-time rendering system development.
基金Acknowledgements This work was partially supported by the Na- tional High-tech R&D Program of China (863 Program) (2012AA01A301), and the National Natural Science Foundation of China (Grant No. 61120106005). The MilkyWay-2 project is a great team effort and benefits from the cooperation of many individuals at NUDT. We thank all the people who have contributed to the system in a variety of ways.
文摘On June 17, 2013, MilkyWay-2 (Tianhe-2) supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design of hardware and software systems. The key architecture features of MilkyWay-2 are highlighted, including neo-heterogeneous compute nodes integrating commodity- off-the-shelf processors and accelerators that share similar instruction set architecture, powerful networks that employ proprietary interconnection chips to support the massively parallel message-passing communications, proprietary 16- core processor designed for scientific computing, efficient software stacks that provide high performance file system, emerging programming model for heterogeneous systems, and intelligent system administration. We perform extensive evaluation with wide-ranging applications from LINPACK and Graph500 benchmarks to massively parallel software deployed in the system.
基金Project supported by the National Natural Science Foundation of China(Nos.61033008,61272145,60903041,and 61103080)the Research Fund for the Doctoral Program of Higher Education of China(No.20104307110002)+1 种基金the Hunan Provincial Innovation Foundation for Postgraduate(No.CX2010B028)the Fund of Innovation in Graduate School of NUDT(Nos.B100603 and B120605),China
文摘OpenCL programming provides full code portability between different hardware platforms,and can serve as a good programming candidate for heterogeneous systems,which typically consist of a host processor and several accelerators.However,to make full use of the computing capacity of such a system,programmers are requested to manage diverse OpenCL-enabled devices explicitly,including distributing the workload between different devices and managing data transfer between multiple devices.All these tedious jobs pose a huge challenge for programmers.In this paper,a distributed shared OpenCL memory(DSOM) is presented,which relieves users of having to manage data transfer explicitly,by supporting shared buffers across devices.DSOM allocates shared buffers in the system memory and treats the on-device memory as a software managed virtual cache buffer.To support fine-grained shared buffer management,we designed a kernel parser in DSOM for buffer access range analysis.A basic modified,shared,invalid cache coherency is implemented for DSOM to maintain coherency for cache buffers.In addition,we propose a novel strategy to minimize communication cost between devices by launching each necessary data transfer as early as possible.This strategy enables overlap of data transfer with kernel execution.Our experimental results show that the applicability of our method for buffer access range analysis is good,and the efficiency of DSOM is high.
基金supported in part by the Key Research and Development Program of Guangdong,China(2021B0101310002)Natural Science Foundation of China(62172239)Intel Corporation.
文摘The explosive growth of graph data sets has led to an increase in the computing power and storage resources required for graph computing.To handle large-scale graph processing,heterogeneous platforms have become necessary to provide suf-ficient computing power and storage.The most popular scheme for this is the CPU-GPU architecture.However,the steep learning curve and complex concurrency control for heterogeneous platforms pose a challenge for developers.Additionally,GPUs from different vendors have varying software stacks,making cross-platform porting and verification challenging.Recently,Intel proposed a unified programming model to manage multiple heterogeneous devices at the same time,named oneAPI.It provides a more friendly programming model for simple C++developers and a convenient concurrency control scheme,allowing managing different vendors of devices at the same time.Hence there is an opportunity to utilize oneAPI to design a general cross-architecture framework for large-scale graph computing.In this paper,we propose a large-scale graph computing framework for multiple types of accelerators with Intel oneAPI and we name it as OneGraph.Our approach signifi-cantly reduces the data transfer between GPU and CPU and masks the latency by asynchronous transfer,which significantly improves performance.We conducted rigorous performance tests on the framework using four classical graph algorithms.The experiment results show that our approach achieves an average speedup of 3.3x over the state-of-the-art partitioning-based approaches.Moreover,thanks to the cross-architecture model of Intel oneAPI,the framework can be deployed on different GPU platforms without code modification.And our evaluation proves that OneGraph has only less than 1%performance loss compared to the dedicated programming model on GPUs in large-scale graph computing.