期刊文献+
共找到6篇文章
< 1 >
每页显示 20 50 100
SYCL-MLU:unifying SIMT and SIMD in heterogeneous programming
1
作者 Runyu Zhou Yijin Li +4 位作者 Jiacheng Zhao Ziyang Wang En Shao Ziyan Xie Huimin Cui 《CCF Transactions on High Performance Computing》 2026年第1期94-106,共13页
With the rapid advancement of artificial intelligence and high-performance computing,heterogeneous computing platforms have evolved to encompass increasingly diverse architectures.While SYCL,an open standard for heter... With the rapid advancement of artificial intelligence and high-performance computing,heterogeneous computing platforms have evolved to encompass increasingly diverse architectures.While SYCL,an open standard for heterogeneous programming,has gained widespread adoption,its mainstream implementations(such as DPC++and AdaptiveCpp)primarily target SIMT-architecture devices like GPUs,presenting substantial challenges when adapting to specialized accelerators such as the Cambricon MLU,which employs a fundamentally different SIMD execution model.This cross-programming-model extension encounters two critical challenges:(1)bridging the programming abstraction gap between SIMT’s thread-level parallelism and SIMD’s data-level parallelism;and(2)harmonizing SYCL’s unified memory model with device-specific memory architectures.This paper proposes a novel cross-programming-model SYCL extension methodology to achieve full SYCL support for SIMD architectures,demonstrated through a comprehensive implementation for the Cambricon MLU platform.Our approach introduces MLU-specific vector programming interfaces while maintaining compatibility with the SYCL standard,enabling seamless integration of SIMD-based accelerators into the SYCL ecosystem.To validate our methodology,we integrated the extended SYCL-MLU implementation into PaddlePaddle’s CINN compiler,achieving a geometric mean performance improvement of 9.14%across representative neural networks,including ResNet,YOLOv3,and BERT.This research significantly broadens the application scope of SYCL in heterogeneous programming and provides a systematic methodology for extending SYCL to other SIMD-based hardware platforms. 展开更多
关键词 High performance computing heterogeneous programming SYCL MLU CINN PaddlePaddle
在线阅读 下载PDF
High performance heterogeneous embedded computing: a review 被引量:5
2
作者 HE Yongfu WANG Shaojun PENG Yu 《Instrumentation》 2014年第2期1-12,共12页
As increasingly widening gap of computing demand and performance in embedded computing domain,heterogeneous computing architecture which delivers better performance as well as lower power in limited size is gaining mo... As increasingly widening gap of computing demand and performance in embedded computing domain,heterogeneous computing architecture which delivers better performance as well as lower power in limited size is gaining more and more attention. At first,the heterogeneous computing model is presented. And the different tightly coupled single chip heterogeneous architectures and their application domain are introduced. Then,task partitioning methods are described. Several programming model technology are analyzed and discussed. Finally,main challenges and future perspective of High Performance Embedded Computing(HPEC) are summarized. 展开更多
关键词 HPEC heterogeneous SoCs hardw are/softw are partition heterogeneous programming
原文传递
RenderKernel:High-level programming for real-time rendering systems
3
作者 Jinyuan Yang Soumyabrata Dev Abraham G.Campbell 《Visual Informatics》 EI 2024年第3期82-95,共14页
Real-time rendering applications leverage heterogeneous computing to optimize performance.However,software development across multiple devices presents challenges,including data layout inconsistencies,synchronization ... Real-time rendering applications leverage heterogeneous computing to optimize performance.However,software development across multiple devices presents challenges,including data layout inconsistencies,synchronization issues,resource management complexities,and architectural disparities.Additionally,the creation of such systems requires verbose and unsafe programming models.Recent developments in domain-specific and unified shading languages aim to mitigate these issues.Yet,current programming models primarily address data layout consistency,neglecting other persistent challenges.In this paper,we introduce RenderKernel,a programming model designed to simplify the development of real-time rendering systems.Recognizing the need for a high-level approach,RenderKernel addresses the specific challenges of real-time rendering,enabling development on heterogeneous systems as if they were homogeneous.This model allows for early detection and prevention of errors due to system heterogeneity at compile-time.Furthermore,RenderKernel enables the use of common programming patterns from homogeneous environments,freeing developers from the complexities of underlying heterogeneous systems.Developers can focus on coding unique application features,thereby enhancing productivity and reducing the cognitive load associated with real-time rendering system development. 展开更多
关键词 heterogeneous programming High-level programming Real-time rendering Rendering systems
原文传递
MilkyWay-2 supercomputer: system and application 被引量:35
4
作者 Xiangke LIAO Liquan XIAO +1 位作者 Canqun YANG Yutong LU 《Frontiers of Computer Science》 SCIE EI CSCD 2014年第3期345-356,共12页
On June 17, 2013, MilkyWay-2 (Tianhe-2) supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design... On June 17, 2013, MilkyWay-2 (Tianhe-2) supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design of hardware and software systems. The key architecture features of MilkyWay-2 are highlighted, including neo-heterogeneous compute nodes integrating commodity- off-the-shelf processors and accelerators that share similar instruction set architecture, powerful networks that employ proprietary interconnection chips to support the massively parallel message-passing communications, proprietary 16- core processor designed for scientific computing, efficient software stacks that provide high performance file system, emerging programming model for heterogeneous systems, and intelligent system administration. We perform extensive evaluation with wide-ranging applications from LINPACK and Graph500 benchmarks to massively parallel software deployed in the system. 展开更多
关键词 MilkyWay-2 supercomputer petaflops computing neo-heterogeneous architecture interconnect network heterogeneous programing model system management benchmark optimization performance evaluation
原文传递
Efficient fine-grained shared buffer management for multiple OpenCL devices
5
作者 Chang-qing XUN Dong CHEN +1 位作者 Qiang LAN Chun-yuan ZHANG 《Journal of Zhejiang University-Science C(Computers and Electronics)》 SCIE EI 2013年第11期859-872,共14页
OpenCL programming provides full code portability between different hardware platforms,and can serve as a good programming candidate for heterogeneous systems,which typically consist of a host processor and several ac... OpenCL programming provides full code portability between different hardware platforms,and can serve as a good programming candidate for heterogeneous systems,which typically consist of a host processor and several accelerators.However,to make full use of the computing capacity of such a system,programmers are requested to manage diverse OpenCL-enabled devices explicitly,including distributing the workload between different devices and managing data transfer between multiple devices.All these tedious jobs pose a huge challenge for programmers.In this paper,a distributed shared OpenCL memory(DSOM) is presented,which relieves users of having to manage data transfer explicitly,by supporting shared buffers across devices.DSOM allocates shared buffers in the system memory and treats the on-device memory as a software managed virtual cache buffer.To support fine-grained shared buffer management,we designed a kernel parser in DSOM for buffer access range analysis.A basic modified,shared,invalid cache coherency is implemented for DSOM to maintain coherency for cache buffers.In addition,we propose a novel strategy to minimize communication cost between devices by launching each necessary data transfer as early as possible.This strategy enables overlap of data transfer with kernel execution.Our experimental results show that the applicability of our method for buffer access range analysis is good,and the efficiency of DSOM is high. 展开更多
关键词 Shared buffer OPENCL heterogeneous programming Fine grained
原文传递
OneGraph:a cross-architecture framework for large-scale graph computing on GPUs based on oneAPI
6
作者 Shiyang Li Jingyu Zhu +6 位作者 Jiaxun Han Yuting Peng Zhuoran Wang Xiaoli Gong Gang Wang Jin Zhang Xuqiang Wang 《CCF Transactions on High Performance Computing》 2024年第2期179-191,共13页
The explosive growth of graph data sets has led to an increase in the computing power and storage resources required for graph computing.To handle large-scale graph processing,heterogeneous platforms have become neces... The explosive growth of graph data sets has led to an increase in the computing power and storage resources required for graph computing.To handle large-scale graph processing,heterogeneous platforms have become necessary to provide suf-ficient computing power and storage.The most popular scheme for this is the CPU-GPU architecture.However,the steep learning curve and complex concurrency control for heterogeneous platforms pose a challenge for developers.Additionally,GPUs from different vendors have varying software stacks,making cross-platform porting and verification challenging.Recently,Intel proposed a unified programming model to manage multiple heterogeneous devices at the same time,named oneAPI.It provides a more friendly programming model for simple C++developers and a convenient concurrency control scheme,allowing managing different vendors of devices at the same time.Hence there is an opportunity to utilize oneAPI to design a general cross-architecture framework for large-scale graph computing.In this paper,we propose a large-scale graph computing framework for multiple types of accelerators with Intel oneAPI and we name it as OneGraph.Our approach signifi-cantly reduces the data transfer between GPU and CPU and masks the latency by asynchronous transfer,which significantly improves performance.We conducted rigorous performance tests on the framework using four classical graph algorithms.The experiment results show that our approach achieves an average speedup of 3.3x over the state-of-the-art partitioning-based approaches.Moreover,thanks to the cross-architecture model of Intel oneAPI,the framework can be deployed on different GPU platforms without code modification.And our evaluation proves that OneGraph has only less than 1%performance loss compared to the dedicated programming model on GPUs in large-scale graph computing. 展开更多
关键词 heterogeneous programming Graph computing Out-of-memory process Cross-architecture portability OneAPI
在线阅读 下载PDF
上一页 1 下一页 到第
使用帮助 返回顶部