期刊文献+
共找到2篇文章
< 1 >
每页显示 20 50 100
Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique
1
作者 Neda Seifi Abdullah Al-Mamun 《Journal of Computer and Communications》 2024年第5期124-139,共16页
Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these adv... Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these advancements, efficiently programming GPUs remains a daunting challenge, often relying on trial-and-error optimization methods. This paper introduces an optimization technique for CUDA programs through a novel Data Layout strategy, aimed at restructuring memory data arrangement to significantly enhance data access locality. Focusing on the dynamic programming algorithm for chained matrix multiplication—a critical operation across various domains including artificial intelligence (AI), high-performance computing (HPC), and the Internet of Things (IoT)—this technique facilitates more localized access. We specifically illustrate the importance of efficient matrix multiplication in these areas, underscoring the technique’s broader applicability and its potential to address some of the most pressing computational challenges in GPU-accelerated applications. Our findings reveal a remarkable reduction in memory consumption and a substantial 50% decrease in execution time for CUDA programs utilizing this technique, thereby setting a new benchmark for optimization in GPU computing. 展开更多
关键词 Data Layout Optimization CUDA Performance Optimization GPU memory Optimization Dynamic Programming Matrix Multiplication memory Access pattern Optimization in CUDA
在线阅读 下载PDF
Optimizing 2D convolution for DCUs
2
作者 Wenlong Fan Haobo Hua +3 位作者 Jiandong Shang Zhuxin Wen Hengliang Guo Litao Zhang 《CCF Transactions on High Performance Computing》 2025年第2期142-154,共13页
With the growing importance of convolution in deep learning,the development of efficient convolution algorithms has become an urgent requirement.DCU(Deep Computing Unit),as an emerging GPU-like accelerator,has a relat... With the growing importance of convolution in deep learning,the development of efficient convolution algorithms has become an urgent requirement.DCU(Deep Computing Unit),as an emerging GPU-like accelerator,has a relatively underdeveloped deep learning ecosystem.Therefore,this study focuses on developing efficient convolution operators for DCU.Using DCU hardware,diverse memory access patterns are created and fine thread rearrangements are performed at the warp and thread levels to optimize memory access patterns,improve computational efficiency,and improve data layout.Based on redesigned memory access patterns,the more efficient Implicit GEMM and Winograd 3×3 convolution algorithms are successfully implemented on the DCU.In addition,load partitioning is also improved through multiple techniques.Finally,a heuristic strategy selection module is developed that can determine the optimal computation method based on the scale of convolution.A series of tests are conducted on DCU,and it is compared with MIOpen.The results demonstrate that the optimized algorithm can achieve a maximum improvement of 2.32×,and the average speedup ratio is 1.29×,which fully demonstrates the efficiency of our method in implementing the 3×3 convolution algorithm on DCU. 展开更多
关键词 Convolution Implicit GEMM Winograd memory access pattern
在线阅读 下载PDF
上一页 1 下一页 到第
使用帮助 返回顶部