With the rise of large AI models,Graphics Processing Units(GPUs)have become the preferred hardware solution for many scientific applications due to their superior floating-point computation capabilities.This paper exp...With the rise of large AI models,Graphics Processing Units(GPUs)have become the preferred hardware solution for many scientific applications due to their superior floating-point computation capabilities.This paper explores the application of CPU+GPU heterogeneous accelerators in the Global/Regional Assimilation and Prediction System(GRAPES).We moved the main time-consuming part of the scalar advection scheme(PRM)in the system to run on the GPU.Specifically,we performed a detailed performance analysis of the PRM module and then refactored and ported the code using C and CUDA C to run on the GPU.During this process,we used a series of optimization methods,including changing array storage order,optimizing GPU memory access,and merging loops to increase kernel function computation.Additionally,to reduce communication overhead,we designed a communication-avoidance scheme to improve performance.The final solution showed good accuracy within acceptable error margins and excellent scalability.On a cluster with Intel(R)Xeon(R)Gold 6326 CPUs and NVIDIA A800 GPUs,we achieved up to 87.90 times speedup for the hotspot function and 5.21 times overall speedup for the scalar advection scheme using 16 CPU cores and 8 GPU accelerators.展开更多
Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these adv...Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these advancements, efficiently programming GPUs remains a daunting challenge, often relying on trial-and-error optimization methods. This paper introduces an optimization technique for CUDA programs through a novel Data Layout strategy, aimed at restructuring memory data arrangement to significantly enhance data access locality. Focusing on the dynamic programming algorithm for chained matrix multiplication—a critical operation across various domains including artificial intelligence (AI), high-performance computing (HPC), and the Internet of Things (IoT)—this technique facilitates more localized access. We specifically illustrate the importance of efficient matrix multiplication in these areas, underscoring the technique’s broader applicability and its potential to address some of the most pressing computational challenges in GPU-accelerated applications. Our findings reveal a remarkable reduction in memory consumption and a substantial 50% decrease in execution time for CUDA programs utilizing this technique, thereby setting a new benchmark for optimization in GPU computing.展开更多
基金supported by the Young and Middle-aged Science and Technology Talents Promotion Project of 2022 Qinghai Province(No.2022QHSKXRCTJ41)Natural Science Foundation of Qinghai Province(No.2023-ZJ-906 M)National Natural Science Foundation of China(No.62062059,No.62162053).
文摘With the rise of large AI models,Graphics Processing Units(GPUs)have become the preferred hardware solution for many scientific applications due to their superior floating-point computation capabilities.This paper explores the application of CPU+GPU heterogeneous accelerators in the Global/Regional Assimilation and Prediction System(GRAPES).We moved the main time-consuming part of the scalar advection scheme(PRM)in the system to run on the GPU.Specifically,we performed a detailed performance analysis of the PRM module and then refactored and ported the code using C and CUDA C to run on the GPU.During this process,we used a series of optimization methods,including changing array storage order,optimizing GPU memory access,and merging loops to increase kernel function computation.Additionally,to reduce communication overhead,we designed a communication-avoidance scheme to improve performance.The final solution showed good accuracy within acceptable error margins and excellent scalability.On a cluster with Intel(R)Xeon(R)Gold 6326 CPUs and NVIDIA A800 GPUs,we achieved up to 87.90 times speedup for the hotspot function and 5.21 times overall speedup for the scalar advection scheme using 16 CPU cores and 8 GPU accelerators.
文摘Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these advancements, efficiently programming GPUs remains a daunting challenge, often relying on trial-and-error optimization methods. This paper introduces an optimization technique for CUDA programs through a novel Data Layout strategy, aimed at restructuring memory data arrangement to significantly enhance data access locality. Focusing on the dynamic programming algorithm for chained matrix multiplication—a critical operation across various domains including artificial intelligence (AI), high-performance computing (HPC), and the Internet of Things (IoT)—this technique facilitates more localized access. We specifically illustrate the importance of efficient matrix multiplication in these areas, underscoring the technique’s broader applicability and its potential to address some of the most pressing computational challenges in GPU-accelerated applications. Our findings reveal a remarkable reduction in memory consumption and a substantial 50% decrease in execution time for CUDA programs utilizing this technique, thereby setting a new benchmark for optimization in GPU computing.