With the global trend of pursuing clean energy and decarbonization,power systems have been evolving in a fast pace that we have never seen in the history of electrification.This evolution makes the power system more d...With the global trend of pursuing clean energy and decarbonization,power systems have been evolving in a fast pace that we have never seen in the history of electrification.This evolution makes the power system more dynamic and more distributed,with higher uncertainty.These new power system behaviors bring significant challenges in power system modeling and simulation as more data need to be analyzed for larger systems and more complex models to be solved in a shorter time period.The conventional computing approaches will not be sufficient for future power systems.This paper provides a historical review of computing for power system operation and planning,discusses technology advancements in high performance computing(HPC),and describes the drivers for employing HPC techniques.Some high performance computing application examples with different HPC techniques,including the latest quantum computing,are also presented to show how HPC techniques can help us be well prepared to meet the requirements of power system computing in a clean energy future.展开更多
The mismatch between compute performance and I/O performance has long been a stumbling block as supercomputers evolve from petaflops to exaflops. Currently, many parallel applications are I/O intensive,and their overa...The mismatch between compute performance and I/O performance has long been a stumbling block as supercomputers evolve from petaflops to exaflops. Currently, many parallel applications are I/O intensive,and their overall running times are typically limited by I/O performance. To quantify the I/O performance bottleneck and highlight the significance of achieving scalable performance in peta/exascale supercomputing, in this paper, we introduce for the first time a formal definition of the ‘storage wall' from the perspective of parallel application scalability. We quantify the effects of the storage bottleneck by providing a storage-bounded speedup,defining the storage wall quantitatively, presenting existence theorems for the storage wall, and classifying the system architectures depending on I/O performance variation. We analyze and extrapolate the existence of the storage wall by experiments on Tianhe-1A and case studies on Jaguar. These results provide insights on how to alleviate the storage wall bottleneck in system design and achieve hardware/software optimizations in peta/exascale supercomputing.展开更多
Recently,researchers have shown increasing interest in combining more than one programming model into systems running on high performance computing systems(HPCs)to achieve exascale by applying parallelism at multiple ...Recently,researchers have shown increasing interest in combining more than one programming model into systems running on high performance computing systems(HPCs)to achieve exascale by applying parallelism at multiple levels.Combining different programming paradigms,such as Message Passing Interface(MPI),Open Multiple Processing(OpenMP),and Open Accelerators(OpenACC),can increase computation speed and improve performance.During the integration of multiple models,the probability of runtime errors increases,making their detection difficult,especially in the absence of testing techniques that can detect these errors.Numerous studies have been conducted to identify these errors,but no technique exists for detecting errors in three-level programming models.Despite the increasing research that integrates the three programming models,MPI,OpenMP,and OpenACC,a testing technology to detect runtime errors,such as deadlocks and race conditions,which can arise from this integration has not been developed.Therefore,this paper begins with a definition and explanation of runtime errors that result fromintegrating the three programming models that compilers cannot detect.For the first time,this paper presents a classification of operational errors that can result from the integration of the three models.This paper also proposes a parallel hybrid testing technique for detecting runtime errors in systems built in the C++programming language that uses the triple programming models MPI,OpenMP,and OpenACC.This hybrid technology combines static technology and dynamic technology,given that some errors can be detected using static techniques,whereas others can be detected using dynamic technology.The hybrid technique can detect more errors because it combines two distinct technologies.The proposed static technology detects a wide range of error types in less time,whereas a portion of the potential errors that may or may not occur depending on the 4502 CMC,2023,vol.74,no.2 operating environment are left to the dynamic technology,which completes the validation.展开更多
With the coming of exascale supercomputing era, power efficiency has become the most important obstacle to build an exascale system. Dataflow architecture has native advantage in achieving high power efficiency for sc...With the coming of exascale supercomputing era, power efficiency has become the most important obstacle to build an exascale system. Dataflow architecture has native advantage in achieving high power efficiency for scientific applications. However, the state-of-the-art dataflow architectures fail to exploit high parallelism for loop processing. To address this issue, we propose a pipelining loop optimization method (PLO), which makes iterations in loops flow in the processing element (PE) array of dataflow accelerator. This method consists of two techniques, architecture-assisted hardware iteration and instruction-assisted software iteration. In hardware iteration execution model, an on-chip loop controller is designed to generate loop indexes, reducing the complexity of computing kernel and laying a good f(mndation for pipelining execution. In software iteration execution model, additional loop instructions are presented to solve the iteration dependency problem. Via these two techniques, the average number of instructions ready to execute per cycle is increased to keep floating-point unit busy. Simulation results show that our proposed method outperforms static and dynamic loop execution model in floating-point efficiency by 2.45x and 1.1x on average, respectively, while the hardware cost of these two techniques is acceptable.展开更多
基金the support from U.S.Department of Energy through its Advanced Grid Modeling program,Exascale Computing Program(ECP)The Grid Modernization Laboratory Consortium(GMLC)+1 种基金Advanced Research Projects Agency-Energy(ARPA-E),The National Quantum Information Science Research Centers,Co-design Center for Quantum Advantage(C2QA)the Office of Advanced Scientific Computing Research(ASCR).
文摘With the global trend of pursuing clean energy and decarbonization,power systems have been evolving in a fast pace that we have never seen in the history of electrification.This evolution makes the power system more dynamic and more distributed,with higher uncertainty.These new power system behaviors bring significant challenges in power system modeling and simulation as more data need to be analyzed for larger systems and more complex models to be solved in a shorter time period.The conventional computing approaches will not be sufficient for future power systems.This paper provides a historical review of computing for power system operation and planning,discusses technology advancements in high performance computing(HPC),and describes the drivers for employing HPC techniques.Some high performance computing application examples with different HPC techniques,including the latest quantum computing,are also presented to show how HPC techniques can help us be well prepared to meet the requirements of power system computing in a clean energy future.
基金the National Natural Science Foundation of China(Nos.61272141 and 61120106005)the National High-Tech R&D Program(863)of China(No.2012AA01A301)
文摘The mismatch between compute performance and I/O performance has long been a stumbling block as supercomputers evolve from petaflops to exaflops. Currently, many parallel applications are I/O intensive,and their overall running times are typically limited by I/O performance. To quantify the I/O performance bottleneck and highlight the significance of achieving scalable performance in peta/exascale supercomputing, in this paper, we introduce for the first time a formal definition of the ‘storage wall' from the perspective of parallel application scalability. We quantify the effects of the storage bottleneck by providing a storage-bounded speedup,defining the storage wall quantitatively, presenting existence theorems for the storage wall, and classifying the system architectures depending on I/O performance variation. We analyze and extrapolate the existence of the storage wall by experiments on Tianhe-1A and case studies on Jaguar. These results provide insights on how to alleviate the storage wall bottleneck in system design and achieve hardware/software optimizations in peta/exascale supercomputing.
基金[King Abdulaziz University][Deanship of Scientific Research]Grant Number[KEP-PHD-20-611-42].
文摘Recently,researchers have shown increasing interest in combining more than one programming model into systems running on high performance computing systems(HPCs)to achieve exascale by applying parallelism at multiple levels.Combining different programming paradigms,such as Message Passing Interface(MPI),Open Multiple Processing(OpenMP),and Open Accelerators(OpenACC),can increase computation speed and improve performance.During the integration of multiple models,the probability of runtime errors increases,making their detection difficult,especially in the absence of testing techniques that can detect these errors.Numerous studies have been conducted to identify these errors,but no technique exists for detecting errors in three-level programming models.Despite the increasing research that integrates the three programming models,MPI,OpenMP,and OpenACC,a testing technology to detect runtime errors,such as deadlocks and race conditions,which can arise from this integration has not been developed.Therefore,this paper begins with a definition and explanation of runtime errors that result fromintegrating the three programming models that compilers cannot detect.For the first time,this paper presents a classification of operational errors that can result from the integration of the three models.This paper also proposes a parallel hybrid testing technique for detecting runtime errors in systems built in the C++programming language that uses the triple programming models MPI,OpenMP,and OpenACC.This hybrid technology combines static technology and dynamic technology,given that some errors can be detected using static techniques,whereas others can be detected using dynamic technology.The hybrid technique can detect more errors because it combines two distinct technologies.The proposed static technology detects a wide range of error types in less time,whereas a portion of the potential errors that may or may not occur depending on the 4502 CMC,2023,vol.74,no.2 operating environment are left to the dynamic technology,which completes the validation.
基金This work was supported by the National Key Research and Development Program of China under Grant No. 2016YFB0200501, tile National Natural Science Foundation of China under Grant Nos. 61332009 and 61521092, the Open Project Program of State Key Laboratory of Mathematical Engineering and Advanced Computing under Grant No. 2016A04 and tile Beijing Municipal Science and Technology Commission under Grant No. Z15010101009, the Open Project Program of State Key Laboratory of Computer Architecture under Grant No. CARCH201503, China Scholarship Council, and Beijing Advanced hmovation Center for hnaging Technology.
文摘With the coming of exascale supercomputing era, power efficiency has become the most important obstacle to build an exascale system. Dataflow architecture has native advantage in achieving high power efficiency for scientific applications. However, the state-of-the-art dataflow architectures fail to exploit high parallelism for loop processing. To address this issue, we propose a pipelining loop optimization method (PLO), which makes iterations in loops flow in the processing element (PE) array of dataflow accelerator. This method consists of two techniques, architecture-assisted hardware iteration and instruction-assisted software iteration. In hardware iteration execution model, an on-chip loop controller is designed to generate loop indexes, reducing the complexity of computing kernel and laying a good f(mndation for pipelining execution. In software iteration execution model, additional loop instructions are presented to solve the iteration dependency problem. Via these two techniques, the average number of instructions ready to execute per cycle is increased to keep floating-point unit busy. Simulation results show that our proposed method outperforms static and dynamic loop execution model in floating-point efficiency by 2.45x and 1.1x on average, respectively, while the hardware cost of these two techniques is acceptable.