High performance extended math library is used by many scientific engineering and artificial intelligence applications,which usually involves many common mathematical computations and the most time-consuming functions...High performance extended math library is used by many scientific engineering and artificial intelligence applications,which usually involves many common mathematical computations and the most time-consuming functions.In order to take full advantage of the high performance processors,these functions need to be parallelized and optimized intensively.It is common for processor vendors to supply highly optimized commercial math library.For example,Intel maintains oneMKL,and NVIDIA has cuBLAS,cuSolver,and cuFFT.In this paper,we release a new-generation high-performance extended math library,xMath 2.0,specifically designed for the SW26010-Pro many-core processor,which includes four major modules:BLAS,LAPACK,FFT,and SPARSE.Each module is optimized for the domestic SW26010-Pro processor,leveraging parallelization on the many-core CPE mesh and optimization techniques such as assembly instruction rearrangement and computation-communication overlapping.In xMath2.0,the BLAS module has an average performance increase of 146.02 times over the MPE version of GotoBLAS2,and the performance of BLAS level 3 functions has increased by 393.95 times.The LAPACK module(calling xMath BLAS)is 233.44 times better than LAPACK(calling GotoBLAS2).And the FFT module is 47.63 times faster than FFTW3.3.2.The library has been deployed on the domestic Sunway TaihuLight Pro supercomputer,which have been used by dozens of users.展开更多
区域气候模式CWRF(Climate-Weather Research and Forecasting model)是国家气候中心区域气候预测系统的重要组成部分,也是系统最耗时的程序。高性能计算是提高CWRF数值预报计算性能的关键技术,开展CWRF模式在国产神威众核架构上的移植...区域气候模式CWRF(Climate-Weather Research and Forecasting model)是国家气候中心区域气候预测系统的重要组成部分,也是系统最耗时的程序。高性能计算是提高CWRF数值预报计算性能的关键技术,开展CWRF模式在国产神威众核架构上的移植和优化,提高模式的模拟效率,对模式的扩展、开发能力和可持续发展具有重要意义。基于国产众核SW26010处理器,完成了CWRF区域气候模式的移植、性能分析和深入性能优化,采用访存优化、Cache命中率优化及众核加速优化等方法,对CWRF模式动力过程、物理过程和I/O过程计算代码进行重构及众核加速。结果表明:优化技术可使CWRF动力过程平均加速2倍,最高加速6.4倍,物理过程平均加速1.7倍,最高加速5.4倍,I/O过程加速1.2倍,程序整体最高加速1.4倍,计算误差在合理范围内。展开更多
基金supported in part by Special Project on High-Performance Computing under the National Key R&D Program(2020YFB0204601).
文摘High performance extended math library is used by many scientific engineering and artificial intelligence applications,which usually involves many common mathematical computations and the most time-consuming functions.In order to take full advantage of the high performance processors,these functions need to be parallelized and optimized intensively.It is common for processor vendors to supply highly optimized commercial math library.For example,Intel maintains oneMKL,and NVIDIA has cuBLAS,cuSolver,and cuFFT.In this paper,we release a new-generation high-performance extended math library,xMath 2.0,specifically designed for the SW26010-Pro many-core processor,which includes four major modules:BLAS,LAPACK,FFT,and SPARSE.Each module is optimized for the domestic SW26010-Pro processor,leveraging parallelization on the many-core CPE mesh and optimization techniques such as assembly instruction rearrangement and computation-communication overlapping.In xMath2.0,the BLAS module has an average performance increase of 146.02 times over the MPE version of GotoBLAS2,and the performance of BLAS level 3 functions has increased by 393.95 times.The LAPACK module(calling xMath BLAS)is 233.44 times better than LAPACK(calling GotoBLAS2).And the FFT module is 47.63 times faster than FFTW3.3.2.The library has been deployed on the domestic Sunway TaihuLight Pro supercomputer,which have been used by dozens of users.
文摘区域气候模式CWRF(Climate-Weather Research and Forecasting model)是国家气候中心区域气候预测系统的重要组成部分,也是系统最耗时的程序。高性能计算是提高CWRF数值预报计算性能的关键技术,开展CWRF模式在国产神威众核架构上的移植和优化,提高模式的模拟效率,对模式的扩展、开发能力和可持续发展具有重要意义。基于国产众核SW26010处理器,完成了CWRF区域气候模式的移植、性能分析和深入性能优化,采用访存优化、Cache命中率优化及众核加速优化等方法,对CWRF模式动力过程、物理过程和I/O过程计算代码进行重构及众核加速。结果表明:优化技术可使CWRF动力过程平均加速2倍,最高加速6.4倍,物理过程平均加速1.7倍,最高加速5.4倍,I/O过程加速1.2倍,程序整体最高加速1.4倍,计算误差在合理范围内。