LAPACK (Linear Algebra PACKage) is a subroutine library for solving the most common problems in numerical linear algebra, designed to run efficiently on shared-memory vector and parallel processors. Only the general s...LAPACK (Linear Algebra PACKage) is a subroutine library for solving the most common problems in numerical linear algebra, designed to run efficiently on shared-memory vector and parallel processors. Only the general sequential code of LAPACK is available on INTERNET, the optimization of it on a special machine is very burdensome. To solve this problem, we develop an automatic parallelizing tool on SGI POWER Challenge, and it shows good results.展开更多
This paper shows two approaches to improve the performance of numeral al- gebra software by describing block algorithms in LAPACK. The block algorithms can make up higher level and more effcient BLAS programs. This pa...This paper shows two approaches to improve the performance of numeral al- gebra software by describing block algorithms in LAPACK. The block algorithms can make up higher level and more effcient BLAS programs. This paper further presents the relations between the effciency of the block algorithm and the size of block, and shows the relations relates to not only scale of algorithms and problems but also architectures and Characters of destination machines. Finally The paper gives the test results on Hitachi SR2201& SR8000.展开更多
High performance extended math library is used by many scientific engineering and artificial intelligence applications,which usually involves many common mathematical computations and the most time-consuming functions...High performance extended math library is used by many scientific engineering and artificial intelligence applications,which usually involves many common mathematical computations and the most time-consuming functions.In order to take full advantage of the high performance processors,these functions need to be parallelized and optimized intensively.It is common for processor vendors to supply highly optimized commercial math library.For example,Intel maintains oneMKL,and NVIDIA has cuBLAS,cuSolver,and cuFFT.In this paper,we release a new-generation high-performance extended math library,xMath 2.0,specifically designed for the SW26010-Pro many-core processor,which includes four major modules:BLAS,LAPACK,FFT,and SPARSE.Each module is optimized for the domestic SW26010-Pro processor,leveraging parallelization on the many-core CPE mesh and optimization techniques such as assembly instruction rearrangement and computation-communication overlapping.In xMath2.0,the BLAS module has an average performance increase of 146.02 times over the MPE version of GotoBLAS2,and the performance of BLAS level 3 functions has increased by 393.95 times.The LAPACK module(calling xMath BLAS)is 233.44 times better than LAPACK(calling GotoBLAS2).And the FFT module is 47.63 times faster than FFTW3.3.2.The library has been deployed on the domestic Sunway TaihuLight Pro supercomputer,which have been used by dozens of users.展开更多
文摘LAPACK (Linear Algebra PACKage) is a subroutine library for solving the most common problems in numerical linear algebra, designed to run efficiently on shared-memory vector and parallel processors. Only the general sequential code of LAPACK is available on INTERNET, the optimization of it on a special machine is very burdensome. To solve this problem, we develop an automatic parallelizing tool on SGI POWER Challenge, and it shows good results.
文摘This paper shows two approaches to improve the performance of numeral al- gebra software by describing block algorithms in LAPACK. The block algorithms can make up higher level and more effcient BLAS programs. This paper further presents the relations between the effciency of the block algorithm and the size of block, and shows the relations relates to not only scale of algorithms and problems but also architectures and Characters of destination machines. Finally The paper gives the test results on Hitachi SR2201& SR8000.
基金supported in part by Special Project on High-Performance Computing under the National Key R&D Program(2020YFB0204601).
文摘High performance extended math library is used by many scientific engineering and artificial intelligence applications,which usually involves many common mathematical computations and the most time-consuming functions.In order to take full advantage of the high performance processors,these functions need to be parallelized and optimized intensively.It is common for processor vendors to supply highly optimized commercial math library.For example,Intel maintains oneMKL,and NVIDIA has cuBLAS,cuSolver,and cuFFT.In this paper,we release a new-generation high-performance extended math library,xMath 2.0,specifically designed for the SW26010-Pro many-core processor,which includes four major modules:BLAS,LAPACK,FFT,and SPARSE.Each module is optimized for the domestic SW26010-Pro processor,leveraging parallelization on the many-core CPE mesh and optimization techniques such as assembly instruction rearrangement and computation-communication overlapping.In xMath2.0,the BLAS module has an average performance increase of 146.02 times over the MPE version of GotoBLAS2,and the performance of BLAS level 3 functions has increased by 393.95 times.The LAPACK module(calling xMath BLAS)is 233.44 times better than LAPACK(calling GotoBLAS2).And the FFT module is 47.63 times faster than FFTW3.3.2.The library has been deployed on the domestic Sunway TaihuLight Pro supercomputer,which have been used by dozens of users.