As one of the most essential and important operations in linear algebra, the performance prediction of sparse matrix-vector multiplication (SpMV) on GPUs has got more and more attention in recent years. In 2012, Guo a...As one of the most essential and important operations in linear algebra, the performance prediction of sparse matrix-vector multiplication (SpMV) on GPUs has got more and more attention in recent years. In 2012, Guo and Wang put forward a new idea to predict the performance of SpMV on GPUs. However, they didn’t consider the matrix structure completely, so the execution time predicted by their model tends to be inaccurate for general sparse matrix. To address this problem, we proposed two new similar models, which take into account the structure of the matrices and make the performance prediction model more accurate. In addition, we predict the execution time of SpMV for CSR-V, CSR-S, ELL and JAD sparse matrix storage formats by the new models on the CUDA platform. Our experimental results show that the accuracy of prediction by our models is 1.69 times better than Guo and Wang’s model on average for most general matrices.展开更多
在图形处理器(GPU)上实现对角稀疏矩阵向量乘法(SpMV)可以充分利用GPU的并行计算能力,并加速矩阵向量乘法;然而,相关主流算法存在零元填充数据多、计算效率低的问题。针对上述问题,提出一种对角SpMV算法DIA-Dynamic(DIAgonal-Dynamic)...在图形处理器(GPU)上实现对角稀疏矩阵向量乘法(SpMV)可以充分利用GPU的并行计算能力,并加速矩阵向量乘法;然而,相关主流算法存在零元填充数据多、计算效率低的问题。针对上述问题,提出一种对角SpMV算法DIA-Dynamic(DIAgonal-Dynamic)。首先,设计一种全新的动态划分策略,根据矩阵的不同特征进行分块,在保证GPU高计算效率的同时大幅减少零元填充,去除冗余计算量;其次,提出一种对角稀疏矩阵存储格式BDIA(Block DIAgonal)存储分块数据,并调整数据布局,提高GPU上的访存性能;最后,基于GPU的底层进行条件分支优化,以减少分支判断,并使用动态共享内存解决向量的不规则访问问题。DIA-Dynamic与前沿Tile SpMV算法相比,平均加速比达到了1.88;与前沿BRCSD(Diagonal Compressed Storage based on Row-Blocks)-Ⅱ算法相比,平均零元填充减少了43%,平均加速比达到了1.70。实验结果表明,DIA-Dynamic能够有效提高GPU上对角SpMV的计算效率,缩短计算时间,提升程序性能。展开更多
Performance models provide insightful perspectives to predict performance and to propose optimization guidance.Although there has been much researches,pinpointing bottlenecks of various memory access patterns and reac...Performance models provide insightful perspectives to predict performance and to propose optimization guidance.Although there has been much researches,pinpointing bottlenecks of various memory access patterns and reaching high accurate prediction of both regular and irregular programs on various hardware configurations are still not trivial.This work proposes a novel model called process-RAM-feedback(PRF)to quantify the overhead of computation and data transmission time on general-purpose multi-core processors.The PRF model predicts the cost of instruction for singlecore by a directed acyclic graph(DAG)and the transmission time of memory access between each memory hierarchy through a newly designed cache simulator.By using performance modeling and feedback optimization method,this paper uses PRF model to analyze and optimize convolution,sparse matrix-vector multiplication and sn-sweep as case study for covering with typical regular kernel to irregular and data dependence.Through the PRF model,it obtains optimization guidance with various sparsity structures,algorithm designs,and instruction sets support on different data sizes.展开更多
The efficiency of three Krylov subspace methods with their ILU0-preconditioned version in solving the systems with the nonadiagonal sparse matrix is examined.The systems have arisen from the discretization of Poisson&...The efficiency of three Krylov subspace methods with their ILU0-preconditioned version in solving the systems with the nonadiagonal sparse matrix is examined.The systems have arisen from the discretization of Poisson's equation using the 4th and 6th-order compact schemes.Four matrix-vector multiplication techniques based on four sparse matrix storage schemes are considered in the algorithm of the Krylov subspace methods and their effects are explored.The convergence history,error reduction,iteration-resolution relation and CPU-time are addressed.The efficacy of various methods is evaluated against a benchmark scenario in which the conventional second-order central difference scheme is employed to discretize Poisson's equation.The Krylov subspace methods,paired with four distinct matrix-vector multiplication strategies across three discretization approaches,are tested and implemented within an incompressible fluid flow solver to solve the elliptic segment of the equations.The resulting solution process CPU-time surface gives a new vision regarding speeding up a CFD code with proper selection of discretization stencil and matrixvector multiplication technique.展开更多
文摘As one of the most essential and important operations in linear algebra, the performance prediction of sparse matrix-vector multiplication (SpMV) on GPUs has got more and more attention in recent years. In 2012, Guo and Wang put forward a new idea to predict the performance of SpMV on GPUs. However, they didn’t consider the matrix structure completely, so the execution time predicted by their model tends to be inaccurate for general sparse matrix. To address this problem, we proposed two new similar models, which take into account the structure of the matrices and make the performance prediction model more accurate. In addition, we predict the execution time of SpMV for CSR-V, CSR-S, ELL and JAD sparse matrix storage formats by the new models on the CUDA platform. Our experimental results show that the accuracy of prediction by our models is 1.69 times better than Guo and Wang’s model on average for most general matrices.
文摘在图形处理器(GPU)上实现对角稀疏矩阵向量乘法(SpMV)可以充分利用GPU的并行计算能力,并加速矩阵向量乘法;然而,相关主流算法存在零元填充数据多、计算效率低的问题。针对上述问题,提出一种对角SpMV算法DIA-Dynamic(DIAgonal-Dynamic)。首先,设计一种全新的动态划分策略,根据矩阵的不同特征进行分块,在保证GPU高计算效率的同时大幅减少零元填充,去除冗余计算量;其次,提出一种对角稀疏矩阵存储格式BDIA(Block DIAgonal)存储分块数据,并调整数据布局,提高GPU上的访存性能;最后,基于GPU的底层进行条件分支优化,以减少分支判断,并使用动态共享内存解决向量的不规则访问问题。DIA-Dynamic与前沿Tile SpMV算法相比,平均加速比达到了1.88;与前沿BRCSD(Diagonal Compressed Storage based on Row-Blocks)-Ⅱ算法相比,平均零元填充减少了43%,平均加速比达到了1.70。实验结果表明,DIA-Dynamic能够有效提高GPU上对角SpMV的计算效率,缩短计算时间,提升程序性能。
基金Supported by the National Key Research and Development Program of China(No.2017YFB0202105,2016YFB0201305,2016YFB0200803,2016YFB0200300)the National Natural Science Foundation of China(No.61521092,91430218,31327901,61472395,61432018).
文摘Performance models provide insightful perspectives to predict performance and to propose optimization guidance.Although there has been much researches,pinpointing bottlenecks of various memory access patterns and reaching high accurate prediction of both regular and irregular programs on various hardware configurations are still not trivial.This work proposes a novel model called process-RAM-feedback(PRF)to quantify the overhead of computation and data transmission time on general-purpose multi-core processors.The PRF model predicts the cost of instruction for singlecore by a directed acyclic graph(DAG)and the transmission time of memory access between each memory hierarchy through a newly designed cache simulator.By using performance modeling and feedback optimization method,this paper uses PRF model to analyze and optimize convolution,sparse matrix-vector multiplication and sn-sweep as case study for covering with typical regular kernel to irregular and data dependence.Through the PRF model,it obtains optimization guidance with various sparsity structures,algorithm designs,and instruction sets support on different data sizes.
文摘The efficiency of three Krylov subspace methods with their ILU0-preconditioned version in solving the systems with the nonadiagonal sparse matrix is examined.The systems have arisen from the discretization of Poisson's equation using the 4th and 6th-order compact schemes.Four matrix-vector multiplication techniques based on four sparse matrix storage schemes are considered in the algorithm of the Krylov subspace methods and their effects are explored.The convergence history,error reduction,iteration-resolution relation and CPU-time are addressed.The efficacy of various methods is evaluated against a benchmark scenario in which the conventional second-order central difference scheme is employed to discretize Poisson's equation.The Krylov subspace methods,paired with four distinct matrix-vector multiplication strategies across three discretization approaches,are tested and implemented within an incompressible fluid flow solver to solve the elliptic segment of the equations.The resulting solution process CPU-time surface gives a new vision regarding speeding up a CFD code with proper selection of discretization stencil and matrixvector multiplication technique.