The sparse matrix vector multiplication (SpMV) is inevitable in almost all kinds of scientific computation, such as iterative methods for solving linear systems and eigenvalue problems. With the emergence and developm...The sparse matrix vector multiplication (SpMV) is inevitable in almost all kinds of scientific computation, such as iterative methods for solving linear systems and eigenvalue problems. With the emergence and development of Graphics Processing Units (GPUs), high efficient formats for SpMV should be constructed. The performance of SpMV is mainly determinted by the storage format for sparse matrix. Based on the idea of JAD format, this paper improved the ELLPACK-R format, reduced the waiting time between different threads in a warp, and the speed up achieved about 1.5 in our experimental results. Compared with other formats, such as CSR, ELL, BiELL and so on, our format performance of SpMV is optimal over 70 percent of the test matrix. We proposed a method based on parameters to analyze the performance impact on different formats. In addition, a formula was constructed to count the computation and the number of iterations.展开更多
As modern industrial chains become increasingly complex and time-sensitive,traditional transportation planning methods encounter efficiency bottlenecks.To address this,we propose a parallelization approach based on Sp...As modern industrial chains become increasingly complex and time-sensitive,traditional transportation planning methods encounter efficiency bottlenecks.To address this,we propose a parallelization approach based on Sparse Matrix-Vector Multiplication(SpMV)to accelerate the Transportation Simplex Algorithm(TSA)for large-scale transportation problems.Existing methods primarily exploit data parallelism but underutilize GPU computational resources.To overcome the key challenge of breadth-first search(BFS)traversal with node dependencies in the MODI algorithm,we reformulate sequential operations as SpMV computations to enhance parallelism.Branching logic in potential vector computation and closed-loop search is unified through matrix formulations to eliminate divergence,and device-side loops are introduced to accelerate single iteration steps.Experiments on a 5000×10000 dataset demonstrate a 19×speedup for the parallel MODI algorithm and a 20×overall speedup for solving the transportation problem.Furthermore,the parallel TSA outperforms a commercial LP solver by 1.3×to 1.4×on large-scale instances.展开更多
Sparse matrix-vector multiplication(SpMV)is one of the key kernels extensively employed in both industrial and scientific applications,with its computation and random access incurring a lot of overhead.To capitalize o...Sparse matrix-vector multiplication(SpMV)is one of the key kernels extensively employed in both industrial and scientific applications,with its computation and random access incurring a lot of overhead.To capitalize on higher compute rates and data movement efficiency,there have been efforts to utilize mixed precision SpMV.However,most existing techniques focus on single-grained precision selection for all matrices.In this work,we concentrate on hierarchical precision selection strategies tailored for irregular matrices,driven by the need to achieve optimal load balancing among thread groups executing on GPUs.Based on the concept of strong connection,we firstly introduce a novel adaptive row-grained precision selection strategy that surpasses existing strategy within multi-precision Jacobi methods.Secondly,our experiments have uncovered a range within which converting double-precision floating-point numbers to single-precision floating-point numbers incurs a loss smaller than the machine precision FLT_EPSILON.This range is used for element-grained precision selection.Subsequently,we propose a hierarchical precision selection compressed sparse row format(CSR)storage method and enhance the CSR-Vector kernel,achieving higher relative speedups and load balancing on a benchmark suite composed of 41 matrices compared to existing methods.Finally,we integrate the mixed precision SpMV into the generalized minimal residual method(GMRES)algorithm,achieving faster execution speeds while maintaining similar convergence accuracy as double-precision GMRES.展开更多
In this article the author De Dong was incorrectly flagged as a corresponding author.The correct corresponding author of this article is Nurbol Luktarhan.The Original Article has been corrected.Publisher’s note Sprin...In this article the author De Dong was incorrectly flagged as a corresponding author.The correct corresponding author of this article is Nurbol Luktarhan.The Original Article has been corrected.Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.展开更多
The new generation of computing devices tends to support multiple floating-point formats and different computing precision.Besides single and double precision,half precision is embraced and widely supported by new com...The new generation of computing devices tends to support multiple floating-point formats and different computing precision.Besides single and double precision,half precision is embraced and widely supported by new computing devices.Lowprecision representations have compact memory size and lightweight computing strength,and they also bring opportunities to the optimization of BLAS routines.This paper proposes a new sparse matrix partition approach based on IEEE 754 standard floating-point format.An input sparse matrix in double precision is partitioned and transformed into several sub-matrices in different precision without loss of accuracy.Most non-zero elements can be stored in half or single precision,if the most significant bits of exponent and the least significant bits of mantissa are zeros in double-precision representation.Based on this mixed-precision representation of sparse matrix,we also present a new SpMV algorithm pSpMV for GPU devices.pSpMV not only reduces the memory access overhead,but also reduces the computing strength of floating-point numbers.Experimental results on two GPU devices show that pSpMV achieves a geometric mean speedup of 1.39x on Tesla V100 and 1.45x on Tesla P100 over double-precision SpMV for 2,554 sparse matrices.展开更多
稀疏矩阵向量乘(SpMV)是稀疏线性系统的计算核心和瓶颈,其运算效率会影响迭代求解器的整体性能,其优化研究一直是科学计算和工程应用领域中的研究热点之一。偏微分方程的离散化会产生稀疏对角矩阵,由于其多样的非零元分布,导致没有一种...稀疏矩阵向量乘(SpMV)是稀疏线性系统的计算核心和瓶颈,其运算效率会影响迭代求解器的整体性能,其优化研究一直是科学计算和工程应用领域中的研究热点之一。偏微分方程的离散化会产生稀疏对角矩阵,由于其多样的非零元分布,导致没有一种方法能够在所有矩阵中取得最优时间性能。针对上述问题,提出一种面向图形处理单元(GPU)的稀疏对角矩阵自适应SpMV优化方法AST(Adaptive SpMV Tuning)。该方法通过设计特征空间,构建特征提取器,提取矩阵结构精细特征,通过深入分析特征和SpMV方法的相关性,建立可扩展的候选方法集合,形成特征和最优方法的映射关系,构建性能预测工具,实现矩阵最优方法的高效预测。实验结果表明,AST能够取得85.8%的预测准确率,平均时间性能损失为0.09,相比于DIA(Diagonal)、HDIA(Hacked DIA)、HDC(Hybrid of DIA and Compressed Sparse Row)、DIA-Adaptive和DRM(Divide-Rearrange and Merge),能够获得平均20.19、1.86、3.06、3.72和1.53倍的内核运行时间加速和1.05、1.28、12.45、1.94和0.97倍的浮点运算性能加速。展开更多
文摘稀疏矩阵向量乘法(sparse matrix-vector multiplication,SpMV)是数值计算中的核心操作,广泛应用于科学计算、工程模拟以及机器学习中.SpMV的性能优化主要受限于不规则的稀疏模式,传统的优化通常依赖手动设计存储格式、计算策略和内存访问模式.现有张量编译器如TACO和TVM通过领域特定语言(domain specific language,DSL)可实现高性能算子生成,减轻开发人员繁琐的手动优化工作,但对稀疏计算的优化支持尚显不足,难以根据不同的稀疏模式自适应优化性能.为了解决这些问题,提出了名为SparseMode的稀疏编译框架,能够依据矩阵的稀疏模式为SpMV计算生成高效的向量化代码,并根据硬件平台的特性自适应地调整优化策略.该编译框架首先设计了领域专属语言SpMV-DSL,能够简洁高效地表达SpMV的稀疏矩阵和计算操作.然后提出了基于稀疏模式感知的方法,根据SpMV-DSL定义的矩阵存储格式和非零元素分布动态选择计算策略.最后通过稀疏模式分析和调度优化生成高效并行的SpMV算子代码,以充分利用SIMD指令提升性能.在不同硬件平台上的SpMV实验结果表明,SparseMode生成的SpMV算子代码相较于现有的TACO和TVM张量编译器实现了最高2.44倍的加速比.
文摘The sparse matrix vector multiplication (SpMV) is inevitable in almost all kinds of scientific computation, such as iterative methods for solving linear systems and eigenvalue problems. With the emergence and development of Graphics Processing Units (GPUs), high efficient formats for SpMV should be constructed. The performance of SpMV is mainly determinted by the storage format for sparse matrix. Based on the idea of JAD format, this paper improved the ELLPACK-R format, reduced the waiting time between different threads in a warp, and the speed up achieved about 1.5 in our experimental results. Compared with other formats, such as CSR, ELL, BiELL and so on, our format performance of SpMV is optimal over 70 percent of the test matrix. We proposed a method based on parameters to analyze the performance impact on different formats. In addition, a formula was constructed to count the computation and the number of iterations.
基金supported by the National Key Research and Development Program of China(No.2022YFB3304402)Natural Science Foundation of China(No.12371321)+1 种基金Guangdong Basic and Applied Basic Research Foundation(No.2024A1515030197)Shenzhen research Grant(No.CJGJZD20210408092806017).
文摘As modern industrial chains become increasingly complex and time-sensitive,traditional transportation planning methods encounter efficiency bottlenecks.To address this,we propose a parallelization approach based on Sparse Matrix-Vector Multiplication(SpMV)to accelerate the Transportation Simplex Algorithm(TSA)for large-scale transportation problems.Existing methods primarily exploit data parallelism but underutilize GPU computational resources.To overcome the key challenge of breadth-first search(BFS)traversal with node dependencies in the MODI algorithm,we reformulate sequential operations as SpMV computations to enhance parallelism.Branching logic in potential vector computation and closed-loop search is unified through matrix formulations to eliminate divergence,and device-side loops are introduced to accelerate single iteration steps.Experiments on a 5000×10000 dataset demonstrate a 19×speedup for the parallel MODI algorithm and a 20×overall speedup for solving the transportation problem.Furthermore,the parallel TSA outperforms a commercial LP solver by 1.3×to 1.4×on large-scale instances.
基金supported by National Natural Science Foundation of China(No.22333003)。
文摘Sparse matrix-vector multiplication(SpMV)is one of the key kernels extensively employed in both industrial and scientific applications,with its computation and random access incurring a lot of overhead.To capitalize on higher compute rates and data movement efficiency,there have been efforts to utilize mixed precision SpMV.However,most existing techniques focus on single-grained precision selection for all matrices.In this work,we concentrate on hierarchical precision selection strategies tailored for irregular matrices,driven by the need to achieve optimal load balancing among thread groups executing on GPUs.Based on the concept of strong connection,we firstly introduce a novel adaptive row-grained precision selection strategy that surpasses existing strategy within multi-precision Jacobi methods.Secondly,our experiments have uncovered a range within which converting double-precision floating-point numbers to single-precision floating-point numbers incurs a loss smaller than the machine precision FLT_EPSILON.This range is used for element-grained precision selection.Subsequently,we propose a hierarchical precision selection compressed sparse row format(CSR)storage method and enhance the CSR-Vector kernel,achieving higher relative speedups and load balancing on a benchmark suite composed of 41 matrices compared to existing methods.Finally,we integrate the mixed precision SpMV into the generalized minimal residual method(GMRES)algorithm,achieving faster execution speeds while maintaining similar convergence accuracy as double-precision GMRES.
文摘In this article the author De Dong was incorrectly flagged as a corresponding author.The correct corresponding author of this article is Nurbol Luktarhan.The Original Article has been corrected.Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
文摘The new generation of computing devices tends to support multiple floating-point formats and different computing precision.Besides single and double precision,half precision is embraced and widely supported by new computing devices.Lowprecision representations have compact memory size and lightweight computing strength,and they also bring opportunities to the optimization of BLAS routines.This paper proposes a new sparse matrix partition approach based on IEEE 754 standard floating-point format.An input sparse matrix in double precision is partitioned and transformed into several sub-matrices in different precision without loss of accuracy.Most non-zero elements can be stored in half or single precision,if the most significant bits of exponent and the least significant bits of mantissa are zeros in double-precision representation.Based on this mixed-precision representation of sparse matrix,we also present a new SpMV algorithm pSpMV for GPU devices.pSpMV not only reduces the memory access overhead,but also reduces the computing strength of floating-point numbers.Experimental results on two GPU devices show that pSpMV achieves a geometric mean speedup of 1.39x on Tesla V100 and 1.45x on Tesla P100 over double-precision SpMV for 2,554 sparse matrices.
文摘稀疏矩阵向量乘(SpMV)是稀疏线性系统的计算核心和瓶颈,其运算效率会影响迭代求解器的整体性能,其优化研究一直是科学计算和工程应用领域中的研究热点之一。偏微分方程的离散化会产生稀疏对角矩阵,由于其多样的非零元分布,导致没有一种方法能够在所有矩阵中取得最优时间性能。针对上述问题,提出一种面向图形处理单元(GPU)的稀疏对角矩阵自适应SpMV优化方法AST(Adaptive SpMV Tuning)。该方法通过设计特征空间,构建特征提取器,提取矩阵结构精细特征,通过深入分析特征和SpMV方法的相关性,建立可扩展的候选方法集合,形成特征和最优方法的映射关系,构建性能预测工具,实现矩阵最优方法的高效预测。实验结果表明,AST能够取得85.8%的预测准确率,平均时间性能损失为0.09,相比于DIA(Diagonal)、HDIA(Hacked DIA)、HDC(Hybrid of DIA and Compressed Sparse Row)、DIA-Adaptive和DRM(Divide-Rearrange and Merge),能够获得平均20.19、1.86、3.06、3.72和1.53倍的内核运行时间加速和1.05、1.28、12.45、1.94和0.97倍的浮点运算性能加速。
文摘稀疏矩阵向量乘(SpMV)是科学与工程计算中一个重要的核心函数,但在当前基于存储器层次结构的计算平台上,传统CSR(Compressed Sparse Row)存储的稀疏矩阵向量乘性能较低,运行效率往往远低于硬件浮点峰值的10%.目前现有的处理器架构一般都采用SIMD向量化技术进行加速,但是传统CSR格式的稀疏矩阵向量乘由于访存的不规则性,不能直接采用向量化技术进行加速,为了利用SIMD技术,对具有局部性特征的稀疏矩阵,提出了新的稀疏矩阵存储格式CSRL(Compressed Sparse Row with Local information),该格式可以减少SpMV时内存访问次数,并且能够充分利用硬件的SIMD向量化技术进行读取和计算,提高了SpMV性能.实验表明,该方法相比国际著名商业库Intel MKL10.3版平均性能提升达到29.5%,最高可达89%的性能提升.