To address the issues of sparse matrix load imbalance and parallelism degradation with increasing matrix size in the mainstream Sparse-dense matrix-matrix multiplication(SpMM)parallelization strategy row-split,we prop...To address the issues of sparse matrix load imbalance and parallelism degradation with increasing matrix size in the mainstream Sparse-dense matrix-matrix multiplication(SpMM)parallelization strategy row-split,we propose a new framework for parallel SpMM computation on DCUs(GPU-like accelerators).This framework is based on the standard CSR format,requiring no additional format conversion,and thus offers strong generality.To address the issue of load imbalance,we introduce a coarse-grained two-level binning strategy that categorizes the rows of the sparse matrix into three groups based on the number of non-zero elements.Dedicated computation kernels are designed for each category to better accommodate different types of computational tasks,thereby significantly improving load balance.To address the decline in parallelism as the matrix size increases,we design multiple optimized kernels and dynamically select the optimal configuration at runtime to maximize parallelism.Experimental results show that our proposed SpMM framework significantly outperforms two current state-of-the-art row-split based SpMM algorithms(rocSparse and GE-SpMM),achieving speedups of 5.4×and 2.28×,respectively.展开更多
基金funded by the National Key Research and Development Program of China(2024YFB4504103)the Major Science and Technology Special Projects in Henan Province(241111212300)the National Key Research and Development Program of China(2023ZD0120604).
文摘To address the issues of sparse matrix load imbalance and parallelism degradation with increasing matrix size in the mainstream Sparse-dense matrix-matrix multiplication(SpMM)parallelization strategy row-split,we propose a new framework for parallel SpMM computation on DCUs(GPU-like accelerators).This framework is based on the standard CSR format,requiring no additional format conversion,and thus offers strong generality.To address the issue of load imbalance,we introduce a coarse-grained two-level binning strategy that categorizes the rows of the sparse matrix into three groups based on the number of non-zero elements.Dedicated computation kernels are designed for each category to better accommodate different types of computational tasks,thereby significantly improving load balance.To address the decline in parallelism as the matrix size increases,we design multiple optimized kernels and dynamically select the optimal configuration at runtime to maximize parallelism.Experimental results show that our proposed SpMM framework significantly outperforms two current state-of-the-art row-split based SpMM algorithms(rocSparse and GE-SpMM),achieving speedups of 5.4×and 2.28×,respectively.