We have optimized the parallel threshold ILU algorithm(ParILUT)for GPUs.The optimizations are for three building blocks:candidate search and ILU residual computation,adding and removing elements,and threshold selectio...We have optimized the parallel threshold ILU algorithm(ParILUT)for GPUs.The optimizations are for three building blocks:candidate search and ILU residual computation,adding and removing elements,and threshold selection.Firstly,we fuse candidate search and ILU residual computation by modifying the ParILUT algorithm and extending the register-aware SpGEMM algorithm to calculate it.At the same time,we developed a GPU bin search algorithm to make the register-aware SpGEMM algorithm perform better in ParILUT.Secondly,we adopt a warp-row-parallel approach to add elements to new L and U and remove elements from candidates instead of the thread-row-parallel approach.And used the efficient GPU instructions to locate the positions of elements.Thirdly,we proposed a balanced classification tree in the threshold selection to balance the buckets’data,when a large number of elements with the same value.Finally,we experimented with the performance of each optimization and the whole ParILUT.And verified the correctness of the optimized ParILUT.The result indicates that the optimized ParILUT average speedup is 4.03 times over the original version,and the speedup increases with the amount of fill-in.展开更多
基金supported by the National Natural Science Foundation of China,under Grant 62172389.
文摘We have optimized the parallel threshold ILU algorithm(ParILUT)for GPUs.The optimizations are for three building blocks:candidate search and ILU residual computation,adding and removing elements,and threshold selection.Firstly,we fuse candidate search and ILU residual computation by modifying the ParILUT algorithm and extending the register-aware SpGEMM algorithm to calculate it.At the same time,we developed a GPU bin search algorithm to make the register-aware SpGEMM algorithm perform better in ParILUT.Secondly,we adopt a warp-row-parallel approach to add elements to new L and U and remove elements from candidates instead of the thread-row-parallel approach.And used the efficient GPU instructions to locate the positions of elements.Thirdly,we proposed a balanced classification tree in the threshold selection to balance the buckets’data,when a large number of elements with the same value.Finally,we experimented with the performance of each optimization and the whole ParILUT.And verified the correctness of the optimized ParILUT.The result indicates that the optimized ParILUT average speedup is 4.03 times over the original version,and the speedup increases with the amount of fill-in.