Convolution algorithms based on the Winograd implementation can reduce computational complexity and are widely used in CNNs.As an emerging GPU-like accelerator,DCU has achieved some performance optimization for the Wi...Convolution algorithms based on the Winograd implementation can reduce computational complexity and are widely used in CNNs.As an emerging GPU-like accelerator,DCU has achieved some performance optimization for the Winograd algorithm,but it fails to fully exploit the Matrix Cores of DCU to further enhance the efficiency of Winograd convolution computations.This paper proposes an improved fused Winograd convolution optimization scheme that integrates all transformation stages into a single kernel,which is specifically designed to exploit the characteristics of Matrix Cores.In the input transformation stage,we design an efficient data reuse mechanism that reduces redundant global memory accesses.In the element-wise matrix multiplication stage,we transform Hadamard products into batched GEMMs,boosting computational intensity and complying with the data layout requirements of Matrix Cores.During kernel fusion,we eliminate shared memory bank conflicts by reorganizing thread layout and further introduce software pipelining to effectively mask memory access latency.The results show that our method achieves average speedups of 1.35×and 1.72×(up to 1.81×and 2.78×)over the Winograd and Implicit GEMM algorithms in MIOpen under FP16 mode,and 1.22×and 1.53×(up to 1.55×and 1.88×)under FP32 mode.展开更多
基金funded by the National Key Research and Development Program of China(2023ZD0120604)the National Key Research and Development Program of China(2024YFB4504103)the Major Science and Technology Special Projects in Henan Province(241111212300).
文摘Convolution algorithms based on the Winograd implementation can reduce computational complexity and are widely used in CNNs.As an emerging GPU-like accelerator,DCU has achieved some performance optimization for the Winograd algorithm,but it fails to fully exploit the Matrix Cores of DCU to further enhance the efficiency of Winograd convolution computations.This paper proposes an improved fused Winograd convolution optimization scheme that integrates all transformation stages into a single kernel,which is specifically designed to exploit the characteristics of Matrix Cores.In the input transformation stage,we design an efficient data reuse mechanism that reduces redundant global memory accesses.In the element-wise matrix multiplication stage,we transform Hadamard products into batched GEMMs,boosting computational intensity and complying with the data layout requirements of Matrix Cores.During kernel fusion,we eliminate shared memory bank conflicts by reorganizing thread layout and further introduce software pipelining to effectively mask memory access latency.The results show that our method achieves average speedups of 1.35×and 1.72×(up to 1.81×and 2.78×)over the Winograd and Implicit GEMM algorithms in MIOpen under FP16 mode,and 1.22×and 1.53×(up to 1.55×and 1.88×)under FP32 mode.