期刊文献+
共找到1篇文章
< 1 >
每页显示 20 50 100
Optimizing winograd-based convolution with DCU’s matrix cores
1
作者 Jiandong Shang Fuchang Gao +5 位作者 Zhaopeng Li yizhe sui Gang Wu Nan Wang Lingling Wang Dujuan Zhang 《CCF Transactions on High Performance Computing》 2026年第1期107-119,共13页
Convolution algorithms based on the Winograd implementation can reduce computational complexity and are widely used in CNNs.As an emerging GPU-like accelerator,DCU has achieved some performance optimization for the Wi... Convolution algorithms based on the Winograd implementation can reduce computational complexity and are widely used in CNNs.As an emerging GPU-like accelerator,DCU has achieved some performance optimization for the Winograd algorithm,but it fails to fully exploit the Matrix Cores of DCU to further enhance the efficiency of Winograd convolution computations.This paper proposes an improved fused Winograd convolution optimization scheme that integrates all transformation stages into a single kernel,which is specifically designed to exploit the characteristics of Matrix Cores.In the input transformation stage,we design an efficient data reuse mechanism that reduces redundant global memory accesses.In the element-wise matrix multiplication stage,we transform Hadamard products into batched GEMMs,boosting computational intensity and complying with the data layout requirements of Matrix Cores.During kernel fusion,we eliminate shared memory bank conflicts by reorganizing thread layout and further introduce software pipelining to effectively mask memory access latency.The results show that our method achieves average speedups of 1.35×and 1.72×(up to 1.81×and 2.78×)over the Winograd and Implicit GEMM algorithms in MIOpen under FP16 mode,and 1.22×and 1.53×(up to 1.55×and 1.88×)under FP32 mode. 展开更多
关键词 Convolution Fused Winograd Batched GEMMs Matrix Cores
在线阅读 下载PDF
上一页 1 下一页 到第
使用帮助 返回顶部