General matrix multiplication is a vital operation in high-performance computing and has wide applications in areas such as computational fluid dynamics and deep learning (DL). While there are many optimization techni...General matrix multiplication is a vital operation in high-performance computing and has wide applications in areas such as computational fluid dynamics and deep learning (DL). While there are many optimization techniques available for large matrix multiplications on CPUs and GPUs, handling batches of small matrix operations requires innovative solutions. Digital Signal Processors (DSPs) offer a promising alternative for processing DL workloads;however, the architectural differences between DSPs and conventional processors like CPUs and GPUs necessitate the development of specialized optimization strategies. This paper introduces mtSmm , an optimization approach tailored for small matrix multiplications on multi-core DSPs. Our approach focuses on the batch-as-vector paradigm, efficient on-chip memory management, and a well-designed micro-kernel. By maximizing computational resources, optimizing instruction-level and thread-level parallelism, and enhanc-ing memory access patterns, our approach significantly improves performance. Experimental results on the FT-M7032 DSP demonstrate that our method can achieve up to 83% of the theoretical peak performance of the hardware, significantly outperforming current state-of-the-art methods for batches of small matrices.展开更多
基金supported by the National Key Research and Development Program of China under Grant No.2023YFB3001503.
文摘General matrix multiplication is a vital operation in high-performance computing and has wide applications in areas such as computational fluid dynamics and deep learning (DL). While there are many optimization techniques available for large matrix multiplications on CPUs and GPUs, handling batches of small matrix operations requires innovative solutions. Digital Signal Processors (DSPs) offer a promising alternative for processing DL workloads;however, the architectural differences between DSPs and conventional processors like CPUs and GPUs necessitate the development of specialized optimization strategies. This paper introduces mtSmm , an optimization approach tailored for small matrix multiplications on multi-core DSPs. Our approach focuses on the batch-as-vector paradigm, efficient on-chip memory management, and a well-designed micro-kernel. By maximizing computational resources, optimizing instruction-level and thread-level parallelism, and enhanc-ing memory access patterns, our approach significantly improves performance. Experimental results on the FT-M7032 DSP demonstrate that our method can achieve up to 83% of the theoretical peak performance of the hardware, significantly outperforming current state-of-the-art methods for batches of small matrices.