Optimizing GEneral Matrix Multiplication(GEMM)on GPU platforms is becoming increasingly critical to meet the growing computational demands of modern deep neural network research.While significant progress has been mad...Optimizing GEneral Matrix Multiplication(GEMM)on GPU platforms is becoming increasingly critical to meet the growing computational demands of modern deep neural network research.While significant progress has been made in accelerating high-precision GEMM,the optimization of low-bit GEMM remains a challenging open problem.The CUTLASS library provides highly optimized low-bit GEMM templates leveraging Tensor Cores;however,performance varies considerably depending on tile and pipeline configurations across different GPU architectures.In this work,we propose a novel auto-tuning framework for low-bit CUTLASS GEMM,utilizing a neural network model to predict optimal GEMM template parameters for target GPUs.Our model is trained on a synthetic dataset with up to 116100 unique samples,encompassing diverse matrix sizes across various Ampere GPUs,and is thoroughly evaluated on these hardware platforms.Experimental results show that our method achieves an accuracy of up to 95.11%on the validation dataset.Furthermore,real-time evaluations of low-bit data types on the A100 GPU demonstrate speedups of up to 1.99×for GEMM operations and 1.28×for the linear layer,compared to the default CUTLASS templates.展开更多
随着大语言模型(large language model,LLM)参数规模的指数级增长,模型部署和推理面临着严峻的内存和计算资源挑战。量化技术作为模型压缩的核心方法,通过降低权重和激活值的数值精度,显著减少了模型的存储需求和计算开销。首先回顾了...随着大语言模型(large language model,LLM)参数规模的指数级增长,模型部署和推理面临着严峻的内存和计算资源挑战。量化技术作为模型压缩的核心方法,通过降低权重和激活值的数值精度,显著减少了模型的存储需求和计算开销。首先回顾了量化技术的发展历程,从经典的Int8/4量化方法到前沿的超低比特量化算法,总结了典型方法的技术特征与性能演进规律,指出传统实数域量化在极低比特条件下存在受限于离散化误差的挑战,难以突破性能上限。为此,进而系统性地梳理了复域量化系列工作。该系列工作提出了基于复数域的量化范式,通过在参数表示中引入幅度与相位2个自由度,显著扩展了模型的表达空间;此外,类比信号处理中通过将时域信号进行傅里叶变换与低通滤波实现稳定表示的经典范式,进一步提出了由实数模型经复域变换与复域量化,达成了无乘法稳定推理的技术路线。实验结果表明,该方案在多个基准数据集上优于现有超低比特量化方法,有效突破了实数域模型的性能天花板,展现出复域量化在高效建模与性能保持方面的潜在价值。总体而言,通过对量化技术演进及复域量化系列研究的系统分析,旨在揭示超低比特量化的发展规律与未来趋势,为高效大模型的理论研究与工程实现提供参考。展开更多
本文针对动平台分布式雷达硬件资源受限时的协同探测需求展开研究,提出了一种基于节点自适应低比特量化的异构数据协同探测算法。首先,建立动平台分布式雷达自适应低比特量化的异构数据模型,推导关于目标多维未知状态与传播衰减的似然...本文针对动平台分布式雷达硬件资源受限时的协同探测需求展开研究,提出了一种基于节点自适应低比特量化的异构数据协同探测算法。首先,建立动平台分布式雷达自适应低比特量化的异构数据模型,推导关于目标多维未知状态与传播衰减的似然函数。其次,分析低比特量化非线性变换导致的多维未知参数耦合特性,构建其求解目标函数,设计基于批量梯度下降与差分进化算法的多维参数联合估计算法。最后,基于广义似然比检测(Generalized Likelihood Ratio Test,GLRT)准则推导动平台分布式雷达的异构检测器及其统计特性,设计恒虚警检测门限,论证系统的理论性能。通过两组仿真实验对应的数值计算结果证明所提方法的有效性和鲁棒性,验证了该算法具有广阔的应用前景。展开更多
基金supported by the Federal Ministry of Research,Technology and Space under the funding code“KI-Servicezentrum Berlin-Brandenburg”16IS22092.
文摘Optimizing GEneral Matrix Multiplication(GEMM)on GPU platforms is becoming increasingly critical to meet the growing computational demands of modern deep neural network research.While significant progress has been made in accelerating high-precision GEMM,the optimization of low-bit GEMM remains a challenging open problem.The CUTLASS library provides highly optimized low-bit GEMM templates leveraging Tensor Cores;however,performance varies considerably depending on tile and pipeline configurations across different GPU architectures.In this work,we propose a novel auto-tuning framework for low-bit CUTLASS GEMM,utilizing a neural network model to predict optimal GEMM template parameters for target GPUs.Our model is trained on a synthetic dataset with up to 116100 unique samples,encompassing diverse matrix sizes across various Ampere GPUs,and is thoroughly evaluated on these hardware platforms.Experimental results show that our method achieves an accuracy of up to 95.11%on the validation dataset.Furthermore,real-time evaluations of low-bit data types on the A100 GPU demonstrate speedups of up to 1.99×for GEMM operations and 1.28×for the linear layer,compared to the default CUTLASS templates.
文摘随着大语言模型(large language model,LLM)参数规模的指数级增长,模型部署和推理面临着严峻的内存和计算资源挑战。量化技术作为模型压缩的核心方法,通过降低权重和激活值的数值精度,显著减少了模型的存储需求和计算开销。首先回顾了量化技术的发展历程,从经典的Int8/4量化方法到前沿的超低比特量化算法,总结了典型方法的技术特征与性能演进规律,指出传统实数域量化在极低比特条件下存在受限于离散化误差的挑战,难以突破性能上限。为此,进而系统性地梳理了复域量化系列工作。该系列工作提出了基于复数域的量化范式,通过在参数表示中引入幅度与相位2个自由度,显著扩展了模型的表达空间;此外,类比信号处理中通过将时域信号进行傅里叶变换与低通滤波实现稳定表示的经典范式,进一步提出了由实数模型经复域变换与复域量化,达成了无乘法稳定推理的技术路线。实验结果表明,该方案在多个基准数据集上优于现有超低比特量化方法,有效突破了实数域模型的性能天花板,展现出复域量化在高效建模与性能保持方面的潜在价值。总体而言,通过对量化技术演进及复域量化系列研究的系统分析,旨在揭示超低比特量化的发展规律与未来趋势,为高效大模型的理论研究与工程实现提供参考。
文摘本文针对动平台分布式雷达硬件资源受限时的协同探测需求展开研究,提出了一种基于节点自适应低比特量化的异构数据协同探测算法。首先,建立动平台分布式雷达自适应低比特量化的异构数据模型,推导关于目标多维未知状态与传播衰减的似然函数。其次,分析低比特量化非线性变换导致的多维未知参数耦合特性,构建其求解目标函数,设计基于批量梯度下降与差分进化算法的多维参数联合估计算法。最后,基于广义似然比检测(Generalized Likelihood Ratio Test,GLRT)准则推导动平台分布式雷达的异构检测器及其统计特性,设计恒虚警检测门限,论证系统的理论性能。通过两组仿真实验对应的数值计算结果证明所提方法的有效性和鲁棒性,验证了该算法具有广阔的应用前景。