Large language models(LLMs)have exhibited remarkable performance across a broad spectrum of tasks,yet their extensive computational and memory requirements present substantial challenges for deployment in resource-con...Large language models(LLMs)have exhibited remarkable performance across a broad spectrum of tasks,yet their extensive computational and memory requirements present substantial challenges for deployment in resource-constrained scenarios.To address the challenges,this work introduces software and hardware co-optimization strategies aimed at enhancing the inference performance of LLMs on ARM CPU-based platforms.A mixed-precision quantization technique is employed,preserving the precision of critical weights to maintain model accuracy while quantizing non-essential weights to INT8,thereby reducing the model’s memory footprint.This work also capitalizes on the SIMD instruction set of ARM CPUs to efficiently process model data.Furthermore,the inference framework is optimized by fusing components of the attention computation and streamlining the dequantization process through modifications to the scaling factor.These enhancements result in a significant reduction in model memory usage and improved throughput during the prefill and decode stages.The efficacy of the proposed approach is demonstrated through the optimization of the Qwen-1.8B model on Armv9,with only a 0.66%decrease in accuracy and a reduction in memory usage to 58.8%of the baseline,while achieving a 4.09×and 15.23×increase in inference performance for the prefill and decode stages over the baseline,respectively.展开更多
云服务器的功耗模型是云数据中心能耗优化研究的重要内容之一。CPU功耗模型是云服务器功耗模型的重要组成部分,然而现有CPU功耗模型没有考虑CPU的异构性,如缺乏对ARM架构服务器CPU功耗模型的研究。在调研分析现有的ARM架构CPU功耗模型...云服务器的功耗模型是云数据中心能耗优化研究的重要内容之一。CPU功耗模型是云服务器功耗模型的重要组成部分,然而现有CPU功耗模型没有考虑CPU的异构性,如缺乏对ARM架构服务器CPU功耗模型的研究。在调研分析现有的ARM架构CPU功耗模型的基础上,提出了一种面向ARM架构的新CPU功耗模型——基于混合建模的CPU功耗模型(Hybrid Based Model, HBM)。该功耗模型综合考虑了CPU利用率和CPU性能事件等建模特征,相比现有的测算精度很高的基于性能计数器的CPU功耗模型,HBM的测算精度与其相近且模型训练成本更低,更适合ARM服务器的CPU功耗建模。文中使用Sysbench负载工具对所提HBM进行实验验证,实验结果表明,HBM的平均相对误差(MRE)在1%以内,具有良好的测算精度。此外,还针对x86和ARM架构服务器进行了交叉实验,实验结果表明不同架构服务器的CPU功耗行为相异,应当使用不同的CPU功耗建模方法。展开更多
基金the National Key Research and Development Program of China under Grant 2023YFB2806000the Postdoctoral Fellowship Program of CPSF under Grant GZC20241305the Proof of Concept Foundation of Xidian,University Hangzhou Institute of Technology,under Grant GNYZ2024JC004.
文摘Large language models(LLMs)have exhibited remarkable performance across a broad spectrum of tasks,yet their extensive computational and memory requirements present substantial challenges for deployment in resource-constrained scenarios.To address the challenges,this work introduces software and hardware co-optimization strategies aimed at enhancing the inference performance of LLMs on ARM CPU-based platforms.A mixed-precision quantization technique is employed,preserving the precision of critical weights to maintain model accuracy while quantizing non-essential weights to INT8,thereby reducing the model’s memory footprint.This work also capitalizes on the SIMD instruction set of ARM CPUs to efficiently process model data.Furthermore,the inference framework is optimized by fusing components of the attention computation and streamlining the dequantization process through modifications to the scaling factor.These enhancements result in a significant reduction in model memory usage and improved throughput during the prefill and decode stages.The efficacy of the proposed approach is demonstrated through the optimization of the Qwen-1.8B model on Armv9,with only a 0.66%decrease in accuracy and a reduction in memory usage to 58.8%of the baseline,while achieving a 4.09×and 15.23×increase in inference performance for the prefill and decode stages over the baseline,respectively.
文摘云服务器的功耗模型是云数据中心能耗优化研究的重要内容之一。CPU功耗模型是云服务器功耗模型的重要组成部分,然而现有CPU功耗模型没有考虑CPU的异构性,如缺乏对ARM架构服务器CPU功耗模型的研究。在调研分析现有的ARM架构CPU功耗模型的基础上,提出了一种面向ARM架构的新CPU功耗模型——基于混合建模的CPU功耗模型(Hybrid Based Model, HBM)。该功耗模型综合考虑了CPU利用率和CPU性能事件等建模特征,相比现有的测算精度很高的基于性能计数器的CPU功耗模型,HBM的测算精度与其相近且模型训练成本更低,更适合ARM服务器的CPU功耗建模。文中使用Sysbench负载工具对所提HBM进行实验验证,实验结果表明,HBM的平均相对误差(MRE)在1%以内,具有良好的测算精度。此外,还针对x86和ARM架构服务器进行了交叉实验,实验结果表明不同架构服务器的CPU功耗行为相异,应当使用不同的CPU功耗建模方法。