In this study, a microchannel liquid cooling plate (LCP) is proposed for Intel Xeon 52.5 mm * 45 mm packaged architecture processors based on topology optimization (TO). Firstly, a mathematical model for topology opti...In this study, a microchannel liquid cooling plate (LCP) is proposed for Intel Xeon 52.5 mm * 45 mm packaged architecture processors based on topology optimization (TO). Firstly, a mathematical model for topology optimization design of the LCP is established based on heat dissipation and pressure drop objectives. We obtain a series of two-dimensional (2D) topology optimization configurations with different weighting factors for two objectives. It is found that the biomimetic phenomenon of the topologically optimized flow channel structure is more pronounced at low Reynolds numbers. Secondly, the topology configuration is stretched into a three-dimensional (3D) model to perform CFD simulations under actual operating conditions. The results show that the thermal resistance and pressure drop of the LCP based on topology optimization achieve a reduction of approximately 20% - 50% compared to traditional serpentine and microchannel straight flow channel structures. The Nusselt number can be improved by up to 76.1% compared to microchannel straight designs. Moreover, it is observed that under high flow rates, straight microchannel LCPs exhibit significant backflow, vortex phenomena, and topology optimization structures LCPs also tend to lead to loss of effectiveness in the form of tree root-shaped branch flows. Suitable flow rate ranges for LCPs are provided. Furthermore, the temperature and pressure drop of experimental results are consistent with the numerical ones, which verifies the effectiveness of performance for topology optimization flow channel LCP.展开更多
Large language models(LLMs)have exhibited remarkable performance across a broad spectrum of tasks,yet their extensive computational and memory requirements present substantial challenges for deployment in resource-con...Large language models(LLMs)have exhibited remarkable performance across a broad spectrum of tasks,yet their extensive computational and memory requirements present substantial challenges for deployment in resource-constrained scenarios.To address the challenges,this work introduces software and hardware co-optimization strategies aimed at enhancing the inference performance of LLMs on ARM CPU-based platforms.A mixed-precision quantization technique is employed,preserving the precision of critical weights to maintain model accuracy while quantizing non-essential weights to INT8,thereby reducing the model’s memory footprint.This work also capitalizes on the SIMD instruction set of ARM CPUs to efficiently process model data.Furthermore,the inference framework is optimized by fusing components of the attention computation and streamlining the dequantization process through modifications to the scaling factor.These enhancements result in a significant reduction in model memory usage and improved throughput during the prefill and decode stages.The efficacy of the proposed approach is demonstrated through the optimization of the Qwen-1.8B model on Armv9,with only a 0.66%decrease in accuracy and a reduction in memory usage to 58.8%of the baseline,while achieving a 4.09×and 15.23×increase in inference performance for the prefill and decode stages over the baseline,respectively.展开更多
OpenCL is an open heterogeneous programming framework. Although OpenCL programs are func- tionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When ...OpenCL is an open heterogeneous programming framework. Although OpenCL programs are func- tionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus has been extensively used. However, locality concerns exposed in GPU-specific OpenCL code are usually inherited without analysis, which may give side-effects on the CPU performance. Typi- cally, the use of OpenCL's local memory on multi-core/many-core CPUs may lead to an opposite performance effect, because local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by (1) removing all the unwanted local-memory arrays together with the obsolete barrier statements and (2) optimizing the coalesced kernel code with vectorization and locality re-exploitation. Moreover, we have developed an automated tool chain that makes this transformation of GPU-specific OpenCL kernels into a CPU-friendly form, which is accompanied with a scheduler that forms a new OpenCL runtime. Experiments show that the automated transformation can improve OpenCL kernel performance on a multi-core CPU by an average factor of 3.24. Satisfactory performance improvements axe also achieved on Intel's many-integrated-core coprocessor. The resultant performance on both architectures is better than or comparable with the corresponding OpenMP performance.展开更多
在信息化蓬勃发展的今日,大量云计算资源的高效管理是运维领域的重要难题。准确的负载预测是应对这一难题的关键技术。针对该问题提出一种基于局部加权回归周期趋势分解算法(Seasonal and Trend decomposition using Loess,STL)、Holt-W...在信息化蓬勃发展的今日,大量云计算资源的高效管理是运维领域的重要难题。准确的负载预测是应对这一难题的关键技术。针对该问题提出一种基于局部加权回归周期趋势分解算法(Seasonal and Trend decomposition using Loess,STL)、Holt-Winters模型和深度自回归模型(DeepAR)的组合预测模型STL-DeepAR-HW。先采用快速傅里叶变换和自相关函数提取数据的周期性特征,以提取到的最优周期对数据做STL分解,将数据分解为趋势项、季节项和余项;并用DeepAR和Holt-Winters分别预测趋势项和季节项,最后组合得到预测结果。在公开数据集AzurePublicDataset上进行实验,结果表明,与Transformer、Stacked-LSTM以及Prophet等模型相比,该组合模型在负载预测中具有更高的准确性和适用性。展开更多
文摘In this study, a microchannel liquid cooling plate (LCP) is proposed for Intel Xeon 52.5 mm * 45 mm packaged architecture processors based on topology optimization (TO). Firstly, a mathematical model for topology optimization design of the LCP is established based on heat dissipation and pressure drop objectives. We obtain a series of two-dimensional (2D) topology optimization configurations with different weighting factors for two objectives. It is found that the biomimetic phenomenon of the topologically optimized flow channel structure is more pronounced at low Reynolds numbers. Secondly, the topology configuration is stretched into a three-dimensional (3D) model to perform CFD simulations under actual operating conditions. The results show that the thermal resistance and pressure drop of the LCP based on topology optimization achieve a reduction of approximately 20% - 50% compared to traditional serpentine and microchannel straight flow channel structures. The Nusselt number can be improved by up to 76.1% compared to microchannel straight designs. Moreover, it is observed that under high flow rates, straight microchannel LCPs exhibit significant backflow, vortex phenomena, and topology optimization structures LCPs also tend to lead to loss of effectiveness in the form of tree root-shaped branch flows. Suitable flow rate ranges for LCPs are provided. Furthermore, the temperature and pressure drop of experimental results are consistent with the numerical ones, which verifies the effectiveness of performance for topology optimization flow channel LCP.
基金the National Key Research and Development Program of China under Grant 2023YFB2806000the Postdoctoral Fellowship Program of CPSF under Grant GZC20241305the Proof of Concept Foundation of Xidian,University Hangzhou Institute of Technology,under Grant GNYZ2024JC004.
文摘Large language models(LLMs)have exhibited remarkable performance across a broad spectrum of tasks,yet their extensive computational and memory requirements present substantial challenges for deployment in resource-constrained scenarios.To address the challenges,this work introduces software and hardware co-optimization strategies aimed at enhancing the inference performance of LLMs on ARM CPU-based platforms.A mixed-precision quantization technique is employed,preserving the precision of critical weights to maintain model accuracy while quantizing non-essential weights to INT8,thereby reducing the model’s memory footprint.This work also capitalizes on the SIMD instruction set of ARM CPUs to efficiently process model data.Furthermore,the inference framework is optimized by fusing components of the attention computation and streamlining the dequantization process through modifications to the scaling factor.These enhancements result in a significant reduction in model memory usage and improved throughput during the prefill and decode stages.The efficacy of the proposed approach is demonstrated through the optimization of the Qwen-1.8B model on Armv9,with only a 0.66%decrease in accuracy and a reduction in memory usage to 58.8%of the baseline,while achieving a 4.09×and 15.23×increase in inference performance for the prefill and decode stages over the baseline,respectively.
基金Project supported by the National Natural Science Foundation of China(No.61272145)the National High-Tech R&D Program(863)of China(No.2012AA012706)
文摘OpenCL is an open heterogeneous programming framework. Although OpenCL programs are func- tionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus has been extensively used. However, locality concerns exposed in GPU-specific OpenCL code are usually inherited without analysis, which may give side-effects on the CPU performance. Typi- cally, the use of OpenCL's local memory on multi-core/many-core CPUs may lead to an opposite performance effect, because local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by (1) removing all the unwanted local-memory arrays together with the obsolete barrier statements and (2) optimizing the coalesced kernel code with vectorization and locality re-exploitation. Moreover, we have developed an automated tool chain that makes this transformation of GPU-specific OpenCL kernels into a CPU-friendly form, which is accompanied with a scheduler that forms a new OpenCL runtime. Experiments show that the automated transformation can improve OpenCL kernel performance on a multi-core CPU by an average factor of 3.24. Satisfactory performance improvements axe also achieved on Intel's many-integrated-core coprocessor. The resultant performance on both architectures is better than or comparable with the corresponding OpenMP performance.
文摘在信息化蓬勃发展的今日,大量云计算资源的高效管理是运维领域的重要难题。准确的负载预测是应对这一难题的关键技术。针对该问题提出一种基于局部加权回归周期趋势分解算法(Seasonal and Trend decomposition using Loess,STL)、Holt-Winters模型和深度自回归模型(DeepAR)的组合预测模型STL-DeepAR-HW。先采用快速傅里叶变换和自相关函数提取数据的周期性特征,以提取到的最优周期对数据做STL分解,将数据分解为趋势项、季节项和余项;并用DeepAR和Holt-Winters分别预测趋势项和季节项,最后组合得到预测结果。在公开数据集AzurePublicDataset上进行实验,结果表明,与Transformer、Stacked-LSTM以及Prophet等模型相比,该组合模型在负载预测中具有更高的准确性和适用性。
文摘基于响应系数的数值模拟是在港湾环境容量评估中的常用方法之一,但目前常见的海洋模型中没有可同时计算多个释放点的响应系数场且互不干扰的示踪物模块。针对响应系数法的特点,本研究对三维水动力海洋数值模型FVCOM(Finite-Volume Community Ocean Model)的示踪物模块(dyeing tracking,DYE)进行改进,在模型原有DYE模块的基础上增加多个功能与原DYE模块相同的独立模块,即并行计算多个DYE模块,使FVCOM能够同时计算多个互不干扰的保守示踪物模块。以一个理想地形矩形案例和一个象山港理想地形案例进行了测试。结果显示,改进算法模拟的多点源示踪物平流扩散过程互不影响,且模拟的响应系数场与传统算法一致;相较于传统算法,改进算法的计算过程耗时更短,对理想矩形案例的计算效率最高提升了85%,对象山港案例最高提升了78%;在并行运算的条件下,改进算法对CPU进程的利用率更高。使用改进后的DYE计算响应系数场可以缩短海洋环境容量评估的整体用时。