期刊文献+
共找到6篇文章
< 1 >
每页显示 20 50 100
Convergence-aware operator-wise mixed-precision training
1
作者 Wenhao Dai Ziyi Jia +1 位作者 Yuesi Bai Qingxiao Sun 《CCF Transactions on High Performance Computing》 2025年第1期43-57,共15页
With the support of more precision formats in emerging hardware architectures,mixed-precision has become a popular approach to accelerate deep learning(DL)training.Applying low-precision formats such as FP16 and BF16 ... With the support of more precision formats in emerging hardware architectures,mixed-precision has become a popular approach to accelerate deep learning(DL)training.Applying low-precision formats such as FP16 and BF16 to neural operators can save GPU memory while improving bandwidth.However,DL frameworks use black and white lists as default mixed-precision selections and cannot flexibly adapt to a variety of neural networks.In addition,existing work on automatic precision adjustment does not consider model convergence,and the decision cost of precision selection is high.To address the above problems,this paper proposes CoMP,a non-intrusive framework for Convergence-aware operator-wise Mixedprecision training.CoMP uses two-stage precision adjustment based on epochs and batches to ensure convergence and performance respectively.After that,CoMP performs subsequent training according to the searched optimal operator-wise mixed-precision plan.The experimental results on A100 GPU show that CoMP achieves a maximum performance speedup of 1.15×compared with PyTorch AMP implementation,while also saving up to 29.81%of GPU memory. 展开更多
关键词 GPU mixed-precision Neural network training AUTO-TUNING Performance optimization
在线阅读 下载PDF
Mixed-precision block incomplete sparse approximate preconditioner on Tensor core
2
作者 Haoyuan Zhang Wenpeng Ma +2 位作者 Wu Yuan Jian Zhang Zhonghua Lu 《CCF Transactions on High Performance Computing》 2024年第1期54-67,共14页
In this paper,we propose and implement a mixed-precision Block-ISAI preconditioner for solving linear systems from mul-tiphysics areas.By leveraging FP32 computing,our approach accelerates the sparse matrix-vector pro... In this paper,we propose and implement a mixed-precision Block-ISAI preconditioner for solving linear systems from mul-tiphysics areas.By leveraging FP32 computing,our approach accelerates the sparse matrix-vector product kernel while main-taining satisfactory accuracy.Meanwhile,an efficient,warp-based GPU implementation for Block-ISAI preconditioner with Tensor core acceleration is proposed.For the matrix-multiplication portion of it,we use the double-precision Tensor core on the NVIDIA GPUs A100 to accelerate it.To showcase the effectiveness of our method,detailed comparisons are made which shows noteworthy speedup:precisely,it is 6x faster than cuSPARSE and 11.2x faster than PETSc’s built-in preconditioner. 展开更多
关键词 Block-ISAI GPU mixed-precision Tensor core PRECONDITIONER
在线阅读 下载PDF
Enhancing LLM Inference Performance on ARM CPUs Through Software and Hardware Co-Optimization Strategies
3
作者 CHENG ZHANG XINGYU ZHU +8 位作者 LONGHAO CHEN TINGJIE YANG EVENS PAN GUOSHENG YU YANG ZHAO XIGUANG WU BO LI WEI MAO GENQUAN HAN 《Integrated Circuits and Systems》 2025年第2期49-57,共9页
Large language models(LLMs)have exhibited remarkable performance across a broad spectrum of tasks,yet their extensive computational and memory requirements present substantial challenges for deployment in resource-con... Large language models(LLMs)have exhibited remarkable performance across a broad spectrum of tasks,yet their extensive computational and memory requirements present substantial challenges for deployment in resource-constrained scenarios.To address the challenges,this work introduces software and hardware co-optimization strategies aimed at enhancing the inference performance of LLMs on ARM CPU-based platforms.A mixed-precision quantization technique is employed,preserving the precision of critical weights to maintain model accuracy while quantizing non-essential weights to INT8,thereby reducing the model’s memory footprint.This work also capitalizes on the SIMD instruction set of ARM CPUs to efficiently process model data.Furthermore,the inference framework is optimized by fusing components of the attention computation and streamlining the dequantization process through modifications to the scaling factor.These enhancements result in a significant reduction in model memory usage and improved throughput during the prefill and decode stages.The efficacy of the proposed approach is demonstrated through the optimization of the Qwen-1.8B model on Armv9,with only a 0.66%decrease in accuracy and a reduction in memory usage to 58.8%of the baseline,while achieving a 4.09×and 15.23×increase in inference performance for the prefill and decode stages over the baseline,respectively. 展开更多
关键词 Model compression mixed-precision quantization ARM CPUs SIMD optimization LLM inference performance.
在线阅读 下载PDF
Establishing high performance AI ecosystem on Sunway platform
4
作者 Sha Liu Jie Gao +2 位作者 Xin Liu Zeqiang Huang Tianyu Zheng 《CCF Transactions on High Performance Computing》 2021年第3期224-241,共18页
To meet the demand of large computing power for training complex deep neural networks(DNN),we establish an AI ecosystem on Sunway platform to utilize the Sunway series of high performance computers(HPC).We provide a s... To meet the demand of large computing power for training complex deep neural networks(DNN),we establish an AI ecosystem on Sunway platform to utilize the Sunway series of high performance computers(HPC).We provide a specially optimized accelerating library for DNN operators on Sunway,namely SWDNNv2,supporting both single-precision and half-precision.Based on the highly efficient library,we refactor the PyTorch framework to fit the Sunway platform by adopting hardwarespecific acceleration and MPI backend support.A Python-interface based lightweight framework named SWMind is also developed from srcatch to provide higer peformance for some domain models.Some techniques about training large models are also dicussed,including mixed-precision and hybrid parallelism.The toolkits in the AI ecosystem have been applied to actual projects,such as training large scale multi-modality model.We have managed to train a 1 billion parameter model and achieve a relative close performance to the NVIDIA Tesla V100.The high efficiency of SWDNNv2 is demonstrated by the performace of the GEMM operator,which can achieve 88.23% and 84.5% of the FP32 and FP16 theoretical peak FLOPS for the SW many-core CPU.The evaluation also shows the scalability of the AI framework by training a ResNet-50 model and the parallel efficiency can achieve 91.51% when scales to 1024 CPUs. 展开更多
关键词 AI ecosystem Sunway DNN mixed-precision Hybrid parallelism
在线阅读 下载PDF
XHYPRE:a reliable parallel numerical algorithm library for solving large‑scale sparse linear equations
5
作者 Chuanying Li Stef Graillat +3 位作者 Zhe Quan Tong‑Xiang Gu Hao Jiang Kenli Li 《CCF Transactions on High Performance Computing》 2023年第2期191-209,共19页
With the rapid development of supercomputers,large-scale computing has become increasingly widespread in various scientific research and engineering fields.Meanwhile,the precision and efficiency of large-scale floatin... With the rapid development of supercomputers,large-scale computing has become increasingly widespread in various scientific research and engineering fields.Meanwhile,the precision and efficiency of large-scale floating-point arithmetic have always been a research hotspot in high-performance computing.This paper studies the numerical method to solve large-scale sparse linear equations,in which the accumulation of rounding errors during the solution process leads to inaccurate results,and large-scale data makes the solver produce a long running time.For the above issues,we use error-free transformation technology and mixed-precision ideas to construct a reliable parallel numerical algorithm framework based on HYPRE,which solves large-scale sparse linear equations to improve accuracy and accelerate numerical calculations.Moreover,we illustrate the implementation details of our technique by implementing two cases.One is that we use error-free transformation technology to design high-precision iterative algorithms,such as GMRES,PCG,and BICGSTAB,which reduce rounding errors in the calculation process and make the result more accurate.The other is that we propose a mixed-precision iterative algorithm that utilizes low-precision formats to achieve higher computing power and reduce computing time.Experimental results demonstrate that XHYPRE has higher reliability and effectiveness.Our XHYPRE is on average 1.3x faster than HYPRE and reduces the number of iterations to 87.1%on average. 展开更多
关键词 High-performance computing Rounding errors Error-free transformation technology mixed-precision
在线阅读 下载PDF
Interplay Bitwise Operation in Emerging MRAM for Efficient In‑memory Computing
6
作者 Hao Cai Honglan Jiang +2 位作者 Yongliang Zhou Menglin Han Bo Liu 《CCF Transactions on High Performance Computing》 2020年第3期282-296,共15页
In order to realize high efficient magnetization switching in magnetic tunnel junction(MTJ),several potential mechanisms have been realized as the interplay effect to MTJ device,such as the interaction between spin or... In order to realize high efficient magnetization switching in magnetic tunnel junction(MTJ),several potential mechanisms have been realized as the interplay effect to MTJ device,such as the interaction between spin orbit torque-spin transfer torque(STT)and voltage-controlled magnetic anisotropy(VCMA)-STT.The interplay mechanisms have been experimentally explored with improved switching energy efficiency comparing with traditional STT method.Considering the requirement of mixed-precision memory,we propose a novel write-only in-memory computing paradigm based on interplay bitwise operation in two terminal or three terminal MRAM bit-cell,which aims to reduce the layout overhead of peripheral computing circuits,as well as to eliminate read decision failure in the procedure of in-memory computing.Specifically,the proposed write-only bitwise in-memory computing is demonstrated with OR,AND,XOR,full adder operations.Four nonvolatile approximate full adders(AxFAs)are proposed and implemented in different MRAM bit-cells.The AxFAs can be easily reconfigured into memory units with simple connections.Image processing applications are used to demonstrate the inmemory computing,include FA,XOR operation.Comparing with traditional sensing based approach,more than 80%energy reduction is obtained using the proposed interplay writing-only in memory computing with approximation setup.A 61.4%energy reduction is achieved using VCMA mechanism interaction based XOR functions. 展开更多
关键词 MTJ interplay writing mixed-precision memory In-memory computing Image processing
在线阅读 下载PDF
上一页 1 下一页 到第
使用帮助 返回顶部