With the support of more precision formats in emerging hardware architectures,mixed-precision has become a popular approach to accelerate deep learning(DL)training.Applying low-precision formats such as FP16 and BF16 ...With the support of more precision formats in emerging hardware architectures,mixed-precision has become a popular approach to accelerate deep learning(DL)training.Applying low-precision formats such as FP16 and BF16 to neural operators can save GPU memory while improving bandwidth.However,DL frameworks use black and white lists as default mixed-precision selections and cannot flexibly adapt to a variety of neural networks.In addition,existing work on automatic precision adjustment does not consider model convergence,and the decision cost of precision selection is high.To address the above problems,this paper proposes CoMP,a non-intrusive framework for Convergence-aware operator-wise Mixedprecision training.CoMP uses two-stage precision adjustment based on epochs and batches to ensure convergence and performance respectively.After that,CoMP performs subsequent training according to the searched optimal operator-wise mixed-precision plan.The experimental results on A100 GPU show that CoMP achieves a maximum performance speedup of 1.15×compared with PyTorch AMP implementation,while also saving up to 29.81%of GPU memory.展开更多
In this paper,we propose and implement a mixed-precision Block-ISAI preconditioner for solving linear systems from mul-tiphysics areas.By leveraging FP32 computing,our approach accelerates the sparse matrix-vector pro...In this paper,we propose and implement a mixed-precision Block-ISAI preconditioner for solving linear systems from mul-tiphysics areas.By leveraging FP32 computing,our approach accelerates the sparse matrix-vector product kernel while main-taining satisfactory accuracy.Meanwhile,an efficient,warp-based GPU implementation for Block-ISAI preconditioner with Tensor core acceleration is proposed.For the matrix-multiplication portion of it,we use the double-precision Tensor core on the NVIDIA GPUs A100 to accelerate it.To showcase the effectiveness of our method,detailed comparisons are made which shows noteworthy speedup:precisely,it is 6x faster than cuSPARSE and 11.2x faster than PETSc’s built-in preconditioner.展开更多
Large language models(LLMs)have exhibited remarkable performance across a broad spectrum of tasks,yet their extensive computational and memory requirements present substantial challenges for deployment in resource-con...Large language models(LLMs)have exhibited remarkable performance across a broad spectrum of tasks,yet their extensive computational and memory requirements present substantial challenges for deployment in resource-constrained scenarios.To address the challenges,this work introduces software and hardware co-optimization strategies aimed at enhancing the inference performance of LLMs on ARM CPU-based platforms.A mixed-precision quantization technique is employed,preserving the precision of critical weights to maintain model accuracy while quantizing non-essential weights to INT8,thereby reducing the model’s memory footprint.This work also capitalizes on the SIMD instruction set of ARM CPUs to efficiently process model data.Furthermore,the inference framework is optimized by fusing components of the attention computation and streamlining the dequantization process through modifications to the scaling factor.These enhancements result in a significant reduction in model memory usage and improved throughput during the prefill and decode stages.The efficacy of the proposed approach is demonstrated through the optimization of the Qwen-1.8B model on Armv9,with only a 0.66%decrease in accuracy and a reduction in memory usage to 58.8%of the baseline,while achieving a 4.09×and 15.23×increase in inference performance for the prefill and decode stages over the baseline,respectively.展开更多
To meet the demand of large computing power for training complex deep neural networks(DNN),we establish an AI ecosystem on Sunway platform to utilize the Sunway series of high performance computers(HPC).We provide a s...To meet the demand of large computing power for training complex deep neural networks(DNN),we establish an AI ecosystem on Sunway platform to utilize the Sunway series of high performance computers(HPC).We provide a specially optimized accelerating library for DNN operators on Sunway,namely SWDNNv2,supporting both single-precision and half-precision.Based on the highly efficient library,we refactor the PyTorch framework to fit the Sunway platform by adopting hardwarespecific acceleration and MPI backend support.A Python-interface based lightweight framework named SWMind is also developed from srcatch to provide higer peformance for some domain models.Some techniques about training large models are also dicussed,including mixed-precision and hybrid parallelism.The toolkits in the AI ecosystem have been applied to actual projects,such as training large scale multi-modality model.We have managed to train a 1 billion parameter model and achieve a relative close performance to the NVIDIA Tesla V100.The high efficiency of SWDNNv2 is demonstrated by the performace of the GEMM operator,which can achieve 88.23% and 84.5% of the FP32 and FP16 theoretical peak FLOPS for the SW many-core CPU.The evaluation also shows the scalability of the AI framework by training a ResNet-50 model and the parallel efficiency can achieve 91.51% when scales to 1024 CPUs.展开更多
With the rapid development of supercomputers,large-scale computing has become increasingly widespread in various scientific research and engineering fields.Meanwhile,the precision and efficiency of large-scale floatin...With the rapid development of supercomputers,large-scale computing has become increasingly widespread in various scientific research and engineering fields.Meanwhile,the precision and efficiency of large-scale floating-point arithmetic have always been a research hotspot in high-performance computing.This paper studies the numerical method to solve large-scale sparse linear equations,in which the accumulation of rounding errors during the solution process leads to inaccurate results,and large-scale data makes the solver produce a long running time.For the above issues,we use error-free transformation technology and mixed-precision ideas to construct a reliable parallel numerical algorithm framework based on HYPRE,which solves large-scale sparse linear equations to improve accuracy and accelerate numerical calculations.Moreover,we illustrate the implementation details of our technique by implementing two cases.One is that we use error-free transformation technology to design high-precision iterative algorithms,such as GMRES,PCG,and BICGSTAB,which reduce rounding errors in the calculation process and make the result more accurate.The other is that we propose a mixed-precision iterative algorithm that utilizes low-precision formats to achieve higher computing power and reduce computing time.Experimental results demonstrate that XHYPRE has higher reliability and effectiveness.Our XHYPRE is on average 1.3x faster than HYPRE and reduces the number of iterations to 87.1%on average.展开更多
In order to realize high efficient magnetization switching in magnetic tunnel junction(MTJ),several potential mechanisms have been realized as the interplay effect to MTJ device,such as the interaction between spin or...In order to realize high efficient magnetization switching in magnetic tunnel junction(MTJ),several potential mechanisms have been realized as the interplay effect to MTJ device,such as the interaction between spin orbit torque-spin transfer torque(STT)and voltage-controlled magnetic anisotropy(VCMA)-STT.The interplay mechanisms have been experimentally explored with improved switching energy efficiency comparing with traditional STT method.Considering the requirement of mixed-precision memory,we propose a novel write-only in-memory computing paradigm based on interplay bitwise operation in two terminal or three terminal MRAM bit-cell,which aims to reduce the layout overhead of peripheral computing circuits,as well as to eliminate read decision failure in the procedure of in-memory computing.Specifically,the proposed write-only bitwise in-memory computing is demonstrated with OR,AND,XOR,full adder operations.Four nonvolatile approximate full adders(AxFAs)are proposed and implemented in different MRAM bit-cells.The AxFAs can be easily reconfigured into memory units with simple connections.Image processing applications are used to demonstrate the inmemory computing,include FA,XOR operation.Comparing with traditional sensing based approach,more than 80%energy reduction is obtained using the proposed interplay writing-only in memory computing with approximation setup.A 61.4%energy reduction is achieved using VCMA mechanism interaction based XOR functions.展开更多
基金supported by National Natural Science Foundation of China(Grant No.62402525)the Fundamental Research Funds for the Central Universities(Grant No.2462023YJRC023).
文摘With the support of more precision formats in emerging hardware architectures,mixed-precision has become a popular approach to accelerate deep learning(DL)training.Applying low-precision formats such as FP16 and BF16 to neural operators can save GPU memory while improving bandwidth.However,DL frameworks use black and white lists as default mixed-precision selections and cannot flexibly adapt to a variety of neural networks.In addition,existing work on automatic precision adjustment does not consider model convergence,and the decision cost of precision selection is high.To address the above problems,this paper proposes CoMP,a non-intrusive framework for Convergence-aware operator-wise Mixedprecision training.CoMP uses two-stage precision adjustment based on epochs and batches to ensure convergence and performance respectively.After that,CoMP performs subsequent training according to the searched optimal operator-wise mixed-precision plan.The experimental results on A100 GPU show that CoMP achieves a maximum performance speedup of 1.15×compared with PyTorch AMP implementation,while also saving up to 29.81%of GPU memory.
基金funded by Key Technologies Research and Development Program(No.2020YFB1709500).
文摘In this paper,we propose and implement a mixed-precision Block-ISAI preconditioner for solving linear systems from mul-tiphysics areas.By leveraging FP32 computing,our approach accelerates the sparse matrix-vector product kernel while main-taining satisfactory accuracy.Meanwhile,an efficient,warp-based GPU implementation for Block-ISAI preconditioner with Tensor core acceleration is proposed.For the matrix-multiplication portion of it,we use the double-precision Tensor core on the NVIDIA GPUs A100 to accelerate it.To showcase the effectiveness of our method,detailed comparisons are made which shows noteworthy speedup:precisely,it is 6x faster than cuSPARSE and 11.2x faster than PETSc’s built-in preconditioner.
基金the National Key Research and Development Program of China under Grant 2023YFB2806000the Postdoctoral Fellowship Program of CPSF under Grant GZC20241305the Proof of Concept Foundation of Xidian,University Hangzhou Institute of Technology,under Grant GNYZ2024JC004.
文摘Large language models(LLMs)have exhibited remarkable performance across a broad spectrum of tasks,yet their extensive computational and memory requirements present substantial challenges for deployment in resource-constrained scenarios.To address the challenges,this work introduces software and hardware co-optimization strategies aimed at enhancing the inference performance of LLMs on ARM CPU-based platforms.A mixed-precision quantization technique is employed,preserving the precision of critical weights to maintain model accuracy while quantizing non-essential weights to INT8,thereby reducing the model’s memory footprint.This work also capitalizes on the SIMD instruction set of ARM CPUs to efficiently process model data.Furthermore,the inference framework is optimized by fusing components of the attention computation and streamlining the dequantization process through modifications to the scaling factor.These enhancements result in a significant reduction in model memory usage and improved throughput during the prefill and decode stages.The efficacy of the proposed approach is demonstrated through the optimization of the Qwen-1.8B model on Armv9,with only a 0.66%decrease in accuracy and a reduction in memory usage to 58.8%of the baseline,while achieving a 4.09×and 15.23×increase in inference performance for the prefill and decode stages over the baseline,respectively.
基金partially supported by PACMAN(Parallel Architecture and Compiler technology of Mobile,Accelerated,and Networked systems)Laboratory of Tsinghua University.
文摘To meet the demand of large computing power for training complex deep neural networks(DNN),we establish an AI ecosystem on Sunway platform to utilize the Sunway series of high performance computers(HPC).We provide a specially optimized accelerating library for DNN operators on Sunway,namely SWDNNv2,supporting both single-precision and half-precision.Based on the highly efficient library,we refactor the PyTorch framework to fit the Sunway platform by adopting hardwarespecific acceleration and MPI backend support.A Python-interface based lightweight framework named SWMind is also developed from srcatch to provide higer peformance for some domain models.Some techniques about training large models are also dicussed,including mixed-precision and hybrid parallelism.The toolkits in the AI ecosystem have been applied to actual projects,such as training large scale multi-modality model.We have managed to train a 1 billion parameter model and achieve a relative close performance to the NVIDIA Tesla V100.The high efficiency of SWDNNv2 is demonstrated by the performace of the GEMM operator,which can achieve 88.23% and 84.5% of the FP32 and FP16 theoretical peak FLOPS for the SW many-core CPU.The evaluation also shows the scalability of the AI framework by training a ResNet-50 model and the parallel efficiency can achieve 91.51% when scales to 1024 CPUs.
基金supported by the NuSCAP(ANR-20-CE48-0014)project of the French National Agency for Research(ANR)the 173 program(2020-JCJQ-ZD-029)Science Challenge Project(TZ2016002).
文摘With the rapid development of supercomputers,large-scale computing has become increasingly widespread in various scientific research and engineering fields.Meanwhile,the precision and efficiency of large-scale floating-point arithmetic have always been a research hotspot in high-performance computing.This paper studies the numerical method to solve large-scale sparse linear equations,in which the accumulation of rounding errors during the solution process leads to inaccurate results,and large-scale data makes the solver produce a long running time.For the above issues,we use error-free transformation technology and mixed-precision ideas to construct a reliable parallel numerical algorithm framework based on HYPRE,which solves large-scale sparse linear equations to improve accuracy and accelerate numerical calculations.Moreover,we illustrate the implementation details of our technique by implementing two cases.One is that we use error-free transformation technology to design high-precision iterative algorithms,such as GMRES,PCG,and BICGSTAB,which reduce rounding errors in the calculation process and make the result more accurate.The other is that we propose a mixed-precision iterative algorithm that utilizes low-precision formats to achieve higher computing power and reduce computing time.Experimental results demonstrate that XHYPRE has higher reliability and effectiveness.Our XHYPRE is on average 1.3x faster than HYPRE and reduces the number of iterations to 87.1%on average.
基金funded with National Key R&D Program of China under Grant 2018YFB2202800National Natural Science Foundation of China under Grant 61904028.
文摘In order to realize high efficient magnetization switching in magnetic tunnel junction(MTJ),several potential mechanisms have been realized as the interplay effect to MTJ device,such as the interaction between spin orbit torque-spin transfer torque(STT)and voltage-controlled magnetic anisotropy(VCMA)-STT.The interplay mechanisms have been experimentally explored with improved switching energy efficiency comparing with traditional STT method.Considering the requirement of mixed-precision memory,we propose a novel write-only in-memory computing paradigm based on interplay bitwise operation in two terminal or three terminal MRAM bit-cell,which aims to reduce the layout overhead of peripheral computing circuits,as well as to eliminate read decision failure in the procedure of in-memory computing.Specifically,the proposed write-only bitwise in-memory computing is demonstrated with OR,AND,XOR,full adder operations.Four nonvolatile approximate full adders(AxFAs)are proposed and implemented in different MRAM bit-cells.The AxFAs can be easily reconfigured into memory units with simple connections.Image processing applications are used to demonstrate the inmemory computing,include FA,XOR operation.Comparing with traditional sensing based approach,more than 80%energy reduction is obtained using the proposed interplay writing-only in memory computing with approximation setup.A 61.4%energy reduction is achieved using VCMA mechanism interaction based XOR functions.