Deep learning accelerators(DLAs)have been proved to be efficient computational devices for processing deep learning algorithms.Various DLA architectures are proposed and applied to different applications and tasks.How...Deep learning accelerators(DLAs)have been proved to be efficient computational devices for processing deep learning algorithms.Various DLA architectures are proposed and applied to different applications and tasks.However,for most DLAs,their programming interfaces are either difficult to use or not efficient enough.Most DLAs require programmers to directly write instructions,which is time-consuming and error-prone.Another prevailing programming interface for DLAs is high-performance libraries and deep learning frameworks,which are easy to be used and very friendly to users,but their high abstraction level limits their control capacity over the hardware resources thus compromises the efficiency of the accelerator.A design of the programming interface is for DLAs.First various existing DLAs and their programming methods are analyzed and a methodology for designing programming interface for DLAs is proposed,which is a high-level assembly language(called DLA-AL),assembler and runtime for DLAs.DLA-AL is composed of a low-level assembly language and a set of high-level blocks.It allows experienced experts to fully exploit the potential of DLAs and achieve near-optimal performance.Meanwhile,by using DLA-AL,end-users who have little knowledge of the hardware are able to develop deep learning algorithms on DLAs spending minimal programming efforts.展开更多
There are a wide variety of intelligence accelerators with promising performance and energy efficiency,deployed in a broad range of applications such as computer vision and speech recognition.However,programming produ...There are a wide variety of intelligence accelerators with promising performance and energy efficiency,deployed in a broad range of applications such as computer vision and speech recognition.However,programming productivity hinders the deployment of deep learning accelerators.The low-level library invoked in the high-level deep learning framework which supports the end-to-end execution with a given model,is designed to reduce the programming burden on the intelligence accelerators.Unfortunately,it is inflexible for developers to build a network model for every deep learning application,which probably brings unnecessary repetitive implementation.In this paper,a flexible and efficient programming framework for deep learning accelerators,FlexPDA,is proposed,which provides more optimization opportunities than the low-level library and realizes quick transplantation of applications to intelligence accelerators for fast upgrades.We evaluate FlexPDA by using 10 representative operators selected from deep learning algorithms and an end-to-end network.The experimental results validate the effectiveness of FlexPDA,which achieves an end-to-end performance improvement of 1.620x over the low-level library.展开更多
The increasing complexity of neural network applications has led to a demand for higher computational parallelism and more efficient synchronization in artificial intelligence(AI)chips.To achieve higher performance an...The increasing complexity of neural network applications has led to a demand for higher computational parallelism and more efficient synchronization in artificial intelligence(AI)chips.To achieve higher performance and lower power,a comprehensive and efficient approach is required to compile neural networks for implementation on dedicated hardware.Our first-generation deep learning accelerator,tensor computing unit,was presented with hardware and software solutions.It offered dedicated very long instruction words(VLIWs)instructions and multi-level repeatable direct memory access(DMA).The former lowers the instruction bandwidth requirement and makes it easier to parallelize the index and vector computations.The latter reduces the communication latency between the compute core and the asynchronous DMA,and also greatly alleviates the programming complexity.For operator implementation and optimization,the compiler-based data-flow generator and the instruction macro generator first produced a set of parameterized operators.Then,the tunerconfiguration generator pruned the search space and the distributed tuner framework selected the best data-flow pattern and corresponding parameters.Our tensor computing unit supports all the convolution parameters with full-shape dimensions.It can readily select proper operators to achieve 96%of the chip peak performance under certain shapes and find the best performance implementation within limited power.The evaluation of a large number of convolution shapes on our tensor computing unit chip shows the generated operators significantly outperform the handwritten ones,achieving 9%higher normalized performance than CUDA according to the silicon data.展开更多
Deep learning has recently gained significant prominence in various real-world applications such as image recognition,natural language processing,and autonomous vehicles.While deep neural networks appear to have diffe...Deep learning has recently gained significant prominence in various real-world applications such as image recognition,natural language processing,and autonomous vehicles.While deep neural networks appear to have different architectures,the main operations within these models are matrix-vector multiplications(MVM).Compute-in-memory(CIM)architectures are promising solutions for accelerating the massive MVM operations by alleviating the frequent data movement issue in traditional processors.Analog CIM macros leverage current-accumulating or charge-sharing mechanisms to perform multiply-and-add(MAC)computations.Even though they can achieve high throughput and efficiency,the computing accuracy is sacrificed due to the analog nonidealities.To ensure precise MAC calculations,it is crucial to analyze the sources of nonidealities and identify their impacts,along with corresponding solutions.In this paper,comprehensive linearity analysis and dedicated calibration methods for charge domain static-random access memory(SRAM)based in-memory computing circuits are proposed.We analyze nonidealities from three areas based on the mechanism of charge domain computing:charge injection effect,temperature variations,and ADC reference voltage mismatch.By designing a 256×256 CIM macro and conducting investigations via post-layout simulation,we conclude that these nonidealities don’t deteriorate the computing linearity,but only cause the scaling and bias drift.To mitigate the scaling and bias drift identified,we propose three calibration methods ranging from the circuit level to the algorithm level,all of which exhibit promising results.The comprehensive analysis and calibration methods can assist in designing CIM macros with more accurate MAC computations,thereby supporting more robust deep learning inference.展开更多
基金Supported by the National Key Research and Development Program of China(No.2017YFA0700902,2017YFB1003101)the 973 Program of China(No.2015CB358800)National Science and Technology Major Project(No.2018ZX01031102)
文摘Deep learning accelerators(DLAs)have been proved to be efficient computational devices for processing deep learning algorithms.Various DLA architectures are proposed and applied to different applications and tasks.However,for most DLAs,their programming interfaces are either difficult to use or not efficient enough.Most DLAs require programmers to directly write instructions,which is time-consuming and error-prone.Another prevailing programming interface for DLAs is high-performance libraries and deep learning frameworks,which are easy to be used and very friendly to users,but their high abstraction level limits their control capacity over the hardware resources thus compromises the efficiency of the accelerator.A design of the programming interface is for DLAs.First various existing DLAs and their programming methods are analyzed and a methodology for designing programming interface for DLAs is proposed,which is a high-level assembly language(called DLA-AL),assembler and runtime for DLAs.DLA-AL is composed of a low-level assembly language and a set of high-level blocks.It allows experienced experts to fully exploit the potential of DLAs and achieve near-optimal performance.Meanwhile,by using DLA-AL,end-users who have little knowledge of the hardware are able to develop deep learning algorithms on DLAs spending minimal programming efforts.
基金This work was supported by the National Key Research and Development Program of China under Grant No.2017YFB1003103the Natural Science Research Foundation of Jilin Province of China under Grant No.20190201193JCthe Fundamental Research Funds for the Central Universities,JLU.
文摘There are a wide variety of intelligence accelerators with promising performance and energy efficiency,deployed in a broad range of applications such as computer vision and speech recognition.However,programming productivity hinders the deployment of deep learning accelerators.The low-level library invoked in the high-level deep learning framework which supports the end-to-end execution with a given model,is designed to reduce the programming burden on the intelligence accelerators.Unfortunately,it is inflexible for developers to build a network model for every deep learning application,which probably brings unnecessary repetitive implementation.In this paper,a flexible and efficient programming framework for deep learning accelerators,FlexPDA,is proposed,which provides more optimization opportunities than the low-level library and realizes quick transplantation of applications to intelligence accelerators for fast upgrades.We evaluate FlexPDA by using 10 representative operators selected from deep learning algorithms and an end-to-end network.The experimental results validate the effectiveness of FlexPDA,which achieves an end-to-end performance improvement of 1.620x over the low-level library.
文摘The increasing complexity of neural network applications has led to a demand for higher computational parallelism and more efficient synchronization in artificial intelligence(AI)chips.To achieve higher performance and lower power,a comprehensive and efficient approach is required to compile neural networks for implementation on dedicated hardware.Our first-generation deep learning accelerator,tensor computing unit,was presented with hardware and software solutions.It offered dedicated very long instruction words(VLIWs)instructions and multi-level repeatable direct memory access(DMA).The former lowers the instruction bandwidth requirement and makes it easier to parallelize the index and vector computations.The latter reduces the communication latency between the compute core and the asynchronous DMA,and also greatly alleviates the programming complexity.For operator implementation and optimization,the compiler-based data-flow generator and the instruction macro generator first produced a set of parameterized operators.Then,the tunerconfiguration generator pruned the search space and the distributed tuner framework selected the best data-flow pattern and corresponding parameters.Our tensor computing unit supports all the convolution parameters with full-shape dimensions.It can readily select proper operators to achieve 96%of the chip peak performance under certain shapes and find the best performance implementation within limited power.The evaluation of a large number of convolution shapes on our tensor computing unit chip shows the generated operators significantly outperform the handwritten ones,achieving 9%higher normalized performance than CUDA according to the silicon data.
基金supported in part by the National Key Research and Development Program of China under Grant 2022YFB4400900in part by the Natural Science Foundation of China under Grant 62371223.
文摘Deep learning has recently gained significant prominence in various real-world applications such as image recognition,natural language processing,and autonomous vehicles.While deep neural networks appear to have different architectures,the main operations within these models are matrix-vector multiplications(MVM).Compute-in-memory(CIM)architectures are promising solutions for accelerating the massive MVM operations by alleviating the frequent data movement issue in traditional processors.Analog CIM macros leverage current-accumulating or charge-sharing mechanisms to perform multiply-and-add(MAC)computations.Even though they can achieve high throughput and efficiency,the computing accuracy is sacrificed due to the analog nonidealities.To ensure precise MAC calculations,it is crucial to analyze the sources of nonidealities and identify their impacts,along with corresponding solutions.In this paper,comprehensive linearity analysis and dedicated calibration methods for charge domain static-random access memory(SRAM)based in-memory computing circuits are proposed.We analyze nonidealities from three areas based on the mechanism of charge domain computing:charge injection effect,temperature variations,and ADC reference voltage mismatch.By designing a 256×256 CIM macro and conducting investigations via post-layout simulation,we conclude that these nonidealities don’t deteriorate the computing linearity,but only cause the scaling and bias drift.To mitigate the scaling and bias drift identified,we propose three calibration methods ranging from the circuit level to the algorithm level,all of which exhibit promising results.The comprehensive analysis and calibration methods can assist in designing CIM macros with more accurate MAC computations,thereby supporting more robust deep learning inference.