Deep learning(DL)accelerators are critical for handling the growing computational demands of modern neural networks.Systolic array(SA)-based accelerators consist of a 2D mesh of processing elements(PEs)working coopera...Deep learning(DL)accelerators are critical for handling the growing computational demands of modern neural networks.Systolic array(SA)-based accelerators consist of a 2D mesh of processing elements(PEs)working cooperatively to accelerate matrix multiplication.The power efficiency of such accelerators is of primary importance,especially considering the edge AI regime.This work presents the SAPER-AI accelerator,an SA accelerator with power intent specified via a unified power format representation in a simplified manner with negligible microarchi-tectural optimization effort.Our proposed accelerator switches off rows and columns of PEs in a coarse-grained manner,thus leading to SA microarchitecture complying with the varying computational requirements of modern DL workloads.Our analysis demonstrates enhanced power efficiency ranging between 10% and 25% for the best case 32×32 and 64×64 SA designs,respectively.Additionally,the power delay product(PDP)exhibits a progressive improvement of around 6%for larger SA sizes.Moreover,a performance comparison between the MobileNet and ResNet50 models indicates generally better SA performance for the ResNet50 workload.This is due to the more regular convolutions portrayed by ResNet50 that are more favored by SAs,with the performance gap widening as the SA size increases.展开更多
Traditional von Neumann architectures suffer from severe energy and latency overheads due to intensive data movement between memory and processing units.In-Memory Computing(IMC)integrates computation within memory arr...Traditional von Neumann architectures suffer from severe energy and latency overheads due to intensive data movement between memory and processing units.In-Memory Computing(IMC)integrates computation within memory arrays,greatly mitigating this bottleneck.This paper provides a comprehensive review of IMC principles,implementations,and challenges across Static Random-Access Memory(SRAM),Dynamic Random-Access Memory(DRAM),and Non-Volatile Memories(NVMs)such as Resistive Random Access Memory(RRAM),Magnetoresistive Random-Access Memory(MRAM),and Phase-Change Memory(PCM).We summarize architectural advances,device-level constraints,and system-level opportunities,with emphasis on the emerging class of resistive NVM-based IMC accelerators.Furthermore,we highlight engineering trade-offs,real-world application scenarios,and current industrial standardization efforts,offering guidance toward large-scale deployment of IMC technologies.展开更多
AI inference accelerators have drawn extensive attention.But none of the previous work performs a holistic and systematic benchmarking on AI inference accelerators.First,an end-to-end AI inference pipeline consists of...AI inference accelerators have drawn extensive attention.But none of the previous work performs a holistic and systematic benchmarking on AI inference accelerators.First,an end-to-end AI inference pipeline consists of six stages on both host and accelerators.However,previous work mainly evaluates hardware execution performance,which is only one stage on accelerators.Second,there is a lack of a systematic evaluation of different optimizations on AI inference accelerators.Along with six representative AI workloads and a typical AI inference accelerator–Diannao based on Cambricon ISA,we implement five frequently-used AI inference optimizations as user-configurable hyper-parameters.We explore the optimization space by sweeping the hyper-parameters and quantifying each optimization’s effect on the chosen metrics.We also provide crossplatform comparisons between Diannao and traditional platforms(Intel CPUs and Nvidia GPUs).Our evaluation provides several new observations and insights,which sheds light on the comprehensive understanding of AI inference accelerators’performance and instructs the co-design of the upper-level optimizations and underlying hardware architecture.展开更多
Artificial Narrow Intelligences(ANI)are rapidly becoming an integral part of everyday consumer technology.With products like ChatGPT,Midjourney,and Stable Diffusion gaining widespread popularity,the demand for local h...Artificial Narrow Intelligences(ANI)are rapidly becoming an integral part of everyday consumer technology.With products like ChatGPT,Midjourney,and Stable Diffusion gaining widespread popularity,the demand for local hosting of neural networks has significantly increased.However,the typical'always-online'nature of these services presents several limitations,including dependence on reliable internet connections,privacy concerns,and ongoing operational costs.This essay will explore potential hardware solutions to popularize on-device inferencing of ANI on consumer hardware and speculate on the future of the industry.展开更多
The large-scale neural networks have brought incredible shocks to the world,changing people’s lives and offering vast prospects.However,they also come with enormous demands for computational power and storage pressur...The large-scale neural networks have brought incredible shocks to the world,changing people’s lives and offering vast prospects.However,they also come with enormous demands for computational power and storage pressure,the core of its computational requirements lies in the matrix multiplication units dominated by multiplication operations.To address this issue,we propose an area-power-efficient multiplier-less processing element(PE)design.Prior to implementing the proposed PE,we apply a powerof-2 dictionary-based quantization to the model and effectiveness of this quantization method in preserving the accuracy of the original model is confirmed.In hardware design,we present a standard and one variant‘bi-sign’architecture of the PE.Our evaluation results demonstrate that the systolic array that implement our standard multiplier-less PE achieves approximately 38%lower power-delay-product and 13%smaller core area compared to a conventional multiplication-and-accumulation PE and the bi-sign PE design can even save 37%core area and 38%computation energy.Furthermore,the applied quantization reduces the model size and operand bit-width,leading to decreased on-chip memory usage and energy consumption for memory accesses.Additionally,the hardware schematic facilitates expansion to support other sparsity-aware,energy-efficient techniques.展开更多
LU and Cholesky factorizations for dense matrices are one of the most fundamental building blocks in a number of numerical applications.Because of the O(n^(3))complexity,they may be the most time consuming basic kerne...LU and Cholesky factorizations for dense matrices are one of the most fundamental building blocks in a number of numerical applications.Because of the O(n^(3))complexity,they may be the most time consuming basic kernels in numerical linear algebra.For this reason,accelerating them on a variety of modern parallel processors received much attention.We in this paper implement LU and Cholesky factorizations on novel massively parallel artificial intelligence(AI)accelerators originally developed for deep neural network applications.We explore data parallelism of the matrix factorizations,and exploit neural compute units and on-chip scratchpad memories of modern AI chips for accelerating them.The experimental results show that our various optimization methods bring performance improvements and can provide up to 41.54 and 19.77 GFlop/s performance using single precision data type and 78.37 and 33.85 GFlop/s performance using half precision data type for LU and Cholesky factorizations on a Cambricon AI accelerator,respectively.展开更多
文摘Deep learning(DL)accelerators are critical for handling the growing computational demands of modern neural networks.Systolic array(SA)-based accelerators consist of a 2D mesh of processing elements(PEs)working cooperatively to accelerate matrix multiplication.The power efficiency of such accelerators is of primary importance,especially considering the edge AI regime.This work presents the SAPER-AI accelerator,an SA accelerator with power intent specified via a unified power format representation in a simplified manner with negligible microarchi-tectural optimization effort.Our proposed accelerator switches off rows and columns of PEs in a coarse-grained manner,thus leading to SA microarchitecture complying with the varying computational requirements of modern DL workloads.Our analysis demonstrates enhanced power efficiency ranging between 10% and 25% for the best case 32×32 and 64×64 SA designs,respectively.Additionally,the power delay product(PDP)exhibits a progressive improvement of around 6%for larger SA sizes.Moreover,a performance comparison between the MobileNet and ResNet50 models indicates generally better SA performance for the ResNet50 workload.This is due to the more regular convolutions portrayed by ResNet50 that are more favored by SAs,with the performance gap widening as the SA size increases.
文摘Traditional von Neumann architectures suffer from severe energy and latency overheads due to intensive data movement between memory and processing units.In-Memory Computing(IMC)integrates computation within memory arrays,greatly mitigating this bottleneck.This paper provides a comprehensive review of IMC principles,implementations,and challenges across Static Random-Access Memory(SRAM),Dynamic Random-Access Memory(DRAM),and Non-Volatile Memories(NVMs)such as Resistive Random Access Memory(RRAM),Magnetoresistive Random-Access Memory(MRAM),and Phase-Change Memory(PCM).We summarize architectural advances,device-level constraints,and system-level opportunities,with emphasis on the emerging class of resistive NVM-based IMC accelerators.Furthermore,we highlight engineering trade-offs,real-world application scenarios,and current industrial standardization efforts,offering guidance toward large-scale deployment of IMC technologies.
文摘AI inference accelerators have drawn extensive attention.But none of the previous work performs a holistic and systematic benchmarking on AI inference accelerators.First,an end-to-end AI inference pipeline consists of six stages on both host and accelerators.However,previous work mainly evaluates hardware execution performance,which is only one stage on accelerators.Second,there is a lack of a systematic evaluation of different optimizations on AI inference accelerators.Along with six representative AI workloads and a typical AI inference accelerator–Diannao based on Cambricon ISA,we implement five frequently-used AI inference optimizations as user-configurable hyper-parameters.We explore the optimization space by sweeping the hyper-parameters and quantifying each optimization’s effect on the chosen metrics.We also provide crossplatform comparisons between Diannao and traditional platforms(Intel CPUs and Nvidia GPUs).Our evaluation provides several new observations and insights,which sheds light on the comprehensive understanding of AI inference accelerators’performance and instructs the co-design of the upper-level optimizations and underlying hardware architecture.
文摘Artificial Narrow Intelligences(ANI)are rapidly becoming an integral part of everyday consumer technology.With products like ChatGPT,Midjourney,and Stable Diffusion gaining widespread popularity,the demand for local hosting of neural networks has significantly increased.However,the typical'always-online'nature of these services presents several limitations,including dependence on reliable internet connections,privacy concerns,and ongoing operational costs.This essay will explore potential hardware solutions to popularize on-device inferencing of ANI on consumer hardware and speculate on the future of the industry.
基金supported by theWaseda University Open Innovation Ecosystem Program for Pioneering Research(W-SPRING)under Grant Number JPMJSP2128.
文摘The large-scale neural networks have brought incredible shocks to the world,changing people’s lives and offering vast prospects.However,they also come with enormous demands for computational power and storage pressure,the core of its computational requirements lies in the matrix multiplication units dominated by multiplication operations.To address this issue,we propose an area-power-efficient multiplier-less processing element(PE)design.Prior to implementing the proposed PE,we apply a powerof-2 dictionary-based quantization to the model and effectiveness of this quantization method in preserving the accuracy of the original model is confirmed.In hardware design,we present a standard and one variant‘bi-sign’architecture of the PE.Our evaluation results demonstrate that the systolic array that implement our standard multiplier-less PE achieves approximately 38%lower power-delay-product and 13%smaller core area compared to a conventional multiplication-and-accumulation PE and the bi-sign PE design can even save 37%core area and 38%computation energy.Furthermore,the applied quantization reduces the model size and operand bit-width,leading to decreased on-chip memory usage and energy consumption for memory accesses.Additionally,the hardware schematic facilitates expansion to support other sparsity-aware,energy-efficient techniques.
基金supported by the National Natural Science Foundation of China under Grant No.61972415the Science Foundation of China University of Petroleum,Beijing under Grant Nos.2462019YJRC004,2462020XKJS03.
文摘LU and Cholesky factorizations for dense matrices are one of the most fundamental building blocks in a number of numerical applications.Because of the O(n^(3))complexity,they may be the most time consuming basic kernels in numerical linear algebra.For this reason,accelerating them on a variety of modern parallel processors received much attention.We in this paper implement LU and Cholesky factorizations on novel massively parallel artificial intelligence(AI)accelerators originally developed for deep neural network applications.We explore data parallelism of the matrix factorizations,and exploit neural compute units and on-chip scratchpad memories of modern AI chips for accelerating them.The experimental results show that our various optimization methods bring performance improvements and can provide up to 41.54 and 19.77 GFlop/s performance using single precision data type and 78.37 and 33.85 GFlop/s performance using half precision data type for LU and Cholesky factorizations on a Cambricon AI accelerator,respectively.