On-device Artificial Intelligence(AI)accelerators capable of not only inference but also training neural network models are in increasing demand in the industrial AI field,where frequent retraining is crucial due to f...On-device Artificial Intelligence(AI)accelerators capable of not only inference but also training neural network models are in increasing demand in the industrial AI field,where frequent retraining is crucial due to frequent production changes.Batch normalization(BN)is fundamental to training convolutional neural networks(CNNs),but its implementation in compact accelerator chips remains challenging due to computational complexity,particularly in calculating statistical parameters and gradients across mini-batches.Existing accelerator architectures either compromise the training accuracy of CNNs through approximations or require substantial computational resources,limiting their practical deployment.We present a hardware-optimized BN accelerator that maintains training accuracy while significantly reducing computational overhead through three novel techniques:(1)resourcesharing for efficient resource utilization across forward and backward passes,(2)interleaved buffering for reduced dynamic random-access memory(DRAM)access latencies,and(3)zero-skipping for minimal gradient computation.Implemented on a VCU118 Field Programmable Gate Array(FPGA)on 100 MHz and validated using You Only Look Once version 2-tiny(YOLOv2-tiny)on the PASCALVisualObjectClasses(VOC)dataset,our normalization accelerator achieves a 72%reduction in processing time and 83%lower power consumption compared to a 2.4 GHz Intel Central Processing Unit(CPU)software normalization implementation,while maintaining accuracy(0.51%mean Average Precision(mAP)drop at floating-point 32 bits(FP32),1.35%at brain floating-point 16 bits(bfloat16)).When integrated into a neural processing unit(NPU),the design demonstrates 63%and 97%performance improvements over AMD CPU and Reduced Instruction Set Computing-V(RISC-V)implementations,respectively.These results confirm that our proposed BN hardware design enables efficient,high-accuracy,and power-saving on-device training for modern CNNs.Our results demonstrate that efficient hardware implementation of standard batch normalization is achievable without sacrificing accuracy,enabling practical on-device CNN training with significantly reduced computational and power requirements.展开更多
为解决传统神经网络在CIFAR-10(Canadian Institute For Advanced Research)数据集上进行图像分类识别时,存在的模型准确率较低和训练过程易发生过拟合现象等问题,提出了一种将卷积神经网络和批归一化相结合的新神经网络结构构建方法。...为解决传统神经网络在CIFAR-10(Canadian Institute For Advanced Research)数据集上进行图像分类识别时,存在的模型准确率较低和训练过程易发生过拟合现象等问题,提出了一种将卷积神经网络和批归一化相结合的新神经网络结构构建方法。该方法首先对数据集进行数据增强和边界填充处理,其次对典型的CNN(Convolutional Neural Networks)网络结构进行改进,移除了卷积层组中的池化层,仅保留了卷积层和BN(Batch Normalization)层,并适量增加卷积层组。为了验证模型的有效性和准确性,设计了6组不同的神经网络结构对模型进行训练。实验结果表明,在相同训练周期数下,推荐使用的model-6模型表现最佳,测试准确率高达90.17%,突破了长期以来经典CNN在CIFAR-10数据集上难于达到90%准确率的瓶颈,为图像分类识别提供了新的解决方案和模型参考。展开更多
基金supported by the National Research Foundation of Korea(NRF)grant for RLRC funded by the Korea government(MSIT)(No.2022R1A5A8026986,RLRC)supported by Institute of Information&Communications Technology Planning&Evaluation(IITP)grant funded by the Korea government(MSIT)(No.2020-0-01304,Development of Self-Learnable Mobile Recursive Neural Network Processor Technology)+3 种基金supported by the MSIT(Ministry of Science and ICT),Republic of Korea,under the Grand Information Technology Research Center support program(IITP-2024-2020-0-01462,Grand-ICT)supervised by the IITP(Institute for Information&Communications Technology Planning&Evaluation)supported by the Korea Technology and Information Promotion Agency for SMEs(TIPA)supported by the Korean government(Ministry of SMEs and Startups)’s Smart Manufacturing Innovation R&D(RS-2024-00434259).
文摘On-device Artificial Intelligence(AI)accelerators capable of not only inference but also training neural network models are in increasing demand in the industrial AI field,where frequent retraining is crucial due to frequent production changes.Batch normalization(BN)is fundamental to training convolutional neural networks(CNNs),but its implementation in compact accelerator chips remains challenging due to computational complexity,particularly in calculating statistical parameters and gradients across mini-batches.Existing accelerator architectures either compromise the training accuracy of CNNs through approximations or require substantial computational resources,limiting their practical deployment.We present a hardware-optimized BN accelerator that maintains training accuracy while significantly reducing computational overhead through three novel techniques:(1)resourcesharing for efficient resource utilization across forward and backward passes,(2)interleaved buffering for reduced dynamic random-access memory(DRAM)access latencies,and(3)zero-skipping for minimal gradient computation.Implemented on a VCU118 Field Programmable Gate Array(FPGA)on 100 MHz and validated using You Only Look Once version 2-tiny(YOLOv2-tiny)on the PASCALVisualObjectClasses(VOC)dataset,our normalization accelerator achieves a 72%reduction in processing time and 83%lower power consumption compared to a 2.4 GHz Intel Central Processing Unit(CPU)software normalization implementation,while maintaining accuracy(0.51%mean Average Precision(mAP)drop at floating-point 32 bits(FP32),1.35%at brain floating-point 16 bits(bfloat16)).When integrated into a neural processing unit(NPU),the design demonstrates 63%and 97%performance improvements over AMD CPU and Reduced Instruction Set Computing-V(RISC-V)implementations,respectively.These results confirm that our proposed BN hardware design enables efficient,high-accuracy,and power-saving on-device training for modern CNNs.Our results demonstrate that efficient hardware implementation of standard batch normalization is achievable without sacrificing accuracy,enabling practical on-device CNN training with significantly reduced computational and power requirements.
文摘为解决传统神经网络在CIFAR-10(Canadian Institute For Advanced Research)数据集上进行图像分类识别时,存在的模型准确率较低和训练过程易发生过拟合现象等问题,提出了一种将卷积神经网络和批归一化相结合的新神经网络结构构建方法。该方法首先对数据集进行数据增强和边界填充处理,其次对典型的CNN(Convolutional Neural Networks)网络结构进行改进,移除了卷积层组中的池化层,仅保留了卷积层和BN(Batch Normalization)层,并适量增加卷积层组。为了验证模型的有效性和准确性,设计了6组不同的神经网络结构对模型进行训练。实验结果表明,在相同训练周期数下,推荐使用的model-6模型表现最佳,测试准确率高达90.17%,突破了长期以来经典CNN在CIFAR-10数据集上难于达到90%准确率的瓶颈,为图像分类识别提供了新的解决方案和模型参考。