期刊文献+

一种基于频域内推理计算的长短期记忆神经网络硬件加速器设计

A Hardware Accelerator Design for Long Short-Term Memory Neural Networks Based on Frequency-Domain Inference Computation
在线阅读 下载PDF
导出
摘要 长短期记忆神经网络(Long Short-Term Memory,LSTM)可以捕捉到序列数据间长距离的依赖关系,因此在时间序列预测、自然语言分析和语音识别等领域得到广泛应用。然而,LSTM网络独特的门控机制和状态更新过程导致其推理计算的复杂度较高,参数量较大,对其在资源受限的边缘设备上的部署形成挑战。本文提出一种基于频域内推理计算的长短期记忆神经网络硬件加速器设计。采用循环分块矩阵对网络的权重参数进行压缩存储,结合快速傅里叶变换(Fast Fourier Transform,FFT)和频域激活函数实现频域内网络推理计算,避免在处理不同时间样本时频繁的时域-频域切换开销。采用坐标旋转数字计算机算法(Coordinate Rotation Digital Computer,CORDIC)替换频域内的乘法运算和超函数计算,实现LSTM的低功耗硬件部署。提出的硬件加速器在PYNQ-Z2开发板上进行了原型实现。面向开源时间序列数据集的实验结果表明,加速器实现了63.6μs的网络平均推理延迟,功耗1.743 W,相比时域LSTM推理计算延迟降低了44.2%,功耗降低6.4%。同时,BRAM和FIFO的资源占用率仅为5%和2%,相比时域LSTM推理计算分别降低了83%和91.2%。 Long Short-Term Memory neural networks,as a type of Recurrent Neural Network(RNN),can effectively handle long-term dependencies in sequential data,thereby avoiding the gradient vanishing or explosion problems that traditional RNNs encounter with long sequences.By introducing mechanisms such as input gates,forget gates,and output gates,LSTM networks can selectively retain and forget information,thereby capturing long-term variations in data,making them widely applicable in fields such as time series prediction,natural language processing,and speech recognition.However,the unique gating mechanism and state update process of LSTMs result in high computational complexity and a large number of parameters.This situationnot only requires substantial memory but also demands significant computational power to support both training and inference processes,creating challenges for deploying these networks on resource-constrained edge devices.Therefore,exploring methods to compress LSTM models to reduce storage and computational demands is crucial for enabling edge computing of LSTM networks.Based on this background,this paper proposes a solution that aims to compress network parameters and enhance inference speed while ensuring that accuracy loss remains within an acceptable range.This paper proposes a hardware accelerator design for long short-term memory neural networks based on inference computation in the frequency domain.The method utilizes block-circulant matrix compression to store the network’s weight parameters,combined with Fast Fourier Transform(FFT)and frequency-domain activation functions to achieve frequency domain network inference,thereby avoiding the frequent time-domain to frequency-domain switching overhead when processing different time samples.The Coordinate Rotation Digital Computer(CORDIC)algorithm is employed to replace multiplication operations and hyperfunction calculations in the frequency domain,enabling low-power hardware deployment for LSTM.The paper first partitions the input data and performs FFT transformation,followed by element-wise multiplication and accumulation with the frequency-domain weight matrix to obtain the accumulated outputs of the four gates.These outputs are then processed in parallel through frequency-domain activation functions to update the cell state and hidden state.In this way,the forward computation process of the LSTM can be divided into five main modules:the FFT/IFFT module,the multiplication module,the accumulation module,and the activation function module.Among them,the FFT/IFFT module is based on the rotation mode of the CORDIC algorithm in the circular coordinate system,using fixed rotation angles and shift-add operations to replace traditional butterfly calculations.The multiplication module utilizes the rotation mode of the CORDIC algorithm in the linear coordinate system to achieve element-wise multiplication,combined with parallel prediction algorithms to accelerate computation.The accumulation module is responsible for summing the results of each row block.The activation function module adopts frequency-domain linear approximation instead of traditional activation functions,enabling inference computation to be entirely performed in the frequency domain.The proposed hardware accelerator is prototyped on a PYNQ-Z2 development board.Experimental results on an open-source time series dataset demonstrate that the accelerator achieves an average network inference latency of 63.6μs with a power consumption of 1.743 W.Compared to the time-domain LSTM,the inference latency is reduced by 44.2%,and the power consumption is lowered by 6.4%.Additionally,the resource utilization of BRAM and FIFO is only 5%and 2%,respectively,representing reductions of 83%and 91.2%compared to the time-domain LSTM inference.
作者 靳松 陈诗琪 JIN Song;CHEN Shi-Qi(Department of Electronic and Communication Engineering,North China Electric Power University,Baoding,Hebei 071003;Hebei Key Laboratory of Power Internet of Things Technology,North China Electric Power University,Baoding,Hebei 071003)
出处 《计算机学报》 北大核心 2025年第8期1781-1794,共14页 Chinese Journal of Computers
基金 河北省省级科技计划资助(平台编号:SZX2020034) 河北省自然科学基金项目(F2021502006)资助。
关键词 长短期记忆神经网络 分块循环矩阵 坐标旋转数字计算机 频域推理计算 快速傅里叶变换 long short-term memory neural networks block-circulant matrix the coordinate rotation digital computer algorithm frequency-domain inference computation fast fourier transform
  • 相关文献

参考文献4

二级参考文献6

共引文献109

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部