一种基于频域内推理计算的长短期记忆神经网络硬件加速器设计

A Hardware Accelerator Design for Long Short-Term Memory Neural Networks Based on Frequency-Domain Inference Computation

下载PDF

导出

摘要长短期记忆神经网络(Long Short-Term Memory,LSTM)可以捕捉到序列数据间长距离的依赖关系,因此在时间序列预测、自然语言分析和语音识别等领域得到广泛应用。然而,LSTM网络独特的门控机制和状态更新过程导致其推理计算的复杂度较高,参数量较大,对其在资源受限的边缘设备上的部署形成挑战。本文提出一种基于频域内推理计算的长短期记忆神经网络硬件加速器设计。采用循环分块矩阵对网络的权重参数进行压缩存储,结合快速傅里叶变换(Fast Fourier Transform,FFT)和频域激活函数实现频域内网络推理计算,避免在处理不同时间样本时频繁的时域-频域切换开销。采用坐标旋转数字计算机算法(Coordinate Rotation Digital Computer,CORDIC)替换频域内的乘法运算和超函数计算,实现LSTM的低功耗硬件部署。提出的硬件加速器在PYNQ-Z2开发板上进行了原型实现。面向开源时间序列数据集的实验结果表明,加速器实现了63.6μs的网络平均推理延迟,功耗1.743 W,相比时域LSTM推理计算延迟降低了44.2%,功耗降低6.4%。同时,BRAM和FIFO的资源占用率仅为5%和2%,相比时域LSTM推理计算分别降低了83%和91.2%。 Long Short-Term Memory neural networks,as a type of Recurrent Neural Network(RNN),can effectively handle long-term dependencies in sequential data,thereby avoiding the gradient vanishing or explosion problems that traditional RNNs encounter with long sequences.By introducing mechanisms such as input gates,forget gates,and output gates,LSTM networks can selectively retain and forget information,thereby capturing long-term variations in data,making them widely applicable in fields such as time series prediction,natural language processing,and speech recognition.However,the unique gating mechanism and state update process of LSTMs result in high computational complexity and a large number of parameters.This situationnot only requires substantial memory but also demands significant computational power to support both training and inference processes,creating challenges for deploying these networks on resource-constrained edge devices.Therefore,exploring methods to compress LSTM models to reduce storage and computational demands is crucial for enabling edge computing of LSTM networks.Based on this background,this paper proposes a solution that aims to compress network parameters and enhance inference speed while ensuring that accuracy loss remains within an acceptable range.This paper proposes a hardware accelerator design for long short-term memory neural networks based on inference computation in the frequency domain.The method utilizes block-circulant matrix compression to store the network’s weight parameters,combined with Fast Fourier Transform(FFT)and frequency-domain activation functions to achieve frequency domain network inference,thereby avoiding the frequent time-domain to frequency-domain switching overhead when processing different time samples.The Coordinate Rotation Digital Computer(CORDIC)algorithm is employed to replace multiplication operations and hyperfunction calculations in the frequency domain,enabling low-power hardware deployment for LSTM.The paper first partitions the input data and performs FFT transformation,followed by element-wise multiplication and accumulation with the frequency-domain weight matrix to obtain the accumulated outputs of the four gates.These outputs are then processed in parallel through frequency-domain activation functions to update the cell state and hidden state.In this way,the forward computation process of the LSTM can be divided into five main modules:the FFT/IFFT module,the multiplication module,the accumulation module,and the activation function module.Among them,the FFT/IFFT module is based on the rotation mode of the CORDIC algorithm in the circular coordinate system,using fixed rotation angles and shift-add operations to replace traditional butterfly calculations.The multiplication module utilizes the rotation mode of the CORDIC algorithm in the linear coordinate system to achieve element-wise multiplication,combined with parallel prediction algorithms to accelerate computation.The accumulation module is responsible for summing the results of each row block.The activation function module adopts frequency-domain linear approximation instead of traditional activation functions,enabling inference computation to be entirely performed in the frequency domain.The proposed hardware accelerator is prototyped on a PYNQ-Z2 development board.Experimental results on an open-source time series dataset demonstrate that the accelerator achieves an average network inference latency of 63.6μs with a power consumption of 1.743 W.Compared to the time-domain LSTM,the inference latency is reduced by 44.2%,and the power consumption is lowered by 6.4%.Additionally,the resource utilization of BRAM and FIFO is only 5%and 2%,respectively,representing reductions of 83%and 91.2%compared to the time-domain LSTM inference.

作者靳松陈诗琪 JIN Song;CHEN Shi-Qi(Department of Electronic and Communication Engineering,North China Electric Power University,Baoding,Hebei 071003;Hebei Key Laboratory of Power Internet of Things Technology,North China Electric Power University,Baoding,Hebei 071003)

机构地区华北电力大学电子与通信工程系华北电力大学河北省电力物联网技术重点实验室

出处《计算机学报》北大核心 2025年第8期1781-1794,共14页 Chinese Journal of Computers

基金河北省省级科技计划资助(平台编号:SZX2020034) 河北省自然科学基金项目(F2021502006)资助。

关键词长短期记忆神经网络分块循环矩阵坐标旋转数字计算机频域推理计算快速傅里叶变换 long short-term memory neural networks block-circulant matrix the coordinate rotation digital computer algorithm frequency-domain inference computation fast fourier transform

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献4

1张宇,张鹏远,颜永红.基于注意力LSTM和多任务学习的远场语音识别[J].清华大学学报（自然科学版）,2018,58(3):249-253. 被引量：32
2刘建伟,王园方,罗雄麟.深度记忆网络研究进展[J].计算机学报,2021,44(8):1549-1589. 被引量：35
3任杰,高岭,于佳龙,袁璐.面向边缘设备的高能效深度学习任务调度策略[J].计算机学报,2020,43(3):440-452. 被引量：21
4焦李成,孙其功,杨育婷,冯雨歆,李秀芳.深度神经网络FPGA设计进展、实现与展望[J].计算机学报,2022,45(3):441-471. 被引量：25

二级参考文献6

1罗军,黄启俊,常胜,李昌盛.H.264中整数变换与量化的FPGA实现[J].中国图象图形学报,2011,16(5):740-745. 被引量：4
2杨庆峰,伍梦秋.记忆哲学:解码人工智能及其发展的钥匙[J].探索与争鸣,2018,0(11):86-92. 被引量：13
3Luyang Li,Bing Qin,Wenjing Ren,Ting Liu.Truth Discovery with Memory Network[J].Tsinghua Science and Technology,2017,22(6):609-618. 被引量：6
4梁天新,杨小平,王良,张永俊,朱艳丽,许翠.记忆神经网络的研究与发展[J].软件学报,2017,28(11):2905-2924. 被引量：21
5Jian CHENG,Pei-song WANG,Gang LI,Qing-hao HU,Han-qing LU.Recent advances in efficient computation of deep convolutional neural networks[J].Frontiers of Information Technology & Electronic Engineering,2018,19(1):64-77. 被引量：37
6纪荣嵘,林绍辉,晁飞,吴永坚,黄飞跃.深度神经网络压缩与加速综述[J].计算机研究与发展,2018,55(9):1871-1888. 被引量：62

共引文献109

1师庆科,郑涛.大型三甲医院患者智能随访语音平台设计与应用[J].中国数字医学,2021,16(8):22-27. 被引量：22
2吴克烈,胡旭微.世界500强的启示[J].统计研究,2000,17(4):20-23. 被引量：4
3靳华中,刘潇龙,胡梓珂.一种结合全局和局部特征的图像描述生成模型[J].应用科学学报,2019,37(4):501-509. 被引量：9
4程艳芬,陈垚鑫,陈逸灵,杨益.嵌入注意力机制并结合层级上下文的语音情感识别[J].哈尔滨工业大学学报,2019,51(11):100-107. 被引量：9
5李江,冯存前,王义哲,许旭光.基于AlexNet-BiLSTM网络的锥体目标微动分类[J].信号处理,2019,35(11):1835-1843. 被引量：5
6李树刚,马莉,潘少波,石新莉.基于循环神经网络的煤矿工作面瓦斯浓度预测模型研究[J].煤炭科学技术,2020,48(1):33-38. 被引量：49
7郎伟明,麻向津,周博文,杨东升,罗艳红,刘林奇.基于LSTM和非参数核密度估计的风电功率概率区间预测[J].智慧电力,2020,48(2):31-37. 被引量：32
8娄英丹,徐静林,黄丽霞,张雪英.MLLR和MAP在远场噪声混响下的语音识别研究[J].计算机工程与应用,2020,56(10):122-126. 被引量：7
9赵建群,王悦.基于GRU的股票涨跌预测[J].湖州师范学院学报,2020,42(8):93-98. 被引量：1
10马聪,李锋,张建华,陈学东,张学俭,朱丹.基于LSTM神经网络的肉牛动态称重算法研究[J].黑龙江畜牧兽医,2020(20):60-63. 被引量：5

1刘子浩,赖嘉伟,查宇恒,唐珂,徐荣青,孙科学.基于ZYNQ的神经网络硬件加速器设计[J].计算机技术与发展,2025,35(10):10-17.
2黄力,陈新宇,唐波,唐嘉晖,廖梓榕,刘鹏.基于改进矩量法的风电机无源干扰快速计算[J].高电压技术,2024,50(2):737-748. 被引量：1
3吴昊,刘楠,丁朋,茹占强,宋贺伦.改进CORDIC算法实现及其在边缘检测中的应用[J].电子测量技术,2023,46(16):148-157. 被引量：4
4胡雄龙,陈进华,乔海,唐军.跳跃迭代的高速高精CORDIC算法及FPGA实现[J].计算机仿真,2023,40(10):365-370. 被引量：2
5卢英俊,罗赞琛,陆俊峰,高枫.基于NLP的电力运维智能问答系统的设计与实现[J].电工技术,2025(12):153-155.
6王耀琦,潘祥,王小鹏.面向边缘端的YOLOv4-tiny网络硬件加速器设计[J].北京邮电大学学报,2025,48(4):123-128. 被引量：1
7魏学静,孙皓,彭宇,刘连胜.基于查找表和并行免缩放迭代的混合CORDIC算法[J].电子测量与仪器学报,2025,39(6):174-183.
8叶国棋,卢柳樱,宋江.一种改进的基于卷积神经网络的高压塔鸟巢检测算法[J].软件,2025,46(9):4-7.
9王力力,黎鹏飞,高雨茜,刘喆.半干旱地区生态环境脆弱性时空格局及邻域效应——榆林市案例实证研究[J].现代农业科技,2025(11):112-121.
10哈晖,高翔,姚秀娟,付降寅,李伟,张晓燕.基于STF-Net的信号调制波形识别方法[J].北京航空航天大学学报,2025,51(9):3150-3160.

计算机学报

2025年第8期

浏览历史

内容加载中请稍等...

一种基于频域内推理计算的长短期记忆神经网络硬件加速器设计

参考文献4

二级参考文献6

共引文献109

相关作者

相关机构

相关主题

浏览历史