基于FPGA的递归神经网络加速器的研究进展

doi:10.11959/j.issn.2096-109x.2019034

摘要/Abstract

摘要：

递归神经网络(RNN)近些年来被越来越多地应用在机器学习领域，尤其是在处理序列学习任务中，相比CNN等神经网络性能更为优异。但是RNN及其变体，如LSTM、GRU等全连接网络的计算及存储复杂性较高，导致其推理计算慢，很难被应用在产品中。一方面，传统的计算平台CPU不适合处理RNN的大规模矩阵运算；另一方面，硬件加速平台GPU的共享内存和全局内存使基于GPU的RNN加速器的功耗比较高。FPGA 由于其并行计算及低功耗的特性，近些年来被越来越多地用来做 RNN 加速器的硬件平台。对近些年基于FPGA的RNN加速器进行了研究，将其中用到的数据优化算法及硬件架构设计技术进行了总结介绍，并进一步提出了未来研究的方向。

关键词: 递归神经网络, FGPA, 加速器

Abstract:

Recurrent neural network(RNN) has been used wildly used in machine learning field in recent years,especially in dealing with sequential learning tasks compared with other neural network like CNN.However,RNN and its variants,such as LSTM,GRU and other fully connected networks,have high computational and storage complexity,which makes its inference calculation slow and difficult to be applied in products.On the one hand,traditional computing platforms such as CPU are not suitable for large-scale matrix operation of RNN.On the other hand,the shared memory and global memory of hardware acceleration platform GPU make the power consumption of GPU-based RNN accelerator higher.More and more research has been done on the RNN accelerator of the FPGA in recent years because of its parallel computing and low power consumption performance.An overview of the researches on RNN accelerator based on FPGA in recent years is given.The optimization algorithm of software level and the architecture design of hardware level used in these accelerator are summarized and some future research directions are proposed.

Key words: recurrent neural network, FPGA, accelerator

中图分类号:

TP391.1

高琛,张帆. 基于FPGA的递归神经网络加速器的研究进展[J]. 网络与信息安全学报, 2019, 5(4): 1-13.

Chen GAO,Fan ZHANG. Survey of FPGA based recurrent neural network accelerator[J]. Chinese Journal of Network and Information Security, 2019, 5(4): 1-13.

图/表 10

图1

图2

图3

表1

图4

图5

图6

图7

表2

图8

参考文献 47

[1]	HAO Y , QUIGLEY S . The implementation of a deep recurrent neural network language model on a Xilinx FPGA[J]. arXiv Preprint arXiv:1710.10296, 2017.
[2]	SAK H , SENIOR A , BEAUFAYS F . Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition[J]. arXiv Preprint arXiv:1402.1128, 2014.
[3]	MIKOLOV T , KARAFIAT M , BURGET L ,et al. Recurrent neural network based language model[C]// Eleventh Annual Conference of the International Speech Communication Association. 2010.
[4]	CHO K , VAN -MERRIENBOER B , GULCEHRE C ,et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. arXiv Preprint arXiv:1406.1078, 2014.
[5]	GRAVES A , MOHAMED A , HINTON G . Speech recognition with deep recurrent neural networks[C]// 2013 IEEE International Conference on.Acoustics,speech and signal processing (icassp). 2013: 6645-6649.
[6]	BYEONW , BREUEL T M , RAUE F , et al . Scene labeling with LSTM recurrent neural networks[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 3547-3555.
[7]	ZHANG Y , WANG C , GONG L ,et al. A power-efficient accelerator based on FPGA for LSTM network[C]// 2017 IEEE International Conference on Cluster Computing (CLUSTER). 2017: 629-630.
[8]	GUO K , ZENG S , YU J ,et al. A survey of FPGA-based neural network accelerator[J]. arXiv preprint arXiv:1712.08934, 2017.
[9]	HWANG K , SUNG W . Single stream parallelization of generalized LSTM-like RNNs on a GPU[J]. arXiv Preprint arXiv:1503.02852, 2015.
[10]	ABADI M , AGARWAL A , BARHAM P ,et al. Tensorflow:largescale machine learning on heterogeneous distributed systems[J]. arXiv preprint arXiv:1603.04467, 2016.
[11]	OUYANG P , YIN S , WEI S . A fast and power efficient architecture to parallelize LSTM based RNN for cognitive intelligence applications[C]// The 54th Annual Design Automation Conference 2017. ACM, 2017:63.
[12]	NURVITADHI E , SIM J , SHEFFIELD D ,et al. Accelerating recurrent neural networks in analytics servers:comparison of FPGA,CPU,GPU,and ASIC[C]// 2016 26th International Conference on Field Programmable Logic and Applications (FPL). 2016: 1-4.
[13]	HOPFIELD J J . Neural networks and physical systems with emergent collective computational abilities[J]. Proceedings of the National Academy of Sciences, 1982,79(8): 2554-2558.
[14]	HOCHREITER S , SCHMIDHUBER J . Long short-term memory[J]. Neural Computation, 1997,9(8): 1735-1780.
[15]	CHUNG J , GULCEHRE C , CHO K H ,et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[J]. arXiv Preprint arXiv:1412.3555, 2014.
[16]	ZAREMBA W , SUTSKEVER I , VINYALS O . Recurrent neural network regularization[J]. arXiv Preprint arXiv:1409.2329, 2014.
[17]	RYBALKIN V , PAPPALARDO A , GHAFFAR M M ,et al. FINN-L:library extensions and design trade-off analysis for variable precision LSTM networks on FPGAs[J]. arXiv Preprint arXiv:1807.04093, 2018.
[18]	RYBALKIN V , WEHN N , YOUSEFI M R ,et al. Hardware architecture of bidirectional long short-term memory neural network for optical character recognition[C]// The Conference on Design,Automation ＆ Test in Europe.European Design and Automation Association. 2017: 1394-1399.
[19]	GUAN Y , LIANG H , XU N ,et al. FP-DNN:an automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates[C]// 2017 IEEE 25th Annual International Symposium on Field-programmable Custom Computing Machines (FCCM). 2017: 152-159.
[20]	LI S , LI W , COOK C ,et al. Independently recurrent neural network (indrnn):building a longer and deeper RNN[C]// IEEE Conference on Computer Vision and Pattern Recognition. 2018: 5457-5466.
[21]	HAJDUK Z . Reconfigurable FPGA implementation of neural networks[J]. Neuro Computing, 2018,308: 227-234.
[22]	LIU B , DONG W , XU T ,et al. E-ERA:an energy-efficient reconfigurable architecture for RNN using dynamically adaptive approximate computing[J]. IEICE Electronics Express, 2017,14(15): 20170637-20170637.
[23]	宋翔, 周凡, 陈耀武 ,等. 基于 FPGA 的实时双精度浮点矩阵乘法器设计[J]. 浙江大学学报(工学版), 2008,42(9): 1611-1615.
	SONG X , ZHOU F , CHEN Y W ,et al. Design of real time double precision floating point matrix multiplier based on FPGA[J]. Journal of ZheJiang University, 2008,42(9): 1611-1615.
[24]	GUAN Y , YUAN Z , SUN G ,et al. FPGA-based accelerator for long short-term memory recurrent neural networks[C]// IEEE Design Automation Conference (ASP-DAC). 2017: 629-634.
[25]	CHANG A X M , CULURCIELLO E . Hardware accelerators for recurrent neural networks on FPGA[C]// 2017 IEEE International Symposium on.Circuits and Systems (ISCAS). 2017: 1-4.
[26]	CHANG A X M , MARTINI B , CULURCIELLO E . Recurrent neural networks hardware implementation on FPGA[J]. arXiv Preprint arXiv:1511.05552, 2015.
[27]	LI S , WU C , LI H ,et al. Fpga acceleration of recurrent neural network based language model[C]// 2015 IEEE 23rd Annual International Symposium on Field-programmable Custom Computing Machines. IEEE, 2015: 111-118.
[28]	LEE M , HWANG K , PARK J ,et al. FPGA-based low-powerspeech recognition with recurrent neural networks[C]// 2016 IEEE International Workshop on.Signal Processing Systems (SiPS). 2016: 230-235.
[29]	WANG S , LI Z , DING C ,et al. C-LSTM:enabling efficient LSTM using structured compression techniques on FPGAs[C]// ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2018: 11-20.
[30]	ZHANG Y , WANG C , GONG L ,et al. Implementation and optimization of the accelerator based on FPGA hardware for LSTM network[C]// IEEE International Symposium on Parallel and Distributed Processing with Applications and IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC). 2017: 614-621.
[31]	LIAO Y , LI H , WANG Z . Based real-time processing architecture for recurrent neural network[C]// International Conference on Intelligent and Interactive Systems and Applications. 2017: 705-709.
[32]	SALCIC Z , BERBER S , SECKER P . FPGA prototyping of RNN decoder for convolutional codes[J]. EURASIP Journal on Advances in Signal Processing, 2006,2006(1):015640.
[33]	FERREIRA J C , FONSECA J . An FPGA implementation of a long short-term memory neural network[C]// 2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig). 2016: 1-8.
[34]	SHIN S , HWANG K , SUNG W . Fixed-point performance analysis of recurrent neural networks[J]. arXiv Preprint arXiv:1512.01322, 2015.
[35]	HAN S , POOL J , TRAN J ,et al. Learning both weights and connections for efficient neural network[C]// Advances in Neural Information Processing Systems. 2015: 1135-1143.
[36]	HAN S , KANG J , MAO H ,et al. Ese:efficient speech recognition engine with sparse LSTM on FPGA[C]// ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2017: 75-84.
[37]	ALI S M , SHAOJUN W , NING M ,et al. A bandwidth in-sensitive low stall sparse matrix vector multiplication architecture on reconfigurable FPGA platform[C]// 13th IEEE International Conference on Electronic Measurement ＆ Instruments (ICEMI). 2017: 171-176.
[38]	FOWERS J , OVTCHAROV K , STRAUSS K ,et al. A high memory bandwidth FPGA accelerator for sparse matrix-vector multiplication[C]// IEEE 22nd Annual International Symposium on FieldProgrammable Custom Computing Machines (FCCM). 2014: 36-43.
[39]	NEIL D , LEE J H , DELBRUCK T ,et al. Delta networks for optimized recurrent network computation[J]. arXiv Preprint arXiv:1612.05571, 2016.
[40]	GAO C , NEIL D , CEOLINI E ,et al. DeltaRNN:a power-efficient recurrent neural network accelerator[C]// ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2018: 21-30.
[41]	KINGSBURY B E D , SAINATH T N , SINDHWANI V . Low-rank matrix factorization for deep belief network training with high-dimensional output targets[P].2016-2-16.
[42]	XUE J , LI J , GONG Y . Restructuring of deep neural network acoustic models with singular value decomposition[C]// Interspeech. 2013: 2365-2369.
[43]	QIU J , WANG J , YAO S ,et al. Going deeper with embedded FPGA platform for convolutional neural network[C]// ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2016: 26-35.
[44]	LU Z , SINDHWANI V , SAINATH T N . Learning compact recurrent neural networks[J]. arXiv Preprint arXiv:1604.02594, 2016.
[45]	RIZAKIS M , VENIERIS S I , KOURIS A ,et al. Approximate FPGA-based LSTM under computation time constraints[J]. arXiv Preprint arXiv:1801.02190, 2018.
[46]	LI Z , WANG S , DING C ,et al. Efficient recurrent neural networks using structured matrices in FPGA[J]. arXiv Preprint arXiv:1803.07661, 2018.
[47]	WANG Z , LIN J , WANG Z . Accelerating recurrent neural networks:a memory-efficient approach[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2017,25(10): 2763-2775.

变体	输入信息	隐层信息	总参数量
标准RNN	M ×N	N×N	MN × N ²
LSTM	4M × N	4N × N	4MN×N²
GRU	3M × N	3N × N	3MN×N²

文献	模型	对比平台	量化方法	数据压缩倍数	计算速度提升
文献[28]	LSTM	NVIDIA GeForce Titan X	Fixed-point 6	4倍	4.12倍
文献[33]	LSTM	CORE i7-3770k	Fixed-point 17	—	251倍
文献[26]	LSTM	ARM Cortex-A9 CPU	Fixed-point 16	—	21倍
文献[34]	LSTM	—	非线性量化	5~9倍	—