基于线性脉动阵列的卷积神经网络计算优化与性能分析

doi:10.11959/j.issn.2096-109x.2018100

网络与信息安全学报 ›› 2018, Vol. 4 ›› Issue (12): 16-24.doi: 10.11959/j.issn.2096-109x.2018100

基于线性脉动阵列的卷积神经网络计算优化与性能分析

刘勤让,刘崇阳(),周俊,王孝龙

国家数字交换系统工程技术研究中心，河南郑州 450002

修回日期:2018-10-29 出版日期:2018-12-01 发布日期:2018-12-30
作者简介:刘勤让（1975-），男，河南睢县人，国家数字交换系统工程技术研究中心研究员，主要研究方向为宽带信息网络、片上网络设计。|刘崇阳（1994-），男，湖北宜昌人，国家数字交换系统工程技术研究中心硕士生，主要研究方向为人工智能、深度学习。|周俊（1979-），男，湖北黄冈人，国家数字交换系统工程技术研究中心讲师，主要研究方向为芯片设计、宽带信息处理。|王孝龙（1993-），男，河南民权人，国家数字交换系统工程技术研究中心硕士生，主要研究方向为宽带信息网络、协议解析。
基金资助:
国家科技重大专项基金资助项目(2016ZX01012101);国家自然科学基金资助项目(61572520);国家自然科学基金创新研究群体资助项目(61521003)

Based on linear systolic array for convolutional neural network’s calculation optimization and performance analysis

Qinrang LIU,Chongyang LIU(),Jun ZHOU,Xiaolong WANG

National Digital Switching System Engineering and Technological R＆D Center,Zhengzhou 450002,China

Revised:2018-10-29 Online:2018-12-01 Published:2018-12-30
Supported by:
The National Science Technology Major Project of China(2016ZX01012101);The National Natural Science Foundation of China(61572520);The National Natural Science Foundation Innovation Group Project of China(61521003)

摘要/Abstract

摘要：

针对大部分FPGA端上的卷积神经网络（CNN,convolutional neural network）加速器设计未能有效利用稀疏性的问题，从带宽和能量消耗方面考虑，提出了基于线性脉动阵列的2种改进的CNN计算优化方案。首先，卷积转化为矩阵相乘形式以利用稀疏性；其次，为解决传统的并行矩阵乘法器存在较大I/O需求的问题，采用线性脉动阵列改进设计；最后，对比分析了传统的并行矩阵乘法器和2种改进的线性脉动阵列用于CNN加速的利弊。理论证明及分析表明，与并行矩阵乘法器相比，2种改进的线性脉动阵列都充分利用了稀疏性，具有能量消耗少、I/O带宽占用少的优势。

关键词: 线性脉动阵列, 卷积神经网络, 稀疏性, I/O带宽, 性能分析

Abstract:

Concerning the issue that the convolutional neural network (CNN) accelerator design on most FPGA ends fails to effectively use the sparsity and considering both bandwidth and energy consumption,two improved CNN calculation optimization strategies based on linear systolic array architecture are proposed.Firstly,convolution is transformed into matrix multiplication to take advantage of sparsity.Secondly,in order to solve the problem of large I/O demand in traditional parallel matrix multiplier,linear systolic array is used to improve the design.Finally,a CNN acceleration comparative analysis of the advantages and disadvantages between parallel matrix multiplier and two improved linear systolic arrays is presented.Theoretical proof and analysis show that compared with the parallel matrix multiplier,the two improved linear systolic arrays make full use of sparsity,and have the advantages of less energy consumption and less I/O bandwidth occupation.

Key words: linear systolic array, convolutional neural network, sparsity, I/O bandwidth, performance analysis

中图分类号:

TP183

刘勤让,刘崇阳,周俊,王孝龙. 基于线性脉动阵列的卷积神经网络计算优化与性能分析[J]. 网络与信息安全学报, 2018, 4(12): 16-24.

Qinrang LIU,Chongyang LIU,Jun ZHOU,Xiaolong WANG. Based on linear systolic array for convolutional neural network’s calculation optimization and performance analysis[J]. Chinese Journal of Network and Information Security, 2018, 4(12): 16-24.

图/表 18

图1

图2

图3

图4

表1

图5

图6

图7

图8

图9

图10

表2

单输出线性脉动阵列和并行矩阵乘法器存取操作类别和存取操作次数对比"

操作类别	存取操作次数
操作类别	单输出线性脉动阵列	并行矩阵乘法器
片外→片内及片内→片外	一致	一致
PE外Cache→PE内寄存器	2n²	2n³
PE内部寄存器→寄存器	$\frac{3 n^{2} (n - 1)}{2}$	―
存取中间结果Cache操作	一致	一致
输出时Cache操作	$\frac{n^{2} (n - 1)}{2}$	n²

表2

表3

表4

单输出线性脉动阵列和并行矩阵乘法器对比"

	单输出线性脉动阵列	并行矩阵乘法器
周期	2n²+1	n²
存取操作差异次数	$\frac{3 n^{2} (n - 1)}{2}$	$\frac{3 n^{2} (n - 1)}{2}$
	（寄存器读）	（SRAM读）
寄存器消耗	多2n	—
I/O端口及带宽	3	3n

表4

表5

图11

表6

多输出线性脉动阵列和并行矩阵乘法器存取操作类别和存取操作次数对比"

	多输出线性脉动阵列	并行矩阵乘法器
周期	$x n^{2} + \frac{2 n}{x}$	n²
存取操作差异次数	$\frac{3 n^{2} (n - 1)}{2}$	2n ²(n?1) （SRAM读）
	（寄存器读）
寄存器消耗	多2n	―
I/O端口及带宽	2+n	3n

表6

表7

参考文献 21

[10]	CHEN Y H , KRISHNA T , EMER J S ,et al. Eyeriss:an energy-efficient reconfigurable accelerator for deep convolutional neural networks[J]. IEEE Journal of Solid-State Circuits, 2017,52(1): 127-138.
[11]	刘勤让, 刘崇阳 . 利用参数稀疏性的卷积神经网络计算优化及其FPGA加速器设计[J]. 电子与信息学报, 2018,40(6): 1368-1374.
	LIU Q R , LIU C Y . Calculation optimization for convolutional neural networks and FPGA-based accelerator design using the parameters sparsity[J]. JEIT, 2018,40(6): 1368-1374.
[12]	LIU X , HAN S , MAO H ,et al. Efficient sparse-winograd convolutional neural networks[C]// International Conference on Learning Representations. 2017.
[13]	JANG J W , CHOI S B , PRASANNA V K . Energy-and time-efficient matrix multiplication on FPGAs[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2005,13(11): 1305-1319.
[14]	MATAM K K , LE H , PRASANNA V K . Energy efficient architecture for matrix multiplication on FPGAs[C]// International Conference on Field Programmable Logic and Applications. 2013: 1-4.
[15]	JIA Y , SHELHAMER E , DONAHUE J ,et al. Caffe:convolutional architecture for fast feature embedding[C]// The 22nd ACM International Conference on Multimedia. 2014: 675-678.
[16]	JIA Y Q . Optimzing conv in caffe[R].
[17]	MOONS B , DE BRABANDERE B , VAN GOOL L ,et al. Energy-efficient convnets through approximate computing[C]// Applications of Computer Vision. 2016: 1-8.
[18]	田翔, 周凡, 陈耀武 ,等. 基于 FPGA 的实时双精度浮点矩阵乘法器设计[J]. 浙江大学学报(工学版), 2008,42(9): 1611-1615.
	TIAN X , ZHOU F , CHEN Y W ,et al. Design of field programmable gate array based real-time double-precision floating-point matrix multiplier[J]. Journal of Zhejiang University (Engineering Science), 2008,42(9): 1611-1615.
[19]	HAN S , POOL J , TRAN J ,et al. Learning both weights and connections for efficient neural network[C]// Annual Conference on Neural Information Processing Systems. 2015: 1135-1143.
[1]	HAN S , MAO H , DALLY W J . Deep compression:compressing deep neural networks with pruning,trained quantization and huffman coding[J]. Fiber, 2015,56(4): 3-7.
[2]	QIU J , WANG J , YAO S ,et al. Going deeper with embedded FPGA platform for convolutional neural network[C]// International Symposium on Field-Programmable Gate Arrays. 2016: 26-35.
[3]	SABOUR S , FROSST N , HINTON G E . Dynamic routing between capsules[C]// Annual Conference on Neural Information Processing Systems. 2017.
[4]	HAN S , LIU X , MAO H ,et al. EIE:efficient inference engine on compressed deep neural network[J]. ACM Sigarch Computer Architecture News, 2016,44(3): 243-254.
[5]	CHEN W , WILSON J , TYREE S ,et al. Compressing neural networks with the hashing trick[C]// International Conference on Machine Learning. 2015: 2285-2294.
[6]	MA Y , CAO Y , VRUDHULA S ,et al. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks[C]// International Symposium on Field-Programmable Gate Arrays. 2017: 45-54.
[7]	LI N , TAKAKI S , TOMIOKAY Y ,et al. A multistage dataflow implementation of a deep convolutional neural network based on FPGA for high-speed object recognition[C]// 2016 IEEE Southwest Symposium on Image Analysis and Interpretation. 2016: 165-168.
[8]	SUDA N , CHANDRA V , DASIKA G ,et al. Throughput-optimized openCL-based FPGA accelerator for large-scale convolutional neural networks[C]// International Symposium on Field-Programmable Gate Arrays. 2016: 16-25.
[9]	XIAO Q , LIANG Y , LU L ,et al. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs[C]// The 54th Annual Design Automation Conference. 2017: 62-67.
[20]	LAI B C C , LIN J L . Efficient designs of multi-ported memory on FPGA[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2017,25(1): 139-150.
[21]	CHEN J , LI J . The research of peer-to-peer network security[C]// The International Conference on Information Computing and Automation. 2015: 590-592.

操作类型	能量消耗/pJ	能量消耗比例
32 bit int ADD	0.1	1
32 bit float ADD	0.9	9
32 bit Register File	1	10
32 bit int MULT	3.1	31
32 bit float MULT	3.7	37
32 bit SRAM Cache	5	50
32 bit DRAM Memory	640	6 400

资源类别	总使用量	可利用总量	利用率
BRAM_18K	1 458	2 060	70.78%
DSP48E	1 792	2 800	64.00%
FF	1 792	607 200	28.12%
LUT	142 304	303 600	46.87%

卷积层	原始大小	分块
卷积层1	(3 025×363)×(363×96)	(95×12)×(12×3)
卷积层2	(729×1200)×(1200×256)	(23×38)×(38×8)
卷积层3	(169×2 304)×(2 304×384)	(6×72)×(72×12)
卷积层4	(169×1 728)×(1 728×384)	(6×54)×(54×12)
卷积层5	(169×1 728)×(1 728×256)	(6×54)×(54×8)

改进方案	特点
文献[11]并行矩阵乘法器	带宽要求最大但周期最少，耗能最多
单输出线性脉动阵列	带宽要求最小但周期最多，耗能次之
多输出线性脉动阵列	带宽、周期处于前两者之间，耗能最少

基于线性脉动阵列的卷积神经网络计算优化与性能分析

Based on linear systolic array for convolutional neural network’s calculation optimization and performance analysis

在线阅读

pdf下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 18

参考文献 21

相关文章 10

Metrics

推荐阅读 0

[1]	谢绒娜, 马铸鸿, 李宗俞, 田野. 基于卷积神经网络的加密流量分类方法[J]. 网络与信息安全学报, 2022, 8(6): 84-91.
[2]	林点, 潘理, 易平. 面向图像识别的卷积神经网络鲁棒性研究进展[J]. 网络与信息安全学报, 2022, 8(3): 111-122.
[3]	乔通, 姚宏伟, 潘彬民, 徐明, 陈艳利. 基于深度学习的数字图像取证技术研究进展[J]. 网络与信息安全学报, 2021, 7(5): 13-28.
[4]	李沛杰, 张丽, 夏云飞, 许立明. 基于软件定义的可重构卷积神经网络架构设计[J]. 网络与信息安全学报, 2021, 7(3): 29-36.
[5]	张鑫,羌卫中,吴月明,邹德清,金海. 基于卷积神经网络恶意安卓应用行为模式挖掘[J]. 网络与信息安全学报, 2020, 6(6): 35-44.
[6]	谢博,申国伟,郭春,周燕,于淼. 基于残差空洞卷积神经网络的网络安全实体识别方法[J]. 网络与信息安全学报, 2020, 6(5): 126-138.
[7]	江玉朝,吉立新,高超,李邵梅. 基于卷积神经网络的多尺度Logo检测算法[J]. 网络与信息安全学报, 2020, 6(2): 116-124.
[8]	张雪涛,孙蒙,王金双. 基于操作码的安卓恶意代码多粒度快速检测方法[J]. 网络与信息安全学报, 2019, 5(6): 85-94.
[9]	张晓斌, 陈福才, 黄瑞阳. 基于CNN和双向LSTM融合的实体关系抽取[J]. 网络与信息安全学报, 2018, 4(9): 44-51.
[10]	李巧玲,关晴骁,赵险峰. 基于卷积神经网络的图像生成方式分类方法[J]. 网络与信息安全学报, 2016, 2(9): 40-48.