时空压缩激励残差乘法网络的视频动作识别

doi:10.11959/j.issn.1000-436x.2019194

摘要/Abstract

摘要：

针对双流网络结构中浅层网络和一般深度模型学习空间信息和时间信息的不足，提出将压缩激励残差网络用于空间流和时间流的动作识别，同时将恒等映射核作为时间滤波器注入网络中捕获长期时间依赖性。为了进一步加强压缩激励残差网络的空间信息和时间信息之间的交互，采用时空特征相乘融合，并研究空间流和时间流乘法融合方式、次数以及位置对识别性能的影响。鉴于单个模型获得性能的局限性，提出了3种不同的策略生成多个模型，并使用直接平均与加权平均集成以得到最终识别结果。HMDB51和UCF101数据集上的实验结果表明，所提时空压缩激励残差乘法网络能够有效提升动作识别性能。

关键词: 动作识别, 时空流, 压缩激励残差网络, 相乘融合, 多模型集成

Abstract:

Aiming at the shortcomings of shallow networks and general deep models in two-stream network structure,which could not effectively learn spatial and temporal information,a squeeze-and-excitation residual network was proposed for action recognition with a spatial stream and a temporal stream.Meanwhile,the long-term temporal dependence was captured by injecting the identity mapping kernel into the network as a temporal filter.Spatiotemporal feature multiplication fusion was used to further enhance the interaction between spatial information and temporal information of squeeze-and-excitation residual networks.Simultaneously,the influence of spatial-temporal stream multiplication fusion methods,times and locations on the performance of action recognition was studied.Given the limitations of performance achieved by a single model,three different strategies were proposed to generate multiple models,and the final recognition result was obtained by integrating these models through averaging and weighted averaging.The experimental results on the HMDB51 and UCF101 datasets show that the proposed spatiotemporal squeeze-and-excitation residual multiplier networks can effectively improve the performance of action recognition.

Key words: action recognition, spatiotemporal stream, squeeze-and-excitation residual network, multiplication fusion, multi-model ensemble

中图分类号:

TP391

罗会兰, 童康. 时空压缩激励残差乘法网络的视频动作识别[J]. 通信学报, 2019, 40(10): 189-198.

Huilan LUO, Kang TONG. Spatiotemporal squeeze-and-excitation residual multiplier network for video action recognition[J]. Journal on Communications, 2019, 40(10): 189-198.

图/表 10

图1

图2

图3

图4

图5

表1

表2

表3

表4

表5

参考文献 29

[1]	HERATH S , HARANDI M , PORIKLI F . Going deeper into action recognition:a survey[J]. Image and Vision Computing, 2017(60): 4-21.
[2]	胡琼, 秦磊, 黄庆 . 基于视觉的人体动作识别综述[J]. 计算机学报, 2013,36(12): 2512-2524.
	HU Q , QIN L , HUANG Q . Overview of human action recognition based on vision[J]. Chinese Journal of Computers, 2013,36(12): 2512-2524.
[3]	朱煜, 赵江坤, 王逸宁 . 基于深度学习的人体行为识别算法综述[J]. 自动化学报, 2016,42(6): 848-857.
	ZHU Y , ZHAO J K , WANG Y N . A review of human action recognition based on deep learning[J]. ACTA Automatica Sinica, 2016,42(6): 848-857.
[4]	罗会兰, 王婵娟, 卢飞 . 视频行为识别综述[J]. 通信学报, 2018,39(6): 173-184.
	LUO H L , WANG C J , LU F . Survey of video behavior recognition[J]. Journal on Communications, 2018,39(6): 173-184.
[5]	BOBICK A F , DAVIS J W . An appearance-based representation of action[C]// International Conference on Pattern Recognition. IEEE, 1996: 307-312.
[6]	WEINLAND D , RONFARD R , BOYER E . Free viewpoint action recognition using motion history volumes[J]. Computer Vision and Image Understanding, 2006,104(2-3): 249-257.
[7]	YILMAZ A , SHAH M . Actions sketch:a novel action representation[C]// Computer Vision and Pattern Recognition. IEEE, 2005: 984-989.
[8]	WANG H , ULLAH M M , KLASER A ,et al. Evaluation of local spatio-temporal features for action recognition[C]// British Machine Vision Conference. BMVA, 2009: 1-11.
[9]	KLASER A , SCHMID C . Action recognition by dense trajectories[C]// Computer Vision and Pattern Recognition. IEEE, 2011: 3169-3176.
[10]	WANG H , SCHMID C . Action recognition with improved trajectories[C]// International Conference on Computer Vision. IEEE, 2013: 3551-3558.
[11]	JI S , XU W , YANG M ,et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis ＆Machine Intelligence, 2013,35(1): 221-231.
[12]	DU T , BOURDEV L , FERGUS R ,et al. Learning spatiotemporal features with 3D convolutional networks[C]// International Conference on Computer Vision. IEEE, 2015: 4489-4497.
[13]	TRAN D , RAY J , SHOU Z ,et al. ConvNet architecture search for spatiotemporal feature learning[J]. Computing Research Repository, 2017,16(8): 178-190.
[14]	KARPATHY A , TODERICI G , SHETTY S ,et al. Large-scale video classification with convolutional neural networks[C]// Computer Vision and Pattern Recognition. IEEE, 2014: 1725-1732.
[15]	SIMONYAN K , ZISSERMAN A . Two-stream convolutional networks for action recognition in videos[C]// Neural Information Processing Systems. NeurlPS, 2014: 568-576.
[16]	WANG L , XIONG Y , WANG Z ,et al. Temporal segment networks:towards good practices for deep action recognition[J]. ACM Transactions on Information Systems, 2016,22(1): 20-36.
[17]	FEICHTENHOFER C , PINZ A , WILDES R P . Spatiotemporal residual networks for video action recognition[C]// Neural Information Processing Systems. NeurlPS, 2016: 3468-3476.
[18]	WANG X , FARHADI A , GUPTA A . Actions～transformations[C]// Computer Vision and Pattern Recognition. IEEE, 2016: 2658-2667.
[19]	WANG Y , LONG M , WANG J ,et al. Spatiotemporal pyramid network for video action recognition[C]// Computer Vision and Pattern Recognition. IEEE, 2017: 2097-2106.
[20]	FEICHTENHOFER C , PINZ A , ZISSERMAN A . Convolutional two-stream network fusion for video action recognition[C]// Computer Vision and Pattern Recognition. IEEE, 2016: 1933-1941.
[21]	FEICHTENHOFER C , PINZ A , WILDES R P . Spatiotemporal multiplier networks for video action recognition[C]// Computer Vision and Pattern Recognition. IEEE, 2017: 7445-7454.
[22]	WANG L , GE L , LI R ,et al. Three-stream CNNs for action recognition[J]. Pattern Recognition Letters, 2017,92(C): 33-40.
[23]	BILEN H , FERNANDO B , GAVVES E ,et al. Action recognition with dynamic image networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018,40(12): 2799-2813.
[24]	HE K , ZHANG X , REN S ,et al. Deep residual learning for image recognition[C]// Computer Vision and Pattern Recognition. IEEE, 2016: 770-778.
[25]	HU J , SHEN L , SUN G . Squeeze-and-excitation networks[C]// Computer Vision and Pattern Recognition. IEEE, 2018: 7132-7141.
[26]	SOOMRO K , ZAMIR A R , SHAH M . UCF101:a dataset of 101 human actions classes from videos in the wild[J]. Computer Science, 2012,3(12): 1-9.
[27]	KUEHNE H , JHUANG H , GARROTE E ,et al. HMDB:a large video database for human motion recognition[C]// International Conference on Computer Vision. IEEE, 2011: 2556-2563.
[28]	ZHANG C L , ZHANG H , WEI X S ,et al. Deep bimodal regression for apparent personality analysis[C]// European Conference on Computer Vision Workshops. Springer, 2016: 311-324.
[29]	KHOWAJA S A , LEE S-L . Semantic image networks for human action recognition[J]. The Computing Research Repository, 2019,21(1): 1-30.

数据集	划分	识别准确率
数据集	划分	空间流网络	时间流网络
	split₁	83.40%	84.20%
UCF101	split₂	83.30%	85.90%
	split₃	83.00%	87.80%
	平均值	83.20%	86.00%
	split₁	47.60%	57.90%
HMDB51	split₂	48.60%	58.70%
	split₃	44.70%	57.10%
	平均值	47.00%	57.90%

融合次数	融合位置	识别准确率
	conv2_1_relu和conv2_1	65.9%
	conv3_1_relu和conv3_1	66.1%
单次融合	conv4_1_relu和conv4_1	66.5%
	conv5_1_relu和conv5_1	67.1%
两次融合	conv4_1_relu和conv4_1,conv5_1_relu和conv5_1	69.7%
三次融合	conv3_1_relu和conv3_1,conv4_1_relu和conv4_1,conv5_1_relu和conv5_1	69.1%
四次融合	conv2_1_relu 和 conv2_1,conv3_1_relu 和 conv3_1,conv4_1_relu和 conv4_1,conv5_1_relu和conv5_1	67.6%

融合次数	融合位置	融合方式	识别准确率
单次融合	conv2_1_relu和conv2_1	时间流到空间流	65.9%
		空间流到时间流	64.6%
		时间流到空间流	67.1%
	conv5_1_relu和conv5_1	空间流到时间流	65.0%
		时间流到空间流	69.7%
两次融合	conv4_1_relu和conv4_1,conv5_1_relu和conv5_1	空间流到时间流	62.1%
三次融合	conv3_1_relu和conv3_1,conv4_1_relu和conv4_1,conv5_1_relu和conv5_1	时间流到空间流	69.1%
		空间流到时间流	57.1%
四次融合	conv2_1_relu和conv2_1,conv3_1_relu和conv3_1,conv4_1_relu和conv4_1,conv5_1_relu和conv5_1	时间流到空间流	67.6%
		空间流到时间流	52.0%

方法	识别准确率
策略1（直接平均）	68.5%
策略1（加权平均）	69.2%
策略2（直接平均）	65.6%
策略2（加权平均）	67.6%
策略3（直接平均）	68.8%
策略3（加权平均）	69.3%

方法	UCF101	HMDB51
改进的稠密轨迹^[10]	86.4%	61.7%
三维残差卷积网络^[13]	85.8%	54.9%
双流卷积神经网络^[15]	88.0%	59.4%
卷积双流网络融合^[20]	91.8%	64.6%
时空金字塔网络^[19]	93.2%	66.1%
时空乘法网络^[21]	94.2%	68.9%
三流卷积神经网络^[22]	92.1%	67.2%
语义图像网络^[29]	92.1%	65.8%
本文方法（策略3+加权平均）	92.4%	69.3%