基于深度学习的视频行为识别方法综述

doi:10.11959/j.issn.1000-0801.2019286

电信科学 ›› 2019, Vol. 35 ›› Issue (12): 99-111.doi: 10.11959/j.issn.1000-0801.2019286

基于深度学习的视频行为识别方法综述

赵朵朵¹,章坚武¹(),郭春生¹,周迪²,穆罕默德·阿卜杜·沙拉夫·哈基米¹

¹ 杭州电子科技大学，浙江杭州 310018
² 浙江宇视科技有限公司，浙江杭州 310018

修回日期:2019-12-10 出版日期:2019-12-20 发布日期:2020-01-15
作者简介:赵朵朵（1995- ），女，杭州电子科技大学通信工程学院硕士生，主要研究方向为图像处理与人工智能等|章坚武（1961- ），男，博士，杭州电子科技大学通信工程学院教授、博士生导师，中国电子学会、中国通信学会高级会员，浙江省通信学会常务理事，主要研究方向为移动通信、多媒体信号处理与人工智能、通信网络与信息安全。|郭春生（1971- ），男，博士，杭州电子科技大学通信工程学院副教授、硕士生导师，主要研究方向为视频分析与模式识别。|周迪（1975- ），男，浙江宇视科技有限公司高级工程师、宇视研究院院长，主要研究方向为视频安全、人工智能等。|穆罕默德·阿卜杜·沙拉夫·哈基米（1991- ），男，杭州电子科技大学博士生，主要研究方向为图像处理与人工智能。
基金资助:
国家自然科学基金资助项目(61772162);国家自然科学基金资助项目(U1866209);国家重点研发计划资助项目(2018YFC0831503);浙江省自然科学基金资助项目(LYl6F020016);浙江省重点研发计划资助项目(2018C01059);浙江省重点研发计划资助项目(2019C01062)

A survey of video behavior recognition based on deep learning

Duoduo ZHAO¹,Jianwu ZHANG¹(),Chunsheng GUO¹,Di ZHOU²,ABDUSHARAFALHAKIMI MOHAMMED¹

¹ Hangzhou Dianzi University,Hangzhou 310018,China
² Zhejiang Uniview Technologies Co.,Ltd.,Hangzhou 310018,China

Revised:2019-12-10 Online:2019-12-20 Published:2020-01-15
Supported by:
The National Natural Science Foundation of China(61772162);The National Natural Science Foundation of China(U1866209);The National Key Research Development Program of China(2018YFC0831503);The Natural Science Foundation of Zhejiang Province of China(LYl6F020016);The Key Research Development Program of Zhejiang Province of China(2018C01059);The Key Research Development Program of Zhejiang Province of China(2019C01062)

摘要/Abstract

摘要：

近年来，自动学习特征的深度学习方法在视频行为识别领域中不断被挖掘探索。在总结了常用的行为识别数据集的基础上，对传统的行为识别方法以及深度学习的相关基础原理进行了概述，着重对基于不同输入内容与不同深度网络的行为识别方法进行了较为全面、系统性的总结、对比与分析。最后，对深度学习在行为识别领域的发展做了总结并展望了未来的发展趋势。

关键词: 行为识别, 数据集, 自动学习, 深度网络

Abstract:

In recent years,the deep learning method of automatic learning features has been continuously explored in the field of video behavior recognition.The traditional behavior recognition methods and the underlying principles of deep learning were outlined.Then a number of behavior recognition methods based on different input content and different deep networks was compared and analyzed.Finally,the development of deep learning in the field of behavior recognition was concluded and its future development trend was prospected.

Key words: behavior recognition, dataset, automatic learning, depth network

中图分类号:

TP393

赵朵朵,章坚武,郭春生,周迪,穆罕默德·阿卜杜·沙拉夫·哈基米. 基于深度学习的视频行为识别方法综述[J]. 电信科学, 2019, 35(12): 99-111.

Duoduo ZHAO,Jianwu ZHANG,Chunsheng GUO,Di ZHOU,ABDUSHARAFALHAKIMI MOHAMMED. A survey of video behavior recognition based on deep learning[J]. Telecommunications Science, 2019, 35(12): 99-111.

图/表 12

表1

数据集汇总"

名称（年份）	简介	视频样本数	目前最高识别率
KTH^[1]（2004年）	由25人完成的6类动作：挥手、步行、慢跑、跑步、拍手和拳击。有4个不同场景	2 391	98.83%^[14]
Weizmann^[2,3] （2005年）	包含9人完成的10个动作：走、跑、跳、挥手和弯腰等	93	100%^[15,16]
Hollywood 2^[4] （2009年）	含10类场景下的12种行为类别：打架、开车、握手、拥抱、亲吻等。样本均来自69部Hollywood电影	3 669	78.6%^[17]
Olympic sports^[5] （2010年）	含16种运动动作，有跳高、跳远等	783	96.6%^[18]
HMDB51^[6] （2011年）	含51个类别，一个类别至少含101段视频样本。样本来源于公共数据库、电影和YouTube等	6 849	82.1%^[19]
UCF Datasets （2007年- ）	（1）UCF-11(2008)^[7]（2008年）有 11 种行为类别：篮球投篮、与狗一起散步等	1 600	94.5%^[20]
	（2）UCF-50^[8]（2009年）类别由11 种扩展至50种。视频来自YouTube，是 UCF-11的扩展	6 676	99.98%^[21]
	（3）UCF-101^[9]（2012年）是 UCF-50的扩展,类别由50 种扩展至101种。分为五大类：人与人交互、人与物交互、弹奏乐器、运动及肢体运动	13 320	98.0%^[22]
	（4）UCF Sports^[10]（2013年）含10 种运动的类别，有踢足球、举重、跳水等。收集于广播电视频道以及互联网等的各类运动样本	150	96.2%^[23]
Sports-1M^[11] （2014年）Kinetics^[12] （2017年）Moments in Time^[13]（2018年）	包含 487 种体育运动项目，分为六大类:水上、团队、冬季、球类、对抗、与动物等运动覆盖700种人类动作，每个动作至少包含600个视频，每一段都来自一个独特的YouTube视频。动作包括演奏乐、握手、拥抱等涉及人物、动物、物体或自然现象，捕捉了动态场景中标记了的3秒视频	1 133 158650 0001 000 000	75.9%^[24]Top-1:83.5%， Top-5:96.8%^[25]Top-1:38.1%,Top-5:65.3%^[26]

表1

图1

图2

图3

图4

图5

图6

图7

图8

表2

表3

表4

参考文献 72

[1]	SCHULDT C , LAPTEV I , CAPUTO B . Recognizing human actions:a local SVM approach[C]// 17th International Conference on Pattern Recognition(ICPR),Aug 23-26,2004,Cambridge,UK. Piscataway:IEEE Press, 2004: 32-36.
[2]	BLANK M , GORELICK L , SHECHTMAN E ,et al. Actions as space-time shapes[C]// 10th IEEE International Conference on Computer Vision(ICCV),Oct 17-21,2005,Beijing,China. Piscataway:IEEE Press, 2005: 1395-1402.
[3]	GORELICK L , BLANK M , SHECHTMAN E ,et al. Actions as space-time shapes[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007,29(12): 2247-2253.
[4]	MARSZALEK M , LAPTEV I , SCHMID C . Actions in context[C]// 22nd IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 20-25,2009,Florida,USA. Piscataway:IEEE Press, 2009: 2929-2936.
[5]	NIEBLES J C , CHEN C W , LI F F . Modeling temporal structure of decomposable motion segments for activity classification[C]// 11th European Conference on Computer Vision (ECCV),Sep 5-11,2010,Heraklion,Crete,Greece. Berlin:Springer Verlag, 2010: 392-405.
[6]	KUEHNE H , JHUANG H , GARROTE E ,et al. HMDB:a large video database for human motion recognition[C]// 16th IEEE International Conference on Computer Vision(ICCV),Nov 6-13,2011,Barcelona,Spain. Piscataway:IEEE Press, 2011: 2556-2563.
[7]	LIU J G , LUO J B , SHAH M . Recognizing realistic actions from videos[C]// 22nd IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 20-25,2009,Florida,USA. Piscataway:IEEE Press, 2009: 1996-2003.
[8]	REDDY K K , SHAH M . Recognizing 50 human action categories of Web videos[J]. Machine Vision and Applications, 2013,24(5): 971-981.
[9]	SOOMRO K , ZAMIR A R , SHAH M . UCF101:a dataset of 101 human actions classes from videos in the wild[J]. Computer Science, 2012: 1-7.
[10]	RODRIGUEZ M D , AHMED J , SHAH M . Action match a spatio-temporal maximum average correlation height filter for action recognition[C]// 21st IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 24-26,2008,Anchorage,Alaska,USA. Piscataway:IEEE Press, 2008: 1-8.
[11]	KARPATHY A , TODERICI G , SHETTY S ,et al. Large-scale video classification with convolutional neural networks[C]// 27th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 23-28,2014,Columbus,USA. Piscataway:IEEE Press, 2014: 1725-1732.
[12]	KAY W , CARREIRA J , SIMONYAN K ,et al. The kinetics human action video dataset[J]. arXiv:1705.06950, 2017:
[13]	MONFORT M , ANDONIAN A , ZHOU B ,et al. Moments in time dataset:one million videos for event understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019(3): 1-9.
[14]	XU W R , MIAO Z J , TIAN Y . A novel mid-level distinctive feature learning for action recognition via diffusion map[J]. Neurocomputing, 2016(218): 185-196.
[15]	TONG M , WANG H Y , TIAN W J ,et al. Action recognition new framework with robust 3D-TCCHOGAC and 3D-HOOFGAC[J]. Multimedia Tools and Applications, 2017,76(2): 3011-3030.
[16]	VISHWAKARMA D K , KAPOOR R , DHIMAN A . Unified framework for human activity recognition:an approach using spatial edge distribution and transform[J]. AEU-International Journal of Electronics and Communications, 2016,70(3): 341-353.
[17]	WANG Y , TRAN V , HOAI M . Evolution-preserving dense trajectory descriptors[J]. arXiv:1702.04037, 2017:
[18]	LI Y W , LI W X , MAHADEVAN V ,et al. VLAD3:encoding dynamics of deep features for action recognition[C]// 29th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 27-30,2016,Las Vegas,USA. Piscataway:IEEE Press, 2016: 1951-1960.
[19]	ZHU J , ZOU W , ZHU Z . End-to-end video-level representation learning for action recognition[C]// 24th International Conference on Pattern Recognition(ICPR),Aug 20-24,2018,Beijing,China. Piscataway:IEEE Press, 2018: 645-650.
[20]	SUN Q , LIU H , MA L ,et al. A novel hierarchical bag-of-words model for compact action representation[J]. Neurocomputing, 2016(174): 722-732.
[21]	IJJINA E P , MOHAN C K . Human action recognition using genetic algorithms and convolutional neural networks[J]. Pattern Recognition, 2016(59): 199-212.
[22]	MAHASSENI B , TODOROVIC S . Regularizing long short term memory with 3D human-skeleton sequences for action recognition[C]// 29th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 27-30,2016,Las Vegas,USA. Piscataway:IEEE Press, 2016: 3054-3062.
[23]	ALHARBI N , GOTOH Y . A unified spatio-temporal human body region tracking approach to action recognition[J]. Neurocomputing, 2015(161): 56-64.
[24]	MAHASSENI B , TODOROVIC S . Regularizing long short term memory with 3D human-skeleton sequences for action recognition[C]// 29th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 27-30,2016,Las Vegas,USA. Piscataway:IEEE Press, 2016: 3054-3062.
[25]	ZHANG X , BAO Y , ZHANG F ,et al. Qiniu submission to Activity Net challenge 2018[J].,2018. 2018 Computer Vision and Pattern Recognition Challenge,arXiv:1806.04391, 2018.
[26]	LI Y , XU Z , WU Q ,et al. Submission to moments in time challenge 2018[J]. 2018 Computer Vision and Pattern Recognition Challenge,a rXiv:1808.03766, 2018.
[28]	WANG H,KLāSER A , SCHMID C ,et al. Dense trajectories and motion boundary descriptors for action recognition[J]. International Journal of Computer Vision, 2013,103(1): 60-79.
[29]	WANG H , SCHMID C . Action recognition with improved trajectories[C]// 18th IEEE International Conference on Computer Vision(ICCV),Dec 1-8,2013,Sydeny,Australia. Piscataway:IEEE Press, 2013: 3551-3558.
[30]	LECUN Y , BOTTOU L , BENGIO Y ,et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998,86(11): 2278-2324.
[31]	KRIZHEVSKY A , SUTSKEVER I , HINTON G E . ImageNet classification with deep convolutional neural networks[C]// 25th Annual Conference on Neural Information Processing Systems,Dec 3-6,2012,Lake Tahoe,USA. Massachusetts:MIT Press, 2012: 1106-1114.
[32]	SIMONYAN K , ZISSERMAN A . Very deep convolutional networks for large-scale image recognition[C]// 3rd International Conference on Learning Representations(ICLR),May 7-9,2015,San Diego,USA. New York:AMC Press, 2015: 1-14.
[33]	HE K , ZHANG X , REN S ,et al. Deep residual learning for image recognition[C]// 29th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 26-Jul 1,2016,Las Vegas,USA. Piscataway:IEEE Press, 2016: 770-778.
[34]	SZEGEDY C , LIU W , JIA Y ,et al. Going deeper with convolutions[C]// 28th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 7-12,2015,Boston,USA. Piscataway:IEEE Press, 2015: 7-12.
[35]	ARIF S , WANG J , HASSAN U T ,et al. 3D-CNN-based fused feature maps with LSTM applied to action recognition[J]. Future Internet, 2019,11(2):42.
[36]	JI S , XU W , YANG M ,et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013,35(1): 221-231.
[37]	NG Y H , HAUSKNECHT M , VIJAYANARASIMHAN S ,et al. Beyond short snippets:deep networks for video classification[C]// 28th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 7-12,2015,Boston,USA. Piscataway:IEEE Press, 2015: 4694-4702.
[38]	LIU Z , HU H F . Spatiotemporal relation networks for video action recognition[J]. IEEE Access, 2019(7): 14969-14976.
[39]	BACCOUCHE M , MAMALET F , WOLF C ,et al. Sequential deep learning for human action recognition[C]// 2nd International Conference on Human Behavior Unterstanding(HBU),Nov 16-16,2011,Amsterdam,Netherlands. Berlin:Springer Verlag, 2011: 29-39.
[40]	DONAHUE J , HENDRICKS L A , ROHRBACH M ,et al. Long-term recurrent convolutional networks for visual recognition and description[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014,39(4): 677-691.
[41]	ILG E , MAYER N , SAIKIA T ,et al. FlowNet 2.0:evolution of optical flow estimation with deep networks[C]// 30th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jul 21-26,2017,Honolulu,USA. Piscataway:IEEE Press, 2017: 1467-1655.
[42]	FISCHER P , DOSOVITSKIY A , ILG E ,et al. FlowNet:learning optical flow with convolutional networks[C]// 20th IEEE International Conference on Computer Vision(ICCV),Dec 11-18,2015,Santiago,Chile. Piscataway:IEEE Press, 2015: 2758-2766.
[43]	YE H , WU Z , ZHAO R W ,et al. Evaluating Two-Stream CNN for Video Classification[C]// 5th ACM on International Conference on Multimedia Retrieval(ICMR),Jun 23-26,2015,Shanghai,China. New York:ACM, 2015: 435-442.
[44]	WU Z , WANG X , JIANG Y G ,et al. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification[C]// 23rd ACM Multimedia Conference,Oct 26-30,2015,Brisbane,Australia. New York:ACM Press, 2015: 461-470.
[45]	WU Z , JIANG Y G , WANG X ,et al. Multi-stream multi-class fusion of deep networks for video classification[C]// 24th ACM Multimedia Conference,Oct 15-19,2016,Amsterdam,UK. New York:ACM Press, 2016: 791-800.
[46]	LONG X , GAN C , MELO G D ,et al. Attention clusters:Purely attention based local feature integration for video classification[C]// 31st IEEE Conference on Computer Vision and Pattern Recognition (CVPR),Jun 18-22,2018,Salt Lake,USA. Piscataway:IEEE Press, 2018: 7834-7843.
[47]	JIANG Y G , WU Z , TANG J ,et al. Modeling multimodal clues in a hybrid deep learning framework for video classification[J]. IEEE Transactions on Multimedia, 2018,20(11): 3137-3147.
[48]	SIMONYAN K , ZISSERMAN A . Two-stream convolutional networks for action recognition in videos[C]// 28th Annual Conference on Neural Information Processing Systems(NIPS),Dec 8-13,2014,Montreal,Canda. Massachusetts:MIT Press, 2014: 568-576.
[49]	FEICHTENHOFER , PINZ A , ZISSERMAN A . Convolutional two-stream network fusion for video action recognition[C]// 29th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 27-30,2016,Las Vegas,USA. Piscataway:IEEE Press, 2016: 1933-1941.
[50]	WANG L , XIONG Y , WANG Z ,et al. Temporal segment networks:towards good practices for deep action recognition[C]// 14th European Conference on Computer Vision(ECCV),Oct 8-16,2016,Amsterdam,Netherlands. Berlin:Springer Verlag, 2016: 20-36.
[51]	LAN Z , ZHU Y , HAUPTMANN A G . Deep local video feature for action recognition[C]// 30th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jul 21-26,2017,Honolulu,USA. Piscataway:IEEE Press, 2017: 1219-1225.
[52]	ZHOU B , ANDONIAN A , TORRALBA A . Temporal relational reasoning in videos[C]// 15th European Conference on Computer Vision(ECCV),Sep 8-14,2018,Munich,Germany. Berlin:Springer Verlag, 2018: 831-846.
[53]	DIBA A , SHARMA V , VAN GOOL L . Deep temporal linear encoding networks[C]// 30th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),July 21-26,2017,Honolulu,USA. Piscataway:IEEE Press, 2017: 1541-1550.
[54]	TRAN D , BOURDEV L , FERGUS R ,et al. Learning spatiotemporal features with 3D convolutional networks[C]// 20th IEEE International Conference on Computer Vision(ICCV),Dec 11-18,2015,Santiago,Chile. Piscataway:IEEE Press, 2015: 4489-4497.
[55]	SUN L , JIA K , YEUNG D Y ,et al. Human action recognition using factorized spatio-temporal convolutional networks[C]// 20th IEEE International Conference on Computer Vision(ICCV),Dec 11-18,2015,Santiago,Chile. Piscataway:IEEE Press, 2015: 4597-4605.
[56]	QIU Z , YAO T , MEI T . Learning Spatio-temporal representation with pseudo-3D residual networks[C]// 22nd IEEE International Conference on Computer Vision(ICCV),Oct 22-29,2017,Venice,Italy. Piscataway:IEEE Press, 2017: 5534-5542.
[57]	DIBA A , FAYYAZ M , SHARMA V ,et al. Temporal 3D ConvNets:new architecture and transfer learning for video classification[J].,2017. arXiv:1711.08200, 2017.
[58]	CARREIRA J , ZISSERMAN A . Quo Vadis,action recognition? A new model and the kinetics dataset[C]// 30th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jul 21-26,2017,Honolulu,USA. Piscataway:IEEE Press, 2017: 6299-6308.
[59]	TRAN D , WANG H , TORRESANI L ,et al. A closer look at spatiotemporal convolutions for action recognition[C]// 31st IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 18-22,2018,Salt Lake,USA. Piscataway:IEEE Press, 2018: 6450-6459.
[60]	FAN L , HUANG W , GAN C ,et al. End-to-end learning of motion representation for video understanding[C]// 31st IEEE Conference on Computer Vision and Pattern Recognition (CVPR),Jun 18-22,2018,Salt Lake,USA. Piscataway:IEEE Press, 2018: 6016-6025.
[61]	ZHU W , HU J , SUN G ,et al. A key volume mining deep framework for action recognition[C]// 29th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 27-30,2016,Las Vegas,USA. Piscataway:IEEE Press, 2016: 1991-1999.
[62]	KAR A , RAI N , SIKKA K ,et al. AdaScan:adaptive scan pooling in deep convolutional neural networks for human action recognition in Videos[C]// 30th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jul 21-26,2017,Honolulu,USA. Piscataway:IEEE Press, 2017: 5699-5708.
[63]	ZHU Y , LAN Z , NEWSAM S ,et al. Hidden two-stream convolutional networks for action recognition[C]// 14th Asian Conference on Computer Vision(ACCV),Dec 2-6,2018,Perth,Australia. Berlin:Springer Verlag, 2018: 363-378.
[64]	WANG L , XIONG Y , WANG Z ,et al. Towards good practices for very deep two-stream ConvNets[J]. Computer Science,arXiv:1507.02159, 2015.
[65]	FEICHTENHOFER C , PINZ A , WILDES R P . Spatiotemporal residual networks for video action recognition[C]// 30th Conference and Workshop on Neural Information Processing Systems (NIPS),Dec 5-10,2016,Barcelona,Spain.[S.l.:s.n]. 2016: 3476-3484.
[66]	WANG Y , LONG M , WANG J ,et al. Spatiotemporal pyramid network for video action recognition[C]// 30th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jul 21-26,2017,Honolulu,USA. Piscataway:IEEE Press, 2017: 2097-2106.
[67]	FEICHTENHOFER C , PINZ A , WILDES R P . Spatiotemporal multiplier networks for video action recognition[C]// 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR),Jul 21-26,2017,Honolulu,USA. Piscataway:IEEE Press, 2017: 7445-7454.
[68]	OUYANG X , XU S J , ZHANG C Y ,et al. A 3D-CNN and LSTM based multi-task learning architecture for action recognition[J]. IEEE Access, 2019(7): 40757-40770.
[69]	WANG L , QIAO Y , TANG X . Action recognition with trajectory-pooled deep-convolutional descriptors[C]// 28th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 7-12,2015,Boston,USA. Piscataway:IEEE Press, 2015: 4305-4314.
[70]	VAROL G , LAPTEV I , SCHMID C . Long-term temporal convolutions for action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018: 1510-1517.
[71]	LEV G , SADEH G , KLEIN B ,et al. RNN fisher vectors for action recognition and image annotation[C]// 14th European Conference on Computer Vision(ECCV),Oct 8-16,2016,Amsterdam,Netherlands. Berlin:Springer Verlag, 2016: 833-850.
[72]	BILEN H , FERNANDO B , GAVVES E ,et al. Action recognition with dynamic image networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018,40(12): 2799-2813.

参考文献	方法	预训练数据集	UCF-101	HMDB51
48]	Two-Stream(VGG-M)	ImageNet	88.0	59.4
37]	Two-Stream+LSTM	ImageNet	88.6	—
64]	Very deep Two-Stream(GoogLeNet)Very deep Two-Stream(VGG-16)	ImageNet	89.391.4	—58.5
49]	Two -Stream fusion（VGG-16）Two-Stream fusion（VGG-16）+ IDT	ImageNet	92.593.5	65.469.2
65]	ST-ResNetST-ResNet+IDT	ImageNet	93.494.6	66.470.3
	ST-Pyramid Network(VGG-16)		93.2	66.1
66]	ST-Pyramid Network(ResNet-50)	ImageNet	93.8	66.5
	ST-Pyramid Network( BN-Inception)		94.6	68.9
67]	ST-MultiplierST-Multiplier+IDT	ImageNet	94.294.9	68.972.2
	TLE:FC-Pooling		92.2	68.8
53]	TLE:Bilinear+TS	ImageNet	95.1	70.6
	TLE:Bilinear		95.6	71.1
50]	TSNTSN(Inception v3)	ImageNet	94.296.2	69.475.3
68]	Hidden Two-Stream (TSN)Hidden Two-Stream ( I3D)	Kinetics	93.297.1	66.878.7

参考文献	方法	预训练数据集	UCF-101	HMDB51
55]	F_stCN	ImageNet	88.1	59.1
54]	C3D one network	Sports-1M	82.3	—
	C3D ensemble		85.2	—
	C3D ensemble + IDT		90.1	—
35]	C3D+LSTM	—	92.9	70.1
57]	T3D	ImageNet	90.3	59.2
	T3D-Transfer		91.7	61.1
	T3D+TSN		93.2	63.5
38]	STRN	ImageNet	93.2	64.9
56]	P3D ResNet	ImageNet+Sports-1M	88.6
	P3D ResNet+IDT		93.7
68]	Multi-task C3D+LSTM	Sports-1M	93.4	68.9
59]	R(2+1)D-RGB	Sports-1M	93.6	66.6
	R(2+1)D-Flow		93.3	70.1
	R(2+1)D-Two-Stream		95.0	72.7
	R(2+1)D-RGB	Kinetics	96.8	74.5
	R(2+1)D-Flo w		95.5	76.4
	R(2+1)D-Two-Stream		97.3	78.7
58]	RGB-I3D	ImageNet+Kinetics	95.6	74.8
	Flow-I3D		96.7	77.1
	Two-Stream I3D		98.0	80.7
	RGB-I3D	Kinetics	95.1	74.3
	Flow-I3D		96.5	77.3
	Two-Stream I3D		97.8	80.9

参考文献	方法	预训练数据集	UCF-101	HMDB51
69]	TDD	ImageNet	90.3	63.2
	TDD+IDT		91.5	65.9
70]	LTC	Sports-1M	91.7	64.8
	LTC+IDT		92.7	67.2
61]	Key -volume mining deep framework	ImageNet	93.1	63.3
62]	AdaScan	ImageNet	89.4	54.9
	AdaScan+iDT		91.3	61.0
	AdaScan+ iDT+C3D		93.2	66.9
71]	RNN-FV(C3D+VGG-CCA)	—	54.33	88.01
	RNN-FV(C3D+VGG-CCA) +IDT	—	94.08	67.71
51]	DOVF	ImageNet	94.9	71.7
	DOVF + MIFS		95.3	75.0
60]	TVNet	—	94.35	71.0
	TVNet+IDT		95.4	72.6
72]	Four-Stream	ImageNet	95.5	72.5
	Four-Stream + IDT		96.0	74.9
19]	DTPP	ImageNet	95.8	74.8
	DTPP + MIFS		96.1	76.3
	DTPP + IDT		96.2	75.3
	DTPP	Kinetics	98.0	82.1

基于深度学习的视频行为识别方法综述

A survey of video behavior recognition based on deep learning

在线阅读

PDF下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 72

相关文章 9

Metrics

推荐阅读 0

[1]	朱应钊, 李嫚. 元学习研究综述[J]. 电信科学, 2021, 37(1): 22-31.
[2]	张旭,刘洋,胡磊,赵晓东,张海滨. 电信行业基于种子用户群扩展技术的定向营销研究与应用[J]. 电信科学, 2018, 34(1): 166-173.
[3]	张德刚,吴毅,张少泉,彭庆军. 电力企业异构数据集成研究[J]. 电信科学, 2015, 31(6): 128-131.
[4]	王桂玲,张峰,韩燕波. 一种基于数据服务超链进行情景数据集成的方法[J]. 电信科学, 2014, 30(2): 51-59.
[5]	张瑜,潘红芳. 基于一体化平台数据中心的数据质量管理在内蒙古电力公司的应用[J]. 电信科学, 2014, 30(1): 142-147.
[6]	潘俊,程建和. 电信企业通用数据服务平台的设计与实现[J]. 电信科学, 2013, 29(2): 124-128.
[7]	吴舜,苏丹,吴佳,李坤,许大卫,刘昀,魏征. 基于TiIera平台的网络细粒度应用行为识别[J]. 电信科学, 2013, 29(11): 94-98.
[8]	陆慧娟,张金伟,马小平,杨小兵. 基于特征选择的过抽样算法的研究[J]. 电信科学, 2012, 28(1): 87-91.
[9]	陆慧娟,张金伟,张金伟,张金伟,马小平,杨小兵. 基于特征选择的过抽样算法的研究[J]. 电信科学, 2012, 28(1): 91-95.