基于骨骼及表观特征融合的动作识别方法

doi:10.11959/j.issn.1000-436x.2022020

Abstract

Abstract:

Focusing on the issue that traditional skeletal feature-based action recognition algorithms were not easy to distinguish similar actions, an action recognition method based on the fusion of deep joints and manual apparent features was considered.The joint spatial position and constraints was firstly input into the long short-term memory (LSTM) model equipped with spatio-temporal attention mechanism to acquire spatio-temporal weighted and highly separable deep joint features.After that, heat maps were introduced to locate the key frames and joints, and manually extract the apparent features around the key joints that could be considered as an effective complement to the deep joint features.Finally, the apparent features and the deep skeleton features could be fused frame by frame to achieve effectively discriminating similar actions.Simulation results show that, compared with the state-of-the-art action recognition methods, the proposed method can distinguish similar actions effectively and then the accuracy of action recognition is promoted rather obviously.

Key words: action recognition, LSTM, spatio-temporal attention mechanism, skeleton joint, apparent feature

CLC Number:

TP391

Hongyan WANG, Hai YUAN. Action recognition method based on fusion of skeleton and apparent features[J]. Journal on Communications, 2022, 43(1): 138-148.

Figures/Tables 11

数据	特征	方法方法	cross subject	cross view
	手工提取手工提取	LARP	50.08%	52.76%
		Dynamic skeletonsDynamic skeletons	60.23%	65.22%
	CNNCNN	Multi temporal 3D CNN	66.85%	72.58%
	LSTMLSTM	ST-LSTM+Trust Gate	69.20%	77.70%
骨骼序列	RNN	Two-Stream RNN	71.30%	79.50%
	CNNCNN	TSRJI	73.30%	80.30%
	LSTM	STA-LSTM	73.40%	81.20%
	LSTM	DS-LSTM	77.80%	87.33%
	CNN	Fuzzy fusion+CNN	84.22%	89.71%
	LSTM/手工提取	所提方法	$88 . 73 %$	$90 . 01 %$

数据	特征	方法	准确率
	手工提取	HOJ3D	54.50%
骨骼序列		LARP	74.20%
骨骼序列	RNN/LSTM	HBRNN-L	78.52%
	CNN	Multi-view dynamics+CNN	84.20%
	LSTM/手工提取	所提方法	$85 . 73 %$

方法	准确率
Joint Feature	86.90%
Co-occurrence Feature	90.40%
STA-LSTM	91.50%
ST-LSTM+Trust Gate	93.30%
Two-Stream RNN	94.80%
所提方法	$95 . 46 %$

数据集	STA-LSTM	STA-SC-LSTM	双流融合
NTU（cross subject）	73.40%	75.83%	$88 . 73 %$
NTU（cross view）	81.20%	82.72%	$90 . 01 %$
Northwestern-UCLA	—	77.58%	$85 . 73 %$
SBU	91.50%	92.33%	$95 . 46 %$

双流融合比重	NTU（cross subject）	NTU（cross view）	NorthwesternUCLA	SBU
(0.4, 0.6)	75.42%	79.93%	75.21%	93.84%
(0.5, 0.5)	84.07%	85.53%	82.29%	93.84%
(0.6, 0.4)	$88 . 73 %$	$90 . 01 %$	$85 . 73 %$	$95 . 46 %$
(0.7, 0.3)	81.34%	84.15%	81.91%	93.06%

References 36

[1]	罗会兰, 王婵娟, 卢飞 . 视频行为识别综述[J]. 通信学报, 2018,39(6): 169-180.
	LUO H L , WANG C J , LU F . Survey of video behavior recognition[J]. Journal on Communications, 2018,39(6): 169-180.
[2]	JIANG Y G , DAI Q , LIU W ,et al. Human action recognition in unconstrained videos by explicit motion modeling[J]. IEEE Transactions on Image Processing:a Publication of the IEEE Signal Processing Society, 2015,24(11): 3781-3795.
[3]	LIU M Y , LIU H . Depth Context:a new descriptor for human activity recognition by using sole depth sequences[J]. Neurocomputing, 2016,175: 747-758.
[4]	CHEN C , LIU M Y , LIU H ,et al. Multi-temporal depth motion maps-based local binary patterns for 3-D human action recognition[J]. IEEE Access, 2017,5: 22590-22604.
[5]	SHOTTON J , FITZGIBBON A , COOK M ,et al. Real-time human pose recognition in parts from single depth images[C]// Machine Learning for Computer Vision. Berlin:Springer, 2013: 119-135.
[6]	HAN F , REILY B , HOFF W ,et al. Space-time representation of people based on 3D skeletal data:a review[J]. Computer Vision and Image Understanding, 2017,158: 85-105.
[7]	KE Q H , BENNAMOUN M , AN S J ,et al. Learning clip representations for skeleton-based 3D action recognition[J]. IEEE Transactions on Image Processing:a Publication of the IEEE Signal Processing Society, 2018,27(6): 2842-2855.
[8]	VEMULAPALLI R , ARRATE F , CHELLAPPA R . Human action recognition by representing 3D skeletons as points in a lie group[C]// Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2014: 588-595.
[9]	AHMED F , PAUL P P , GAVRILOVA M . Adaptive pooling of the most relevant spatio-temporal features for action recognition[C]// Proceedings of 2016 IEEE International Symposium on Multimedia. Piscataway:IEEE Press, 2016: 177-180.
[10]	WANG L , HUYNH D Q , KONIUSZ P . A comparative review of recent kinect-based action recognition algorithms[J]. IEEE Transactions on Image Processing, 2020,29: 15-28.
[11]	BANERJEE A , SINGH P K , SARKAR R . Fuzzy integral-based CNN classifier fusion for 3D skeleton action recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021,31(6): 2206-2216.
[12]	LE Q V , JAITLY N , HINTON G E . A simple way to initialize recurrent networks of rectified linear units[J]. arXiv Preprint,arXiv:1504.00941, 2015.
[13]	ZHANG J , BAI F S , ZHAO J F ,et al. Multi-views action recognition on 3D ResNet-LSTM framework[C]// Proceedings of 2021 IEEE 2nd International Conference on Big Data,Artificial Intelligence and Internet of Things Engineering. Piscataway:IEEE Press, 2021: 289-293.
[14]	AVOLA D , CASCIO M , CINQUE L ,et al. 2-D skeleton-based action recognition via two-branch stacked LSTM-RNNs[J]. IEEE Transactions on Multimedia, 2020,22(10): 2481-2496.
[15]	JIANG X H , XU K , SUN T F . Action recognition scheme based on skeleton representation with DS-LSTM network[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020,30(7): 2129-2140.
[16]	KWAK I S , GUO J Z , HANTMAN A ,et al. Detecting the starting frame of actions in video[C]// Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision. Piscataway:IEEE Press, 2020: 478-486.
[17]	SONG S J , LAN C L , XING J L ,et al. Spatio-temporal attention-based LSTM networks for 3D action recognition and detection[J]. IEEE Transactions on Image Processing:a Publication of the IEEE Signal Processing Society, 2018,27(7): 3459-3471.
[18]	SCHINDLER K , VAN GOOL L . Action snippets:how many frames does human action recognition require?[C]// Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2008: 1-8.
[19]	OJALA T , PIETIK?INEN M , HARWOOD D . A comparative study of texture measures with classification based on featured distributions[J]. Pattern Recognition, 1996,29(1): 51-59.
[20]	PIETIK?INEN M , . Image analysis with local binary patterns[C]// Proceedings of the 14th Scandinavian Conference on Image Analysis.[S.l.:s.n.], 2005: 115-118.
[21]	梁淑芬, 刘银华, 李立琛 . 基于LBP和深度学习的非限制条件下人脸识别算法[J]. 通信学报, 2014,35(6): 154-160.
	LIANG S F , LIU Y H , LI L C . Face recognition under unconstrained based on LBP and deep learning[J]. Journal on Communications, 2014,35(6): 154-160.
[22]	LEI L , PENG J , YANG B . Image retrieval based on HSV feature and regional Shannon entropy[J]. International Journal of Software Science and Computational Intelligence, 2012,4(2): 64-80.
[23]	YU P , ZHANG C , DU C H . Image retrievals based on color and texture features[C]// Proceedings of 2007 9th International Symposium on Signal Processing and Its Applications. Piscataway:IEEE Press, 2007: 1-4.
[24]	SHAHROUDY A , LIU J , NG T T ,et al. NTU RGB+D:a large scale dataset for 3D human activity analysis[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2016: 1010-1019.
[25]	HU J F , ZHENG W S , LAI J H ,et al. Jointly learning heterogeneous features for RGB-D activity recognition[C]// Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2015: 5344-5352.
[26]	TU J H , LIU M Y , LIU H . Skeleton-based human action recognition using spatial temporal 3D convolutional neural networks[C]// Proceedings of 2018 IEEE International Conference on Multimedia and Expo. Piscataway:IEEE Press, 2018: 1-6.
[27]	LIU J , SHAHROUDY A , XU D ,et al. Spatio-temporal LSTM with trust gates for 3D human action recognition[C]// Computer Vision – ECCV 2016. Berlin:Springer, 2016: 816-833.
[28]	WANG H S , WANG L . Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks[C]// Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2017: 3633-3642.
[29]	CAETANO C , BRéMOND F , SCHWARTZ W R . Skeleton image representation for 3D action recognition based on tree structure and reference joints[C]// Proceedings of 2019 32nd SIBGRAPI Conference on Graphics,Patterns and Images (SIBGRAPI). Piscataway:IEEE Press, 2019: 16-23.
[30]	WANG J , NIE X H , XIA Y ,et al. Cross-view action modeling,learning,and recognition[C]// Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2014: 2649-2656.
[31]	XIA L , CHEN C C , AGGARWAL J K . View invariant human action recognition using histograms of 3D joints[C]// Proceedings of 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. Piscataway:IEEE Press, 2012: 20-27.
[32]	DU Y , WANG W , WANG L . Hierarchical recurrent neural network for skeleton based action recognition[C]// Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2015: 1110-1118.
[33]	XIAO Y , CHEN J , WANG Y C ,et al. Action recognition for depth video using multi-view dynamic images[J]. Information Sciences, 2019,480: 287-304.
[34]	YUN K , HONORIO J , CHATTOPADHYAY D ,et al. Two-person interaction detection using body-pose features and multiple instance learning[C]// Proceedings of 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. Piscataway:IEEE Press, 2012: 28-35.
[35]	ZHANG S Y , LIU X M , XIAO J . On geometric features for skeleton-based action recognition using multilayer LSTM networks[C]// Proceedings of 2017 IEEE Winter Conference on Applications of Computer Vision. Piscataway:IEEE Press, 2017: 148-157.
[36]	ZHU W T , LAN C L , XING J L ,et al. Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks[C]// Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press, 2016: 3697-3703.

Metrics

Recommended 0

No Suggested Reading articles found!

Action recognition method based on fusion of skeleton and apparent features

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 11

References 36

Related Articles 9

Metrics

Recommended 0

[1]	Wengang MA, Yadong ZHANG, Jin GUO. Abnormal traffic detection method based on LSTM and improved residual neural network optimization [J]. Journal on Communications, 2021, 42(5): 23-40.
[2]	Yuntian FENG, Xia WU, Xiong XU, Rongqing ZHANG. Research on ionospheric parameters prediction based on deep learning [J]. Journal on Communications, 2021, 42(4): 202-206.
[3]	Ruizhang HUANG, Wenfan JIN, Yanping CHEN, Yongbin QIN, Qinghua ZHENG. Research on Chinese predicate head recognition based on Highway-BiLSTM network [J]. Journal on Communications, 2021, 42(1): 100-107.
[4]	Han ZHANG,Yongjin HU,Yuanbo GUO,Jicheng CHEN. Research on coreference resolution technology of entity in information security [J]. Journal on Communications, 2020, 41(2): 165-175.
[5]	WANG Li’na,GUO Xiaodong,WANG Run. Automated crowdturfing attack in Chinese user reviews [J]. Journal on Communications, 2019, 40(6): 1-13.
[6]	Huilan LUO, Kang TONG. Spatiotemporal squeeze-and-excitation residual multiplier network for video action recognition [J]. Journal on Communications, 2019, 40(10): 189-198.
[7]	Run WANG,Benxiao TANG,Li’na WANG. DeepRD:LSTM-based Siamese network for Android repackaged applications detection [J]. Journal on Communications, 2018, 39(8): 69-82.
[8]	Shui-fei ZENG,Xiao-yan ZHANG,Xiao-feng DU,Tian-bo LU. New method of text representation model based on neural network [J]. Journal on Communications, 2017, 38(4): 86-98.
[9]	You-jun LI,Jia-jin HUANG,Hai-yuan WANG,Ning ZHONG. Study of emotion recognition based on fusion multi-modal bio-signal with SAE and LSTM recurrent neural network [J]. Journal on Communications, 2017, 38(12): 109-120.