基于多维度和多模态信息的视频描述方法

doi:10.11959/j.issn.1000-436x.2020037

Abstract

Abstract:

In order to solve the problem of complex information representation in automatic video description tasks,a multi-dimensional and multi-modal visual feature extraction and fusion method was proposed.Firstly,multi-dimensional features such as static and dynamic attributes of the video sequence were extracted by transfer learning,and the image description algorithm was also used to extract the semantic information of the key frames in the video.By doing this,the video features extraction was carried out.Then,multi-layer long and short memory networks were used to fuse multi-dimensional and multi-modal information,and finally generated a language description of the video content.Compared with the existing methods,experimental simulations results show that the proposed method achieves better results in the video automatic description task.

Key words: video description, multimodal, transfer learning, long and short term memory network

CLC Number:

TP391.4

Enjie DING,Zhongyu LIU,Yafeng LIU,Wanli YU. Video description method based on multidimensional and multimodal information[J]. Journal on Communications, 2020, 41(2): 36-43.

Figures/Tables 6

References 30

[1]	KOJIMA A , IZUMI M , TAMURA T ,et al. Generating natural language description of human behavior from video images[C]// 15th International Conference on Pattern Recognition. ICPR, 2000: 728-731.
[2]	ZHAO B , LI X , LU X . CAM-RNN:co-attention model based RNN for video captioning[J]. IEEE Transactions on Image Processing, 2019,28(11): 5552-5564.
[3]	PARK J , SONG C , HAN J . A study of evaluation metrics and datasets for video captioning[C]// 2017 International Conference on Intelligent Informatics and Biomedical Sciences. ICIIBMS, 2017: 172-175.
[4]	YI B , YANG Y , FUMIN S ,et al. Describing video with attention-based bidirectional LSTM[J]. IEEE Transactions on Cybernetics, 2018,49(7): 1-11.
[5]	KRISHNA R , HATA K , REN F ,et al. Dense-captioning events in videos[C]// 2017 IEEE International Conference on Computer Vision. ICCV, 2017: 706-715.
[6]	SHEN Z , LI J , SU Z ,et al. Weakly supervised dense video captioning[C]// IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 2017: 1916-1924.
[7]	GUADARRAMA S , KRISHNAMOORTHY N , MALKARNENKAR G ,et al. Youtube2text:recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition[C]// IEEE International Conference on Computer Vision. ICCV, 2013: 2712-2719.
[8]	ROHRBACH M , QIU W , TITOV I ,et al. Translating video content to natural language descriptions[C]// IEEE International Conference on Computer Vision. IEEE, 2013: 433-440.
[9]	KOJIMA A , TAMURA T , FUKUNAGA K . Natural language description of human activities from video images based on concept hierarchy of actions[J]. International Journal of Computer Vision, 2002,50(2): 171-184.
[10]	THOMASON J , VENUGOPALAN S , GUADARRAMA S ,et al. Integrating language and vision to generate natural language descriptions of videos in the wild[C]// International Conference on Computational Linguistics. ICCL, 2014: 1218-1227.
[11]	JOHNSON M , SCHUSTER M , LE Q V ,et al. Google’s multilingual neural machine translation system:enabling zero-shot translation[J]. Transactions of the Association for Computational Linguistics, 2017,5(2): 339-351.
[12]	VENUGOPALAN S , XU H , DONAHUE J ,et al. Translating videos to natural language using deep recurrent neural networks[C]// 2015 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2015: 1494-1504.
[13]	VENUGOPALAN S , ROHRBACH M , DONAHUE J ,et al. Sequence to sequence-video to text[C]// IEEE International Conference on Computer Vision. ICCV, 2015: 4534-4542.
[14]	YAO L , TORABI A , CHO K ,et al. Describing videos by exploiting temporal structure[C]// IEEE International Conference on Computer Vision. ICCV, 2015: 4507-4515.
[15]	HOCHREITER S , SCHMIDHUBER J . Long short-term memory[J]. Neural Computation, 1997,9(8): 1735-1780.
[16]	JIN Q , CHEN J , CHEN S ,et al. Describing videos using multi-modal fusion[C]// The 24th ACM International Conference on Multimedia. ACM, 2016: 1087-1091.
[17]	CHEN Y , WANG S , ZHANG W ,et al. Less is more:picking informative frames for video captioning[C]// The European Conference on Computer Vision. ECCV, 2018: 358-373.
[18]	CHEN T H , LIAO Y H , CHUANG C Y ,et al. Show,adapt and tell:adversarial training of cross-domain image captioner[C]// IEEE International Conference on Computer Vision. ICCV, 2017: 521-530.
[19]	CHEN D L , DOLAN W B . Collecting highly parallel data for paraphrase evaluation[C]// The 49th Annual Meeting of the Association for Computational Linguistics. ACL, 2011: 190-200.
[20]	XU J , MEI T , YAO T ,et al. MSR-VTT:a large video description dataset for bridging video and language[C]// IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 2016: 5288-5296.
[21]	XU K , BA J , KIROS R ,et al. Show,attend and tell:neural image caption generation with visual attention[J]. Computer Science, 2015,2(1): 2048-2057.
[22]	DENG J , DONG W , SOCHER R ,et al. Imagenet:a large-scale hierarchical image database[C]// IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 2009: 248-255.
[23]	ZHOU B , LAPEDRIZA A , KHOSLA A ,et al. Places:a 10 million image database for scene recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017,40(6): 1452-1464.
[24]	CARREIRA J , ZISSERMAN A . Quo vadis,action recognition? a new model and the kinetics dataset[C]// The IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 2017: 6299-6308.
[25]	IOFFE S , SZEGEDY C . Batch normalization:accelerating deep network training by reducing internal covariate shift[C]// International Conference on Machine Learning. ICML, 2015: 448-456.
[26]	HE K , ZHANG X , REN S ,et al. Deep residual learning for image recognition[C]// IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 2016: 770-778.
[27]	CHEN X , FANG H , LIN T Y ,et al. Microsoft coco captions:data collection and evaluation server[J]. arXiv Preprint,arXiv:1504.00325, 2015.
[28]	VEDANTAM R , ZITNICK C L , PARIKH D ,et al. Cider:consensus-based image description evaluation[C]// IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 2015: 4566-4575.
[29]	RAMANISHKA V , DAS A , PARK D H ,et al. Multimodal video description[C]// The 24th ACM International Conference on Multimedia. ACM, 2016: 1092-1096.
[30]	ZHANG X , GAO K , ZHANG Y ,et al. Task-driven dynamic fusion:Reducing ambiguity in video description[C]// IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 2017: 3713-3721.

Metrics

Recommended 0

No Suggested Reading articles found!

方法	MSVD				MSR-VTT
方法	METEOR	BLEU4	ROUGE-L	CIDEr	METEOR	BLEU4	ROUGE-L	CIDEr
MPool^[12]	29.1	33.3	—	—	23.7	30.4	52	35
SA^[14]	29.6	41.9	—	51.7	25	28.5	53.3	37.1
S2VT^[13]	29.8	—	—	—	25.7	31.4	55.9	35.2
VidLAB^[29]	—	—	—	—	27.7	39.1	60.6	44.4
TDDF^[30]	27.8	37.3	59.2	43.8	—	—	—	—
PickNet^[17]	33.1	46.1	62.9	76	27.2	38.9	59.5	42.1
本文方法	33.6	46.7	65	76.8	28.5	39.3	61.2	44.6

数据集	视频	原手工标注	本文方法描述结果
		A group of children on stage.
		A group of children reciting a poem onstage.	A group of children sang to the audience on a stage in a studio.
		A group of children singing a song on a stage.
		A man gettin flushed in the water.
MSVD		A man is going into a tunnel.	A man slides down a swimming slide.
		A man is playing in pool.
		A group of men run a track race.
		Men are running a race.	A group of athletes raced on a playground.
		Athletes are running down the track.
		Men are racing around a track.
		A man is constructing a model.
		A man is standing.	A group of people were doing experiments in a laboratory.
		A workplace environment people doing work.
		A lecturer is talking to his classroom.
		A lot of people are waiting for a lecture.	A teacher was teaching a group of students in a classroom.
MSR-VTT		A man is giving lecture in class.
MSR-VTT		Man in wheelchair with broken arm and leg sits near to a table with movable top.
		The man in the wheel chair is setting up a prank.where the bow on the table will spill on him.	A nanny was serving a man with injured arms and legs.

Video description method based on multidimensional and multimodal information

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 6

References 30

Related Articles 3

Metrics

Recommended 0

[1]	Zhijin QIN, Tantan ZHAO, Fan LI, Xiaoming TAO. Survey of research on multimodal semantic communication [J]. Journal on Communications, 2023, 44(5): 28-41.
[2]	Yun LIN, Huaitao XU, Sen WANG, Sicheng ZHANG, Long ZHUANG. Objective assessment of communication speech interference effect based on feature fusion [J]. Journal on Communications, 2023, 44(3): 105-116.
[3]	Jinyin CHEN, Wenchang SHANGGUAN, Jingjing ZHANG, Haibin ZHENG, Yayu ZHENG, Xuhong ZHANG. Membership inference attacks against transfer learning for generalized model [J]. Journal on Communications, 2021, 42(10): 197-210.