基于多维度和多模态信息的视频描述方法

doi:10.11959/j.issn.1000-436x.2020037

通信学报 ›› 2020, Vol. 41 ›› Issue (2): 36-43.doi: 10.11959/j.issn.1000-436x.2020037

基于多维度和多模态信息的视频描述方法

丁恩杰¹,刘忠育¹(),刘亚峰¹,郁万里²

¹ 中国矿业大学物联网（感知矿山）研究中心，江苏徐州 221008
² 不来德大学电动学与微电子研究所，不来德 28359

修回日期:2020-01-14 出版日期:2020-02-25 发布日期:2020-03-09
作者简介:丁恩杰（1962- ），男，山东青岛人，博士，中国矿业大学教授，主要研究方向为工业物联网、模式识别、人员定位等|刘忠育（1985- ），男，河南辉县人，中国矿业大学博士生，主要研究方向为计算机视觉、自然语言处理等|刘亚峰（1985- ），男，江苏徐州人，博士，中国矿业大学助理研究员，主要研究方向为机器学习、计算机视觉、行为识别等|郁万里（1987- ），男，江苏徐州人，博士，不来梅大学在站博士后，主要研究方向为工业物联网、网络优化、移动边缘计算等
基金资助:
国家重点研发计划基金资助项目(2017YFC0804400);国家重点研发计划基金资助项目(2017YFC0804401)

Video description method based on multidimensional and multimodal information

Enjie DING¹,Zhongyu LIU¹(),Yafeng LIU¹,Wanli YU²

¹ IoT/Perception Mine Research Center,China University of Mining ＆Technology,Xuzhou 221008,China
² Institute of Electrodynamics and Microelectronics,University of Bremen,Bremen 28359,Germany

Revised:2020-01-14 Online:2020-02-25 Published:2020-03-09
Supported by:
The National Key Research and Development Program of China(2017YFC0804400);The National Key Research and Development Program of China(2017YFC0804401)

摘要/Abstract

摘要：

针对视频自动描述任务中的复杂信息表征问题，提出一种多维度和多模态视觉特征的提取和融合方法。首先通过迁移学习提取视频序列的静态和动态等多维度特征，并采用图像描述算法提取视频关键帧的语义信息，完成视频信息的特征表征；然后采用多层长短期记忆网络融合多维度和多模态信息，最终生成视频内容的语言描述。实验仿真表明，所提方法与目前已有方法相比，在视频自动描述任务中取得了较好的效果。

关键词: 视频描述, 多模态, 迁移学习, 长短期记忆网络, 循环神经网络

Abstract:

In order to solve the problem of complex information representation in automatic video description tasks,a multi-dimensional and multi-modal visual feature extraction and fusion method was proposed.Firstly,multi-dimensional features such as static and dynamic attributes of the video sequence were extracted by transfer learning,and the image description algorithm was also used to extract the semantic information of the key frames in the video.By doing this,the video features extraction was carried out.Then,multi-layer long and short memory networks were used to fuse multi-dimensional and multi-modal information,and finally generated a language description of the video content.Compared with the existing methods,experimental simulations results show that the proposed method achieves better results in the video automatic description task.

Key words: video description, multimodal, transfer learning, long and short term memory network

中图分类号:

TP391.4

丁恩杰,刘忠育,刘亚峰,郁万里. 基于多维度和多模态信息的视频描述方法[J]. 通信学报, 2020, 41(2): 36-43.

Enjie DING,Zhongyu LIU,Yafeng LIU,Wanli YU. Video description method based on multidimensional and multimodal information[J]. Journal on Communications, 2020, 41(2): 36-43.

图/表 6

图1

图2

图3

图4

表1

表2

参考文献 30

[1]	KOJIMA A , IZUMI M , TAMURA T ,et al. Generating natural language description of human behavior from video images[C]// 15th International Conference on Pattern Recognition. ICPR, 2000: 728-731.
[2]	ZHAO B , LI X , LU X . CAM-RNN:co-attention model based RNN for video captioning[J]. IEEE Transactions on Image Processing, 2019,28(11): 5552-5564.
[3]	PARK J , SONG C , HAN J . A study of evaluation metrics and datasets for video captioning[C]// 2017 International Conference on Intelligent Informatics and Biomedical Sciences. ICIIBMS, 2017: 172-175.
[4]	YI B , YANG Y , FUMIN S ,et al. Describing video with attention-based bidirectional LSTM[J]. IEEE Transactions on Cybernetics, 2018,49(7): 1-11.
[5]	KRISHNA R , HATA K , REN F ,et al. Dense-captioning events in videos[C]// 2017 IEEE International Conference on Computer Vision. ICCV, 2017: 706-715.
[6]	SHEN Z , LI J , SU Z ,et al. Weakly supervised dense video captioning[C]// IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 2017: 1916-1924.
[7]	GUADARRAMA S , KRISHNAMOORTHY N , MALKARNENKAR G ,et al. Youtube2text:recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition[C]// IEEE International Conference on Computer Vision. ICCV, 2013: 2712-2719.
[8]	ROHRBACH M , QIU W , TITOV I ,et al. Translating video content to natural language descriptions[C]// IEEE International Conference on Computer Vision. IEEE, 2013: 433-440.
[9]	KOJIMA A , TAMURA T , FUKUNAGA K . Natural language description of human activities from video images based on concept hierarchy of actions[J]. International Journal of Computer Vision, 2002,50(2): 171-184.
[10]	THOMASON J , VENUGOPALAN S , GUADARRAMA S ,et al. Integrating language and vision to generate natural language descriptions of videos in the wild[C]// International Conference on Computational Linguistics. ICCL, 2014: 1218-1227.
[11]	JOHNSON M , SCHUSTER M , LE Q V ,et al. Google’s multilingual neural machine translation system:enabling zero-shot translation[J]. Transactions of the Association for Computational Linguistics, 2017,5(2): 339-351.
[12]	VENUGOPALAN S , XU H , DONAHUE J ,et al. Translating videos to natural language using deep recurrent neural networks[C]// 2015 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2015: 1494-1504.
[13]	VENUGOPALAN S , ROHRBACH M , DONAHUE J ,et al. Sequence to sequence-video to text[C]// IEEE International Conference on Computer Vision. ICCV, 2015: 4534-4542.
[14]	YAO L , TORABI A , CHO K ,et al. Describing videos by exploiting temporal structure[C]// IEEE International Conference on Computer Vision. ICCV, 2015: 4507-4515.
[15]	HOCHREITER S , SCHMIDHUBER J . Long short-term memory[J]. Neural Computation, 1997,9(8): 1735-1780.
[16]	JIN Q , CHEN J , CHEN S ,et al. Describing videos using multi-modal fusion[C]// The 24th ACM International Conference on Multimedia. ACM, 2016: 1087-1091.
[17]	CHEN Y , WANG S , ZHANG W ,et al. Less is more:picking informative frames for video captioning[C]// The European Conference on Computer Vision. ECCV, 2018: 358-373.
[18]	CHEN T H , LIAO Y H , CHUANG C Y ,et al. Show,adapt and tell:adversarial training of cross-domain image captioner[C]// IEEE International Conference on Computer Vision. ICCV, 2017: 521-530.
[19]	CHEN D L , DOLAN W B . Collecting highly parallel data for paraphrase evaluation[C]// The 49th Annual Meeting of the Association for Computational Linguistics. ACL, 2011: 190-200.
[20]	XU J , MEI T , YAO T ,et al. MSR-VTT:a large video description dataset for bridging video and language[C]// IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 2016: 5288-5296.
[21]	XU K , BA J , KIROS R ,et al. Show,attend and tell:neural image caption generation with visual attention[J]. Computer Science, 2015,2(1): 2048-2057.
[22]	DENG J , DONG W , SOCHER R ,et al. Imagenet:a large-scale hierarchical image database[C]// IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 2009: 248-255.
[23]	ZHOU B , LAPEDRIZA A , KHOSLA A ,et al. Places:a 10 million image database for scene recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017,40(6): 1452-1464.
[24]	CARREIRA J , ZISSERMAN A . Quo vadis,action recognition? a new model and the kinetics dataset[C]// The IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 2017: 6299-6308.
[25]	IOFFE S , SZEGEDY C . Batch normalization:accelerating deep network training by reducing internal covariate shift[C]// International Conference on Machine Learning. ICML, 2015: 448-456.
[26]	HE K , ZHANG X , REN S ,et al. Deep residual learning for image recognition[C]// IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 2016: 770-778.
[27]	CHEN X , FANG H , LIN T Y ,et al. Microsoft coco captions:data collection and evaluation server[J]. arXiv Preprint,arXiv:1504.00325, 2015.
[28]	VEDANTAM R , ZITNICK C L , PARIKH D ,et al. Cider:consensus-based image description evaluation[C]// IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 2015: 4566-4575.
[29]	RAMANISHKA V , DAS A , PARK D H ,et al. Multimodal video description[C]// The 24th ACM International Conference on Multimedia. ACM, 2016: 1092-1096.
[30]	ZHANG X , GAO K , ZHANG Y ,et al. Task-driven dynamic fusion:Reducing ambiguity in video description[C]// IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 2017: 3713-3721.

方法	MSVD				MSR-VTT
方法	METEOR	BLEU4	ROUGE-L	CIDEr	METEOR	BLEU4	ROUGE-L	CIDEr
MPool^[12]	29.1	33.3	—	—	23.7	30.4	52	35
SA^[14]	29.6	41.9	—	51.7	25	28.5	53.3	37.1
S2VT^[13]	29.8	—	—	—	25.7	31.4	55.9	35.2
VidLAB^[29]	—	—	—	—	27.7	39.1	60.6	44.4
TDDF^[30]	27.8	37.3	59.2	43.8	—	—	—	—
PickNet^[17]	33.1	46.1	62.9	76	27.2	38.9	59.5	42.1
本文方法	33.6	46.7	65	76.8	28.5	39.3	61.2	44.6

数据集	视频	原手工标注	本文方法描述结果
		A group of children on stage.
		A group of children reciting a poem onstage.	A group of children sang to the audience on a stage in a studio.
		A group of children singing a song on a stage.
		A man gettin flushed in the water.
MSVD		A man is going into a tunnel.	A man slides down a swimming slide.
		A man is playing in pool.
		A group of men run a track race.
		Men are running a race.	A group of athletes raced on a playground.
		Athletes are running down the track.
		Men are racing around a track.
		A man is constructing a model.
		A man is standing.	A group of people were doing experiments in a laboratory.
		A workplace environment people doing work.
		A lecturer is talking to his classroom.
		A lot of people are waiting for a lecture.	A teacher was teaching a group of students in a classroom.
MSR-VTT		A man is giving lecture in class.
MSR-VTT		Man in wheelchair with broken arm and leg sits near to a table with movable top.
		The man in the wheel chair is setting up a prank.where the bow on the table will spill on him.	A nanny was serving a man with injured arms and legs.

基于多维度和多模态信息的视频描述方法

Video description method based on multidimensional and multimodal information

在线阅读

PDF下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 6

参考文献 30

相关文章 15

Metrics

推荐阅读 0

[1]	秦志金, 赵菼菼, 李凡, 陶晓明. 多模态语义通信研究综述[J]. 通信学报, 2023, 44(5): 28-41.
[2]	林云, 徐怀韬, 王森, 张思成, 庄龙. 基于特征融合的通信语音干扰效果客观评估[J]. 通信学报, 2023, 44(3): 105-116.
[3]	陈浩, 杨芫, 徐明伟, 裴丹, 尤艺霖. 支持多模态网络的可扩展异构服务功能链并行编排部署系统[J]. 通信学报, 2022, 43(9): 1-11.
[4]	廖海君, 贾泽晗, 周振宇, 刘念, 王飞, 甘忠, 姚贤炯. 面向调控信息新鲜度保障的电力至简物联网资源优化[J]. 通信学报, 2022, 43(7): 203-214.
[5]	李昂, 陈建新, 魏昕, 周亮. 面向6G的跨模态信号重建技术[J]. 通信学报, 2022, 43(6): 28-40.
[6]	兰巨龙, 朱棣, 李丹. 面向多模态网络业务切片的虚拟网络功能资源容量智能预测方法[J]. 通信学报, 2022, 43(6): 143-155.
[7]	杨思锦, 庄雷, 宋玉, 王家兴, 阳鑫宇. 多模态网络中时间敏感网络模态的智能调度机制[J]. 通信学报, 2022, 43(5): 82-91.
[8]	胡宇翔, 崔子熙, 李子勇, 董永吉, 崔鹏帅, 邬江兴. 基于领域专用软硬件协同的多模态网络环境构造技术[J]. 通信学报, 2022, 43(4): 3-13.
[9]	王劲林, 井丽南, 陈晓, 尤佳莉. 面向多模态网络的可编程数据处理方法及系统设计[J]. 通信学报, 2022, 43(4): 14-25.
[10]	郜帅, 侯心迪, 刘宁春, 张宏科. 多模态网络环境异构标识空间管控架构研究[J]. 通信学报, 2022, 43(4): 26-35.
[11]	张汝云, 肖戈扬, 单麒赫, 邹涛, 李丹, 滕菲. 多模态网络下多智能体协同控制的通信拓扑重构方法[J]. 通信学报, 2022, 43(4): 50-59.
[12]	亓伟敬, 宋清洋, 郭磊. 面向软件定义多模态车联网的双时间尺度RAN切片资源分配[J]. 通信学报, 2022, 43(4): 60-70.
[13]	王敬宇, 庄子睿. 知识定义多模态网络按需服务体系研究[J]. 通信学报, 2022, 43(4): 71-82.
[14]	王泽宇, 布树辉, 黄伟, 郑远攀, 吴庆岗, 常化文, 张旭. 基于多模态特征的无监督领域自适应多级对抗语义分割网络[J]. 通信学报, 2022, 43(12): 157-171.
[15]	舒坚, 王启宁, 刘琳岚. 基于深度图嵌入的无人机自组网链路预测[J]. 通信学报, 2021, 42(7): 137-149.