Journal on Communications ›› 2020, Vol. 41 ›› Issue (2): 36-43.doi: 10.11959/j.issn.1000-436x.2020037

• Topics: Intellgent Mine • Previous Articles     Next Articles

Video description method based on multidimensional and multimodal information

Enjie DING1,Zhongyu LIU1(),Yafeng LIU1,Wanli YU2   

  1. 1 IoT/Perception Mine Research Center,China University of Mining &Technology,Xuzhou 221008,China
    2 Institute of Electrodynamics and Microelectronics,University of Bremen,Bremen 28359,Germany
  • Revised:2020-01-14 Online:2020-02-25 Published:2020-03-09
  • Supported by:
    The National Key Research and Development Program of China(2017YFC0804400);The National Key Research and Development Program of China(2017YFC0804401)

Abstract:

In order to solve the problem of complex information representation in automatic video description tasks,a multi-dimensional and multi-modal visual feature extraction and fusion method was proposed.Firstly,multi-dimensional features such as static and dynamic attributes of the video sequence were extracted by transfer learning,and the image description algorithm was also used to extract the semantic information of the key frames in the video.By doing this,the video features extraction was carried out.Then,multi-layer long and short memory networks were used to fuse multi-dimensional and multi-modal information,and finally generated a language description of the video content.Compared with the existing methods,experimental simulations results show that the proposed method achieves better results in the video automatic description task.

Key words: video description, multimodal, transfer learning, long and short term memory network

CLC Number: 

No Suggested Reading articles found!