[1] |
KOJIMA A , IZUMI M , TAMURA T ,et al. Generating natural language description of human behavior from video images[C]// 15th International Conference on Pattern Recognition. ICPR, 2000: 728-731.
|
[2] |
ZHAO B , LI X , LU X . CAM-RNN:co-attention model based RNN for video captioning[J]. IEEE Transactions on Image Processing, 2019,28(11): 5552-5564.
|
[3] |
PARK J , SONG C , HAN J . A study of evaluation metrics and datasets for video captioning[C]// 2017 International Conference on Intelligent Informatics and Biomedical Sciences. ICIIBMS, 2017: 172-175.
|
[4] |
YI B , YANG Y , FUMIN S ,et al. Describing video with attention-based bidirectional LSTM[J]. IEEE Transactions on Cybernetics, 2018,49(7): 1-11.
|
[5] |
KRISHNA R , HATA K , REN F ,et al. Dense-captioning events in videos[C]// 2017 IEEE International Conference on Computer Vision. ICCV, 2017: 706-715.
|
[6] |
SHEN Z , LI J , SU Z ,et al. Weakly supervised dense video captioning[C]// IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 2017: 1916-1924.
|
[7] |
GUADARRAMA S , KRISHNAMOORTHY N , MALKARNENKAR G ,et al. Youtube2text:recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition[C]// IEEE International Conference on Computer Vision. ICCV, 2013: 2712-2719.
|
[8] |
ROHRBACH M , QIU W , TITOV I ,et al. Translating video content to natural language descriptions[C]// IEEE International Conference on Computer Vision. IEEE, 2013: 433-440.
|
[9] |
KOJIMA A , TAMURA T , FUKUNAGA K . Natural language description of human activities from video images based on concept hierarchy of actions[J]. International Journal of Computer Vision, 2002,50(2): 171-184.
|
[10] |
THOMASON J , VENUGOPALAN S , GUADARRAMA S ,et al. Integrating language and vision to generate natural language descriptions of videos in the wild[C]// International Conference on Computational Linguistics. ICCL, 2014: 1218-1227.
|
[11] |
JOHNSON M , SCHUSTER M , LE Q V ,et al. Google’s multilingual neural machine translation system:enabling zero-shot translation[J]. Transactions of the Association for Computational Linguistics, 2017,5(2): 339-351.
|
[12] |
VENUGOPALAN S , XU H , DONAHUE J ,et al. Translating videos to natural language using deep recurrent neural networks[C]// 2015 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2015: 1494-1504.
|
[13] |
VENUGOPALAN S , ROHRBACH M , DONAHUE J ,et al. Sequence to sequence-video to text[C]// IEEE International Conference on Computer Vision. ICCV, 2015: 4534-4542.
|
[14] |
YAO L , TORABI A , CHO K ,et al. Describing videos by exploiting temporal structure[C]// IEEE International Conference on Computer Vision. ICCV, 2015: 4507-4515.
|
[15] |
HOCHREITER S , SCHMIDHUBER J . Long short-term memory[J]. Neural Computation, 1997,9(8): 1735-1780.
|
[16] |
JIN Q , CHEN J , CHEN S ,et al. Describing videos using multi-modal fusion[C]// The 24th ACM International Conference on Multimedia. ACM, 2016: 1087-1091.
|
[17] |
CHEN Y , WANG S , ZHANG W ,et al. Less is more:picking informative frames for video captioning[C]// The European Conference on Computer Vision. ECCV, 2018: 358-373.
|
[18] |
CHEN T H , LIAO Y H , CHUANG C Y ,et al. Show,adapt and tell:adversarial training of cross-domain image captioner[C]// IEEE International Conference on Computer Vision. ICCV, 2017: 521-530.
|
[19] |
CHEN D L , DOLAN W B . Collecting highly parallel data for paraphrase evaluation[C]// The 49th Annual Meeting of the Association for Computational Linguistics. ACL, 2011: 190-200.
|
[20] |
XU J , MEI T , YAO T ,et al. MSR-VTT:a large video description dataset for bridging video and language[C]// IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 2016: 5288-5296.
|
[21] |
XU K , BA J , KIROS R ,et al. Show,attend and tell:neural image caption generation with visual attention[J]. Computer Science, 2015,2(1): 2048-2057.
|
[22] |
DENG J , DONG W , SOCHER R ,et al. Imagenet:a large-scale hierarchical image database[C]// IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 2009: 248-255.
|
[23] |
ZHOU B , LAPEDRIZA A , KHOSLA A ,et al. Places:a 10 million image database for scene recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017,40(6): 1452-1464.
|
[24] |
CARREIRA J , ZISSERMAN A . Quo vadis,action recognition? a new model and the kinetics dataset[C]// The IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 2017: 6299-6308.
|
[25] |
IOFFE S , SZEGEDY C . Batch normalization:accelerating deep network training by reducing internal covariate shift[C]// International Conference on Machine Learning. ICML, 2015: 448-456.
|
[26] |
HE K , ZHANG X , REN S ,et al. Deep residual learning for image recognition[C]// IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 2016: 770-778.
|
[27] |
CHEN X , FANG H , LIN T Y ,et al. Microsoft coco captions:data collection and evaluation server[J]. arXiv Preprint,arXiv:1504.00325, 2015.
|
[28] |
VEDANTAM R , ZITNICK C L , PARIKH D ,et al. Cider:consensus-based image description evaluation[C]// IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 2015: 4566-4575.
|
[29] |
RAMANISHKA V , DAS A , PARK D H ,et al. Multimodal video description[C]// The 24th ACM International Conference on Multimedia. ACM, 2016: 1092-1096.
|
[30] |
ZHANG X , GAO K , ZHANG Y ,et al. Task-driven dynamic fusion:Reducing ambiguity in video description[C]// IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 2017: 3713-3721.
|