Fig. 1. Example of video captioning.
Fig. 2. Examples of (a) dynamic and (b) static semantic features.
Fig. 3. Video captioning model.
Fig. 4. The dynamic semantic network (DSN).
Fig. 5. The static semantic network (SSN).
Fig. 6. The caption generation network (CGN).
Fig. 7. Qualitative results on the MSVD dataset: correct captions with relevant semantic features.
Fig. 8. Qualitative results on the MSVD dataset: incorrect captions with relevant semantic features.
Table 1. Performance comparison between two semantic networks on the MSVD dataset
Table 2. Performance comparison among different feature models on the MSVD dataset
Table 3. Performance comparison among different models on the MSR-VTT dataset
Table 4. Performance comparison with other state-of-art models on the MSVD dataset
참고문헌
- K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778.
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3D convolutional networks," in Proceedings of IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 4489-4497.
- S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko, "YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition," in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 2013, pp. 2712-2719.
- J. Xu, T. Mei, T. Yao, and Y. Rui, "MSR-VTT: a large video description dataset for bridging video and language," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 5288-5296.
- S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, "Sequence to sequence: video to text," in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 4534-4542.
- Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, "Jointly modeling embedding and translation to bridge video and language," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 4594-4602.
- L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville, "Describing videos by exploiting temporal structure," in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 4507-4515.
- K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," in Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, 2015.
- Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng, "Semantic compositional networks for visual captioning," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 1141-1150.
- Y. Pan, T. Yao, H. Li, and T. Mei, "Video Captioning with Transferred Semantic Attributes," Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 984-992.
- Y. Yu, H. Ko, J. Choi, and G. Kim, "End-to-end concept word detection for video captioning, retrieval, and question answering," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 3261-3269.
- F. Nian, T. Li, Y. Wang, X. Wu, B. Ni, and C. Xu, "Learning explicit video attributes from mid-level representation for video captioning," Journal of Computer Vision and Image Understanding, vol. 163, pp. 126-138, 2017. https://doi.org/10.1016/j.cviu.2017.06.012
- J. Song, Z. Guo, L. Gao, W. Liu, D. Zhang, and H. T. Shen, "Hierarchical LSTM with adjusted temporal attention for video captioning," in Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), Melbourne, Australia, 2017, pp. 2737-2743.
- A. A. Liu, N. Xu, Y. Wong, J. Li, Y. T. Su, and M. Kankanhalli, "Hierarchical & multimodal video captioning: discovering and transferring multimodal knowledge for vision to language," Journal of Computer Vision and Image Understanding, vol. 163, pp. 113-125, 2017. https://doi.org/10.1016/j.cviu.2017.04.013
- K. Papineni, S. Roukos, T. Ward, and W. Zhu, "BLEU: a method for automatic evaluation of machine translation," in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, PA, 2002, pp. 311-318.
- R. Vedantam, C. L. Zitnick, and D. Parikh, "CIDEr: Consensus-based Image Description Evaluation," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 4566-4575.