Acknowledgement
이 논문은 2022년 전주대학교 연구비 지원을 받아 수행되었음.
References
- J. Zhang and Y. Peng, "Video captioning with object-aware spatio-temporal correlation and aggregation," IEEE Transactions on Image Processing, Vol.29, pp.6209-6222, 2020. https://doi.org/10.1109/TIP.2020.2988435
- B. Pan et al., "Spatio-temporal graph for video captioning with knowledge distillation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.10870-10879, 2020.
- L. Li, X. Gao, J. Deng, Y. Tu, Z. Zha, and Q. Huang, "Long short-term relation transformer with global gating for video captioning," IEEE Transactions on Image Processing, Vol.31, pp.2726-2738, 2022. https://doi.org/10.1109/TIP.2022.3158546
- L. Yao et al., "Describing videos by exploiting temporal structure," in Proceedings of the IEEE International Conference on Computer Vision, pp.4507-4515, 2015.
- Z. Gan et al., "Semantic compositional networks for visual captioning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.5630-5639, 2017.
- H. Chen, K. Lin, A. Maye, J. Li, and X. Hu, "A semantics-assisted video captioning model trained with scheduled sampling," Frontiers in Robotics and AI, Vol.7, pp.475767, 2020.
- B. Wang, L. Ma, W. Zhang, and W. Liu, "Reconstruction network for video captioning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.7622-7631, 2018.
- H. Ye, G. Li, Y. Qi, S. Wang, Q. Huang, and M. Yang, "Hierarchical modular network for video captioning," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.17939-17948, 2022.
- S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, "Sequence to sequence-video to text," in Proceedings of the IEEE International Conference on Computer Vision, pp.4534-4542, 2015.
- S. Venugopalan et al., "Translating videos to natural language using deep recurrent neural networks[J]," arXiv preprint arXiv:1412.4729, 2014.
- C. Szegedy et al., "Going deeper with convolutions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.1-9, 2015.
- K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.770-778, 2016.
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3d convolutional networks," in Proceedings of the IEEE international Conference on Computer Vision, pp.4489-4497, 2015.
- S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, Vol.9, No.8, pp.1735-1780, 1997. https://doi.org/10.1162/neco.1997.9.8.1735
- S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, "Aggregated residual transformations for deep neural networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.1492-1500, 2017.
- M. Zolfaghari, K. Singh, and T. Brox, "Eco: Efficient convolutional network for online video understanding," in Proceedings of the European Conference on Computer Vision ECCV, pp.695-712, 2018.
- J. P. Martin, B. Bustos, and J. Perez, "Improving video captioning with temporal composition of a visual-syntactic embedding," in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.3039-3049, 2021.
- L. Yan et al., "Gl-rg: Global-local representation granularity for video captioning," arXiv preprint arXiv:2205. 10706, 2022.
- D. Tran, J. Ray, Z. Shou, S. F. Chang, and M. Paluri, "Convnet architecture search for spatiotemporal feature learning," arXiv preprint arXiv:1708.05038, 2017.
- K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
- B. K. Horn and B. G. Schunck, "Determining optical flow," Artificial Intelligence, Vol.17, No.1-3, pp.185-203, 1981. https://doi.org/10.1016/0004-3702(81)90024-2
- J. Donahue et al., "Long-term recurrent convolutional networks for visual recognition and description," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.2625-2634, 2015.
- P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, "Hierarchical recurrent neural encoder for video representation with application to captioning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.1029-1038, 2016.
- L. Gao, Z. Guo, H. Zhang, X. Xu, and H. Shen, "Video captioning with attention-based lstm and semantic consistency," IEEE Transactions on Multimedia, Vol.19, No.9, pp.2045-2055, 2017. https://doi.org/10.1109/TMM.2017.2729019
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, "Rethinking the inception architecture for computer vision," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.2818-2826, 2016.
- H. Jegou, M. Douze, C. Schmid, and P. Perez, "Aggregating local descriptors into a compact image representation," in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, pp.3304-3311, 2010.
- T. N. Kipf and M. Welling, "Semi-supervised classification with graph convolutional networks," arXiv preprint arXiv: 1609.02907, 2016.
- L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, "End-to-end dense video captioning with masked transformer," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.8739-8748, 2018.
- R. Girshick, "Fast r-cnn," in Proceedings of the IEEE International Conference on Computer Vision, pp.1440-1448, 2015.
- T. Wang, R. Zhang, Z. Lu, F. Zheng, R. Cheng, and P. Luo, "End-to-end dense video captioning with parallel decoding," in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.6847-6857, 2021.
- V. Ramanishka et al., "Multimodal video description," in Proceedings of the 24th ACM International Conference on Multimedia, pp.1092-1096, 2016.
- B. Logan et al., "Mel frequency cepstral coefficients for music modeling," in Ismir. Plymouth, MA Vol.270, p.11, 2000.
- Y. Xu, J. Yang, and K. Mao, "Semantic-filtered soft-split-aware video captioning with audio-augmented feature," Neurocomputing, Vol.357, pp.24-35, 2019. https://doi.org/10.1016/j.neucom.2019.05.027
- C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, "Videobert: A joint model for video and language representation learning," in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.7464-7473, 2019.
- J. Devlin, M. Chang, K. Lee, and K. Toutanova, "Bert: Pretraining of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
- S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, "Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification," in Proceedings of the European Conference on Computer Vision ECCV, pp.305-321, 2018.
- S. Ging, M. Zolfaghari, H. Pirsiavash, and T. Brox, "Coot: Cooperative hierarchical transformer for video-text representation learning," Advances in Neural Information Processing Systems, Vol.33, pp.22605-22618, 2020.
- V. Iashin and E. Rahtu, "Multi-modal dense video captioning," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp.958-959, 2020.
- A. Vaswani et al., "Attention is all you need," Advances in Neural Information Processing Systems, Vol.30, 2017.
- J. Carreira and A. Zisserman, "Quo vadis, action recognition? a new model and the kinetics dataset," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.6299-6308, 2017.
- S. Hershey et al., "Cnn architectures for large-scale audio classification," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp.131-135, 2017.
- H. Luo et al., "Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning," Neurocomputing, Vol.508, pp.293-304, 2022. https://doi.org/10.1016/j.neucom.2022.07.028
- A. Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020.
- A. Radford et al., "Learning transferable visual models from natural language supervision," in International Conference on Machine Learning, PMLR, pp.8748-8763, 2021.
- P. H. Seo, A. Nagrani, A. Arnab, and C. Schmid, "End-to-end generative pretraining for multimodal video captioning," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.17959-17968, 2022.
- A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lucic and C. Schmid, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.6836-6846, 2021.
- D. Chen and W. B. Dolan, "Collecting highly parallel data for paraphrase evaluation," in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.190-200, 2011.
- J. Xu, T. Mei, T. Yao, and Y. Rui, "Msr-vtt: A large video description dataset for bridging video and language," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.5288-5296, 2016.
- R. Krishna, K. Hata, F. Ren, F-F. Li, and J. C. Niebles, "Dense-captioning events in videos," in Proceedings of the IEEE International Conference on Computer Vision, pp.706-715, 2017.
- G. A. Sigurdsson, G. Varol, X. L. Wang, A. Farhadi, I. Laptev, and A. Gupta, "Hollywood in homes: Crowd-sourcing data collection for activity understanding," in Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I 14. Springer, pp.510-526, 2016.
- X. Wang, J. Wu, J. Chen, L. Li, Y. Wang, and W. Wang, "Vatex: A large-scale, high-quality multilingual dataset for video-and-language research," in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.4581-4591, 2019.
- A. Rohrbach et al., "Movie description," arXiv preprint, 2016.
- L. Zhou, C. Xu, and J. J Corso, "Towards automatic learning of procedures from web instructional videos," in AAAI Conference on Artificial Intelligence, pp.7590-7598, 2018.
- A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, "A dataset for movie description," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.3202-3212, 2015.
- S. Pini, M. Cornia, F. Bolelli, L. Baraldi, and R. Cucchiara, "M-vad names: a dataset for video captioning with naming," Multimedia Tools and Applications, Vol.78, pp.14007-14027, 2019. https://doi.org/10.1007/s11042-018-7040-z
- K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, "Bleu: a method for automatic evaluation of machine translation," in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp.311-318, 2002.
- S. Banerjee and A. Lavie, "Meteor: An automatic metric for mt evaluation with improved correlation with human judgments," in Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp.65-72, 2005.
- R. Vedantam, C. L. Zitnick, and D. Parikh, "Cider: Consensus-based image description evaluation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.4566-4575, 2015.
- C. Lin, "Rouge: A package for automatic evaluation of summaries," in Text Summarization Branches Out, pp.74-81, 2004.
- K. Cho. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
- A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele, "Coherent multi-sentence video description with variable level of detail," in Pattern Recognition: 36th German Conference, GCPR 2014, Munster, Germany, September 2-5, 2014, Proceedings 36. Springer, pp.184-195, 2014.
- C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, "Inceptionv4, inception-resnet and the impact of residual connections on learning," in Proceedings of the AAAI Conference on Artificial Intelligence, Vol.31. No.1, 2017.