DOI QR코드

DOI QR Code

Estimation of Automatic Video Captioning in Real Applications using Machine Learning Techniques and Convolutional Neural Network

  • Vaishnavi, J (Department of Computer and Information Science, Annamalai University) ;
  • Narmatha, V (Department of Computer and Information Science, Annamalai University)
  • Received : 2022.09.05
  • Published : 2022.09.30

Abstract

The prompt development in the field of video is the outbreak of online services which replaces the television media within a shorter period in gaining popularity. The online videos are encouraged more in use due to the captions displayed along with the scenes for better understandability. Not only entertainment media but other marketing companies and organizations are utilizing videos along with captions for their product promotions. The need for captions is enabled for its usage in many ways for hearing impaired and non-native people. Research is continued in an automatic display of the appropriate messages for the videos uploaded in shows, movies, educational videos, online classes, websites, etc. This paper focuses on two concerns namely the first part dealing with the machine learning method for preprocessing the videos into frames and resizing, the resized frames are classified into multiple actions after feature extraction. For the feature extraction statistical method, GLCM and Hu moments are used. The second part deals with the deep learning method where the CNN architecture is used to acquire the results. Finally both the results are compared to find the best accuracy where CNN proves to give top accuracy of 96.10% in classification.

Keywords

References

  1. Zhong, Yu., Hongjiang Zhang., Anil K. Jain,.: Automatic Caption Localization in Compressed Video. In: IEEE transactions on pattern analysis and machine intelligence, Vol. 22, No. 4 (2000).
  2. Study of Video Captioning Problem, Jiaqi Su Princeton University jiaqis@princeton.edu, (2018).
  3. Jain, A.K., Yu, B.: Automatic text location in images and video frames. In: Pattern Recognition, vol. 31, pp. 2055-2076 (1998). https://doi.org/10.1016/S0031-3203(98)00067-3
  4. Chien Cheng Lee., Yu-Chun Chiang., Hau-Ming Huang., Chun-Li Tsai.: A Fast Caption Localization and Detection for News Videos. In: Second International Conference on Innovative Computing, Information and Control (2007).
  5. Xingqi Wang., Guang Dong,.: A Novel Approach for Captions Detection in Video Sequences. In: International Conference on Fuzzy Systems and Knowledge Discovery, (2009).
  6. Watanabe, K., Sugiyama, M.: Automatic caption generation for video data. Time alignment between caption and acoustic signal. In: IEEE Third Workshop on Multimedia Signal Processing (1999).
  7. Suzuki, T., Kitazume, T., Sugiyama, M,.: The latest achievement of VC project for automatic video caption generation. In: IEEE Workshop on Multimedia Signal Processing (2002).
  8. Liu, Y., Dey, S., Lu, Y,.: Enhancing Video Encoding for Cloud Gaming Using Rendering Information. In: IEEE Transactions on Circuits and Systems for Video Technology, 25(12) (2015).
  9. Akshada, A., Gade., Arati, J., Vyavahare,.: Feature Extraction using GLCM for Dietary Assessment Application, In: International Journal Multimedia and Image Processing (IJMIP), Vol 8, Issue 2 (2018).
  10. Kanimozhi, P., Sathiya, S., Balasubramanian, M., Sivaguru, P., Sivaraj, P,.: Evaluation of Machine Learning And Deep Learning Approaches To Classify Breast Cancer Using Thermography, In: International Journal of Psychology and Education, Vol 58, Number 2 (2021).
  11. Xu, K., Ba, J., Kiros, R., Cho, K., Courvile, A., Salakhutdinov, R., Zemel, R., Bengio, Y,.: Show attend and tell: Neural image caption generation with visual attention. In: arXiv: 1502, 03044, 2(3):5 (2015).
  12. Vinyals, O., Toshev, A., Samy, B., Erhan, D,.: Show and tell: A neural image caption generator. In: CVPR, (2015).
  13. Donahue, J., Anne, L., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T,.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, (2015).
  14. Anne, L., Hendricks, S. Venugopalan, Rohrbach, M., Mooney, R., Saenko, K., Darrell, T,.: Deep compositional captioning: Describing novel object categories without paired training data. In: CVPR, (2016).
  15. Fang, H., Gupta, S., Landola, F., Srivastava, K., Deng, L., Dollar, P., Jianferg, G.: From captions to visual concepts and back. In: CVPR, (2015).
  16. Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). In: arXiv:1412.6632 (2014).
  17. S. Venugopalan, Anne, L., Hendricks, Rohrbach, M., Mooney, R., Saenko, K., Darrell, T,.: Captioning images with diverse objects. In: arXiv:1606.07770 (2016).
  18. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, (2015).
  19. Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: Fully convolutional localization networks for dense captioning. In: CVPR, (2016).
  20. Hochreiter, S., Schmidhuber, J.: Long short-term memory. In: Neural computation, 9(8) (1997).
  21. Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: NAACL, (2015).
  22. Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing Videos by Exploiting Temporal Structure. In: IEEE International Conference on Computer Vision (ICCV), (2015).
  23. Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: CVPR, (2016).
  24. Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: ICCV, (2015).
  25. Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: CVPR, (2016).
  26. Yang, Y., Zhou, J., Ai, J., Bin, Y., Hanjalic, A., Shen, H.T., Ji, Y.: Video captioning by adversarial LSTM. In: IEEE Image Processing, 27, 5600-5611, (2018). https://doi.org/10.1109/TIP.2018.2855422
  27. Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Trans. Pattern Anal. Mach. Intell., 39, 677-691(2017). https://doi.org/10.1109/TPAMI.2016.2599174
  28. Yan, C., Tu, Y., Wang, X., Zhang, Y., Hao, X., Zhang, Y., Dai, Q.: STAT: Spatial-temporal attention mechanism for video captioning. In: IEEE Trans. Multimed. 22, 229-241 (2019). https://doi.org/10.1109/tmm.2019.2924576
  29. Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: IEEE Conference on Computer Vision and Pattern Recognition, (2016).
  30. Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, (2016).
  31. Anne Hendricks, L., Venugopalan, S., Rohrbach, M., Mooney, R., Saenko, K., Darrell, T.: Deep compositional captioning: Describing novel object categories without paired training data. In: IEEE Conference on Computer Vision and Pattern Recognition. (2016).
  32. Krizhevsky, A., Sutskever, I., Hinton, GE.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. (2012).
  33. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: arXiv preprint arXiv:1409.1556, (2014).
  34. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucka, V., Rabinovich, A.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition, (2015).
  35. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, (2016).
  36. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, (2016).
  37. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-CNN: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, (2015).
  38. Badrinarayanan, V., Handa, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labeling. In: arXiv preprint arXiv:1505.07293, (2015).
  39. Chen, L.-C., Papandreou, G., Kokkinos, L., Murphy, K., Yuille, L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. In: arXiv preprint arXiv:1606.00915, (2016).
  40. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, (2015).