DOI QR코드

DOI QR Code

Image Caption Generation using Recurrent Neural Network

Recurrent Neural Network를 이용한 이미지 캡션 생성

  • Received : 2016.02.18
  • Accepted : 2016.05.10
  • Published : 2016.08.15

Abstract

Automatic generation of captions for an image is a very difficult task, due to the necessity of computer vision and natural language processing technologies. However, this task has many important applications, such as early childhood education, image retrieval, and navigation for blind. In this paper, we describe a Recurrent Neural Network (RNN) model for generating image captions, which takes image features extracted from a Convolutional Neural Network (CNN). We demonstrate that our models produce state of the art results in image caption generation experiments on the Flickr 8K, Flickr 30K, and MS COCO datasets.

이미지의 내용을 설명하는 캡션을 자동으로 생성하는 기술은 이미지 인식과 자연어처리 기술을 필요로 하는 매우 어려운 기술이지만, 유아 교육이나 이미지 검색, 맹인들을 위한 네비게이션 등에 사용될 수 있는 중요한 기술이다. 본 논문에서는 이미지 캡션 생성을 위해 Convolutional Neural Network(CNN)으로 인코딩된 이미지 정보를 입력으로 갖는 이미지 캡션 생성에 최적화된 Recurrent Neural Network(RNN) 모델을 제안하고, 실험을 통해 본 논문에서 제안한 모델이 Flickr 8K와 Flickr 30K, MS COCO 데이터 셋에서 기존의 연구들보다 높은 성능을 얻음을 보인다.

Keywords

Acknowledgement

Grant : (엑소브레인-1세부) 휴먼 지식증강 서비스를 위한 지능진화형 WiseQA 플랫폼 기술 개발

Supported by : 정보통신기술진흥센터

References

  1. Hodosh, Micah, Young, Peter, and Hockenmaier, Julia. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR, 47: 853-899, 2013.
  2. Mao, Junhua, Xu, Wei, Yang, Yi, Wang, Jiang, and Yuille, Alan. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv:1412.6632, 2014.
  3. Cho, Kyunghyun, van Merrienboer, Bart, Gulcehre, Caglar, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua. Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP, 2014.
  4. Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan, Dumitru. Show and tell: A neural image caption generator. arXiv:1411.4555, 2014.
  5. Karpathy, Andrej and Li, Fei-Fei. Deep visualsemantic alignments for generating image descriptions. arXiv:1412.2306, 2014.
  6. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML, 2015.
  7. Bahdanau, D. et al., "Neural machine translation by jointly learning to align and translate," Proc. of ICLR'15, arXiv:1409.0473, 2015.
  8. Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
  9. Young, Peter, Lai, Alice, Hodosh, Micah, and Hockenmaier, Julia. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2:67-78, 2014.
  10. Lin, et al. Microsoft coco: Common objects in context. arXiv preprint arXiv:1405.0312, 2014.
  11. Papineni K., Rouskos S, Ward T, Zhu WJ. BLEU: a method for automatic evaluation of machine translation. ACL, 2002.
  12. Bastien, F. et al. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop. 2012.
  13. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K. and Kuksa, P. Natural Language Processing (Almost) from Scratch, Journal of Machine Learning Research (JMLR), 2011.

Cited by

  1. Development of Personalized Urination Recognition Technology Using Smart Bands vol.21, pp.Suppl 1, 2017, https://doi.org/10.5213/inj.1734886.443