Browse > Article
http://dx.doi.org/10.5626/JOK.2016.43.8.878

Image Caption Generation using Recurrent Neural Network  

Lee, Changki (Kangwon National Univ.)
Publication Information
Journal of KIISE / v.43, no.8, 2016 , pp. 878-882 More about this Journal
Abstract
Automatic generation of captions for an image is a very difficult task, due to the necessity of computer vision and natural language processing technologies. However, this task has many important applications, such as early childhood education, image retrieval, and navigation for blind. In this paper, we describe a Recurrent Neural Network (RNN) model for generating image captions, which takes image features extracted from a Convolutional Neural Network (CNN). We demonstrate that our models produce state of the art results in image caption generation experiments on the Flickr 8K, Flickr 30K, and MS COCO datasets.
Keywords
image caption; image caption generation; recurrent neural network; convolutional neural network;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Hodosh, Micah, Young, Peter, and Hockenmaier, Julia. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR, 47: 853-899, 2013.
2 Mao, Junhua, Xu, Wei, Yang, Yi, Wang, Jiang, and Yuille, Alan. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv:1412.6632, 2014.
3 Cho, Kyunghyun, van Merrienboer, Bart, Gulcehre, Caglar, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua. Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP, 2014.
4 Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan, Dumitru. Show and tell: A neural image caption generator. arXiv:1411.4555, 2014.
5 Karpathy, Andrej and Li, Fei-Fei. Deep visualsemantic alignments for generating image descriptions. arXiv:1412.2306, 2014.
6 Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML, 2015.
7 Bahdanau, D. et al., "Neural machine translation by jointly learning to align and translate," Proc. of ICLR'15, arXiv:1409.0473, 2015.
8 Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
9 Young, Peter, Lai, Alice, Hodosh, Micah, and Hockenmaier, Julia. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2:67-78, 2014.
10 Lin, et al. Microsoft coco: Common objects in context. arXiv preprint arXiv:1405.0312, 2014.
11 Papineni K., Rouskos S, Ward T, Zhu WJ. BLEU: a method for automatic evaluation of machine translation. ACL, 2002.
12 Bastien, F. et al. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop. 2012.
13 Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K. and Kuksa, P. Natural Language Processing (Almost) from Scratch, Journal of Machine Learning Research (JMLR), 2011.