Browse > Article
http://dx.doi.org/10.3745/KTSDE.2017.6.4.203

Design of a Deep Neural Network Model for Image Caption Generation  

Kim, Dongha (경기대학교 컴퓨터과학과)
Kim, Incheol (경기대학교 컴퓨터과학과)
Publication Information
KIPS Transactions on Software and Data Engineering / v.6, no.4, 2017 , pp. 203-210 More about this Journal
Abstract
In this paper, we propose an effective neural network model for image caption generation and model transfer. This model is a kind of multi-modal recurrent neural network models. It consists of five distinct layers: a convolution neural network layer for extracting visual information from images, an embedding layer for converting each word into a low dimensional feature, a recurrent neural network layer for learning caption sentence structure, and a multi-modal layer for combining visual and language information. In this model, the recurrent neural network layer is constructed by LSTM units, which are well known to be effective for learning and transferring sequence patterns. Moreover, this model has a unique structure in which the output of the convolution neural network layer is linked not only to the input of the initial state of the recurrent neural network layer but also to the input of the multimodal layer, in order to make use of visual information extracted from the image at each recurrent step for generating the corresponding textual caption. Through various comparative experiments using open data sets such as Flickr8k, Flickr30k, and MSCOCO, we demonstrated the proposed multimodal recurrent neural network model has high performance in terms of caption accuracy and model transfer effect.
Keywords
Image Caption Generation; Deep Neural Network Model; Model Transfer; Multi-Modal Recurrent Neural Network;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Lisa Anne Hendricks et al., "Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data," Proc. of IEEE Conf. on CVPR, 2016.
2 Oriol Vinyals and Alexander Toshev et al., "Show and Tell: A Neural Image Caption Generator," Proc. of the IEEE, Conf. on CVPR, 2015.
3 Kevin Xu and Jimmy Lei Ba et al., "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention," Proc. of. ICML. 2015.
4 Junhua Mao, Wei Xu, and Yi Yang et al., "Deep Captioning with Multimodal Recurrent Neural Networks (M-RNN)," Proc. of. ICLR, 2015.
5 Changki Lee, "Image Caption Generation using Recurrent Neural Network," Journal of KIISE, Vol.43, No.8, pp.878-882, 2016.   DOI
6 Hochreiter, Sepp, and Jürgen Schmidhuber, "Long Short- Term Memory," Neural Computation, Vol.9, No.8, pp.1735- 1780, 1997.   DOI
7 Chung, Junyoung et al., "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling," arXiv preprint arXiv:1412.3555, 2014.
8 Szegedy, Christian, Sergey Ioffe et al., "Inception-v4, Inception- Resnet and The Impact of Residual Connections on Learning," arXiv preprint arXiv:1602.07261, 2016.
9 Papineni Kishore, Rouskos Salim et al., "BLEU: a Method for Automatic Evaluation of Machine Translation," Proc. of ACL, pp.311-318, 2002.
10 Lin Tsung-Yi and Maire Michael et al., "Microsoft COCO: Common Objects in Context," Proc. of ECCV, Springer International Publishing, 2014.