DOI QR코드

DOI QR Code

Design of a Deep Neural Network Model for Image Caption Generation

이미지 캡션 생성을 위한 심층 신경망 모델의 설계

  • Received : 2016.12.15
  • Accepted : 2016.12.28
  • Published : 2017.04.30

Abstract

In this paper, we propose an effective neural network model for image caption generation and model transfer. This model is a kind of multi-modal recurrent neural network models. It consists of five distinct layers: a convolution neural network layer for extracting visual information from images, an embedding layer for converting each word into a low dimensional feature, a recurrent neural network layer for learning caption sentence structure, and a multi-modal layer for combining visual and language information. In this model, the recurrent neural network layer is constructed by LSTM units, which are well known to be effective for learning and transferring sequence patterns. Moreover, this model has a unique structure in which the output of the convolution neural network layer is linked not only to the input of the initial state of the recurrent neural network layer but also to the input of the multimodal layer, in order to make use of visual information extracted from the image at each recurrent step for generating the corresponding textual caption. Through various comparative experiments using open data sets such as Flickr8k, Flickr30k, and MSCOCO, we demonstrated the proposed multimodal recurrent neural network model has high performance in terms of caption accuracy and model transfer effect.

본 논문에서는 이미지 캡션 생성과 모델 전이에 효과적인 심층 신경망 모델을 제시한다. 본 모델은 멀티 모달 순환 신경망 모델의 하나로서, 이미지로부터 시각 정보를 추출하는 컨볼루션 신경망 층, 각 단어를 저차원의 특징으로 변환하는 임베딩 층, 캡션 문장 구조를 학습하는 순환 신경망 층, 시각 정보와 언어 정보를 결합하는 멀티 모달 층 등 총 5 개의 계층들로 구성된다. 특히 본 모델에서는 시퀀스 패턴 학습과 모델 전이에 우수한 LSTM 유닛을 이용하여 순환 신경망 층을 구성하며, 캡션 문장 생성을 위한 매 순환 단계마다 이미지의 시각 정보를 이용할 수 있도록 컨볼루션 신경망 층의 출력을 순환 신경망 층의 초기 상태뿐만 아니라 멀티 모달 층의 입력에도 연결하는 구조를 가진다. Flickr8k, Flickr30k, MSCOCO 등의 공개 데이터 집합들을 이용한 다양한 비교 실험들을 통해, 캡션의 정확도와 모델 전이의 효과 면에서 본 논문에서 제시한 멀티 모달 순환 신경망 모델의 높은 성능을 확인할 수 있었다.

Keywords

References

  1. Lisa Anne Hendricks et al., "Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data," Proc. of IEEE Conf. on CVPR, 2016.
  2. Oriol Vinyals and Alexander Toshev et al., "Show and Tell: A Neural Image Caption Generator," Proc. of the IEEE, Conf. on CVPR, 2015.
  3. Kevin Xu and Jimmy Lei Ba et al., "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention," Proc. of. ICML. 2015.
  4. Junhua Mao, Wei Xu, and Yi Yang et al., "Deep Captioning with Multimodal Recurrent Neural Networks (M-RNN)," Proc. of. ICLR, 2015.
  5. Changki Lee, "Image Caption Generation using Recurrent Neural Network," Journal of KIISE, Vol.43, No.8, pp.878-882, 2016. https://doi.org/10.5626/JOK.2016.43.8.878
  6. Hochreiter, Sepp, and Jürgen Schmidhuber, "Long Short- Term Memory," Neural Computation, Vol.9, No.8, pp.1735- 1780, 1997. https://doi.org/10.1162/neco.1997.9.8.1735
  7. Chung, Junyoung et al., "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling," arXiv preprint arXiv:1412.3555, 2014.
  8. Szegedy, Christian, Sergey Ioffe et al., "Inception-v4, Inception- Resnet and The Impact of Residual Connections on Learning," arXiv preprint arXiv:1602.07261, 2016.
  9. Papineni Kishore, Rouskos Salim et al., "BLEU: a Method for Automatic Evaluation of Machine Translation," Proc. of ACL, pp.311-318, 2002.
  10. Lin Tsung-Yi and Maire Michael et al., "Microsoft COCO: Common Objects in Context," Proc. of ECCV, Springer International Publishing, 2014.