DOI QR코드

DOI QR Code

Active Vision from Image-Text Multimodal System Learning

능동 시각을 이용한 이미지-텍스트 다중 모달 체계 학습

  • 김진화 (서울대학교 협동과정 인지과학전공) ;
  • 장병탁 (서울대학교 컴퓨터공학부)
  • Received : 2016.03.11
  • Accepted : 2016.04.19
  • Published : 2016.07.15

Abstract

In image classification, recent CNNs compete with human performance. However, there are limitations in more general recognition. Herein we deal with indoor images that contain too much information to be directly processed and require information reduction before recognition. To reduce the amount of data processing, typically variational inference or variational Bayesian methods are suggested for object detection. However, these methods suffer from the difficulty of marginalizing over the given space. In this study, we propose an image-text integrated recognition system using active vision based on Spatial Transformer Networks. The system attempts to efficiently sample a partial region of a given image for a given language information. Our experimental results demonstrate a significant improvement over traditional approaches. We also discuss the results of qualitative analysis of sampled images, model characteristics, and its limitations.

이미지 분류 문제는 인간 수준의 성능을 보이지만 일반적인 인식 문제는 어려운 점들이 남아있다. 실내 환경은 다양한 정보를 담고 있어 정보 처리의 양을 효율적으로 줄일 필요성이 있다. 정보의 양을 효율적으로 줄일 수 있도록 대상 객체의 위치 측정을 위한 변분 추론, 변분 베이지안 등의 방법이 소개되었지만, 모든 경우에 대한 주변(marginal) 확률 분포를 구하기 어렵기 때문에 현실적으로 계산하기 어렵다. 본 연구에서는 공간 변형 네트워크(Spatial Transformer Networks)을 응용하여 능동 시각을 이용한 이미지-텍스트 통합 인지 체계를 제안한다. 이 체계는 주어진 텍스트 정보를 바탕으로 이미지의 일부를 효율적으로 샘플링 하도록 학습한다. 이를 통해 전통적인 방법으로 해결하기 어려운 문제를 상당한 격차로 성능을 향상 시킬 수 있다는 것을 보인다. 제안하는 모델을 통해 샘플링 된 이미지를 정성적으로 분석하여 이 모델이 가지는 특성도 함께 살펴본다.

Keywords

Acknowledgement

Supported by : 한국연구재단, 정보통신기술진흥센터

References

  1. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proc. of the IEEE, Vol. 86, pp. 2278-2323, 1998. https://doi.org/10.1109/5.726791
  2. P. Simard, B. Victorri, Y. LeCun, and J. Denker, "Tangent prop-a formalism for specifying selected invariances in an adaptive network," Proc. of the Advances in neural information processing systems, pp. 895-903, 1992.
  3. V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, "Recurrent Models of Visual Attention," Proc. of the Advances in Neural Information Processing Systems, 27, pp. 2204-2212, 2014.
  4. Q. Wang, J. Zhang, S. Song, and Z. Zhang, "Attentional Neural Network : Feature Selection Using Cognitive Feedback," Proc. of the Advances in Neural Information Processing Systems, pp. 1-9, 2014.
  5. J. Ba, V. Mnih, and K. Kavukcuoglu, "Multiple Object Recognition with Visual Attention," arXiv preprint arXiv:1412.7755, pp. 1-10, 2014.
  6. R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," Proc. of the Computer Vision and Pattern Recognition, pp. 580-587, 2014.
  7. A. Karpathy and L. Fei-Fei, "Deep Visual-Semantic Alignmentsfor Generating Image Descriptions," Proc. of the 28th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), pp. 3128-3137, 2015.
  8. K. Xu, A. Courville, R. S. Zemel, and Y. Bengio, "Show, Attend and Tell : Neural Image Caption Generation with Visual Attention," Proc. of the 32nd International Conference on Machine Learning, 2015.
  9. M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, "Spatial Transformer Networks," Proc. of the Advances in Neural Information Processing Systems 28, pp. 2008-2016, 2015.
  10. J. Ba, R. Grosse, R. Salakhutdinov, and B. Frey, "Learning Wake-Sleep Recurrent Attention Models," Proc. of the Advances in Neural Information Processing Systems 28, pp. 2575-2583, 2015.
  11. J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, "Multimodal Deep Learning," Proc. of the 28th International Conference on Machine Learning, pp. 689-696, 2011.
  12. N. Srivastava and R. R. Salakhutdinov, "Multimodal Learning with Deep Boltzmann Machines," Proc. of the Advances in Neural Information Processing Systems, 25, pp. 2222-2230, 2012.
  13. R. Kiros, R. Zemel, and R. Salakhutdinov, "Multimodal Neural Language Models," Proc. of the 31st International Conference on Machine Learning, 2014.
  14. K. Sohn, W. Shang, and H. Lee, "Improved Multimodal Deep Learning with Variation of Information," Proc. of the Advances in Neural Information Processing Systems 27, pp. 2141-2149, 2014.
  15. K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition," Proc. of the International Conference on Learning Representations, 2015.
  16. M. Malinowski and M. Fritz, "A multi-world approach to question answering about real-world scenes based on uncertain input," Proc. of the Advances in Neural Information Processing Systems 27, pp. 1682-1690, 2014.
  17. S. Ioffe and C. Szegedy, "Batch Normalization : Accelerating Deep Network Training by Reducing Internal Covariate Shift," Proc. of the 32nd International Conference on Machine Learning, 2015.
  18. V. Nair and G. E. Hinton, "Rectified Linear Units Improve Restricted Boltzmann Machines," Proc. of the 27th International Conference on Machine Learning, pp. 807-814, 2010.