DOI QR코드

DOI QR Code

Estimation of Manhattan Coordinate System using Convolutional Neural Network

합성곱 신경망 기반 맨하탄 좌표계 추정

  • 이진우 (국민대학교 비주얼컴퓨팅연구실) ;
  • 이현준 (인텔 코리아) ;
  • 김준호 (국민대학교 비주얼컴퓨팅연구실)
  • Received : 2017.06.24
  • Accepted : 2017.07.06
  • Published : 2017.07.14

Abstract

In this paper, we propose a system which estimates Manhattan coordinate systems for urban scene images using a convolutional neural network (CNN). Estimating the Manhattan coordinate system from an image under the Manhattan world assumption is the basis for solving computer graphics and vision problems such as image adjustment and 3D scene reconstruction. We construct a CNN that estimates Manhattan coordinate systems based on GoogLeNet [1]. To train the CNN, we collect about 155,000 images under the Manhattan world assumption by using the Google Street View APIs and calculate Manhattan coordinate systems using existing calibration methods to generate dataset. In contrast to PoseNet [2] that trains per-scene CNNs, our method learns from images under the Manhattan world assumption and thus estimates Manhattan coordinate systems for new images that have not been learned. Experimental results show that our method estimates Manhattan coordinate systems with the median error of $3.157^{\circ}$ for the Google Street View images of non-trained scenes, as test set. In addition, compared to an existing calibration method [3], the proposed method shows lower intermediate errors for the test set.

본 논문에서는 도심 영상에 대해 맨하탄 좌표계를 추정하는 합성곱 신경망(Convolutional Neural Network) 기반의 시스템을 제안한다. 도심 영상에서 맨하탄 좌표계를 추정하는 것은 영상 조정, 3차원 장면 복원 등 컴퓨터 그래픽스 및 비전 문제 해결의 기본이 된다. 제안하는 합성곱 신경망은 GoogLeNet[1]을 기반으로 구성한다. 합성곱 신경망을 훈련하기 위해 구글 스트리트 뷰 API로 영상을 수집하고 기존 캘리브레이션 방법으로 맨하탄 좌표계를 계산하여 데이터셋을 생성한다. 장면마다 새롭게 합성곱 신경망을 학습해야하는 PoseNet[2]과 달리, 본 논문에서 제안하는 시스템은 장면의 구조를 학습하여 맨하탄 좌표계를 추정하기 때문에 학습되지 않은 새로운 장면에 대해서도 맨하탄 좌표계를 추정한다. 제안하는 방법은 학습에 참여하지 않은 구글 스트리트 뷰 영상을 검증 데이터로 테스트하였을 때 $3.157^{\circ}$의 중간 오차로 맨하탄 좌표계를 추정하였다. 또한, 동일 검증 데이터에 대해 제안하는 방법이 기존 맨하탄 좌표계 추정 알고리즘[3]보다 더 낮은 중간 오차를 보이는 것을 확인하였다.

Keywords

Acknowledgement

Supported by : 한국연구재단, 정보통신기술진흥센터

References

  1. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1-9.
  2. A. Kendall, M. Grimes, and R. Cipolla, "PoseNet: A convolutional network for real-time 6-dof camera relocalization," in Proceedings of the International Conference on Computer Vision, 2015, pp. 2938-2946.
  3. H. Lee, E. Shechtman, J. Wang, and S. Lee, "Automatic upright adjustment of photographs with robust camera calibration," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 5, pp. 833-844, 2014. https://doi.org/10.1109/TPAMI.2013.166
  4. J. M. Coughlan and A. L. Yuille, "Manhattan World: Compass direction from a single image by bayesian inference," in Proceedings of the International Conference on Computer Vision, 1999, p. 941-947.
  5. K. H. Jang and S. K. Jung, "Practical modeling technique for large-scale 3d building models from ground images," Pattern Recognition Letters, vol. 30, no. 10, pp. 861-869, 2009. https://doi.org/10.1016/j.patrec.2009.04.004
  6. P. Denis, J. H. Elder, and F. J. Estrada, "Efficient edge-based methods for estimating manhattan frames in urban imagery," in Proceedings of European Conference on Computer Vision, 2008, p. 197-210.
  7. B. Li, K. Peng, X. Ying, and H. Zha, "Simultaneous vanishing point detection and camera calibration from single images," in Proceedings og the International Symposium on Visual Computing, 2010, pp. 151-160.
  8. J. R. Movellan, "Tutorial on Gabor filters," Univ. of California, San Diego, Tech. Rep., 2005.
  9. M. Zhai, S. Workman, and N. Jacobs, "Detecting vanishing points using global image context in a non-Manhattan world," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5657-5665.
  10. C. Wu, "Towards linear-time incremental structure from motion," in Proceedings of the International Conference on 3D Vision, 2013, pp. 127-134.
  11. "Google Street View Image API." [Online]. Available: https://developers.google.com/maps/documentation/streetview/
  12. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, "TensorFlow: Large-scale machine learning on heterogeneous systems." [Online]. Available: http://tensorflow.org/
  13. D. P. Kingma and J. L. Ba, "Adam: A method for stochastic optimization," in Proceedings ot The International Conference on Learning Representations, 2015, pp. 2938-2946.
  14. E. Tretyak, O. Barinova, P. Kohli, and V. Lempitsky, "Geometric image parsing in man-made environments," International Journal of Computer Vision, vol. 97, no. 3, pp. 305-321, 2011. https://doi.org/10.1007/s11263-011-0488-1