High-Quality Depth Map Generation of Humans in Monocular Videos

단안 영상에서 인간 오브젝트의 고품질 깊이 정보 생성 방법

  • Received : 2014.04.12
  • Accepted : 2014.05.28
  • Published : 2014.06.01

Abstract

The quality of 2D-to-3D conversion depends on the accuracy of the assigned depth to scene objects. Manual depth painting for given objects is labor intensive as each frame is painted. Specifically, a human is one of the most challenging objects for a high-quality conversion, as a human body is an articulated figure and has many degrees of freedom (DOF). In addition, various styles of clothes, accessories, and hair create a very complex silhouette around the 2D human object. We propose an efficient method to estimate visually pleasing depths of a human at every frame in a monocular video. First, a 3D template model is matched to a person in a monocular video with a small number of specified user correspondences. Our pose estimation with sequential joint angular constraints reproduces a various range of human motions (i.e., spine bending) by allowing the utilization of a fully skinned 3D model with a large number of joints and DOFs. The initial depth of the 2D object in the video is assigned from the matched results, and then propagated toward areas where the depth is missing to produce a complete depth map. For the effective handling of the complex silhouettes and appearances, we introduce a partial depth propagation method based on color segmentation to ensure the detail of the results. We compared the result and depth maps painted by experienced artists. The comparison shows that our method produces viable depth maps of humans in monocular videos efficiently.

단안 영상에서 3차원 입체영상으로 변환한 결과물의 품질은장면의 물체들에게 부여한 깊이 정보의 정확도에 의존적이다. 영상의 매 프레임마다 장면의 물체들의 깊이 정보를 수동으로 입력하는 것은 많은 시간을 필요로 하는 노동집약적인 작업이다. 특히, 높은 자유도를 가진 관절형 물체인 인간의 몸은 고품질 입체변환에 있어서 가장 어려운 물체 중에 하나이다. 다양한 스타일의 옷, 액세서리, 머리카락들이 만드는 매우 복잡한 실루엣은 문제를 더욱 어렵게 한다. 본 논문에서는 단안 영상에 나타난 인간 오브젝트의 고품질 깊이 정보를 생성하는 효율적인 방법을 제안한다. 먼저, 적은 수의 사용자입력을 기반으로 3 원 템플릿 모델을 순차 관절 각도 제약을 가진 자세 추정 방법을 통해서 영상에 등장하는 2차원 인간 오브젝트에 정합한다. 정합된 3차원 모델로부터 초기 깊이 정보를 획득한 뒤, 컬러 세그멘테이션 방법을 기반으로 한 부분 깊이 전파 방법을 통해 세밀한 표현을 보장하며 누락된 영역을 포함하는 최종 깊이 정보를 생성한다. 숙련된 아티스트들의 수작업 결과물과 제안된 방법의 결과물을 비교한 검증 실험은 제안된 방법이 단안 영상에서 동등한 수준의 깊이 정보를 효율적으로 생성한다는 것을 보여준다.

Keywords

References

  1. M. Guttmann, L. Wolf, and D. Cohen-Or, "Semi-automatic stereo extraction from video footage," in Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009, pp. 136-142.
  2. O. Wang, M. Lang, M. Frei, A. Hornung, A. Smolic, and M. Gross, "Stereobrush: interactive 2d to 3d conversion using discontinuous warps," in International Symposium on Sketch- Based Interfaces and Modeling (SBIM 2011), 2011.
  3. X. Yan, Y. Yang, G. Er, and Q. Dai, "Depth map generation for 2d-to-3d conversion by limited user inputs and depth propagation," in 3DTV Conference: The True Vision-Capture, Transmission and Display of 3D Video (3DTV-CON), 2011. IEEE, 2011, pp. 1-4.
  4. R. B. i. Ribera, S. Choi, Y. Kim, J. Lee, and J. Noh, "Video panorama for 2d to 3d conversion," Computer Graphics Forum, vol. 31, no. 7pt2, pp. 2213-2222, 2012. [Online]. Available: http://dx.doi.org/10.1111/j.1467- 8659.2012.03214.x
  5. L.-M. Po, X. Xu, Y. Zhu, S. Zhang, K.-W. Cheung, and C.- W. Ting, "Automatic 2d-to-3d video conversion technique based on depth-from-motion and color segmentation," in Signal Processing (ICSP), 2010 IEEE 10th International Conference on, Oct 2010, pp. 1000-1003.
  6. A. McKenzie, E. Vendrovsky, and J. Noh, "Terrain geometry from monocular image sequences," Journal of Computing Science and Engineering, vol. 2, no. 1, pp. 98-108, 2008. https://doi.org/10.5626/JCSE.2008.2.1.098
  7. H. Hwang, K. Kim, J. Noh, et al., "Stereoscopic image generation of background terrain scenes," Computer Animation and Virtual Worlds, 2011.
  8. B. Ward, S. Kang, and E. Bennett, "Depth director: A system for adding depth to movies," Computer Graphics and Applications, IEEE, vol. 31, no. 1, pp. 36-48, 2011.
  9. D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis, "Scape: shape completion and animation of people," in ACM Transactions on Graphics (TOG), vol. 24, no. 3. ACM, 2005, pp. 408-416. https://doi.org/10.1145/1073204.1073207
  10. A. Jain, T. Thormahlen, H. Seidel, and C. Theobalt, "Moviereshape: Tracking and reshaping of humans in videos," in ACM Transactions on Graphics (TOG), vol. 29, no. 6. ACM, 2010, p. 148.
  11. S. Zhou, H. Fu, L. Liu, D. Cohen-Or, and X. Han, "Parametric reshaping of human bodies in images," ACM Transactions on Graphics (TOG), vol. 29, no. 4, p. 126, 2010.
  12. A. Agarwal and B. Triggs, "Recovering 3d human pose from monocular images," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 28, no. 1, pp. 44-58, 2006. https://doi.org/10.1109/TPAMI.2006.21
  13. D. DiFranco, T. Cham, and J. Rehg, "Reconstruction of 3d figure motion from 2d correspondences," in Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, vol. 1. IEEE, 2001, pp. I-307.
  14. G. Loy, M. Eriksson, J. Sullivan, and S. Carlsson, "Monocular 3d reconstruction of human motion in long action sequences," Computer Vision-ECCV 2004, pp. 442-455, 2004.
  15. X. Wei and J. Chai, "Videomocap: Modeling physically realistic human motion from monocular video sequences," ACM Transactions on Graphics (TOG), vol. 29, no. 4, p. 42, 2010.
  16. M. Lourakis, "levmar: Levenberg-marquardt nonlinear least squares algorithms in C/C++," [web page] http://www.ics.forth.gr/˜lourakis/levmar/, Jul. 2004, [Accessed on 31 Jan. 2005.].
  17. D. Comaniciu and P. Meer, "Mean shift: A robust approach toward feature space analysis," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 24, no. 5, pp. 603- 619, 2002. https://doi.org/10.1109/34.1000236
  18. K. Hormann and M. Floater, "Mean value coordinates for arbitrary planar polygons," ACM Transactions on Graphics (TOG), vol. 25, no. 4, pp. 1424-1441, 2006. https://doi.org/10.1145/1183287.1183295