1. INTRODUCTION
Understanding 3D structure from a single image is a central problem receiving increased attention due to its wide applicability in many vision applications [31,32]. The 3D structure of a scene provides decisive clues, enabling major breakthroughs in several tasks such as, scene labeling [1], object detection [2], and intrinsic image estimation [3]. Indeed, humans have remarkable ability to infer scene structure in monocular situations. This is because a prior information for understanding 3D structure can be learned from various monocular depth cues, including motion, occlusion, and shading. However, replicating this mechanism with machines is not trivial. It is difficult to discover the relation of numerous monocular cues underlying the image.
There have been many attempts to estimate depth information from single images. We roughly classify the existing methods into two groups: semi-automatic and automatic approaches. The former expects human interaction for depth estimation. Conventional semi-automatic methods require highly labor intensive procedures, so it can make a good performance. Actually, many 3D films have been converted from 2D ones as following steps: separating objects in individual frames, specifying proper depths, and correcting errors after final rendering. To reduce the burden of human interactions, scribble-based methods have been introduced [4,5]. Predicted depth is obtained from sparse depth scribble through optimization based propagation [4]. Non-local random walks [5] can reduce the number of user scribbles. However, scribble-based methods are not suitable for automatic vision-related tasks.
Contrary to semi-automatic approaches, automatic depth estimation methods do not require any user interactions. For example, they estimate depth automatically using depth cues, such as shape from shading [6], and structure from motion [7]. However, their applications are limited to restricted environments i.e., moving cameras or static scenes. After the emergence of large scale RGB-D datasets, data-driven approaches for depth estimation are received lots of attention. They can be classified into two categories: non-parametric sampling and parametric learning. Non-parametric sampling methods learn depth from retrieved depths. The depth candidates are visually similar with query image. They assume that visually similar scenes have similar geometric structures. Visually similar RGB-D candidates are selected by hand-craft features, such as GIST [8] and HOG [9]. In [10] and [11], warping process is performed to align boundaries of depth with those of query image.
Contrary to non-parametric sampling methods, parametric learning methods make a model to describe the relationship between a scene and corresponding depth. In [12], Markov Randomn Field (MRF) model is built to estimate depth map from a single image. The model is trained by monocular cues and the relation among multiple regions in the input image. Nowadays, convolutional neural networks (CNNs) have succeeded in depth estimation from a single image. Multi-scale CNN framework for depth estimation is introduced in [13]. Also, jointly training continuous CRFs and deep CNNs framework [14] can estimate a plausible depth. More recently, fully convolutional networks (FCNs) [15] are mainstream for depth estimation.
In this paper, we integrate non-parametric algorithm and CNNs for depth estimation from a single monocular scene. First, we estimate coarse depth map using FCN. Non-parametric sampling method extracts warped depth maps, which their boundaries are aligned with query's ones. The warped depth maps are concatenated with the convolutional layer's activations of CNN for depth refinement. The network extracts a plausible depth prediction. This paper is organized as follows. Section 2 presents related works. The proposed algorithm is explained in Section 3. Section 4 shows the implementation details and experimental results. Finally, we conclude the paper with some discussions and future works in Section 5.
2. RELATED WORKS
Recently, several methods have been proposed to directly estimate depth from a single image. Among all the interesting works, we review and discuss two related works that are most relevant to ours.
2.1 Non-parametric Sampling Methods
Non-parametric sampling methods have been received lots of attention after the emergence of large RGB-D datasets. They retrieve visually similar scenes with query image under an assumption that visually similar scenes also have similar geometric structures. Depth estimation is occurred from retrieved images. Depth Transfer [10] selects the candidates using GIST feature [8] from the RGB-D database. To align depth boundaries with query image's boundaries, warped depth maps are obtained by SIFT flows [16] between query image and RGB candidates. A global optimization is then performed to regress the warped depth maps with robust potential function. However, calculating SIFT flows and solving the optimization problem require very expensive operation. Choi et al. [11] adapt Patch Match [17] for efficient warping process. The method, which is called Depth Analogy, estimates depth gradient rather than directly estimate depth map. The final depth prediction is obtained from Poisson reconstruction. The frame work is reasonable due to the statistical invariance property of depth gradients.
Contrary to Depth Transfer and Depth Analogy, Konrad et al. [18] proposes depth estimation method without warping process. They assume that the location of some objects (i.e., sky, building, furniture) are quite consistent with the candidate images. Based on the assumption, the inferred depth values come from median of retrieved depth maps. A joint filtering is then executed to align depth boundaries to those of the input image. However, if the candidate depths are not locally consistent with query image, it might fail to estimate proper depth values from median.
2.2 Parametric Learning Methods
Another group of researchers focuses on defining a parametric model between RGB and depth. Saxena et al. [12] proposes Markov Random Field (MRF) model to estimate depth map from a single image. A set of plane parameters is trained to capture a relationship between RGB and depth using large scale of RGB-D dataset.
CNN-based depth estimation methods have been studied a lot. Eigen et al. [13] builds a multi-scale CNN networks to estimate depth directly. The network consists of two deep neural network stacks. The first network performs coarse depth prediction in low resolution. The coarse prediction then concatenates with the second network after bilinear interpolation. The network refines the coarse depth map. Liu et al. [14] propose a deep convolution neural network, which estimates depth using continuous CRFs and deep CNNs. CRF network captures the relationship between superpixels in an input image. Jointly training CRFs and CNNs is their huge contribution.
More recently, fully convolutional networks (FCNs) [15] have become a popular choice in semantic segmentation. FCNs can generate effective features, and be trained end-to-end manner. However, FCNs extract coarse output because the network just resizes low resolution into high resolution using bilinear interpolation. Chen et al. [19] propose a framework which refines FCN output using the fully connected CRF for semantic segmentation. In [20], they replace the fully connected CRF into mordern domain transform. Likewise semantic segmentation, depth estimation methods based on FCN concentrate how to refine depth boundaries using CRF [21], or annotations of relative depth [22].
Our proposed method is also based on FCN for depth estimation from a single monocular scene. We integrate non-parametric sampling framework to refine coarse FCN output. Since we embed boundary information of non-parametric results into CNN for refinement, the final depth is able to contain both of global and local information.
3. PROPOSED METHOD
The proposed method consists of three components as follows: parametric model for coarse depth estimation, non-parametric sampling framework for warped depth maps, and CNN layers for depth refinement. The overall framework of the proposed method is shown in Fig. 1.
Fig. 1.The overall framework of proposed method.
3.1 Parametric Depth Estimation Model: Fully Convolutional Network
The first component in the proposed method is FCN, which produces a coarse depth map. We modify VGG-16 net [23] to make FCN as shown in Fig. 2. The upper layers in Fig. 2 come from VGG-16 net. They consist of 10 convolutional, 10 ReLU, and 4 max-pooling layers. We also use pretrained parameters using ImageNet [24] as the initial ones. We concatenate last two features, which have 1/8 and 1/16 scales of input image. The features are resized into input's resolution using bilinear interpolation before concatenation. To obtain better high-resolution prediction, we concatenate the middle activations with the last ones. Since middle activations perform less pooling process, they contain more structural information. Two convolutional layers are applied to obtain FCN output. We do not add ReLU after the last convolutional layer. We set a loss used in [13] as follows:
where u and g are FCN output and its groundtruth, respectively. In addition, Dx and Dy denote the horizontal and vertical difference matrix. n is the whole number of valid pixels. The first and second term are least square and scale-invariant difference, respectively. Also, the last term encourages to have similar gradients between output and ground-truth. As mentioned in previous Sections, FCN produces coarse depth map. In our method, local structures in depth boundaries are refined with the output of non-parametric sampling method.
Fig. 2.Fully Convolutional Network (FCN) in propose method.
3.2 Non-Parametric Sampling Method
The most important thing in non-parametric sampling approaches is scene retrieval. In our method, we retrieve visually similar scenes using CNN descriptor [25]. It is 4096-dimensional feature vector. We confirm that CNN descriptor can outperform in image retrieval comparing with handcraft features such as GIST, or HOG. CNN descriptors of training set are pre-computed for efficient retrieval. Let ci denote CNN descriptor of the image Ii in RGB-D database. Given the input image I0 and its descriptor c0, the visual similarity is measured by the sum of squared difference (SSD) as follows:
In our experiments, we select 7 RGB-D candidates. Fig. 3 shows query input and 7 visually similar scenes extracted by CNN descriptor. We can confirm that CNN descriptors can retrieve visually similar scenes, and they also consist of similar 3D structures. To align boundaries of depth with ones of input, we perform warping process using Patch Match [17], which is much faster than SIFT flow [16]. Although it is hard to estimate plausible depth maps based on non-parametric approach, we can obtain warped depth maps containing the information of depth boundaries by warping process. The warped depth maps and FCN output are used as input of depth refinement framework.
Fig. 3.(a) input image, (b) visually similar scenes retrieved by CNN descriptor.
3.3 Depth Refinement by CNNs
Traditional approaches refine depth using joint filtering techniques such as joint bilateral filter [29], or guided filter [30]. They usually adapt guidance image as color ones. However, by the structural inconsistency between RGB and depth, texture copying problems are frequently occurred in refined depth. Nowadays, learning based approaches resolve the problem. Rather than using color as guidance image directly, the approaches use CNN layers for depth refinement. To effectively regress FCN output and warped depth maps, we build CNN layers as shown in the bottom of Fig. 1. We modify residual network [26]. Kim et al. show that learning only residual can converge faster, and show superior performance. In addition, it can solve the vanishing/exploding gradients problem in backpropagation. We confirm that learning only residual can converge faster, and produce better performance than general deep networks. To give local structures of depth in the network, we concatenate 7 warped depths with the first convolutional layer's activations. And then, the desired output is obtained by addition of the residual and FCN output. Likewise FCN, (1) is also used as a loss function.
4. EXPERIMENTAL RESULTS
We evaluate the proposed method with NYU Kinect V2 dataset [27]. The dataset is composed of 1449 indoor scenes and corresponding depths captured by Kinect sensor. We select 795 RGB-D pairs for training set, and 654 scenes are used as test set. The training set is used to train parameters of our deep neural networks. It is also used as RGB-D database for non-parametric sampling. For learning CNN layers for depth refinement, 7 visually similar scenes are selected from the training set except query image. The proposed method is implemented on a standard desktop with an NVIDIA Titan, based on the popular CNNs toolbox MatConvNet.
For training our network, training images are resized into 320×240 from 640×480 due to the problem of GPU memory. Convolutional layers in FCN (upper layers in Fig. 2) are initialized from pretrained VGG-16 net, and others are randomly initialized with zero mean Gaussian random values. All parameters in CNN for depth refinement are also randomly initialized. Learning rate is 10-4 in FCN. For CNN for depth refinement, we can use large learning rate 10-1, because of residual learning. Momentum is 0.9, and batch size is 32 in all networks. In CNN refinement, we add batch normalization [28] layers after all convolutional layers except the last one. Training deep neural networks is difficult because the distribution of each layer's inputs changes during training, which is called as internal covariance shift. The problem slows down the training. Batch normalization solves the problem by normalizing layer's input.
We compare our method with one non-parametric method [18] and two parametric methods [12,13]. For the quantitative evaluation, we measure several kinds of errors, which are standard measures in depth estimation. Let d and g denote prediction depth and ground-truth, respectively. N is the number of test images, and i indicates pixel index. The measures are shown as follows:
root mean squared in log error (log) :
accuracy with threshold (thr) : percentage
We analyze the performance of the proposed method comparing with competing algorithms. Fig. 4 shows results of ours. As mentioned previous Sections, FCN output (Fig. 4(c)) can estimate global depth values, but depth boundaries are very smooth. By aggregating FCN with warped depth map by CNN, depth boundaries become much sharper. Furthermore, small objects, such as projector, book shelves in the background, are refined as shown in Fig. 4(d). Since any refinement process using joint filtering techniques is not performed, our method can be more trustworthy. Fig. 5 illustrates the qualitative comparison with our method and competing methods. In general, nonparametric methods have poor performance in estimating depth. Methods of Eigen et al. can infer a plausible depth, but its outputs are very smooth. On the other hands, our method not only estimates global depth values, but also preserves depth boundaries. Table 1 shows the quantitative evaluations with some measures. It also shows our method outperforms the non-parametric methods.
Fig. 4.Qualitative Evaluation (a) input image, (b) ground-truth, (c) FCN output, (d) ours.
Fig. 5.Comparison with competing algorithms (a) input image, (b) ground-truth, (c) Depth Fusion [18], (d) Make3D [12], (e) Eigen et al. [13], (f) ours.
Table 1.Quantitative evaluations with competing algorithms
5. CONCLUSION
This paper has presented depth estimation method from a single monocular scene using popular FCN and non-parametric method. FCN can extract globally plausible depth, whereas outputs of non-parametric method preserve local structures, due to warping process based on Patch Match. Each virtues is aggregated by CNN, which only learn residual. Since we adapt the newest techniques (i.e., batch normalization and residual net) in our deep networks learning, they converge fast and produce better results. Experimental results show that the proposed method can outperform other existing methods in estimating both of global and local structures of depth. In experiments, we tried to use large RGB-D dataset. It is well-known that large number of training images make a good performance in deep neural networks, and it helps to avoid overfitting problem. Non-parametric methods can also make better performance in large database. However, large database makes it hard to retrieve training images in non-parametric method. Fast image search algorithms can improve the proposed method. Also, in future works, we will extend the proposed method by jointly training parametric and non-parametric methods.
References
-
X. Ren, L. Bo, and D. Fox, "RGB-(D) Scene Labeling: Features and Algorithms,"
Proceeding of IEEE Conference on Computer Vision and Pattern Recognition , pp. 2759-2766, 2012. -
D. Lin, S. Fidler, and R. Urtasun, "Holistic Scene Understanding for 3D Object Detection with RGBD Cameras,"
Proceeding of IEEE International Conference on Computer Vision , pp. 1417- 1424, 2013. -
J. Jeon, S. Cho, X. Tong, and S. Lee, "Intrinsic Image Decomposition Using Structure-Texture Separation and Surface Normals,"
Proceeding of European Conference on Computer Vision , pp. 218-233, 2014. -
O. Wang, M. Lang, and M. Gross, "Stereo Brush: Interactive 2D to 3D Conversion using Discontinuous Warps,"
Proceedings of the 8th Eurographics Symposium on Sketch-based Interfaces and Modeling , pp. 47-54, 2011. -
H. Yuan, S. Wu, P. Cheng, P. An, and S. Bao, "Nonlocal Random Walks Algorithms for Semi-Automatic 2D-to-3D Image Convertsion,"
IEEE Signal Processing Letters , Vol. 22, No. 3, pp. 371-374, 2015. https://doi.org/10.1109/LSP.2014.2359643 -
J. Atick, P. Griffin, and N. Redlich, "Statistical Approach to Shape fomr Shading: Reconstruction of Three-Dimensional Face Surfaces from Single Two-Dimensional Images,"
Neural Computation , Vol. 8, No. 6, pp. 132-1340, 1996. https://doi.org/10.1162/neco.1996.8.6.1321 -
R. Szeliski and P. Torr, "Geometrically Constrained Structure from Motion: Points on Planes,"
3D Structure from Multiple Images of Large-Scale Environments ,Springer Berlin Heideberg , pp. 171-186, 1998. -
A. Oliva and A. Torralba, "Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope,"
International Journal of Computer Vision , Vol. 42, No. 3, pp. 145-175, 2001. https://doi.org/10.1023/A:1011139631724 -
N. Dalal and B. Triggs, "Histograms of Oriented Gradients for Human Detection,"
Proceeding of IEEE Computer Vision and Pattern Recognition , Vol. 1, pp. 886-893, 2005. -
K. Karsch, C. Liu, and S. Kang, "Depth Extraction from Video Using Non-parametric Sampling,"
Proceeding of European Conference on Computer Vision , pp. 775-788, 2012. -
S. Choi, D. Min, B. Ham, Y. Kim, C. Oh, and K. Sohn, "Depth Analogy: Data-Driven Approach for Single Image Depth Estimation using Gradient Samples,"
IEEE Transaction on Image Processing , Vol. 24, No. 12, pp. 5953-5966, 2015. https://doi.org/10.1109/TIP.2015.2495261 -
A. Saxena, M. Sun, and A. Ng, "MAKE3D: Learning 3D Scene Structure from a Single Still Image,"
IEEE Transaction on Pattern Analysis machine Intelligence , Vol. 31, No. 5, pp. 824-840, 2009. https://doi.org/10.1109/TPAMI.2008.132 -
D. Eigen, C. Puhrsch, and R. Fergus, "Depth Map Prediction from a Single Image Using a Multi-Scale Deep Network,"
Advances in Neural Information Processing Systems , pp. 2366-2374, 2014. -
F. Liu, C. Shen, and G. Lin, "Deep Convolutional Neural Fields for Depth Estimation from a Single Image,"
Proceeding of IEEE Conference on Computer Vision and Recognition , pp. 5162-5170, 2015. -
J. Long, E. Shelhamer, and T. Darrell, "Fully Convolutional Networks for Semantic Segmentation,"
Proceeding of IEEE Conference on Computer Vision and Pattern Recognition , pp. 3431-3440, 2015. -
C. Liu, J. Yuen, and A. Torralba, "SIFT Flow: Depth Correspondence across Scenes and Its Applications,"
IEEE Transaction Patterns Analysis Machine Intelligence , Vol. 33, No. 33, pp. 978-994, 2011. https://doi.org/10.1109/TPAMI.2010.147 -
C. Barnes, E. Shechtman, A. Finkelstein, and D. Goldman, "PatchMatch: A Randomized Correspondence Algorithm for Structural Image Editing,"
ACM Transactions on Graphics , Vol. 28, No. 3, pp. 24, 2009. https://doi.org/10.1145/1531326.1531330 -
J. Konrad, M. Wang, P. Ishwar, C. Wu, and D. Mukherjee, "Learning-based, Automatic 2D-to-3D Image and Video Conversion,"
IEEE Transaction Image Processing , Vol. 22, No. 9, pp. 3485-3496, 2013. https://doi.org/10.1109/TIP.2013.2270375 -
L. C. Chen, G. Papandreou, I. Kokkonos, K. Murphy, and A.L. Yuille, "Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs,"
arXiv: 1412.7062, 2014. -
L. C. Chen, J. Barron, and A. L. Yulle, "Semantic Image Segmentation with Task-Specific Edge Detection Using CNNs and a Discriminatively Trained Domain Transform,"
arXiv: 1511.03328, 2015. -
F. Liu, C. Shen, G. Lin, and I. D. Reid, "Learning Depth from Single Monocular Images using Deep Convolutional Neural Fields,"
Proceeding of IEEE Conference on Computer Vision and Pattern Recognition , pp. 5162-5170, 2015. -
W. Chen, Z. Fu, D. Yang, and J. Dong, "Single-Image Depth Perception in the Wild" ,
arXiv: 1604.03901, 2016. -
K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Reconstruction,"
arXiv: 1409.1556, 2014. -
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg, and L. Fei-Fei, "ImageNet Large Scale Visual Recognition Challenge,"
International Journal of Computer Vision , Vol. 115, No. 3, pp. 211-252, 2015. https://doi.org/10.1007/s11263-015-0816-y -
A. Krizhevsky, I. Sutskever, and G. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks,"
Advances in Neural Information Processing Systems , pp. 1097-1105, 2012. -
J. Kim, J. Lee, and K. Lee, "Accurate Image Super-Resolution Using Very Deep Convolutional Networks,"
arXiv: 1511.04587, 2015. -
N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, "Indoor Segmentation and Support Inference from RGBD Images,"
Proceeding of European Conference on Computer Vision , pp. 746-760, 2012. -
S. Loffe and c. Szegedy, "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,"
Proceeding of International Conference on Machine Learning , pp. 448-456, 2015. -
J. Kopf, M.F. Cohen, D. Lischinski, and M. Uyttendaele, "Joint Bilateral Upsampling,"
ACM Transactions on Graphics , Vol. 26, No. 3, pp. 96, 2007. https://doi.org/10.1145/1276377.1276497 -
K. He, J. Sun, and X. Tang, "Guided Image Filtering,"
Proceeding of European Conference on Computer Vision , pp. 1-14, 2010. -
D. Lee, and S. Kwon, "A Recognition Method for Moving Objects Using Depth and Color Information,"
Journal of Korea Multimedia Society , Vol. 19, No. 4, pp. 681- 688, 2016. https://doi.org/10.9717/kmms.2016.19.4.681 -
S. Kim, and H. Kang, "Semantic Segmentation of Indoor Scenes Using Depth Superpixel,"
Journal of Korea Multimedia Society , Vol. 19, No. 3, pp. 531-538, 2016. https://doi.org/10.9717/kmms.2016.19.3.531