Deep Learning-based Depth Map Estimation: A Review

Abdullah, Jan;Safran, Khan;Suyoung, Seo ;

doi:10.7780/kjrs.2023.39.1.1

Korean Journal of Remote Sensing (대한원격탐사학회지)

Volume 39 Issue 1
/
Pages.1-21
/
2023
/
1225-6161(pISSN)
/
2287-9307(eISSN)

Korean Society of Remote Sensing (대한원격탐사학회)

DOI QR Code

Deep Learning-based Depth Map Estimation: A Review

Abdullah, Jan (Major in Civil Engineering, School of Architectural, Civil, Environmental, and Energy Engineering, Kyungpook National University) ;
Safran, Khan (Major in Civil Engineering, School of Architectural, Civil, Environmental, and Energy Engineering, Kyungpook National University) ;
Suyoung, Seo (Major in Civil Engineering, School of Architectural, Civil, Environmental, and Energy Engineering, Kyungpook National University)

Received : 2023.01.17
Accepted : 2023.02.16
Published : 2023.02.28

https://doi.org/10.7780/kjrs.2023.39.1.1 Citation PDF HTML

Download PDF

⟨ Previous Next ⟩

Abstract

In this technically advanced era, we are surrounded by smartphones, computers, and cameras, which help us to store visual information in 2D image planes. However, such images lack 3D spatial information about the scene, which is very useful for scientists, surveyors, engineers, and even robots. To tackle such problems, depth maps are generated for respective image planes. Depth maps or depth images are single image metric which carries the information in three-dimensional axes, i.e., xyz coordinates, where z is the object's distance from camera axes. For many applications, including augmented reality, object tracking, segmentation, scene reconstruction, distance measurement, autonomous navigation, and autonomous driving, depth estimation is a fundamental task. Much of the work has been done to calculate depth maps. We reviewed the status of depth map estimation using different techniques from several papers, study areas, and models applied over the last 20 years. We surveyed different depth-mapping techniques based on traditional ways and newly developed deep-learning methods. The primary purpose of this study is to present a detailed review of the state-of-the-art traditional depth mapping techniques and recent deep learning methodologies. This study encompasses the critical points of each method from different perspectives, like datasets, procedures performed, types of algorithms, loss functions, and well-known evaluation metrics. Similarly, this paper also discusses the subdomains in each method, like supervised, unsupervised, and semi-supervised methods. We also elaborate on the challenges of different methods. At the conclusion of this study, we discussed new ideas for future research and studies in depth map research.

Keywords

1. Introduction

Images are the primary source of visual information. With the advancement in computer vision technologies and computational photogrammetry, image data are utilized for many useful purposes like image segmentation, classification, object detection, simultaneous mapping and localization, autonomous driving, and depth mapping. Over the past few decades, much work has been done in 3D photogrammetry, especially depth mapping. Depth maps are traditional computer vision tasks used to calculate the distance information of a scene in three-dimensional axes. To store three-dimensional spatial information, depth estimation leverages two-dimensional data acquired by vision sensors like cameras, smartphones, or any time-of-flight sensors. When a vision sensor takes an image, it transfers the 3D scene from the real world into a 2D image plane. For the visual purpose, the 2D image is satisfactory; however, it needs to supply depth or distance information for the 3D points from the camera axes. To obtain information about the pixel’s distance, depth maps are produced for the corresponding RGB images. The depth map is one channel image metric that carries distance or depth values for each pixel in the image. A typical example of a depth map is shown in Fig. 1. The color intensity in the depth map corresponds to the pixel’s depth or distance from the camera axes. Depth maps can be captured using specialized hardware like Microsoft Kinect, stereo cameras, lidar, sonars, radar, or even depth cameras.

OGCSBN_2023_v39n1_1_f0001.png 이미지

Fig. 1. Image with the respective depth map.

Depth mapping has been studied extensively in the past few decades. Moravec (1990) developed the first stereo-vision system for making depth maps for robot navigation. They were the first to create a stereo vision system to conduct a depth map for robot navigation which could navigate up to 20m in its surrounding. New depth sensors then debuted in the late 1980s and early 2000s. The development of sonars (Leonard and Durrant-Whyte, 1991) and lidars (Hall, 2011) altered the nature of simultaneous localization, mapping, and navigation systems. The inventions of sonars and lidars were fascinating but had significant drawbacks. The fundamental limitation of lidars and sonars at the time was that they only extracted information in one axis, i.e., only in plane straight to the detector. To overcome these limitations, new 3D lidars and depth cameras like Kinect have been introduced to tackle this issue. Currently, depth maps are generated with the help of computer-aided software and deep-learning techniques.

Now, we have some basic idea of depth maps and how they are generated. We will explain a detailed overview of depth maps based on the review of our analyzed articles and newly published papers. Furthermore, we will also suggest some future research based on our analysis. One point worth noting is that our analysis differs from other pieces of literature because we covered all the types of depth mapping techniques ranging from traditional and machine learning to new advanced deep learning methods.

2. Materials and Methods

We group our literature review and existing work into two main categories to understand depth mapping. The first section will cover conventional or non-deep learning techniques, which utilize hand-crafted features and assumptions to develop respective depth maps. This portion of the chapter is mainly related to the theories of epi-polar geometry and simple computer vision techniques. The second section will provide an overview of deep learning methods. This section will demonstrate the applications of advances in deep learning techniques with some critical analysis and loss functions. Theories about neural networks relevant to this study will also be presented.Aschematic diagram of different depth mapping techniques is shown in Fig. 2. The primary focus of this study belongs to the second portion.

OGCSBN_2023_v39n1_1_f0002.png 이미지

Fig. 2. Different methods used for development of depth maps.

2.1. Traditional methods / Hand-crafted techniques

Back in the 1830s,Wheatstone (1838) demonstrated depth perception of the human retina. They invented a stereoscope that could show conclusively that the brain uses horizontal disparity to find the relative depth of the object with respect to the intersection point of two optical axes in the physical world.The process is called stereopsis. Since Wheatstone (1838) discovery, most studies on depth perception have centered on figuring out the mechanism underpinning disparity computation. Only a few decades later, Helmholtz (1925, 2013) formally proposed a depth cue akin to stereopsis in terms of its potent and practical principle. They discovered that objects at various distances from the observer move on the retinal surface at varying speeds whenever there is translational motion. This motion parallax is the capability to extract 3D shapes from motion.

Such discoveries were the foundation of depth perception, and later these theories were translated into mathematical formulae, and with the help of triangulation laws and powerful computers, depth maps were generated. Just like the work of Wheatstone (1838) for binocular depth estimation, two cameras (like human eyes) are needed. Such cameras needed to be fully calibrated. After camera calibration, the corresponding points in two images must be estimated. They can be calculated either by Harris and Stephens (1988),speeded up robust features(SURF) (Bay et al., 2008), scale-invariant feature transform (SIFT)(Lowe, 1999), or by other techniques like template matching (Hartley and Zisserman, 2004). Once the templates are matched, the disparity is calculated.Another option is the replication of the work of Helmholtz (1925, 2013), which is known as the structure from motion phenomenon. In this situation, the camera’s motion is incorporated to estimate the depth map. A single camera is moved with a known baseline, and images are captured from the scene. In this estimation, a single camera is used to calculate the disparity and depth.

The stereo camera system has some significant drawbacks. The first drawback is that it requires the cameras to be calibrated. The other disadvantages are estimating the proper baseline, correcting inliers, and removing outliers. Similarly, brightness variation, lens blur, and camera-to-object distance are the main factors of drawbacks in the stereo camera system. Due to such issues, image features like edges and corners are affected. Such problems are addressed in the work of Seo (2018). Seo (2018) modeled the edge profile with two blur parameters and varying brightness combinations to camera object distances. Also, the stereo camera system is not feasible to install in some places, making it less effective.

Another technique in the traditional methods is the depth from focus. Such a method required many photos of a single scene with different points to focus. This method works on the focus defocus technique, where the camera sensor is brought near or far away from the point. The depth is measured from such focus-defocus images (Tang et al., 2015; Tang et al., 2017; Pentland, 1987). The main drawback of this method is that it requires a stack of images of a single scene, making it less applicable to real-world scenarios (Nayar et al., 1996). So how did this stereo pair, structure from motion (SFM), or focus defocus method give us depth? We can summarize such practices in the most simplified manner as below.

2.1.1. Stereo depth estimation in the traditional way

In depth image, the scene’s geometry is encoded in an intensity color image format. The color history delivers the distance information of the point in the scene. A depth image is created by applying a perspective transformation on a 3D world point P= (X, Y, Z). Each depth pixel’s value is simply Z, where Z represents the distance along the optical axis. Modeling depth requires a set of calibrated cameras separated at a constant baseline. Once the images are captured and corresponding points are matched, a fundamental matrix is calculated for cameras. The outliers are removed by using RANdom SAmple consensus (RANSAC) (Derpanis, 2005) or any other classifier. The inliers or matched points are used to calculate the disparity. The disparity is the inverse of depth, in which each pixel is stored as 1/Z rather than Z. Disparity between a pair of stereo cameras records the geometry of the respective scene. The simple stereo camera setup can be seen in Fig. 3.

OGCSBN_2023_v39n1_1_f0003.png 이미지

Fig. 3. Stereo camera system.

A point to remember is that disparity is directly proportional to baseline and inversely proportional to depth (Hartley and Zisserman, 2004). These methods work for both stereo vision and SFM.

Similarly, depth map can be created using monocular view. Such methods are known as depth from focus (Pentland, 1987; Tang et al., 2015). Gaussian lens law is used to conduct a 3D view of the scene to calculate the depth from focus-defocus. This method requires a stack of images of the same scene with varying lens distance or blur, as explained by Nayar et al. (1996), Hartley and Zisserman (2004), and Tang et al. (2015).

2.1.2. Monocular Depth Estimation in Traditional Way

Till now, we have discussed stereo vision techniques or multi-image theories. Nowwewillfocusonmonocular depth estimation using handcrafted features, probabilistic models, and non-parametric approaches. In the early days, single image depth estimation was taken indirectly. The possible early work known to us is the work of Hoiem et al. (2005), who automatically reconstruct a 3D scene from a single RGB image for the virtual environment. In their work, they assume the outdoor environment consists of the sky, ground, and vertical objects stuck to the ground. They use hand-crafted features and categorize superpixels into one of the three above class types. Then, by positioning the objects vertically on the ground plane, they automatically generate the virtual 3D environment utilizing the three classes and the supposition. Nevertheless, the finished products have a nice appearance. This work introduces many significant recurring ideas in depth maps for the first time.

Following the footsteps and theories of Hoiem et al. (2005),some researchers such as Liu et al. (2010), Karsch et al. (2014), and Ladický et al. (2014) use Hoiem et al. (2005)’stheoriesto conduct depth analysis. They call it depth from semantic segmentation. For the assessment of depth, semantic information is crucial. A computer vision system may benefit from using its prior knowledge of a certain semantic class. For instance, we may estimate their depth by determining that one of two blue areas in a photograph is the sky and the other is water. Ladický et al. (2014) demonstrated that, combining monocular depth features with semantic object labels enhances the performance. However, they rely on manually created features and superpixels to segment the image. Karsch et al. (2014) use a SIFT flow-based mechanism on K-nearest neighbor (KNN) transfer to estimate static background depths from single images, which they augment with motion information to better estimate moving foreground subjects in videos. This can result in better alignment, but it necessitates having the entire dataset available at runtime and performing costly alignment procedures.

Another early work in monocular depth estimation is the work of Michels et al. (2005). In their setup, they use a small robot vehicle hatch which uses reinforced learning to steer around obstacles. Avision component simulates a 2D laser scanner to assess the distance to the nearest obstacles in each direction. Together, reinforced learning and 2D laser components make up the planned architecture. In this instance, depth is inferred from a single image since it provides a broader range than the common binocular vision. Their vision system was trained in a supervised manner as a linear regression with handcrafted features. The vision system of Michels et al. (2005) was formulated very differently from nowadays computer vision approaches but still gave a fundamental idea for depth estimation, which many researchers used later in their work.

After Michels et al. (2005) work, many scientists conducted depth maps, but the work of Saxena et al. (2005) brought the revolution in depth estimation approach. They used synthetic data to estimate depth maps. Synthetic datasets consist of RGB images with their depth maps. Saxena et al. (2005) applied custom-made filters to the input image in discrete sections, processing it to extract image information. The depth value is calculated for each small patch. They used global cues to correctly detect the absolute depth by applying the filters in multiple scales. Similarly, they also incorporated the characteristics of surrounding patches, and features from the same columns in image to find exact depth values. Based on the observation that most of the structures in the photos are vertical, features from the same image column are utilized. Additionally, histograms of the feature values are computed for each patch, and the system is fed with the difference between the histogram of a patch and its nearby patches to improve the system’s comprehension of neighboring patches. Using a Markov random field model, they evaluated the absolute scales of several image patches and inferred the depth image.

Some other researchers, such as Karsch et al. (2014) and Yang et al. (2022) used non-parametric methods to estimate the depth maps. In their work, the depth of a query image was conducted by fusing the depths of photos with related photometric content that was retrieved from a database. Until now, we have studied most methods that use nonparametric or hand-crafted features. Hand-crafted features were helpful when access to fast machines was minimal. There were some major issues such as stereo correspondence which is easily lost for monocular images. Similarly in many cases depth estimation is dependent on multi-view geometry which requires post processing and tedious alignment procedures. It is also important to consider a number of practical issues, such as the amount of time and memory needed to calculate results for various applications(Khan et al., 2020;Masoumian et al., 2022). For example, applications like obstacle avoidance in self-driving need high accuracy with less memory usage and small calculation time to estimate depth to avoid unwanted accidents.

To overcome the memory requirement issue, the use of monocular images with deep learning seems to be efficient. After years of progress, depth maps were tackled with fast computers and machine learning techniques. We will focus on some deep learning techniques employed to conduct depth mapping. As mentioned earlier, depth maps are ill-posed problems and require high knowledge of features and information. Therefore, such scene reconstruction is handled with deep learning. Our goal is to show different state-of-the-art methods and compare their results. A simple flow diagram is shown in the Fig. 4 below. We will also discuss the good points and drawbacks of-state-of-the-art techniques.

OGCSBN_2023_v39n1_1_f0004.png 이미지

Fig. 4. Comparison between traditional methods and deep learning methods.

2.1.3. Depth Estimation Using Deep Learning / Artificial Neural Networks

Artificial neural networks mimic the human biological neural system. It comprises multiple nodes or neurons which are interconnected to each other. Artificial intelligence and machine learning techniques imitate how humans gain certain knowledge. The complex form of artificial neural networks consisting of more than two neurons is called deep neural networks. The more neurons in a system, the deeper the network will be. The number of hidden layers in artificial neural networks depends on the task’s complexity and application.

Deep learning is very versatile and can be used in many applications. They can be applied in various fields, such as medicine, chemistry, biology, astrology, physics, and civil engineering. Taking advantage of deep learning, many researchers used this technique to estimate depth maps. It is evident that depth maps are very complex and require much knowledge and skill. Several recent attempts have been made to estimate depths using deep convolutional neural networks. Different types of techniques such as supervised learning, semi-supervised learning, or unsupervised learning are adapted to create depth maps. In general, supervised learning is a sub-branch of artificial neural networks and machine learning. It is distinguished by the way it trains convolutional neural network (CNN) models to accurately classify data or predict outcomes using labeled datasets. The model receives all available input data and modifies its weights until it is properly fitted. Similarly, unsupervised learning, commonly referred to as self-supervised machine learning, employs data without labels and ground truth. It derives patterns from such data that are not labeled. Such models are trained with datasets that are generated simultaneously during or before training. The majority of datasets in unsupervised depth mapping are generated with the stereo camera system. This is especially helpful when experts in the field are unaware of common characteristics in a data set. Finally, semi-supervised learning is a type of hybrid learning in which a small amount of labeled data is combined with a large amount of unlabeled data during training. Semi-supervised learning falls somewhere between unsupervised and supervised learning. Sometimes it is very advantageous as a small amount of labeled data when combined with unlabeled data can significantly improve learning accuracy. The flow diagram of such methods can be seen in Fig. 4. We will briefly discuss some state-of-the-art methods and their literature to get more insight into their work. An insight into different types of methods used for depth perception is shown in Fig. 5. The description of each method is in Table 1.

OGCSBN_2023_v39n1_1_f0005.png 이미지

Fig. 5. Different learning methods for depth estimation.

Table 1. Comparison of different methods for depth estimation

OGCSBN_2023_v39n1_1_t0001.png 이미지

2.1.3.1. Supervised Learning

One of the attempts to predict the depth map from monocular images by convolutional neural networks was performed by Eigen et al. (2014). In their work, they used multi-scale deep neural networks to reconstruct a depth map for a respective RGB image. They developed a two-staged network that predicted depth maps at a coarse level and a fine level. Coarse level network generated depths at a global level from the entire image. Later the image from the coarse network, along with the original RGB, is passed into the fine network, which refines the predicted depth using local information. Eigen et al. (2014) used the advantage of transfer learning as their network was pre-trained on the ImageNet (Krizhevsky et al., 2012) classification task dataset. Using a pre-trained weight worked better because the network performs better if the training is initialized with some existing weights values rather than random initialization weights.

Eigen and Fergus (2015) modified Eigen et al. (2014)’s network and incorporated surface normal and semantic labels. This time they make their network deeper by incorporating more convolutional layers to estimate multichannel feature maps. They created a 3-stage network where the inputs from the first layer were fed into the proceeding layers. They used a single scale, one stack network, combining depth and normal to share computation load. Hence, they simultaneously predict the depth and normal for a single RGB image. Their new algorithm was 1.6 times faster than their previous one. Similarly, they used the same loss function as their previous work as shown in the following:

\(\begin{aligned} L\left(y, y^{*}\right) & =\frac{1}{2 n} \sum_{i}\left(\log y_{i}-\log y_{i}^{*}\right\}+\sum_{j}\left(\log y_{i}-\log y_{i}^{*}\right)^{2} \\ & +\frac{1}{n} \sum_{j}\left[\left(\Delta_{x} d_{i}\right)^{2}+\left(\Delta_{y} d_{i}\right)^{2}\right]\end{aligned}\) (1)

In Eq.(1), y is the ground truth depth, y^* is the predicted depth, and n is the number of depth pixels. Δ_x d_iand Δ_y d_iare the horizontal and vertical image gradients of the difference (Eigen and Fergus, 2015).

Their work demonstrated that visual geometry group (VGG)(Simonyan and Zisserman, 2015) based network performed the best compared to a small Alex-Net (Krizhevsky et al., 2012), but it failed to achieve sharp transitions. When this algorithm was proposed, it achieved a state-of-the-art monocular depth estimation result showing the benefits of using multi-scale networks.

In image processing, edges and lines are important features used to detect object shapes in a scene. The deficiency in the above work of Eigen and Fergus can be overcome by postprocessing of the output results. A method of signal-to-noise ratio (SNR) proposed by Seo (2021) can be applied to output depth maps to detect edges and lines. In his method, he analytically determined the SNR of the line and achieved quality results.We can enhance the output results by enhancing the lines and edges in the image.

Liu et al.(2016) used conditional random fields with CNNs to develop depth maps from single images. Unlike the previous work of Eigen and Fergus (2015), who regressed depth maps directly from input images, Liu et al. (2016) used deep convolutional neural fields to develop the connection between neighboring parts of depth maps. Convolutional random field (CRF) nodes modeled a probability distribution function to optimize the neural network.CRF nodes or superpixels were homogenous regions of the input image to the network. Their network consists of two parts (unary and pairwise). The unary convolutional part aims to regress the depth values, and the pairwise potential part encourages neighbor super-pixel with the same appearance to take on the same depth values. The output from their network was a regressed depth for super-pixel values. A point worth to note is that the qualitative results of Liu et al. (2016) were better, having sharper transitions and aligned local structures, but the issue of scale invariance was not considered in their proposed work.

Laina et al. (2016) proposed fully convolutional residual networks for depth estimation from a single image. In their work, they used simple up-sampling while removing fully connected layers. By doing so, this made their network simpler, faster, and more efficient. They used Res-Net 50 (He et al., 2016) with pretrained weights as an encoder. Thus, using pretrained weights makes the training time smaller while giving good output results. Laina et al. (2016) experimented with different backbone architectures such as VGG (Simonyan and Zisserman, 2015), Alex-Net (Krizhevsky et al., 2012) and Res-Net (He et al., 2016), where Res-Net performed with outstanding results. They used BerHu loss (Zwald and Lambert-Lacroix, 2012) during training as this loss performed well compared to L2 loss when exploiting depth maps as ground truth. The outcomes of this study demonstrate the value of applying up-projections in the decoder to produce dense output predictions with improved resolution. Additionally, it demonstrates the possibility of employing Res-Net as the feature extractor rather than VGG or Alex-Net.

Similarly, Wang et al. (2015) tried to use the global context CNN and regional CNN in a combined hierarchicalCRF. In this research the authors combined the work of Liu et al. (2016) and Eigen et al. (2014). The purpose of global context CNN was to estimate the semantic labels and depth in log space. Global context CNN was the replica of Eigen et al. (2014)’s work, but the only difference was that the last layer for semantic label prediction was removed in their proposed method. In addition to that, the regional CNN estimated the depth and semantic labels separately. This algorithm also followed the same structure as global CNN with some fine-tuning. The output from this network was semantic labels for segments and affinities to local depth templates. Such templates represent local structures such as corners, edges, and planes in a depth map. The problem with such a technique was that the algorithm couldn’t estimate the absolute depth as absolute depth cannot be estimated only from local segments and predicted relative depth values. So, this issue was solved by deducting the absolute value of the pixel in the segment’s center and rescaling to the range <0, 1>. Also, absolute depths for the ground truth targets were converted to relative depths. Their quantitative results showed the best results except for the threshold metric. The author speculated that the worst performance on threshold metrics was due to non-consideration of scale invariance.

Roy and Todorovic (2016) proposed monocular depth estimation using a neural regression forest.Their model combined convolutional neural networks and random regression forest with binary regression trees. Depth prediction for each pixel was estimated at the root tree node with the help of a convolutional window at the center of the image window. The features from the network’sfinal convolutional layers at the node are used as input to the child nodes. Employed CNNs contain pooling layers, which causes the resolution to decrease as the features go down the tree. This produces a multi-scale representation of the input. The same procedure is carried out for each split node. The CNN has fully connected layers at each node that output a probability of sending the output to the left or right child node. Furthermore, to ensure smooth depth predictions, the probability distribution predicted by the tree for pixel p is adjusted by bilateral filtering to include probability distributions predicted by the tree for pixels in p’s neighborhood. The quantitative measures performed better in their proposed work, but neural regression performance was worse than previously discussed methods. Their metrics show that requiring smoothness of depth values across surrounding pixels with comparable appearance can improve performance.

To predict depth from monocular images, Fu et al. (2018) developed a deep ordinal-regression network (DORN) using a multi-scale approach. The depth data are discretized using a space-increasing discretization (SID) technique into several intervals. This approach considers that the uncertainty of depth prediction grows as depth values increase, enabling higher errors for deeper depth values. The depth prediction is formulated as an ordinal regression problem, which tries to classify each pixel into a set of ordered categories while still possessing the characteristics of regression and classification. Fu et al.(2018) considered atrous-spatial-pyramid-pooling with varying dilations and receptive fields offilters. The purpose of spatial pyramid pooling was to avoid repeated convolution and pooling operations which decreases feature map feature sizes as it is undesirable for dense prediction tasks. In their work, they compared VGG and Res-Net backbones and concluded that Res-Net performance is the best. The benefit of employing an atrous-spatial-pyramid-pooling (ASPP)-module is to create feature maps with higher resolution and the superior performance of Res-Net over VGG.

Hu et al. (2018) proposed two main improvements in a depth estimation tasks. Firstly, they trained their encoder-decoder network to fuse extracted features at different scales with the refinement module. Secondly, they developed a loss function that measures inference error in training a model. They used the same Res-Net proposed by Fu et al. (2018) with a skip connection, as it can capture more features with less trainable parameters. To overcome the loss of spatial resolution, Hu et al. (2018) used a multiscale fusion module (MFM). This module helped to combine discrete information at multiple scales into one. Such lower-level information helped to recover finer spatial resolution, which could be used to restore the minor details lost due to multiple application of downscaling. The resultant output showed crisp boundaries in depth maps. Also, they have investigated the impact of various loss functions on the measurement of estimation errors near step edges. Since this, they claimed that while the loss of difference in gradients typically exhibits a sensitivity to positional shift and edge blur, it is insensitive to the loss of difference in depth. Qualitative results show the better output depth maps, but their network failed to accurately measure the reconstruction error of step edges for evaluation of reconstruction accuracy.

Alhashim and Wonka (2019) applied transfer learning to estimate depth maps of a scene from a monocular image. They used an encoder-decoder network with pretrained weight from ImageNet (designed for image classification)to prevent training from random weights. Their network comprised the backbone of Dense-Net (Huang et al., 2018) with skip connections. They introduced a novel loss function where they combined different losses to estimate the depth map of a respective RGB image. The benefit of transfer learning helped to get more feature extraction within less training time. The main drawback of their model was the GPU memory occupancy. With a denser encoder, the number of parameters grows twice, which creates a big caveat in training, making the learning slow. In decoder, the features feeding was reduced to half, which reduced the performance instability. Another drawback was the overfitting due to color augmentation in their model.

Kumari et al. (2019) generated depth images with the help of a residual encoder-decoder CNN network. In such an encoder-decoder network, the depth information was learned using a pair of colored images and their corresponding depth maps. Their suggested model combines residual connections found in the pooling and up-sampling layers with hourglass networks that analyze the encoded characteristics at different scales. The inclusion of the hourglass module improved the output results as this module emphasizes the export of global to local information by analyzing the different scales. Without the hourglass module, the results were suboptimal. They also added perceptual loss, which considered high-level features at different scales of abstractions. The addition of perceptual loss helped the model to converge faster.

2.1.3.2. Semi-Supervised and Self-Supervised Learning

Ground truth depth training datasets at the pixel level are always needed for learning-based work (supervised learning). Most of the time, ground truth depth data is unavailable and is difficult be manually annotated. In such a manner to address this need, recently attention has switched to looking for alternative training supervision signals to solve these challenges. Such techniques are called semi-supervised or self-supervised training.

Garg et al. (2016) tried to solve the depth estimation task in an unsupervised manner. They proposed a network that could be trained in an unsupervised manner and justified their work by comparing the weakness of supervised neural networks, which requires a lot of training data. They used an autoencoder setup to train on pair of stereo images to construct depth maps. The encoder is based on Alex-Net (Krizhevsky et al., 2012), in which the last connected layer is replaced with fully convolutional layers. Skip connections were introduced inside the network to get refined predictions. The advantage of skip connections was to combine global information with local information. During training, the network predicts the inverse depth of the left image. Then right image, the disparity, and the predicted inverse depth of the left image are used by the network to reconstruct the left image. In the training phase, the reconstruction loss is reduced, and the reconstructed left image is matched to the input.

Similarly, Godard et al. (2017) estimated monocular depth in an unsupervised manner. Their work was similar to the above-mentioned method (Garg et al., 2016), but they used a new loss with left-right consistency. Their proposed method showed that only using reconstruction will lead to poor depth prediction. Hence, they administered epi-polar geometry constrains to generate disparity maps for both left and right images using a single leftimage. The disparitymaps’consistency is enforced during the training phase, which improves the robustness and accuracy of the depth predictions. Their study demonstrated how a network could be trained in an unsupervised way for depth estimation using only stereo pairs of images.

Another work that performed unsupervised learning for depth mapping is the work of Zhou et al. (2017). Using photometric loss as supervision, they created an unsupervised learning framework to simultaneously estimate the depth and ego-motion from a monocular camera. The dataset was extracted from video recordings. The videos in the dataset were captured with a single camera, and because the camera location moves slightly between frames, they behave like a stereo pair. Succeeding frames are utilized for training the network. A separate network was developed to estimate the movement between the frames of the camera. Hence, the camera calibration which was required for the stereo camera system was eliminated. Such a method performed better, but certain assumptions must be made to hold it true. The authors train a second network to estimate how much their model can explain each pixel. To prevent bad pixels, such as a pixel on a moving object causing a negative effect on the training process, they weigh the loss from each pixel using the predicted value for those pixels.

Following the trend, Casser et al. (2019) proposed another unsupervised approach. Their method can simulate dynamic scenes by simulating object motion. They proposed a motion model, which is the same as the ego-motion model, but the difference in this model was that it could precisely predict the motion of each object in 3-dimensional (3D) space. Their model can take RGB input image sequences that are pre-segmented. The dataset with segmentation information is fed into the motion model. The motion model is then trained to predict the transformation vectors for each object in 3D space, which results in the observed object appearance in the respective target frame. Their method not only models object in 3D but also learns their motion on the fly. This is a fundamental method for modeling depth independently for the scene and each object.

Chen et al. (2018) estimated the depth from RGB images at a very sparse set of pixels. Their work demonstrated that it is possible to efficiently transform sparse depth measurements obtained using lower-power depth sensors or simultaneous localization and mapping (SLAM) systems into high-quality dense depth maps. To enhance the monocular depth estimation, they used semantic segmentation. The same network framework underlies both depth estimation and semantic segmentation, which switch according to conditions. Region-aware depth estimation is performed using a novel left-right semantic consistency term, which increases both tasks’ accuracy and resilience. Their model was trained in a semi-supervised manner with a stereo camera setup. They claim that their sparse depth information is very flexible which leads the model to generalize the multiple scenes which can simultaneously work for indoor and outdoor scenes.

Mahjourian et al. (2018) presented a new technique in unsupervised learning where the system uses the whole 3D geometry of the scene. Unlike previous work, which uses the ego-motion pixel-wise or gradient-based losses in the small neighborhood, this technique considered the entire 3D scene and enforced consistency of estimated 3D point clouds and ego-motion across all consecutive frames. The backpropagation algorithm was used for the alignment of 3D structures. Although their algorithm performed better than previously discussed work, this algorithm fails in a scenario where the object moves between two frames as their loss function misestimates the depth. Also, their system needs to improve in largely dynamic scenes.

Finally, Goldman et al. (2019) used a two-stream Siamese network (Koch et al., 2015) for self-supervised monocular depth estimation. Two twin networks make up the Siamese network design, each of which learns to predict a disparity map from a single image. However, only one of these networks was utilized during testing to determine depth. Their network was able to be trained on a pair of stereo images simultaneously and symmetrically in a self-supervised way. Their architecture expected the input from stereopair cameras, which were manually calibrated and placed at a proper baseline. Such a network processed the input data in a self-supervised manner without labeling. No ground truth or labeled depth was provided to the network during training. Siamese neural network (Koch et al., 2015) performed very well. Still, the training time doubled due to two streams of network behavior, as two networks were training on different images simultaneously. Similarly, the drawbacks of the traditional stereo camera system and post-processing were also the main drawbacks of the method.

Up until this point, we have discussed various state-of-the-art methods for estimating monocular depth maps. We have seen how the researcher tried to solve this problem using different models and learning methodologies. Table 2. provides a brief overview of all of the above-mentioned methods, their learning technique, and the output results.

Table 2. Comparison of different state-of-the-art methods based on their learning methodology and output results

OGCSBN_2023_v39n1_1_t0002.png 이미지

In Fig. 6, 1st row is the input RGB image, 2nd is ground truth, 3rd row is the results of Laina et al. (2016), 4th is the results of He et al. (2018), and 5th row is the result of Kumari et al. (2019).

OGCSBN_2023_v39n1_1_f0006.png 이미지

Fig. 6. Comparison of different state-of-the-art methods using the NYU v2 dataset taken from Kumari et al. (2019).

3. Results and Discussion

Depth estimation in general is considered as a challenging task. The development of depth estimation still needs a focus on improving accuracy, transmissibility, and real-time performance. Most of the previous work in depth mapping depends on the hardware setup. Such methods require a lot of calibrations and preprocessing to estimate the depth maps. Also, there was no such evaluation method to evaluate the accuracy of their output predicted depth map. With recent development in deep learning and convolutional neural networks. The models are trained with huge datasets and models are evaluated with test datasets with some evaluation metrics. There are some well-known datasets that are used for training depth estimation networks. Such datasets will be discussed in thissection. Also, with the use of these datasets, we can evaluate the performance of networks by qualitative as well as quantitative measurement of predicted depth maps.

3.1. Datasets for Depth map

3.1.1. NYU V2 Dataset

This dataset is made from a video sequence of variousindoorsettings. Itsimages and depth data come from a lidar sensor known as Microsoft Kinect. The dataset is made up of 464 video sequences of indoor scenes.There are 120k training samples and 654 testing samples.

3.1.2. DIODE Dataset

Dense indoor and outdoor depth (DIODE)is another dataset used for depth estimation task. The DIODE dataset includes a variety of high-resolution color images with precise, dense, far-field depth measurements. It is the first publicly available dataset to contain RGBD photos of both indoor and outdoor scenes collected using a single sensor suite. There are 8,574 images for training and 753 test images of indoor scenes dataset while 16,884 train and 876 test images for outdoor scenes.

3.1.3. KITTI Dataset

The Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) dataset is an outdoor dataset for deep learning-based object tracking, object detection, and monocular deep estimation. KITTI dataset is captured with a vehicle drown through a road environment. The vehicle was outfitted with two high-resolution color cameras, two gray-scale cameras, a laser scanner, and a GPS with a maximum measuring range of 120 m. The dataset includes 93,000 RGB-D training samples from the city of Karlsruhe and the highway.

3.1.4. Cityscapes

Cityscapes is a large diverse stereo video sequences dataset. Cityscapes is shot on the streets of 50 different cities. It is a large-scale dataset that focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense pixel annotations for 30 classes divided into 8 categories. The dataset includes approximately 5,000 finely annotated images and 20,000 coarsely annotated images. Data was collected in 50 cities over several months, during daylight hours in good weather.

3.1.5. Make3D

TheMake3D dataset is a monocular depth estimation dataset with 400 training images and 134 test samples. These images have a resolution of 2,272 × 1,704, while the ground truth depth maps have a resolution of 55 × 305. The depth maps included in the Make3D dataset were obtained using a laser scanner, and they comprise the daytime city and natural scenery. The depth ranges from5 to 81 m, and any area beyond that is consistently mapped at 81 m.

3.1.6. Middlebury Dataset

The Middlebury Stereo dataset comprises a set of pixel-accurate ground truth disparity data and high-resolution stereo sequences with complicated geometry. This dataset was captured with a structured light scanner. Middlebury is a large indoor scene dataset of 33 high-resolution, 6-megapixel photos. Two stereo DSLR cameras and two point-and-shoot cameras were used to capture the images. The images in this dataset have a resolution of 2,872 × 1,984.

3.2. Evaluation Metrics in Depth Estimation Tasks

There are a variety of evaluation metrics for estimating monocular depth maps. For model training the datasets are split into test and training samples. Once the model is trained, test sample data is used to evaluate the model performance and the accuracy of the predicted output is checked. To assess and compare the effectiveness of depth estimation systems, evaluation measures defined by Eigen et al. (2014) are used. The evaluation metrics of Eigen et al. (2014) is taken as a standard and used by all author in depth estimation tasks. Both error and accuracy measurements used in the evaluation as shown in the following:

Average relative error

\(\begin{aligned}(\operatorname{Rel})=\frac{1}{n} \sum_{p}^{n} \frac{\left|y_{p}-y_{p}^{\prime}\right|}{y_{p}}\end{aligned}\) (2)

Root means squared error

\(\begin{aligned}(\mathrm{RMS})=\sqrt{\frac{1}{n} \sum_{p}^{n}\left(y_{p}-y_{p}^{\prime}\right)^{2}}\end{aligned}\) (3)

Average error

\(\begin{aligned}\left(\log _{10}\right)=\frac{1}{n} \sum_{p}^{n}\left|\log _{10}\left(y_{p}\right)-\log _{10}\left(y_{p}^{\prime}\right)\right|\end{aligned}\) (4)

Threshold accuracy (δi)

\(\begin{aligned}\begin{array}{c}(\delta i)=\% \text { of } y_{p} \text { such that, } \max \left(\frac{y_{p}}{y_{p}^{\prime}}, \frac{y_{p}}{y_{p}^{\prime}}\right)<1.25^{i} \\ (i=1,2,3)\end{array}\end{aligned}\) (5)

In the above metrics yp is the pixel value in depth image y, y_p′ is the pixel value in predicted image y′ and n in Eq. (2), (3), (4) is the total number of pixels for each depth images.

We used the publicly available pre-trained networks tested on the NYU depth v2 and KITTI datasets in order to assess and contrast various monocular depth estimation approaches based on deep learning. The results are shown in Table 3. Performance evaluations of the various approaches are shown along with error and accuracy metrics.

Table 3. Quantitative comparison of different state-of-the-art methods

OGCSBN_2023_v39n1_1_t0003.png 이미지

It is evident that with the advancement in deep learning techniques, the depth estimation problem is solved to very high accuracy. In the above-mentioned Table 3, we can observe that different researchers used different approaches to develop depth maps. The majority of the techniques are evaluated on publicly available NYU depth v2 and KITTI dataset as they are considered as the benchmark for depth maps. The nature of dataset is different, but both are used to develop depth maps. Hence, researchers adopted these datasets based on their needs. But the overall evaluation matrices of both datasets are the same as described in section 3.2.

In our review, we found out that the method of Hu et al. (2018) achieved the best results in accuracy measures while Kumari et al. (2019) model performed better in error metrics. Both of these ranking research followed the supervised learning methodology where RGB images with their respective depth maps were inputted into their models. Hence, we can summarize that if the model is trained in a supervised manner. It learns features dominantly and performs better predictions with high-accuracy measurements and minimal loss. Till now we can agree on supervised learning, but this does not solve the problem. There is a big room for improvement in semi-supervised and unsupervised learning which will soon surpass the supervised learning. Because supervised learning datasets are not diverse. It has constant and specific nature like only indoor scenes, outdoor scenes, or road environments. Hence, in near future it cannot be used to situation where the nature or geometry are totally different from input dataset. The trend is highly switching towards unsupervised and semi-supervised learning as these methods can be applied to any situation without providing labeled datasets.

4. Future Work and Challenges

We have examined the evolution of the deep learning-based depth estimation approaches used in the various publications and how these techniques have changed through time, as well as how their approaches differ in some specific ways. With each step, we get insight into the potential future range of work in this field. Afterall, depth estimation is still a difficult task. Accuracy and real-time performance will still be the main priority during depth map estimation. In the past, geometric structures and conventional computer vision methods were used to create depth maps. Some of the researchers tried to conduct depth maps from single images but the majority of the research at that time focused on multiview or stereo-vision depth maps. With the advancement in machine learning and deep neural networks, CNN has demonstrated a potent capacity to properly predict dense depth maps from a single image with acceptable image processing performance. Numerous studies have looked into the various depth network cues needed for monocular depth estimation in recent years. However, there are still some limitations needed to be overcome. We present future work suggestions that focus on improving the network performance.

4.1. Training

Currently, depth datasets are very limited. New and more labeled datasets could aid in the training of more accurate networks. Additional data augmentations, such as spatial scaling, color, brightness, and contrast augmentations could be used to provide more varied data to the network. This, hopefully, will teach the network to generalize better to new scenes. Similarly, to improve accuracy, researchers deepen the layers of deep neural networks, which increases memory usage and space complexity. Thus, lightweight models should be developed to improve the accuracy and training time.

4.2. Encoder and Decoder

Currently encoder-decoder networks are typically used in monocular depth estimation networks. The depth features are severely lost after multiple layers of information processing, resulting in low-accuracy estimated depth maps that do not meet the requirements of practical applications. To train on depth estimation tasks, a more efficient, lightweight, and denser encoder and decoder should be used. A variety of attention models can also be incorporated into new networks. Attention in a model enhances the features without the increase of model parameters.

4.3. Objects in Motion and Occlusions

Realistic scenes are typically complicated, with many moving objects, occlusions, lighting changes, and weather changes. However, most existing depth estimation models only consider ideal conditions. Much of the work is needed to deal with dynamic objects and occlusion scenes. Recently some progresses have made, in determining how to better estimate the depth of complex scenes to meet practical applications. Incorporating semantic information, normal and stereo-geometric information to train the network may help in reducing the error.

5. Conclusions

With this review, we hope to advance the field of deep learning-based depth estimation. In light of this, we review the relevant works of depth estimation from the perspective of the type of methods used, training, including supervised, unsupervised, and semi-supervised learning, with the use of loss functions and network frameworks. We also cover some pressing issues and difficulties and proposed some suggestions and encouraging lines of inquiry for further investigation. This paper provided a summary of the contributions made by this developing field of science to depth estimation using computer vision and deep learning. Therefore, we attempted to review the most recent studies on depth estimation from many aspects, such as data input types, training methods, and learning methodologies in combination with the use of various datasets and evaluation indicators. The architecture of deep learning models needs to be enhanced in the future to increase the accuracy and dependability of the suggested networks and decrease the inference time.

Acknowledgments

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2016R1D1A1B02011625).

Conflicts of Interest

The authors declare no conflict of interest.

References

Alhashim, I. and Wonka, P., 2019. High quality monocular depth estimation via transfer learning. arXiv preprint arXiv:1812.11941. https://doi.org/10.48550/arXiv.1812.11941
Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L., 2008. Speeded-up robust features(SURF).Computer Vision and Image Understanding, 110(3), 346-359. https://doi.org/10.1016/j.cviu.2007.09.014
Casser, V., Pirk, S., Mahjourian, R., and Angelova, A., 2019. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8001-8008. https://doi.org/10.1609/aaai.v33i01.33018001
Chen, Z., Badrinarayanan, V., Drozdov, G., and Rabinovich, A., 2018. Estimating depth from RGB and sparse sensing. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, Sept. 8-14, pp. 176-192. https://doi.org/10.1007/978-3-030-01225-0_11
Derpanis, K. G., 2005. Overview of the RANSAC Algorithm. Image Rochester NY, 4(1), 2-3.
Eigen, D. and Fergus, R., 2015. Predicting depth, surface normals and semantic labels with a commonmulti-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, Dec. 11-18, pp. 2650-2658.
Eigen, D., Puhrsch, C., and Fergus, R., 2014. Depth map prediction from a single image using a multi-scale deep network. Advances in Neural Information Processing Systems 27 (NIPS 2014).
Fu, H., Gong, M., Wang, C., Batmanghelich, K., and Tao, D., 2018. Deep ordinal regression network form on ocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, June 18-23, pp. 2002-2011.
Garg, R., Bg, V. K., Carneiro, G., and Reid, I., 2016. Unsupervised CNN for single view depth estimation: geometry to the rescue. In Proceedings of the Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, Oct. 11-14, pp. 740-756.
Godard, C., MacAodha, O., and Brostow, G. J., 2017. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, July 21-26, pp. 270-279.
Goldman, M., Hassner, T., and Avidan S., 2019. Learn stereo, infer mono: Siamese networks for self-supervised, monocular, depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, June 16-20.
Hall, D.S., 2011. U.S.Patent No. 7, 969, 558. Washington, DC: U.S. Patent and Trademark Office.
Harris, C. and Stephens, M., 1988. Acombined corner and edge detector. In Proceedings of the Alvey Vision Conference, Manchester, UK, Aug. 31-Sept. 2, pp. 1-6. https://doi.org/10.5244/C.2.23
Hartley, R. and Zisserman, A., 2004. Multiple view geometry in computer vision(2nded.). Cambridge University Press. https://doi.org/10.1017/CBO9780511811685
He, K., Zhang, X., Ren, S., and Sun, J., 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, June 26-July 1, pp. 770-778.
He, L., Wang, G., and Hu, Z., 2018. Learning depth from single images with deep neural network embedding focal length. IEEE Transactions on Image Processing, 27(9), 4676-4689. https://doi.org/10.1109/TIP.2018.2832296
Helmholtz, H., 1924. Treatise on physiological optics, vol. 2 OCR. Dover Publications Inc.
Hoiem, D., Efros, A.A., and Hebert, M., 2005. Automatic photo pop-up. In Proceedings of the SIGGRAPH '05: Special Interest Group on Computer Graphics and Interactive Techniques Conference, Los Angeles, CA, USA, July 31-Aug. 4, pp. 577-584. https://doi.org/10.1145/1073204.1073232
Hu, J., Ozay, M., Zhang, Y., and Okatani, T., 2018. Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, Jan. 7-11, pp. 1043-1051. https://doi.org/10.1109/WACV.2019.00116
Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q., 2018. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 1608. https://doi.org/10.48550/arXiv.1608.06993
Karsch, K., Liu, C., and Kang, S.B., 2014. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(11), 2144-2158. https://doi.org/10.1109/TPAMI.2014.2316835
Khan, F., Salahuddin, S., and Javidnia, H., 2020. Deep learning-based monocular depth estimation methods-A state-of-the-art review. Sensors, 20(8), 2272. https://doi.org/10.3390/s20082272
Koch, G., Zemel, R., and Salakhutdinov, R., 2015. Siamese neural networks for one-shot image recognition. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, July 6-11, pp. 1-8.
Krizhevsky, A., Sutskever, I., and Hinton, G. E., 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, Dec. 3-6, pp. 1-9.
Kumari, S., Jha, R.R., Bhavsar, A., and Nigam A., 2019. Auto depth: Single image depth map estimation via residual cnn encoder-decoder and stacked hourglass. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, Sept. 22-25, pp. 340-344. https://doi.org/10.1109/ICIP.2019.8803006
Ladicky, L., Shi, J., and Pollefeys, M., 2014. Pulling things out of perspective. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, June 23-28, pp. 89-96. https://doi.org/10.1109/CVPR.2014.19
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., and Navab, N., 2016. Deeper depth prediction with fully convolutional residual networks. In Proceedings of the 2016 4th International Conference on 3D Vision (3DV), Stanford, CA, USA, Oct. 25-28, pp. 239-248. https://doi.org/10.1109/3DV.2016.32.
Leonard, J.J. and Durrant-Whyte, H. F., 1991. Mobile robot localization by tracking geometric beacons. IEEE Transactions on Robotics and Automation, 7(3), 376-382. https://doi.org/10.1109/70.88147
Liu, B., Gould, S., and Koller, D., 2010. Single image depth estimation from predicted semantic labels. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, June 13-18, pp. 1253-1260. https://doi.org/10.1109/CVPR.2010.5539823
Liu, F., Shen, C., Lin, G., and Reid, I., 2016. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10), 2024-2039. https://doi.org/10.1109/TPAMI.2015.2505283
Lowe, D. G., 1999. Object recognition from local scale invariant features. In Proceedings of the 7th IEEE International Conference on Computer Vision, Kerkyra, Greece, Sept. 20-27, pp. 1150-1157. https://doi.org/10.1109/ICCV.1999.790410
Mahjourian, R., Wicke, M., and Angelova, A., 2018. Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt LakeCity, UT, USA, June 18-22, pp. 5667-5675.
Masoumian, A., Rashwan, H.A., Cristiano, J., Asif, M. S., and Puig, D., 2022. Monocular depth estimation using deep learning: a review. Sensors, 22(14), 5353. https://doi.org/10.3390/s22145353
Michels, J., Saxena, A., and Ng, A.Y., 2005. High speed obstacle avoidance using monocular vision and reinforcement learning. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, Aug. 7-11, pp. 593-600. https://doi.org/10.1145/1102351.1102426
Moravec, H.P., 1990. The Stanford Cart and the CMU Rover. In: Cox, I.J., Wilfong, G.T. (eds.), Autonomous Robot Vehicles, Springer, pp. 407-419. https://doi.org/10.1007/978-1-4613-8997-2_30
Nayar, S.K., Watanabe, M., and Noguchi, M., 1996. Real-time focus range sensor. IEEET ransactions on Pattern Analysis and Machine Intelligence, 18(12), 1186-1198. https://doi.org/10.1109/34.546256
Pentland, A.P., 1987. A new sense for depth of field. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(4), 523-531. https://doi.org/10.1109/tpami.1987.4767940
Roy, A. and Todorovic, S., 2016. Monocular depth estimation using neural regression forest. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), LasVegas, NV, USA, June 27-30, pp. 5506-5514. https://doi.org/10.1109/CVPR.2016.594
Saxena, A., Chung, S., and Ng, A., 2005. Learning depth from single monocular images. In Proceedings of the 19th Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, Dec. 5-8, pp. 1-8.
Seo, S., 2018. Edge modeling by two blur parameters in varying contrasts. IEEE Transactions on Image Processing, 27(6), 2701-2714. https://doi.org/10.1109/TIP.2018.2810504
Seo, S., 2021. SNR analysis for quantitative comparison of line detection methods. Applied Sciences, 11(21), 10088. https://doi.org/10.3390/app112110088
Simonyan, K. and Zisserman, A., 2015. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. https://doi.org/10.48550/arXiv.1409.1556
Tang, C., Hou, C., and Song, Z., 2015. Depth recovery and refinement from a single image using defocus cues. Journal of Modern Optics, 62(6), 441-448. https://doi.org/10.1080/09500340.2014.967321
Tang, H., Cohen, S., Price, B., Schiller, S., and Kutulakos, K. N., 2017. Depth from defocus in the wild. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, July 21-26, pp. 2740-2748. https://doi.org/10.1109/CVPR.2017.507
Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., and Yuille, A. L., 2015. Towards unified depth and semantic prediction from a single image. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, June 7-12, pp. 2800-2809. https://doi.org/10.1109/CVPR.2015.7298897
Wheatstone, F.R.S., 1838. Stereoscopy.com-The Library: Contributions to the Physiology of Vision. - Part the First. On some remarkable, and hitherto unobserved, Phenomena of Binocular Vision. Available online: https://www.stereoscopy.com/library/wheatstone-paper1838.html (accessed on Oct. 3, 2022).
Yang, J., Alvarez, J.M., and Liu, M., 2022. Non-parametric depth distribution modelling based depth inference for multi-view stereo. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, June 18-24, pp. 8616-8624. https://doi.org/10.1109/CVPR52688.2022.00843
Zhou, T., Brown, M., Snavely, N., and Lowe, D.G., 2017. Unsupervised learning of depth and egomotion from video. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, July 21-26, pp. 1851-1858.
Zwald, L. and Lambert-Lacroix, S., 2012. The BerHu penalty and the grouped effect. arXiv preprint arXiv:1207.6868. https://doi.org/10.48550/arXiv.1207.6868

Korean Journal of Remote Sensing (대한원격탐사학회지)

Deep Learning-based Depth Map Estimation: A Review

Abstract

Keywords

1. Introduction

2. Materials and Methods

2.1. Traditional methods / Hand-crafted techniques

2.1.1. Stereo depth estimation in the traditional way

2.1.2. Monocular Depth Estimation in Traditional Way

2.1.3. Depth Estimation Using Deep Learning / Artificial Neural Networks

2.1.3.1. Supervised Learning

2.1.3.2. Semi-Supervised and Self-Supervised Learning

3. Results and Discussion

3.1. Datasets for Depth map

3.1.1. NYU V2 Dataset

3.1.2. DIODE Dataset

3.1.3. KITTI Dataset

3.1.4. Cityscapes

3.1.5. Make3D

3.1.6. Middlebury Dataset

3.2. Evaluation Metrics in Depth Estimation Tasks

4. Future Work and Challenges

4.1. Training

4.2. Encoder and Decoder

4.3. Objects in Motion and Occlusions

5. Conclusions

Acknowledgments

Conflicts of Interest

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)