1. INTRODUCTION
Augmented reality (AR) has shown promising applications for education, industrial training, and even recreational purposes. For AR, the 3D camera (or object) pose has to be estimated using visual information from the camera images. In the literature, there are two main approaches for camera pose estimation: marker-based[1,2] and markerless [3,4]. Marker-based methods use fiducial markers that usually have primitive shapes and discriminative colors, thus are easy to implement and robust. However, placing markers in the user space is visually disturbing, which makes users unpleasant. Also, marker-based methods are inherently vulnerable to occlusion. Therefore, they have become less attractive with time. Therefore, markerless methods use image features (such as corners or blobs) and their geometry instead of markers, to overcome the problems of marker-based methods. However, they require the target scene (or object) be rigid and of rich texture, and they are less reliable than marker-based methods.
Model-based methods[5, 6, 7] also do not require markers, so they can be considered as a kind of markerless methods. However, as model-based methods use 3D knowledge of the target scene (or object), i.e., 3D scene/object model, they are highly reliable and computationally efficient.
Model-based methods estimate the camera pose from 3D-2D correspondences between the 3D model data and its corresponding 2D observations in the camera image. As each correspondence independently contributes to the camera pose estimation, model-based methods with robust estimators can be robust to occlusion. The workflow of a model-based method is shown in Fig. 1. Printed models that have dense textures are advantageous since the 3D-2D correspondences are directly established by either feature point matching[8,9] or template matching[10,11]. On the other hand, when the target 3D object is textureless or poorly textured, edges are the only information to rely on. In such edge-based methods, the 3D object’s mesh data is projected on the camera image and matched with its corresponding 2D image edges detected via local search in the normal direction of projected object boundary. Then, the 3D camera (or object) pose between consecutive frames are recovered from 2D displacements between the correspondences. Since the initial edge-based method, RAPID tracker[5], was proposed, a number of variants with different computational frameworks have been reported and their performance has been considerably improved[6,7]. However, edge-based methods have suffered from matching errors commonly caused by clutters on either the object’s surface or background. Therefore, the optimal local searching (OLS) method[12] introduced a robust way to find the correspondences in heavily cluttered backgrounds.
Fig. 1. Overall working flow of model-based 3D object tracking. (a) Camera image containing the target 3D object. (b) Mesh model of the 3D object. (c) Contour (in green) of the 3D mesh model projected on the image using the initial (or previous) camera pose. (d) Drawn 3D mesh model (in red) after refining the camera pose from the displacement between the contour and its corresponding image edges.
As another attractive approach for tracking textureless or poorly-textured 3D objects, direct methods[13, 14, 15] are inherently less influenced by the object or background clutters because they exploit rich information in an image to estimate the camera pose instead of relying on local features such as edges. In direct methods, brightness (intensity) constancy between consecutive frames is commonly assumed. However, the assumption is violated due to intensity variations, which are usually caused by illumination changes. To tackle this problem, a direct method[16], called D-IVM (Direct method with an Intensity Variation Model) hereafter, introduced an approach that models intensity variations using the surface normal of the object under the Lambertian assumption.
Each of the OLS and the D-IVM methods tackles limitations on its respective tracking approaches. However, we observed that the local search nature of the OLS method exposed the weakness to fast camera (or object) movements that make a large distance between the correspondences and weaken edge strength, especially with cluttered backgrounds; the nature of using the information on the entire object region of the D-IVM method exposed the weakness to occlusion and the inaccuracy in object boundary matching. Therefore, based on the fact that the weakness of one method can be compensated by the other, we introduce an approach that smartly combines the OLS and D-IVM methods to create a system that is consistent and accurate in tracking and robust to occlusion. The main contribution of this paper is the design of a method that combines the two different types of tracking approaches to exploit their respective strengths and tackle their respective weaknesses.
2. TWO BASE METHODS
In this section, we describe two types of model-based 3D object tracking methods that we used to combine in the proposed method. The first method is the OLS method[12] which is an edge-based method, and the second method is the D-IVM method[16] which is a direct method.
2.1 OLS Method
Given a 3D mesh model of the target object M, an initial camera pose E0, and camera images obtained using a calibrated RGB camera, edge-based 3D object tracking methods estimate the current camera pose Et at time t by updating the camera pose of the previous frame Et-1 with infinitesimal camera motions between consecutive frames Δ, so that Et = Et-1Δ. The infinitesimal camera motions are computed by iteratively minimizing the distances (disti) between the projections of boundary points (Mi) sampled on the 3D mesh model with Et-1 and their corresponding 2D image edges mi as follows:
\(\begin{array}{l} \widehat{\Delta}=\arg \min \sum_{i=0}^{N-1} \varphi\left(d i s t_{i}\right) \\ \text { dist }_{i}=\left\|\mathbf{m}_{i}-\operatorname{Proj}\left(\mathbf{M}_{i} ; \mathbf{E}^{t-1}, \Delta, \mathbf{K}\right)_{i}\right\|^{2} \end{array}\) (1)
Here, φ(·) is a robust estimator to penalize outliers, K is the camera intrinsic matrix, and N is the number of boundary points. With this formulation, the OLS method proposed a robust way to finding corresponding image edges in cluttered backgrounds.
The OLS method defined regions of the image as Φ{+,0,-} representing three levels with Φ+ being the interior region, Φ0 being the contour, and Φ- being the exterior region, based on the projected contour of 3D object mesh model. The projected contour is sampled into si. From each sample si, corresponding edge candidates ci are searched on the image toward normal directions along 1D searching line noted as li{+,0,-} within a certain range |η|. Based on three levels of regions, matching candidates ci{+,0,-} are pixels with local maximum gradient responses (computed using a filter mask [-1 0 1]) above a certain threshold ε. Then, the true correspondences ci* will exist among the candidates ci{+,0,-}. However, the OLS method searches for matching candidates in only confident directions. Specifically, the method prioritizes searching for potential matches towards interior regions through 1D line li{+,0}. Only when there is no match in the interior region, the method searches for matches towards exterior regions.
Besides the prioritized searching scheme, the OLS method modeled the local appearance of object surface region and background region using a histogram-based representation in the hue-saturation-value (HSV) color space to keep it less sensitive to illumination. The appearance model is then used to suppress the false edges caused by object or background clutters.
2.2 D-IVM Method
Given a 3D mesh model of the target object M, an initial camera pose E0, and camera images obtained using a calibrated RGB camera, direct 3D object tracking methods estimate infinitesimal camera motions between consecutive frames as in edge-based methods. Under the brightness constancy assumption, image points at time t are mapped to their corresponding image points at time t + 1 as follows:
\(\mathbf{I}_{t+1}(\mathbf{m}+\Delta \mathbf{m})=\mathbf{I}_{t}(\mathbf{m})\) (2)
where It(m) is the image intensity of m at time t. From this assumption, direct methods estimate the infinitesimal camera motions Δ by iteratively minimizing the intensity differences (diffi) as follows:
\(\begin{array}{l} \widehat{\Delta}=\arg \min _{\Delta} \sum_{i=0}^{N-1} \varphi\left(d i f f_{i}\right) \\ d i f f_{i}=\left\|\mathbf{I}_{t+1}\left(\operatorname{Proj}\left(\mathbf{M}_{i} ; \mathbf{E}^{t}, \Delta, \mathbf{K}\right)_{i}\right)-\mathbf{I}_{t}\left(\mathbf{m}_{i}\right)\right\|^{2} \end{array}\) (3)
Unlike conventional direct methods, the D-IVM method sought to address the problem that the brightness constancy is often violated due to illumination changes.
To model the intensity variations induced by illumination changes, the D-IVM method assumed that the 3D target object is rigid and has Lambertian surface. Then, the observed image intensity at a 2D point m on the image plane is expressed by:
\(\mathbf{I}(\mathbf{m})=\sigma(\mathbf{M}) \mathbf{n}(\mathbf{M})^{\mathrm{T}} \mathbf{l}\) (4)
where σ is the surface albedo, n is the unit surface normal, and l is the unknown light vector. Since the surface albedo is constant over time and the object is rigid (nt+1 = nt), the intensity variations between consecutive frames can be expressed by:
\(\Delta \mathrm{I}=\sigma\left(\mathrm{n}_{t+1}^{\mathrm{T}} \mathrm{l}_{t+1}-\mathrm{n}_{t}^{\mathrm{T}} \boldsymbol{l}_{t}\right)=\sigma\left(\mathrm{n}_{t}^{\mathrm{T}} \mathrm{l}_{t}\right)\left(\frac{\mathrm{n}_{t}^{\mathrm{T}} \mathrm{l}_{t+1}}{\mathrm{n}_{\mathrm{t}}^{\mathrm{T}} \mathrm{l}_{t}}-1\right)=\sigma\left(\mathrm{n}_{t}^{\mathrm{T}} \mathrm{l}_{t}\right)(\kappa-1)\) (5)
Therefore, Eq. (3) is modified to handle the intensity variations as follows:
\(\begin{array}{l} \widehat{\Delta}=\arg \min \sum_{i=0}^{N-1} \varphi\left(d v a r_{i}\right) \\ d v a r_{i}=\left\|\mathbf{I}_{t+1}\left(\operatorname{Proj}\left(\mathbf{M}_{i} ; \mathbf{E}^{t}, \Delta, \mathbf{K}\right)_{i}\right)-\kappa \mathbf{I}_{t}\left(\mathbf{m}_{i}\right)\right\|^{2} \end{array}\) (6)
For more details for how to minimize Eq. (6) and how to compute the compensation parameter κ, refer to [16].
3. PROPOSED COMBINATION METHOD
In this paper, we carefully combine the OLS method[12] and D-IVM method[16] to create a system that is more accurate, stable and robust to occlusion. First, using both the OLS and D-IVM methods, the camera poses are estimated for each frame using Eqs. (1) and (6). The camera poses estimated by the OLS and D-IVM methods in the frame at time t are defined as pot and pdt, respectively. Although the proposed method is designed to rely mostly on the D-IVM method as it is more stable and reliable, the starting poses (required for running both methods) are chosen depending on the estimation errors of both methods. The flowchart of the proposed method is shown in Fig. 2. For each frame, a camera pose is obtained using the D-IVM method and the pose is refined using the OLS method.
Fig. 2. Flowchart of the proposed method.
When the system starts, at frame 0, an initial camera pose p0 loaded from a file is used as starting point to estimate the camera pose pd0 using the D-IVM method. With pd0 , the estimation error of the D-IVM method e d is computed as follows:
\(e_{d}=\frac{1}{N} \sum_{i=0}^{N-1} d v a r_{i}\) (7)
Here, dvari is the intensity difference in Eq. (6). After that, the OLS method uses pd0 as starting point to estimate the camera pose po0. With po0, the estimation error of the OLS method eo is computed as follows:
\(e_{o}=\alpha \frac{1}{N} \sum_{i=0}^{N-1} \text { dist }_{i}\) (8)
Here, disti is the geometric distance in Eq. (1) and α is a weight used to match the scale to ed. At this point, po0 is used as the final camera pose to draw virtual contents for AR. At frame t, before running the D-IVM method, we compare e o and ed . If e d > eo , we use pot-1 as the initial pose for the D-IVM method. Otherwise, we use pdt-1 as the initial pose for the D-IVM method. After that, pdt is obtained and ed is updated using Eq. (7). Before running the OLS method, eo and ed are compared again so that the better pose (between pdt-1 and pot-1) is used as the initial pose. At the end, pot is obtained as the final camera pose of the proposed method. By cooperatively combining the two different types of 3D object tracking methods in this way, we can take advantage of both methods: consistency and stability of the D-IVM method and strength to occlusion of the OLS method.
4. EXPERIMENTAL RESULTS AND DISCUSSION
In this section, we compare the proposed combination method with each of the base methods (OLS and D-IVM), to show that the combination can overcome the limitations of each base method. The comparison is first made in terms of three different points: 1) tracking consistency, 2) tracking accuracy, and 3) robustness to occlusion. The tracking consistency shows that the proposed method can track the object, with challenging backgrounds, more consistently or stably than both base methods. The tracking accuracy shows that the proposed method has lower error rates than both base methods. Finally, the robustness to occlusion shows that the proposed method can keep tracking against occlusions better than both base methods. We also conducted analysis for the computation time to show that the proposed method, despite combining two methods, still can run in real time for augmented reality.
4.1 Tracking Consistency and Accuracy
A printed cat object with a uniform, white color is tracked using a webcam (Logitech C922) in a resolution of 640 × 480. To analyze the tracking consistency and accuracy, we considered three types of backgrounds. The first type of background, shown in Fig. 3-(a), has a fairly homogeneous color. This type of background is less challenging to both base tracking methods (OLS and D-IVM) since the difference between the background color and the object color is distinguishable and the object contours can be clearly detected on the image. The second type of background, shown in Fig. 3-(b), is a background with a few colors, a slight texture, and some strong edges. Finally, the last type of background, shown in Fig. 3-(c), has a high texture, various colors, and a dense, strong edges. To compute estimation errors, we use the average distance between projected sample points of the 3D mesh model using the estimated camera pose and their corresponding edges detected on the camera images. Fig. 4 shows the tracking errors of each method. For all background cases, the OLS method lost tracking at some point while the D-IVM method showed good consistency for all the cases. However, the D-IVM method was less accurate, i.e., the object boundary was loosely matched (Fig. 5). Combining both methods as proposed in this paper improved the consistency of the OLS method and the overall accuracy of both methods, as detailed in Table 1.
Fig. 3. Camera images for three backgrounds cases of different difficulty levels. (a) A background with a fairly homogeneous color that is clearly different from the object color, (b) a background with a slight texture, and (c) a background with a high texture and various colors.
Fig. 4. Estimation errors of the D-IVM, OLS, and proposed methods for backgrounds with different tracking difficulties.
Fig. 5. Tracking accuracy of the D-IVM and OLS methods. Table 1. Tracking consistency and accuracy of different tracking methods.
Table 1. Tracking consistency and accuracy of different tracking methods
4.2 Robustness to Occlusion
To show that the proposed method has a better robustness to occlusion than each OLS and D-IVM method, we recorded experimental videos with different types and levels of occlusion. As the first type of occlusion, we used frames where the tracked object is partially visible from the frame without being occluded by another object, as shown in Fig. 6. Both of the D-IVM and OLS methods had good robustness to this type of occlusion. This will be due to the fact that the occlusion does not affect local luminosity of the tracked object, thus local edges and color characteristics stay the same. Therefore, even when only a fraction of the object was visible, both methods succeeded to keep tracking. The improvement by the proposed method was not noticeable.
Fig. 6.Tracking results of the D-IVM, OLS, and proposed methods when the object is partially visible. The results of (c) represent the final results of the proposed method.
However, with a different type of occlusion where another object (a human hand in our experiments) partially occludes the tracked object, local luminosity changes introduced by shadows of the occluding object were usually observed. In such cases, color characteristics of the tracked object may change between adjacent frames with respect to the movements of the occluding object. To this end, the D-IVM method was vulnerable to this type of occlusion. However, the OLS method appeared to be much more robust than the D-IVM method. Therefore, the proposed method benefited from such robustness of the OLS method to make the overall tracking more robust, as shown in Fig. 7.
Fig. 7. Tracking results of the D-IVM, OLS, and proposed methods when the object is partially occluded by a hand. The results of (c) represent the final results of the proposed method.
4.3 Computation Time
The proposed method aims to implement a combination of the OLS and D-IVM methods in such a way that it runs in real time. To do so, we first crafted the algorithm in an optimized way such that both methods can run faster than the original versions. As the OLS method runs on top of the DIVM method in our implementation, we reduced the number of iterations to 3. To further improve overall the computation time of the proposed method, we parallelized certain parts of the code. Detailed information of the computation time is shown in Table 2. We analyzed the computation time in two computers of different specs. The first computer is a MacBook Pro 15-inch with a i7 Intel processor of 2.2 GHz and 16 GB RAM. The second computer is a high-end PC with an Intel i7 of 3.7 GHz and 32 GB RAM. For the average computer, as shown in Table 2, the total processing time for tracking was 75 ms. That corresponds to 13 frames per second (FPS). For the high-end computer, the total processing time was 55 ms. That corresponds to 18 FPS. As a result, although it is slower than both base methods, the proposed method could run in real time for both computer setups.
Table 2. Computation time of different tracking methods.
5. CONCLUSION
In this paper, we proposed a method of combining two different types of model-based 3D object tracking methods to make a new system that is more consistent, accurate, and robust to occlusion. The proposed method ran mainly on the direct method, called D-IVM, by taking advantage on its consistency and stability of tracking over time. The edge-based method, called OLS, was used to refine the camera pose and also provide a potentially better initial camera pose for the next frame. We designed a working flow that takes advantage of both methods. The OLS method brought to our system a better response to occlusion. Experimental results showed that the proposed method had a better tracking accuracy, tracking consistency, and robustness to occlusion than both base methods. Although the computation time was slower than the base methods, the proposed method still could run in real time on both average laptops and high-end computers.
References
- J. Moon, D. Park, H. Jung, Y. Kim, and S. Hwang, "An Image-Based Augmented Reality System for Multiple Users Using Multiple Markers," Journal of Korea Multimedia Society, Vol. 21, No. 10, pp. 1162-1170, 2018. https://doi.org/10.9717/KMMS.2018.21.10.1162
- H. Kato and M. Billinghurst, "Marker Tracking and HMD Calibration for a Video-Based Augmented Reality Conferencing System," Proceeding of 2nd IEEE and ACM International Workshop on Augmented Reality, pp. 85-94, 1999.
- K.W. Chia, A.D. Cheok, and S.J.D. Prince, "Online 6 DOF Augmented Reality Registration from Natural Features," Proceeding of International Symposium on Mixed and Augmented Reality, pp. 305-313, 2002.
- A.I. Comport, E. Marchand, and F. Chaumette, "A Real-Time Tracker for Markerless Augmented Reality," Proceeding of International Symposium on Mixed and Augmented Reality, pp. 36-45, 2003.
- C. Harris and C. Stennett, "RAPID: A Video-Rate Object Tracker," Proceeding of British Machine Vision Conference, pp. 73-77, 1990.
- T. Drummond and R. Cipolla, "Real-Time Visual Tracking of Complex Structures," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 7, pp. 932-946, 2002. https://doi.org/10.1109/TPAMI.2002.1017620
- H. Wuest, F. Vial, and D. Stricker, "Adaptive Line Tracking with Multiple Hypotheses for Augmented Reality," Proceeding of IEEE/ACM International Symposium on Mixed and Augmented Reality, pp. 62-69, 2005.
- I. Skrypnyk and D.G. Lowe, "Scene Modeling, Recognition, and Tracking with Invariant Image Features," Proceeding of IEEE/ACM International Symposium on Mixed and Augmented Reality, pp. 110-119, 2004.
- S. Hinterstoisser, S. Benhimane, and N. Navab, "N3M: Natural 3D Markers for Real-Time Object Detection and Pose Estimation," Proceeding of IEEE International Conference on Computer Vision, pp. 1-7, 2007.
- E. Ladikos, S. Benhimane, and N. Navab, "A Realtime Tracking System Combining Template-Based and Feature-Based Approaches," Proceeding of International Conference on Computer Vision Theory and Applications, pp. 325-332, 2007.
- Y. Park, V. Lepetit, and W. Woo, "Handling Motion-Blur in 3D Tracking and Rendering for Augmented Reality," IEEE Transactions on Visualization and Computer Graphics, Vol. 18, No. 9, pp. 1449-1459, 2012. https://doi.org/10.1109/TVCG.2011.158
- B.-K. Seo, H. Park, J.-I. Park, S. Hinterstoisser, and S. Ilic, "Optimal Local Searching for Fast and Robust Textureless 3D Object Tracking in Highly Cluttered Backgrounds," IEEE Transactions on Visualization and Computer Graphics, Vol. 20, No. 1, pp. 99-110, 2013. https://doi.org/10.1109/TVCG.2013.94
- S. Baker and I. Matthews, "Lucas-Kanade 20 Years On: a Unifying Framework," International Journal of Computer Vision, Vol. 56, No. 3, pp. 221-255, 2004. https://doi.org/10.1023/B:VISI.0000011205.11775.fd
- G. Caron, A. Dame, and E. Marchand, "Direct Model Based Visual Tracking and Pose Estimation Using Mutual Information," Image and Vision Computing, Vol. 32, No. 1, pp. 54-63, 2014. https://doi.org/10.1016/j.imavis.2013.10.007
- J. Engel, T. Schops, and D. Cremers, "LSD-SLAM: Large-Scale Direct Monocular SLAM," Proceeding of European Conference on Computer Vision, pp. 834-849, 2014.
- B.-K. Seo and H. Wuest, "A Direct Method for Robust Model-Based 3D Object Tracking from a Monocular RGB Image," Lecture Notes on Computer Science, Vol. 9915, pp. 551-562, 2016.