1. Introduction
The position and orientation information of an object is known as pose. Position represents the three-dimensional location of the object, while orientation can be expressed as a series of consecutive rotations. Estimating the pose of the sensor is important in navigation system. Pose estimation has been studied over the past few decades and various technologies have been researched. There are several approaches to estimate the pose of the sensor. The well-known method is using inertial or visual sensors.
Inertial Measurement Unit (IMU) sensors have been widely used for pose estimation in various applications including robots, aircrafts and navigation systems. IMU sensors normally contain three orthogonal accelerometer axis and three orthogonal gyroscope axis. IMU sensor can obtain the data in high-frequency rate and calculate the accurate results without complex operation. Therefore, IMU sensor can track the fast and abrupt movements. In addition, it can be small size, lightweight, low cost, and adopt wireless communication technologies. Furthermore, they are robust to the illumination changes and visual occlusions. However, it has a drift problem accumulated from the position estimation [1,28]. Therefore, additional sensor is normally used to aid the IMU sensor in the pose estimation. Some researches proposed to use Global Positioning System (GPS) [2, 25, 26] with IMU sensor to overcome the problem of a long-term drift. However, the major problem of the GPS is intermittent loss of signal, especially for indoor environment [3].
Vision-based tracking has high accuracy for pose estimation [4-6]. It can precisely track the relative motion between the camera and objects by measuring the movement of selected features in the consecutive image sequences. There are various methods in feature detection. Local detector such as SIFT (Scale-Invariant Feature Transform) [7] and SURF (Speeded-Up Robust Feature) [8] are invariant to the rotation and illumination change [9]. BREIF [10], ORB [11], AKAZE [12] are also well-known feature detectors for tracking. Those feature points can be used to calculate the pose of the camera [13]. However, vision sensor normally suffers from lack of robustness, low frequency of data acquisition, and high computational cost. The main problem of vision-based tracking systems is instability with fast-moving dynamic scene because of the loss of visual features. In addition, vision-based tracking systems are severely affected by occlusion matter. To overcome these problems, some researchers proposed approaches to take advantage of both the vision sensor and inertial sensor [14-16].
Typically, measurements of visual and inertial sensors are combined with filtering method [27]. Kalman filter methods are normally selected to perform sensor fusion by integrating measurements from various sensors. Kim and Park [17] developed a sensor fusion framework by combining Radar, Lidar and camera with Extended Kalman Filter (EKF) for pose estimation. Foxlin and Naimak [18] used both the inertial and vision data by a complementary Kalman filter for a pose estimation based on fiducials. Rehbinder and Ghosh [19] adopted gyroscope to measure the orientation, and proposed a vision-based system focusing on reducing the drift error of the IMU framework. Although several researches showed that the inertial data can improve the vision-based tracking, the optimal way to integrate the data from inertial sensor and vision sensor is yet to be brought up. Moreover, the integration of inertial data often causes instabilities in position estimation [20,24].
The purpose of this paper is to overcome the instabilities of sensors and increase the accuracy of the pose estimation by using camera and IMU sensor. The proposed approach fused the data from the IMU sensor and the camera to estimate the pose of the sensor more precisely and more robustly. As a result, the proposed method can estimate the pose of the sensor in real-time, and it can be applied for various applications including robot, autonomous driving, etc. The remainder of this paper is organized as follows. Section 2 explains the approach to the proposed pose estimation method. Section 3 describes the methodology of the proposed pose estimation algorithm in details, and section 4 shows the results of the experimentation with indoor datasets. Finally, the summary of the paper is given in the last section.
2. Approach to The Camera-IMU Fusion
2.1 Feature Matching
To calculate the pose of the camera using the image sequences, visual features should be obtained from the image sequences. Feature points can be detected by the detector including SIFT, SURF, AKAZE, etc. Because of the lower computational cost and moderate performance, ORB [11] is widely used in feature tracking applications. Then, descriptors are used to find the matching relations in the given sets of feature points. Fig. 1 shows the example of feature point matching result with ORB detector and descriptor. Using feature point matching relations and camera calibrated parameters, the camera extrinsic parameters can be calculated.
Fig.1. The example of ORB feature matching result.
However, as mentioned earlier, the pose estimation results can be unstable because of the dynamic image sequences. Illumination change, blur, and motion of the objects can adversely affect the position or matching relation of the feature points in the images. Consequently, the estimated pose results can be instable. To overcome this problem, we used epipolar geometry to remove the outliers of the feature points [13,23].
If the camera is not moving, the same feature points must lie on the same position in the next image frame. Then, the distances of the corresponding feature points are calculated, and the feature points which have distances higher than the threshold can be removed from the pose estimation. On the other hand, if the camera is moving, the same feature points must lie on the epipolar line in the next image frame. The feature points which have distances higher than the threshold are considered as outliers. The epipolar line and the outliers are shown in Fig. 2.
Fig. 2. The example of outlier removal process with epipolar geometry (a) epipolar line example (b) outlier detected by epipolar geometry. Red dots are considered as outlier.
However, if the feature points have a large portion of outliers in an image (e.g. blur, occlusion), it is difficult to determine the outliers properly with epipolar geometry. The epipolar line is drawn with the fundamental matrix which is calculated from the matching feature points and camera parameters. If the matching of feature points is not accurate, the fundamental matrix and epipolar line become inaccurate. As a result, the outlier rejection becomes challenging. To overcome this instability of the vision sensor, we decided to use IMU sensor as the initial estimator of the camera pose rather than feature matching.
2.2 Camera-IMU Fusion
Normally, IMU sensor has 6 DoF (degree of freedom) and can obtain linear and angular acceleration with high-frequency rate. As we mentioned above, IMU sensors generate fast signals with high rate during dynamic motions. However, they are sensitive to accumulated drift error because of the double integration process in the pose estimation procedure. On the other hand, camera sensors can precisely estimation the ego-motion even with the long estimation time. However, they have weaknesses in blurred features under fast and unpredicted motions. Fig. 3 shows example of the pose estimated from the vision and inertial sensor. It shows the translation change in x-axis over frame. We obtained the data from each sensor with the same frame rate (30fps). As the sensor captured the image in static state, the ground truth must be zero. Both errors from each sensor are low because the camera is in static state, but they show distinctively different tendency. IMU data has high rate of flipping error and because of the drift error, the error is slightly increasing. On the other hand, vision sensor has no. drift error, but sometimes error occur sharply because of the unpredicted motion of the image (e.g. illumination change, motion of the object). The aim of Camera-IMU sensor fusion is to overcome some fundamental limitations of visual-only tracking and IMU-only tracking by using the both sensor complementarily.
Fig. 3. The example of pose estimation from vision and IMU sensor.
As the IMU sensor can estimate the pose change of the consecutive image invariant to the abrupt motion, the transformation matrix obtained from IMU sensor can be used to the initial value for the fundamental matrix calculation. After removing the outliers with the initial fundamental matrix, the feature matching method will produce more precise transformation matrix. The overall procedure of the proposed method is shown in Fig. 4. Detail explanation will be given in the next section.
Fig. 4. Overall procedure of the proposed pose estimation method.
3. Proposed Pose Estimation Method
The goal of the proposed pose estimation is to find the corresponding feature points belonging to the same object in consecutive image sequences. Firstly, the connected pixels in the first image of the image sequences are marked as the target feature points. To obtain the connected pixels, the detector is used for the feature point detection. The extracted features can be colors, texture, edges, GLOH, SURF, SIFT, etc. The proposed method used ORB because of its fast and relatively robust feature detection property. Secondly, a tracking of feature points is performed to determine the location of the target feature points in the next image sequences. It is especially difficult to track feature points at the moving camera situation. Finally, with the obtained feature matching relations, the pose change of the sensor can be calculated. In the proposed method, the IMU estimates initial value of rotation (R0) and translation (t0). Since the raw data has a bias error, we applied Kalman filter [21] to reduce the error of the raw data. The initial fundamental matrix is calculated from the following equation.
\(F_{0}=K^{-T} R_{0} K^{T}\left[K R_{0}^{T} t_{0}\right]_{X}\) (1)
where K are the calibration data of the camera. Calibration data is the intrinsic parameters of the camera. It can be calculated using the bundle adjustment.
The feature points \(\left(d_{n-1}^{\prime}, d_{n}^{\prime}\right)\) are detected from the consecutive images \(\left(I_{n-1}, I_{n}\right)\) captured by the camera. The welL-known detector ORB [11] is used in the proposed method. Using the initial fundamental matrix F0 and the feature points dn-1, the epipolar line can be drawn in the consecutive image In. Then, the epipolar line and eature point distance is calculated with
\(d=\frac{|a u+b v+c|}{\sqrt{a^{2}+b^{2}}}\) (2)
where u and v are the coordinates of the feature point, and a, b and c represent the coefficient ax + by + c = 0.
Feature points with distance above the threshold is considered outliers. The optimal threshold value is obtained by experiments. We applied a threshold value for 1.5 pixel, which means if the corresponding feature point is far from epipolar line more than 1.5 pixel, that feature point is considered as an outlier in the pose estimation procedure.
After the outliers are removed from the feature points \(\left(d_{n-1}^{\prime}, d_{n}^{\prime}\right)\), these feature points matching relations are used to calculate the precise fundamental matrix (F). Then, the above procedure is repeated again to remove a few outliers remaining in the feature points. Finally, with the feature point matching relation, the pose of the sensor is estimated.
Fig. 5 shows the error estimated from the feature point matching related to the iteration of outlier removal process. We tested with 100 dataset of indoor scene, and the average error is plotted according to the iterations. As it can be seen in Fig. 5, five times of iteration is enough for the outlier removal process.
Fig. 5. Outlier removal result related to the iteration
In the next section, experimental results of the various pose estimation methods are demonstrated.
4. Experiments and Results
For the experiment, we used commercial sensor ZED 2 (StereoLabs, San Francisco, USA) which includes camera and IMU sensor in one platform. The raw IMU sensor data are transferred to the host computer at 100 Hz via the USB 3.0 port. Camera can capture image sequences in 30fps with Full-HD resolution. We captured the both data in 30fps to synchronize the data from camera and IMU sensor. The time elapsed, which is the difference between the time instant when the acquisition process starts and the time instant when a new image frame is available, is collected together with the sensor data. The drift of the IMU sensor in translational axis is 0.35% and the drift in rotational axis is 0.005 °/m. The configuration of the experiment is shown in Fig. 6. After collecting data with the sensor, the data were transferred to the personal computer and the various pose estimation methods were tested with an i7-4790K processor with 16GB memory and a Windows 10 operating system.
Fig. 6. Configuration of the experiment.
Four pose estimation methods were experimented. The proposed method and the pose estimation method using only feature point matching (ORB) were implemented using OpenCV library [22]. Also, the previous method using Kalman filter to integrate the result of ORB and IMU data and pose estimation method using only IMU were implemented for the comparison. We collected the data of moving sensor from 1m to 5m for 10 times (50 datasets). The example of test results for each dataset is shown in Fig. 7. The results of four pose estimation methods were compared with the given dataset. (a) shows one test result with the dataset moving 1m. The position results of various pose estimation methods must converge to 1m. (b) shows one test result with the dataset moving 2m. The position results of each pose estimation methods must converge to 2m. (c) shows one test result with the dataset moving 3m, which the results of pose estimation must converge to 3m. (d) shows one test result with the dataset moving 4m, which results of pose estimation must converge to 4m. (e) shows one test result with the dataset moving 5m, which results of pose estimation must converge to 5m. ‘proposed’ is the pose estimation method proposed in this paper. ‘IMU’ is the method using only IMU to estimate the position of the sensor. ‘ORB’ is the method using feature matching to calculate the position of the sensor. ‘IMU+ORB KF’ is the previous method using both IMU and camera sensor with the Kalman Filter.
Fig. 7. Pose estimation result for the example dataset.
As it can be seen in the graph, the pose estimation result from the IMU sensor (orange line) has the largest error because of the drift. The value of IMU sensor is overestimated in the end of the measurement. Pose estimation from the ORB feature matching (grey line) fluctuates because of the instability of the image. In a complex scene, the error of feature matching might increase significantly. As the two sensor data were both used in the proposed method (blue line) and in the Kalman Filtering method (yellow line), they showed improved performance in the given experiment. In addition, the proposed method shows more stable tendency compared to the other methods. The average position of the various methods for each datasets are shown in Table 1.
Table 1. Average position of the tested algorithms.
The average error of each dataset can be shown in Fig. 8. The method using IMU shows the largest error, and the error increases when the data acquisition time increases. The method using feature matching algorithm shows randomized error. The error of Kalman Filtering method is slightly smaller than the feature matching or IMU method. The proposed method shows the smallest error compared to the other algorithms. The average error of the proposed method is 0.12%, which is better than the other methods, IMU 1.48%, ORB 1.23%, Kalman Filtering 0.81%.
We also tested each algorithm with the data which were collected by the cart moving in a circle inside the office. The Fig. 9 shows the position tracking result with the four algorithms including the proposed method. The trajectory of the proposed method shows the smoothest curve which seems to be the real movement of the sensor. The Kalman Filtering method also shows smooth curve, but the end of the point doesn’t match the beginning of the curve. Since the dataset shown in Fig. 9 have no. ground truth, the error analysis is not appropriate in this situation. As it can be seen in Fig. 7-9 and Table 1, the proposed method shows the best performance among the tested algorithms.
Fig. 8. Average error of various algorithms for each dataset.
Fig. 9. Position tracking result of the indoor data.
5. Conclusion
In this Research, a new pose estimation method using image sequences and IMU sensor is proposed. The feature point matching method has a fundamental problem of instability in the dynamic scenes. The proposed method used IMU sensor to obtain the initial pose of the sensor. Then, the epipolar geometry is applied to eliminate the outliers in the feature points. With the feature points of outliers removed, matching relation generates the more accurate pose of the sensor.
One of the major attributes of the proposed method is that it improves the accuracy of feature-based pose estimation, which has weaknesses in the moving image sequence. The proposed method used IMU sensor to compensate the pose estimation procedure, and it reduces the instability error in the pose estimation.
The proposed method was implemented and was tested with the other methods. Experimentation shows that the proposed method provides better pose estimation results and is more robust compared to the other methods.
The limitation of the proposed method is the computational cost. Also, the abrupt movement will cause the algorithm to malfunction as it is the fundamental problem of the sensors. The proposed method needs optimization at abnormal situations through additional tests. More research must be taken using datasets with the ground truth data to analyze the error tendency of the pose estimation trajectory of various algorithms. Also, the research on the data from outdoor should be considered in the future work.
Acknowledgment
This research was supported by the ‘5G based VR Device Core Technology Development Program’(IITP2020000103001, Development of 5G-based 3D spatial scanning device technology for virtual space composition), funded by the Ministry of Science and Information & Communication Technology (MSIT, Korea).
References
- A. R. Jimenez, F. Seco, J. C. Prieto, J. Guevara, "Indoor pedestrian navigation using an INS/EKF framework for yaw drift reduction and a foot-mounted IMU," in Proc. of 2010 7th Workshop on Positioning, Navigation and Communication, pp. 135-143, 2010.
- F. Caron, E. Duflos, D. Pomorski, P. Vanheeghe, "GPS/IMU data fusion using multisensor Kalman filtering: introduction of contextual aspects," Information Fusion, vol. 7, no. 2, pp. 221-230, 2006. https://doi.org/10.1016/j.inffus.2004.07.002
- F. Subhan, S. Ahmed, S. Haider, S. Saleem, A. Khan, S. Ahmed, M. Numan, "Hybrid Indoor Position Estimation using K-NN and MinMax," KSII Transactions on Internet and Information Systems, vol. 13, no. 9, pp. 4408-4428, 2019. https://doi.org/10.3837/tiis.2019.09.005
- M. Blosch, S. Weiss, D. Scaramuzza, R. Siegwart, "Vision based MAV navigation in unknown and unstructured environments," in Proc. of 2010 IEEE International Conference on Robotics and Automation, pp. 21-28, 2010.
- M. E. Ragab, G. F. Elkabbany, "A Parallel Implementation of Multiple Non-overlapping Cameras for Robot Pose Estimation," KSII Transactions on Internet and Information Systems, vol. 8, no. 11, pp. 4103-4117, 2014. https://doi.org/10.3837/tiis.2014.11.025
- L. Li, Y. Liu, T. Jiang, K. Wang, M. Fang, "Adaptive Trajectory Tracking of Nonholonomic Mobile Robots Using Vision-Based Position and Velocity Estimation," IEEE Transactions on Cybernetics, vol. 48, no. 2, pp. 571-582, 2018. https://doi.org/10.1109/tcyb.2016.2646719
- D. G. Lowe, "Object recognition from local scale-invariant features," in Proc. of the Seventh IEEE International Conference on Computer Vision, vol. 2, pp. 1150-1157, 1999.
- H. Bay, A. Ess, T. Tuytelaars, L. V. Gool, "SURF: Speeded Up Robust Features," Computer Vision and Image Understanding, pp. 404-417, 2006.
- S. Jung, S. Song, M. Chang, S. Park, "Range Image Registration based on 2D Synthetic Images," Computer-Aided Design, vol. 94, pp. 16-27, 2018. https://doi.org/10.1016/j.cad.2017.08.001
- M. Calonder, V. Lepetit, C. Strecha, P. Fua, "BRIEF: Binary Robust Independent Elementary Features," in Proc. of ECCV 2010: Proc. of the 11th European Conference on Computer Vision, pp.778-792, 2010.
- E. Rublee, V. Rabaud, K. Konolige, G. Bradski, "ORB: An efficient alternative to SIFT or SURF," in Proc. of 2011 International Conference on Computer Vision, pp. 2564-2571, 2011.
- P. Alcantarilla, J. Nuevo, A. Bartoli, "Fast Explicit Diffusion for Accelerated Features in Nonlinear Scale Spaces," in Proc. of British Machine Vision Conference 2013, vol. 13, pp. 1-11, 2013.
- S. Jung, Y. Cho, D. Kim, M. Chang, "Moving Object Detection from Moving Camera Image Sequences Using an Inertial Measurement Unit Sensor," Applied Sciences, vol. 10, pp. 268, 2020. https://doi.org/10.3390/app10010268
- J. L., J. A. Besada, A. M. Bernardos, P. Tarrio, J. R. Casar, "A novel system for object pose estimation using fused vision and inertial data," Information Fusion, vol. 33, pp. 15-28, 2017. https://doi.org/10.1016/j.inffus.2016.04.006
- Y. Tian, Jie Zhang, J. Tan, "Adaptive-frame-rate monocular vision and IMU fusion for robust indoor positioning," in Proc. of 2013 IEEE International Conference on Robotics and Automation, pp. 2257-2262, 2013.
- K. Kumar, A. Varghese, P. K. Reddy, N. Narendra, P. Swamy, M. G. Chandra, P. Balamuralidhar, "An Improved Tracking using IMU and Vision Fusion for Mobile Augmented Reality Applications," International Journal of Multimedia and its Applications, vol. 6, no. 5, 2014.
- T. Kim, T.H. Park, "Extended Kalman Filter (EKF) Design for Vehicle Position Tracking Using Reliability Function of Radar and Lidar," Sensors, vol. 20, pp. 4126, 2020. https://doi.org/10.3390/s20154126
- E. Foxlin, L. Naimak, "VIS-Tracker: A Wearable Vision-inertial Self-Tracker," in Proc. of IEEE VR2003, vol. 1, pp. 199, 2003.
- H. Rehbinder, B. K. Ghosh, "Pose estimation using line-based dynamic vision and inertial sensors," IEEE Transactions on Automatic Control, vol. 48, no. 2, pp. 186-199, 2003. https://doi.org/10.1109/TAC.2002.808464
- G. Qian, R. Chellappa, Qinfen Zheng, "Bayesian structure from motion using inertial information," in Proc. of International Conference on Image Processing, 2002.
- F. M. Mirzaei, S. I. Roumeliotis, "A Kalman Filter-Based Algorithm for IMU-Camera Calibration: Observability Analysis and Performance Evaluation," IEEE Transactions on Robotics, vol. 24, no. 5, pp. 1143-1156, 2008. https://doi.org/10.1109/TRO.2008.2004486
- G. Bradski, "The OpenCV Library," Dr. Dobb's Journal of Software Tools, 2000.
- S. Jung, Y. Cho, K. Lee, M. Chang, "Moving Object Detection with Single Moving Camera and IMU Sensor using Mask R-CNN Instance Image Segmentation," International Journal of Precision Engineering and Manufacturing, vol. 22, pp. 1049-1059, 2021. https://doi.org/10.1007/s12541-021-00527-9
- H. Deilamsalehy, T. C. Havens, "Sensor fused three-dimensional localization using IMU, camera and LiDAR," IEEE Sensors, pp. 1-3, 2016.
- E.D. Marti, D. Martin, J. Garcia, A. De la Escalera, J.M. Molina, J.M. Armingol, "Context-Aided Sensor Fusion for Enhanced Urban Navigation," Sensors, vol. 12, no. 12, pp. 16802-16837, 2012. https://doi.org/10.3390/s121216802
- C. Yu, H. Lan, F. Gu, F. Yu, N. El-Sheimy, "A Map/INS/Wi-Fi Integrated System for Indoor Location-based Service Applications," Sensors, vol. 17, pp. 1272, 2017. https://doi.org/10.3390/s17061272
- X. Li, Y. Wang, K. Khoshelham, "A Robust and Adaptive Complementary Kalman Filter based on Mahalanobis Distance for Ultra Wideband/Inertial Measurement Unit Fusion Positioning," Sensors, vol. 18, pp. 3435, 2018. https://doi.org/10.3390/s18103435
- Y. Wu, X. Niu, J. Du, L. Chang, H. Tang, H. Zhang, "Artificial Marker and MEMS IMU-based Pose Estimation Method to Meet Multirotor UAV Landing Requirements," Sensors, vol. 19, no. 24, pp. 5428, 2019. https://doi.org/10.3390/s19245428