key: cord-0630465-g3q8x57i authors: Tyagi, Abhishek; Liang, Yangwen; Wang, Shuangquan; Bai, Dongwoon title: DVIO: Depth aided visual inertial odometry for RGBD sensors date: 2021-10-20 journal: nan DOI: nan sha: cb6dc76f3278378b45a5d56463076da05a984238 doc_id: 630465 cord_uid: g3q8x57i In past few years we have observed an increase in the usage of RGBD sensors in mobile devices. These sensors provide a good estimate of the depth map for the camera frame, which can be used in numerous augmented reality applications. This paper presents a new visual inertial odometry (VIO) system, which uses measurements from a RGBD sensor and an inertial measurement unit (IMU) sensor for estimating the motion state of the mobile device. The resulting system is called the depth-aided VIO (DVIO) system. In this system we add the depth measurement as part of the nonlinear optimization process. Specifically, we propose methods to use the depth measurement using one-dimensional (1D) feature parameterization as well as three-dimensional (3D) feature parameterization. In addition, we propose to utilize the depth measurement for estimating time offset between the unsynchronized IMU and the RGBD sensors. Last but not least, we propose a novel block-based marginalization approach to speed up the marginalization processes and maintain the real-time performance of the overall system. Experimental results validate that the proposed DVIO system outperforms the other state-of-the-art VIO systems in terms of trajectory accuracy as well as processing time. Nowadays, augmented reality (AR) is becoming a key feature for the latest mobile devices. Consequently, AR applications, such as 3D reconstruction, AR filters, in-home shopping, space exploration, AR gaming, AR meetings, etc., are shaping our post-COVID lifestyles. In all these applications, six degrees-of-freedom (6DOF) pose tracking of the device is very critical to help localize the device in an unknown environment and render the augmentation with respect to the device. This problem has been solved using the simultaneous localization and mapping (SLAM) algorithm. Originally, the visual SLAM was used to solve this problem. Since the original monocular visual SLAM [1] work, multiple impressive implementations of the visual SLAM have shown good performance in different applications [2] , [3] , [4] . Unfortunately, a visual SLAM system cannot recover the absolute scale of the environment due to the use of a single monocular camera. To overcome this, a stereo camera [5] or RGBD sensor [6] , [7] is introduced. Even with the scale estimation from other sources, the pose estimation in visual SLAM system gets impacted negatively when we have motion blur in images due to fast motion, or have Figure 1 : Trajectory comparisons of different VIO systems for the "hall2" sequence from the VCU-RVI benchmark [30] . The total duration of this sequence is 318 seconds and the total distance is 309 meters. low texture scene, which reduces the number of trackable features. These issues mainly arise from the fact that in the visual SLAM we only use two-dimensional (2D) feature measurement in the camera frame to estimate the pose. To tackle these issues, an inertial measurement unit (IMU) sensor is coupled with the camera to aid the pose estimation. Such setup is termed as Visual Inertial Odometry (VIO). In the past few years multiple monocular VIO systems have shown good performance for the pose estimation in * email: a.tyagi@samsung.com † email: liang.yw@samsung.com ‡ email: shuangquan.w@samsung.com § email: dongwoon.bai@samsung.com [13] . (c) shows the ground truth 3D model. VINS-RGBD [18] fails to initiate. challenging environments [8] , [9] . These systems either use an Extended Kalman Filter (EKF)-based framework or a nonlinear optimization-based framework to create a tightly coupled VIO system. The main advantage of an EKF framework-based system for VIO [10] , [11] is that they are able to perform good pose estimations in challenging environments with low processing requirement. However, the main drawback of the EKF-based VIO system has been its reliability. They are less robust with initialization of the filter and also have higher linearization errors in the estimated pose compared to nonlinear optimization systems. In the last few years the nonlinear optimization-based VIO systems [12] , [13] , [14] have shown good methods to perform tight coupling of the visual and inertial measurements using IMU preintegrate [15] in a bundle adjustment [16] framework. These systems have shown better performance in the pose estimation accuracy as well as more robustness to the initialization motion needed for the system. The recent implementations [13] have also shown that by using a smart software architecture we can achieve the real-time performance on mobile devices. However, these implementations only compute the vision residue using the 2D position of the landmark in camera frame. In the case of RGBD sensors, the depth map can be used to generate extra constraints for the landmark position, and hence improve the accuracy and robustness of the optimization step. In [17] , it was shown how to use an EKF framework to perform the tight coupling of the inertial, visual and depth measurements for the RGBD-IMU SLAM, and [18] has used the nonlinear optimization to perform the tight coupling of IMU and visual measurements and use depth from a RGBD sensor to improve the 3D estimation of the 2D feature. There is also recent work [19] , which is using depth as a measurement in the optimization step and using feature uncertainty model to improve the pose estimation. In the early work on integrating inertial measurements with RGBD data [20] , the EKF framework was used to perform a loose coupling of the pose estimated using inertial data with the pose estimated using a sparse visual-only RGBD SLAM system. In [17] , the author proposed a RGBD-IMU tightly coupled EKF framework for the pose estimation as well as calibration between the two sensors. They analyzed the observability of the nonlinear system describing the RGBD-IMU calibration and 6DOF pose estimation of the IMU state. This method improved the performance of the EKF-based RGBD-IMU SLAM system and also reduced the processing time, but the EKF-based systems can have lower pose accuracy due to higher linearization error and are not robust to initialization of the state. In [12] , a keyframe-based sliding window approach for VIO was proposed using nonlinear optimization, this work was then further extended in [13] , which not only improved the accuracy of the system but also improved the performance for initialization and processing time of the overall system [8] . In [18] , RGBD data is used with IMU data in a tightly coupled nonlinear optimization approach. In this solution, the depth measurement of the 3D landmark is used to perform the triangulation of the feature depth in the first keyframe, in which the feature is observed. This helps in better initialization of the feature in the sliding window. In this work they also reduced the size of the nonlinear optimization problem by making the feature state for the landmark as constant for the features, which are initialized using the depth measurement. This approach resulted in more accurate and robust optimization solution for the overall sliding window. In this system the depth measurement is not used as a constraint in the optimization step. In [19] , the author has proposed a system that models uncertainties in the depth measurement from a RGBD sensor using a Gaussian mixture model (GMM) and then incorporates the measurement in the initialization and optimization process of the VIO system. However, the depth measurement can be used to provide much stronger constraints for the pose estimation process that will give better pose accuracy from the nonlinear optimization step. Moreover, for a commercially realistic VIO system using RGBD sensors, any increase in the optimization problem size needs to be compensated with the improvement in the processing time. Unlike the previous proposed VIO systems, we demonstrate that depth measurements from the RGBD sensor can be utilized to provide constraints for the nonlinear optimization process with different feature parameterizations as well as unsynchronized sensors. In this work, we introduce a method to add the depth measurement in a nonlinear optimization-based VIO system. In our system, we first show how to use depth measurements in the optimization step of the VIO, and then describe a method of adding the depth measurement constraint in a sliding window that has 1D parametrized feature states. In addition, we also show the addition of the depth in a VIO system with feature state parameterized using the full 3D observation of the landmark in the first keyframe. With the 3D parametrization, we are able to utilize all the constraints provided by depth measurements for the optimization processes. We also show how depth measurements can be used in handling the scenario, where we have time offset between RGBD data and IMU data. With these changes we have increased the overall size of the optimization problem. To compensate for this increase we present a block-by-block fast marginalization scheme that significantly reduces the overall processing time for the marginalization step and maintains the real-time performance of our VIO system. The performance improvements in both pose estimation accuracy and processing time are validated using an extensive range of test cases in our experiments. As shown in Figure 1 and Figure 2 , we demonstrate the performance of our proposed system compared with other state-ofthe-art VIO systems. In Figure 1 , the test sequence hall2 from the VCU-RVI benchmark [30] was obtained by walking in a looped trajectory (i.e., returning to the starting point). In Figure 2 , the 3D point cloud of the object in the camera view is generated by the resulting poses obtained from our proposed system and VINS-Mono. Here, the test sequence is the u3d spiral which will be introduced in Section 3. From the result, one can see the impact of good pose estimation on the final result for the 3D reconstruction application. In the next section we will go over the algorithmic details of our proposed system. Similar to [13] and [18] , the proposed DVIO is implemented using software architecture consisting of two main threads as shown in Figure 3 . The first thread is named as the frontend thread, which handles the feature tracking and the depth map processing using the data from the RGBD sensor. In this thread, we detect 2D features [22] in camera image and track them from one frame to the next frame using the KLT [23] algorithm. We also compute the depth of the 2D feature using the depth map in the RGBD data and the 2D feature location. The second thread is named as the backend thread, which handles the full state estimation using a sliding windowbased nonlinear optimization. In this thread, the feature tracking data and the IMU data are used to perform the initialization of the estimator. In the DVIO implementation, the steps followed in the initialization process are the same as those defined in [13] . After initialization is successful, the sliding window is created using the IMU pose, speed and bias from the IMU pre-integration [15] and the landmark observations in the camera frames computed in the frontend thread. This sliding window is then used to perform the state estimation using the nonlinear optimization process. Once we have the estimated state of the sliding window, it is shared with the loop closure module to perform the global pose graph optimization [13] . The state vector is also used to compute the camera pose or IMU pose at the input sensor frequency using the motion-only bundle adjustment or IMU propagation as described in [13] . The full state vector in sliding window is defined as follows: is the IMU state that contains pose , speed and bias of the IMU at the time of ℎ keyframe, is the feature state for ℎ 3D landmark, and the sliding window has K keyframes and L features. Similar to [13] and [18] , in the DVIO, we estimate the state vector by minimizing the Mahalanobis norm of all the measurement residuals and the residuals for the prior information from marginalization as follows. Here, (⋅) is the Cauchy loss function, is the index set of sliding window, ℱ is the set of feature observations within the sliding window, is the set of state variables to optimize over, +1 and are the covariance matrices used for computing Mahalanobis norm as described in [13] , is the IMU residue factor, is the vision residue factor for the first keyframe, in which landmark is observed, is the vision residue factor for all the other keyframes, in which the landmark is observed, and ( ) is the marginalization residue factor for the state after marginalization. In the DVIO, the IMU residue, , and the marginalization residue, , are the same as those defined in [13] for the optimization step. For the vision residue factors, and , we propose two solutions based on the feature parameterization in the state vector. In the 1D feature parametrization, we use the same feature state as defined in [13] , [19] , [24] , where we parameterize a 3D landmark using the inverse depth in the first camera keyframe, in which the landmark is observed. In the 3D feature parameterization, we parameterize the 3D landmark using the inverse depth and the 2D location of the feature in the first camera keyframe, in which the landmark is observed. Since most of the current RGBD sensors used in the mobile devices are using structure light [25] technology, the noise in the depth measurement is inversely proportional to the depth of the landmark, hence using inverse depth enables us to use a Gaussian model for the depth measurement noise. In scenarios, where we have no available depth for a feature measurement due to holes in the depth map, we do not compute the depth constraint for the feature measurement. In the case of 1D feature parameterization, the feature state is defined using the inverse depth of the landmark in the first keyframe, in which the feature is observed. This keyframe is called as the anchor keyframe for the landmark, and all the subsequent keyframes, in which the landmarks are observed, are called as the non-anchor keyframe for the landmark. The feature state of a landmark in the sliding window is initialized by two steps. In the first step, we compute the depth of the landmark in the anchor keyframe using the landmark's depth measurements in all the keyframes and transforming them to the anchor keyframe using the keyframe pose in sliding window. The feature state is then initialized as an average of all the computed depths of the landmark in the anchor keyframe. The remaining landmark that could not be initialized in the first step are initialized using triangulation. Specifically, the Direct Linear Transformation (DLT) algorithm [26] is utilized to estimate the depth of the landmark in the anchor keyframe using the 2D feature measurement in the sliding window. The DVIO vision residual for the non-anchor keyframe for a landmark that is observed in anchor keyframe and a non-anchor keyframe , is computed as follows: where, is the inverse depth of the landmark in the anchor keyframe , (̅ , ̅ ) and (̅ , ̅ ) are the measured 2D locations of the landmark in the keyframes and in the normalized homogeneous coordinate system, respectively. and are the rotation and translation from the keyframe to the keyframe , and ̅ is the depth measured for the landmark in the keyframe . Next we use the depth ̅ measured in the anchor keyframe and estimated inverse depth in the state to generate the vision residue in the anchor keyframe i. The residues and combine the depth measurement from the depth sensor with the feature measurement in the camera image to create the vision constraints in the DVIO. The front end thread processes RGBD data and creates landmark feature data, and the back end thread takes IMU data and feature data to perform optimization-based VIO and also integrate with loop closure and pose output for the sensor. In the case of 3D feature parameterization, the feature state is defined by the estimate of 2D location of the feature and the inverse depth of the feature in the anchor keyframe. Hence, the feature state becomes a 3D vector, denoted as = [ , , ] , where ( , ) is the estimated 2D location of the feature in the anchor keyframe in the normalized coordinate system. The inverse depth part of the feature state is initialized in the same way as it was done in the case of 1D feature parameterization. For the initialization of the 2D feature location of the landmark in the anchor keyframe, we utilize the DLT algorithm [26] . In the DLT algorithm using the 2D feature measurement for the landmark in the sliding window and the sliding window pose, we can compute the 2D location of the feature in the anchor keyframe in the normalized coordinate system, which is used to initialize the feature state for the landmark. Similar to the 1D feature parameterization, we can compute the non-anchor keyframe vision residue using the new feature state as follows. and the residue can be obtained as In the case of anchor keyframe, we have the estimated and measured value of the landmark in the full 3D. This allows us to extend the anchor keyframe vision residue to a three dimensional residual. With this we are able to use all the constraints provided by the feature and depth measurement in the sliding window. These residues are then used with the IMU residual and marginalization factor to estimate more accurate pose using the nonlinear optimization process. To perform the nonlinear optimization, we employ Ceres solver [27] in the software implementations. In most of the commercially available devices that have a RGBD sensor and an IMU sensor, we find that the timestamps of IMU and RGBD sensors are synchronized at the hardware level. It makes the state estimation easier as there is no need to estimate the time offset between the two sensor modalities. In scenarios where we have temporal misalignment (i.e., time offset) between the two sensor modalities, we can use depth and feature measurement to perform the time offset estimation. In our consideration, we assume that we only have the time offset between RGBD and IMU sensors and no time offset between camera image and depth map. Let's consider a time offset between the IMU timestamps and RGBD timestamps . Similar to [28] , we use the general case where the time offset is constant but unknown, such that = + . The landmark's 2D feature observation in two consecutive image frames can be characterized as ( +1 , +1 ) in frame +1 at time +1 and ( , ) in image frame at time . Combined with the depth observations +1 and for the same landmark from the corresponding depth map, we can obtain the 3D velocity of the feature as: Using the velocity and time offset we can shift the landmark observation in a keyframe to the IMU time stamps. Thus, we have moved the observation of the landmark in the keyframe to compensate for the time offset. Subsequently, we are able to obtain the vision residue for the 1D and 3D feature parameterizations with the shifted feature observation.  1D parameterized feature vision residue:  3D parameterized feature vision residue: Here, is the time offset between RGBD and IMU sensors, ( ̅ ( ), ̅ ( ) ) and ( ̅ ( ), ̅ ( ) ) are the shifted 2D observations of the landmark in the anchor frame and the nonanchor frame , is the estimated inverse depth of the landmark in the anchor frame , ( , ) are the estimated 2D feature observation of the landmark in the anchor frame , ̅ ( )and ̅ ( ) are the shifted depth measurements of landmark in the anchor frame and the non-anchor frame , ( ) and ( ) are the vision residues for the anchor frame and non-anchor frame using the time shifted observations, respectively. In the case of vision residue for the 3D parameterized feature, we have the estimated value of the 2D feature observation in IMU time and hence does not require to be shifted using . In the above vision residue with the correct value of , the constraints can be used with the IMU constraints directly, as it will match with the IMU time stamp. To estimate the optimum value of , similar to [28] , we add to the state vector and iteratively optimize it with other state variable in the sliding window based optimization by minimizing following cost function: Here, the cost function is similar to the one defined in (2) except that the vision residue ( ) and ( ) are obtained using the time shifted landmark observations and is added to the state vector . In the sliding window-based nonlinear optimization process, every time we have a sliding window with the max number of nodes, we need to drop certain nodes from the window to maintain the size of the problem in the optimization step. As described in [13] , when we want to drop certain nodes and only keep remaining subset ( ∪ = , ∩ = ∅ ), the marginalization is performed to drop the nodes in the optimization problem based on Schur complement [29] . Similar to [13] , in DVIO for a given The marginalization process described in [13] uses the eigen decomposition to compute the inverse of the full matrix block , which can become very time consuming for a big matrix. For marginalization in DVIO, we utilize the sparse structure of the information matrix block . Consequently, we rearrange the elements of the matrix . First we arrange together all the sparse diagonal blocks corresponding to the landmarks that are getting marginalized. Next we arrange together the blocks corresponding to the keyframe pose and IMU constraints . With this the diagonal of the matrix is now separated into two blocks: the sparse block for landmarks and the dense block for pose and IMU constraints. We handle the marginalization of these two blocks separately.  In the case of 1D feature parameterization, the information matrix for landmarks is a diagonal matrix. Hence, its inverse can be computed easily for a full rank matrix. For a rank-deficient matrix, we use its pseudoinverse. This will allow us to marginalize all the landmarks in one simple step.  In the case of 3D feature parameterization, each landmark has three correlated coordinates. Hence, it is not possible to marginalize all the landmark together in one single step. Instead we compute the inverse of a 3 × 3 matrix and marginalize each landmark one by one. However, compared to the original method, the lower processing time can still be achieved by the well-known closed-form solution for the inverse of any 3 × 3 matrix with full rank. We use the pseudo-inverse to handle the rank-deficient information matrix block, arising from landmarks at infinity or landmarks visible in only one keyframe. Once all the landmarks are marginalized, the dense block can be marginalized using the standard marginalization step. Since the size of the remaining block for marginalization is small and fixed, Schur complement can be efficiently applied using LDL decomposition [16] . With the block-based marginalization, we are able to achieve speed-up in the processing time for both 1D and 3D feature parameterized DVIOs. As a result, we are able to maintain the realtime performance of the DVIO as shown in Section 3.2. The fast marginalization algorithm is summarized in Table 1 . Our main use case for the proposed DVIO is in the augmented reality application on mobile devices. In this section, we carried out experiments to compare the performance of our proposed DVIO system with the state-of-the-art VIO systems such as VINS-Mono [13] , VINS-RGBD [18] and DUI-VIO [19] on the handheld indoor test scenarios. In [8] , VINS-Mono system had shown the best pose accuracy performance among all the VIO systems. Hence, we use VINS-Mono to create a baseline performance for the current monocular camera-based VIO approaches. The VINS-RGBD system added depth in the initialization step of VINS-Mono and also used depth to make the feature state constant for certain features in the optimization step. The DUI-VIO system exploited the uncertainty of the depth data and added constraints computed using the depth measurement in the optimization step. These two systems use depth from a RGBD sensor with IMU measurements to improve the pose estimation, this makes them good candidates to compare against the DVIO's performance. To evaluate the performance, we created a test case of 35 data sequences. All the data sequences are captured in a rosbag format, to enable their use with the ROS framework. The data sequences can be divided into three types:  Synthetic data: The first set are the sequences created in a simulation environment using the Unity3D software. In these sequences, the camera moves in a photorealistic synthetic environment created using 3D models rendered in Unity3D. The device poses, trajectories and IMU data are obtained with similar technique introduced in [31] . The RGBD data are generated noise free. These sequences will be available for the community for future researches.  VINS-RGBD dataset: The second set of sequences used for evaluation were created by the VINS-RGBD [18] project. These sequences were captured by handheld using an Intel RealSense D435i sensor inside a large room with three motion scenarios.  VCU-RVI handheld benchmark: The third set of sequences were created by the VCU-RVI benchmark [30] . In this set of sequences, the data is captured using an Occipital Structure Core sensor with different motion scenarios. There are total twenty-seven data sequences covering a total of ~2.7 kilometers trajectory were recorded in various indoor environments. In the test case, we have 6 sequences (rgbd simple, rgbd normal, rgbd rotation, u3d simple, u3d normal, u3d rotation) with unsynchronized time stamps between RGBD and IMU sensors. With these sequences we are able compare all the aspects of DVIO with other state-of-the-art VIO systems. To compare the pose accuracy of different systems on the test sequences, we use RMSE of the absolute trajectory error (ATE) and relative pose error (RPE) [32] , [33] for the estimated trajectory with respect to the ground truth trajectory.  ATE:  RPE for every other frame using consecutive frame pairs: Here, is the ground truth pose, is the estimated pose for the frame n in the sequence, and is the rigid body transformation between the ground truth and estimated trajectory. is the total number of frames in the test sequence and ( ) refers to the translational components of the pose error . The result of the tests are summarized in Table 2 and Table 3 . The smallest RMSE is highlighted in boldface for every sequence. In the table an "X" indicates that the method failed in initialization or resulted in a large (>50 meters) ATE. In Table 2 , we present the results for the synthetic dataset and the VINS-RGBD dataset. For all the synthetic data sequences the performance of DVIO is better than the other systems. This shows that in scenario with no measurement noise DVIO system will have best pose accuracy. In VINS-RGBD dataset the user is moving around in a very large room, pointing the camera towards objects that are far away and performing quick motions. Due to these factors we observe a higher error in depth measurement for these sequences. In VINS-RGBD, the depth measurement is only used to initialize the feature state, hence its final estimated pose is less impacted by higher depth measurement errors. In contrast, we use the depth measurement to compute constraints in the optimization step. Due to this reason, we observe higher errors for DVIO systems in the case of "rgbd" sequences as compared to VINS-RGBD, but the performance gap is small. In Table 3 , we present the results for VCU-RVI handheld benchmark dataset. In these results, easy and motion sequences have relatively simple motion scenarios and DVIO has performed with the best ATE for these sequences. Corridor and hall sequences have motion scenarios with longer trajectories, which is causing higher ATE for all the systems. In the corridor1 sequence, both DVIO 1D and DVIO 3D diverge at some point in the sequence, and in the hall3 sequence, VINS-Mono, VINS-RGBD and DVIO-1D are diverging. Overall for these longer trajectory sequences, DVIO is able to perform best in 5 out 7 sequences. In case of light and dynamic test sequences we have challenging motion scenarios with moving objects in scene and also change in lighting conditions. DVIO performs the best for all the sequences except for the light6 sequence. In case of DVIO we use depth measurement for providing constraints to the optimization step, this improves the Figure 4 , we show the trajectory comparison of the VIO system for different test sequences. In the evaluation experiments we have observed that the main weakness in the DVIO system comes from the depth measurement error in the RGBD sensor. The accuracy of the depth measurement in a RGBD sensor is good for depth range of 0.5 to 4 meters (roomscale), but the accuracy of the depth measurement deteriorates considerably as the scene depth increases outside of room-scale range. We also notice that in the scenario with a lot of occlusions in the scene, the depth map will have shadow regions with no depth measurement or very erroneous depth measurement. Since in the proposed approach we use depth measurements to compute constraints in the optimization step, such big measurement errors will have bigger negative impact on the DVIO performance. Hence, these scenarios need to be handled to further improve the performance of a depth-aided VIO system. This can be done by improving the depth measurement accuracy at the RGBD sensor level or by using deep learning methods. In these experiments we also observe that the quantitative difference is small between the performance of DVIO 1D and DVIO 3D. However, we do observe that DVIO 3D has the most reliable performance with only one failure case and also consistently performs the best in longer trajectory sequences. In DVIO 3D, we have bigger state size and also compute more constraints in the optimization step as compared to DVIO 1D. These extra constraints help in the better reliability and accuracy performance in the case of DVIO 3D, however this also causes an increase in the processing time. Hence, we recommend using the DVIO 1D system in the small-scale SLAM scenarios with stringent processing time requirements, whereas in the large-scale indoor SLAM scenarios, we recommend using the DVIO 3D system for more robust and accurate performance. In the DVIO by adding the depth measurement as part of the optimization step and also increasing the state size for the 3D case, we have increased the problem size for the nonlinear optimization process. To compensate for this, we have improved the processing time for the marginalization step in DVIO. In Table 4 , we compare the total processing time and the marginalization time in a sliding window optimization for DVIO 1D with and without the fast marginalization, DVIO 3D with and without the fast marginalization, VINS-Mono and VINS-RGBD. These are average runtime numbers, generated using the offline processing of the u3d sine test sequence on a Linux machine running on an Intel Core i7-5930K processor with 64 GB DRAM. As one can see from the results, the fast marginalization provides 80% and 94% improvements for the marginalization step in the 1D and 3D parametrized feature scenarios, respectively. This enables the DVIO optimization thread to run at the 10 Hz update, allowing for a real-time performance of the backend thread. In this paper, we proposed a method to add the depth measurement from RGBD sensor in a nonlinear optimization-based VIO system. We have shown that the depth measurement can be used in both 1D and 3D feature parameterizations. Moreover, we proposed a method to utilize the depth measurement in the estimation of time offset for the unsynchronized RGBD and IMU sensors. In the marginalization process, we have also proposed a block-based fast marginalization scheme, which reduces the runtime significantly. Finally, we carried out extensive experiments to show that our proposed DVIO system outperforms the state-of-the-art in terms of reliability, accuracy and runtime. MonoSLAM: Real-Time Single Camera SLAM LSD-SLAM: Large-Scale Direct Monocular SLAM ORB-SLAM: A Versatile and Accurate Monocular SLAM System Sliding Window Based Monocular SLAM Using Nonlinear Optimization Vision-Based SLAM: Stereo and Monocular Approaches Real-time visual odometry from dense RGB-D images 3-D Mapping With an RGB-D Camera A Benchmark Comparison of Monocular Visual-Inertial Odometry Algorithms for Flying Robots A Comparative Analysis of Visual-Inertial SLAM for Assisted Wayfinding of the Visually Impaired A Multi-State Constraint Kalman Filter for Vision-aided Inertial Navigation Vision-aided inertial navigation for resource-constrained systems Keyframe-Based Visual-Inertial SLAM using Nonlinear Optimization VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator A general optimization-based framework for local odometry estimation with multiple sensors On-Manifold Preintegration for Real-Time Visual-Inertial Odometry Bundle adjustment: A modern synthesis IMU-RGBD camera 3D pose estimation and extrinsic calibration: Observability analysis and consistency improvement RGBD-Inertial Trajectory Estimation and Mapping for Ground Robots DUI-VIO: Depth Uncertainty Incorporated Visual Inertial Odometry based on an RGB-D Camera IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Fusion of Inertial and Visual Measurements for RGB-D SLAM on Mobile Devices Dense RGB-Dinertial SLAM with map deformations Good features to track An iterative image registration technique with an application to stereo vision Inverse depth for accurate photometric and geometric error minimisation in RGB-D dense visual odometry High-Accuracy Stereo Depth Maps Using Structured Light Multiple view geometry in computer vision Online Temporal Calibration for Monocular Visual-Inertial Systems Sliding window filter with application to planetary landing The VCU-RVI Benchmark: Evaluating Visual Inertial Odometry for Indoor Navigation Applications with an RGB-D Camera InteriorNet: Mega-scale Multi-sensor Photo-realistic Indoor Scenes Dataset A benchmark for the evaluation of RGB-D SLAM systems evo: Python package for the evaluation of odometry and SLAM