International Journal of Advanced Network, Monitoring and Controls      Volume 04, No.04, 2019 

DOI: 10.21307/ijanmc-2019-070                             47 

Improved Stereo Vision Robot Locating and Mapping 

Method 

Yu Haige 

School of computer science and engineering  

Xi’an Technological University 

Xi’an, Shaanxi, China 

E-mail: 279084342@qq.com 

 
Yu Fan  

School of computer science and engineering  

Xi’an Technological University 

Xi’an, Shaanxi, China 

E-mail: yffshun@163.com 

 
Wei Yanxi 

School of computer science and engineering  

Xi’an Technological University 

Xi’an, Shaanxi, China 

E-mail: 407171251@qq.com 

 
Abstract—Vision-based SLAM has an outstanding problem is 

not work when the camera fast motion, or camera operating 

environment characterized by scarce. Aiming at this problem, 

this paper proposes a SLAM method of IMU and vision fusion. 

This article uses a stereo camera to extract the image ORB 

feature points. During the camera movement, if the number of 

extracted feature points is less than a certain threshold and the 

camera movement cannot be estimated or the estimated 

camera rotation and translation is greater than a certain 

threshold, the camera pose is estimated by fusing IMU , 

Otherwise use feature points to estimate camera pose. This 

paper uses non-linear optimization methods to perform pose 

estimation of pure feature points and pose estimation of fused 

IMU, respectively. The experimental results show that 

binocular vision SLAM with IMU information can estimate the 

camera pose more accurately. 

Keyword-Robot; IMU; Stereo Vision; SLAM 

I. INTRODUCTION 

With the development of robot technology, more 

and more robots are approaching our lives, such as 

sweeping robots, shopping mall robots, etc. Mobile 

robots are the product of the cross fusion of various 

disciplines and technologies. Among them, 

SLAM(Simultaneous Localization and Mapping) is an 

important technology for mobile robots. SLAM means 

that the robot builds a map of the surrounding 

environment in real time based on sensor data without 

any prior knowledge, and infers its own positioning 

based on the map. From the 1980s to the present, more 

and more sensors are used in SLAM, from early sonar, 

to later 2D/3D lidar, to monocular, binocular, RGBD, 

ToF and other cameras. Compared with lidar, cameras 

used in vision SLAM have become the focus of current 

SLAM research due to their advantages such as low 

price, light weight, large amount of image information, 

and wide application range. Stereo cameras generally 

consist of two pinhole cameras placed horizontally. 

Compared to monocular vision's scale uncertainty and 

pure rotation problems, binocular cameras can directly 

calculate the pixel depth. At the same time, compared 

to RGB-D cameras, stereo cameras collect images 


International Journal of Advanced Network, Monitoring and Controls      Volume 04, No.04, 2019 

48 

directly from ambient light and can be used indoors and 

outdoors. Compared with lidar, the main disadvantage 

of the camera as a SLAM sensor is that when the 

camera moves too fast, the camera will blur images, 

and the camera will not work in a scene with 

insufficient environmental feature textures and few 

feature points. 

Aiming at the problems of the above-mentioned 

visual SLAM system, this paper proposes an algorithm 

that fuses IMU and SLAM. Through the fusion of IMU, 

it can provide a good initial pose for the system. At the 

same time, during the camera movement process, it 

makes up for the shortcomings of visual SLAM, 

ensuring the accuracy of the camera pose estimation in 

the case of fast camera movement and lack of 

environmental texture. 

II. RELATED WORKS 

A. Camera coordinate system 

Camera models generally have four coordinate 

systems: a pixel coordinate system, an image 

coordinate system, a world coordinate system, and a 

camera coordinate system. Figure 1: 

 
Figure 1. Camera related coordinate system 

Among them,
w w w w

O X Y Z is the world coordinate 

system. The world coordinate system is the reference 

coordinate system in the visual SLAM system. The 

positions of the camera trajectory and map points are 

described based on this coordinate system. The unit is

m . 

i
O xy  is the image coordinate system. The image 

coordinate system uses the intersection of the camera 

optical center and the image plane coordinate system as 

the origin. The unit is mm . 

c c c c
O X Y Z  is the camera coordinate system. The 

camera coordinate system uses the camera optical 

center as the origin, and the directions parallel to the 

x -axis and y -axis of the image coordinate system are 

respectively taken as the 
c

X -axis and 
c

Y -axis, and 

the direction perpendicular to the image plane is the 

c
Z -axis. The unit is m . 

O uv  is the pixel coordinate system. The origin 

of the pixel coordinate system is generally the upper 

left corner of the image, with the u axis to the right 

parallel to the x axis, and the v axis to the y axis. The 

unit is pixel. 

B. Camera projection model 

The camera maps the coordinate points of the 

three-dimensional world to the two-dimensional image 

plane. This process is generally a pinhole model. Under 

the pinhole model, it is assumed that there is a spatial 

point P , and the coordinates of the point P are 

[ , , ]
T

X Y Z . After the projection of the small hole O , 

the point P  falls on the imaging plane o xy , and 

the imaging point is p , The p -point coordinate is 

[ , , ]
T

x y z . Let the distance from the imaging plane to 

the small hole be the focal length f . Therefore, 

according to the principle of triangle similarity, there 

are: 


Z X Y

f x y
   


International Journal of Advanced Network, Monitoring and Controls      Volume 04, No.04, 2019 

49 

So we can get: 



X
x f

Z

Y
y f

Z





 


 

The difference between the pixel coordinate system 

and the imaging plane is a zoom and a translation of the 

origin. Suppose that the pixel coordinates are scaled 

  times on the u axis and   times on the v  axis, 

and the origin is translated ,
T

x y
c c   , so we can get: 


x

y

u x c

v y c





 


 
 

Equation (3) is substituted into equation (2) to get: 


x x

y y

X
u f c

Z

Y
v f c

Z


 


  


 

The unit of f  is m  and the unit of   and  is 

/pixel m , so the unit of ,
x

f  and 
y

f  is pixel . Written 

as a matrix: 



0
1 1

0

1 0 0 1

x x

y y

u f c X

v f c Y
Z Z

Z

    
    


    
    
    

ΚΡ  

Among them, the matrix K is called the internal 

parameter matrix of the camera, and P is the coordinate 

representation of the space point in the camera 

coordinate system. 

Let the coordinate P of the space point in the 

camera coordinate system correspond to the coordinate 

Pw in the world coordinate system, and use coordinate 

transformation to obtain: 

 ( )

1

uv w w

u

Z Z v

 
 

   
 
  

P K RP t KTP  

Among them, T represents the pose of the camera 

relative to the world coordinate system, and can also be 

called the external parameter of the camera. In 

summary, the pinhole camera model uses the triangle 

similarity relationship to obtain the relationship 

between space points and pixels, which is a relatively 

ideal model. In practice, there will be errors in the 

manufacture and installation of optical lenses, which 

will affect the propagation of light during the imaging 

process and cause distortion in the images collected by 

the camera. Here we mainly consider radial distortion 

and tangential distortion. 

Radial distortion is caused by the shape of the lens, 

and the distortion increases as the distance between the 

pixel and the center of the image increases. Therefore, 

a polynomial function can be used to describe the 

changes before and after the distortion, that is, the 

quadratic and higher-order polynomial functions 

related to the distance between the pixel and the center 

of the image can be used for correction. The 

polynomial of the coordinate change before and after 

the radial distortion correction is as follows: 



2 4 6

1 2 3

2 4 6

1 2 3

(1 r r r )

(1 r r r )

corrected

corrected

x x k k k

y y k k k

    


   
 

Among them,  ,
T

x y  is the coordinates of the 

uncorrected points, and  
T

corrected corrected
x y，  is the 

coordinates of the points after the distortion is 

corrected. r is the distance from the point (x, y) to the 

origin. 𝑘1，𝑘2 and 𝑘3 are three radial distortion 


International Journal of Advanced Network, Monitoring and Controls      Volume 04, No.04, 2019 

50 

parameters. Usually these three parameters can be 

obtained by the calibration step. 

For tangential distortion, the reason is that the lens 

and the imaging plane cannot be strictly parallel during 

camera assembly. Tangential distortion can be 

corrected using two other parameters, p1 and p2: 



2 2

1 2

2 2

2 1

2 (r 2 )

2 (r 2 )

corrected

corrected

x x p xy p x

y y p xy p y

    


   
 

Considering the two types of distortion, we can find 

the correct position of a pixel in the pixel coordinate 

system through 5 distortion coefficients: 



2 4 6 2 2

1 2 3 1 2

2 4 6 2 2

1 2 3 2 1

(1 ) 2 ( 2 )

(1 ) 2 ( 2 )

corrected

corrected

x x k r k r k r p xy p r x

y y k r k r k r p xy p r y

       


      


In summary, the parameters describing the camera 

model mainly include: in the camera's internal 

parameter matrix, and distortion correction parameters. 

C. Stereo camera ranging principle 

The binocular camera generally consists of two 

pinhole cameras placed horizontally, and the two 

cameras observe an object together. The aperture 

centers of both cameras are located on one axis, and the 

distance between the two is called the baseline b of the 

binocular camera. There is an existing space point P , 

which is an image in the left-eye camera and the 

right-eye camera, and is denoted as ,
L R

P P . Due to the 

presence of the camera baseline, these two imaging 

positions are different. Remember that the coordinates 

of the imaging on the left and right sides are ,
L R

x x , 

which can be seen from the similarity of the triangles: 


L R

b u uz f

z b

 
  

We can get: 


fb

z
d

  

The above model is an ideal model, which aims to 

explain the principle of measuring the actual 

three-dimensional point depth of the binocular camera. 

In practical applications, due to factors such as 

manufacturing and installation, it is difficult to achieve 

that the imaging planes of the binocular cameras are 

strictly on the same plane and the optical axes are 

strictly parallel. Therefore, before using a binocular 

camera for measurement, it should be calibrated to 

obtain the left and right camera internal parameters and 

the relative position relationship between the left and 

right cameras. 

III. POSE ESTIMATION ALGORITHM 

At present, the fusion method of monocular vision 

sensor and IMU can be divided into two types: loose 

coupling and tight coupling[1]. Loose coupling is based 

on the vision sensor and IMU as two separate modules, 

both of which can calculate the pose information, and 

then fused by EKF[2] and so on. Tight coupling refers 

to the non-linear optimization of vision and IMU data 

to obtain pose estimates. Because tight coupling can 

make full use of each sensor's data, this paper uses tight 

coupling to fuse vision and IMU data. Firstly, the 

purely visual feature point pose estimation method is 

used to estimate the camera pose. Then, during the 

camera movement, if the number of extracted feature 

points is less than a certain threshold value, the camera 

movement cannot be estimated or the estimated camera 

rotation and translation are greater than a certain 

threshold value, The camera pose is estimated by 

fusing the IMU, otherwise feature points are still used 

to estimate the camera pose. 

A. Pose estimation using pure visual information 

The ORB (Oriented Fast and rotated Brief) 

algorithm was proposed by Ethan Rublee et al. In 

2011[3]. The ORB feature is composed of the FAST 


International Journal of Advanced Network, Monitoring and Controls      Volume 04, No.04, 2019 

51 

feature and the BRIEF descriptor. It adds orientation 

and scale invariance to the FAST feature. Features are 

described using binary BRIEF descriptors. When 

performing feature matching, the descriptors between 

feature points and feature points are compared. The 

binocular camera can directly obtain the corresponding 

3D position of the pixel under the known pixel 

matching of the left and right camera images. Therefore, 

the stereo camera-based SLAM system can use the 

known 3D point and its projection match in the current 

frame to obtain the current camera pose without the 

need to solve camera motion using epipolar 

geometry[4]. 

This paper first uses the method of EPnP[5] to solve 

the camera pose. The EPnP pose solution method can 

more effectively use the matching point information, 

and iteratively optimize the camera pose. EPnP is 

known as the coordinates  , 1, 2,...,wiP i n  of n space 

points in the world coordinate system and their 

corresponding coordinates  , 1, 2,...,ciP i n  in the 

image coordinate system to solve the rotation matrix R 

and translation vector t of the camera movement.Set 

four non-coplanar virtual control points in the world 

coordinate system, whose homogeneous sitting marks 

are:  | 1, 2, 3, 4wiC i  .The relationship between the 

world coordinates of the space points and the control 

points is as follows: 



4 4

1 1

, 1
w w

i ij j ij

j j

P C with 
 

     

Once thevirtual control point is determined and the 

premise that the four control points are not coplanar,

 , 1,..., 4ij j   is the only one determined.In the image 

coordinate system, the same weighting sum 

relationship exists: 

4

1

c c

i ij j

j

P C


   

Substituting equation (13) into the camera model 

gives: 


4 4

1 1

0

, 0

1 0 0 1

c

i x x j

c c c

i i i j y y ij j

j j c

j

u f c x

i s v P C f c y

z

 
 

    
    

        
         

 K K 

The image coordinates ,
i i

u v  in Equation (13) are 

known, so: 



4

1

c

i ij j

j

s z


   

From equations (13) and (14): 



4

1

4

1

( ) 0

( ) 0

c c

ij x j ij x i j

j

c c

ij y j ij y i j

j

f x c u z

f y c v z

 

 






  



   






 

In order to obtain the coordinates of the 2D point 

into the camera coordinate system, it is assumed that 

ij
  in the camera coordinate system is consistent with 

ij
  in the world coordinate system, that is, to find the 

rotation and translation of the four control points. Solve 

linear equations: 

 0MX  

Among them, M is a 2 12n  matrix, and 

1 2 3 4
[ , , , ]

cT cT cT cT
C C C CX  is a vector composed of 12 

unknowns to be solved. 


1

N

i i

i

v


 X  


International Journal of Advanced Network, Monitoring and Controls      Volume 04, No.04, 2019 

52 

i
v is the right singular vector of M, and the 

corresponding singular value is 0. Solve the TM M  

eigen value and eigenvector. The eigenvector with 

eigenvalue of 0 is 
i

v . N is the dimension of the 
T

M M

space, and 
i

  is the coefficient to be determined. 

Depending on the position of the reference point, 

the spatial dimension of the matrix TM M  may take 

the values 1,2,3,4. According to the same distance 

between the control points in the world coordinate 

system and the camera coordinate system, six 

constraints can be obtained, and the pending 

coefficients can be solved. 

When N = 1, according to the constraints: 


   

2 2
i j w w

i j
v v C C     

and so： 



[ ] [ ]

[ , ] [1,4]

2
[ ] [ ]

[ , ] [1,4]

i j w w

i ji j

i j

i j

v v C C

v v






 







 

When N = 2: 


2 2

[ ] [ ] [ ] [ ]

1 1 2 2 1 1 2 2
( )

i i j j w w

i j
v v v v C C        

Since 
1
  and 

2
  only appear in the equation as 

quadratic terms, let 2 2
1 1 2 2

[ , , ]
T

   β , and use the 

vector ρ  to represent all 
2

w w

i j
C C , thus obtaining 

the equation: 

 Lβ = ρ  

Where L  is a 6 3  matrix composed of 
1

v and

2
v . 

When N = 3, L  is a 6 6  matrix. 

In summary, the coordinate solution of the reference 

point in the camera coordinate system can be obtained 

as the initial value of the optimization, the optimization 

variable is 
1 2

[ , ,..., ]
T

N
  β , and the objective 

function is: 


2 2

( , ) . .

( ) ( )
c c w w

i j i j

i j s t i j

Error C C C C


   β  

Optimize β  corresponding to the smallest 

dimension of the error, get the vector X , and restore 

the coordinates of the control point in the camera 

coordinate system. At the same time, the coordinates of 

the reference point in the camera coordinate system are 

obtained according to the centroid coordinate 

coefficient. Finally, according to the coordinates of a 

set of point clouds in the two coordinate systems, the 

pose transformations of the two coordinate systems are 

obtained. The solution steps are as follows: 

a) Find the center point: 

 ,
c w

i ic w

c c

p p
p P

n n
 
 

 

b) To the center point: 

 ,
c c c w w w

i i c i i c
q p p q p p     

c) Calculate the H matrix: 


1

T
n

c w

i i

i

q q


 H  

d) SVD decomposition of H matrix: 


T

 H U V  

e) Calculate the rotation R: 


T

R UV  

If R <0, then R(2,.) =-R(2,0). 


International Journal of Advanced Network, Monitoring and Controls      Volume 04, No.04, 2019 

53 

f) Calculate displacement t: 

 0 0
c w

t p p  R  

Taking the results of EPnP solution as initial values, 

the method of g2o was used to optimize the pose of the 

camera nonlinearly. Construct the least squares 

problem and find the best camera pose: 



2

* ^

1
2

1 1
arg min exp( )

2

n

i i

i i

u K P
s

 


   

Among them, 
i

u  is the pixel coordinates of the 

projection point, K  is the camera internal reference, 

  is the camera pose, and 
i

P  is the space point 

coordinate. 

B. Camera pose estimation method based on IMU 

The measurement frequency of the IMU is often 

higher than the frequency at which the camera collects 

pictures. For example, the binocular camera used in 

this article has a frame rate of up to 60FPS and an IMU 

frequency of up to 500Hz. The difference in frequency 

between the two results in multiple IMU measurements 

between the two frames. Therefore, in order to ensure 

the information fusion of the two sensors, it is 

necessary to pre-integrate [6] the data of the IMU. That 

is, only the IMU information between the two image 

moments is integrated to obtain the relative pose value, 

and the integration result is saved for later joint 

optimization.The IMU-based camera pose estimation 

method mainly includes three coordinate systems: the 

world coordinate system, the IMU coordinate system, 

and the camera coordinate system. All pose and feature 

point coordinates are finally expressed in the world 

coordinate system. During the calculation process, this 

article will convert the state quantity in the camera 

coordinate system to the IMU coordinate system, and 

then to the world coordinate system.In this article, the 

letter W is used to represent the world coordinate 

system, the letter B is used to represent the IMU 

coordinate system, 
WB

R  is used to represent the 

rotation matrix from the IMU coordinate system to the 

world coordinate system, and 
WB

p  is used to represent 

the translation matrix from the IMU coordinate system 

to the world coordinate system. 

The acceleration and angular velocity of the IMU 

are: 


( ) ( ) ( ) ( )

a ( ) ( )( a( ) ) ( ) ( )

g g

B WB B WB

T a a

B WB WB W W

t t b t t

t R t t g b t t

  



  

   
 

Among them, ( )ab t  and ( )gb t  represent the bias 

of the accelerometer and gyroscope respectively, ( )a t  

and ( )g t represent the noise of the accelerometer and 

gyroscope respectively, and 
W

g  represents the gravity 

vector in the world coordinate system. 

The derivatives of rotation, velocity, and translation 

are expressed as: 



^

WB WB B WB

W WB W WB

W WB W WB

R R

v a

p v







 

The rotation, speed and translation in the world 

coordinate system can be obtained by the general 

integral formula: 



2

( ) ( ) ( ( ) )

( ) ( ) a( )

( ) ( ) ( ) a( )

t t

WB WB B WB
t

t t

W W W
t

t t t t

W W W W
t t

R t t R t Exp d

v t t v t d

p t t p t v d d

  

 

   





 

  

   

    





 



Use Equation (32) in discrete time for Euler 

integration: 


International Journal of Advanced Network, Monitoring and Controls      Volume 04, No.04, 2019 

54 


2

( ) ( ) ( ( ) )

( ) ( ) a( )

1
( ) ( ) ( ) a( )

2

WB WB B WB

W W W

W W W W

R t t R t Exp t t

v t t v t t t

p t t p t v t t t t

   

    

      

 

The IMU model is obtained from equations (30) and 

(33): 

2 2

( ) ( ) (( ( ) ( ) ( ) )

( ) ( ) ( )(a( ) ( ) ( ))

1 1
( ) ( ) ( ) ( )(a( ) ( ) ( ))

2 2

g gd

a ad

a ad

R t t R t Exp t b t t t

v t t v t g t R t t b t t t

p t t p t v t t g t R t t b t t t

 





     

        

          

 
Suppose there are two image frames with time 
i

t

and 
j

t , 
j i

t t . Therefore, the IMU's pre-integration 

observation model is: 


2

( )

( )

1
( )

2

T

ij i j ij

T

ij i j i ij ij

T

ij i j i i ij ij ij

R R R Exp

v R v v g t v

p R p p v t g t p







 

     

       

 

Among them, A, B, and C are the noise terms of the 

rotation amount, the pre-integrated speed noise term, 

and the pre-integrated translation noise term, 

respectively. 

For the pose between two adjacent frames, this 

paper still uses a nonlinear optimization method to fuse 

IMU information and visual information. Among them, 

the state quantities that need to be optimized are: 

  , , , ,j j j j jWB W B W B g aR p v b b   

In equation (36), j
WB

R , j
WB

v , and j
WB

p  are the 

rotation, velocity, and translation of the IMU 

coordinate system relative to the world coordinate 

system at time i, and the random walk bias of the 

gyroscope and accelerometer at time i, respectively. 

Therefore, the optimal state quantity   is solved 

by optimizing the visual reprojection error and the IMU 

measurement error: 


*

( ,
arg min( + ( , ))

proj k j IMU

k

E E i j


   ）  

C. Experimental design 

The upper computer of the experimental platform in 

this article is a laptop with Ubuntu 16.04 version, 

running memory is 8G, processor model is CORE i5 

8250U, and the main frequency is 1.6GHz. The robot 

platform is a Dashgo D1 robot mobile platform that 

supports the ROS development system. The overall 

size is 406 210   and the diameter of the driving 

wheel is 125mm. The binocular camera sensor used is 

MYNT EYE D1000-IR-120/Color. 

The experiments in this paper are mainly aimed at 

the positioning accuracy of the robot. The evaluation 

index is the RMSE (root-mean-square-error) of the 

robot position: 


2

1

1
ˆ( )

N

i i

i

RMSE p p
N 

   

Where ˆ
i

p  is the estimated robot position and 
i

p  

is the actual robot position. 

 
Figure 2. Robot Straight Driving Positioning Experiment 


International Journal of Advanced Network, Monitoring and Controls      Volume 04, No.04, 2019 

55 

In this paper, robot positioning experiments are 

performed in corridor environments with insignificant 

environmental characteristics and indoor environments 

with rich characteristics. In a corridor environment, a 

mobile robot is used to carry experimental equipment 

to travel at a constant speed of 10m in the positive 

direction of the camera, and then the positioning 

accuracy of pure vision and the positioning accuracy of 

vision fusion IMU are recorded separately. In a 

feature-rich indoor environment, a robot linear 

experiment was also performed to make the mobile 

robot move forward at a constant speed of 5m in the 

positive direction of the camera, but the speed was 2.5 

times that of the previous experiment. Perform multiple 

experiments and record the results. 

TABLE I.  EXPERIMENTAL RESULT 

Robot operating environment Pure visual RMSE/m Visual fusion IMU RMSE/m 

Low-texture corridor environment 0.0746 0.02122 

Feature-rich environment 0.1024 0.06502 

 
From the experimental results, it can be seen that 

the stereo vision positioning error of the fusion IMU is 

less than the pure vision positioning error, which 

indicates that the visual positioning of the robot with 

the fusion IMU is more accurate than the vision-only 

positioning in low-texture environments and fast robot 

movements. degree. 

IV. CONCLUSION 

In this paper, the robot positioning technology in the 

robot system is researched, and a binocular vision 

fusion IMU-based robot positioning method is 

proposed. Compared with the pure vision robot 

localization method, the proposed method is more 

robust in low-textured environments and fast robot 

movements. The experimental results show that the 

visual positioning method integrated with IMU solves 

the defects of pure visual positioning to a certain extent 

and improves the positioning accuracy of the robot. 

 
REFERENCE 

[1] Agostino Martinelli. Closed-Form Solution of Visual-Inertial 
Structure from Motion[J]. International Journal of Computer Vision, 
2014, 106(2):138-152. 

[2] Smith R C, Cheeseman P. On the representation and estimation of 
spatial uncertainty[J]. International Journal of Robotics Research, 
1986, 5(4): 56-68. 

[3] Rublee E, Rabaud V, Konolige K, et al. ORB: An efficient alternative 
to SIFT or SURF[C]// 2011 International Conference on Computer 
Vision. IEEE, 2012. 

[4] Gao Xiang, Zhang Tao. Fourteen lectures on visual SLAM [M]. 
Beijing: Publishing House of Electronics Industry, 2017. 

[5] V. Lepetit, F. Moreno-Noguer, P. Fua. EPnP:An accurate o(n) 
solution to the pnp problem[J]. International Journal of Computer 
Vision,2008,81(2):155-166. 

[6] Forster C, Carlone L, Dellaert F, et al. On-Manifold Preintegration for 
Real Time Visual--Inertial Odometry[J]. IEEE Transactions on 
Robotics, 2017, 33(1):1-21.