Efficient Iterative Pose Estimation using an Invariant to ...the variations in image space and 3D...

HAL Id: hal-00840264https://hal.inria.fr/hal-00840264

Submitted on 2 Jul 2013

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Efficient Iterative Pose Estimation using an Invariant toRotations

Omar Tahri, Helder Araujo, Youcef Mezouar, François Chaumette

To cite this version:Omar Tahri, Helder Araujo, Youcef Mezouar, François Chaumette. Efficient Iterative Pose Estimationusing an Invariant to Rotations. IEEE Transactions on Cybernetics, IEEE, 2014, 22 (4), pp.199-207.�10.1109/TCYB.2013.2251331�. �hal-00840264�

https://hal.inria.fr/hal-00840264

https://hal.archives-ouvertes.fr

1

Efficient Iterative Pose Estimation using anInvariant to Rotations

Omar Tahri, Helder Araujo,Youcef Mezouar and François Chaumette

Résumé—This paper deals with pose estimation using

an iterative scheme. We show that using adequate visual

information, pose estimation can be performed iteratively

with only three independent unknowns, which are the

translation parameters. Specifically, an invariant to ro-

tational motion is used to estimate the camera position.

In addition, an adequate transformation is applied to the

proposed invariant to decrease the non-linearities between

the variations in image space and 3D space. Once the

camera position is estimated, we show that the rotation

can be estimated efficiently using two different direct

methods. The proposed approach is compared against two

other methods from the literature. The results show that

using our method, pose tracking in image sequences and

the convergence rate for randomly generated poses are

improved.

Index Terms—Pose estimation, invariant to rotation,

projection onto sphere

I. INTRODUCTION

Pose estimation (also known as extrinsic camera ca-libration) consists on the determination of the positionand orientation of a camera with respect to an objectcoordinate frame using the image information. Poseestimation is a classical problem in computer vision [6],[15]. Nevertheless, there is a recent renewed interest asa result of automated navigation and model-based visionsystems.

Numerous methods to estimate the pose have beenproposed in the literature and giving an exhaustive listof them is certainly impossible. Nevertheless, they can bedivided into several categories according to the featuresused or the nature of the estimation method, namelydirect methods or iterative methods. The geometricfeatures considered for the estimation of the pose areoften points [6], segments [7], contours, conics [18] or

O. Tahri and H. Araujo are with ISR, University of Coimbra,Polo II PT-3030-290 Coimbra, Portugal ([email protected], [email protected] ).

Y. Mezouar is with IFMA - Institut Français de Mécanique Avan-cée, Institut Pascal, BP 10448, F-63000 Clermont-Ferrand, France([email protected]).

F. Chaumette is with Inria Rennes-Bretagne Atlantique,Campus de Beaulieu, 35042 Rennes-cedex, France([email protected]).

image moments [22]. Another important issue is theregistration problem. Purely geometric [7], or numericaland iterative [6], [2], [16] approaches may be considered.Linear approaches give closed-form solutions free ofinitialization [8], [1], [14]. However, the estimated poseusing such methods is sensitive to image noise andto errors on camera parameters. Full-scale non-linearoptimization techniques [16] minimize the error betweenthe observation and the projection of the feature usingthe model, that is the reprojection error. The main advan-tage of the non-linear and iterative approaches is theiraccuracy. The main drawback is that they may be subjectto local minima and, worse, divergence, if not correctlyinitialized. Furthermore, they usually require severaliterations to minimize the cost function and generallythey are more time consuming than the direct methods.These problems (i.e. local minima, divergence and timecost) are mainly due to non-linearities in the mappingbetween 3D and image space. The non-linearities arealso usually the main reason for the failure of filteringstrategies of the pose [13]. This occurs especially whenthe initial state is not accurate or when abrupt motionshappen (for instance, for Extended Kalman Filter [21]).

In this paper, we deal with the selection of visual in-formation that decreases the effect of the non-linearitiesbetween the variations in the image space and the 3Dspace. Firstly, we will show that the iterative estimationof the pose can be expressed as a minimization withoutconstraints of a cost function on three unknowns only,which are the translation parameters. More precisely, toestimate the camera position separately from the rotation,an invariant to rotation computed from the point pro-jections onto the unit sphere will be used. Furthermore,these visual features are chosen to minimize the changeson their corresponding Jacobian matrices with respect tothe camera position. Secondly, we will also show thatonce the camera position is obtained using an iterativemethod, the rotation can be computed directly, that is,it is obtained without any iterative method. Therefore,the convergence speed and rate are only function of thetranslations.

In the next section, we recall the unified projectionmodel and some basic definitions. Section 3 describes

2

FIGURE 1. Unified image formation.

the method we propose. Section 4 compares this methodwith two iterative methods from the literature.

II. DEFINITIONS

A. Notations

In the sequel, the following notations will be used :– P = (X, Y, Z) : 3D point coordinates ;– Ps = (xs, ys, zs) : projected point onto the unit

sphere ;– Pv : virtual points defined by linear combination

of the projected points onto the unit sphere ;– m = (x, y, 1) : coordinates of projected point onto

the image plane in metric units ;– p : coordinates of projected point onto the image

plane in pixels ;

– dij =√

2− 2P⊤siPsj : distance between two pro-

jected points Psi and Psj on the unit sphere ;– the variables followed by ∗ are computed for the

camera pose to be estimated.– all the scalars are in italic.– all the matrices and vectors are in bold

B. Camera Model

Central imaging systems can be modeled using twoconsecutive projections : spherical and then perspective.This geometric formulation, called the unified model,was proposed by Geyer and Daniilidis in [9]. ConsiderFm the frame attached to a virtual unitary sphere asshown on Fig. 1. The frames attached to the sphere Fm

and to the perspective camera Fp are related by a simpletranslation of −ξ along the Z-axis. Let P be a 3D point

with coordinates P = (X, Y, Z) in Fm. The world pointP is projected onto :

m =(

x, y, 1)

=(

XZ+ξ‖P‖ ,

YZ+ξ‖P‖ , 1

)

(1)

and then mapped to the homogeneous image-plane co-ordinate p = Km, where K is a 3×3 matrix of cameraand mirror intrinsic parameters. The matrix K and theparameter ξ can be obtained from calibration using, forexample, the methods proposed in [17]. In the sequel,the imaging system is assumed to be calibrated. In thiscase, the inverse projection onto the unit sphere can beobtained from :

Ps = γ(

x, y, 1− ξγ

)

(2)

where

γ =ξ +

√

1 + (1− ξ2)(x2 + y2)

1 + x2 + y2.

The projection onto the unit sphere from the imageplane is possible for all sensors obeying the unifiedmodel. In other words, it encompasses all sensors in thisclass [9] : perspective and catadioptric cameras. A largeclass of fisheye cameras are also concerned by this model[4], [3].

C. Pose estimation

Pose estimation consists in determining the rigid trans-formation cMo between the object frame Fo and thecamera frame Fc in unknown position using the corres-ponding object image. It is well known that the rela-tionship between an object point with coordinates Pc =[Xc, Yc, Zc, 1]⊤ in Fc and Po = [Xo, Yo, Zo, 1]⊤ inFo can be written :

Pc =c MoPo =

[

cRocto

031 1

]

Po. (3)

The matrix cMo can be estimated by minimizing themodulus of the error in the image :

e =‖ s(cMo)− s∗ ‖, (4)

where s∗ is the value of a set of visual features computedin the image acquired with the camera in unknown posi-tion and s(cMo) is the value of the same set of featurescomputed from the object model, the transformationcMo, and the camera model.

III. POSE ESTIMATION METHOD

In this section, we first propose new features toestimate the camera position separately from the rotation.We then propose a method for the direct estimation ofthe rotation once the translational part of the pose hasbeen determined.

3

FIGURE 2. Relation between 3D distance and distance betweenprojected point on the sphere

A. Position estimation using an invariant to rotation

1) Invariant to rotations: Let dij be the distancebetween two projected points on the unit sphere

dij =√

2− 2P⊤siPsj (5)

It can easily be shown that the distance dij is an invariantto any rotational motion applied to the camera frame.Indeed, if we apply a rotational motion to the cameraframe, the projected points on the sphere undergo alsothe same rotation. In this case, we have P′

si = RPsi ,where P′

si and Psi are the projections of a 3D point ontothe unit sphere corresponding the two cameras orienta-tions and R is the matrix defining the rotation. Therefore,it can be obtained that P

′⊤si P

′sj = Psi R

⊤R Psj =

P⊤siPsj since R⊤R = I, which proves the invariance

of dij to rotational motion applied to the camera frame.Therefore, the variation of dij only depends of the trans-lation. Furthermore, the Jacobian matrix that links thevariation of dij with respect to translational displacementis given by :

Jdij= −

P⊤siJPsj

+P⊤sjJPsi

dij(6)

where JPsiand JPsj

are the Jacobian matrices that relatethe variations of the point coordinates on the unit sphereto the camera translational displacements. This Jacobianhas the following form [11] :

JPsi=

−I+PsiP⊤si

‖ Pi ‖(7)

where ‖ Pi ‖ is the distance of the 3D point to the centerof the sphere. After inserting (7) in (6), we obtain :

Jdij= −

1

dij

(

(−1

‖ Pj ‖+

P⊤siPsj

‖ Pi ‖)P⊤

si

+(−1

‖ Pi ‖+

P⊤siPsj

‖ Pj ‖)P⊤

sj

) (8)

Further to the invariance to rotation, which allowsseparating the estimation of the camera position andorientation, it is also possible to decrease the non-linearities between the image space and 3D space. In-deed, the distance dij on the sphere behaves as functionwhich is approximately inversely proportional to thepoint depths ‖ Pi ‖. This means that its correspondingJacobian matrix depends on the square of the inverseof the point depths. On the other hand, the inverse ofthe distance behaves approximately as a linear functionof the points depths. This should allow obtaining morelinearizing properties between the image space and 3Dspace. So we propose to use sij = 1/dij for all possiblecombinations of two projected points. Let us considerthe case when the "mean" distance R of the points tothe sphere center is such that R ≈‖ Pi ‖≈‖ Pj ‖ asshown in Figure 2. In this case, we have :

Jsij ≈(−1 +P⊤

siPsj )

R

(Psi +Psj )⊤

d3ij(9)

Note that −1+P⊤siPsj = −

d2

ij

2 , then (9) can be writtenas :

Jsij ≈‖ Psi +Psj ‖

2Rdij

(Psi +Psj )⊤

‖ Psi +Psj ‖(10)

From Figure 2, we have :

dijDij

=1

R(11)

By combining (11) with (10), we obtain :

Jsij ≈‖ Psi +Psj ‖

2Dij

(Psi +Psj )⊤

‖ Psi +Psj ‖(12)

Note that(Psi

+Psj)⊤

‖Psi+Psj

‖ is the unitary vector that passesthrough the middle of the two points Psi and Psj andalso ‖ Psi +Psj ‖≈ 2 if R >> Dij . This means that theJacobian matrix Jsij behaves as a constant matrix whenthe point depth increases.

2) Noise propagation from image space to the new

feature space: Applying non-linear transformations ondata obtained from the sensor space changes the noisedistribution. For instance, if the image noise is Gaussianwhite, the noise on invariants features is no more Gaus-sian white since the applied transformation is nonlinear.In practice, if the noise level is low, the use of invariantssij allows obtaining adequate performances. If the noiselevel increases, the propagation of noise from the imageto the feature sij should be taken into account. Let usstart with the sensitivity of a projected point onto thesphere with respect to noise in the image plane. Takingthe derivative of (2), the variation in the coordinates of

4

the point projected onto the sphere as a function of thevariation in the coordinates in the image points (noise-meters) is obtained by (using first order approximation) :

∆Ps = JPs/m∆m (13)

where :

JPs/m =

γ + x ∂γ∂x

x ∂γ∂y

0

y ∂γ∂x

γ + y ∂γ∂y

0∂γ∂x

∂γ∂y

0

(14)

with :

∂γ

∂x=

x

1 + x2 + y2(

(1− ξ2)√

1 + (1− ξ2)(x2 + y2)− 2γ)

∂γ

∂y=

y

1 + x2 + y2(

(1− ξ2)√

1 + (1− ξ2)(x2 + y2)− 2γ)

(15)

where γ and ξ have been defined in Section II-B.Therefore, the variation of Ps with respect to imagepoints in pixels is obtained by :

∆Ps = JPs/mK−1∆p (16)

Furthermore, from dij =√

2− 2P⊤siPsj , we have :

∆dij = −1

dij(P⊤

sj ∆Psi +P⊤si∆Psj) (17)

As a result of (14) and (17), the variation of sij = 1dij

with respect to noise in the coordinates of the imagepoints (in pixels) is obtained by :

∆sij = Jsij/p

[

∆pi

∆pj

]

(18)

where Jsij/p =

[

P⊤

sj JPsi/miK

−1P

⊤

si JPsj/mjK

−1

]

/d3ij .

In order to take into account the non-linear mappingfrom the image point coordinates to the features sij , eachvisual feature should be weighted by 1

‖Js∗ij

/p∗‖ computed

using the image points coordinates corresponding tothe pose to be computed. More precisely, we use allpossible combinations of swij =

1dij

1‖Js∗

ij/p∗‖ as measure

to estimate the camera position.3) Direct estimation of the rotation: We now present

two direct methods to determine the rotation once thetranslation has been estimated.

a) Method based on an orthonormal basis compu-

tation: In [5] and [23], an angle-axis representation of arotation matrix R computed from two projected pointson the sphere has been considered to control rotationalmotions for visual servoing application. The idea behindthe rotation formula given in these works is equivalentto attaching an orthonormal frame basis to each camerapose using two projected points onto the unit sphere.More precisely, let Ps1 and Ps2 be two projected points

FIGURE 3. Definition of vector basis from 2 projected points

FIGURE 4. Definition of vector basis from N projected points

on the sphere for a pose 1 of the camera and P∗s1 and

P∗s2 their corresponding projected points for a camera

pose 2. From Ps1 and Ps2 , it is possible to define anorthonormal basis cRs = [v1;v2;v3] (see Figure 3) suchthat :

v1 = Ps1 , (19)

v2 =Ps2 − (P⊤

s2Ps1)Ps1

‖ Ps2 − (P⊤s2Ps1)Ps1 ‖

, (20)

v3 = v1 × v2. (21)

and similarly an orthonormal basis c∗Rs∗ using pointsP∗

s1 and P∗s2 . If only a rotational motion is considered,

the rotation matrix R between the two camera poses isdetermined by the matrix that transforms the vector basiscRs to c∗Rs∗ :

R = c∗Rs∗cR⊤

s (22)

For the sake of robustness, all projected points on thesphere should be used and not only two. In this work,

5

we propose a way to define the rotation matrix usingall the points. The idea is based on the fact that therotation matrix given by (22) can be obtained from tworeal projected points as well as from two virtual pointsrigidly attached to the set of projected points. Let usconsider :

Pv1=

N∑

i=1

a1iPsi , Pv2=

N∑

i=1

a2iPsi (23)

two virtual points obtained by a linear combination ofthe real set of projected points on the sphere. Then, fromPv1

and Pv2an orthonormal basis cRn can be defined

as follows :

vn1 =Pv1

‖ Pv1‖, (24)

vn2 =Pv2

−P⊤v2vn1vn1

‖ Pv2−P⊤

v2vn1vn1 ‖

, (25)

vn3 = vn1 × vn2. (26)

Lemma 1: If only a rotational motion is considered,the rotation matrix R between the two camera poses isdetermined by the matrix that transforms the vector basiscRn into c∗Rn∗ :

R = c∗Rn∗cR⊤

n (27)

where :cRn = [vn1; vn2; vn3]

andc∗Rn∗ = [v∗

n1; v∗n2; v

∗n3]

are computed using (23), (24), (25) and (26).The proof of the previous lemma is detailed in theappendix.

To define the vector basis cRn and c∗Rn∗ it is ne-cessary to determine the parameters a1i and a2i in (23).More precisely, we have to define two virtual points P∗

v1

and P∗v2

and then express them as linear combinationsof the projected points on the sphere P∗

si computedfor the camera pose to be estimated. For the sake ofsimplicity, P∗

v1and P∗

v2are chosen to be unitary and

perpendicular to each-other. For such case, the basis[P∗

v1; P∗

v2; P∗

v1×P∗

v2] is orthonormal.

In any given frame basis [P∗v1; P∗

v2; P∗

v1×P∗

v2], each

projected point onto the sphere can be expressed as :

P∗si = b1iP

∗v1

+ b2iP∗v2

+ b3iP∗v1

×P∗v2

(28)

Let B be the 3×N matrix that defines the coordinatesof all the projected points on the new frame basis. Wehave :

P∗st = Pv∗ B, (29)

where :P∗

st = [P∗s1 P∗

s2 . . .P∗sN ],

P∗v = [P∗

v1P∗

v2P∗

v1×P∗

v2],

andB = P⊤

v∗P∗st (30)

In practice, a1i and a2i have to be chosen such thattheir corresponding virtual points are robust to noise.Our choice is based on characteristic features of 3Dstructure—the center of gravity of the directions definedby the points and the principal axis of the directions (alsodefined by the points). More precisely, the first virtualpoint is defined by :

P∗v1

=1

‖∑N

i=1P∗si ‖

N∑

i=1

P∗si. (31)

which corresponds to a1i = 1‖∑N

i=1P∗

si‖. The second

virtual point P∗v2

is chosen as the unitary vector per-pendicular to P∗

v1that lays on the plane defined by P∗

v1

and the major principal axis of the set of the projectedpoints on the sphere (see Fig. 4). The choice of the mainprincipal axis as second axis allows having the majorityof the points in its direction. Now as P∗

v1and P∗

v2have

been determined, the matrix B can be computed using(30). From (28), it can be obtained that :

P∗si − b1iP

∗v1

= b2iP∗v2

+ b3iP∗v1

×P∗v2

(32)

By replacing P∗v1

by∑N

i=1P∗

si

‖∑N

i=1P∗

si‖, we obtain for all the

projected points onto the sphere :

P∗stC =

[

P∗v2

P∗v1

×P∗v2

]

B23 (33)

where B23 is composed by the two last rows of B andC is an N ×N matrix defined by clm = 1− b1l

‖∑N

i=1P∗

si‖

if l = m and clm = − b1l‖∑N

i=1P∗

si‖for l 6= m. By inverting

(33), we obtain :

[

P∗v2

P∗v1

×P∗v2

]

= P∗stCB23

+. (34)

The parameters a2i are then obtained as the firstcolumn of the matrix CB23

+. Now as the coefficientsa1i and a2i are defined, the vector basis for each camerapose can be obtained using (24), (25) and (26).

This way of obtaining the rotation matrix by buildingan orthogonal basis from a set of points has dual applica-tion in practice : it allows a direct estimation of rotationfor pose estimation problem, but it can also provide arotation vector from a set of N projected points on the

6

sphere to control the rotational motions in image-basedvisual servoing. In the next paragraph, we recall anotherdirect way of recovering the rotation matrix between twoset of projected points on the sphere.

b) Direct estimation of the rotation by solving an

orthogonal Procrustes problem: The rotation matrixcould also be directly obtained by solving orthogonalProcrustes problem between two sets of projected pointson the sphere. We recall that the Orthogonal Procrustesproblem is defined as the least square problem transfor-ming a given matrix F into a given matrix F′ by anorthogonal transformation R so that the sum of squaresof the residual matrix E = RF − F′ is minimal [12].Mathematically, this problem can formulated as follows[19].

RF = E+ F′, (35)

RR⊤ = I, (36)

tr(E⊤E) is minimal, (37)

In our context, the matrices F and F′ are composedby the set of all projected points onto unit sphere thatis Pst and P∗

st. The Orthogonal Procrustes Problemcan be solved by computing the SVD decomposition ofP∗

stP⊤st :

P∗stP

⊤st = UΣV⊤ (38)

The rotation matrix between the two camera poses isthen given by :

R = UV⊤ (39)

4) Pose estimation algorithm: The pose estimationmethod is divided into two steps : firstly, we determinethe translation between the initial pose and the pose tobe estimated using the invariant to rotation as feature asfollows :

– Project the image points corresponding to the poseto be computed onto the sphere using (2).

– Compute the value of features vector st∗ for the

pose to be estimated by stacking the features s∗wij =1d∗

ij

1‖Js∗

ij/p∗‖ .

– The camera pose is set up at its initial value :

cMo =iMo =

[

iRoito

01×3 1

]

Minimization loop : while (‖ st − st∗ ‖≤ ǫ) where ǫ is

defined by the user.– Project the 3D points of the object onto the unit

sphere using the object model and the current valueof the pose cMo.

– Compute the current value of features vector stcorresponding to st

∗ by stacking the features swij =1dij

1‖Js∗

ij/p∗‖ .

– Compute the Jacobian matrix Jst corresponding tost (Jst is an l × 3 matrix, l is the number of useddistances between projected points on the sphere).

– Compute the translational displacement using∆t = −λJ+

st(st−st

∗) (λ is a scalar gain that tunesthe convergence speed and J+

stis the pseudo-inverse

of Jst)– Update cMo by adding the translational motion ∆t.After the end of the minimization loop described

above, the matrix cRi that defines the rotation betweenthe initial camera pose (defined by iMo) and the camerapose to be computed can be directly obtained using oneof the two methods presented in Section III-A3. Besides,the rotation between the object frame and the cameraframe is then obtained by cRo =

cRiiRo. This means

that if the translational motion is well estimated using aninvariant to rotations, the correct pose will be obtained.Let us remind that the iterative minimization processis nothing but a minimization without constraints of acost function on three unknowns only. Therefore, theconvergence speed and rate are only function of thetranslations.

IV. VALIDATION RESULTS

In this part, our pose estimation method is comparedto two non-linear and iterative methods proposed res-pectively by [2] (method A in the following) and by[16] (method L in the following). The method L isa globally convergent algorithm that minimizes error inobject space : the error between the observation and theprojection of the features using the model. On the otherhand, the method A minimizes an error defined in theimage and improves the classical Lowe’s pose-estimationalgorithm. A comparison of several iterative methods hasbeen made in [10] and showed that the method A isthe most accurate of the considered methods. When thepose is estimated using our method, the direct estimationof rotation is obtained using the first method given by(27), except when it is indicated that the OrthogonalProcrustes method has been used.

A. Results for pose tracking

In this paragraph, the ability of each method to trackthe pose of the camera with respect to a set of pointsfor image sequences with abrupt motions is tested. Acamera model with focal scaling factors Fx = Fy =800 pixels/m and principal point coordinates ux =vx = 400 pixels has been used to compute the imagepoints. For our method, the scalar gain λ has been setto 1.

7

The first sequence of 300 images is obtained using 9non coplanar points defined in the object frame by (referalso to Fig. 5.a) :

X1=

0.2 −0.2 −0.2 0.2 0 0 0.1 −0.13 0.40.2 −0.2 0.2 −0.2 0 0.15 0.01 0 0.41.01 1.02 0.96 1.03 1. 1. 1. 1.2 1.31 1 1 1 1 1 1 1 1

(40)White Gaussian noise with standard deviation equal to0.5 has been added to the coordinates of each point in theimage. Furthermore, the identity matrix has been used toinitialize iMo for the first image of the sequence (theinitial set of points is assumed to be in front of thecamera close to the optical axis and at 1 meter distancefrom the image plane). The computed pose for eachimage is used as initialization to determine the pose forthe following one using each method. The evolution ofthe real pose parameters of the camera with respect tothe object frame is shown in Fig. 6. Let us consider thepose error defined by :

Te =

[

Re te01×3 1

]

= T−1r Tc, (41)

where Tr and Tc are respectively the real and the esti-mated poses. If the correct pose is obtained, Te is equalto the identity matrix (‖ te ‖= 0 and Re = I3). Letθe be the norm of the rotation vector θeu correspondingto the rotation matrix Re (recall that θeu is linked toRe by the Rodrigues’ formula). The errors ‖ te ‖ andθe on the estimated poses using our method, method Aand method L are shown respectively in Figs 7, 8 and 9.From these plots, it can be seen that the estimated valuesusing the three methods are similar and very near to thereal ones. Furthermore, the errors on the estimated poseobtained using the three methods are similar.

The second image sequence is obtained using lesspoints (5 non-coplanar points) (refer to Fig. 5.b) :

X2 =

0.2 −0.2 −0.2 0.2 00.2 −0.2 0.2 −0.2 0.41.01 1.01 0.95 1.03 1.1 1 1 1 1

(42)

A stronger white gaussian noise with standard deviationequal to 2 has been added to the coordinates of eachpoint. The results obtained using our method, method Aand method L are shown respectively in Figs 11, 12 and13. From Fig. 11, it can be seen that the estimated valuesof the pose using our method follow closely the realones. On the other hand the method A diverged and wasnot able to estimate the pose (refer to Fig. 12). Finally, asit was mentioned in [20], method L is affected by local

minima. Indeed from the plots, it can be noticed thatthe pose switched several times to local minima (referto Fig. 13).

B. Convergence for random poses

In this paragraph, we compare the convergence rate forrandom poses using our method, method L and methodA. The following setup has been used :

– an object composed of 8 coplanar points definedas follows has been considered (refer also to Fig.5.c) :

X3=

−0.4 0.4 −0.4 0.4 0.42 −0.09 0.32 −0.32−0.4 −0.4 0.4 0.4 −0.28 0.32 0 01. 1 1. 1 1 1 1 1

(43)

– random poses have been generated as follow :– 1000 random rotational motions are firstly ap-

plied to the point coordinates defined in the objectframe. The norm of the rotation around the x-axis and the y-axis range from −π

2 to π2 , while

the rotation angle around the optical axis rangesfrom 0 to 2π.

– for each generated rotation, a translational motionwith respect to the optical axis that ranges from1 meter to 4 meters is applied to the point coor-dinates defined in the object frame. Furthermore,the translational motion with respect to the x-axis and the y-axis are chosen such that thepoints coordinates belongs to the image limits[1 800; 1 800] pixels.

The error on the pose is calculated using ‖ te ‖ andθe computed from (41). Furthermore, for all methods,the identity matrix is used as the initial value of thepose matrix. Figures 14.a and 14.b give the distributionof ‖ te ‖ and θe using the three different methods andusing perfect data (no noise on the point coordinates inthe image). In other words, for each value of ‖ te ‖ andθe, the plot gives the percentage of the errors smalleror equal to these values. From these figures, it can beseen that our method achieves a convergence rate around90%, while method L and A achieve convergence ratesaround 70% and 50% respectively. The case of nonconvergence to the global minimum using our methodand method L are due to convergence to local minima.Conversely, in the case where the method A is used, thenon convergences to the global minimum are due to bothdivergence and convergence to local minima.

Now, we test the convergence rate of the three methodsusing the same setup, but with 1 pixel gaussian noise onthe point coordinates in the image. In this experiment, themethod of rotation estimation by solving the Orthogonal

8

Procrustes Problem (OPP) is also tested. The resultsobtained using each method are given on Fig. 15. Fromthis figure, it can be noticed that the accuracy of allthe pose estimation methods decreased. However, ouriterative method gives more accurate estimation of theposes. From Figure 15.b, it can be noticed that the rota-tion estimation by solving the OPP gives slightly betterresults than the first method to estimate the rotation. Onthe other hand, as mentioned above, the direct methodrecovering the rotation by building an orthonormal basisfrom a set of projected points has a dual use since itcan also be used in image-based visual servoing [23],[5]. This does not seem possible using the formulationof rotation as an OPP.

As it has been shown above, the pose estimationcan be performed as an iterative minimization withoutconstraints for only three parameters that are the trans-lation parameters tx, ty and tz . This limits the space thathas to be searched for to find the global optimum. Sincethe method we propose allows a high convergence rate,this makes possible preinitializing the iterative algorithmat random starting points. The low dimensionality of thespace permits also to spot the local minima. Let us consi-der the following case where the object point coordinatesare defined by (44) and the translational motion to becomputed is defined by the vector [0.7 0.4 2]m. Figure16 shows the cost function ‖ s∗t − st ‖ as color levelfor 5 values of tz and tx and ty ranging from −2m to+2m. From this Figure, we can spot the positions ofthe local and global minima with respect to each other(the position of the minimal value of the cost functionfor each value tz is marked by a cross in the images).We also note from Figures 16.d and 16.e that no localminimums exist for tz > 2m.

X4 =

−0.4 0.4 −0.4 0.4 0.5−0.4 −0.4 0.4 0.4 0.61. 1 1. 1 1

(44)

V. CONCLUSION AND FUTURE WORK

In this paper, we have proposed a new pose estimationmethod from a set of matched points based on an inva-riant to rotations. The proposed method has been valida-ted and compared to two different non-linear methods.The results obtained show that the proposed methodachieves better tracking of the pose for image sequencesbut also a higher rate of convergence compared to theother methods considered. Future works will be devotedto extend this method to model-free pose estimation andalso to camera calibration from a set of points.

L

−0.50

0.5

−0.5

0

0.50.5

1

1.5

xaxisyaxis

zaxi

s

−0.20

0.2

−0.5

0

0.50.9

1

1.1

1.2

xaxisyaxis

zaxi

s

a b

−0.4 −0.2 0 0.2 0.4 0.6

−0.4

−0.2

0

0.2

0.4

0.6

c

FIGURE 5. Object used for validations : a) object corresponding toX1, b) object corresponding to X2, c) planar object correspondingto X3.

0 100 200 300−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Images

Tra

nsla

tion

in m

eter

tx

ty

tz

0 100 200 300−200

−150

−100

−50

0

50

100

Imagesθ

u in

deg

rees

θ ux

θ uy

θ uz

FIGURE 6. Real values of the pose for the image sequence 1 versusimage number : left) translation vector entries in meter, right) rotationvector entries in degrees.

0 100 200 3000

0.05

0.1

0.15

0.2

Images

err

or in

met

ers

0 100 200 3000

1

2

3

4

Images

erro

r in

deg

rees

FIGURE 7. Error on the estimated pose parameters using our methodfor the image sequence 1 versus image number : left) ‖ te ‖, right)θe.

APPENDIX A : PROOF OF THE LEMMA 1

If only a rotational motion is considered between thetwo camera poses we have :

9

0 100 200 3000

0.05

0.1

0.15

0.2

erro

r in

met

ers

Images0 100 200 300

0

1

2

3

4

Images

erro

rs in

deg

rees

FIGURE 8. Error on the estimated pose parameters using method Afor the image sequence 1 versus image number : left) ‖ te ‖, right)θe.

0 100 200 3000

0.05

0.1

0.15

0.2

Images

erro

r in

met

ers

0 100 200 3000

1

2

3

4

Images

err

or in

deg

rees

FIGURE 9. Error on the estimated pose parameters using method Lfor the image sequence 1 versus image number : left) ‖ te ‖, right)θe.

0 50 100 150 200 250 300−3

−2

−1

0

1

2

3

Images

Tra

nsla

tion

in m

eter

tx

ty

tz

0 50 100 150 200 250 300−200

−150

−100

−50

0

50

100

150

200

Images

θ u

in d

egre

es

θ ux

θ uy

θ uz

FIGURE 10. Real values of the pose for the image sequence 2 versusimage number : left) translation vector entries in meter, right) rotationvector entries in degrees.

50 100 150 200 250 3000

1

2

3

4

5

Images

err

or in

met

ers

50 100 150 200 250 3000

50

100

Images

erro

r in

deg

rees

FIGURE 11. Error on the estimated pose parameters using our methodfor the image sequence 1 versus image number : left) ‖ te ‖, right)θe.

P∗v1

=

N∑

i=1

a1iP∗si =

N∑

i=1

a1ic∗RcPsi =

c∗RcPv1(45)

P∗v2

=

N∑

i=1

a2iP∗si =

N∑

i=1

a2ic∗RcPsi =

c∗RcPv2(46)

0 50 100 150 200 250 3000

0.5

1

1.5

2

2.5

3

3.5

4x 10

16

erro

r in

met

ers

Images0 50 100 150 200 250 300

0

50

100

150

200

Images

erro

rs in

deg

rees

FIGURE 12. Error on the estimated pose parameters using methodA for the image sequence 1versus image number : left) ‖ te ‖, right)θe.

0 50 100 150 200 250 3000

1

2

3

4

5

Imageser

ror

in m

eter

s0 50 100 150 200 250 300

0

20

40

60

80

100

120

140

Images

err

or in

deg

rees

FIGURE 13. Error on the estimated pose parameters using methodL for the image sequence 1 versus image number : left) ‖ te ‖, right)θe.

0 0.1 0.2 0.3 0.4

20

40

60

80

100

Error on translational motion in meter

Per

cent

age

our methodmethod Amethod L

0 2 4 6 8

20

40

60

80

100

Rotation angle error in degrees

Per

cent

age


a b

FIGURE 14. Percentage of convergence with perfect data : a) ‖ te ‖,b) θe

0 0.1 0.2 0.3 0.4

20

40

60

80

100

Error on translational motion in meter

Per

cent

age


0 2 4 6 8

20

40

60

80

100

Rotation angle error in degrees

Per

cent

age

our methodour method + OPPmethod Amethod L

FIGURE 15. Percentage of convergence with 1 pixel gaussian noiseon image point coordinates : left) ‖ te ‖, right) θe

By combining these equations with (24), we obtain :

v∗n1 =

c∗RcPv1

√

P⊤v1

c∗R⊤c

c∗RcPv1

(47)

10

(a) (b)

(c) (d)

(e)

FIGURE 16. The cost function as a color level : (a) result for tz =

1.6m, (b) result for tz = 1.8m, (c) result for tz = 2m, (d) result fortz = 2.2m (e) result for tz = 2.4m

from which, we obtain :

v∗n1 = c∗Rc vn1 (48)

Identically, it is possible to prove that :

v∗n2 = c∗Rc vn2 (49)

Combining (48), (49) and (26) it yields :

v∗n3 =

c∗Rc vn3 (50)

Therefore, by combining (48), (49) and (50), one ob-tains :

c∗Rn∗ =c∗Rc

cRn (51)

Finally, by multiplying both sides of the last equation bycR⊤

n , we obtain

c∗Rn∗cR⊤

n = c∗RccRn

cR⊤n = c∗Rc (52)

ACKNOWLEDGMENT

We thank one of the anonymous reviewers for hissuggestion about the Procrustes problem. This workwas partly funded by grants from the French program" investissement d’avenir " managed by the NationalResearch Agency (ANR), the European Commission(Auvergne FEDER funds) and the " Région Auvergne" in the framework of the LabEx IMobS3 (ANR-10-LABX-16-01) and by grants from PHC PESSOA 2011(25116ND).

RÉFÉRENCES

[1] A. Ansar and K. Daniilidis. Linear Pose Estimation from Pointsor Lines. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 25(5) :578–589, 2003.[2] H. Araujo, R. L. Carceroni, and C. M. Brown. A Fully

Projective Formulation to improve the Accuracy of LoweŠsPose-Estimation Algorithm. Computer Vision and Image Un-

derstanding, 70 :227–238, 1998.[3] J. Courbon, Y. Mezouar, L. Eck, and P. Martinet. A Generic

Fisheye Camera Model for Robotic Applications. In IEEE/RSJ

International Conference on Intelligent Robots and Systems,

IROS’07, pages 1683–1688, San Diego, CA, USA, 2007.[4] J. Courbon, Y. Mezouar, and P. Martinet. Evaluation of the

Unified Model on the Sphere for Fisheye Cameras in RoboticApplications. Advanced Robotics, 6(8) :947–967, 2012.

[5] N. Cowan and D. E. Chang. Geometric Visual Servoing. IEEE

Transactions on Robotics, 21(6) :1128 – 1138, Dec. 2005.[6] D. Dementhon and L. Davis. Model-based Object Pose in 25

Lines of Code. Int. Journal of Computer Vision, 15(1-2) :123–141, June 1995.

[7] M. Dhome, M. Richetin, J.-T. Lapreste, and G. Rives. Determi-nation of the Attitude of 3D Objects from a Single PerspectiveView. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 11(12) :1265–1278, Dec 1989.[8] P. Fiore. Efficient Linear Solution of Exterior Orientation. IEEE

Transactions on Pattern Analysis and Machine Intelligence,23(2) :140–148, Feb 2001.

[9] C. Geyer and K. Daniilidis. Mirrors in Motion : EpipolarGeometry and Motion Estimation. International Journal on

Computer Vision, 45(3) :766–773, 2003.[10] D. Grest, T. Petersen, and V. Krüger. A Comparison of Iterative

2D-3D Pose Estimation Methods for Real-Time Applications,volume 5575 of Lecture Notes in Computer Sciences : Image

Analysis, pages 706–715. Springer, 2009.[11] T. Hamel and R. Mahony. Visual Servoing of an Under-

Actuated Dynamic Rigid Body System : an Image-BasedApproach. IEEE Transactions on Robotics and Automation,18(2) :187–198, April 2002.

[12] J. R. Hurley and R. B. Cattell. The procrustes program :Producing direct rotation to test a hypothesized factor structure.Systems Research and Behavioral Science, 7 :258–262, 1962.

[13] F. Janabi-Sharifi and M. Marey. A Kalman-Filter-Based Methodfor Pose Estimation in Visual Servoing. IEEE Transactions on

Robotics, 26(5) :939–947, October 2010.[14] V. Lepetit, F. Moreno-Noguer, and P. Fua. EPnP : An Accurate

O(n) Solution to the PnP Problem. International Journal on

Computer Vision, 12(2) :155–166, December 2009.[15] D. G. Lowe. Fitting Parameterized Three-Dimensional Models

to Images. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 13 :441–450, 1991.

11

[16] C.-P. Lu, G. Hager, and E. Mjolsness. Fast and GloballyConvergent Pose Estimation from Video Images. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence, 22(6) :610–622, Jun 2000.

[17] C. Mei and P. Rives. Single View Point OmnidirectionalCamera Calibration from Planar Grids. In IEEE International

Conference on Robotics and Automation, pages 3945–3950,April 2007.

[18] R. Safaee-Rad, I. Tchoukanov, K. Smith, and B. Benhabib.Three-Dimensional Location Estimation of Circular Featuresfor Machine Vision. IEEE Transactions on Robotics and

Automation, 8(5) :624–640, Oct 1992.[19] P. H. Schönemann. A Generalized Solution of the Orthogonal

Procrustes Problem. Psychometrika, 31(1) :1–10, March 1966.[20] G. Schweighofer and A. Pinz. Robust Pose Estimation from

a Planar Target. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 28 :2024–2030, 2006.[21] A. Shademan and F. Janabi-Sharifi. Sensitivity Analysis of

EKF and Iterated EKF Pose Estimation for Position-BasedVisual Servoing. In IEEE International Conference on Control

Applications, pages 755–760, Toronto, Canada, August 2005.[22] O. Tahri and F. Chaumette. Complex Objects Pose Estimation

Based on Image Moment Invariants. In IEEE Int. Conf. on

Robotics and Automation, ICRA’05, pages 438–443, Barcelona,Spain, April 2005.

[23] R. Tatsambon Fomena, O. Tahri, and F. Chaumette. Distance-Based and Orientation-Based Visual Servoing from ThreePoints. IEEE Transactions on Robotics, 27(2) :256–267, April2011.

Omar Tahri was born in Fez, Morocco, in1976. He got his Masters in photonics, imagesand system control from the Louis PasteurUniversity, Strasbourg, France, in 2000 andreceived his Ph.D degree in computer sciencefrom the University of Rennes, France, inMarch 2004. Since 2008, he is working asa researcher with the computer vision lab(http ://labvis.isr.uc.pt/) of the institute for

Systems and Robotics, Coimbra, Portugal (http ://www.isr.uc.pt/).His research interests include mobile robotics, visual servoing andcomputer vision.

Helder Araujo is a Professor in the Depart-ment of Electrical and Computer Engineeringof the University of Coimbra, Portugal andalso an Invited Professor at Blaise PascalUniversity, Clermont-Ferrand, France. He isa researcher at the Institute for Systems andRobotics-Coimbra. His research interests in-clude computer and robot vision, robot navi-gation and localization, and sensor modeling

for robot navigation and localization.

Youcef Mezouar was born in Paris, France,in 1973. He received the Ph.D. degree in com-puter science from the University of Rennes1, Rennes, France, in 2001. He was a Post-doctoral Associate in the Robotics Laboratoryof the Computer Science Department, Colum-bia University, New York, NY. Since 2002,he has been with the Robotics and VisionGroup, Institut Pascal Clermont-Ferrand, Au-

biere, France. His current research interests include robotics, compu-ter vision, vision-based control, and mobile robots

François Chaumette (SM’10) graduated fromÉcole Nationale Supérieure de Mécanique,Nantes, France, in 1987. He received thePh.D. degree in computer science fromthe University of Rennes, France, in 1990.He is currently the Head of the LagadicGroup (http ://www.irisa.fr/lagadic) at INRIA,centre Rennes-Bretagne Atlantique and IRISA,Rennes. His research interests include robotics

and computer vision, especially visual servoing and active percep-tion.

Dr. Chaumette received the AFCET/CNRS Prize for the bestFrench thesis in automatic control in 1991. He also received, withEzio Malis, the 2002 King-Sun Fu Memorial Best IEEE Transactionson Robotics and Automation Paper Award. From 2001 to 2005, hehas been an Associate Editor of the IEEE Transactions on Robotics.He is currently in the Editorial Board of the International Journal ofRobotics Research.

Efficient Iterative Pose Estimation using an Invariant to ...the variations in image space and 3D...

Documents

Transcript of Efficient Iterative Pose Estimation using an Invariant to ...the variations in image space and 3D...