Control Engineering - essay.utwente.nlessay.utwente.nl/62106/1/BSc_KJ_Russcher.pdf · Control...

Control Engineering

Klaas Jan Russcher

Bsc opdracht

Supervisors:

prof.dr.ir. S. Stramigioli

prof.dr.ir. A. de Boer

dr.ir. F. van der Heijden

September 2011

Report nr. 019CE2011

Control Engineering

EE-Math-CS

University of Twente

P.O.Box 217

7500 AE Enschede

The Netherlands

Contents

1 Introduction 5

2 Saliency Mapping 62.1 Saliency in computer vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Saliency tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 3D reconstruction 93.1 Rectification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 LinMMSE estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3 Optimal triangulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4 Parameter error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.4.1 Monte Carlo analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4.2 Parameter sweep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 3D estimation of a salient object 164.1 NAO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Obtain second image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.3 Coordinate system transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.3.1 Matching the coordinate systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.3.2 Orientate coordinate system to camera 1 . . . . . . . . . . . . . . . . . . . . . . . . 17

4.4 3D estimation of a salient object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.4.1 Test setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 Conclusions and recommendations 21

3

Chapter 1

Introduction

Figure 1.1: Humanoid NAO.

Rob Reilink made in 2008 a saliency algorithm for a humanoidhead with stereo vision [7]. In this report it is tried to im-plement that saliency algorithm on the NAO humanoid (see fig-ure 1.1). Because NAO does not have stereo vision, it is tried toestimate the 3D position of a salient object with 3D reconstruc-tion.

In chapter 2 the theory of saliency mapping is explained. Saliencymapping is tested on some images to show under what conditions itworks best.Chapter 3 threats the distance estimation. In this report the distancteestimation is called 3D reconstruction. Three methods of 3D recon-struction are explained (rectification, linMMSE estimation and opti-mal triangulation). A Monte Carlo analysis and a parameter sweep aredone to determine the lesast error prone method.The saliency mapping and 3D reconstruction are combined in chap-ter 4. It is explained how the two images that are needed for 3Dreconstruction are obtained. The coordinate systems of the saliencymapping, 3D reconstruction and NAO are matched and the algorithmis tested. Chapter 5 is the chapter where the conclusions are drawnand the recommendations are made.

5

Chapter 2

Saliency Mapping

Saliency mapping is a bottom-up visual attention system. In this system primal instincts determine theinteresting points. The neurobiological model of saliency mapping is introduced by Koch and Ullmandin 1985 [5]. In 1998 Koch, Itti and Niebur describe how saliency mapping can be implemented incomputer vision [4]. Section 2.1 will explain how the saliency mapping in computer vision works. Insection 2.2 saliency mapping will be tested on some images to show how well it works. The results ofthe test images will be discussed in the conclusion (section 2.3)

2.1 Saliency in computer vision

Koch, Itti and Niebur have derived a algorithm for saliency mapping in [4]. The general architecture ofsaliency mapping displayed in figure 2.1. The saliency mapping algorithm will be explained using thisfigure. The first step in figure 2.1 is the linear filtering of the input image. It will only be filtered forcolor and intensity. Filtering for orientation requires a filter that is not in the OpenCV vision library.Creating this filter would cost to much time and is therefore beyond the scope of this project. Reilinkproved in [7] that saliency mapping works well without the orientation filtering.

Figure 2.1: Architecture of saliency mapping [4].

The algorithm starts with a 640x480 RGB in-put image. This image is splitted into three newimages: red (r) , green (g) and blue (b). In-tensity map I is created by element-wise addingthese three images together and then divide eachelement by three. The images red, green andblue are normalized by the intensity to decouplehue from intensity. Four color maps are created:red (R) = r − (g+ b)/2, green (G) = g − (r+b)/2, blue (B) = b − (r + g)/2 and yel-low (Y ) = (r + g)/2 − |r − g|/2 − b.For each of the five maps a Gaussian pyramid ofscale 8 is created (the source image is scale 0).These pyramids are used for the calculation of thecenter-surround differences (denoted by the sym-bol ). The higher the scale of a Gaussian pyra-mid, the more smoothened an image is and thatwill filter the small differences away. If then an im-age at a low scale of the Gaussian pyramid (center)and an image at a higher scale of the Gaussian pyramid (surround) are compared, the contrast of thecenter and its surround will become visible.For the intensity I(c,s) there are six maps created, with c = {2, 3, 4} and s = c + {3, 4}. The center-

6

surround difference is the absolute difference between two scales of the Gaussian pyramid:

I(c, s) = |I(c) I(s)| (2.1)

For the color maps it is not just the center-surround difference of each color. Because the human eye issensitive for color pairs, it is tried to simulate this. The color pairs that human eye is sensitive for aregreen/red, red/green, blue/yellow and yellow/blue. The center-surround differences that are created forthe color maps are RG(c,s) and BY(c,s):

RG(c, s) = |(R(c)−G(c)) (G(s)−R(s))| (2.2a)

BY (c, s) = |B(c)− Y (c)) (Y (s)−B(s))| (2.2b)

All the center-surround differences are created now. But there is still a problem. When an image hasa lot of high intensity spots and just one weak red spot, the intensity spots will suppress the red spot.Therefore a map normalization operator (Λ) is created. This operator will promote maps with few peaksand suppress maps with a lot of peaks. The map normalization operator consists of 3 steps:

1. Scale the map to a maximum M. In case of an 8-bit image, M will likely be 255.

2. Find all the local maxima of the map and calculate the average m over these maxima.

3. Scale the map with (M − m)2/M2.

As shown in figure 2.1 there are now 12 color maps (6 RG and 6 BY) and 6 intensity maps. These mapsare combined into two ”conspicuity maps”, one for the intensity I and one for the color C. The mapsare scaled to scale 4 of the Gaussian pyramid and then a per element addition results in a ”conspiciutymap”. The saliency map is constructed out of the two ”conspiciuty maps”:

S =1

2(Λ(I) + Λ(C)) (2.3)

The final step in the saliency mapping algorithm that is used for this project is the selection of the mostsalient place. This is done by the principle of winner takes all. The point on the salience map with thehighest value is the focus of attention.

The original saliency mapping algorithm has also an inhibition of return step (see figure 2.1). Theinhibition of return suppresses the focus of attention for a couple hundred milliseconds and prevents thealgorithm from staring at the same object. This step is not used in this project because it could affectthe the goal of selecting the same object in two different images.

2.2 Saliency tests

Figure 2.2a is a test image of colored balloons in a bright blue sky. Figure 2.2b is the resulting saliencymap. The two biggest balloons have approximately the same size, one is red and the other is blue. Butit is clear that the big red balloon attracts the most attention due to the contrast with the bright bluesky.

(a) (b)

Figure 2.2: Balloon test image (a) and the resulting saliency map (b).

7

Figure 2.3a is an image of the Control Engineering lab. On the image the handle of a red screwdriverthat is held by someone is visible. The red color of the screwdriver is in good contrast with its background.This can be seen in the resulting saliency map (figure 2.3b).

(a) (b)

Figure 2.3: Test image with red screwdriver (a) and the resulting saliency map (b).

The last test image is shown in figure 2.4a. It is also an image of the Control Engineering lab butnow without a salient object. The result of an image without an salient object is shown in figure 2.4b.

(a) (b)

Figure 2.4: Test image of the Control Engineering lab (a) and the resulting saliency map (b).

2.3 Conclusion

Figure 2.2b show that saliency mapping works for an image with big salient objects and a backgroundthat exist of a plain color. When the background gets more complicated and the salient object getssmaller the saliency mapping tends to keep working well (figure 2.3b). When there are no salient objectsthe saliency mapping tries to spot something, but the results are useless (figure 2.4b).So the saliency mapping spots easily salient objects for scenes where the objects contrast their backgroundwell. But when there are no contrasting objects the outcome of the saliency mapping is unpredictable.This is something to keep in mind because when when the localizing error is to large, it is also possiblethat the object does not contrast it background well enough to be detected.

8

Chapter 3

3D reconstruction

In this chapter the subject of 3D reconstruction is treated. The 3D reconstruction reconstructs the 3Dcoordinates of an object that is seen by two cameras at different positions. In a situation without noise itis not difficult to reconstruct the coordinates of an object. But most of the times the image lines will notcross each other in the 3D space due to noise. So a more advanced method is needed to reconstruct anobjects 3D coordinates in this case. There are various methods to reconstruct a 3D point from two noisyimage points. The methods that are covered here are the rectification method, the linMMSE estimation(both suggested by [9]) and the optimal triangulation method.First the three methods of 3D reconstruction will be explained (section 3.1 to 3.3). Then a parametererror analysis will be done to estimate which of the three reconstruction methods is the least error proneto parameter errors (section 3.4). In the conclusion at the end of this chapter the results of the parametererror analysis will be discussed and the least error prone reconstruction method will be chosen for theuse in NAO (section 3.5).

The most common symbols that are used in this chapter, with their meaning, are displayed in table 3.1.

Table 3.1: Definitions of parameters

Symbol DefinitionCCS Coordinate system attached to camera 1CCS’ Coordinate system attached to camera 2

x = [x, y] Image coordinates of the point in image 1x′ = [x′, y′] Image coordinates of the point in image 2

X = [X,Y, Z] 3D point in CCS spaceX′ = [X ′, Y ′, Z ′] 3D point in CCS’ space

R Rotation matrix to convert CCS to CCS’t Translation vector to translate CCS to CCS’K Calibration matrix of camera 1K′ Calibration matrix of camera 2

P = K[I|0] Projection matrix of camera 1P ′ = K′[R|t] Projection matrix of camera 2

D Focal distance

3.1 Rectification

The rectification method ( [9], [2]) is the simplest method of the three 3D reconstruction methods. Itvirtually manipulates the images in such a way that image 1 and image 2 get identical orientations. Thismeans that a point of interest on an imaginary horizontal line in image 1 is also on that same imaginaryhorizontal line in image 2. Only in image 2 it is on another position of the imaginary horizontal line.

9

This makes the reconstruction of a 3D point much easier.So the only difference between these two images is that image 2 is shifted a certain distance in thex–direction in comparison to image 1. When the images are rectified, triangulation [3] determines the3D point X.

Set t′ = −RT t. The rotation matrix used to rotate both images has the form of formula (3.1a) withelements as in formula (3.1b).

Rrect =

rT1rT2rT3

(3.1a)

r1 =t′

|t′|r2 =

1√r1(1)2 + r1(2)2

−r1(2)r1(1)

0

r3 = r1 × r2 (3.1b)

With Rrect calculated the images can be rectified (3.2).

xrect = KRrectK−1x x′rect = K′RrectRK′

−1x′ (3.2)

The disparity is the difference between xrect and x′rect. This is used to determine the coordinates of the3D point with the formula’s in (3.3).

Zrect =‖t′‖D

disparityXrect =

xrectZrectD

Yrect =yrectZrect

D(3.3)

This gives a point Xrect. But that point lies in the rectified coordinate system and not in the CCScoordinate system. X = R−1rectXrect transfers Xrect back to CCS coordinates.

3.2 LinMMSE estimation

The reconstruction with the linear minimum mean squared error (linMMSE) estimation is described in [6]and [9]. It is derived from the Kalman filtering which is used to track a moving point. The linMMSEestimation is the update step from the Kalman filtering. The advantage of this method is that it doesnot only reconstructs a point X, but also constructs a covariance matrix. With the covariance matrixone is able to say something about the precision of the reconstructed point X.

The linMMSE estimation starts with the reconstruction of a vector z and a matrix H (4.2). z1 andz2 are the positions of the point of attention in respectively image 1 and image 2. H1 and H2 are thecamera calibration matrices of camera 1 and camera 2, only without the last row.

z =

[z1

z2

]H =

[H1

H2

](3.4)

The covariance matrices of point X and point X′ are also needed for the linMMSE estimation. Thesecan be calculated with formula (3.5) (I2 is a 2x2 identity matrix). In this formula there is a X. Thisis a initial estimate of where the point of attention could be. For this application it is the middle of theworking space. σ is the uncertainty of point X and it is the distance of X to the edge of the workingspace. σ is used to construct Cx = σ2I3. For the calculations the two covariance matrices are put inone covariance matrix (3.5).

C(X) = Z2σ2I2 Cn(X) =

[C1(X) 0

0 C2(X′)

](3.5)

The first estimate has three steps. First it calculates the innovation matrix S (3.6a). Second step isto calculate the linMMSE gain Ka (3.6b). The third step to calculate the new X (3.6c), this is calledthe Kalman update.

S = HCxHT + Cn(X) (3.6a)

10

Ka = CxHTS−1 (3.6b)

Xfirst = X + Ka(z −HX) (3.6c)

The second estimate has four steps. The first three steps (3.7) are basically the same steps as first threefrom the first estimate. The last step calculates the new covariance matrix (3.8).

S = HCxHT + Cn(Xfirst) (3.7a)

Ka = CxHTS−1 (3.7b)

Xsecond = X + Ka(z −HX) (3.7c)

Cx,new = Cx −KaSKTa (3.8)

The reconstructed X is Xsecond and the covariance matrix is Cx,new.

3.3 Optimal triangulation

The method of optimal triangulation described here is the one that is proposed in [3]. The method isbased on the maximum-likelihood estimation (MLE). Therefore the assumption is made that the noisethat causes the uncertainty of the 3D point has a Gaussian distribution.This method uses the obtained corresponding points of the two images (x and x’ in homogeneouscoordinates) and the fundamental matrix F . The fundamental matrix is a matrix that relates points in

image 1 to corresponding points in image 2 so that x′T ·F ·x = 0. The fundamental matrix is calculated

with the use of the camera matrices of both cameras (K and K’), the rotational matrix R and thetranslational vector t, this is displayed in formula (3.9).

F = K′−T · [t]x ·R ·K−1 (3.9)

[t]x is a skew-symmetric matrix that is constructed with the translational vector t.

The dot product of F and x gives an epipolar line l’ in image 2 and the dot product of the transposedF and x’ gives an epipolar line l in image 1. When there are no errors, point x is positioned on epipolarline l and point x’ is positioned on line l’. In practice this is never the case.

The optimal triangulation method tries to minimize the cost function. The cost function is the sumof the squared perpendicular distance between the image point x and the epipolar line l in image 1 andthe squared perpendicular distance between the image point x’ and the epipolar line l’ in image 2.

In [3] the optimal triangulation method is presented as an 11-step algorithm. Here the same structurewill be used to explain the algorithm. The first 5 steps are used to put x, x’ and F in such a form thatthe cost function can be easily calculated.

1. The first step is to define two matrices V and V ’ (3.10). These matrices take the points x and x’to the origin of their coordinate systems. x, y, x’ and y’ are the coordinates of the points x andx’. After these transformations are x and x’ both at [0, 0, 1]T of the new coordinate system.

V =

1 0 −x0 1 −y0 0 1

V ′ =

1 0 −x′0 1 −y′0 0 1

(3.10)

2. Replace the fundamental matrix to make it compatible with the new coordinate system (3.11).

F = V ′−T · F · V −1 (3.11)

3. The left and right epipoles are calculated with F . The right epipole e is the nullspace of F andthe left epipole e’ is the nullspace of the transposed F . The epipoles have to be normalized so thate21 + e22 = 1 and e′21 + e′22 = 1, and multiply both epipoles the sign of their last element.

11

4. Form now the rotation matrices W and W ’.

W =

e1 e2 0−e2 e1 0

0 0 1

W ′ =

e′1 e′2 0−e′2 e′1 0

0 0 1

(3.12)

5. Replace the fundamental matrix to adjust it to the new coordinate system.

F = W ′−T · F ·W−1 (3.13)

The cost function is given by formula (3.14). Steps 6 to 9 evaluate the cost function and so determinethe image points that give the smallest error.

S(t) =2t

(1 + f2t2)2+

2(ad− bc)(at+ b)(ct+ d)

((at+ b)2 + f ′2(ct+ d)2)2(3.14)

6. Step 5 results in a fundamental matrix that looks like formula (3.15). a, b, c and d are determinedby this matrix. f and f’ are respectively e3 and e′3.

W =

f · f ′ · d −f ′ · c −f ′ · d−f · b a b−f · d c d

(3.15)

7. g(t) (3.16) is the numerator of the derivative of S(t) (3.14). Fill in the variables a, b, c, d, f and fin g(t) as a polynomial in t. Solve g(t) to get the roots.

g(t) = t((at+ b)2 + f ′2(ct+ d)2)2 − (ad− bc)(1 + f2t2)2(at+ b)(ct+ d) (3.16)

8. Evaluate the cost function S(t) at each of the roots of g(t). If there are complex roots, only evaluatethe real part of those roots. Also evaluate S(∞) (3.17). Select t, for which the cost function hasthe smallest value, as tmin.

S(∞) =1

f2+

c2

a2 + f ′2c2(3.17)

9. Construct the two epipolar lines l = [tminf 1 − tmin]T

and

l′ = [(−f ′(ctmin + d) amint+ b ctmin + d)]T

.

The point x on an epipolar line l = [λ µ ν]T

that is the closest the origin is point x =[−λν − µν λ2 + µ2

]T. Construct with this formula x and x′.

The last two steps transform the image points back to the original coordinate system and point X isdetermined.

10. Replace x and x′ with respectively x = T−1RT x and x′ = T′−1R

′T x′.

11. To complete the optimal triangulation method, point X has to be calculated. This is done withthe use of matrix A (3.18). Take the singular value decomposition of A = UDV T . X is the lastcolumn of matrix V T . X is a homogeneous vector and therefore his last element has to be one.Divide X by his last element to get a homogeneous vector.

A =

x · P 3,T − P 1,T

y · P 3,T − P 2,T

x′ · P ′3,T − P ′1,Ty′ · P ′3,T − P ′2,T

(3.18)

12

3.4 Parameter error analysis

As said in this chapters introduction the image points will be affected by noise. The rotation matrix Rand the translation vector t are the rotation matrix and translational vector that the describe the shiftfrom the camera position in the initial robot configuration to the camera position after the robot movedits head to another configuration. These rotation matrix and translational vector have uncertainties dueto the errors in NAO’s sensors. The image points x and x′ have uncertainties because the images arebuild up out of pixels. Therefore it is not possible to get the exact image point. And the calibrationmatrices K and K′ have uncertainties caused by the calibration process.For the 3D reconstruction the most robust method is wanted. The most robust method is the method thatis the least sensitive to the parameter errors. Two analysis methods are used to determine the sensitivityof the 3D reconstruction methods to errors. The analysis methods analyze the 3D reconstruction methodsfor each parameter separately. So they analyze the error in rotation matrix R first, then the error in thetranslation vector t and so on. In this way it can be made clear for which errors the 3D reconstructionmethods are the least sensitive.The influence of five parameter errors have been analyzed. These five parameters are the focal distanceof the camera (D), the camera center (cc) (these two parameters are in the camera calibration matrixK), the translational vector (t), the rotational vector (R) and the image points. The uncertainties ofthe focal distance, camera center and image points are determined by the Camera Calibration Toolboxfor Matlab. The uncertainties of the translational vector and the rotational matrix are estimated by testperformed on the NAO robot. For these test the command is given to NAO to change the yaw and pitchangles of the head by 0.5 rad. Then the angle changes of the yaw and pitch are measured with NAO’sinternal sensors. The error is the measured angle change minus 0.5 rad. This test is done several timesand resulted in an uncertainty for the rotational and translational angle (see table 3.2).The uncertainties that are used in the simulations are two times the measured standard deviations, thatensures a 95% certainty that the parameter value will be within the uncertainty region (the uncertaintiesof the parameters have a Gaussian distribution). All the parameter values with their uncertainties aredisplayed in table 3.2. The uncertainties for the camera center and the translational vector are equal foreach elements. The value for the image point is not given, because it has different values for image 1and image 2 and it depends also on which of the five 3D points is used.

Table 3.2: NAO’s parameters and uncertainties

Parameter Value uncertainty (σ) 2× σFocal distance (mm) 752 2 4Camera center (mm) [331 253]T 1.6 3.2Image point (pixels) 0.12 0.24

Rotation angle (radian) 0.5 0.01 0.02Translational vector (mm) [28 29 − 24]T 1.3 2.6

For the simulation where 5 3D points used (3.19), all distances are in millimeters. These points wheretransformed to image coordinates with the projection matrices without errors. The parameters errorswhere only used in the 3D reconstruction methods.

X1−5 =

00

100

00

300

0−25300

4040750

2500

1500

(3.19)

3.4.1 Monte Carlo analysis

The Monte Carlo method is suggested by [9] and described in [1]. The Monte Carlo method simulatesthe 3D reconstruction methods N times and each time it takes a random value within the uncertainty ofthe parameter. Each of the N simulations gives a 3D point Xest. This Xest is subtracted from the real3D point X. The result is a error–vector with errors in the x–, y– and z–direction. Then the systematicerror is calculated by taking the mean of each element of the error–vector over N simulations. Therandom error is calculated by taking the standard deviation of each element of the error–vector over N

13

simulations. The RMS absolute error is the root mean square of these two errors.The Monte Carlo analysis was done with N = 10, 000. Results are displayed in table 3.3, the values inthe table are worst case scenarios. The uncertainties of the camera center and image point are given inpixel distance, the rotation angle in degrees and the rest of the parameter and the errors are given inmillimeters.

Table 3.3: Results Monte Carlo analysis (absolute error)

RMS linMMSE RMS RMS optimalestimation rectification triangulation

uncertainties absolute error absolute error absolute errorfocal distance σD = 4mm 80.35 231.22 0.26camera center σcc = 3.2 472.15 7,194.32 2.15rotation angle σα = 1.15◦ 780.91 38,044.77 15.89translation vector σt = 2.6mm 131.75 145.51 19.01image point σx = 0.24 55.38 72.08 29.61

3.4.2 Parameter sweep

The parameter sweep simulates the 3D reconstruction method also N times. But now it sweeps troughthe uncertainty of the parameter. So it starts at −∆max and stops at +∆max. The error–vector iscalculated and the norm of the error–vector is taken. If there are more 3D simulation points, the averageof the error–vector norms is taken. The final step is to plot the result of the N simulations against theparameter error.The results of this analysis is plotted in the figure 3.1. The errors on the x–axis are the deviations fromthe real parameter value.

3.5 Conclusion

The three 3D reconstruction methods are in the previous section (3.4) analyzed for their sensitivity toparameter errors. The Monte Carlo method and the parameter sweep give a clear picture of whichreconstruction method is the least error prone. Both figure 3.1 and table 3.3 show that the optimaltriangulation method is by far the least error prone 3D reconstruction method. Only for the imagelocation error, the rectification method error and the linMMSE estimation error values come near thevalues of the optimal triangulation method.A reason for the large errors in the rectification method and the linMMSE estimation is the small trans-lational vector t of NAO, its in the order of a few centimeters. This makes these two methods sensitiveto parameter errors. Testing shows that making the translational vector for instance 10 times largerreduces the errors in both methods significantly, whereas the errors in the optimal triangulation methodstays the same.Another reason for the large errors in the linMMSE estimation is the initial estimate of the point ofattention and the uncertainty (σ) of that point. Because there is no idea where the point of attentionis, the middle of the working space is chosen with a uncertainty that covers approximately the wholeworking space. This gives an initial guess that is most likely not close to the point of attention. Thiscauses a large error in the linMMSE estimation.

With the results in figure 3.1 and table 3.3, the optimal triangulation method is chosen as methodfor the 3D reconstruction because it gives the best results.

14

−4 −3 −2 −1 0 1 2 3 40

1

2

3

4

5

6

Focal distance error (mm)

|err

or| (

mm

)

RectificationLinMMSE estimationOptimal Triangulation

(a)

−3 −2 −1 0 1 2 30

1

2

3

4

5

6

7

8

Camera center error (mm)

|err

or| (

mm

)


(b)

−0.02 −0.015 −0.01 −0.005 0 0.005 0.01 0.015 0.020

10

20

30

40

50

60

70

80

Rotation angle error (rad)

|err

or| (

mm

)


(c)

−2 −1 0 1 20

5

10

15

20

25

30

35

40

Translational vector error (mm)

|err

or| (

mm

)


(d)

−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.20

0.1

0.2

0.3

0.4

0.5

Image location error (mm)

|err

or| (

mm

)


(e)

Figure 3.1: results of the parameter sweep

15

Chapter 4

3D estimation of a salient object

This chapter will present the whole algorithm of the 3D estimation of a salient object (section 4.4). Themain parts are already threated in chapter 2 (saliency mapping) and in chapter 3.3 (optimal triangulationmethod). In section 4.1 is the used robot presented. Section 4.2 shows the conditions under which thesecond image is captured. The transformation of the coordinate system is explained in section 4.3.

4.1 NAO

The humanoid used for this project is NAO (figure 1.1). NAO is manufactured by the French companyAldebaran Robotics. It has a heigth of 58 cm and weights 4.3 kg [8]. The humanoids equipment includestwo CMOS digital cameras. The field of view of these two cameras do not overlap. Therefore taking animage with both cameras at the same time can not be used for 3D reconstruction. Nao has also 32 Halleffect sensors, a two axis gyrometer and a three axis accelerometer. These sensors are used to determinethe position of NAO.NAO is supported by the NaoQi API framework. This framework enables the user to let NAO walkand move easily. It is also used for getting the position of the cameras and the pitch and yaw angles ofNAO’s head.

4.2 Obtain second image

When image 1 is obtained it does not matter in which position NAO is or when it is obtained. That isdifferent for image 2. Image 2 has to be obtained as fast as possible after image 1 because as time elapses,NAO’s surroundings will most likely change. The field of view of camera position 2 has to overlap thefield of view of camera position 1, therefore the position of NAO does matter when image 2 is obtained.The fields of view have to overlap otherwise a 3D reconstruction can not be made.

When there is a salient object spotted in image 1, the yaw- and pitch angles of NAO’s will be changedso that the salient object transfers to the image center. The maximum angle changes will be the anglechanges that will transfer an object from the border of the image to the center.The salient object in image 1 is represented by image coordinates. The coordinates of the image centerwill be subtracted from the salient object coordinates and then divided by the image coordinates. Thisgives the relative image coordinates. These relative image coordinates will be multiplied by the maxi-mum angle changes. The result of these angle changes will be that the object will appear in the centerof the image.

There are two conditions which the angle changes have to meet. The first condition is that the anglechanges have to be at least 0.1 radian. This condition is needed because the angles must also changewhen the salient object is close to the center of the image. The second condition is that angle changeshave to result in an angle that is in the range of the yaw and pitch of NAO’s head. When this conditionis not met, the angle change will be reduced till it is in the yaw or pitch range. When the angle changegets smaller then 0.1 radian, the angle change will be 0.1 radian in the opposite direction.

16

4.3 Coordinate system transformation

The position of the camera that is acquired from NAO is transformed twice. First it is rotated to matchthe orientation used in the saliency mapping and in the camera calibration. Then it is transformed sothat the camera position of image 1 is at the center of the coordinate system. This is necessary for thecalculation of the rotation matrix and translational vector.

4.3.1 Matching the coordinate systems

For the saliency mapping and the calibration of the camera a 2D coordinate system is used. The originof those coordinate systems is at the left upper corner of the images. x is the width and y is the heightof the image. For a Cartesian coordinate system z would then be depth of the image.The orientation of NAO’s coordinate system is as shown in figure 4.1.

Figure 4.1: NAO’s coordinate system [8].

To rotate NAO’s coordinate system to the desired orientation it is first rotated by −90◦ around thex-axis and then −90◦ around the z-axis. The resulting rotation matrix is shown in formula (4.1).

Rxz =

1 0 00 cos(−90) sin(−90)0 −sin(−90) cos(−90)

• cos(−90) sin(−90) 0−sin(−90) cos(−90) 0

0 0 1

=

0 −1 00 0 −11 0 0

(4.1)

4.3.2 Orientate coordinate system to camera 1

Figure 4.2: Transform NAO.

The rotation matrix and translational vector that are needed for the optimal triangulation methodare based on a coordinate system with at the origin image 1. The coordinate system is orientated is sucha way that the x-axis is along the width of the image and the y-axis along the hight of the image. Infigure 4.2 is a schematic drawing of NAO’s head and torso. Originaly the center of the coordinate systemis somewhere at the center of the torso. It has to be transposed to the camera in the head of NAO.

17

To transform the coordinate system to the camera there are four transformation matrices needed. Firstthe coordinate system has to be transposed to the center of the head joint. Then the coordinate systemhas to be rotated twice, once for the yaw and once for the pitch. The last transformation is the transfor-mation from the head joint to the camera. Equation 4.2 shows these four transformation matrices. Theyaw angle is φ and the pitch angle is ϕ.

1 0 0 00 1 0 −dstJ0 0 1 00 0 0 1

•

1 0 0 00 cos(φ) −sin(φ) 00 sin(φ) cos(φ) 00 0 0 1

•cos(ϕ) 0 sin(ϕ) 0

0 1 0 0−sin(ϕ) 0 cos(ϕ) 0

0 0 0 1

•

1 0 0 −dstX0 1 0 −dstY0 0 1 −dstZ0 0 0 1

(4.2)

4.4 3D estimation of a salient object

All the individual parts of the 3D estimation of a salient object are explained in the previous chaptersand sections. The complete algorithm is constructed from these parts. It is represented as a six stepalgorithm:

1. Take an image with NAO’s camera. This image will be called image 1. Take also the angles ofNAO’s head joint.

2. Create the saliency map for image 1 (see chapter 2). It returns the most salient position (x1) inthe image.

3. Turn NAO’s head towards the salient object and take a second image (see section 4.2). This imagewill be called image 2. Take here also the angles of NAO’s head joint.

4. Create the saliency map for image 2. It returns the most salient position (x2) in the image.

5. Perform the coordinate transformation (see section 4.3). Calculate the rotation matrix and thetranslational vector.

6. The two image coordinates of the salient object are now known (x1 and x2). Also the rotationmatrix and the translational vector are known. These are the four parameters for the optimaltriangulation method. Reconstruct now the 3D point with the optimal triangulation method (seesection 3.3).

4.4.1 Test setup

For the test setup NAO will be put on the floor, facing a white wall. A bright red object will be tapedto the white wall. The distance between NAO and the wall will be 30 cm and the red object will be atthe camera center when image 1 is taken. The reason for this is that the optimal triangulation methodputs the origin of coordinate system at the center of image 1. When the red object is then at the centerof the image 1, the x- and y-values of the reconstructed 3D point are approximately zero and z-valuewill be the distance from the camera to the red object. These conditions ensure that it can be easilydetermined if the 3D estimation of a salient object works.

4.4.2 Results

Figure 4.3 shows image 1 and image 2 that are token for the first test.

18

(a) (b)

Figure 4.3: Test 1: image 1 (a) and image 2 (b).

The saliency maps that belong to the images of the first test are shown in figure 4.4.

(a) (b)

Figure 4.4: Test 1: saliency maps of image 1 (a) and image 2 (b).

The reconstructed 3D point for the first test is [0.12,−5.4,−229]T .

For the second test the red object is placed in a corner of the camera image. This ensures that theangle changes are relatively large. Figure 4.5 shows the captured images 1 and 2.

(a) (b)

Figure 4.5: Test 2: image 1 (a) and image 2 (b).

The resulting saliency maps for the second test are shown in figure 4.6.

19

(a) (b)

Figure 4.6: Test 2: saliency maps of image 1 (a) and image 2 (b).

The reconstructed 3D point for the second test is [43,−34, 1271]T .

20

Chapter 5

Conclusions and recommendations

The reconstructed 3D point of test 1 is a point that is approximately 23 cm behind NAO. The second re-constructed 3D point is 1.3 m in front of NAO. This is in no relation to the 30 cm in front of NAO that isshould be. In fact, there are numerous tests done and each reconstructed 3D point was completely differ-ent and only in a few cases it was near the 30 cm in front of NAO. The 3D estimation of a salient object isanalyzed and there are two possible error source why the 3D estimation of a salient object does not work.

As can be seen in figure 4.4 and figure 4.6 there can not be a misunderstanding about at which imagepoint the red object is. The salient places have the size of a few pixels, but the pixel error should not bethe problem (see section 3.5). So the problem is not in the saliency mapping part.For the optimal triangulation method the translational vector from camera position 1 to camera posi-tion 2 is needed. Camera position 1 is taken here as the center of the coordinate system and shouldtherefore have the coordinates [0, 0, 0]T . However for the first test these coordinates are [3.7,−1.1,−4.6]T

and for the second test [−21.4, 19.1, 33.2]T . This indicates that there is something wrong with the coor-dinate system transformations (section 4.3).

The first error source could be the distance from the original coordinate system to the head jointand the distance from the head joint to the camera (see figure 4.2). If these distances have a significanterror, the transformation will not result in a coordinate system with the origin at NAO’s camera.The second error source could be the error in the angles of the yaw and pitch of NAO’s head. If thecommand is given to NAO to set these angle to 0, the angles are not set completely to 0. The error inthe angle is approximately 0.02 radian. This is 20% of the minimum angle change and that could causea significant error. So the conclusion is that the 3D estimation of a salient object does not work due tocoordinate transformation errors.

To make the 3D estimation of a salient object work is it recommended to let NAO be calibrated byAldebaran Robotics. The calibration ensures that if the pitch and yaw angles are set to a certain value,NAO will move its head to that exact position.When the NAO is calibrated the positions of NAO’s camera and NAO’s head joint can be calculated. Ifthe pitch and yaw angles are 0 the distance of NAO’s camera in the z-direction is known. Then the pitchof NAO’s should be changed until the value of the camera’s z-direction gets 0. This forms a triangle withtwo known angles and one known side. This can used to calculate the distance in the y-direction fromthe head joint to the camera. When this distance is subtracted from the distance in the y-direction inthe case that the yaw and pitch angles are 0, the result will be the distance form the original coordinatesystem to the head joint.

Figure 5.1: Triangulation.

Another possible explanation that the algorithm does notworks came to me just before printing this report. To deter-mine the position of an object it is needed that the epipoles ofthe two images (O and O’, see figure 5.1) do not coincide. Forthe NAO the assumption was made that the epipoles do not co-incide. But it is possible that for the NAO the epipoles coincide

21

and that could explain the results. So the recommendation is tofirst check if the epipoles of the images token by NAO coincide.If they do, then the optimal triangulation could still be the best3D reconstruction method if the movements of NAO would beadapted with the new knowledge.

22

Bibliography

[1] Costas Anyfantakis, Guglielmo Maria Caporale, and Nikitas Pittis. Parameter instability and fore-casting performance : a Monte Carlo study. London Metropolitan University, London, 2004. CostasAnyfantakis, Guglielmo Maria Caporale, Nikitas Pittis. ill. ; 21 cm. Discussion paper series ; DEDP04/04 Includes bibliographical references. London Metropolitan University. Dept. of Economics.

[2] Andrea Fusiello, Emanuele Trucco, and Alessandro Verri. A compact algorithm for rectification ofstereo pairs. Machine Vision and Applications, 12:16–22, 2000.

[3] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. CambridgeUniversity Press, Cambridge, 2000. by Richard Hartley, Andrew Zisserman. ill. (some col.) ; 26 cm.Includes bibliographical references and index.

[4] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis.Ieee Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254–1259, 1998. 138TXTimes Cited:1230 Cited References Count:18.

[5] C. Koch and S. Ullman. Shifts in selective visual attention: towards the underlying neural circuitry.Human Neurobiology, 4:219–227, 1985.

[6] K. K. Lee, K. H. Wong, M. M. Y. Chang, Y. K. Yu, and M. K. Leung. Extended kalman filteringapproach to stereo video stabilization. 19th International Conference on Pattern Recognition, 1-6:3470–3473 4005, 2008.

[7] R. Reilink, S. Stramigioli, F. Van der Heijden, and G. Van Oort. Saliency-based humanoid gazeemulation using a moving camera setup. University of Twente, Dept. of Electrical Engineering,Control Engineering, MSc thesis 019CE2008.

[8] Aldebaran Robotics. Documentation nao. Online, July 2011.

[9] Ferdi Van der Heijden. 3d position estimation from 2 cameras. Available on request at the Departmentof Electrical Engineering Mathematics and Computer Science(EEMCS) of the University of Twente,April 2011.

23

Control Engineering - essay.utwente.nlessay.utwente.nl/62106/1/BSc_KJ_Russcher.pdf · Control...

Documents

Transcript of Control Engineering - essay.utwente.nlessay.utwente.nl/62106/1/BSc_KJ_Russcher.pdf · Control...