Image formation - George Mason Universitykosecka/it835/lect2.pdfIn a broad gurative sense, vision is...

Lecture Notes 2. An Invitation to 3-D Vision: From Images to Models (in preparation)Y. Ma, J. Kosecka, S. Soatto and S. Sastry. c©Yi Ma et. al.

Image formation

Background

This chapter requires basic notions of elementary physics and geometry. In addition, it requiresknowledge of the basic properties of rigid body transformations as described in the previous chap-ter. Extra linear algebraic notions such as affine and projective transformations can be found inAppendix ??.

0.1 Representation of images

In a broad figurative sense, vision is the inverse problem of image formation: the latter studies howobjects give raise to images, while the former attempts to use images to recover a description ofobjects in space. Therefore, designing vision algorithms requires first developing a suitable modelof image formation. Suitable in this context does not necessarily mean physically accurate: thelevel of abstraction and complexity in modeling image formation must trade off physical constraintsand mathematical simplicity in order to result in a manageable model (i.e. one that can be easilyinverted).

Physical models of image formation easily exceed the level of complexity necessary and appro-priate to this book. We will confine ourselves to purely geometric models; for a more germaneaccount of the image formation process, the reader is referred to optics books (such as the one byBorn and Wolf [?]).

An image, as far as this book is concerned, is a two-dimensional brightness array. In otherwords, it is a map I, defined on a compact region Ω of a two-dimensional surface, taking values inthe positive reals. For instance, in the case of a camera, Ω is a planar, rectangular region occupiedby the photographic medium (or by the CCD sensor), so that we have

I : Ω ⊂ R2 → R+; (x, y) 7→ I(x, y). (1)

Such an image can be represented, for instance, as the graph of I, as in Figure 1. In the caseof a digital image, both the domain Ω and the range R+ are discretized. For instance, Ω =[1, 640] × [1, 480] ⊂ Z

2 and R+ is substituted by [0, 255] ⊂ Z+. Such an image can be representedas an array of numbers as in Table 1.The values taken by the map I depend upon physical properties of the scene being viewed, suchas its shape, its material reflectance properties and the distribution of the light sources. Despitethe fact that Figure 1 and Table 1 do not seem very indicative of the properties of the scene theyportray, this is how they are represented in a computer. A different representation of the sameimage that is better suited for interpretation by the human visual system is obtained by generatinga picture. A picture is a scene - different from the true one - that produces on the imaging sensor(the eye in this case) the same image as the original scene. In this sense pictures are “controlledillusions”: they are scenes different from the true ones (they are flat!), that produce in the eye thesame image as the original scenes. A picture of the same image I described in Figure 1 and Table1 is shown in Figure 2. Although the latter seems more informative on the content of the scene, itis merely a different representation and contains exactly the same information.

2

020

4060

80100

0

20

40

60

800

50

100

150

200

250

xy

I

Figure 1: An image I represented as a two-dimensional surface.

188 186 188 187 168 130 101 99 110 113 112 107 117 140 153 153 156 158 156 153189 189 188 181 163 135 109 104 113 113 110 109 117 134 147 152 156 163 160 156190 190 188 176 159 139 115 106 114 123 114 111 119 130 141 154 165 160 156 151190 188 188 175 158 139 114 103 113 126 112 113 127 133 137 151 165 156 152 145191 185 189 177 158 138 110 99 112 119 107 115 137 140 135 144 157 163 158 150193 183 178 164 148 134 118 112 119 117 118 106 122 139 140 152 154 160 155 147185 181 178 165 149 135 121 116 124 120 122 109 123 139 141 154 156 159 154 147175 176 176 163 145 131 120 118 125 123 125 112 124 139 142 155 158 158 155 148170 170 172 159 137 123 116 114 119 122 126 113 123 137 141 156 158 159 157 150171 171 173 157 131 119 116 113 114 118 125 113 122 135 140 155 156 160 160 152174 175 176 156 128 120 121 118 113 112 123 114 122 135 141 155 155 158 159 152176 174 174 151 123 119 126 121 112 108 122 115 123 137 143 156 155 152 155 150175 169 168 144 117 117 127 122 109 106 122 116 125 139 145 158 156 147 152 148179 179 180 155 127 121 118 109 107 113 125 133 130 129 139 153 161 148 155 157176 183 181 153 122 115 113 106 105 109 123 132 131 131 140 151 157 149 156 159180 181 177 147 115 110 111 107 107 105 120 132 133 133 141 150 154 148 155 157181 174 170 141 113 111 115 112 113 105 119 130 132 134 144 153 156 148 152 151180 172 168 140 114 114 118 113 112 107 119 128 130 134 146 157 162 153 153 148186 176 171 142 114 114 116 110 108 104 116 125 128 134 148 161 165 159 157 149185 178 171 138 109 110 114 110 109 97 110 121 127 136 150 160 163 158 156 150

Table 1: The image I represented as a two-dimensional table of integers.

0.2 Lenses, surfaces and light

In order to describe the image formation process, we must specify the value of the map I ateach point (x, y) in Ω. Such a value I(x, y) is typically called image brightness, or more formallyirradiance. It has the units of power per unit area (Watts/m2) and describes the energy fallingonto a small patch of the imaging sensor. Given the power of the light δE reaching the small imagepatch and the area of the patch δA the image irradiance at point (x, y) is:

I(x, y) =δE

δA(x, y). (2)

The irradiance at a point of coordinates (x, y) is typically obtained by integrating energy both intime (for instance the shutter interval in a camera, or the integration time in a CCD) and in space.The region of space that contributes to the irradiance at (x, y) depends upon the geometry andoptics of the imaging device, and is by no means trivial to determine. In Section 0.2.1 we willadopt common simplifying assumptions to approximately determine it. Once the region of space

0.2. LENSES, SURFACES AND LIGHT 3

x

y

10 20 30 40 50 60 70 80 90

10

20

30

40

50

60

70

Figure 2: A “picture” of the image I (compare with image 1.

is determined, the energy it contributes depends upon the geometry and material of the scene aswell as on the distribution of light sources.

0.2.1 Imaging through lenses

An optical system is a set of lenses used to “direct” light. By directing light we mean a controlledchange in the direction of propagation, which can be performed by means of diffraction, refractionand reflection. For the sake of simplicity, we neglect the effects of diffraction and reflection in alens system, and we only consider refraction. Even so, a complete description of the functioning ofa (purely refractive) lens is well beyond the scope of this book. Therefore, we will only considerthe simplest possible model, that of a thin lens in a black box.

A thin lens is defined by an axis, called the optical axis, and a plane perpendicular to the axis,called the focal plane, with a circular aperture centered at the optical center, i.e. the intersectionof the focal plane with the optical axis. The thin lens is characterized by one parameter, usuallyindicated by f , called the focal length, and by two functional properties. The first is that all raysentering the aperture parallel to the optical axis intersect on the optical axis at a distance f fromthe optical center. The point of intersection is called the focus of the lens (see Figure 3). Thesecond property is that all rays through the optical center are undeflected. Now, consider a pointp ∈ E

3 not too far from the optical axis at a distance Z along the optical axis from the opticalcenter. Now draw, from the point p, two rays: one parallel to the optical axis, and one throughthe optical center (Figure 4). The first one intersects the optical axis at the focus, the secondremains undeflected (by the defining properties of the thin lens). Call x the point where the tworays intersect, and let z be its distance from the optical center. By decomposing any other rayfrom p into a component parallel to the optical axis and one through the optical center, we canargue that all rays from p intersect at x on the opposite side of the lens. In particular, a ray fromx parallel to the optical axis, must go through p. Using similar triangles, from Figure 4 we obtainthe following fundamental equation of the thin lens:

1Z

+1z

=1f

(3)

4

o f

Figure 3: The rays parallel to the optical axis intersect at the focus.

The point x is called the image of the point p. Therefore, under the assumption of a thin lens,

o f

p

z

Z

x

Figure 4: Image of the point p is the point x of the intersection of rays going parallel to the opticalaxis and the ray through the optical center.

the irradiance at the point x of coordinates (x, y) on the image plane is obtained by integrating allthe energy emitted from the region of space contained in the cone determined by the geometry ofthe lens. If such a region does not contain light sources, but only opaque surfaces, it is necessaryto assess how such surfaces radiate energy towards the sensor. Before doing so, we introduce thenotion of “field of view”, which is the angle subtended by the aperture of the lens seen from thefocus. If D is the diameter of the lens, then the field of view is arctan(D/2f).

0.3. GEOMETRIC CAMERA MODELS 5

0.2.2 Imaging through pin-hole

If we let the aperture of a thin lens decrease to zero, all rays are forced to go through the opticalcenter, and therefore they remain undeflected. Consequently, the aperture of the cone decreases tozero, and the only points that contribute to the irradiance at the points x on a line through p. Ifwe let p have coordinates X(p) = [X,Y,Z]T relative to a reference frame centered at the opticalcenter, with the the optical axis being the Z-axis, then it is immediate to see from Figure 5 thatthe coordinates of x and p are related by an ideal perspective projection:

x = −fX

Z, y = −f

Y

Z(4)

Note that any other point on the line through p projects onto the same coordinates [x, y]T . Thisimaging model is the so-called ideal pin-hole. It is doubtlessly ideal since, when the aperture

o

f

p

z

xx

y

Figure 5: Image of the point p is the point x of the intersection of the ray going through the opticalcenter o and an image plane at a distance f from the optical center.

decreases, diffraction effects become dominant and therefore the (purely refractive) thin lens modeldoes not hold. Furthermore, as the aperture decreases to zero, the energy going through thelens also becomes zero. The pin-hole is a purely geometric model that approximates well-focusedimaging systems. In this book we will use the pin-hole model as much as possible and concentrateon the geometric aspect of image-formation.

Notice that there is a negative sign in each of the formulae (4). This makes the image of anobject appear to be upside down on the image plane (or the retina for the human eye). To eliminatethis effect, we can simply flip the image: (x, y) 7→ (−x,−y). This corresponds to placing the imageplane Z = −f in front of the optical center instead: Z = +f.

0.3 Geometric camera models

The above sections describe a model of the image-formation process, where geometric quantities areexpressed in a “camera” reference frame, centered in the optical center and aligned with the opticalaxis. In order to establish a relationship between the position of points in the world, expressedfor instance with respect to an inertial reference frame, and the points measured in an image, itis necessary to describe the relation between camera reference frame and the world frame. In thissection we describe such a relation as a series of transformations of coordinates. Inverting such achain of transformations is the task of “camera calibration”, which is the subject of Chapter ??.

6

0.3.1 Rigid body motion of camera

Consider an orthogonal inertial reference frame o,X, Y, Z, called the world frame. We write thecoordinates of a point p in the world frame as

Xw = [Xw, Yw, Zw]T ∈ R3. (5)

In order to write the coordinates of the same point with respect to another reference frame, forinstance the camera frame c, Xc, it is necessary to describe the transformation between the worldframe and the camera frame. We will indicate such a transformation by gwc. With this notation,the coordinates in the world frame and those in the camera frame are related by

Xw = gwc(Xc) (6)

and vice-versa Xc = gcwXw where gcw = g−1wc is the inverse transformation, which maps coordinates

in the world frame onto coordinates in the camera frame.Transformations of coordinates between orthonormal reference frames are characterized by the

rigid motion of the reference frame. Therefore, as discussed in chapter ??, a rigid change ofcoordinates is described by the position of the origin of the new frame, and the orientation of thenew frame relative to the old frame. Consider now the change of coordinates gwc:

gwc = (Rwc, Twc) (7)

where Twc ∈ R3 and Rwc is a rotation matrix, then the change of coordinates of an arbitrary point

is given by

Xw = gwc(Xc) = RwcXc + Twc. (8)

A more compact way of writing the above equation is to use the homogeneous representation,

X .= [XT , 1]T ∈ R4, gwc =

[Rwc Twc

0 1

]∈ R

4×4 (9)

so that

Xw.= gwcXc =

[Rwc Twc

0 1

] [Xc

1

]. (10)

The inverse transformation, that maps coordinates in the world frame onto coordinates in thecamera frame, is given by

gcw = (Rcw, Tcw) = (−RTwcTwc, R

Twc) (11)

as it can be easily verified by

gcwgwc = gwcgcw = I4×4 (12)

which is the identity transformation, and therefore gcw = g−1wc .


0.3.2 Ideal camera

Let us consider the generic point p, with coordinates Xw ∈ R3 relative to the world reference

frame. As we have just seen, the coordinates relative to the camera reference frame are given byXc = RcwXw + Tcw ∈ R

3. So far we have used subscripts to avoid confusion. From now on, forsimplicity, we drop them by using the convention R = Rcw, T = Tcw and g = (R,T ). In the followingsection we denote the coordinates of a point relative to the camera frame by X = [X,Y,Z]T andthe coordinates of the same point relative to the world frame by X0 – X0 is also used to denotethe coordinates of the point relative to the initial location of the camera moving frame. Using thisconvention we have

X = g(X0) = RX0 + T. (13)

Attaching a coordinate frame to the center of projection, with the optical axis aligned with the z-axis and adopting an ideal pin-hole camera model (Figure 5) the point of coordinates X is projectedonto the point of coordinates

x =[

xy

]= − f

Z

[XY

]. (14)

where [x, y]T are coordinates expressed in a two-dimensional reference frame (the retinal imageframe) centered at the principal point (the intersection between the optical axis and the imageplane) with the x-axis and y-axis parallel to the X-axis and Y -axis respectively and f representsthe focal length corresponding to the distance of the image plane from the center of projection. Inhomogeneous coordinates this relationship can be written as:

Z

x

y1

=

−f 0 0 0

0 −f 0 00 0 1 0

XYZ1

. (15)

where [X,Y,Z, 1]T .= X is the representation of a 3-D point in homogeneous coordinates and[x, y, 1]T .= x are homogeneous (projective) coordinates of the point in the retinal plane. Sincethe Z-coordinate (or the depth of the point p) is usually unknown, we may simply denote it as anarbitrary positive scalar λ ∈ R+. Also notice that in the above equation we can decompose thematrix into

−f 0 0 00 −f 0 00 0 1 0

=

−f 0 0

0 −f 00 0 1

1 0 0 0

0 1 0 00 0 1 0

. (16)

Define two matrices

Af =

−f 0 0

0 −f 00 0 1

∈ R

3×3, P =

1 0 0 0

0 1 0 00 0 1 0

∈ R

3×4. (17)

Also notice that from the coordinate transformation we have for X = [X,Y,Z, 1]T

XYZ1

=

R T

0 1

X0

Y0

Z0

1

. (18)

8

To summarize, using the above notation, the geometric model for an ideal camera can bedescribed as

λ

x

y1

=

−f 0 0

0 −f 00 0 1

1 0 0 0

0 1 0 00 0 1 0

R T

0 1

X0

Y0

Z0

1

, (19)

or in matrix form

λx = AfP X = AfP gX0 (20)

0.3.3 Camera with intrinsic parameters

There is a variety of mechanisms to measure the brightness value (or, more properly, “radiance”)of points in the 3-D world (e.g., traditional film, CCD cameras) and transform it to the formusable by a digital computing device. The process involves first capturing the irradiance onto somecontinuous medium (film), or already discretized medium (CCD camera), followed by a samplingand quantization process to obtain the final instance of an image in digital form. Due to thesampling process and the size of the medium the position measurements are transformed to pixelsmeasurements. This process introduces (possibly different) scaling of the original image in boththe x and y directions. Furthermore, in the image reference frame the origin is in the upper leftcorner with the x-axis corresponding to the horizontal boundary of the image. To obtain the actualimage coordinates in pixels the scaling of the coordinates

[xs

ys

]=

[sx 00 sy

] [xy

](21)

is followed by the translation and inversion of axis:

xim = −(xs − ox)yim = −(ys − oy)

where (ox, oy) are the coordinates (in pixel) of the principal point (the intersection of the opticalaxis with the image plane) relative to the image reference frame, or where the Z-axis intersectsthe image plane. So the actual image coordinates are given by the vector xim = [xim, yim]T ∈ R

2

instead of the ideal image coordinates [x, y]T . The above steps can be written in a homogeneouscoordinates in a matrix form in the following way:

xim.=

xim

yim

1

=

−sx 0 ox

0 −sy oy

0 0 1

x

y1

(22)

where xim and yim are actual image coordinates in pixels. When sx = sy the pixels are square. Incase the pixels are not rectangular, a more general form of the scaling matrix can be considered

S =[

sx sθ

0 sy

]∈ R

2×2 (23)


where sθ is proportional to cot(θ) of the angle between the image axes x and y. So the transfor-mation matrix in (22) takes the general form

As =

−sx −sθ ox

0 −sy oy

0 0 1

∈ R

3×3. (24)

In many practical applications it is commonly assumed that sθ = 0.In addition to the change of coordinate transformations, in case of large field of view one can oftenobserve image distortions along radial directions. These are typically modeled as

x = xd(1 + a1r2 + a2r

4) (25)y = yd(1 + a1r

2 + a2r4) (26)

where (xd, yd) are coordinates of the distorted points, r2 = x2d + y2

d and a1, a2 are then consideredadditional camera parameters. However, for simplicity, in this book we will not consider this typeof distortion and refer the reader to [?].Now combining the projection model from the previous section with the scaling and translationyields a more realistic model of a transformation between homogeneous coordinates of a 3-D pointrelative to the camera frame and homogeneous coordinates of its image expressed in terms of pixels:

λ

xim

yim

1

=

−sx −sθ ox

0 −sy oy

0 0 1

−f 0 0

0 −f 00 0 1

1 0 0 0

0 1 0 00 0 1 0

XYZ1

.

Notice that in the above equation, the effect of a real camera is in fact carried through two stages.The first stage is a standard perspective projection with respect to a normalized coordinate system(as if the focal length f = 1). This is characterized by the standard projection matrix P = [I3×3, 0].The second stage is a further transformation (on the so obtained image x) which depends onparameters of the camera such as the focal length f , the scaling factors sx, sy and sθ and the centeroffsets ox, oy. The second transformation is obviously characterized by the combination of the twomatrices As and Af

A = AsAf =

−sx −sθ ox

0 −sy oy

0 0 1

−f 0 0

0 −f 00 0 1

=

fsx fsθ ox

0 fsy oy

0 0 1

∈ R

3×3. (27)

Such a decoupling allows us to write the projection equation in the following way:

λxim = AP X =

fsx fsθ ox

0 fsy oy

0 0 1

1 0 0 0

0 1 0 00 0 1 0

XYZ1

. (28)

The 3 × 3 matrix A collects all parameters that are “intrinsic” to the particular camera, and aretherefore called intrinsic parameters; the matrix P represents the perspective projection. Thematrix A is usually called the intrinsic parameter matrix or simply the calibration matrix of the

10

camera. When A is known, the normalized coordinates x can be obtained from the pixel coordinatesxim by simple inversion of A

λx = λA−1xim = P X =

1 0 0 0

0 1 0 00 0 1 0

XYZ1

. (29)

The information about the matrix A can be obtained through the process of camera calibrationdescribed in chapter ??.

The normalized coordinate system corresponds to the ideal pinhole camera with the image planelocated in front of the center of projection with the focal distance equal to 1. Given this geometricinterpretation the individual entries of the matrix A correspond to

1/sx size of the horizontal pixel in meters [m/pixel],1/sy size of the vertical pixel in meters [m/pixel],αx = sxf size of the focal length in horizontal pixels [pixel],αy = syf size of the focal length in vertical pixels [pixel],αx/αy aspect ratio σ.

To summarize, the overall geometric relationship between 3-D coordinates X0 = [X0, Y0, Z0]T

relative to the world frame and their corresponding image coordinates xim = [xim, yim]T (in pixels)depends on the rigid body displacement between the camera frame and world frame (also calledextrinsic calibration parameters, an ideal projection and the camera intrinsic parameters. Theoverall geometry of the image formation model is therefore captured by the following equation

λ

xim

yim

1

=

fsx fsθ ox

0 fsy oy

0 0 1

1 0 0 0

0 1 0 00 0 1 0

R T

0 1

X0

Y0

Z0

1

, (30)

or in matrix form

λxim = AP X = APgX0 (31)

0.3.4 Spherical projection

The pin-hole camera model outlined in the previous section considers planar imaging surfaces. Analternative imaging surface which is commonly considered is that of a sphere. This choice is partlymotivated by retina shapes often encountered in biological systems. For spherical projection, wesimply choose the imaging surface to be the unit sphere: S

2 =X ∈ R

3 | ‖X‖ = 1

for some norm.Then the spherical projection is defined by the map πs from R

3 to S2:

πs : R3 → S

2, X 7→ X =X‖X‖ .

Similarly to the case of perspective projection, the relationship between the coordinates of 3-Dpoints and their image projections can be expressed as

λxim = AP X = APgX0 (32)

where the scale λ = Z in case of perspective projection and λ =√

X2 + Y 2 + Z2 in case of sphericalprojection. Therefore, mathematically, the perspective projection and spherical projection areexactly equivalent to each other. The only difference is the unknown scale λ takes different values.

0.4. SUMMARY 11

0.3.5 Approximate camera models

The most commonly used approximation to the perspective projection model is the so called or-thographic projection. The light rays in the orthographic model travel along the lines parallel tothe optical axis. The relationship between image points and 3-D points in this case is particularlysimple: x = X; y = Y . So the geometric model for an “orthographic camera” can be expressed as:

[xy

]=

[1 0 00 1 0

] X

YZ

, (33)

or simply in a matrix form

x = P2X (34)

where P2 = [I2×2, 0] ∈ R2×3.

Orthographic projection is a good approximation to perspective projection when the variationof the depth between the viewed points is much smaller then the distance of the points from theimage plane. In case the points viewed lie in a plane which is parallel to the image plane, theimage of the points is essentially a scaled version of the original. This scaling can be explicitlyincorporated into the orthographic projection model, leading to so called weak-perspective model.In such a case, the relationship between image points and 3-D points is:

x = fX

Z, y = f

Y

Z

where Z is the average distance of the points viewed by the camera. Denoting the scaling factors = f

Zwe can express the weak-perspective camera model (scaled orthographic) as:

[xy

]= s

[1 0 00 1 0

] X

YZ

(35)

or simply in a matrix form

x = sP2X (36)

These approximate projection models often lead to simplified and efficient algorithms for estimationof unknown structure and displacement of the cameras [?].

0.4 Summary

0.5 Exercises

1. Show that any point on a line through p projects onto the same coordinates (equation (3.4)).

2. Show how perspective projection approximates orthographic projection when the scene occu-pies a volume whose diameter is small compared to its distance from the camera. Characterizeother conditions under which the two projection models produce similar results (equal in thelimit).

12

3. Consider a thin round lens imaging a plane parallel to the lens at a distance d from thefocal plane. Determine the region of this plane that contributes to the image I at the pointx. (Hint: consider first a one-dimensional imaging model, then extend to a two-dimensionalimage).

4. Perspective projection vs. orthographic projectionIt is common sense that, with a perspective camera, one can not tell an object from anotherobject which is exactly twice as big but twice as far. This is a classic ambiguity introducedby perspective projection. Please use the ideal camera model to explain why this is true. Isthe same also true for orthographic projection? Please explain.

5. Calibration matrixPlease compute the calibration matrix A which represents the transformation from imageI to I ′ as shown in Figure 6. Note, from the definition of calibration matrix, you need touse homogeneous coordinates to represent points on the images. Suppose that the resulting

yo

x

y

(640,480)

I

I’

ox

(-1,1)

Figure 6: Transformation of a normalized image into pixel coordinates.

image I ′ is further digitized into an array of 480 × 640 pixels and the intensity value of eachpixel is quantized to an integer in [0, 255]. Then how many different digitized images one canpossibly get from such a process - give an exact expression for the number?

Image formation - George Mason Universitykosecka/it835/lect2.pdfIn a broad gurative sense, vision is...

Documents

Transcript of Image formation - George Mason Universitykosecka/it835/lect2.pdfIn a broad gurative sense, vision is...