A High-Performance Stereo Vision System for Obstacle …©1998 Todd A. Williamson This research was...

A High-Performance Stereo Vision System for Obstacle Detection

Todd A. Williamson

September 25, 1998CMU-RI-TR-98-24

Robotics InstituteCarnegie Mellon University

Pittsburgh, PA 15213

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy

©1998 Todd A. Williamson

This research was partially sponsored by the collaborative agreement between Carnegie Mellon Univer-sity and Toyota Motor Corporation

i

Abstract

Intelligent vehicle research to date has made great progress toward true autonomy.Integrated systems for on-road vehicles, which include road following, headway main-tenance, tactical-level planning, avoidance of large obstacles, and inter-vehicle coordi-nation have been demonstrated. One of the weakest points of current automated cars,however, is the lack of a reliable system to detect small obstacles on the road surface.In order to be useful at highway speeds, such a system must be able to detect small(~15cm) obstacles at long ranges (~100m), with a cycle rate of at least 2 Hz.

This dissertation presents an obstacle detection system that uses trinocular stereoto detect very small obstacles at long range on highways. The system makes use of theapparent orientation of surfaces in the image in order to determine whether pixelsbelong to vertical or horizontal surfaces. A simple confidence measure is applied toreject false positives introduced by image noise. The system is capable of detectingobjects as small as 14cm high at ranges well in excess of 100m.

The obstacle detection system described here relies on several factors. First, thecamera system is configured in such a way that even small obstacles generate detect-able range measurements. This is done by using a very long baseline, telephoto lenses,and rigid camera mounts. Second, extremely accurate calibration procedures allowaccurate determination of these range differences. Multibaseline stereo is used toreduce the number of false matches and to improve range accuracy. Special image fil-tering techniques are used to enhance the very weak image textures present on the roadsurface, reducing the number of false range measurements. Finally, a technique fordetermining the surface orientation directly from stereo data is used to detect the pres-ence of obstacles.

A system to detect obstacles is not useful if it does not run in near real-time. Inorder to improve performance, this dissertation includes a detailed analysis of eachstage of the stereo algorithm. An efficient method for rectifying images for trinocularstereo is presented. An analysis of memory usage and cache performance of the stereomatching loop has been performed to allow efficient implementation on systems usinggeneral-purpose CPUs. Finally, a method for efficiently determining surface orienta-tion directly from stereo data is described.

iii

cent)tacle

e that

re atted

couldcouldmentg a

aveisioniencessingctionrectlyd mePoel-via.

du-me toin me myibler lim-

Acknowledgements

First of all I want to thank my advisor, Chuck Thorpe, for his unending patienceand guidance, particularly when I was getting into a level of mathematical detail thatwas tedious to us both. My decision to take a two-year leave of absence in Japan didnot phase him in the least, and he was never less than supportive.

I also want to express my thanks to Martial Hébert (note the oft overlooked acfor sharing his great knowledge of projective geometry, stereo vision, and obsdetection with me. He has spent several hours of his life explaining things to mwere perhaps better learned elsewhere; for this I am grateful.

I feel that the environment in the Vision and Autonomous Systems Center heCMU, and the Robotics Institute of which VASC is a part, have both contribugreatly to my research. Whenever a research problem arose, I always felt that Igo and ask practically anyone about it, and if they didn’t know the answer, they point me towards someone who did. I have a feeling that it is this sort of environthat I will miss most when I leave CMU; hopefully I can be instrumental in fosterinsimilar environment wherever I go.

Of my colleagues in VASC and RI, I want to express particular thanks to DLaRose, for much insight into how electrical engineers think about computer vproblems. I think that many people in computer vision who have a computer scbackground could benefit from a more thorough understanding of signal proceprinciples. Similarly, John Hancock did a lot of thinking about the obstacle deteproblem before I even decided to make it my thesis topic, and I benefitted both diand indirectly from the work that he has done. Other people who have providewith invaluble advice (both technical and personal) include Jennie Kay, Conrad man, Jeff Schneider, Bill Ross, Toshihiko Suzuki, Stuart Fairley, and Parag Bata

Finally, I want to express thanks to my family. My mother, who returned to graate school at the same time that I was finishing high school, blazed the trail for follow. She made it look easy. My father has continually expressed a confidence that I often felt was unfounded, but I appreciate it greatly. Finally, I want to thankwife Hiroko, who has followed me to Pittsburgh from Tokyo, and dealt with incredculture shock, in order for me to complete my Ph.D. She has also dealt with ouited funds and a lot of uncertainty for our future, and for that I am thankful.

. . . 7

. . . 9

. 10

. 12

. . 14

v

Contents

Abstract i

Acknowledgements iii

Contents v

1 Introduction 1

1.1 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Intelligent Vehicle Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Obstacle Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Stereo Vision for Obstacle Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5.1 Traditional Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5.2 “Ground Plane Stereo”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.5.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.5.3.1 Multibaseline (Trinocular) Stereo. . . . . . . . . . . . . . . . . . . . . . . .

1.5.3.2 Laplacian of Gaussian (LoG) Filtering. . . . . . . . . . . . . . . . . . . .

1.5.4 Obstacle Detection from Stereo Output. . . . . . . . . . . . . . . . . . . . . . .

vi Contents

1.5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 Mathematical Fundamentals 23

2.1 Mathematics of Stereo Vision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.1.1 Homography Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.1.2 Fundamental Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.1.3 Relationship Between Homography Matrices . . . . . . . . . . . . . . . . . . . . 29

3 Calibration 31

3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Weak Calibration of Multibaseline Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.1 Image Warping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.2 Computing Homography Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2.3 Finding the Epipole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2.4 Improving Accuracy of Recovered Parameters . . . . . . . . . . . . . . . . . . . 40

3.2.5 Stereo Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3 Global (metric or Euclidean) calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.1 Practical and Accurate Metric Calibration. . . . . . . . . . . . . . . . . . . . . . . 45

3.4 Summary of the Calibration Method Steps. . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4 Stereo Algorithm 51

4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.2 Multibaseline Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.3 LoG Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.4 Rectification and Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.5 Stereo Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.6 Sub-pixel Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5 Implementation 65

5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.2 CMU Video-Rate Multibaseline Stereo Machine. . . . . . . . . . . . . . . . . . . . . . 66

5.2.1 LoG Filter and Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Contents vii

. 105

. . 109

. 110

113

. . 118

. . 123

5.2.2 Rectification (Geometry Compensation) . . . . . . . . . . . . . . . . . . . . . . . . 68

5.2.3 Stereo Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.2.4 Stereo Machine Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.3 Software Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.3.1 Multibaseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.3.2 LoG Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.3.2.1 Determining LoG Filter Coefficients . . . . . . . . . . . . . . . . . . . . . . 72

5.3.3 Rectification and Stereo Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.3.3.1 The stereo matching main loop. . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.3.3.2 Rectification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.3.3.3 Rectification strategy for the (r,d,c) ordering . . . . . . . . . . . . . . . . 83

5.3.3.4 Rectification strategy for the (r,c,d) ordering . . . . . . . . . . . . . . . . 84

5.3.3.5 Rectification strategy for the (d,r,c) ordering: . . . . . . . . . . . . . . . 86

5.3.3.6 Computing the Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.3.3.7 Memory Use in Stereo Matching . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.3.3.8 Benchmarks for the (r,c,d) case . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.3.4 CPU-Specific Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6 Obstacle Detection 97

6.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.3 Approaches to Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.4 Traditional Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.5 “Ground Plane Stereo” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.6 Height Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.7 Obstacle Detection from Stereo Output . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.7.1 Computing the two types of stereo efficiently. . . . . . . . . . . . . . . . . . .

6.8 Obstacle Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7 Obstacle Detection Results 121

7.1 Obstacle Detection System Performance . . . . . . . . . . . . . . . . . . . . . . . . .

viii Contents

7.2 Stereo Range to Detected Obstacles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.3 Experiments From a Moving Vehicle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7.4 Other Detected Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7.5 Lateral Position and Extent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7.6 Night Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

7.7 Repeated Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

8 Conclusions 135

8.1 Contributions of This Thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

8.2.1 Determining More Orientations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

8.2.2 Test in an Offroad Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

8.2.3 Use Temporal Information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

8.2.4 Obstacle Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

8.2.5 Further Optimizations and Speed Enhancements . . . . . . . . . . . . . . . . 138

Bibliography 141

1

Chapter 1

Introduction

This dissertation presents an obstacle detection system that uses trinocular stereo

to detect very small obstacles at long range on highways. The system makes use of the

apparent orientation of surfaces in the image in order to determine whether pixels

belong to vertical or horizontal surfaces. A simple confidence measure is applied to

reject false positives introduced by image noise. The system is capable of detecting

objects as small as 14cm high at ranges well in excess of 100m.

1.1. Background

Until the invention of the mechanical vehicles, most transportation systems pos-

sessed some degree of autonomy. In order to be a good beast of burden, an animal not

only has to be strong enough to carry the load, but it also must be intelligent enough to

follow a path, avoid colliding with things, and not run off of cliffs. This degree of

2 Chapter 1. Introduction

autonomy was lost with the transition to human-controlled mechanical vehicles. Since

the 1960s, a number of research groups around the world have been attempting to

restore some of this intelligence.

There are several good reasons to develop intelligent vehicles. Perhaps the first

reason that occurs to most people is convenience. Although many people enjoy driving

to some extent, almost everyone finds the driving task tedious at times. The idea of

getting into a car, programming it for the desired destination, and then relaxing while

in transit thus holds some appeal.

Perhaps a more compelling reason to build intelligent vehicles is to solve traffic

problems. If such a car existed, it should be able to drive much more precisely than a

human can. With increased precision, cars can travel faster and closer together, effec-

tively increasing the capacity of existing roadways.

The most compelling reason for adding autonomous capability to automobiles is

surely increased safety. Government studies attribute 96.2% of accidents in the United

States to driver error [Treat et al. 79]. Many of these accidents could be avoided with

autonomous vehicle technology, either by controlling the car to avoid the accident, or

by warning the driver of a dangerous situation so that she can take appropriate action.

1.2. Intelligent Vehicle Research

For the purposes of definition, an Intelligent Vehicle is a vehicle equipped with

sensors and computing that allow it to perceive the world around it, and to decide on

appropriate action. If the vehicle is also equipped with actuators, the vehicle may be

completely or partially computer-controlled. In the absence of such actuators, the sys-

tem may act in a warning capacity.

Research in intelligent vehicles has a long history. Various research groups experi-

mented with limited automation using analog electronics as early as 1960

([Gardels 60], [Oshima et al. 65]). However, real progress in the problem was not

1.3 Obstacle Detection 3

made until inexpensive cameras and computing enabled vision-based lane tracking in

the mid-to-late 1980s (e.g. [Dickmanns & Zapp 86], [Kluge & Thorpe 89]). Research

in automated headway control solved another piece of the problem and allowed appli-

cations such as automated convoying ([Cro & Parker 70], [Kories et al. 88]). In 1995,

the Carnegie Mellon Navlab 5 vehicle steered 98% of the distance between Washing-

ton, DC and San Diego (a distance of 2800 miles) autonomously, demonstrating that

vision-based road following is a mature technology. Progress has also been made in

the area of high-level planning in the presence of other traffic ([Reece 92], [Suk-

thankar 97]).

As part of a demonstration of Automated Highway System concepts in 1997, many

different groups from around the world demonstrated integrated vehicle systems. The

vehicles from Carnegie Mellon consisted of two cars, a van, and two full-sized city

buses. Integrated capabilities that were demonstrated included road following, lane

changes, inter-vehicle communication, detection and awareness of surrounding vehi-

cles, and detection and avoidance of large obstacles.

1.3. Obstacle Detection

Most of the progress in intelligent vehicles has been made in handling predictable

situations (which is not to say that the situations are necessarily common, just predict-

able). In line with this, much of the work on obstacle detection has focused on detect-

ing other vehicles and large, unambiguous obstacles such as traffic barrels. Many of

these methods can successfully detect moving vehicles, but the more difficult problem

of finding small, static road debris such as tires and crates remains unsolved.

Deciding exactly what size obstacle we need to be able to detect at what minimum

range is a complicated problem which has been addressed by many different research-

ers in different ways. Hancock [Hancock 97] used the equations derived by Kelly

[Kelly 95] for cross-country navigation to arrive at a distance of 65 m ahead for a 20

cm high obstacle, with the following assumptions:


ation

t least

-

For

pave-

from

gher

may

to the

ixels

d like

d from

eters.

cles at

such

n pro-

large

m of

mobile

t low

• vehicle is travelling at 60 mph (26.7 m/s)

• vehicle can decelerate at 0.7 g (6.8 m/s2)

• there is a 0.5 second delay time between sensing of the obstacle and applic

of the brakes

• processing is performed at a cycle rate of 0.3 seconds

• the sensor is located 1 meter above the ground

He also calculates that the sensor must have a vertical angular resolution of a

0.1° and a vertical field of view that is the same, 0.1°, implying that a single line sen

sor would be sufficient. In reality, many of these assumptions are optimistic.

instance, although many cars may be able to sustain 0.7 g deceleration on dry

ment under ideal conditions, it is unrealistic to expect this kind of performance

all vehicles under all conditions. We would also like to be able to travel at hi

speeds when the law permits it. Additionally, there is empirical evidence that we

need to avoid obstacles as small as 6” (14 cm) tall in order to avoid damage

vehicle. Lastly, even if a single scan line is sufficient, it is better to have many p

on the obstacle in order to enhance the reliability of detection results.

The combination of the above factors leads us to the conclusion that we woul

to detect smaller obstacles at somewhat larger ranges. Simply changing the spee

60 to 65 mph and the deceleration to 0.5 g implies a necessary distance of 100 m

Sensors such as automotive radar do not have the acuity to find small obsta

such large distances, and have significant difficulties with non-metallic obstacles

as wood, cement, or animals. While a variety of competing methods have bee

posed for on-road obstacle detection, most of the work has focused on detecting

objects, especially other vehicles (e.g. [Luong et al. 95]). Although the proble

detecting static obstacles has been tackled in both the cross-country and indoor

robot navigation literature (e.g. [Matthies 92]) these systems have operated a

1.4 Stereo Vision for Obstacle Detection 5

speeds (5-10 mph) and short range.

1.4. Stereo Vision for Obstacle Detection

This thesis presents a solution to the obstacle detection problem based on trinocu-

lar stereo vision. The solution presented is capable of detecting small obstacles, on the

order of 15 centimeters tall, on the road surface at ranges of 100 meters or more in

front of the vehicle.

Stereo vision is an ideal method for solving the obstacle detection problem for a

variety of reasons. If we expect to someday equip every vehicle on the highway with

its own obstacle detection system, then the use of an active sensor such as radar or

ladar requires great care to avoid interference between the signals emanating from dif-

ferent vehicles. This argues for the use of passive sensing devices such as video cam-

eras. In addition, cameras and computers are continually getting smaller and less

expensive. Although prices are not yet low enough to include three cameras and a

powerful computer on every car, current trends will make it possible within the next

five years. Yet another factor is that a stereo system lacks moving parts, which implies

less wear and thus greater reliability.

1.5. Thesis Overview

This section presents a summary of the main ideas and results of this dissertation,

which will be presented in greater detail throughout the remaining chapters. First, we

discuss the problems posed by a straight-forward application of stereo vision to the

obstacle detection problem. Following this, a method is presented to solve these prob-

lems. The next two sections discuss major algorithmic choices and their impact on the

quality of the stereo output. This is followed by a section describing the actual method

that we use for detecting obstacles from stereo disparity data. Finally, we present a

summary of obstacle detection results.


1.5.1. Traditional Stereo

As illustrated in Figure 1-1, traditional stereo processing involves taking two

images of a scene at the same time from different viewpoints. Each point in one of the

images is constrained by the camera geometry to lie along a line (called the epipolar

line) in the other image. Its position on this line is related to the distance of the point

from the cameras.

In order to make the search more reliable, instead of comparing individual pixels

from the two images, small regions are compared.

Two examples of this are shown in Figure 1-1. In this example, the scene is of the

inside of a garage. The garage door has calibration targets attached to it to enhance

image texture, and the images have been filtered to enhance image texture. Two

regions are chosen as examples of stereo matching.

For the example region on the door of the garage, we see that the regions searched

Right Image region

Left Image region

Difference

Right Image region

Left Image region

Difference

Mat

chin

g E

rro

r

DisparityFigure 1-1: Traditional Stereo Processing

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 50 100 150 200 250 300

"wall_tra.out""ground_t.out"

1.5 Thesis Overview 7

in the stereo matching (shown in detail below the images) match very well. The upper

curve on the graph at the bottom shows the matching error (sum of absolute differ-

ences, SAD) as a function of the displacement along the epipolar line. This graph

shows a strong global minimum at the correct value of 100.

On the other hand, the example on the garage floor does not match as well. This is

due to the fact that since the ground is tilted with respect to the camera axis, points

which are higher in the image are actually farther away and thus match at a different

location. This is seen as a difference in the slope of the line on the ground. The lower

curve of the graph shows that the global minimum of the matching error does not

occur at the correct position (which would be at a value of around 155).

It is clear from this example that a simple application of traditional stereo tech-

niques will not be sufficient for detecting obstacles on a road surface; points on the

ground such as those shown in the example will produce incorrect results, particularly

in regions where the image texture is low. Since the problem is caused by a difference

in the geometry of the surfaces being observed, the solution to this problem is to com-

pensate for the different geometry.

1.5.2. “Ground Plane Stereo”

The simplest way to solve the problems described in the previous section is to

warp one of the images (using a projective warping function) so that the images would

appear to be exactly the same if all of the pixels in the image were on some typical

ground plane. This results in a situation as shown in Figure 1-2. Both images now

appear to be the same for pixels which are on the ground, but pixels which are on a

vertical surface such as the wall of the garage are now warped in much the same way

that the ground pixels were warped in traditional stereo. This means of computing ste-

reo (described in more detail in [Williamson & Thorpe 98a]) is similar to the tilted

horopter method of Burt et al. [Burt et al. 95], except that in our case, instead of

attempting to determine the parameters of the ground plane at each iteration, we use a


both

ce (if

ce (if

y can

horizontal plane that is fixed relative to the vehicle as the starting point for our stereo

search.

Comparing the results from the ground plane method with the results from the tra-

ditional method, we notice several differences. First, the global minimum of the

matching error curve for the point on the ground (the lower curve) now appears at the

correct location. The value of the error at the minimum is also lower than before, since

it matches better. Second, although the global minimum of the curve for the point on

the door is still at the correct location, the trough of the minimum is much wider, indi-

cating a less certain result. The value at the minimum is larger, indicating that it

doesn’t match as well.

This example illustrates an interesting result: if we compute stereo using

methods, it is possible to determine whether a given point lies on a vertical surfa

the traditional method produces at lower minimum error) or on a horizontal surfa

the ground plane method produces a lower minimum error). The correct disparit

also be determined from the position of the lower minimum.

Right Image

Left Image Difference

Right Image


Mat

chin

g E

rro

r

DisparityFigure 1-2: “Ground Plane Stereo”

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 50 100 150 200 250 300

"wall_gro.out""ground_g.out"


Since most obstacles that we are concerned with contain nearly-vertical surfaces,

detecting such obstacles becomes very easy using this method.

One issue that must be addressed is what conditions are necessary for this method

to work reliably. For example, if two surfaces appear in the same image region (near

where the garage door meets the ground, for instance), which surface will be chosen?

This most important factor is the magnitude of the image texture on each surface.

Another factor is how close the surface directions are to being vertical or horizontal.

Figure 1-3 shows the results of applying both methods to a typical input image set.

The gray coding in both cases represents the number of pixels of displacement along

the epipolar line (dark is negative, medium gray is zero, and bright is positive). As

expected, the ground plane method does very well on the ground pixels, but poorly on

the wall in the background. Conversely, the traditional method works well on vertical

features such as the lamp post and the wall, but there is a lot of noise on the ground

surface.

1.5.3. Implementation

Figure 1-4 shows the architecture of the system that we have implemented. Three

CCD cameras with 35mm lenses are arranged in a triangular configuration, mounted

on top of our Toyota Avalon test vehicle. The distance between the outer set of cam-

eras is about 1.5m.

The computation that is performed is based on the that used by the CMU Video

Rate Multibaseline Stereo Machine [Kanade et al. 96]. The images are first passed

through a Laplacian of Gaussian (LoG) filter, then rectified to align the epipolar lines.

Stereo matching is then performed using both the traditional method and the ground

plane method. Based on the output of both methods, the further step of obstacle detec-

tion and localization is performed.


1.5.3.1. Multibaseline (Trinocular) Stereo

There are several benefits to adding a third camera in a triangular configuration.

The most important of these is that the epipolar lines for different pairs of cameras are

Original Image

Traditional Method Output

Ground Plane Method Output

Figure 1-3: Example output of both methods


in different directions (as illustrated in Figure 1-5). This is due to the fact that the epi-

polar direction is the same as the direction of displacement between the cameras. This

ImageRectification

LoG Filter LoG Filter LoG Filter

ImageRectification

ImageRectification

Stereo Matching

Obstacle Detection/

Localization

Figure 1-4: Architecture of Stereo Obstacle Detection System

Figure 1-5: Three cameras in an “L” configuration give different epipolar directions


is important in situations where the image has texture in one direction but not in the

other (for example, the top border of the obstacle in Figure 1-3).

Another benefit of adding additional cameras is that it allows multiple measure-

ments at each point. This is useful in increasing accuracy and rejecting noise. Further-

more, a system containing only two cameras can be confused by repeated patterns in

the image (such as lines painted on the road surface). With three cameras, this problem

is eliminated.

Adding a fourth (or more) cameras does provide some additional benefit, but it

becomes much more difficult to perform the stereo matching efficiently.

Figure 1-6 shows the output for the ground plane method from Figure 1-3 if only

two cameras are used. The number of incorrectly matched pixels is much larger.

1.5.3.2. Laplacian of Gaussian (LoG) Filtering

Laplacian of Gaussian filtering is a well-accepted means of extracting features to

match from multiple cameras, while at the same time compensating for differences in

camera gain and bias. We use an LoG filter with a high gain in order to enhance the

texture of the otherwise featureless gray asphalt. The results of this filtering are shown

in Figure 1-7. The increase in image texture is very apparent.

The importance of the LoG filter to our algorithm is illustrated in Figure 1-8. The

Figure 1-6: Example using only two cameras


lack of image texture on the road surface causes the entire region to be unmatchable,

though regions with higher texture, such as the obstacle itself and the curb are still

computed correctly.

Figure 1-7: Image before and after LoG filtering

Figure 1-8: Example of stereo output without LoG filter


1.5.4. Obstacle Detection from Stereo Output

As discussed in Section 1.5.2, our method involves performing two types of stereo

matching (for vertical and horizontal surfaces), and comparing the absolute errors to

determine if a particular image region belongs to a vertical or horizontal surface. The

result of this is shown in Figure 1-9. The regions shown in the lower image are coded

by the size of the difference between the minimum errors. Brighter regions indicate

that the vertical match is much better than the ground plane match. Thus regions which

appear white are most likely to be vertical, and black regions are most likely to be hor-

izontal.

Regions of very low texture (such as the white stripe down the side of the road)

sometimes match well as vertical surfaces because of differences between the individ-

Figure 1-9: Detected vertical surfaces


ual cameras being used.

In order to remove such false obstacles from consideration, we use a very simple

confidence measure. For regions which are actual vertical surfaces, we expect that the

traditional stereo matching method will return a relatively large number of pixels at

approximately the same depth. Conversely, if a region belongs to a horizontal plane,

we would expect the traditional method to report a number of different depths. Using

standard connected components labeling methods on the disparity image generated

from traditional stereo matching, we get the image of Figure 1-10. This image encodes

the size (in pixels) of the region to which each pixel belongs. Large regions appear

brighter, and these regions are more likely to be obstacles. By requiring detected

obstacle regions to pass this consistency check, we can remove most false positive

detections.

Combining the images of Figure 1-9 and Figure 1-10, we get the detected obstacle

output of Figure 1-11. Obstacles are shown in black. This example shows two 14cm

high obstacles, which are pieces of wood painted white and black. The obstacles are

100m in front of the vehicle.

1.5.5. Results

We have collected a set of test data using wooden obstacles of four different

Figure 1-10: Size of regions of constant disparity


heights (9, 14, 19, and 29cm) and three different colors (white, black, and gray) at

measured distances from 50 meters to 150 meters.

Figure 1-12 shows the accuracy of the detected range for all 12 obstacles. As

expected, the measured range is very accurate when the object is close, and gets

increasingly less accurate as the obstacle gets farther away.

The results of running the obstacle detection system are shown in Table 1-1. This

table shows that we were successfully able to detect obstacles that are bigger than 9cm

Figure 1-11: Detected Obstacles

Actual Range (m)Figure 1-12: Stereo range accuracy

Det

ecte

d R

ange

(m

)

40

60

80

100

120

140

160

180

40 60 80 100 120 140 160

"rangeacc.out""rangeacc.out"

x


at up to 110m.

Figure 1-13 shows an example trace of an obstacle detection run. The vehicle

moved at a constant rate (about 25 km/h) toward a 14cm black obstacle of the type

shown in Figure 1-9. The data was taken at 15 fps, and processed off-line. The obsta-

cle is detected in every frame of the data, out to a maximum range of approximately

Black 9cm

Black14cm

Black19cm

Black30cm

Grey9cm

Grey14cm

Grey19cm

Grey30cm

White9cm

White14cm

White19cm

White30cm

50m ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

60m ✓ ✓ ✓ ✓ ✕ ✓ ✓ ✓ ✓ ✓ ✓ ✓

70m ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

80m ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

90m ✓ ✓ ✓ ✓ ✕ ✓ ✓ ✓ ✓ ✓ ✓ ✓

100m ✓ ✓ ✓ ✓ ✕ ✓ ✓ ✓ ✓ ✓ ✓ ✓

110m ✓ ✓ ✓ ✓ ✕ ✓ ✓ ✓ ✕ ✓ ✓ ✓

120m ✓ ✕ ✓ ✓ ✓ ✕ ✓ ✓ ✕ ✓ ✓ ✓

130m ✕ ✕ ✓ ✕ ✕ ✓ ✓ ✓ ✕ ✕ ✕ ✕

140m ✓ ✕ ✓ ✓ ✓ ✓ ✓ ✓ ✕ ✕ ✕ ✕

150m ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✕ ✕ ✕ ✕

Table 1-1: Obstacle Detection Results

Frame # (15 fps)Figure 1-13: Detection trace for 14cm obstacle

Ran

ge (

m)

0

20

40

60

80

100

120

0 20 40 60 80 100 120 140 160

"6in_obso.out"


110m (which is the beginning of the data set).

Figure 1-14 shows the same type of trace, this time for a standard 12oz (350ml)

white soda can. The soda can is first reliably detected at 57m.

Each of the previous examples has shown only the detections that actually repre-

sented the obstacle. Of course, there are many more detected objects. A full trace is

shown in Figure 1-15, along with an example image from the set and a diagram show-

ing an overhead view of the scene. The detections can be divided into three sets, repre-

senting the obstacle, the curb behind the obstacle, and the building in the background.

Also note that there are no false detections that are closer than the obstacle.

1.6. Thesis Outline

This thesis consists of a number of chapters, which are divided according to the

major topics to be presented. Each chapter begins with an introduction and a separate

discussion of related work. This is followed by a detailed discussion of the topic at

hand.

Frame # (4 fps)

Ran

ge (m

)

Figure 1-14: Detection trace for a soda can

0

20

40

60

80

100

120

0 10 20 30 40 50 60 70 80 90

"can_obso.out"

1.6 Thesis Outline 19

Chapter 2 briefly introduces the mathematics of projective geometry that will be

used throughout this document. Assuming a pinhole camera model, we derive a com-

pletely general mathematical model for multiple images of a static scene. This chapter

provides some of the fundamental equations upon which the stereo obstacle detection

system is built.

Chapter 3 introduces the problem of calibrating a set of cameras to be used for

multibaseline stereo. A weak calibration method is presented that allows the determi-

nation of just enough parameters to allow the computation of stereo disparity. In order

to perform this calibration, all that is required is images of two planar surfaces (for

example, a wall and a relatively flat patch of ground) in the world, taken from each of

the cameras. Since we are interested in viewing small objects at long range, additional

methods are presented that provide a means to compute these parameters very accu-

rately by adding images of additional planar surfaces, which may be obtained by mov-

ing the vehicle and capturing images at different distances.

0

50

100

150

200

250

0 5 10 15 20 25 30 35 40 45 50

"obs.out"

road

curb

vehicle

obstacle

Frame # (4 fps)

Ran

ge (

m)

Figure 1-15: Other detected points

building


In Chapter 3 we also present a method for performing metric calibration of the ste-

reo system. A metric calibration provides a mapping from the natural coordinates for

stereo processing (pixels and disparity values) into 3D coordinates that can be used for

vehicle control. The accuracy requirements for obstacle position are much less strin-

gent than for stereo matching, since we cannot expect millimeter precision in position

at 100 meter range. The geometry of the situation, as well as other factors, prohibit

such high accuracy position recovery. The method that is described makes use of three

images of a vertical plane at known distances, a horizontal ground plane, and a two

points within the vertical plane at known lateral position. The data for both calibra-

tions can thus be collected at the same time.

Chapter 4 presents the stereo algorithm that is used as the basis for the obstacle

detection system. This chapter discusses the stereo algorithm at a high level, in terms

of what sort of processing is necessary to produce high-quality output. The discussion

of how this algorithm can be implemented efficiently is left until Chapter 5. The first

section of this chapter discusses the benefits of using at least three cameras in the ste-

reo system. Following that, the stages of the stereo algorithm are presented in order of

processing. The first step of the algorithm is preprocessing using an LoG filter. The

reasons why such preprocessing is necessary are discussed in detail. The next step of

the algorithm is rectification and interpolation of the images. The method used for

interpolation is discussed in some detail, but the discussion of rectification is delayed

until Chapter 5, where it will be presented at great length. The next stage of the algo-

rithm is the actual stereo matching. A number of different metrics for image similarity

are presented, and a discussion of the benefits and drawbacks of each follows. This

chapter ends with a discussion of sub-pixel interpolation.

Chapter 5 is devoted to mid-level implementation issues. These issues are those

that are not high-level algorithmic issues such as those presented in Chapter 4, but yet

are still at a high enough level that they apply to any general-purpose computing plat-

form. Since the research presented in this dissertation has grown out of an attempt to

1.6 Thesis Outline 21

ereo”,

tions

tereo

w this

mines

s and

hen

etection

olors.

d goal,

ylight

er dis-

m will

es is

use the CMU Video-Rate Multibaseline Stereo Machine for obstacle detection, the

algorithm used by that machine is presented first. Details of the software implementa-

tion of this algorithm are presented in the sections that follow. Following a section on

the implementation of the LoG filter, the stereo matching main loop is presented in

pseudo-code. The discussion of rectification methods is closely tied to a discussion of

the memory and cache performance required by three different possible implementa-

tions of the stereo main loop. The chapter ends with the presentation of benchmark

data that supports the analysis of memory usage by the stereo algorithm, and shows the

significant performance improvements that can be achieved by attention to memory

usage.

Chapter 6 discusses how the output of the stereo algorithm can be used to build an

effective obstacle detection system. First, I present the major problem posed by trying

to apply traditional stereo techniques to a highway environment. The solution to this

problem, presented in the next section, is something that I call “Ground-Plane St

which is equivalent to what others have called “tilted-horopter stereo”. The sec

that follow describe how the combination of traditional stereo and ground plane s

can be used to determine the orientation of the surfaces being viewed, and ho

orientation can be used as a cue for obstacle detection.

Chapter 7 presents results obtained using the system. The first section exa

the performance of the stereo algorithm and the importance of multiple camera

LoG filtering. This is followed by an analysis of the accuracy of stereo range w

applied to detected obstacles. The rest of the chapter presents actual obstacle d

results. The algorithm was tested on a variety of obstacles of different sizes and c

The results of these tests show that the system is capable of achieving our state

detecting a 15 centimeter obstacle 100 meters in front of the vehicle, under da

conditions. In fact, the system is capable of detecting the obstacle at even larg

tances. A series of tests was also run at nighttime to determine whether the syste

continue to function at night. At night, the ability of the system to detect obstacl


for

tection

lity and

ction

sible

reas:

e sev-

ith a

limited by the extent of the region illuminated by the vehicle’s headlights, which

low beams is much less than 100 meters. In addition to the single obstacle de

runs, we have also performed repeated tests in order to determine the repeatabi

reliability of the results. These tests give us some idea of the probability of dete

versus range for a particular obstacle.

Finally, Chapter 8 takes a look at the contributions of this thesis, and pos

future work. The contributions of this dissertation are presented in three main a

camera calibration, the stereo algorithm itself, and obstacle detection. There ar

eral interesting directions in which this research can be extended; I conclude w

look at a number of possible topics for future work

23

Chapter 2

Mathematical Fundamentals

Much of the research described in this thesis depends on projective geometry. The

definitive reference for projective geometry as it is applied to computer vision is

[Faugeras 93]. While it is an excellent and complete reference, a more concise deriva-

tion of the necessary equations is possible for the special case that we consider in this

thesis: a set of images of a static world, each taken from a different viewpoint. This

chapter presents a derivation of these necessary equations.

The derivation presented here is simplified by choosing a special coordinate sys-

tem whose origin is located at the camera focus and whose axes are aligned with the

camera axes. This simplification allows a more concise derivation of the fundamental

equations describing a system of multiple cameras, without any loss of generality. It

also eliminates some of the confusion that can be caused by presenting a mapping

between 3D homogeneous coordinates and 2D homogeneous coordinates by avoiding

24 Chapter 2. Mathematical Fundamentals

3D homogeneous coordinates altogether.

First, the basics of projective geometry for stereo will be presented, with a deriva-

tion of the fundamental stereo equation and the epipolar geometry. This is followed by

the derivation of homography matrices relating multiple images of a planar surface,

and a brief look at the fundamental matrix.

2.1. Mathematics of Stereo Vision

Projective geometry provides a useful set of tools for thinking about computer

vision problems. The main idea of projective geometry is that image coordinates

(inherently a 2-D space of columns c and rows r) can be represented as 3-D homoge-

neous coordinates, by the following relationship:

(2-1)

So, to convert from a 3-D homogeneous coordinate to a 2-D image coordinate, all

that is needed is to divide each of the first two elements by the third. This is a many-to-

one mapping. To convert a 2-D image coordinate into a homogeneous coordinate, we

can choose an arbitrary third coordinate (usually we choose 1 for simplicity) and mul-

tiply the column and row by this element.

What makes this concept useful is that camera projections can be written as linear

equations in homogeneous coordinates. Suppose we have the camera geometry shown

in Figure 2-1. A set of camera coordinates (x,y,z) are defined with the origin at the

focus of the camera. The z axis is aligned with the camera viewing direction. In the

image plane we define the coordinate system in terms of rows and columns (c,r). If we

then define the 3x3 matrix A:

c

r

αc

αr

α

⇔

2.1 Mathematics of Stereo Vision 25

(2-2)

then we can represent the mapping from camera coordinates (x, y, z) to image coordi-

nates (c,r) by:

(2-3)

where the equations and can be easily derived from the

geometry of similar triangles; f is the focal length of the camera, γ is the aspect ratio,

x

y

zc

r (u,v)

Figure 2-1: Geometry of camera projection

Af 0 u

0 γf v

0 0 1

=

Ax

y

z

fx uz+

γfy vz+

z

=

z

fxz---- u+

γfyz

------- v+

1

=

zc

r

1

=

cfxz---- u+= r

γfyz

------- v+=


and (u,v) is the image center of the camera. This equation provides a compact and sim-

ple representation of the camera geometry, turning a nonlinear equation into a linear

equation.

Note also that since the matrix A is invertible, equation (2-3) can be inverted:

(2-4)

for each point (c,r) in the image, this equation tells us the corresponding line in world

coordinates, parameterized by z.

Now suppose that we have two cameras, represented by primed coordinates

((r’,c’ ),(x’,y’,z’), and A’) and unprimed coordinates ((r,c),(x,y,z), and A). If we also

know the rotation and translation between the two camera coordinate systems, repre-

sented by the 3x3 rotation matrix R and the 3-D translation vector t so that

(2-5)

then we can substitute equation (2-4) into equation (2-5) twice (once for each camera)

and simplify, giving us

(2-6)

this equation embodies the relationship between points in two different images of the

same scene. If we define (a 3x3 matrix) and (a 3-vector),

then this equation becomes:

x

y

z

zA1–

c

r

1

=

x′y′z′

Rx

y

z

t+=

z′c′r ′1

A′ R A1–z

c

r

1

t+

zA′RA1–

c

r

1

A′t+= =

H∞ A′RA1–

= e A′t=


(2-7)

from this equation, we can see three things:

• in the limit as z approaches infinity, the effects of e become negligible, and

(2-8)

• in the limit as approaches zero,

(2-9)

From these equations we can see that for any given point (c,r) in the first camera,

the point (c’,r’ ) in the second camera must lie on the line connecting e (called the epi-

pole, which is the image of one cameras focus in the other camera) to the point

(which is the point at infinity, depending only on the rotations between

cameras). This line is called the epipolar line. In particular, the point must lie between

e and on this line.

2.1.1. Homography Matrices

Being a mapping from a 3-D space (c,r,z) to a 2-D space (c’,r’ ), of course

equation (2-7) is not invertible. But if instead of taking images of a general scene, we

take images of a planar surface (such as a wall or the road surface), we can add an

additional constraint. One way of expressing the general equation of a plane is:

z′c′r ′1

zH∞

c

r

1

e+=

z′c′r ′1

zH∞

c

r

1

=

zz’---

z′c′r ′1

e=

H∞ c r 1T

H∞ c r 1T


(2-10)

where is the unit normal vector to the plane, and d is the normal distance of the

plane from the origin. This can be rewritten as:

(2-11)

If we now multiply the e in equation (2-7) by equation (2-11), we get

(2-12)

Note that this is a linear equation which relates the coordinates of points in the two

images of a planar surface defined by the parameters and d. The 3x3 matrix

is called a homography matrix. Note also that as d goes to infinity, the

homography matrix becomes . Although equation (2-12) refers to a particular

matrix that we can compute, we must note that if we are to try to compute any homog-

raphy matrix (including ) directly from matching sets of image points, we will only

be able to determine it up to a scale factor.

For each point in one image, a homography matrix defines one location in the

other image on the epipolar line corresponding to that point. Thus two homography

matrices yield two points on the epipolar line for each pixel, which is enough to deter-

nT

x

y

z

d=

n

1d--- n

TA

1–z

c

r

1

1=

z′c′r′1

zH∞

c

r

1

en

T

d-----A

1–z

c

r

1

+=

z′z----

c′r′1

H∞ en

T

d-----A

1–

+

c

r

1

=

n

H∞ en

T

d-----A

1–

+

H∞

H∞


mine the epipolar geometry, including the epipole e. It is not possible to compute

from general homography matrices without other information.

2.1.2. Fundamental and Essential Matrices

If we take the cross product with e on both sides of equation (2-7), we get

(2-13)

we then take the dot product with , yielding

(2-14)

The matrix quantity is called the fundamental matrix. It encodes informa-

tion about the epipolar line for each pixel, but the information about the endpoints (e

and ) is lost.

Another related matrix is called the essential matrix, due to Longuet-Higgins

[Longuet-Higgins 81]. It can be defined as

(2-15)

which describes the relationship between the world coordinates of points observed in

the frames of reference of the two cameras, via the following equation

H∞

e z′c′r′1

× e zH∞

c

r

1

× e e×+=

c′ r′ 1T

c′r′1

ec′r′1

×

•c′r′1

e zH∞

c

r

1

×

•=

c′r′1

e H∞

c

r

1

×

• 0=

e×H∞

H∞ c r 1T

E A’1– T

FA=


(2-16)

where and are the world coordinates of a single point observed in

the coordinate systems of the two cameras (or any scalar multiples thereof).

2.1.3. Relationship Between Homography Matrices

Given two homography matrices and ,

(2-17)

if we define , then we can write as

(2-18)

which has the same form as the general homography matrix in equation (2-12). This

indicates that it is not necessary to know in order to know the epipolar geometry.

Any pair of homography matrices can be used to define two points on the epipolar line

for each pixel. The epipole e can be computed (up to a scale factor) from any two

homography matrices since the result of equation (2-15) is a rank 1 matrix (since it is

the outer product of two 3-vectors); any rank 1 matrix can be decomposed into two

component vectors with the only ambiguity being what scale to assign to each vector.

Furthermore, all homography matrices for a given camera geometry belong to a three-

dimensional affine subspace of the set of all 3x3 invertible matrices, which can be

parameterized by . is one special member of this subspace.

It is necessary to note, as we did in Section 2.1.1, that in general we can only com-

x′y′z′

T

Ex

y

z

0=

x y z, ,( ) x′ y′ z′, ,( )

H1 H2

H2 H1– H∞ en2

TA

1–

d2--------------+

H∞ en1

TA

1–

d1--------------+

–=

en2

d2-----

n1

d1-----–

T

A1–

=

n′n2

d2-----

n1

d1-----–= H2

H2 H1 H2 H1–( )+ H1 en′d′----

TA

1–+= =

H∞

n′ H∞


pute homography matrices up to an unknown scale factor. If the two homographies

and do not share the same scale factor, then equation (2-17) is meaningless, since

the terms will not cancel. Therefore, one must be very careful to somehow com-

pute the relative scale of the two matrices when attempting to apply equations of the

form of equation (2-17). This matter will be further addressed in Section 3.2.3.

H1

H2

H∞

eak”

ing. In

s is

e that

orre-

et of

a

31

Chapter 3

Calibration

Perhaps the most important problem for computing stereo range data from a set of

cameras is the problem of accurately calibrating the cameras relative to each other.

In the system that we have implemented, calibration occurs in two steps. The first

step is to do what is known as “weak calibration” for stereo processing. Here “w

refers to the fact that we only know enough about the system to do stereo match

particular, the mapping from the results of matching into 3D world coordinate

unknown. The weak calibration must be done very accurately in order to ensur

the search for matching pixels between the images is in fact looking for point c

spondences that are geometrically feasible.

The second step is to do metric calibration, which allows us to map from a s

corresponding points in the images into a 3D (x,y,z) coordinate relative to the camer

32 Chapter 3. Calibration

in the world. For our application (detection of obstacles on the road surface), we can-

not expect the results of this mapping to be very accurate, since the range resolution

for far-away points is very low. Although we make some attempts to perform the cali-

bration with relatively high accuracy, the accuracy of metric calibration is not as

essential as it is for the weak calibration.

3.1. Related Work

The weak calibration method presented here is an adaptation of a method used pre-

viously at Carnegie Mellon by the Video-Rate Multibaseline Stereo Machine group,

particularly Kazuo Oda and Tak Yoshigahara. Their method is documented in

[Oda 96b], and is based on the weakly calibrated stereo ideas of Faugeras

[Faugeras 92]. I have extended the method further to optimize for multiple planar sur-

faces at the same time, which allows direct computation of the epipoles as well as

being more accurate.

The method used to turn the weak calibration into a metric calibration is com-

pletely ad-hoc, based on the result obtained in equation (3-22). This equation is well

known, having been derived independently in [Faugeras 92] and [Hartley et al. 92].

Another, more principled method for determining the mapping between the results of

weakly calibrated stereo and Euclidean coordinates is presented in

[Devernay & Faugeras 96], though the goal of their method is to recover Euclidean

coordinates without measuring distances to points in the world. The results that they

obtain are thus not metric results, although the mapping between a Euclidean space

and a true metric space can be found by making a small number of measurements.

3.2. Weak Calibration of Multibaseline Stereo

In order to perform stereo matching, for each point (c,r) we need to know what the

possible corresponding points (c’,r’ ) in the second image are. If we only know this

information, the system is said to be weakly calibrated. That is to say that although we

know the set of possible corresponding points between the two images, we do not nec-

3.2 Weak Calibration of Multibaseline Stereo 33

essarily know the physical interpretation (i.e., 3D location) of a particular point corre-

spondence. The problem of determining this set of corresponding points is a problem

of calibration.

The fundamental projective equation describing stereo is:

(3-1)

which, for each pixel (c,r) in the first image, describes a line segment between

and e along which the corresponding point (c’,r’ ) must lie.

Given the discussion of Section 2.1, several methods of calibration present them-

selves:

1. Measure the projection matrix A of each camera, and the translation t and rota-

tion R between them. Given these parameters, we can compute any of the other

quantities that we need. The main problem with this is that it is very difficult to

measure these parameters accurately. The usual method for measuring A is to

take the camera into a laboratory where very accurate measurements can be

made under controlled conditions. Since we expect that these parameters may

change over time (e.g. because of vehicle vibration), we need a calibration

method that can be done quickly and in place on the vehicle.

2. Measure and e. If we know these two quantities, we know both ends of the

epipolar line that we need to search. The problem with this is that it is not always

easy to measure . This can be done easily by pointing the stereo system at a

scene that is so far away that it is indistinguishable from infinity. It is possible to

roughly calculate how far away that is for a given system; for ours it is roughly

4250m.

z’c’

r’

1

zH∞

c

r

1

e+=

H∞ c r 1T

H∞

H∞


ne,

kly, as

ed

pixel

mog-

f one

3. Measure the fundamental matrix. There are two problems with this. First is that

the fundamental matrix does not provide information about where the endpoints

of the epipolar line are, so we do not know where to start and end our search.

Secondly, even if we manage to find corresponding points using only the funda-

mental matrix, the relationship of these correspondences to the distance from the

camera is unclear.

4. Measure a homography matrix for some “typical” plane, and e. The homography

matrix gives us one corresponding point on the epipolar line for each pixel; e is

another point. We also know that we expect points to lie near the “typical” pla

so we can search in a region about that point along the epipolar line.

We have chosen solution 4 because it allows us to recalibrate often and quic

well as having other benefits which will be described later.

3.2.1. Image Warping

Given a homography matrix H, it is possible to apply the transformation describ

by that matrix to one of the images. This is known as projective image warping. After

such a homography is applied, a point in one image will lie at exactly the same

coordinate in the other image if and only if it lies on the plane described by the ho

raphy.

The homography describes a real-valued mapping from the coordinates o

image to the coordinates of the other. The pixel value at (c’,r’ ) of the warped image

should be the value of the original image at:

(3-2)

after division by α, we have a real coordinate (c,r) that represents the location of the

corresponding point in the original image. In reality, since we only have values at dis-

αc

r

1

H1–

c’

r’

1

=


crete pixel locations, we need to interpolate those values to find the best approxima-

tion to the actual value. In general, bilinear interpolation is sufficient. If I(c,r)

represents the pixel value of the image I at the coordinate (c,r), ci is the integer part of

c, and cf is the floating point remainder (c - ci), then we have:

(3-3)

To simplify the notation, we will use to represent the image obtained from

image I after warping by H. The value of this image at the pixel with coordinates (c,r)

would then be .

3.2.2. Computing Homography Matrices

Since any given point in a 2D image can be represented by any of an infinite num-

ber of homogeneous coordinates, all scalar multiples of each other, we cannot expect

to directly solve equation (3-1) for the homography matrix (which we will call H).

One way to represent equality in homogeneous coordinates is to write the cross-prod-

uct of the two homogeneous coordinates that are supposed to be equal, and set it equal

to zero. This has the effect of constraining the two coordinates to be scalar multiples of

each other.

Thus the problem becomes:

(3-4)

Since, given any solution H to this problem, all scalar multiples of H are also solu-

tions, we can arbitrarily set one element of H to whatever value we like (in general we

usually set to 1) and solve for the other eight. Doing this we get two linear equa-

tions per image point, and a total of eight unknowns, so four point correspondences are

required to compute H.

I c r,( ) 1 cf–( ) 1 rf–( ) I ci ri,( )⋅ ⋅cf 1 rf–( ) I ci 1+ ri,( )⋅ ⋅1 cf–( ) rf I ci ri 1+,( )⋅ ⋅

cf rf I ci 1 ri 1+,+( )⋅ ⋅

+

+

+

≈

W I H,( )

W I H,( ) c r,( )

x′ Hx× 0=

H33


Ideally, though, we would like to know the parameters of H to very high precision

so that we can accurately compute depth using sub-pixel interpolation template match-

ing.

For an autonomous vehicle, the goal is to recognize small obstacles (as small as

20cm or so) at long range (60-100m in front of the vehicle). In order to accomplish

this, a combination of telephoto lenses and a large baseline becomes necessary. In this

situation, small inaccuracies in the calibration can cause large errors. In one particular

situation that we have studied, using a 1.2m baseline and 35mm lenses, a 1mm error (1

part in 1000) in the computed position of the camera can cause the epipolar line to be

off by as much as 2 pixels in certain parts of the image at extreme disparities. Accurate

computation of homography matrices is therefore essential.

Thus, some method of accurately determining calibration parameters is necessary.

The most obvious way to do this is to minimize the residual error between one image

and the other image when warped by the homography. The error we want to minimize

is

(3-5)

where I and I’ are the two images to be matched, and W is the region of the image that

corresponds to the planar surface. The standard way to minimize E would be to com-

pute its derivative:

(3-6)

where is the gradient of the warped image (which can also be written in

terms of gradients of the original image if desired), and

E W I′ H,( ) c r,( ) I c r,( )–( )2

c r,( ) W∈∑=

E∂Hij∂

---------- 2 W I′ H,( ) c r,( ) I c r,( )–( ) W I′ H,( )dc'd

------------------------ c' r',( ) c'dHijd

---------- c r,( ) W I′ H,( )dr'd

------------------------ c' r',( ) r'dHijd

---------- c r,( )+

c r,( ) W∈∑=

W I′ H,( )dc’d

------------------------


(3-7)

Since this equation depends on the image data, we do not expect to find a closed-

form solution for the minimum of E by setting the derivatives to zero. Instead, we

must apply some type of nonlinear optimization to minimize the error. For this, we use

a program that has been in use at Carnegie Mellon for several years

[Oda 96a][Oda 96b]. It asks the user to select four matching points in a set of two

images, and to outline the planar region. This data is used to compute a starting set of

parameters for H. Since most nonlinear optimization techniques need an initial set of

parameters that is close to the minimum, and since the computation of the error and its

derivatives is a very computationally intense process, we make use of image pyramids

when computing homography matrices.

A lower resolution version of both images is obtained by simply replacing each

block of four adjacent pixels with their average. This is done for each level of the pyr-

amid. The homography matrix parameters for the lower resolution images are derived

by:

(3-8)

it is easy to verify that this gives the correct answer. If

c’dHijd

----------1

H31c H32r 1+ +---------------------------------------

c r 1

0 0 0

c H11c H12r H13+ +( ) r H11c H12r H13+ +( ) 0

=

r’dHijd

----------1

H31c H32r 1+ +---------------------------------------

0 0 0

c r 1

c H21c H22r H23+ +( ) r H21c H22r H23+ +( ) 0

=

H11 H12 H13

H21 H22 H23

H31 H32 H33

H11 H1212---H

13

H21 H2212---H

23

2H31 2H32 H33

⇒


(3-9)

then

(3-10)

At each level of the pyramid (starting from the lowest resolution), a Levenberg-

Marquart nonlinear optimization is used to minimize E (which requires computing the

derivative). The resulting parameters are then transformed for the next higher resolu-

tion level, and the optimization is performed again using these parameters as a starting

point. The results of the total optimization are shown in Figure 3-1. The results of the

optimization are displayed as a difference image, with the intensities normalized by

the same factor in order to make the errors visible. For this case, the residual error was

reduced by a factor of around 50%.

3.2.3. Finding the Epipole

In order to find the epipole, all that is necessary is to compute two homographies

for different planes:

(3-11)

note that the resulting matrix is rank 1 (it is the outer product of two 3-vectors). This

means that we can determine both of the vectors, but only to within a scale factor.

αc’

r’

1

H11 H12 H13

H21 H22 H23

H31 H32 H33

c

r

1

=

H11 H1212---H

13

H21 H2212---H

23

2H31 2H32 H33

12---c

12---r

1

α

12---c’

12---r’

1

=

H2 H1– H∞ en2

TA

1–

d2--------------+

H∞ en1

TA

1–

d1--------------+

–=

en2

d2-----

n1

d1-----–

T

A1–

=


Since any scalar multiple of a homogeneous coordinate represents the same point, this

is all that is necessary.

There is one problem with this, however. Since we were able to compute the

homographies only up to an arbitrary scale factor, the cancellation of from the sec-

ond line of equation (3-11) is not possible unless we compute the relative scale factors

of the two matrices.

To accomplish this, we use the fact that the difference between the two homogra-

phies is rank 1 for the correct scale factor. So we simply have to find β such that

is rank 1. In general, because of rounding errors and imperfect assumptions

made by our model, there will be no β that accomplishes this exactly. We evaluate how

good any given β is by computing the Singular Value Decomposition of and

Left Image Right Image

Residual after choosing 4 points Residual after optimization

Figure 3-1: results of homography computation

H∞

H1 βH2–

H1 βH2–


e can

and

at we

e only

, the

y know

for a

escribe

ple, a

taking the ratio of the largest and second largest singular values. Finding the best value

of β then becomes a simple 1D optimization problem which can be solved by any

number of methods.

The mathematics described here will be used in Section 5.3.3, when we discuss

Image Rectification, which is a process by which the stereo search is set up to be a

very regular computation which can be implemented efficiently.

3.2.4. Improving Accuracy of Recovered Parameters

As was noted in the last section, the computation of the epipole e from a pair of

homography matrices requires a second step to normalize the difference between the

matrices so that it is rank 1. This is due to the fact that the class of all homography

matrices, for a given camera geometry, is such that the difference between any two

matrices must be a rank 1 matrix (as can be seen from equation (3-11)). Thus the com-

putation of two distinct homography matrices, optimizing 16 separate parameters, has

too many degrees of freedom. Another way to express this is that once we have the

first homography matrix for a pair of cameras, all we need to know to compute another

homography matrix are the two 3-vectors e and . Since we are taking

the outer product of these two vectors, their relative scale doesn’t matter (w

divide one vector by some quantity and multiply the other by the same quantity

still get the same homography matrix). This means that the first homography th

compute for a pair of cameras requires eight parameters, but the second on

requires five. Furthermore, if we know two homographies for a pair of cameras

third and subsequent ones only require three parameters each (since we alread

e up to a scale factor). Similarly, if we already know one set of homographies

system with two baselines, we need to determine two different values for e, but the

other vector is the same in both cases. The number of parameters necessary to d

a set of homographies for a set of cameras is summarized in Table 3-1. For exam

n2

d2-----

n1

d1-----–

T

A1–


d

set of 3 cameras with 3 planes requires 27 parameters.

We define a new error metric E’:

(3-12)

where B is the set of baselines (numbered from 1 to the number of cameras) and P is

the set of planar surfaces for which we have images. Image 0 is used as a reference

image which is compared to all of the other images. From the previous discussion, the

parameters necessary to compute , the homography matrix for a particular base-

line b and planar surface p are:

• one full homography matrix for each baseline:

• one 3-vector for each baseline, representing the epipole:

• one 3-vector for each planar surface: . Note that .

The equation for is then:

(3-13)

the equations for the derivatives of E can be obtained from equation (3-6) an

equation (3-12) as follows:

first baseline second and additional baselines

first plane 8 8

second plane 5 3

third and additional planes 3 0

Table 3-1: Parameters needed to describe a set of homographies for multiple planes and multiple cameras

E′ W Ibp Hbp,( ) c r,( ) I0p c r,( )–( )2

c r,( ) W∈∑

b B∈∑

p P∈∑=

Hbp

Hb

eb

npnp

dp-----

n0

d0-----–

T

A01–

= n0 0=

Hbp

Hbp Hb ebnpT

+=


(3-14)

where refers to the quantity in equation (3-6), computed for a particular

baseline and planar surface. The missing pieces are:

(3-15)

the program described in Section 3.2.2 was rewritten to handle an arbitrary number of

baselines and planar surfaces using the above equations to optimize the large system

for the best set of parameters. As expected, the residual matching errors after optimi-

zation are slightly higher (since degrees of freedom have been removed from the prob-

lem), but satisfactory solutions are found consistently.

3.2.5. Stereo Search

In order to use the results of the calibration technique described in the previous

section, we rewrite equation (3-1) to use the parameters that we have computed:

E′∂Hb’( )ij∂

------------------Ebp∂

Hbp( )kl∂--------------------

Hbp( )kl∂Hb’( )ij∂

--------------------⋅

k l,∑

b B∈∑

p P∈∑=

E′∂eb’( )i∂

---------------Ebp∂

Hbp( )kl∂--------------------

Hbp( )kl∂eb’( )i∂

--------------------⋅

k l,∑

b B∈∑

p P∈∑=

E′∂np’( )i∂

---------------Ebp∂

Hbp( )kl∂--------------------

Hbp( )kl∂

np’( )i∂--------------------⋅

k l,∑

b B∈∑

p P∈∑=

Ebp∂Hbp( )kl∂--------------------

Hbp( )kl∂Hb’( )ij∂

--------------------1 if b’ b k, i and l, j= = =

0 otherwise

=

Hbp( )kl∂eb’( )i∂

--------------------nl if b' b and k i= =

0 otherwise

=

Hbp( )kl∂

np( )j∂--------------------

ek if p' p and l j= =

0 otherwise

=


(3-16)

note that for a given (c,r) and z for the reference camera, this equation tells us the loca-

tion of the corresponding points in all other cameras. In order to perform the stereo

search, we need to decide in what increments we will move along the line segment

defined by and , which is equivalent to asking what values of z we want

to test.

Since the image is sampled at pixel boundaries, it makes sense to search in one

pixel increments. Smaller search steps could be used to yield sub-pixel precision (up to

some limit determined by the particular camera configuration being used). Larger

steps could be used to reduce the total number of steps searched, thus increasing com-

putational speed at the expense of resolution (though if the steps are too large it is pos-

sible to miss the correct match completely).

Dividing equation (3-16) by z, we get:

(3-17)

which is a more convenient representation for the equation. By dividing the first two

elements of the left-hand side by the third, the corresponding location (c’,r’ ) in the

second image is determined. Because this division is required, all scalar multiples of

equation (3-17) are equally valid for defining the search space.

In order to perform the search, we rewrite the equation once more:

zb

cb

rb

1

zHb

c

r

1

eb+=

Hb c r 1T

eb

zb

z----

cb

rb

1

Hb

c

r

1

1z---eb+=


(3-18)

where s is a scale factor that determines how large the search steps will be and d is an

integer. In general, since the relative magnitudes of the will all be different, the step

size will be different in each of the images. In practice, we always adjust s so that the

steps are one pixel for the longest baseline (which also corresponds to the with the

largest magnitude). This implies that the step size on shorter baselines will be less than

one pixel.

3.3. Global (metric or Euclidean) calibration

Although much stereo processing and inference can be done without ever mapping

the image coordinates back into metric 3D space, our eventual goal (obstacle avoid-

ance) requires at least some measure of the size, position, and range of the objects

observed by the stereo system. A high degree of precision may not be possible (since

the accuracy of stereo range decreases with distance), but it is also not necessary.

The process of stereo matching produces a value of d at each pixel (c,r). The ques-

tion then becomes: what is the relationship between the (c,r,d) coordinates and Euclid-

ean (x,y,z) coordinates? From equation (3-17) and equation (3-18), we have that

(3-19)

Combining this with equation (2-3), we can write the relationship as a linear map-

ping between 3D homogeneous coordinates:

(3-20)

zb

z----

cb

rb

1

Hb

c

r

1

deb

s-----⋅+=

eb

eb

dsz--=

z

c

r

d

1

f 0 u 0

0 γf v 0

0 0 0 s

0 0 1 0

x

y

z

1

=

3.3 Global (metric or Euclidean) calibration 45

Often it is convenient to have the origin of our world coordinate system be differ-

ent from the focus of one of the cameras. This is easily accomplished by simply right-

multiplying by a 4x4 rigid transformation:

(3-21)

or equivalently, since the matrices are invertible,

(3-22)

where α represents the fact that we need to divide through by the third coordinate.

Since the resulting P is still a 4x4 matrix, we can solve for it using a variety of linear

algebraic tools. The minimal data necessary to solve this problem is a set of five

points, no four of which are coplanar.

3.3.1. Practical and Accurate Metric Calibration

Although five points provides for a minimal solution to the calibration problem,

the solution thus obtained is very sensitive to measurement errors (both in the mea-

surement of disparity and in the measurement of real-world distances). Since we

already have tools for determining sets of homography matrices very accurately, it

makes sense to use them for metric calibration.

With cameras mounted on top of an automobile, it is easy to find vertical and hori-

zontal planes, and to move the vehicle around within the ground plane. As an example

of one way to calibrate the system fairly accurately using homography matrices, con-

sider taking images of a wall that is vertical and perpendicular to the direction of travel

of the vehicle. In our standard vehicle coordinate system, such a plane is a plane of

z

c

r

d

1

f 0 u 0

0 γf v 0

0 0 0 s

0 0 1 0

R t

0 1

x’

y’

z’

1

P1–

x’

y’

z’

1

= =

α

x’

y’

z’

1

P

c

r

d

1

=


constant z.

The equation for a homography can be written as

(3-23)

which, when compared with equation (3-18), yields the following relationship for

points on the plane:

(3-24)

Thus the homography for the plane defines the disparity d of a point on the plane

for each point in the image. If we expand out the part of equation (3-22) that deals with

the z coordinate, we get that

(3-25)

substituting equation (3-24) into equation (3-25) and rearranging terms, we get

(3-26)

collecting terms in c and r yields

(3-27)

since this equation must be true for all c and r, the coefficients multiplying c and r and

the constant term must be zero:

z’z---

c’

r’

1

Hc

r

1

H∞

c

r

1

nT

h-----A

1–c

r

1

e+= =

d1e h

---------- nTA

1–c

r

1

nT

c

r

1

= =

z’P31c P32r P33d P34+ + +

P41c P42r P43d P44+ + +-------------------------------------------------------------=

P41c P42r P43 n1c n2r n3+ +( ) P44+ + +( )z'

P31c P32r P33 n1c n2r n3+ +( ) P34+ + +=

P41z’ P43n1z' P31– P33n1–+( )c

P42z' P43n2z' P32– P33n2–+( )r

P44z' P43n3z' P34– P33n3–+( )+

+ 0=

3.3 Global (metric or Euclidean) calibration 47

(3-28)

Since this set of equations does not define the scale of the parameters (given a solu-

tion, multiplying all of the parameters by some scalar would be another solution), we

can arbitrarily decide to set equal to 1. Intuitively, the denominator of

equation (3-25) determines the location of the plane at infinity in (c,r,d) space, since

when the denominator goes to zero, the 3D coordinates of the point will go to infinity.

Therefore, we cannot set to 1 as in the previous section, because this effectively

requires the denominator to have a non-zero constant term. We can set to 1

because we can be confident that the equation for the plane at infinity will depend on

d.

Thus each plane of constant z gives us a set of three linear equations in seven

unknowns. Three such planes are required to solve for the parameters of P. We collect

the data for several planes by very carefully driving the car along a straight line that is

perpendicular to the wall that we are observing. The homographies for all of the planes

can be computed at once using the technique described in Section 3.2.4 If we arrange

the problem like so:

(3-29)

then it is a linear problem of the form and the least squares solution for the

P41z’ P43n1z' P31– P33n1–+ 0=

P42z' P43n2z' P32– P33n2–+ 0=

P44z' P43n3z' P34– P33n3–+ 0=

P43

P44

P43

1 0 n11 0 z1– ' 0 0

0 1 n12 0 0 z1– ' 0

0 0 n13 1 0 0 z1– '

…

1 0 nk1 0 zk– ' 0 0

0 1 nk2 0 0 zk– ' 0

0 0 nk3 1 0 0 zk– '

P31

P32

P33

P34

P41

P42

P44

n11z1'

n12z1'

n13z1'

…

nk1zk'

nk2zk'

nk3zk'

=

Xp Y=


parameters of P can be obtained by the pseudo-inverse, i.e.

(3-30)

If the solution is unstable (because is not invertible), SVD can be applied to com-

pute a suitable pseudo-inverse. The problem generalizes to more than three planes, and

the solution becomes more accurate with additional data.

Of the sixteen unknowns in P, eight of the variables are determined by this proce-

dure. If we also have a homography matrix for the ground plane (which can also be

computed at the same time as the other homographies), we can establish it as a plane

where y’ is zero:

(3-31)

again using equation (3-24), rearranging terms, and again noting that the equation

must be true for all c and r, we get

(3-32)

which allows us to solve for , , and in terms of . If we observe one

point (r,c,d) at a known height y, we have:

(3-33)

which we can solve for since we know the values of all of the other quantities.

All that remains to be determined are the values in the first row of the matrix.

Since planes of constant x are scarce, a different method is in order. We already have

many views of a vertical surface for which the disparity at every pixel is known. If we

can measure the x coordinate for four points that are not coplanar in (c,r,d) space, then

p XTX( )

1–XY( )=

XTX

y’P21c P22r P23d P24+ + +

P41c P42r P43d P44+ + +------------------------------------------------------------- 0= =

P21 P23n1+ 0=

P22 P23n2+ 0=

P24 P23n3+ 0=

P21 P22 P24 P23

yP– 23 n1c n2r d– n3+ +( )

P41c P42r P43d P44+ + +-------------------------------------------------------------=

P23

3.4 Summary of the Calibration Method Steps 49

we can set up the following equation:

(3-34)

the least squares solution for the parameters can be computed in the same way as for

equation (3-29). In practice, a set of points for the optimization can be determined by

measuring the x coordinate of two world points on the vertical surface and tracking the

points through the sequence of images of that surface.

The preceeding section has described one example calibration and a set of tools for

computing the parameters of the calibration matrix. In other situations where this type

of calibration is necessary, it may be more convenient to capture images of different

types of surfaces, but the same techniques and equations are applicable.

3.4. Summary of the Calibration Method Steps

1. Collect at least three images of a vertical planar surface, perpendicular to the

direction of travel of the vehicle and at a known (measured) distance. The sur-

face should have sufficient texture to allow accurate matching.

2. Choose two points on the planar surface, and measure the lateral position and

height of those points.

3. If not already done in 1., collect an image of a horizontal plane, preferably the

ground plane.

4. Using the weak calibration method described in Section 3.2.4, match all of the

planes collected above to obtain a set of weak calibration parameters , ,

and .

c0 r0 d0 1

c1 r1 d1 1

c2 r2 d2 1

c3 r3 d3 1

P11

P12

P13

P14

x0 P41c0 P42r0 P43d0 P44+ + +( )

x1 P41c1 P42r1 P43d1 P44+ + +( )

x2 P41c2 P42r2 P43d2 P44+ + +( )

x3 P41c3 P42r3 P43d3 P44+ + +( )

=

Hb eb

np


5. Use the values from the different vertical planes to solve equation (3-30).

This is enough to allow for metric computation of z.

6. Use the value for the ground plane along with the measured height of a single

point on a vertical plane to solve equation (3-32) and equation (3-33). This

allows metric computation of y.

7. Use the measured lateral positions and equation (3-34) to determine the remain-

ing parameters of P, allowing for the metric computation of x.

3.5. Calibration Accuracy

The metric calibration procedure described in the previous sections was used to

calibrate the cameras on our vehicle. We used a garage door as the vertical surface, and

the garage floor as a horizontal surface (the images shown in Figure 1-1 are one set of

images from this calibration set). A total of eight planar surfaces were matched:

images of the garage door were taken at 5-meter intervals from 15 to 45 meters, and

the ground plane from the 45-meter image was used for the horizontal surface.

The calibration was then tested on a set of images of obstacles, taken at 10-meter

intervals from 50 to 150 meters, with about one meter precision. For each obstacle,

stereo matching was performed, and the results were hand-segmented to ensure that no

outlier pixels were included. Then the distance to the obstacle was computed using the

metric calibration parameters derived via the method described in this chapter. The

results are shown in Figure 3-2. The data for 120 meters was erased accidentally, but

the remaining data shows that the calibration is reasonably accurate. The three curves

plotted on the graph represent the correct result (in the center) and the expected results

if the stereo match were off by one pixel in either direction from the correct result.

Since the calibration data only goes out to 45 meters, the results shown in the

graph are all extrapolated from the calibration data and we therefore expect the cali-

np

np

3.5 Calibration Accuracy 51

bration to become less accurate as distance increases. The errors seen in the graph can

thus be explained as some combination of calibration error, error in measurement of

the ground truth (not more than one meter), and possible stereo matching error (not

more than one pixel).

40

60

80

100

120

140

160

40 60 80 100 120 140 160

✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧

✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧

✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧

✧✧✧✧✧✧✧✧✧✧

✧✧

✧

✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧

✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧

✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧

✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧

✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧

✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧

✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧

✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧

✧

x

Ground Truth (m)

Mea

sure

d R

ange

(m

)

Figure 3-2: Calibration accuracy

51

Chapter 4

Stereo Algorithm

The research described in this thesis originated from an effort to apply the CMU

Video-Rate Multibaseline Stereo Machine to the problem of detecting highway obsta-

cles. The stereo algorithm used in this research is thus based on the algorithm used by

the stereo machine.

No claim is made that this algorithm is necessarily the best way of computing

depth from multiple camera views. Many algorithmic choices within the stereo

machine seem to have been made for ease and speed of implementation rather than for

accuracy of the final result. In fact, the software implementation of this algorithm is

more accurate than the hardware, since it does not suffer from most of the limitations

that were imposed on the hardware by speed, cost, or ease of design.

52 Chapter 4. Stereo Algorithm

will

ion for

o the

ly and

are

il in

are not

The basic system that I have constructed is shown in Figure 4-1. With the excep-

tion of the “obstacle detection/localization” box, each of the boxes in this figure

be touched upon in both this chapter and the next. This chapter provides motivat

why a particular step in the algorithm has been chosen; Chapter 5 will go int

implementation details of how each processing step can be performed efficient

accurately. The methods used in the “obstacle detection/localization” box

described in Chapter 6.

4.1. Related Work

The CMU Video-Rate Multibaseline Stereo Machine is described in deta

[Kanade et al. 96], though the ideas behind the design decisions that were made

described there.

ImageRectification


ImageRectification

ImageRectification

Stereo Matching

Obstacle Detection/

Localization


4.2 Multibaseline Stereo 53

rough

used;

ances

fea-

dge

ste-

tural

there

ent, it

iffer-

ction

eful

ing

[Matthies 92] derives several different stereo algorithms using a statistical frame-

work, including the basic SSD search method upon which the work in this thesis is

based. All of the pieces of our algorithm are discussed in some detail in [Faugeras 93],

although many other stereo vision algorithms are also discussed.

The method for computing stereo from more than two cameras that is used in this

thesis was first described in [Okutomi & Kanade 93]. The algorithm described there is

called SSSD-in-inverse-distance. The main idea, that matching errors from multiple

baselines can be added together to evaluate different possible geometries, has been

retained in this work. The final section of this chapter discusses alternatives to the SSD

metric. The “inverse distance” part has been generalized in this dissertation th

the use of projective geometry such that any set of planes in the world can be

however, if uniform sampling in the image is desired, then the perpendicular dist

to subsequent planes are still controlled by the inverse distance formula.

Convolution with the Laplacian of Gaussian operator has a long history as a

ture detector. It was first used by Marr and Hildreth [Marr & Hildreth 80] as an e

detector. Nishihara [Nishihara 84] first used the sign of the LoG-filtered image for

reo matching. The use of more bits of information from the LoG filter was a na

extension of this.

4.2. Multibaseline Stereo

Although only two cameras are required to compute range from image data,

are several advantages to using more than two cameras for stereo vision:

1. since the epipolar direction is the same as the direction of camera displacem

is possible to arrange for the epipolar directions of multiple cameras to lie in d

ent directions in the image, thus taking advantage of image texture in any dire

(an example of this is illustrated in Figure 4-2). An example of where this is us

is when viewing a horizontal feature such as the curb in Figure 4-3. When view


this region with only a pair of cameras with a horizontal baseline, there is very lit-

tle in the image to distinguish one location from another. The addition of a vertical

baseline allows us to take advantage of the available texture.

2. repeating texture in the image can confuse a two camera system by causing match-

ing ambiguities; these ambiguities are eliminated when additional cameras are

present, assuming that the camera spacing is not an integer multiple of the texture

spacing; the latter issue can be avoided by not placing the cameras at an even spac-

ing.

3. as in any measurement process, additional measurements allow more accurate

results by averaging noise; in the case of a large number of cameras, outliers can

be rejected by voting or robust statistics

4. shorter baselines are less prone to matching error while longer baselines are more

accurate; the combination is better than either alone

5. different regions of space are occluded for each camera pair; therefore the prob-

lems caused by occlusion are somewhat ameliorated by using multiple cameras

For advantages 1 and 2 adding a third camera is sufficient; fourth and additional

cameras do not yield any additional benefits. The advantages of 3-5 continue to grow

Figure 4-2: Three cameras in an “L” configuration give different epipolar directions

4.3 LoG Filtering 55

en

akes

-filter

e we

ition,

w at

uating

ture,

ted by

ful for

peri-

e the

lative

gain

past the fourth camera. Thus there is a large benefit to adding the third camera, and the

benefits diminish with the fourth and additional cameras.

4.3. LoG Filtering

Since the Laplacian of Gaussian is a second derivative operator, places where the

LoG-filtered image is zero are places where the intensity of the original image has

maximum variation, i.e., edges. In addition to being a good edge detector, the LoG fil-

ter also has the following two properties:

• it has a tunable Gaussian filter for filtering out high-frequency image noise

• since the LoG function naturally integrates to zero, any bias in intensity betwe

the cameras is eliminated (it subtracts out)

Since the zero crossings of the LoG-filtered image are interesting points, it m

sense that points that are near to zero will also be interesting. Therefore we pre

our images with an LoG filter. We use a small standard deviation for the filter sinc

do not want to agressively remove high frequency texture from the image. In add

we apply the filter with a high gain, and saturate values that overflow or underflo

the maximum and minimum representable values. This has the effect of accent

regions that are near to zero crossings.

The result of LoG filtering is shown in Figure 4-3. The increase in image tex

particularly on the road surface, is very apparent. In practice, the texture extrac

this method is consistent even between different cameras, and thus is very use

stereo matching in such bland environments.

One question that remains is how large the gain on the filter should be. Ex

ments performed with a large number of different gains in an attempt to determin

optimal gain value had predictable results: the optimal gain depends on the re

contrast of the image. Images that have very little contrast benefit from a large


(even if the noise is amplified greatly, it is still better than having no signal to match

whatsoever). On the other hand, images with high contrast match well without any

additional enhancement.

In practice, the gain should probably depend on the image data itself. A system

that automatically adjusted the gain so that the contrast was as high as possible without

saturating the image would be a good solution, though nothing of this type has been

implemented.

4.4. Rectification and Interpolation

After calibrating the camera system, the necessary geometric constraints between

cameras for stereo matching are known. For each pixel in the reference image, we can

compute the coordinates of a set of possible corresponding points in each of the other

Figure 4-3: Image before and after LoG filtering

4.5 Stereo Matching 57

images. In general, these coordinates will not fall on integer pixel boundaries; thus

some method of estimating the correct value of arbitrary points in the image is neces-

sary.

The correct method for interpolation would be to convolve with a sinc function to

remove higher order harmonics that are introduced in the sampling process. In practice

the sinc function has a large support, which requires a large filter size and is therefore

computationally intensive. A reasonable approximation is to use a Gaussian filter for

interpolation. When combined with the LoG filter, this effectively produces an LoG

filter with a larger standard deviation (the new is ), while interpolating

the data as well.

In practice, for signals which have a cutoff frequency that is sufficiently less than

the Nyquist limit (which we can ensure by choosing our LoG filter coefficients care-

fully), bilinear interpolation has proven to be sufficient for estimating actual image

values at non-integer pixel locations. Bilinear interpolation also has the advantage of

being easy to implement efficiently, since it only involves the four neighboring pixels.

The gain in processing speed more than offsets the small loss in output quality.

Image rectification is the process of transforming an image so that it has particular

alignment properties (such as having the epipolar search directions aligned with the

scan lines of the image). Any such desired transformation can be represented as a

homography matrix, and projective image warping is then used to generate the trans-

formed image. Since the details of how this rectification is done are very different

between the stereo machine and the software implementation, the discussion of the

exact methods used is postponed until the chapter on implementation.

4.5. Stereo Matching

The previous sections discussed how we can compute corresponding pixels

between images that have high enough contrast to allow us to differentiate objects at

σ σL2 σG

2+


different distances from the camera. What remains to be done is to search through the

possible distances at each point and decide which one is best supported by the image

data.

Ideally, we would just be able to compare the pixel values for each possible dis-

tance, and choose the ones that match best. If we assume that image noise is roughly

Gaussian, then the best measure of the similarity of pixels is simply the squared differ-

ence between them. In practice, we find that outliers are actually much more likely

than a Gaussian model would predict. One large factor that causes this to be the case

for stereo matching is that the appearance of objects when viewed from different direc-

tions can be different. Two examples of when this occurs are specular reflections and

occluding edges.

Since a good statistical model of such outlier points would be difficult if not

impossible to construct, we are left with the problem of finding an error metric that is

less sensitive to outlier points while being practical to compute. One such operator is

the absolute value of the difference between pixel values.

Of course, since the pixel values are discrete integers between 0 and 255, the

chances are good that several different pixels will match equally well even if we have

the correct statistical model. With the addition of possible image noise, it becomes

likely that an incorrect disparity will match well. In order to compensate for this, we

must make some further assumptions about the scene that we are viewing. The sim-

plest assumption that we can make is that points in a small region of the image should

all match in roughly the same way. This assumption is violated at occluding edges, and

at points in the image with extreme slope compared to the reference plane. Methods

for dealing with the latter problem will be discussed in a later chapter.

The error metric for a particular pixel and disparity, for a single baseline (between

cameras 0 and 1) is then:

4.5 Stereo Matching 59

(4-1)

where is the appropriately interpolated value that matches at dis-

tance d. represents a window of pixels around the image point (x,y).

One consideration is how to modify this metric for multiple baselines. The theoret-

ically correct error metric assuming Gaussian noise would be to compute the variance

of the set of image intensities in place of . The metric correspond-

ing to the absolute difference metric in the case of multiple baselines is:

(4-2)

which is just the sum of the absolute differences from the mean (the variance would be

the sum of the squared differences from the mean). Table 4-1 contains a list of differ-

metric formulacomplexity

(n is number of cameras)

actual number of operations

operations for 3

cameras

absolutedifference(variance)

13

11

squareddifference(variance)

9

absolutedifference(reference)

5

Table 4-1: Possible Error Metrics

E01 x y d, ,( ) I0 i j,( ) I1 i j d, ,( )–

i j W x y,( )∈,∑=

I1 i j d, ,( ) I0 i j,( )

W x y,( )

I0 i j,( ) I1 i j d, ,( )–

nIk i j d, ,( ) Il i j d, ,( )l 0=

n

∑–

k 0=

n

∑

nIk i j d, ,( ) Il i j d, ,( )l 0=

n

∑–

k 0=

n

∑O n( ) 5n 2–

O n2( ) 3

2---n

2 12---n– 1–

Ik i j d, ,( )2

k 0=

n

∑ Ik i j d, ,( )k 0=

n

∑ 2

–

O n( ) 3n

I0 i j,( ) Ik i j d, ,( )–

k 1=

n

∑O n( ) 3n 4–


ent possible metrics and their computational cost. Since we are implementing this

algorithm in modern computer hardware, multiply, add, and absolute value operations

are assumed to be equivalent.

Since whatever metric we choose will be evaluated for each pixel at each search

distance, it is of critical importance that we choose a metric that can be evaluated with

as few operations as possible in order to achieve an implementation that runs quickly.

As is shown in the table, the metric from equation (4-2), although theoretically

correct, is one of the most computationally expensive. The use of variance as an error

metric is motivated by the assumption that the pixels from each of the camera views

are equivalent measurements of the same underlying property, and thus that their mean

is the best estimate of that quantity and their variance is the best estimate of similarity.

In the algorithm that we have described previously, one of the cameras (the refer-

ence camera, camera 0) is special. For each pixel of the image from that camera, we

perform a search over a set of possible distances, comparing that pixel to different pix-

els from the other cameras. If we instead make the assumption that the pixel in the ref-

erence camera has the correct value (instead of assuming that the mean is the correct

value), and we want to find the set of pixels in the other cameras that match it best, we

get the metrics that are marked as (reference) in the table. This metric, though not

squareddifference(reference)

5

allabsolutedifferences

8

Special case for three cameras x is the cost of max(x,y)

4x + 1

metric formulacomplexity

(n is number of cameras)

actual number of operations

operations for 3

cameras

Table 4-1: Possible Error Metrics

I0 i j,( ) Ik i j d, ,( )–( )2

k 1=

n

∑O n( ) 3n 4–

Ik i j d, ,( ) Il i j d, ,( )–

l k≠∑

k 0=

n

∑O n

2( ) 32---n

2 32---n– 1–

4.6 Sub-pixel Interpolation 61

s” in

hat no

an be

also

arely

step

can be

n be

ution

ystem.

by a

-pixel

s data

rela-

o the

Since

mial

mial

s near

.

being strictly correct, is a good comprimise that is about twice as fast and produces

good results.

One other metric worth mentioning is the one marked “all absolute difference

the table. This is the metric that results if the (reference) metric is expanded so t

particular camera is special. Though this metric has no mathematical basis, it c

computed very efficiently, particularly for large numbers of cameras.

In general, we have used “absolute difference (reference)”, though we have

experimented with “all absolute differences”. The loss in accuracy caused was b

detectable, while the increase in performance was large.

4.6. Sub-pixel Interpolation

When more precision is required, there are two options in general. Either the

size of the stereo search can be made smaller, or the sub-pixel interpolation

applied to the results. Both methods are limited in the extent to which they ca

applied; the amount of information contained in the images is limited by the resol

of the camera, the focal length of the lenses, and the longest baseline in the s

Changing the step size in general multiplies the running time of the algorithm

constant (though adaptive schemes which do not do this can be imagined). Sub

interpolation of results, on the other hand, is a constant-time operation that use

that should already be available.

The idea behind sub-pixel interpolation is that the matching error should be a

tively smooth function, and therefore it makes sense to fit a smooth function t

error data near the minimum to more accurately determine where exactly it is.

the function needs to be fit with a minimum of computation, a low-order polyno

(which can be fit with linear algebra) is a good choice. The lowest order polyno

that has a minimum is a quadratic. Therefore, we fit a quadratic to a set of point

the minimum. At least three points are required, though it is possible to use more


The linear equation that must be solved is:

(4-3)

where the are the disparities of the points near the minimum and the are their

corresponding matching errors. Since we are really interested in where the interpolated

minimum is relative to the discrete minimum that we have already found, we can use

(-1,0,1) or (-2,-1,0,1,2) for the and simply add the resulting offset to the discrete

minimum. If we use minimum and one point on either side, the equation simplifies to:

(4-4)

which can be solved by inverting the matrix:

(4-5)

since the minimum of the function is at , we can substitute

and get:

(4-6)

this does require a computationally expensive division operation, but depending on the

hardware it might be a good trade-off versus doing extra search.

Empirical evidence suggests that sub-pixel interpolation can be used down to a

d02

d0 1

…

dn2

dn 1

a

b

c

E0

…En

=

di Ei

di

1 1– 1

0 0 1

1 1 1

a

b

c

E 1–

E0

E1

=

a

b

c

12--- 1–

12---

12---– 0

12---

0 1 0

E 1–

E0

E1

=

E ad2

bd c+ += db–

2a------=

dmin12---–

E1 E 1––

E1 2E0– E 1–+------------------------------------

=

4.6 Sub-pixel Interpolation 63

resolution of about one-fourth of an original image pixel. Below that point, even the

results for smooth, highly-textured surfaces seem to be more or less random.

65

Chapter 5

Implementation

This chapter describes two implementations of the multibaseline stereo algorithm.

The first section describes the hardware implementation used in the CMU Video-Rate

Multibaseline Stereo Machine. Although the research described in this thesis was per-

formed after the stereo machine had already been designed and built, there are several

reasons to include a discussion of it here: a) the existing documentation of the stereo

machine is somewhat sparse, not including several details relevant to the software

implementation, b) the algorithms used are directly based on the algorithm used by the

stereo machine, and c) several important differences between the implementations will

be discussed. The second section of this chapter discusses the software implementa-

tion of the stereo algorithm.

Several key insights are made in this chapter. Perhaps the most important insight

is that special rectification techniques, discussed in Section 5.3.3, can be used to allow

66 Chapter 5. Implementation

trinocular stereo to be computed efficiently. A detailed analysis of memory and cache

usage of three different implementations of the stereo main loop leads to a clear choice

which is supported by benchmark data. Additionally, an efficient method for perform-

ing the LoG filter and a means for determining the LoG filter coefficients are dis-

cussed in Section 5.3.2

5.1. Related Work

During the last few years, several commercial stereo vision systems based on PC

hardware have appeared on the market (e.g. the SVM by SRI [Konolige 97] and Tri-

Clops by PointGrey Research [PointGrey 98]). Unfortunately, most of the innards of

these systems are proprietary and thus I can only speculate that these groups must have

done much of the same analysis that is presented in this chapter.

5.2. CMU Video-Rate Multibaseline Stereo Machine

The stereo machine consists of a number of custom-built 9U VME boards con-

nected in a system. The system is described in some detail in [Kanade et al. 96].

The algorithm used by the stereo machine (see Figure 5-2) works by first digitizing

the images from each of the cameras (up to 6 in the current design).

5.2.1. LoG Filter and Quantization

Each of these images is then passed through an 11x11 LoG filter which was imple-

mented in hardware by a pair of special-purpose 2D 8-bit convolution chips

(PDSP16488, made by GEC Plessey). Since the convolution hardware had a maxi-

mum mask size of 7x7, the filter was decomposed into a 7x7 Gaussian filter with a

standard deviation of one pixel followed by a 7x7 LoG filter, also with a standard

deviation of one pixel. This chained convolution is mathematically identical (modulo

round-off errors) to an 11x11 LoG filter with a standard deviation of pixels. The

gain is controlled by a series of programmable multiply and shift operations, and a

2

5.2 CMU Video-Rate Multibaseline Stereo Machine 67

el at

other

nsists

o 4-

final selection of 8 bits of the 16-bit output of the convolver chip.

This 8-bit output is then quantized down to 4 bits using another lookup table that is

part of the stereo machine hardware. The set of values that worked best in this lookup

table (and thus became the default) effectively just maps the range from -8 through 7

to 0 through 15 while saturating smaller values to 0 and larger values to 15. This is an

effective gain enhancement of a factor of 16, since the 8-bit range of the convolver

output has been reduced to 4 bits using only the low-order bits and discarding the

high-order bits.

5.2.2. Rectification (Geometry Compensation)

The 4-bit LOG filtered data is then passed on to the geometry compensation unit.

This unit is of particular importance because it performs a very general transformation

on each of the input images, to rectify these images before performing the SAD com-

putation which comes next. For each pixel in the reference image, a number of possi-

ble “distances” ζ from the camera are evaluated (see Figure 5-1). For each pix

each distance, the offset from the pixel to the corresponding point in each of the

cameras is retrieved from a look-up table. The value stored in the lookup table co

of two 8-bit pixel offsets (for the column and row directions in the image), and tw

j

i

base imageJb(i,j)

Ib(i,j)

j

i

inspection imageJins(i,j,ζ)

Absolute Difference

Add

interpolated pixel values

from other pairs

Figure 5-1: Geometry compensation

Iins(i,j,ζ)


bit fractions representing the fractional part of the desired location in the image in 1/

16ths of a pixel.

The (column, row) coordinates of the corresponding points are computed by taking

the 8-bit integer offset in each direction and adding it to the current pixel position (i,j).

Since the hardware that does pixel addressing uses 8 bit registers, the maximum image

size is limited to 256x256 pixels. To approximate the correct pixel intensity at the

desired location, a bilinear interpolation of the four nearest pixels is performed using

the fractional offsets retrieved from the lookup table.

Note that the lookup table can contain any values whatsoever, so it is possible to

correct for lens distortion, or to operate with one camera upside down, or to use cam-

eras with lenses of different focal lengths. The primary limitation of the geometry

compensation circuit is that each 4x4 pixel region of the base image must offset by the

same amount, since there is only one lookup table entry for each 4x4 pixel region of

the image. This was done to keep the size of the lookup table from being too large to

implement. While this is not much of a limitation as long as the camera geometry is

close to that of a traditional stereo system, the extreme geometries that are dealt with

in this thesis are often problematic.

Calibration for the stereo machine consists entirely of computing the values to load

into the lookup tables. These values can be computed directly from the homography

matrices by the simple formula

(5-1)aI i j ζ, ,( )aJ i j ζ, ,( )

a

Hζ

i

j

1

=

5.2 CMU Video-Rate Multibaseline Stereo Machine 69

and normalization by a to convert from

homogeneous coordinates to 2D coordi-

nates.

5.2.3. Stereo Matching

In the next stage of the stereo machine,

the absolute value of difference (AD) is per-

formed pixel-by-pixel for the base camera

(camera #0) paired with each of the other

cameras. The results of the AD computation

are summed over all of the camera pairs,

resulting in a sum of absolute differences

(SAD) value for each pixel for each dispar-

ity level.

The resulting SAD values are then

smoothed by summing over a local window,

the size of which is programmable from 5x5

to 13x13. The result is called the SSAD. In

the final stage, for each pixel, the disparity

level with the minimum SSAD value is

found, and the SSAD values of the mini-

mum and its neighbors are sent to the C40

DSP processing board, where the disparity

levels can be interpolated for higher accuracy.

5.2.4. Stereo Machine Performance

The stereo machine processes images at a constant rate of roughly 30 million

pixel-disparities per second (counting pixels processed in camera #0), regardless of the

number of cameras in use. Thus the frame rate depends on the number of pixels pro-

A/D &

LOG &

Frame Grabber

LOG

LOGtoSAD I/F

SSAD Computation 1 (SAD over image pairs)

Matching Pixel

Sum of Absolute Difference

SSAD Computation 2 (Windowing)

Vertical Sum

Horizontal Sum

Minimum Finder

C40 I/F & Graphics Function

C40 DSP Array

Extraction

Data Compression

Frame Memory #1

#1

#1

•••

VxWorksReal-timeProcessor

SunWorkstation

VM

E B

us

Camera Head

#6#3#2 •••

#6#3#2 •••

#1 CompensationGeometry

#6#3#2 •••

#6#3#2 •••

C40#1

C40#2

C40#4

C40#3

C40#5

C40#6

C40#8

C40#7

Ethernet

C40 Communication Port

Figure 5-2: Architecture of the CMU Stereo Machine


cessed and the number of disparities searched. When using the maximum values for

each (256x240 image, 60 disparity levels searched), the frame rate is roughly 7.5 Hz.

5.3. Software Implementation

The software implementation uses almost the same algorithm, with a few minor

changes to adapt from processing in parallel hardware to processing serially in soft-

ware.

5.3.1. Multibaseline

As discussed in Section 4.2, there is a large benefit to using three cameras, and a

diminished benefit to the fourth and additional cameras. On the other hand, it turns out

that there are rectification methods that allow two camera stereo matching to be imple-

mented very efficiently in software. In Section 5.3.3 I will show that there is also a

method that allows a slightly less efficient implementation for three cameras. The

extension to four or more cameras is much more difficult, and requires a large increase

in computation.

Given that four or more cameras gives diminishing returns for greatly increased

computational cost, we decided to concentrate on developing a fast trinocular stereo

system in software.

5.3.2. LoG Filter

A straight-forward serial implementation of 2D convolution in software is very

computationally expensive (it is , where w and h are the width and height of

the convolution template and p is the number of pixels in the image), so an alternative

filtering operation is necessary. The standard optimization technique of splitting a 2D

filter into two 1D filters does not apply, since the LoG filter is not separable. Some

experimentation with different filters revealed that a 7x7 LoG filter with a standard

deviation of one pixel works almost as well, with greatly reduced computational cost.

O pwh( )

5.3 Software Implementation 71

The CMU stereo machine uses a larger filter in part to compensate for image noise that

is introduced by the custom digitization hardware built into the machine. Since the

software implementation uses a commercial digitizer board, the filter size can be

reduced without perceptible loss in output quality.

The formula for an LoG filter is

(5-2)

which is the sum of two separable filters. Thus a new algorithm consisting of four 1D

filters and a summation is possible. The complexity of the new algorithm is

, which is significantly smaller than . The actual number of nec-

essary multiply-accumulate operations per pixel is reduced from 49 to 28 for a 7x7 fil-

ter (which would be reduced further to 14 if the filter were separable).

Another option that I have considered is to use a recursive filter such as those sug-

gested by Deriche [Deriche 90]. The recursive filter implementations have the advan-

tage that they take a constant number of operations independent of the size of σ.

Unfortunately, the constant in this case is 32 multiply-accumulate operations, which is

slightly larger than the case described above. If a larger σ value became necessary, this

method would become advantageous.

5.3.2.1. Determining LoG Filter Coefficients

In order to perform a 2D convolution on discrete data, continuous equations such

as equation (5-2) must be converted into discrete quantities. The values must be dis-

L x y,( ) 1–

πσ4--------- 1

x2

y2

+

2σ2----------------–

e

x2

y2

+( )–

2σ2------------------------

=

1–

2πσ6------------- 2σ2

x2

– y2

–( ) e

x2

–

2σ2---------

e

y2

–

2σ2---------

⋅

=

1–

2πσ6------------- σ2

x2

–( ) e

x2

–

2σ2---------

e

y2

–

2σ2---------

⋅

1–

2πσ6------------- σ2

y2

–( ) e

x2

–

2σ2---------

e

y2

–

2σ2---------

⋅

+=

O p w h+( )( ) O pwh( )


ble

rm

ata

ons

order

ee that

-

for

as in

be to

lue of

hat as

f the

that

crete in the spatial (row and column) domain. Each value in the convolution template

must also be a discrete quantity. There are several reasons why we might want to limit

the range of possible values for the filter coefficients:

• the hardware performing the convolution might have a limited range of possi

coefficients (this is the case with the CMU stereo machine)

• the CPU that we are using might have special SIMD instructions that perfo

multiple multiply operations on small data types with one instruction

• we might want to store the accumulated results of the convolution in a small d

type; the need to avoid overflow restricts the range of coefficients

Since the Intel MMX instructions allow us to perform up to four 16-bit operati

per instruction, we need to keep the accumulated results as 16-bit quantities. In

to perform a convolution on 8-bit data under these circumstances, it is easy to s

the sum of the filter coefficients must be less than 28=256. The digital signal process

ing literature contains surprisingly little information about the optimal method

choosing filter coefficients when the range of possible values is severely limited

this case.

The most straightforward manner in which to compute the coefficients would

simply evaluate the function at each discrete point, scaled so that the largest va

the filter function maps to the largest representable value (thus guaranteeing t

much precision as possible is retained), and then round off the result:

(5-3)

There are three main problems with this approach:

• the resulting coefficients are not guaranteed to sum to zero, which was one o

selling points of the LoG filter in eliminating camera bias

• it is possible that some other scale factors might produce a set of coefficients

ci rnd f i( ) scale⋅( )=


on;

ary

r, the

ple-

of the

, and

, the

very

e of

any

error

ible to

effi-

h the

. In the

must be

is closer to the actual true values

• a division by the scale factor is required at the end of the convolution operati

division operations are expensive, so we would like to convert this into a bin

shift operation instead

The values of the set of filter coefficients is determined by a single paramete

scale factor. Since we would like for the final division of the convolution to be im

mentable as a binary shift operation, we are actually interested in scale factors

form , where j is the maximum feasible amount that we can shift the final result

n is an integer with .

Although some clever method might reduce the search space of this problem

size of the problem is small enough that we can solve it by brute force, trying e

possible value of n.

The algorithm used to find filter coefficients is thus to try every possible valu

n, searching for values for which the resulting coefficients sum to zero. Since m

such solutions exist, we search among these solutions for the one for which the

(5-4)

is minimized. This ensures that the coefficients sum to zero, are as close as poss

representing the original function, and that the convolution can be computed

ciently without any division operations.

5.3.3. Rectification and Stereo Matching

In order to efficiently implement stereo search on modern processors, bot

computation and the data access patterns of the algorithm must be very regular

best case, the data would be accessed sequentially, and accesses to data that

2j

n----

0 n 2j≤<

ci2

j

n----f i( )–

2

i

∑


accessed more than once would clustered so that the data cache of the processor can be

effective. The computation should be as regular as possible (avoiding branches caused

by if-then type constructs) since frequent branches are very inefficient on modern pro-

cessors. Note that none of these issues apply to the hardware implementation in the

CMU stereo machine, since it performs computations at a constant rate, and all mem-

ory accesses occur in one cycle.

5.3.3.1. The stereo matching main loop

The stereo search is fundamentally three-dimensional, since the image has two

dimensions, and we are searching in the third dimension of depth. Since this implies

that there will be three nested loops in the algorithm (over the (c,r) pixel coordinates

and the disparity d), one question that arises is, in what order should the computation

be performed? If the goal is to tailor the rectification process so that the matching can

be performed as quickly as possible, then we must consider whether the ordering of

the computation has an effect on the execution speed of the final program. The main


have

e can

. An

win-

erme-

loop of the stereo matching algorithm can be written in pseudo-code as follows:

for(outer-loop) {for(middle-loop) {for(inner-loop) {SAD(c,r,d) = MATCHING_ERROR(c,r,d);

HORIZONTAL_SUM(c,r,d) = HORIZONTAL_SUM(c-1,r,d) + SAD(c,r,d) - SAD(c-WINDOW_WIDTH,r,d);

VERTICAL_SUM(c,r,d) = VERTICAL_SUM(c,r-1,d) +HORIZONTAL_SUM(c,r,d) -

HORIZONTAL_SUM(c,r-WINDOW_HEIGHT,d);

if(VERTICAL_SUM(c,r,d) < MIN_SSAD(c,r)) {MIN_SSAD(c,r) = VERTICAL_SUM(c,r,d);RESULT_IMAGE(c,r) = d;

}}

}}

The general idea is that we will compute the matching error (SAD) for each pixel, and

then add up the errors over a small window centered at each pixel to produce the

SSAD. Instead of performing the window summation at each pixel as we get to it, it is

more efficient to maintain a “horizontal sum” as we move across the image. If we

HORIZONTAL_SUM(c,r,d) = SAD(c-WINDOW_WIDTH+1,r,d) + ... + SAD(c,r,d)

then we can compute it via the recurrence shown in the pseudo-code. Similarly, w

add up the horizontal sums to get the value of the summation over a window

important aspect of this algorithm is that the running time does not depend on the

dow size.

Aside from the images themselves, the algorithm uses four arrays to hold int

diate values:


we

ossi-

per-

since

sider

. This

ch

ch

d the

ethod

r each

• SAD(c,r,d) is the metric error between pixels for the pixel at (c,r) and disparity

d

• HORIZONTAL_SUM(c,r,d) holds the “horizontal sums” described above

• VERTICAL_SUM(c,r,d) holds the vertical sums

• MIN_SSAD(c,r) holds the minimum value of the SSAD for each pixel

The remainder of this section will refer to this main loop pseudo-code, as

describe different high-level algorithmic choices in an attempt to find the fastest p

ble general implementation of trinocular stereo matching.

In particular, we will discuss in what order the three nested loops should be

formed. Since images are usually arranged in row-major order in memory, and

rows and columns are otherwise mathematically equivalent, we will only con

possible orderings in which the loop over rows is outside the loop over columns

leaves us with three possible loop orderings (from outermost to innermost): (r,c,d),

(r,d,c), and (d,r,c).

5.3.3.2. Rectification

Looking at the stereo main loop, we see that MATCHING_ERROR(c,r,d) will

be called for every possible value of c, r, and d, which is in total

NUM_C*NUM_R*NUM_D iterations. The simplest implementation would be for ea

pixel (c,r) and disparity d, compute the coordinates of the corresponding point in ea

of the images, and then perform bilinear interpolation on the closest pixels to fin

correct value. The error metric can then be computed using this value. This m

requires that we perform a separate coordinate computation and interpolation fo

iteration of the loop, both of which are relatively expensive operations.


their

in the

For the SAD metric discussed earlier,

MATCHING_ERROR(c,r,d) = ABS(IM1(c,r,d) - IM0(c,r)) + ABS(IM2(c,r,d) - IM0(c,r));

where IMn(c,r,d) refers to the coordinates in image n of the point corresponding

to the point (c,r) in IM0, at disparity d.

The primary goal of image rectification is to re-sample the images in such a way

that we only have to do NUM_C*NUM_R interpolations, thus saving a factor of

NUM_D. A secondary goal is to arrange the memory accesses to the images such that

we can take advantage of the cache architecture of the CPU.

The first step in image rectification is to warp the input images such that

IM1(c,r,d) and IM2(c,r,d) always have integer pixel coordinates, so that we

never have to do interpolation after the warping has been completed.

Let us once again return to the basic equation, equation (3-18).

(5-5)

In order for ( , ) to be integers for any integer (c,r) and d, we need to do two things:

• change the images so that when d is zero, ( , ) falls on an integer boundary

• make become an integer offset

We can accomplish the first goal by simply warping images 1 and 2 by

respective homography matrices. The relationship between the coordinates

image before warping (unprimed) and the coordinates after warping (primed) is:

zb

z----

cb

rb

1

Hb

c

r

1

deb

s-----⋅+=

cb rb

cb rb

eb

s-----


(5-6)

An example of what the images look like before and after warping by H, with H being

a homography for the ground plane, is shown in Figure 5-3 and Figure 5-4.

The second goal is a little bit more difficult. We would like to warp the images

such that the epipolar direction is an integer offset in the image coordinates. Since

the images are aligned already, any transformation that is applied should preserve the

alignment.

cb’

rb’

1

Hb1–

cb

rb

1

=

Figure 5-3: Original (LoG-filtered) images

eb

s-----


As a first step, we can warp the image such that the epipolar directions correspond

to the rows and columns of the image. If we call this new warping function M, we get

an equation like this:

(5-7)

The terms occur because we have already warped the images by the H matrices,

and thus the epipolar directions have been changed. This equation is underdetermined.

It seems natural to map the remaining perpendicular directions to each other:

Figure 5-4: Images after warping by homography for the ground plane

1s---M H1

1–e1( ) H2

1–e2( )

1 0

0 1

0 0

=

H1–


(5-8)

and then the equation can be solved for M as follows:

(5-9)

Now we define the following three matrices:

(5-10)

Each W matrix is then used to warp its respective image. After this transformation,

points that are located on the plane corresponding to and will be located at the

same pixel coordinate in all three images. Search along the epipolar direction is

accomplished by moving along a row or column of the image. The results of warping

the images of Figure 5-3 by the W matrices are shown in Figure 5-5.

Intuitively, the H matrices warp the images to appear as if they were taken from

virtual cameras whose image planes were all parallel to each other. They also shift the

images along their epipolar lines so that points on the corresponding plane appear to

have zero disparity. The M matrix of equation (5-9) then further rotates the virtual

cameras so that the image planes are all parallel to the plane defined by the foci of the

three cameras. At the same time, the images are warped to a new coordinate system so

that the x-axis is in the direction of one of the baselines, and the y-axis is in the direc-

tion of the other. Note that this means that this method will not work well if the cam-

eras are nearly colinear.

As discussed in Section 3.2.5, the relative scales of and are determined by

the relative geometry of the three cameras. In general we choose s so that the search

1s---M H1

1–e1( ) H2

1–e2( ) H1

1–e1 H2

1–e2×( )

1 0 0

0 1 0

0 0 1

=

M s H11–e1( ) H2

1–e2( ) H1

1–e1 H2

1–e2×( )

1–=

W0 M=

W1 MH11–

=

W2 MH21–

=

H1 H2

e1 e2


Figure 5-5: Images after warping to align epipoles with image rows and columns. The image size is roughly 1900x2800, original images were 640x240. After warping, the

epipolar direction is vertical in the second image, and horizontal in the third.


step along the epipolar direction corresponding to the longest baseline is about one

pixel. Since the other search direction is shorter, this causes the warping function to

expand the image (since it maps the epipolar step in the original image to a one-pixel

step size in the warped image). This means that the size of the image may increase

greatly after warping (in general, it will increase by the ratio of the lengths of the base-

lines). Additionally, the resulting images will be skewed if the original baselines were

not orthogonal.

An increase in the size of the images is undesirable since the main loop of the ste-

reo algorithm loops over all of the pixels. This is in addition to the extra overhead for

generating larger images during the warping process. Even if we were to regularly

subsample the expanded image, the fact that the pixels are no longer adjacent would

cause reduced performance. The combination of these effects would cancel any

increase in performance gained from the rectification.

Because of the disadvantages of increased image size, we want to apply one fur-

ther set of warping functions to reduce the size back down to something near the orig-

inal image size. Such functions can be implemented by applying a further set of

warping matrices Li:

(5-11)

These matrices Li should be derived to satisfy the following constraints:

1. and are matrices consisting entirely of integer components. This

ensures that each pixel in the new image 0 will appear at an exact integer coordi-

nate in the other images. This allows the stereo matching to be done entirely

without interpolation

2. should cause as little distortion as possible. This can be accomplished by

W0 L0M=

W1 L1MH11–

=

W2 L2MH21–

=

L1L01–

L2L01–

L0M


ensuring that it is near the identity matrix.

3. . This ensures that the epipolar line for the longest baseline is

aligned with the scan lines. We can do this because the orientation of the result-

ing images is not important.

4. one last constraint that depends on the loop ordering. The idea here is to ensure

that pixels that will be accessed sequentially will be in sequential memory loca-

tions, thus taking maximal advantage of the caching hardware in the computer.

Usually, we will set (where the baseline between cameras 0 and 1 is the

longest), which satisfies half of the first constraint. This combined with constraint #3

yields that IM1(c+1,r,d) = IM1(c,r,d+1).

The next three sections will discuss the optimal rectification strategies for the three

different loop orderings. Each strategy will produce a solution with a set of unknown

parameters. The method for finding an optimal set of these parameters will be dis-

cussed later.

5.3.3.3. Rectification strategy for the (r,d,c) ordering

Since the innermost loop is over c, we want IMn(c+1,r,d) to be located

directly to the right of IMn(c,r,d) . This is already true for IM1 if we set

. We add two further constraints:

• : this causes IM2(c+1,r,d) = IM2(c,r,d) + (1,0).

L1

1

0

0

1

0

0

=

L0 L1=

L0 L1=

L2L01–

1

0

0

1

0

0

=


t

• : where a and b are integers. IM2(c,r,d) is in the image for all

d.

The combination of all of the constraints gives us that:

(5-12)

and

(5-13)

(5-14)

where e and f must also be integers. This gives us that

(5-15)

After warping by these matrices, a point (c,r) in IM0 corresponds to the point a

(c,r) in IM1, and at (c+ar,br) in IM2. To move up one disparity level in IM1, we move

one pixel to the right, to (c+1,r). To move up a disparity level in IM2, we add (a,b) to

get (c+ar+a,br+b). Thus IM2(c,r,d+1) = IM2(c,r+1,d).

5.3.3.4. Rectification strategy for the (r,c,d) ordering

Since the innermost loop is over d, we want IMn(c,r,d+1) to be located

directly to the right of IMn(c,r,d). This is already true for IM1 if we set .

We introduce one further constraint:

L2

0

1

0

a

b

0

=

L2

1 a 0

0 b 0

0 0 1

=

L2L01–

1 e 0

0 f 0

0 0 1

=

L0 L1

1 aebf

------– 0

0bf--- 0

0 0 1

= =

L0 L1=


t

r

• : this causes IM2(c,r,d+1) to be to the right of IM2(c,r,d).

We know that:

(5-16)

for some integers a, b, c, and d. This implies

(5-17)

thus we have that:

(5-18)

solving for yields

(5-19)

After warping by these matrices, a point (c,r) in IM0 corresponds to the point a

(c,r) in IM1, and at (ac+er,bc+fr) in IM2. To move up one disparity level in eithe

IM1 or IM2, we simply move one pixel to the right, to (c+1,r) or (ac+er+1,bc+fr)

respectively.

L2

0

1

0

1

0

0

=

L2L01–

a e 0

b f 0

0 0 1

=

L2

1

0

0

a e 0

b f 0

0 0 1

L1

1

0

0

a e 0

b f 0

0 0 1

1

0

0

a

b

0

= = =

L2

a 1 0

b 0 0

0 0 1

=

L1

L1

1f

af be–----------------- 0

0b

af be–-----------------– 0

0 0 1

=


5.3.3.5. Rectification strategy for the (d,r,c) ordering:

Since the innermost loop is over c for this ordering, we could in theory use the

same rectification methods as for the (r,d,c) ordering, but there is a better method.

Given that consists only of integers, it follows that IM2 will be larger than

(or the same size as) IM0. Suppose that

(5-20)

and suppose that we know that the epipolar step is (∆c,∆r). If we then look at the coor-

dinates of the point , we get:

(5-21)

which are the coordinates in IM2 of the point (c,r) of IM0, at disparity (eb-af). This

implies that if we hold d constant and loop over r and c, the set of pixels in IM2 that

will be accessed is the same as the set for for any integer n.

Thus, a better method of rectifying the images is to compute (eb-af) separate

images, using the same rectification matrices that we computed for the (r,d,c) method.

5.3.3.6. Computing the Parameters

Each of the above three methods relies on computing a set of four integers (a,b,c,d)

that satisfy the remaining constraint that should come as close as possible to invert-

ing M (thus causing to be near the identity matrix), while satisfying the con-

straint that (a,b,c,d) must be integers. There are several reasons for doing this:

• we would like to keep the number of pixels in IM0 approximately the same as

L2L01–

L2L01–

a e 0

b f 0

0 0 1

=

c e∆r f∆c–+ r b∆c a∆r–+,( )

a c e∆r f∆c–+( ) e r b∆c a∆r–+( )+

b c e∆r f∆c–+( ) f r b∆c a∆r–+( )+ ac a+ e∆r af∆c er+– eb∆c ea∆r–+

bc be∆r bf∆c– fr fb∆c fa∆r–+ + + =

ac er eb af–( )∆c+ +

bc fr eb af–( )∆r+ + =

d d n eb af–( )+=

L0

L0M


e as

inal

each

olu-

r

:

ts of

t, the

the number of pixels in the original image, the idea being not to arbitrarily

increase or decrease the resolution of the resulting depth image

• since the resulting depth image will be computed in the same coordinate fram

that of IM0, it is best to have that coordinate system be as close to the orig

camera coordinates as possible

Since our goal is to compute their values, it is helpful to note that the range of

of these integers (a,b,c,d) is limited for a number of reasons:

• the parameters in are limited by the fact that we are trying to invert M

• the parameters in are limited by the fact that we do not want to consider s

tions that cause the size of IM2 to grow very large, since some computation is

required to generate each pixel of IM2.

• we would like for the quantity (eb-af) to be small, since it represents the numbe

of pixels in IM2 for each pixel in IM0.

We have developed a program that, given ranges of possible values for (a,b,c,d),

tries all possible combinations, looking for a set that has the following properties

1. and are both invertible

2. the sum of squared difference between the elements of and the elemen

the identity matrix is minimized

3. the size of the resulting IM2 is as small as possible

While some other, better metrics for determining these parameters might exis

method just described has been sufficient for our needs.

An example of what the images look like after optimization for the (r,c,d) case is

L0

L2

L0 L2

L0M


shown in Figure 5-6.

5.3.3.7. Memory Use in Stereo Matching

For a particular camera geometry, it is easy to see that the above methods should

each find the same optimal value for . Let us call the resulting width and height of

IM0 after warping NUM_C and NUM_R respectively. It is then easy to show that IM2

will have (eb-af)*NUM_C*NUM_R pixels after warping.

Let us now examine the memory access patterns of the three different possible

loop orderings. After accessing a pixel for the first time, some number of loop itera-

tions will occur before that pixel is used again. During each of the intervening itera-

tions, a different pixel from the same image will be accessed. Thus, in order for the

pixel to still be in the cache when it gets accessed again, all of the intervening pixels

must also be in the cache (assuming a least-recently-used cache replacement strategy).

Table 5-1 shows the number of accesses to image memory that are needed before

returning to the same pixel location, for each possible loop ordering. Note that the

number of intervening pixels in IM2 depends on the exact warping parameters that are

found for each optimization method. The example that is referred to in the table is for

outerloop

middleloop

innerloop

Number of accesses before returning to the same pixel Example640x240

image256

disparities

IM0 IM1 IM2

r c d 0 NUM_D - 1 b * NUM_C * NUM_D - f * NUM_D - eb+ fa

165,386

r d c NUM_C NUM_C - 1 b * NUM_C * NUM_D - f * NUM_D- eb+ fa

162,308

d r c NUM_C * NUM_R (NUM_C * NUM_R) - 1 NUM_C * NUM_R- b * NUM_C+ eb- fa

460,154

Table 5-1: Cache size necessary for fastest possible image access

L0


Figure 5-6: Final images after warping by the optimal matrices for the (r,c,d) case. The center image (for camera #1) has been reduced by 50% to make it fit on the page. After warping, epipolar lines in both the second and third images correspond to scan lines.


ycled”

of the

ediate

the actual camera geometry used on our vehicle, which uses 640x240 images with 256

disparity levels. For this case, the value of (eb-af) is 11.

Of the rectification parameters, the value b shown in Table 5-1 has the most influ-

ence on the necessary cache size, so it is worth examining a little further. Since it mul-

tiplies large coefficients, we would like for b to be as small as possible. For all three

rectification methods, b must be nonzero (this is required to keep the rectification

matrices from becoming singular), but the constraints that are applied to keep the

image sizes small cause b to tend toward small values, and it almost always has the

value 1.

In addition to the memory access patterns for image data, we must also consider

the access patterns for variables used in storing intermediate results. Storing each of

these arrays in their entirety, for all values of c, r, and d, is not practical due to the

sheer size of the data. Even a small image and a small number of disparities (256x256,

32 disparities, 16 bits per storage element) would take almost thirteen megabytes of

storage. Even if we have an amortized memory throughput of 640 MB/sec, the current

top of the line at this writing, the memory accesses alone would take about 20 ms (one

actual memory access to each location), making frame rate (33 ms) very difficult to

achieve. Since we want to deal with larger images and much larger search ranges, and

actually do some processing on the data, the problem is very difficult.

Each of the intermediate values in the algorithm is written exactly once, and then

read exactly once some time later. Instead of storing the all of the intermediate results

for each possible c, r, and d, we instead store the values only until they are needed

again. After the value has been used, the location that it was stored can be “rec

for storing the next value. Table 5-2 shows the minimum necessary size for each

intermediate variables, for each of the three possible loop orderings. The interm

values are each assumed to require 16 bits, except for MIN_SSAD, which requires 24

(16 for the minimum SSAD value and 8 for the location of the minimum).


a

re that

to be

not

e

nt of

best

tion,

I

The “recycling” of memory locations is actually very important. By writing to

memory location that is the same as the location that we just read from, we ensu

this location is in the cache, and thus the store operation will occur very quickly.

Table 5-3 shows the total necessary cache size for all memory accesses

cached optimally. The value of b is assumed to be 1, and constant terms that do

contain at least one of NUM_C, NUM_R, or NUM_D have been dropped.

It is clear from the table that the (d,r,c) case is rather dramatically superior to th

other possible loop orderings (which are roughly equivalent) in terms of the amou

cache required to attain optimal performance. In order to achieve the

performance, a slightly modified stereo main loop is required. With this modifica

the outer loop does not walk through the values of d sequentially. Instead, it

increments in steps of (eb-af). This modification (which I have just discovered as

outerloop

middleloop

innerloop

SAD size

HORIZONTAL_SUMsize

VERTICAL_SUMsize

MIN_SSADsize

Example640x240 image

11x11 filter256 disparities

r c d WINDOW_WIDTH *NUM_D

WINDOW_HEIGHT *NUM_C *NUM_D

NUM_C *NUM_D

1 3,937,795

r d c WINDOW_WIDTH WINDOW_HEIGHT * NUM_C *NUM_D

NUM_C *NUM_D

NUM_C 3,688,342

d r c WINDOW_WIDTH WINDOW_HEIGHT *NUM_C

NUM_C NUM_R *NUM_C

476,182

Table 5-2: Intermediate Storage Required for Stereo Main Loop

outerloop

middleloop

innerloop

Total Necessary Cache SizeSize for

Example Case

r c d (2*WINDOW_HEIGHT + 3) * NUM_C * NUM_D +(2*WINDOW_WIDTH - f + 1) * NUM_D

4,103,181

r d c (2*WINDOW_HEIGHT + 3) * NUM_C * NUM_D +6 * NUM_C -f * NUM_D

3,850,650

d r c 6 * NUM_C * NUM_R +(WINDOW_HEIGHT + 1) * NUM_C

936,336

Table 5-3: Total cache size needed


write this) cuts down on the necessary cache size from (eb-af+5)*NUM_C*NUM_R to

just 6*NUM_C*NUM_R.

I have implemented all three possible loop orderings in C code. The ease and

efficiency of implementation of the (r,c,d) loop caused me to spend the most time

optimizing it (in assembly language) for the Pentium II processors that we have been

using in this project. I have since discovered the results presented in this section, so it

might have been better to spend some time optimizing the (d,r,c) case. It is the

optimized (r,c,d) code that has been used to implement the near-real-time system that

runs on the vehicle.

Most modern processors have small L1 data caches on-chip (16K for the Pentium

II) and larger L2 unified caches off-chip (typically 512K for the Pentium II). In order

to achieve maximum performance, all of the data should at least fit into the L2 cache.

Ideally, it would all fit in the L1 cache. Since the data that we have accounted for will

not be the only items in the cache, we should really aim to use only half of the L2

cache, leaving room for other data and code. Since the L1 cache is often separated into

separate data and instruction caches, we can plan to use more of it if our target is for all

of the data to fit in the L1 cache.

In order to make all of the data fit into a particular given cache size, we can

decrease NUM_D, NUM_C, and NUM_R. We then have to call our main loop several

times in order to cover all of the pixels and disparity levels that we originally intended

to process. A fair amount of bookkeeping also needs to be done to avoid having to

repeat some computation when making these extra calls to the main loop.

In order to reduce the data size to fit in the L1 cache, the (r,c,d) and (r,d,c) loop

orderings each require a reduction in the data size of about a factor of 256. In order to

fit in the L2 cache, a factor of 16 is necessary. In order to achieve a reduction of 256 in

the example case, we would have to cut both the width of the image and the number of

disparities by a large factor. If we evenly distribute the cuts, this means processing 16


nd 40

, the

cache

con-

p, and

disparity levels on an image that is just 40 pixels wide. With such a small number of

iterations, the constant loop overhead factors become very prominent, so that any

gains from avoiding cache misses are overwhelmed by the overhead. A factor of 16 is

much more reasonable, processing 160 pixel wide images with 64 disparity levels at

once. Since for the (d,r,c) loop ordering only requires a reduction of around a factor of

64 for the L1 cache and a factor of 4 for the L2 cache, this case is even easier to

optimize.

5.3.3.8. Benchmarks for the (r,c,d) case

Table 5-4 shows the results of testing the (r,c,d) loop ordering for various image

widths and numbers of disparities searched. The first number is the number of itera-

tions of the loop body that are executed per second (in millions). For reference, the

CMU stereo machine performs at a constant rate of 30. The second entry in each table

cell (in parentheses) shows the necessary cache size as computed by the formula in

Table 5-3. Note that the table shows a “sweet spot” when the image width is arou

or 80 and the number of disparities is approximately 128 or 192. In this region

necessary cache size is significantly less than 512K, the actual amount of L2

installed in the machine. As the number of iterations in either loop gets small, the

stant time loop overhead becomes similar to the actual time spent inside the loo

Number of disparities searched

32 64 96 128 192 256

ImageWidth

640 14.7 (513K) 16.6 (1,025K) 17.3 (1,538K) 17.3 (2,050K) 16.6 (3,140K) 16.1 (4,101K)

320 15.3 (257K) 18.9 (513K) 19.0 (770K) 19.1 (1,026K) 18.8 (1,539K) 17.7 (2,053K)

160 16.1 (129K) 21.2 (257K) 20.4 (386K) 20.9 (514K) 20.1 (771K) 18.9 (1,029K)

80 15.4 (65K) 20.7 (129K) 23.5 (194K) 23.8 (258K) 23.4 (387K) 21.7 (517K)

40 13.7 (33K) 19.1 (65K) 22.4 (98K) 23.2 (130K) 24.9 (195K) 24.7 (261K)

20 11.3 (17K) 16.6 (33K) 19.1 (50K) 20.8 (66K) 22.9 (102K) 23.7 (133K)

Table 5-4: Performance of the (r,c,d) algorithm (higher is faster)


throw

tions

perfor-

efore

no real

orre-

able to

pti-

in our

3.4 In

fast as

of the

works

re, the

target

cache

oces-

performance no longer scales well. In particular, since the innermost loop has been

unrolled to handle 8 disparity levels at once, a search over 32 disparities only takes 4

loop iterations. Thus it is not surprising that the performance is low when only 32 or

64 disparity levels are searched.

The performance degrades gradually as the necessary cache size gets larger, since

the “least recently used” cache replacement strategy causes the algorithm to

away elements from the largest intermediate value storage area (HORIZONTAL_SUM

in this case) first, while holding onto the other values. As the number of loop itera

increases even further, even the smaller arrays no longer fit in the cache, and

mance degrades further.

In this set of benchmarks, the number of loop iterations becomes too small b

the necessary cache size gets small enough to fit within the L1 cache. There is

solution to this problem for the (r,c,d) or (r,d,c) algorithms. It might be possible to

reduce the size of the data sufficiently with the (d,r,c) algorithm using a greatly

reduced image size.

Although the analysis included in this section is very complex, it also has c

spondingly large benefits. Using the same code on the same processor, we are

make a fairly high-level algorithmic decision about which loop ordering provides o

mal performance, although ease of optimization led us to use a different method

system. The reasons behind this choice will be described in detail in Section 5.

addition, we are able to tune the data size such that the chosen algorithm runs as

possible. A relatively simple change in the size of the data can cause the speed

program to increase by more than 50%, and this cache optimization method

independently of any other optimizations that are applied to the code. Furthermo

optimizations described here are only dependent on the cache architecture of the

processor. If the cache sizes are known, then code that is optimal with respect to

effects can be written in a high-level language without consideration of other pr

sor characteristics.


5.3.4. CPU-Specific Implementation Issues

Currently the system as described in this document exists in two forms: as a pro-

gram written in generic C code, and as a program written in generic C code with opti-

mizations written in i386 (Intel Pentium II) assembly language with MMXTM

technology.

The MMX extensions allow the Pentium II CPU to perform a limited number of

what are called SIMD-in-a-register instructions. Instead of performing one operation

on two 64-bit quantities per instruction, these MMX instructions can perform two 32-

bit operations, four 16-bit operations, or eight 8-bit operations during the execution

time of one CPU instruction. The same arithmetic operation is performed on each of

the smaller data elements (thus the term SIMD, Single Instruction Multiple Data).

More information can be obtained from Intel [Intel 97].

Reimplementing the stereo inner loop using MMX instructions is relatively

straightforward. The main consideration is that there can be no data dependencies

between the operations that are to be performed in parallel. As an example of this, we

see that HORIZONTAL_SUM(c,r,d) depends on

HORIZONTAL_SUM(c-1,r,d). This implies that we cannot perform the operations

of the inner loop for multiple columns in parallel. Similarly, the VERTICAL_SUM

dependencies imply that we cannot perform the operations for multiple rows at once.

This leaves as the only possibility performing the computations for multiple disparity

levels at the same time.

If we are to perform operations on 8 pixels at once, those 8 pixels must be stored in

sequential locations in memory (otherwise the overhead of reading and assembling the

8 pixels into one 64-bit word would defeat the purpose of this optimization). Since the

8 pixels must refer to 8 different disparity levels, this implies that the only loop order-

ing that can be optimized fully with MMX is the (r,c,d) ordering. The optimization

using MMX is expected to yield a 4- to 8-fold speed improvement. When I contrasted


this with the at most 2-fold improvement yielded by optimizing for cache issues by

using the (d,r,c) ordering, I made the logical decision that (r,c,d) was the better choice

and spent most of my time optimizing that method. In fact, the use of MMX improved

the performance of the (r,c,d) method by about a factor of four.

97

Chapter 6

Obstacle Detection

The previous chapters have described how we compute very accurate stereo dis-

parity data (and thus stereo range maps) quickly. In order to build an obstacle detec-

tion system, the missing link is a method for determining which range points belong to

obstacles, and grouping these possibly disjoint sets of points into a small number of

obstacle regions that can be presented to a higher level planner so that the vehicle can

take appropriate action.

Most research groups that have attacked this problem have begun by building an

elevation map from the range data. Then they have applied some relatively simple

tests to the elevation map to determine if the vehicle could travel along each point in

its path. This is generally accomplished by placing the vehicle in the map, and com-

puting whether the resulting position violates any kinematic or dynamic constraints. In

this chapter we present a method, based on weakly calibrated stereo, that attempts to

98 Chapter 6. Obstacle Detection

s

ept-

es

in

nt

ste-

ta-

ms

ort

identify obstacles directly from the stereo imagery.

The idea behind this method is that at each pixel we will attempt to find a stereo

match in two different ways. One of these ways will match very well if the pixel lies

on a vertical surface. The other method will match well if the pixel is on a horizontal

surface. By comparing the results of both methods, we can determine which type of

surface the pixel is most likely to belong to. The assumption is that the obstacles that

we need to avoid will contain at least some pixels that will be classified as vertical.

The following sections will first describe methods for matching pixels of different ori-

entations. This is followed by a section describing how pixels are classified as obsta-

cles.

The groups of individual obstacle pixels that are thus identified must then be

grouped into a small number of obstacle regions. For each of these regions, the size

and 3D location must then be computed. The final section of this chapter describes this

process.

6.1. Related Work

Previous obstacle detection systems fall into several different categories:

• Off-road systems. The slow speed of the robot traversing rough terrain implie

that detecting obstacles at short range and/or with long processing time is acc

able. The complexity of the environment usually makes long processing tim

unavoidable. Two examples of this sort of system are presented

[Matthies et al. 95] and [Hébert et al. 97]. The latter contains several differe

approaches to the problem, using multiple sensors including laser radar and

reo vision.

• Indoor systems. The amount of indoor mobile robot research that includes obs

cle detection is incredibly large. In general, however, obstacle detection syste

for indoor mobile robots are only designed to detect obstacles at very sh

6.1 Related Work 99

les

on;

ther

ing

such

s of

y com-

ystem

is dif-

t long

ed in

idea.

ing it

iffer-

per-

paper

t match.

) is that

orien-

ranges (relative to the 50 to 150 meter ranges dealt with in this thesis). Detection

at short range with fast cycle time is necessitated by the fact that indoor environ-

ments are often both complex and dynamic. Two examples of such systems are

discussed in [Horswill 93] and [Thrun et al. 97].

• On-road systems. A fair amount of research has been done in detecting obstac

on-road. Most of these systems ([Luong et al. 95] is an example using visi

[Langer 97] uses radar) have concentrated on the problem of detecting o

vehicles. Other systems such as [Bruyelle & Postaire 93] show promise in be

able to detect people on the road, but lack the speed and acuity to detect

obstacles at highway speeds.

Additionally, [Matthies & Grandjean 94] provide an excellent, detailed analysi

obstacle detectability based on the assumption that obstacles will be detected b

puting the difference in height along a step edge. The performance of the s

described in this thesis exceeds these limits only because the detection method

ferent.

The lack of a convincing system to detect small objects on the road surface a

range has motivated this thesis work.

One of the primary methods used in the obstacle detection algorithm describ

this chapter, the pre-warping so that the ground plane matches, is not a new

[Burt et al. 95] contains an excellent summary of the history of this method, track

back to [Nishihara 84], who applied a shearing function to compensate for the d

ence in disparity over the very large templates he was using.

The computation of surface orientation directly from stereo data has been

formed previously by Robert [Robert et al. 94]. The system presented in that

searches the entire space of possible surface orientations and chooses the bes

The problem with this method (as can be seen in the data presented in the paper

there is not enough information contained in the image to determine the surface


d the

apter.

local-

with

tation very accurately. By comparing only a small number of planes (two in the case

presented here), we have been able to achieve the necessary performance with very lit-

tle additional computation.

6.2. System Architecture

Our test vehicle is shown in Figure 6-1. Figure 6-2 shows the architecture of the

system that we have built to implement our approach to obstacle detection. Three CCD

cameras with 35mm lenses are arranged in a triangular configuration, mounted on top

of our Toyota Avalon test vehicle. The distance between the outer set of cameras is

about 1.2 meters. The center camera is offset by 0.5 meters horizontally, and 0.3

meters vertically from the rightmost camera.

The computation that is performed is based on the that used by the CMU Video

Rate Multibaseline Stereo Machine [Kanade et al. 96], as described in previous chap-

ters. Stereo matching is then performed using both the “traditional method” an

“ground plane method”, which will be described in subsequent sections of this ch

Based on the output of both methods, the further step of obstacle detection and

ization is performed.

6.3. Approaches to Stereo

In Section 3.2.4, we presented a method for calibrating our set of cameras

Figure 6-1: Toyota Avalon test vehicle

6.3 Approaches to Stereo 101

.

aps

very high precision. The results of this method were:

• one full homography matrix for each baseline:

• one 3-vector for each baseline, representing the epipole:

• one 3-vector for each planar surface: . Note that

Remember that is the unit normal vector to plane p, is the perpendicular

distance from the origin, and contains camera intrinsic parameters.

The matrix provides a starting point for stereo search for each pixel (it m

ImageRectification


ImageRectification

ImageRectification

Stereo Matching

Obstacle Detection/

Localization


Hb

eb

npnp

dp-----

n0

d0-----–

T

A01–

= n0 0=

np dp

A01–

Hb


points in one image to points in the other). After warping by , points located on the

corresponding plane are located at the same pixel coordinate in all of the images. The

optimized rectification and stereo search methods described in Chapter 5 then allow us

to step in one pixel increments along the epipolar line in either direction from this

starting point.

From equation (2-12), we have that

(6-1)

for some and . For compactness, let us define . Combining this

with equation (3-18), we get

(6-2)

further combining this with equation (3-17), we get a formula for z, the range to the

object being viewed:

(6-3)

so that in general, the range depends on all three of c, r, and d.

Combining this result with equation (2-3) and writing out the components of

Hb

Hb H∞ ebn

T

dn-----A

1–

+=

n dn nn

T

dn-----A

1–=

zb

z----

cb

rb

1

H∞ ebnT

+( )c

r

1

deb

s-----⋅+=

H∞

c

r

1

nT

c

r

1

ds---+

eb+=

z1

nT

c

r

1

ds---+

-----------------------=

n

6.4 Traditional Stereo 103

explicitly, we get

(6-4)

thus, the equation for surfaces of constant d is simply the equation for a plane in world

coordinates. Therefore, the output of the stereo matching algorithm (which is the value

for d which matched best at each pixel) is really indicating which of this family of

planes the pixel is most likely to belong to.

When attempting to recognize obstacles, there are at least three different obvious

approaches to computing stereo. The following three sections will describe these

options in more detail.

6.4. Traditional Stereo

Let us assume that . This means that the normal to the plane being

observed is parallel to the camera axis. Using equation (6-4), we can see that

(6-5)

As approaches infinity, this converges to the traditional stereo result that

(6-6)

and, as we expect, the set of planes of constant d are in fact also planes of constant z.

This case is shown in Figure 6-3.

Two examples of stereo computed with this method ( is a plane whose normal

is parallel to the camera axis) are shown in Figure 6-4. In this example, the scene is the

inside of a garage. The garage door has calibration targets attached to it. The images

have also already been LoG filtered to enhance image texture. Two regions are chosen

1dn----- n1x n2y n3z+ +( ) d

s---z+ 1=

nT

0 0 1=

z1

1dn-----

ds---+

---------------=

dn

zsd---=

Hb


as examples of stereo matching to illustrate the problems with traditional stereo pro-

cessing.

For the example region on the garage door, we see that the regions searched in the

-6

-4

-2

0

2

4

40 60 80 100 120 140 160 180 200

Height (m

)

Distance (m)

Figure 6-3: Planes of constant “disparity” for the “Traditional Stereo” method

Right Image region

Left Image region

Difference

Right Image region

Left Image region

Difference

Mat

chin

g E

rro

r

DisparityFigure 6-4: Traditional Stereo Processing

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 50 100 150 200 250 300

“wall_tra.out”“ground_t.out”

6.5 “Ground Plane Stereo” 105

stereo matching (shown in detail below the images) match very well. The upper curve

on the graph at the bottom shows the matching error (sum of absolute differences,

SAD) as a function of the displacement along the epipolar line. This graph shows a

strong global minimum at the correct value of 100.

However, the example on the garage floor does not match as well. This is due to

the fact that since the ground is tilted with respect to the camera axis, points which are

higher in the image are actually farther away and thus match at a different location (a

different value of d). This is seen as a difference in the slope of the line on the ground.

The lower curve of the graph shows that the global minimum of the matching error

does not occur at the correct position (which is at a value of around 155).

It is clear from this example that a simple application of traditional stereo tech-

niques will not be sufficient for detecting obstacles on a road surface; points on the

ground such as those shown in the example will produce incorrect results, particularly

in regions where the image texture is low. Since the problem is caused by a difference

in the geometry of the surfaces being observed, the solution to this problem is to com-

pensate for the different geometry.

6.5. “Ground Plane Stereo”

One way to solve the problems described in the previous section is to use an

that corresponds to a plane that is similar to what we expect for the ground. In this

case, the set of planes defined by equation (6-4) is somewhat more complicated than

for vertical planes. Figure 6-5 shows what the set of planes of constant d would look

like in the case of an idealized example. In the general case, all of the planes pass

through the same intersection line with the x-y plane, and the value of d controls the

pivot angle about this line. The special case of traditional stereo has this line being

infinitely far downward, causing the set of planes to be vertical and parallel.

An example of stereo computed with an corresponding to a horizontal surface

Hb

Hb


is shown in Figure 6-6. Both images now appear to be almost identical for pixels

which are on the ground, but pixels which are on a vertical surface such as the wall of

-8

-6

-4

-2

0

2

4

0 50 100 150 200z

Height (m

)

Distance (m)

Figure 6-5: : Planes of constant “disparity” for the “Ground Plane Stereo” method. Parameters are 1m baseline, 35mm lenses, 1/2” CCD, cameras aligned perfectly and

2m above the ground

Right Image


Right Image


Mat

chin

g E

rro

r

DisparityFigure 6-6: “Ground Plane Stereo”

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 50 100 150 200 250 300

“wall_gro.out”“ground_g.out”

6.5 “Ground Plane Stereo” 107

using

l sur-

l sur-

rrect

faces,

ethod.

or this

egion

ill be

h sur-

hori-

the garage are now warped in much the same way that the ground pixels were warped

in the previous example of traditional stereo. This means of computing stereo is simi-

lar to the tilted horopter method of [Burt et al. 95], except that in our case, instead of

attempting to determine the parameters of the ground plane at each iteration, we use a

single horizontal plane that is fixed relative to the vehicle as the basis to start our ste-

reo search.

Comparing the results from the ground plane method with the results from the tra-

ditional method, we notice several differences. First, the global minimum of the

matching error curve for the point on the ground (the lower curve) now appears at the

correct location. The value of the error at the minimum is also lower than before, since

it matches better. Second, although the global minimum of the curve for the point on

the door is still at the correct location, the trough of the minimum is much wider, indi-

cating a less certain result. The value at the minimum is larger, indicating that it

doesn’t match as well.

These two examples suggest an alternate approach: if we compute stereo

both methods, it is possible to determine whether a given point lies on a vertica

face (if the traditional method produces a lower minimum error) or on a horizonta

face (if the ground plane method produces a lower minimum error). The co

disparity can also be determined from the position of the lower minimum.

Since most obstacles that we are concerned with contain nearly-vertical sur

detecting such obstacles becomes both easier and more reliable using this new m

One issue that should be addressed is what conditions are necessary f

method to work reliably. For example, if two surfaces appear in the same image r

(near where the garage door meets the ground, for instance), which surface w

chosen? The most important factor is the magnitude of the image texture on eac

face. Another factor is how close the surface directions are to being vertical or

zontal.


Figure 6-7 shows the results of applying both methods to a typical input image set.

The gray coding in both cases represents the number of pixels of displacement along

the epipolar line (dark is negative, medium gray is zero, and bright is positive). As

Original Image

Traditional Method Output

Ground Plane Method Output

Figure 6-7: Example output of both methods

6.6 Height Stereo 109

n: “is

pro-

orrect

ith

pixel

con-

effi-

re not

expected, the ground plane method does very well on the ground pixels, but poorly on

the wall in the background. Conversely, the traditional method works well on vertical

features such as the lamp post and the wall, but many pixels on the ground surface are

mis-matched.

6.6. Height Stereo

The ground plane method described in the previous section produces output that is

closely related to the height of the objects being viewed. This leads to the questio

it possible to compute height directly using stereo vision?”

Let us suppose that

(6-7)

where and refer to a horizontal plane. By varying the value of , we can

duce homographies for other planes that are parallel to the ground plane. The c

equation for epipolar search (in analogy with equation (6-2)) then becomes

(6-8)

This equation is different from equation (6-2) in one fundamental way. W

equation (6-2), changing d is the same as adding a constant offset to the previous

position. This offset does not depend on the location of a pixel in the image. This

trasts with equation (6-8), where the offset is not a constant. This implies that the

cient rectification and stereo search methods that we developed in Chapter 5 a

Hb H∞ ebn

T

dn-----A

1–

+=

n dn dn

zb

z----

cb

rb

1

H∞ ebnT

+( )c

r

1

ds--- ebn

Tc

r

1

⋅+=

H∞

c

r

1

ds--- 1+

nT

c

r

1

eb+=


useful for this method.

Nevertheless, it is possible to compute height using stereo, by computing the

results of equation (6-8) for each pixel and disparity level and interpolating. The out-

put of this height method is shown in Figure 6-8. The triangular section on the lower

right represents pixels for which all possible matches lay outside the image for this

method.

The results of this simple height stereo method are reasonably good, but it seems to

have problems at long distances (toward the top of the image). In order to understand

why this is, we must consider the quantity , which is a scalar that multi-

plies e. If the horizon appears in the image, pixels at the horizon will be perpendicular

to the direction of the ground plane, and this scalar will be zero. The effect of this will

be that pixels on the horizon will all map to the same location, regardless of the value

of d. Another effect is that pixels that are close to the horizon will move very little as d

increases, while pixels that are far from the horizon will move much more. Pixels that

are above the horizon will actually move backwards (as if looking at the ground

behind the cameras).

It is clear from this that no single choice of step size for d will give good results for

Intensity Image Disparity Output

Figure 6-8: : output with height calibration

nT

c r 1T

6.7 Obstacle Detection from Stereo Output 111

this method, and the result of that can be seen in the upper portion of Figure 6-8. We

mention the method here only for the sake of completeness.

6.7. Obstacle Detection from Stereo Output

As discussed in section 6.5, our method involves performing two types of stereo

matching (for vertical and horizontal surfaces), and comparing the absolute errors to

determine if a particular image pixel belongs to a vertical or horizontal surface. The

vertical surface result of this is shown in Figure 6-9. The pixels shown in the lower

part of the image are coded by the size of the difference between the minimum errors

found by the two methods. Brighter pixels indicate that the vertical match is much bet-

ter than the ground plane match. Thus pixels which appear white are most likely to be

vertical, and black pixels are most likely to be horizontal.

Figure 6-9: Detected vertical surfaces


“ver-

igher

n we

lue in

Regions of very low texture (such as the black spot in the center of the road) some-

times match well as vertical surfaces, since the amount of signal that can be used to

determine the surface orientation is very small compared to the amount of camera

noise.

In order to remove such false obstacles from consideration, we compute a simple

confidence measure. For regions which are actual vertical surfaces, we expect that the

traditional stereo matching method will return a relatively large number of pixels at

approximately the same depth. Conversely, if a region belongs to a horizontal plane,

we would expect the traditional method to report a number of different depths. Using

standard connected components labeling methods on the disparity image generated

from traditional stereo matching, we get the image of Figure 6-10. The gray level in

this image encodes the size of the region of similar depths to which each pixel belongs.

Large regions appear brighter, and these regions are more likely to be obstacles. By

requiring detected obstacle regions to pass this consistency check, we can remove

most false positive detections.

Whether a pixel belongs to an obstacle or not is determined by comparing the

tical surfaces” output of Figure 6-9 to a threshold. If the value of this image is h

than the threshold, then the pixel is likely to belong to a vertical surface. The

check the same pixel location in the image of Figure 6-10, and compare its va

Figure 6-10: Size of regions of constant disparity


high

m in

on the

ack-

liably

n the

2. The

e large

back-

es not

ne (it

this image to another threshold. If it passes this test, then it belongs to a region of the

image that has provided consistent results. Pixels that pass both tests are declared to be

candidate obstacle pixels. An example of the detected obstacle output is shown in

Figure 6-11. Obstacles are shown in black. This example shows a 14cm (6”)

obstacle, which is a piece of wood painted black. The obstacle is roughly 100

front of the vehicle.

Some other points in the image are also reported as obstacles. The curbs

right and left are both identified relatively consistently, as is the building in the b

ground. The curb in the background is too short and too far away to be re

detected.

In order to show that the system is not sensitive to large amounts of texture o

ground plane, we have also tested in situations such as that shown in Figure 6-1

system does not detect any obstacles on the ground of the parking lot, despite th

amount of image texture provided by the painted lines. The car and trees in the

ground are correctly detected in regions where they have sufficient texture.

6.7.1. Sub-pixel interpolation

Since the obstacle detection method described in the previous sections do

depend on accurately determining the distance to particular pixels in the sce

Figure 6-11: Detected Obstacles


instead attempts to determine the surface orientation at those pixels), sub-pixel accu-

racy in matching is not necessary for the determination of whether an obstacle is

present or not.

On the other hand, if accurate determination of the range to obstacles is desired

then sub-pixel interpolation is necessary, at least for those pixels that have been deter-

mined to lie on the obstacle. In practice, our system has not used sub-pixel interpola-

tion since we have been more concerned with being able to detect the obstacles than

with trying to determine their position. If this sort of obstacle detection system were to

be used on an autonomous vehicle, the accuracy requirements of whatever obstacle

avoidance system is used would determine whether sub-pixel interpolation is neces-

sary or not.

Figure 6-12: Output of system with highly textured ground plane


” and

only

s sup-

de a

nal

y

dis-

ld be:

6.7.2. Computing the two types of stereo efficiently

Since we want to compute both types of stereo (for the “ground plane method

the “traditional method”), we need to find an efficient method for doing so. The

difference between the two methods are the matrices used for rectification. Let u

pose that the matrices refer to the ground plane. The rectification equation is

(6-9)

where exact formula for is not important, since its only purpose is to provi

scale factor for division.

For the ground plane method, we have . For the traditio

method, we have , where is given b

. The v subscripted variables refer to the surface normal and

tance to a vertical plane, and the g subscripted variables refer to the ground plane.

Suppose we were to warp image b using both W matrices, producing two warped

images. The mapping between corresponding points in the resulting images wou

Hb

αb

cb

rb

1

Wb

c

r

1

=

αb

Wb LbMHb1–

=

W'b LbM Hb ebnvT

+( )1–

= nv

nvnv

dv-----

ng

dg-----–

T

A01–

=


(6-10)

from equation (5-9), we have that

(6-11)

furthermore, condition 3 on page 83 gives us that

(6-12)

so that equation (6-10) can be simplified to

(6-13)

note that the parenthesized expression in this equation is a scalar, so that correspond-

ing pixels between the two images are all located on the same scan line of the image,

offset by the quantity

αb

α’b-------

cb

rb

1

WbW’b1–

c’br’b1

=

LbMHb1–

Hb ebnvT

+( )M1–Lb

1–c’br’b1

=

I LbMHb1–ebnv

TM

1–Lb

1–+( )

c’br’b1

=

MHb1–eb

1

0

0

=

Lb

1

0

0

1

0

0

=

αb

α’b-------

cb

rb

1

c’br’b1

nvTM

1–Lb

1–c’br’b1

1

0

0

+=


om-

ome

rtical

range

ce,

rtical

cation

(6-14)

For convenience, let us define

(6-15)

which is a 3-vector.

For the pixel (c,r) in image 0, with disparity d, in the ground plane method the corre-

sponding pixel will appear in image 1 at (c+d,r). For the traditional method, the corre-

sponding pixel is at

(6-16)

in the image warped for the ground plane method. As might be expected, the function

of equation (6-14) simply appears in this equation as a disparity offset that depends on

the image location.

In order to implement both types of stereo matching efficiently, it would be nice if

we could re-use some of the intermediate results of one method for the other. In order

to solve this problem, it is helpful to realize that the planes for which we compute ste-

reo do not have to be perfectly vertical or horizontal, as long as they are close enough

to be useful tests of “verticalness” or “horizontalness”. In order to make efficient c

putation possible, we must choose a vertical plane such that (it will bec

clear why in a moment). Intuitively, this is a requirement on the slope of the ve

plane. What it means is that as we move across a row of the rectified image, the

to both the vertical and horizontal planes must change at the same rate. In practi

is almost always near to zero anyway, which is a result of the fact that both our ve

and horizontal planes are nearly parallel to the camera rows, and that our rectifi

nvTM

1–Lb

1–c’br’b1

η nvTM

1–Lb

1–( )T

=

c d ηTc d+

r

1

+ + r,

η0 0=

η0


procedure attempts to warp the images as little as possible. Our solution is to set

. If this is the case, the pixel at (c,r) and disparity d will appear at

.

In the stereo main loop (presented in Section 5.3.3.1), if we compute stereo for the

ground plane case we must compute MATCHING_ERROR(c,r,d) for every possi-

ble value of (c,r) and d. From the above discussion, we can see that this is equivalent

to computing MATCHING_ERROR(c,r,d- ) for the traditional case.

Although this is not an integer disparity for the traditional case, it is a perfectly valid

result that we can use.

Since setting removed the dependency on image columns,

MATCHING_ERROR(c,r,d+1) for the ground plane case is also equivalent to

MATCHING_ERROR(c,r,d+1- ) for the traditional case. By this logic, it

should also be possible to reuse the HORIZONTAL_SUM calculations:

HORIZONTAL_SUM(c,r,d) for the ground plane case is the same as

HORIZONTAL_SUM(c,r,d- ) for the traditional case.

The problem comes in with the VERTICAL_SUM computations. Since is not

zero (and the result would not be very interesting if it were), the HORIZONTAL_SUM

computations from different rows refer to different sets of non-integer disparity levels.

The sets are offset by , which is unlikely to be an integer. As an example, suppose

. This would produce HORIZONTAL_SUM values for row 0 at disparities of

{..., -2, -1, 0, 1, 2, 3, ...}. For row 1, it would produce values for disparities

{..., -0.8, 0.2, 1.2, 2.2, 3.2, 4.2, ...}, and so on for the rows down through the image.

We solve this problem by making an approximation: for each row of the image, we

round off the set of disparities to the closest integer for the purpose of adding them

together into a VERTICAL_SUM. So, for example, the sums produced for row 1 would

η0 0=

c d η1r η2+ + + r,( )

η1r η2+

η0 0=

η1r η2+

η1r η2+

η1

η1

η1 1.2=

6.8 Obstacle Clustering 119

pari-

s that

be rounded off and used as if the set of disparities were actually

{..., -1, 0, 1, 2, 3, 4, ...}.

The effect of the previous discussion is that the following pseudo-code is able to

compute correct disparities for the “ground plane method”, and approximate dis

ties for the “traditional method”:

for(outer-loop) {for(middle-loop) {

for(inner-loop) {SAD(c,r,d) = MATCHING_ERROR(c,r,d);

HORIZONTAL_SUM(c,r,d) = HORIZONTAL_SUM(c-1,r,d) + SAD(c,r,d) - SAD(c-WINDOW_WIDTH,r,d);

VERTICAL_SUM(c,r,d) = VERTICAL_SUM(c,r-1,d) +HORIZONTAL_SUM(c,r,d) -

HORIZONTAL_SUM(c,r-WINDOW_HEIGHT,d);

if(VERTICAL_SUM(c,r,d) < MIN_SSAD(c,r)) {MIN_SSAD(c,r) = VERTICAL_SUM(c,r,d);RESULT_IMAGE(c,r) = d;

}

VERTICAL_SUM_TRAD(c,r,d) = VERTICAL_SUM_TRAD(c,r-1,d-VS_OFFSET(r)) +HORIZONTAL_SUM(c,r,d) -HORIZONTAL_SUM(c,r-WINDOW_HEIGHT,d-HS_OFFSET(r))

if(VERTICAL_SUM_TRAD(c,r,d) < MIN_SSAD_TRAD(c,r)) {MIN_SSAD_TRAD(c,r) = VERTICAL_SUM_TRAD(c,r,d);RESULT_IMAGE_TRAD(c,r) = d;

}}

}}

VS_OFFSET(r) and HS_OFFSET(r) are precomputed from the values of

and .

6.8. Obstacle Clustering

In order to be useful to a high-level planner, we need to take the set of pixel

η1

η2


are found to be obstacles, and reduce it to a small number of obstacle regions. The

method that we have used for doing this is very straight-forward. First, a simple one-

pass connected-components labeling is performed on the obstacle image (the intersec-

tion of Figure 6-9 and Figure 6-10). While the labeling is being performed, statistics

are maintained for each region, including its size, centroid, mean disparity, maximum

disparity, minimum disparity, and bounding box. Connected regions whose size is

above a certain threshold are declared to be obstacles.

Using the metric calibration methods described in Section 3.3, we can compute 3D

coordinates for the obstacle centroid and mean disparity, or for the closest point. We

can also compute the 3D extent of the obstacle. These parameters are then available to

a higher level module for deciding on and executing appropriate actions.

6.9. System Parameters

The obstacle detection system has several parameters which can be set at compile

time in order to control different aspects of the system. A summary of those parame-

ters and a rough idea of the effects of changing them is presented here.

LoG filter size: the size of the LoG filter mask is directly related to the value of σ

for the LoG filter. Making σ larger causes the filter to remove more high-frequency

components from the image, as well as making the mask larger and thus causing the

filtering process to be slower. Smaller values of σ allow more high-frequency compo-

nents through the filter, which may allow more noise to pass through.

LoG filter gain: as discussed in Section 4.3, the gain of the LoG filter controls the

ability of the filter to enhance image texture.

Rectification step size, s: although this has been implicitly set according to the

discussion in Section 3.2.5, it could be made larger in order to perform sub-pixel

matching or smaller in order to reduce the amount of search.

Disparity search range: the range of disparities searched controls how far away

6.9 System Parameters 121

from the ground plane (or other target plane) a point in the world can be, and still be

properly matched by the stereo algorithm. The speed of the stereo matching part of the

algorithm depends linearly on the size of the search space, so it needs to be set to the

smallest possible value that still allows recognition of obstacles under all conditions.

Stereo matching window size: in the process of stereo matching, matching errors

are summed over a window. This amounts to an assumption that all of the pixels

within the window will belong to the same surface. If multiple surfaces appear within

the matching window, the algorithm will usually either lock onto one surface or the

other, or produce a result that is intermediate between the two. In some rare cases, it

can produce a completely incorrect result. Reducing the size of the window also

reduces the size of the patch that is required to belong to the same surface, causing

more pixels to be matched correctly, at the expense of increasing susceptibility to

image noise. Conversely, increasing the window size decreases susceptibility to image

noise, having a smoothing effect, but it increases incorrect matches at the borders

between surfaces.

Vertical surface threshold: this parameter controls how much better a pixel must

match as a vertical surface than as a horizontal surface in order to be considered a can-

didate obstacle. Small values tend to produce many noise points, as seen in Figure 6-9.

If the value is too large, obstacles will not be detected.

Consistency threshold: this parameter controls the size of the region of constant

disparity (in pixels) that a pixel must belong to in order to considered an obstacle can-

didate. In general, the size of this threshold can be set to any small value in the range

of 5-15 with similar results. If the value is too small, many small erroneous obstacle

regions can appear. If the value is too large, small obstacles may not be detected.

Obstacle clustering threshold: controls the number of adjacent pixels that must

be declared as obstacle pixels in order for an obstacle to be reported. Regions smaller

than 10 pixels tend to be unreliable, so we eliminate them based on their size.

121

Chapter 7

Obstacle Detection Results

The previous chapters have presented the design and implementation consider-

ations that have gone into our obstacle detection system. This chapter presents the

results of a number of different obstacle detection experiments designed to test our

system under a variety of different conditions.

In order to test the performance of the system with respect to different sizes and

colors of obstacles, we constructed test obstacles out of four common sizes of lumber,

1"x4", 1"x6", 1"x8", and 1"x12". Each type of lumber was cut into pieces that were

12" long, and the pieces were spray painted black, white, or gray, thus producing 12

different obstacles. When used in the tests, the boards were propped up on their edges,

producing obstacles of four different heights, approximately 9 cm, 14 cm, 19 cm, and

29 cm tall (note that the commercial lumber sold in the U.S. as "1x4" is not 4" wide).

122 Chapter 7. Obstacle Detection Results

arage-

s of

t with

ack,

end of

es the

so that

cam-

ects

h the

e door

mage,

garage

to be

ft and

tric cal-

Penn-

of the

in the

Additionally, during testing a number of other objects were used as obstacles.

Although the only such obstacle that will be presented in this section is a 12 oz. (355

ml) Diet Pepsi can, many other objects were also tested. These objects included peo-

ple, bricks, stones, boards lying flat on the road, and paper plates. In general, the sys-

tem performed as expected in that taller obstacles and obstacles with higher contrast

relative to the road surface were detected at longer distances than obstacles that were

shorter or had lower contrast.

The camera system was adjusted and calibrated in the “car barn”, a large g

like space on the Carnegie Mellon campus (which appears in the image

Figure 6-3). The car barn was used because it provides a controlled environmen

a flat floor. Straight lines on the floor are provided by a section of railroad tr

which meets the garage door at a right angle. First, the car was placed at the far

the garage, facing the garage door and aligned with the railroad tracks. This plac

cameras about 45 meters from the door. All three cameras were then adjusted

the images of the door roughly overlapped, thus ensuring that the (very narrow)

era fields of view would overlap sufficiently to compute stereo disparity for obj

over a wide range.

The calibration was performed using the methods outlined in Chapter 3, bot

weak calibration and the metric calibration. Seven sets of images of the garag

were taken at five meter intervals from 15 meters to 45 meters. In the 45-meter i

matching was performed both for the garage door as a vertical plane and for the

floor as a horizontal plane. The origin of the vehicle coordinate system was set

the point where the left front tire touches the ground. The lateral offsets to the le

right edges of the door were measured, and those features were used for the me

ibration.

The tests were performed at a site in the boro of Homestead, near Pittsburgh,

sylvania. The site is a rarely-used stretch of city street. An overhead diagram

site is included in Figure 7-6. The road surface is relatively new asphalt, paved

7.1 Obstacle Detection System Performance 123

and

of the

g that

eed.

sec-

last few years and almost unused, although the asphalt has seen some weathering and

no longer has the black and shiny appearance of fresh asphalt. Tests have also been

performed on concrete roadways; although the results are not included here, there was

no substantial difference in system performance.

The total length of straight road available at the test site prevents testing at dis-

tances exceeding 150 meters. Due to timing difficulties and restrictions on the amount

of data that can be collected at once with our system, many of the data sets taken with

the vehicle in motion begin with the obstacle at distances of 110 meters or less.

7.1. Obstacle Detection System Performance

The stereo processing on the vehicle was performed on a 300 MHz Intel

Pentium II PC with a Matrox digitizer. More recently, we have performed some tests

on a 400 MHz Pentium II processor. The processing times for a set of three 640x240

images, searching 96 disparity levels, and computing both “traditional stereo”

“ground plane stereo” are shown in Table 7-1. Note that the stereo matching part

algorithm sees a large benefit from the increased memory bus speed, indicatin

the performance of that part of the algorithm is probably limited by memory sp

The overall cycle time of the system installed on the vehicle is on the order of 1.5

onds per frame.

300 MHz(66 MHz bus)

Pentium II

400 MHz (100 MHz bus)

Pentium II

LoG Filtering (3 images) 150 ms 115 ms

Image Rectification (3 images) 151 ms 111 ms

Stereo Matching (both ground plane & traditional)

750 ms 500 ms

Obstacle Detection 350 ms 290 ms

Table 7-1: stereo system benchmarks


7.2. Stereo Range to Detected Obstacles

In order to assess the accuracy of the range measurements obtained for obstacles

from the complete system, we collected data from a stopped vehicle for each of the 12

obstacles at measured ranges. The ranges used went in 10 meter increments from 50

meters to 150 meters. Figure 7-1 shows a plot of the range as measured by hand versus

the range reported by the obstacle detection system. For an obstacle to be reported, a

region containing at least ten candidate obstacle pixels must have been found. The dia-

monds in this graph represent the range to the pixel of the obstacle that reported the

closest range. The plus signs represent the mean of the ranges reported by all of the

pixels. As expected, the measured range is very accurate when the object is close, and

gets increasingly less accurate as the obstacle gets farther away due to the inherent loss

of stereo accuracy at long range. There is one severe outlier in the data at 150 meters,

which is caused by a single pixel mismatch. Even though the closest pixel for the 150

meter data shows up at 80 meters, the fact that the mean distance is still near 150

meters shows that the other pixels were not mismatched. It would thus be possible to

Actual Range (m)Figure 7-1: Stereo range accuracy

Det

ecte

d R

ange

(m

)

40

60

80

100

120

140

160

180

40 60 80 100 120 140 160

"rangeacc.out""rangeacc.out"

x

7.3 Experiments From a Moving Vehicle 125

filter out such outlier pixels by using the mean or median range instead of the closest

range to the object.

The obstacle detection performance results on this data set are summarized in

Table 7-2. Checkmarks represent successful detection of the obstacle, whereas X

marks represent a failure to detect the obstacle. This table shows at least two interest-

ing facts. First, we were successfully able to detect obstacles that are 14 cm or taller at

up to 110m. Second, the white and gray obstacles are more difficult to detect because

of the lower contrast between obstacle and road surface pixels. The white obstacles

cannot be detected at all beyond 120 meters, regardless of size.

7.3. Experiments From a Moving Vehicle

A number of further experiments were performed with various obstacles while the

vehicle was moving. The cycle time for the computers installed on the vehicle is 1.5

seconds, but we want to get frequent depth measurements in anticipation of hardware

that can run the algorithm at a faster rate. Accordingly, we recorded the image data to

the hard disk at a faster rate (either 4 frames per second or 15 frames per second), and

processed the data off-line.

Black 9cm

Black14cm

Black19cm

Black30cm

Grey9cm

Grey14cm

Grey19cm

Grey30cm

White9cm

White14cm

White19cm

White30cm

50m ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

60m ✓ ✓ ✓ ✓ ✕ ✓ ✓ ✓ ✓ ✓ ✓ ✓

70m ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

80m ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

90m ✓ ✓ ✓ ✓ ✕ ✓ ✓ ✓ ✓ ✓ ✓ ✓

100m ✓ ✓ ✓ ✓ ✕ ✓ ✓ ✓ ✓ ✓ ✓ ✓

110m ✓ ✓ ✓ ✓ ✕ ✓ ✓ ✓ ✕ ✓ ✓ ✓

120m ✓ ✕ ✓ ✓ ✓ ✕ ✓ ✓ ✕ ✓ ✓ ✓

130m ✕ ✕ ✓ ✕ ✕ ✓ ✓ ✓ ✕ ✕ ✕ ✕

140m ✓ ✕ ✓ ✓ ✓ ✓ ✓ ✓ ✕ ✕ ✕ ✕

150m ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✕ ✕ ✕ ✕

Table 7-2: Obstacle Detection Results


n in

d in all

eters

acle.

For each of the runs presented in this section, the vehicle was driven toward the

obstacle at a roughly constant speed of between 10 and 25 miles per hour (the upper

restriction occurs because it is the speed limit at our test site).

The system returns several (on the order of 5-15) obstacle regions per frame, corre-

sponding to other objects such as the curbs, lampposts, and buildings, as well as the

obstacle that we have placed. The graphs of this section show segmented results, so

that only the detected obstacles that correspond to the desired obstacle are shown. This

allows us to examine whether the system can detect the obstacle at a given range. A

brief analysis of the other objects that are detected will occur in the following section.

Figure 7-2 shows an example trace of an obstacle detection run. The vehicle

moved towards a 30 centimeter (12”) high white obstacle of the type show

Figure 6-6. The data was taken at 4 frames per second. The obstacle is detecte

but one frame of the data, out to a maximum range of approximately 110 m

(which is the beginning of the data set).

Figure 7-3 shows a similar trace, this time for a 14 centimeter (6”) black obst

Figure 7-2: Detection trace for 30cm obstacle

Ran

ge (

m)

Frame # (4 fps)

0

10

20

30

40

50

60

70

80

90

100

110

120

130

0 5 10 15 20 25 30 35 40 45 50

"obs1.out" using 1:11✧

✧✧

✧✧ ✧

✧✧

✧

✧

✧ ✧ ✧ ✧

✧✧

✧✧

✧ ✧✧

✧ ✧

✧ ✧

✧ ✧✧ ✧

✧✧

✧✧ ✧

✧✧

✧✧✧ ✧ ✧✧ ✧✧

✧✧✧✧✧✧

✧ ✧ ✧ ✧

✧

7.4 Other Detected Objects 127

tacle

ugh it

fficult

12 oz.

at 57

repre-

at sat-

that

The density of the data is higher because the images were collected at 15 frames per

second. Again, the obstacle is detected reliably from the beginning of the data set

(around 110 meters) until the end of the data set.

The system reaches its limitations when viewing a 9 centimeter (4”) white obs

as in Figure 7-4. The obstacle is not reliably detected until about 40 meters (tho

is detected for one frame at about 55 meters).

Since these results were surprisingly good, we decided to attempt a more di

obstacle. Figure 7-5 shows the same type of trace, this time for a standard

(350ml) soda can, which is mostly white. The soda can is first reliably detected

meters.

7.4. Other Detected Objects

Each of the previous examples has shown only the detections that actually

sented the obstacle. Of course, there are likely to be many objects in the world th

isfy our obstacle detection algorithm, which is just looking for surfaces


Ran

ge (

m)

0

20

40

60

80

100

120

0 20 40 60 80 100 120 140 160

"6in_obso.out"


consistently appear to be closer to vertical than horizontal. In fact, the system does

detect a large number of objects. A full trace is shown in Figure 7-6. This trace is the

same data set as shown in Figure 7-2, along with an example image from the set and a

diagram showing an overhead view of the scene. The detections can be divided (in this

Ran

ge (

m)


0

20

40

60

80

100

120

0 10 20 30 40 50 60

"obs1.out" using 1:11

✧

✧✧✧✧

✧✧✧✧✧

✧✧✧

✧✧✧✧✧✧

✧✧

✧✧

✧

Frame # (4 fps)

Ran

ge (m

)

Figure 7-5: Detection trace for soda can

0

20

40

60

80

100

120

0 10 20 30 40 50 60 70 80 90

"can_obso.out"

7.5 Lateral Position and Extent 129

case, by hand) into three sets, representing the obstacle, the curb behind the obstacle,

and the building in the background. While this is not an analytical result, it is a con-

vincing argument that the number of false positive obstacles reported from the system

is not high. Perhaps more importantly, there are no false detections that are closer than

the obstacle, implying that the system has a low false positive rate for pixels that corre-

spond to a plain asphalt road surface.

7.5. Lateral Position and Extent

In addition to the range data, the stereo system can also provide us with informa-

tion about the 3D position and extent of the obstacle. This sort of data is shown in

Figure 7-7. The LoG filtering and the width of the SAD summation window tend to

make obstacles appear larger than they really are. The obstacle in this example is about

35 centimeters wide, but we detect its extent to be around 50 centimeters. The strange

trajectory that the obstacle appears to follow is actually due to a mid-course correction

0

50

100

150

200

250

0 5 10 15 20 25 30 35 40 45 50

"obs.out"

road

curb

vehicle

obstacle

Frame # (4 fps)

Ran

ge (

m)

Figure 7-6: Other detected points

building


from

h low

bout 55

ult to

s.

same

nable

e data

d with

that the driver made to keep the obstacle from leaving the field of view of the system

as the vehicle approached the obstacle.

7.6. Night Data

In addition to the data shown so far, we have also done experiments at night. The

results of these experiments are shown in Figure 7-8 and Figure 7-9. Figure 7-8 shows

detection of a white 14 centimeter (6”) obstacle. The points marked with “+” are

images collected with high beams, and the points marked with diamonds are wit

beams. This obstacle was detectable at 100 meters with high beams, and at a

meters with low beams. Figure 7-9 shows the results for a black 14 centimeter obstacle

under the same conditions. As expected, the black obstacle is much more diffic

detect — 60 meters with the high beams, and about 37 meters with the low beam

Qualitatively, at night-time the system is able to detect obstacles at about the

time that the obstacle becomes visible in the image to a human. It is questio

whether a human could identify the object as an obstacle from monocular imag

alone. As an example, the first image in which the black obstacle was detecte

Lat

eral

Pos

ition

(m

)

Range (m)Figure 7-7: Detected obstacle extent and trajectory

0

0.5

1

1.5

2

2.5

30 10 20 30 40 50 60 70 80 90 100 110 120 130

"obs2.out" using 11:9:2:3

✧✧

✧

✧

✧

✧

✧

✧✧

✧

✧

✧

✧✧

✧ ✧

✧✧

✧✧✧

✧✧

✧ ✧✧

✧

✧

✧

✧✧✧✧✧ ✧✧

✧

✧

✧✧✧✧ ✧

✧✧✧ ✧✧✧

✧

✧✧

✧

✧✧

✧

✧

✧

✧✧✧✧✧✧✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧✧ ✧✧✧

✧

✧✧

✧

✧

✧

✧

✧

✧✧

✧

✧

✧

✧

✧

✧

✧

✧

✧✧

✧

✧

✧

✧

✧

✧

✧✧

✧✧

✧

✧✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧✧

✧

✧

✧

✧

✧

✧✧

✧

✧

✧

✧

✧

✧✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧✧

✧✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧✧

✧

✧✧

✧

✧

✧

✧✧

✧

✧

✧

✧

✧✧✧

✧

✧

✧

✧

✧

✧

✧

✧✧

✧

✧

✧

✧

✧

✧

✧✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧

✧✧

✧

✧

✧

✧

✧

7.6 Night Data 131

low beams in Figure 7-9 is shown in Figure 7-10. The black obstacle in the center of

the road is just barely visible; a gray obstacle is also visible on the right side of the

road.

Ran

ge (

m)

Frame # (4 fps)Figure 7-8: Detection trace for 14cm white obstacle at night. “+”data is with high

beams, diamonds are with low beams

0

20

40

60

80

100

120

0 5 10 15 20 25 30 35 40 45 50

"obs1.out" using 1:4

✧✧ ✧✧✧✧

✧✧✧✧

✧✧ ✧✧✧✧

✧✧ ✧✧

✧✧

✧

"../blkgry6inhi/obs1.out" using 1:4

✛✛

✛

✛ ✛✛

✛✛✛

✛✛ ✛ ✛

✛ ✛✛ ✛✛

✛✛✛✛✛✛ ✛✛✛✛

✛✛✛ ✛✛✛✛✛ ✛✛✛ ✛✛✛✛

✛✛✛✛✛ ✛✛✛ ✛✛✛

✛✛✛

✛✛✛✛ ✛✛ ✛✛✛✛✛✛✛ ✛✛✛✛✛ ✛✛✛ ✛✛✛ ✛✛✛

✛✛ ✛✛✛ ✛✛✛✛✛ ✛✛

✛

Ran

ge (

m)

Frame # (4 fps)Figure 7-9: Detection trace for 14cm black obstacle at night. “+” data is with high

beams, diamonds are with low beams

0

10

20

30

40

50

60

70

80

90

100

110

120

130

0 5 10 15 20 25 30 35 40 45 50

"obs.out" using 1:4

✧ ✧✧ ✧✧

✧✧✧✧

✧

"../blkblk6in8ihdkhi/obs1.out" using 1:4

✛ ✛✛ ✛

✛✛✛ ✛✛ ✛✛

✛✛✛ ✛✛

✛✛ ✛✛

✛✛ ✛✛✛

✛✛✛✛ ✛✛

✛✛✛ ✛✛✛ ✛✛

✛


e col-

ined by

before

ults of

7.7. Repeated Experiments

In an attempt to show how the probability of detection for a particular obstacle can

be quantified, we have performed a set of repeated experiments. As in Section 7.3, the

obstacle was placed on the road surface, and the car was driven toward the obstacle

while data was collected at 4 frames per second. The data collection continued through

12 different passes toward the obstacle.

Table 7-3 shows the accumulated results of the 12 test runs with a 9 centimeter

(4”) tall black obstacle. Over 1000 sets of images from the three cameras wer

lected and processed. For each frame, the distance to the obstacle was determ

our algorithm. If the obstacle was not detected, the range reported from frames

and/or after were interpolated to determine an approximate range. Using the res

<30 m 30-40m 40-50m 50-60m 60-70m 70-80m 80-90m 90-100m 100-110m 110-120m 120-130m >130m

totalframes

86 50 71 70 85 81 93 92 119 91 48 62

framesdetected

86 50 71 70 84 81 85 76 99 60 28 32

percentdetected

100 100 100 100 98.8 100 91.3 82.6 83.2 65.9 58.3 51.6

Table 7-3: Results of repeated experiments

Figure 7-10: First frame in which the black obstacle was detected at night. The black obstacle is in the center, and a gray obstacle is also visible to the right.

7.7 Repeated Experiments 133

num-

umber

entage

this procedure, we classified each frame into one of 12 bins depending on the detected

range. Each such image represents an opportunity to detect an obstacle within a given

range of distances. The row of Table 7-3 marked “total frames” shows the total

ber of image frames that were classified into each bin. The next row shows the n

of frames for which the obstacle was detected. The last row then shows the perc

detection.

135

Chapter 8

Conclusions

8.1. Contributions of This Thesis

The primary contribution of this thesis is an obstacle detection system that uses

trinocular stereo to detect very small obstacles at long range on highways. The system

makes use of the apparent orientation of surfaces in the image in order to determine

whether pixels belong to vertical or horizontal surfaces. A simple confidence measure

is applied to reject false positives introduced by image noise. The system is capable of

detecting objects as small as 14cm high at ranges well in excess of 100m. To my

knowledge, no existing system is capable of this level of performance.

In order to make the obstacle detection system function, several other contribu-

tions have been made:

High Precision Calibration Methods. The calibration methodology presented in

136 Chapter 8. Conclusions

Chapter 3 provides a simple method for computing weak calibration parameters, based

only on multiple views of planar surfaces. The precision is increased by the addition of

multiple surfaces. Additionally, an easy method for extending the weak calibration to a

full metric calibration is presented. This method can be applied to the same data used

for weak calibration with the addition of a small number of measurements.

Rectification for Efficient Three Camera Stereo. Chapter 5 presents a method

for rectifying a three camera system so that stereo disparity can be computed effi-

ciently. The method works by the application of constraints to the warping functions

used for rectification.

Analysis of Memory and Cache Usage of Stereo Algorithms. Implementation of

stereo on multi-purpose CPUs (as opposed to special-purpose hardware) requires

some attention to how memory and the CPU cache are used. Chapter 5 presents an

analysis of three different variations on the stereo algorithm with respect to their cache

usage, including benchmarks that support my calculations.

Efficient Calculation of Stereo to Test Surface Orientation. The method pre-

sented in Section 6.7.1 allows efficient computation of stereo for multiple hypothe-

sized surface orientations at once. The results of this can be used to decide which

surface orientation is most likely.

Implementation in “Slow Real-Time”. The entire obstacle detection system has

been implemented in Intel Pentium MMX assembly language to achieve cycle times

of around 1 second. It has been integrated into a complete obstacle detection and track-

ing system, and demonstrated running live on our test vehicle at speeds of up to 25

MPH.

8.2. Future Work

There are a number of logical directions in which this work could be extended.

8.2 Future Work 137

8.2.1. Determining More Orientations

An obvious extension of this work would be to adapt the algorithm to compute the

best match out of a number of possible surface orientations (instead of only vertical

and horizontal). The number of orientations that can be distinguished is a function of

both the available surface texture and the window size.

8.2.2. Test in an Offroad Environment

A number of research groups are building cross-country navigation systems. These

systems also need to detect and avoid obstacles. My obstacle detection system should

be tested to determine if it continues to function well in such an environment. In par-

ticular, in the presence of a highly textured environment the choice of window sizes

may become much more important, since a large window may overlap several regions

with different surface orientations. The solution presented in this thesis works well

with a large window size because the ground is relatively bland, so that the texture on

the obstacle dominates. When this is not the case, the system may not be able to detect

such small obstacles.

8.2.3. Use Temporal Information

The obstacle detection system currently views each frame of video as if it were a

completely new situation, independent of what came before. A method for directing

the stereo search to parts of the image that are likely to belong to the road surface, and

particularly to those regions where obstacles are predicted to appear from past data,

could be used to increase both processing speed and accuracy.

A more complicated system could combine the current system with data from

vehicle sensors to build an accurate model of the road in front of the vehicle over time,

perhaps even including super-resolution textures.


8.2.4. Obstacle Avoidance

After detecting the obstacles, we of course need to avoid hitting them. This is very

much an open research problem. Once the obstacle has been detected, an appropriate

course of action must be decided. This course of action is a function of (at least) the

size and position of the obstacle, the speed of the vehicle, weather conditions, vehicle

maneuverability, and the state of other vehicles in the vicinity. The options may

include swerving, changing lanes, stopping, straddling the obstacle, slowing down, or

even hitting the obstacle.

8.2.5. Further Optimizations and Speed Enhancements

Although these are not research topics per se, the obstacle detection system could

benefit from another pass or two of optimization. The following paragraphs highlight

some of the places where optimization is likely to be fruitful.

In accordance with the results derived in Section 5.3.3.7, a speed improvement of

approximately 50% is possible by simply reducing the amount of data (image size and

number of disparities) that is processed in one chunk. Since it is necessary to continue

to process large images and large numbers of disparities, a method that allows efficient

division of the problem into smaller problems without introducing a lot of overhead

would be necessary to take advantage of this.

Additionally, since the cache requirements of the (d,r,c) algorithm are much

smaller than either of the other options, a fast method of implementing this algorithm

with SIMD instructions has the potential of running much faster than the current

implementation.

The optimization of Section 6.7.1, while much faster than computing stereo sepa-

rately for the two different surface orientations, runs slowly because it performs a

large number of unaligned accesses on the Pentium II processor. I am not convinced

that there is no way to avoid this.

8.2 Future Work 139

Further optimization should also be possible in the LoG filtering process. This

operation by itself should be possible at near frame rate. If that does not seem possible

in software, then special-purpose hardware for performing 2D convolutions could be

employed. Similarly, the rectification process is nothing but a 2D projective warping

of the image. This process is very common in 3D rendering (for texture mapping), and

thus very fast and cheap graphics hardware to perform this operation is available. It

would not be surprising if faster software implementations also existed.

Very little effort has been made to optimize the section of the code that takes the

output of stereo matching and finds the obstacle regions. Since this code takes a signif-

icant fraction of the time spent by the obstacle detection algorithm, it may be worth-

while to take another look at it to see what optimizations are possible. Since many of

the algorithms used are common vision techniques (such as connected components

labelling), optimized libraries may be available.

ee

R.le

lly

141

Bibliography

[Bruyelle & Postaire 93] J. Bruyelle and J. Postaire. “Direct RangMeasurement by Linear Stereovision for Real-TimObstacle Detection in Road Traffic.” Proceedings ofthe International Conference on IntelligentAutonomous Systems, 1993.

[Burt et al. 93] P. Burt, P. Anandan, K. Hanna, G. van der Wal, andBassman. “A Front-End Vision Processor for VehicNavigation,” Proceedings of the InternationalConference on Intelligent Autonomous Systems (IAS-3), 1993.

[Burt et al. 95] P. Burt, L. Wixson, and G. Salgian. “ElectronicaDirected “Focal” Stereo,” Proceedings of the FifthInternational Conference on Computer Vision (ICCV’95), Cambridge, Mass, June, 1995, pp. 94-101.

[Cro & Parker 70] J. Cro and R. Parker. Automatic Headway Control —An Automobile Vehicle Spacing System. TechnicalReport 700086, Society of Automotive Engineers,January 1970.

142 Chapter . Bibliography

e to

dy

sions

l

for

om

[Deriche 90] R. Deriche. “Fast Algorithms for Low-Level Vision,”IEEE Transactions on Pattern Analysis and MachineIntelligence (PAMI), 12:1, January 1990, pp. 78-87.

[Devernay & Faugeras 96] F. Devernay and O. Faugeras. “From ProjectivEuclidean Reconstruction,” Proceedings of theInternational Conference on Computer Vision andPattern Recognition (CVPR’96), 1996.

[Dickmanns & Zapp 86] E. Dickmanns and A. Zapp. “A Curvature-BaseScheme for Improving Road Vehicle Guidance bComputer Vision,” Proceedings of the SPIEConference on Mobile Robots, 1986.

[Faugeras 92] O. Faugeras. “What Can Be Seen in Three DimenWith an Uncalibrated Stereo Rig?”, Proceedings of theEuropean Conference on Computer Vision, 1992.

[Faugeras 93] O. Faugeras. Three-Dimensional Computer Vision.MIT Press, 1993.

[Gardels 60] K. Gardels. Automatic Car Controls for ElectronicHighways. Technical Report GMR-276, GeneraMotors Research Labs, June 1960.

[Hancock 97] J. Hancock. “High-Speed Obstacle Detection Automated Highway Applications,” Carnegie MellonRobotics Institute Technical Report, CMU-RI-TR-97-17, 1997.

[Hartley et al. 92] R. Hartley, R. Gupta, and T. Chang. “Stereo frUncalibrated Cameras,” Proceedings of theInternational Conference on Computer Vision andPattern Recognition, 1994.

[Hébert et al. 97] M. Hébert, C. Thorpe and T. Stentz, eds. IntelligentUnmanned Ground Vehicles: Autonomous NavigationResearch at Carnegie Mellon. Kluwer, 1997.

[Horswill 93] I. Horswill. “Polly: A Vision-Based Artificial Agent.”Proceedings Tenth National Conference on ArtificialIntelligence (AAAI-93). Washington DC, 1993.

[Intel 97] Intel Corporation. Intel Architecture SoftwareDeveloper’s Manual, Volumes 1-3. Order numbers:243190-2, 1997.

143

se”

)

us

d

d

rdse

r.”

nicle

,”

[Kanade et al. 96] T. Kanade, A. Yoshida, K. Oda, H. Kano, and M.Tanaka. “A Stereo Machine for Video Rate DenDepth Mapping and its New Applications,Proceedings of the International Conference onComputer Vision and Pattern Recognition (CVPR’96,June, 1996.

[Kelly 95] A. Kelly. An Intelligent Predictive Control Approachto the High-Speed Cross-Country AutonomoNavigation Problem. Carnegie Mellon RoboticsInstitute Ph.D. Thesis, CMU-RI-TR-95-33, 1995.

[Kluge & Thorpe 89] K. Kluge and C. Thorpe. “Explicit Models for RoaFollowing,” Proceedings of the IEEE Conference onRobotics and Automation, 1989.

[Konolige 97] K. Konolige. “Small Vision Systems: Hardware anImplementation,” Eighth International Symposium onRobotics Research, Hayama, Japan, October 1997.

[Kories et al. 88] R. Kories, N. Rehfeld, and G. Zimmermann. “TowaAutonomous Convoy Driving: Recognizing thStarting Vehicle in Front,” Proceedings of the 9thInternational Conference on Pattern Recognition,1988.

[Langer 97] D. Langer. An Integrated MMW Radar System forOutdoor Navigation. Carnegie Mellon RoboticsInsitutute Ph.D. Thesis. CMU-RI-97-03, 1997.

[Longuet-Higgins 81] H. C. Longuet-Higgins. “A Computer Algorithm foReconstructing a Scene from Two ProjectionsNature, 293:133-135, 1981.

[Luong et al. 95] Q.T. Luong, J. Weber, D. Koller and J. Malik, “Aintegrated stereo-based approach to automatic vehguidance,” Fifth International Conference onComputer Vision (ICCV ’95), Cambridge, Mass, June1995, pp. 52-57.

[Marr & Hildreth 80] D. Marr and E. Hildreth. “Theory of Edge DetectionProceedings Royal Society of London, Series B, 1980,pp. 187-217.

144 Chapter . Bibliography

s:e

ancety

-4.

.s:

e

e

ile

8.

es

[Matthies 92] L. Matthies. “Stereo Vision for Planetary RoverStochastic Modeling to Near Real-TimImplementation,” International Journal of ComputerVision, 1992, 8:1, pp. 71-91.

[Matthies & Grandjean 94] L. Matthies and P. Grandjean. "Stochastic PerformModeling and Evaluation of Obstacle Detectabiliwith Imaging Range Sensors". IEEE Transactions onRobotics and Automation, Special Issue on Perceptionbased Real World Navigation, 10(6), December 199

[Matthies et al. 95] L. Matties, A. Kelly, T. Litwin, and G. Tharp“Obstacle Detection for Unmanned Ground VehicleA Progress Report,” Proceedings of the SeventhInternational Symposium of Robotics Research,Munich, Germany, 1995.

[Nishihara 84] H. K. Nishihara. “PRISM, a Practical Real-timImaging Stereo Matcher,” MIT A.I. Technical ReportMemo 780, 1984.

[Oda 96a] K. Oda. personal correspondence, 1996.

[Oda 96b] K. Oda. Calibration Method for Multi-Camera StereoHead for NavLab II. Internal Carnegie MellonDocument, 1996.

[Okutomi & Kanade 93] M. Okutomi and T. Kanade. “A Multiple-BaselinStereo,” IEEE Transactions on Pattern Analysis andMachine Intelligence, Vol. 15, No. 4, April 1993, pp353-363.

[Oshima et al. 65] R. Oshima et al. “Control System for AutomobDriving,” Proceedings of the Tokyo IFAC Symposium,1965.

[PointGrey 98] Point Grey Research. http://www.ptgrey.com/, 199

[Reece 92] D. Reece. Selective Perception for Robot Driving.Ph.D. thesis, Carnegie Mellon University, 1992.

[Robert et al. 94] L. Robert and M. Hébert. “Deriving Orientation Cufrom Stereo Images,” Proceedings of the EuropeanConference on Computer Vision (ECCV ’94), 1994, pp.377-388.

145

rt.

7.

s

T.T.n

ITn/

y

dle

llo,”

[Robert et al. 95] L. Robert, M. Buffa, M. Hébert. “Weakly-CalibratedStereo Perception for Rover Navigation,” Proceedingsof the International Conference on Computer Vision(ICCV ’95), 1995.

[Robert et al. 97] L. Robert, C. Zeller, O. Faugeras, and M. Hébe“Applications of Nonmetric Vision to Some VisuallyGuided Robotics Tasks,” chapter from VisualNavigation: From Biological Systems to UnmannedGround Vehicles, Lawrence Erlbaum Associates, 199

[Sukthankar 97] R. Sukthankar. Situation Awareness for TacticalDriving. Ph.D. Thesis, Carnegie Mellon RoboticInstitute, 1997.

[Thrun et al. 97] S. Thrun, A. Bücken, W. Burgard, D. Fox, Fröhlinghaus, D. Hennig, T. Hofmann, M. Krell, and Schmidt. “Map Learning and High-Speed Navigatioin RHINO.” to appear in AI-based Mobile Robots:Case studies of successful robot systems, Kortenkamp,D. and Bonasso, R.P. and Murphy, R. (eds.), MPress. Available at http://www.cs.cmu.edu/~thrupapers/thrun.rhino_chapter.html.

[Treat et al. 79] J. Treat et al. Tri-level Study of the Causes of TrafficAccidents: Final Report Volume 1. Technical Report,Federal Highway Administration, U.S. DOT, Ma1979.

[Williamson & Thorpe 98a] T. Williamson and C. Thorpe. “A SpecializeMultibaseline Stereo Technique for ObstacDetection,” Proceedings of the InternationalConference on Computer Vision and PatternRecognition (CVPR ‘98), Santa Barbara, California,June, 1998.

[Williamson & Thorpe 98b] T. Williamson and C. Thorpe. “Detection of SmaObstacles at Long Range Using Multibaseline Stereto appear in Proceedings of the 1998 IEEEInternational Conference on Intelligent Vehicles,Stuttgart, Germany, October 1998.

A High-Performance Stereo Vision System for Obstacle …©1998 Todd A. Williamson This research was...

Documents

Transcript of A High-Performance Stereo Vision System for Obstacle …©1998 Todd A. Williamson This research was...