A High-Performance Stereo Vision System for Obstacle …©1998 Todd A. Williamson This research was...
Transcript of A High-Performance Stereo Vision System for Obstacle …©1998 Todd A. Williamson This research was...
A High-Performance Stereo Vision System for Obstacle Detection
Todd A. Williamson
September 25, 1998CMU-RI-TR-98-24
Robotics InstituteCarnegie Mellon University
Pittsburgh, PA 15213
Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy
©1998 Todd A. Williamson
This research was partially sponsored by the collaborative agreement between Carnegie Mellon Univer-sity and Toyota Motor Corporation
i
Abstract
Intelligent vehicle research to date has made great progress toward true autonomy.Integrated systems for on-road vehicles, which include road following, headway main-tenance, tactical-level planning, avoidance of large obstacles, and inter-vehicle coordi-nation have been demonstrated. One of the weakest points of current automated cars,however, is the lack of a reliable system to detect small obstacles on the road surface.In order to be useful at highway speeds, such a system must be able to detect small(~15cm) obstacles at long ranges (~100m), with a cycle rate of at least 2 Hz.
This dissertation presents an obstacle detection system that uses trinocular stereoto detect very small obstacles at long range on highways. The system makes use of theapparent orientation of surfaces in the image in order to determine whether pixelsbelong to vertical or horizontal surfaces. A simple confidence measure is applied toreject false positives introduced by image noise. The system is capable of detectingobjects as small as 14cm high at ranges well in excess of 100m.
The obstacle detection system described here relies on several factors. First, thecamera system is configured in such a way that even small obstacles generate detect-able range measurements. This is done by using a very long baseline, telephoto lenses,and rigid camera mounts. Second, extremely accurate calibration procedures allowaccurate determination of these range differences. Multibaseline stereo is used toreduce the number of false matches and to improve range accuracy. Special image fil-tering techniques are used to enhance the very weak image textures present on the roadsurface, reducing the number of false range measurements. Finally, a technique fordetermining the surface orientation directly from stereo data is used to detect the pres-ence of obstacles.
A system to detect obstacles is not useful if it does not run in near real-time. Inorder to improve performance, this dissertation includes a detailed analysis of eachstage of the stereo algorithm. An efficient method for rectifying images for trinocularstereo is presented. An analysis of memory usage and cache performance of the stereomatching loop has been performed to allow efficient implementation on systems usinggeneral-purpose CPUs. Finally, a method for efficiently determining surface orienta-tion directly from stereo data is described.
ii
iii
cent)tacle
e that
re atted
couldcouldmentg a
aveisioniencessingctionrectlyd mePoel-via.
du-me toin me myibler lim-
Acknowledgements
First of all I want to thank my advisor, Chuck Thorpe, for his unending patienceand guidance, particularly when I was getting into a level of mathematical detail thatwas tedious to us both. My decision to take a two-year leave of absence in Japan didnot phase him in the least, and he was never less than supportive.
I also want to express my thanks to Martial Hébert (note the oft overlooked acfor sharing his great knowledge of projective geometry, stereo vision, and obsdetection with me. He has spent several hours of his life explaining things to mwere perhaps better learned elsewhere; for this I am grateful.
I feel that the environment in the Vision and Autonomous Systems Center heCMU, and the Robotics Institute of which VASC is a part, have both contribugreatly to my research. Whenever a research problem arose, I always felt that Igo and ask practically anyone about it, and if they didn’t know the answer, they point me towards someone who did. I have a feeling that it is this sort of environthat I will miss most when I leave CMU; hopefully I can be instrumental in fosterinsimilar environment wherever I go.
Of my colleagues in VASC and RI, I want to express particular thanks to DLaRose, for much insight into how electrical engineers think about computer vproblems. I think that many people in computer vision who have a computer scbackground could benefit from a more thorough understanding of signal proceprinciples. Similarly, John Hancock did a lot of thinking about the obstacle deteproblem before I even decided to make it my thesis topic, and I benefitted both diand indirectly from the work that he has done. Other people who have providewith invaluble advice (both technical and personal) include Jennie Kay, Conrad man, Jeff Schneider, Bill Ross, Toshihiko Suzuki, Stuart Fairley, and Parag Bata
Finally, I want to express thanks to my family. My mother, who returned to graate school at the same time that I was finishing high school, blazed the trail for follow. She made it look easy. My father has continually expressed a confidence that I often felt was unfounded, but I appreciate it greatly. Finally, I want to thankwife Hiroko, who has followed me to Pittsburgh from Tokyo, and dealt with incredculture shock, in order for me to complete my Ph.D. She has also dealt with ouited funds and a lot of uncertainty for our future, and for that I am thankful.
iv
. . . 7
. . . 9
. 10
. 12
. . 14
v
Contents
Abstract i
Acknowledgements iii
Contents v
1 Introduction 1
1.1 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Intelligent Vehicle Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Obstacle Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Stereo Vision for Obstacle Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5.1 Traditional Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.2 “Ground Plane Stereo”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.3.1 Multibaseline (Trinocular) Stereo. . . . . . . . . . . . . . . . . . . . . . . .
1.5.3.2 Laplacian of Gaussian (LoG) Filtering. . . . . . . . . . . . . . . . . . . .
1.5.4 Obstacle Detection from Stereo Output. . . . . . . . . . . . . . . . . . . . . . .
vi Contents
1.5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2 Mathematical Fundamentals 23
2.1 Mathematics of Stereo Vision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.1 Homography Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.2 Fundamental Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.3 Relationship Between Homography Matrices . . . . . . . . . . . . . . . . . . . . 29
3 Calibration 31
3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Weak Calibration of Multibaseline Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Image Warping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Computing Homography Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.3 Finding the Epipole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.4 Improving Accuracy of Recovered Parameters . . . . . . . . . . . . . . . . . . . 40
3.2.5 Stereo Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Global (metric or Euclidean) calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1 Practical and Accurate Metric Calibration. . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Summary of the Calibration Method Steps. . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4 Stereo Algorithm 51
4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Multibaseline Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 LoG Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Rectification and Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5 Stereo Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.6 Sub-pixel Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5 Implementation 65
5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2 CMU Video-Rate Multibaseline Stereo Machine. . . . . . . . . . . . . . . . . . . . . . 66
5.2.1 LoG Filter and Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Contents vii
. 105
. . 109
. 110
113
. . 118
. . 123
5.2.2 Rectification (Geometry Compensation) . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2.3 Stereo Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.4 Stereo Machine Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3 Software Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3.1 Multibaseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3.2 LoG Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3.2.1 Determining LoG Filter Coefficients . . . . . . . . . . . . . . . . . . . . . . 72
5.3.3 Rectification and Stereo Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3.3.1 The stereo matching main loop. . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3.3.2 Rectification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.3.3 Rectification strategy for the (r,d,c) ordering . . . . . . . . . . . . . . . . 83
5.3.3.4 Rectification strategy for the (r,c,d) ordering . . . . . . . . . . . . . . . . 84
5.3.3.5 Rectification strategy for the (d,r,c) ordering: . . . . . . . . . . . . . . . 86
5.3.3.6 Computing the Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.3.7 Memory Use in Stereo Matching . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.3.8 Benchmarks for the (r,c,d) case . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3.4 CPU-Specific Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6 Obstacle Detection 97
6.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3 Approaches to Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.4 Traditional Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.5 “Ground Plane Stereo” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.6 Height Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.7 Obstacle Detection from Stereo Output . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.7.1 Computing the two types of stereo efficiently. . . . . . . . . . . . . . . . . . .
6.8 Obstacle Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 Obstacle Detection Results 121
7.1 Obstacle Detection System Performance . . . . . . . . . . . . . . . . . . . . . . . . .
viii Contents
7.2 Stereo Range to Detected Obstacles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.3 Experiments From a Moving Vehicle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.4 Other Detected Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.5 Lateral Position and Extent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.6 Night Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.7 Repeated Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8 Conclusions 135
8.1 Contributions of This Thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.2.1 Determining More Orientations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
8.2.2 Test in an Offroad Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
8.2.3 Use Temporal Information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
8.2.4 Obstacle Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8.2.5 Further Optimizations and Speed Enhancements . . . . . . . . . . . . . . . . 138
Bibliography 141
1
Chapter 1
Introduction
This dissertation presents an obstacle detection system that uses trinocular stereo
to detect very small obstacles at long range on highways. The system makes use of the
apparent orientation of surfaces in the image in order to determine whether pixels
belong to vertical or horizontal surfaces. A simple confidence measure is applied to
reject false positives introduced by image noise. The system is capable of detecting
objects as small as 14cm high at ranges well in excess of 100m.
1.1. Background
Until the invention of the mechanical vehicles, most transportation systems pos-
sessed some degree of autonomy. In order to be a good beast of burden, an animal not
only has to be strong enough to carry the load, but it also must be intelligent enough to
follow a path, avoid colliding with things, and not run off of cliffs. This degree of
2 Chapter 1. Introduction
autonomy was lost with the transition to human-controlled mechanical vehicles. Since
the 1960s, a number of research groups around the world have been attempting to
restore some of this intelligence.
There are several good reasons to develop intelligent vehicles. Perhaps the first
reason that occurs to most people is convenience. Although many people enjoy driving
to some extent, almost everyone finds the driving task tedious at times. The idea of
getting into a car, programming it for the desired destination, and then relaxing while
in transit thus holds some appeal.
Perhaps a more compelling reason to build intelligent vehicles is to solve traffic
problems. If such a car existed, it should be able to drive much more precisely than a
human can. With increased precision, cars can travel faster and closer together, effec-
tively increasing the capacity of existing roadways.
The most compelling reason for adding autonomous capability to automobiles is
surely increased safety. Government studies attribute 96.2% of accidents in the United
States to driver error [Treat et al. 79]. Many of these accidents could be avoided with
autonomous vehicle technology, either by controlling the car to avoid the accident, or
by warning the driver of a dangerous situation so that she can take appropriate action.
1.2. Intelligent Vehicle Research
For the purposes of definition, an Intelligent Vehicle is a vehicle equipped with
sensors and computing that allow it to perceive the world around it, and to decide on
appropriate action. If the vehicle is also equipped with actuators, the vehicle may be
completely or partially computer-controlled. In the absence of such actuators, the sys-
tem may act in a warning capacity.
Research in intelligent vehicles has a long history. Various research groups experi-
mented with limited automation using analog electronics as early as 1960
([Gardels 60], [Oshima et al. 65]). However, real progress in the problem was not
1.3 Obstacle Detection 3
made until inexpensive cameras and computing enabled vision-based lane tracking in
the mid-to-late 1980s (e.g. [Dickmanns & Zapp 86], [Kluge & Thorpe 89]). Research
in automated headway control solved another piece of the problem and allowed appli-
cations such as automated convoying ([Cro & Parker 70], [Kories et al. 88]). In 1995,
the Carnegie Mellon Navlab 5 vehicle steered 98% of the distance between Washing-
ton, DC and San Diego (a distance of 2800 miles) autonomously, demonstrating that
vision-based road following is a mature technology. Progress has also been made in
the area of high-level planning in the presence of other traffic ([Reece 92], [Suk-
thankar 97]).
As part of a demonstration of Automated Highway System concepts in 1997, many
different groups from around the world demonstrated integrated vehicle systems. The
vehicles from Carnegie Mellon consisted of two cars, a van, and two full-sized city
buses. Integrated capabilities that were demonstrated included road following, lane
changes, inter-vehicle communication, detection and awareness of surrounding vehi-
cles, and detection and avoidance of large obstacles.
1.3. Obstacle Detection
Most of the progress in intelligent vehicles has been made in handling predictable
situations (which is not to say that the situations are necessarily common, just predict-
able). In line with this, much of the work on obstacle detection has focused on detect-
ing other vehicles and large, unambiguous obstacles such as traffic barrels. Many of
these methods can successfully detect moving vehicles, but the more difficult problem
of finding small, static road debris such as tires and crates remains unsolved.
Deciding exactly what size obstacle we need to be able to detect at what minimum
range is a complicated problem which has been addressed by many different research-
ers in different ways. Hancock [Hancock 97] used the equations derived by Kelly
[Kelly 95] for cross-country navigation to arrive at a distance of 65 m ahead for a 20
cm high obstacle, with the following assumptions:
4 Chapter 1. Introduction
ation
t least
-
For
pave-
from
gher
may
to the
ixels
d like
d from
eters.
cles at
such
n pro-
large
m of
mobile
t low
• vehicle is travelling at 60 mph (26.7 m/s)
• vehicle can decelerate at 0.7 g (6.8 m/s2)
• there is a 0.5 second delay time between sensing of the obstacle and applic
of the brakes
• processing is performed at a cycle rate of 0.3 seconds
• the sensor is located 1 meter above the ground
He also calculates that the sensor must have a vertical angular resolution of a
0.1° and a vertical field of view that is the same, 0.1°, implying that a single line sen
sor would be sufficient. In reality, many of these assumptions are optimistic.
instance, although many cars may be able to sustain 0.7 g deceleration on dry
ment under ideal conditions, it is unrealistic to expect this kind of performance
all vehicles under all conditions. We would also like to be able to travel at hi
speeds when the law permits it. Additionally, there is empirical evidence that we
need to avoid obstacles as small as 6” (14 cm) tall in order to avoid damage
vehicle. Lastly, even if a single scan line is sufficient, it is better to have many p
on the obstacle in order to enhance the reliability of detection results.
The combination of the above factors leads us to the conclusion that we woul
to detect smaller obstacles at somewhat larger ranges. Simply changing the spee
60 to 65 mph and the deceleration to 0.5 g implies a necessary distance of 100 m
Sensors such as automotive radar do not have the acuity to find small obsta
such large distances, and have significant difficulties with non-metallic obstacles
as wood, cement, or animals. While a variety of competing methods have bee
posed for on-road obstacle detection, most of the work has focused on detecting
objects, especially other vehicles (e.g. [Luong et al. 95]). Although the proble
detecting static obstacles has been tackled in both the cross-country and indoor
robot navigation literature (e.g. [Matthies 92]) these systems have operated a
1.4 Stereo Vision for Obstacle Detection 5
speeds (5-10 mph) and short range.
1.4. Stereo Vision for Obstacle Detection
This thesis presents a solution to the obstacle detection problem based on trinocu-
lar stereo vision. The solution presented is capable of detecting small obstacles, on the
order of 15 centimeters tall, on the road surface at ranges of 100 meters or more in
front of the vehicle.
Stereo vision is an ideal method for solving the obstacle detection problem for a
variety of reasons. If we expect to someday equip every vehicle on the highway with
its own obstacle detection system, then the use of an active sensor such as radar or
ladar requires great care to avoid interference between the signals emanating from dif-
ferent vehicles. This argues for the use of passive sensing devices such as video cam-
eras. In addition, cameras and computers are continually getting smaller and less
expensive. Although prices are not yet low enough to include three cameras and a
powerful computer on every car, current trends will make it possible within the next
five years. Yet another factor is that a stereo system lacks moving parts, which implies
less wear and thus greater reliability.
1.5. Thesis Overview
This section presents a summary of the main ideas and results of this dissertation,
which will be presented in greater detail throughout the remaining chapters. First, we
discuss the problems posed by a straight-forward application of stereo vision to the
obstacle detection problem. Following this, a method is presented to solve these prob-
lems. The next two sections discuss major algorithmic choices and their impact on the
quality of the stereo output. This is followed by a section describing the actual method
that we use for detecting obstacles from stereo disparity data. Finally, we present a
summary of obstacle detection results.
6 Chapter 1. Introduction
1.5.1. Traditional Stereo
As illustrated in Figure 1-1, traditional stereo processing involves taking two
images of a scene at the same time from different viewpoints. Each point in one of the
images is constrained by the camera geometry to lie along a line (called the epipolar
line) in the other image. Its position on this line is related to the distance of the point
from the cameras.
In order to make the search more reliable, instead of comparing individual pixels
from the two images, small regions are compared.
Two examples of this are shown in Figure 1-1. In this example, the scene is of the
inside of a garage. The garage door has calibration targets attached to it to enhance
image texture, and the images have been filtered to enhance image texture. Two
regions are chosen as examples of stereo matching.
For the example region on the door of the garage, we see that the regions searched
Right Image region
Left Image region
Difference
Right Image region
Left Image region
Difference
Mat
chin
g E
rro
r
DisparityFigure 1-1: Traditional Stereo Processing
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 50 100 150 200 250 300
"wall_tra.out""ground_t.out"
1.5 Thesis Overview 7
in the stereo matching (shown in detail below the images) match very well. The upper
curve on the graph at the bottom shows the matching error (sum of absolute differ-
ences, SAD) as a function of the displacement along the epipolar line. This graph
shows a strong global minimum at the correct value of 100.
On the other hand, the example on the garage floor does not match as well. This is
due to the fact that since the ground is tilted with respect to the camera axis, points
which are higher in the image are actually farther away and thus match at a different
location. This is seen as a difference in the slope of the line on the ground. The lower
curve of the graph shows that the global minimum of the matching error does not
occur at the correct position (which would be at a value of around 155).
It is clear from this example that a simple application of traditional stereo tech-
niques will not be sufficient for detecting obstacles on a road surface; points on the
ground such as those shown in the example will produce incorrect results, particularly
in regions where the image texture is low. Since the problem is caused by a difference
in the geometry of the surfaces being observed, the solution to this problem is to com-
pensate for the different geometry.
1.5.2. “Ground Plane Stereo”
The simplest way to solve the problems described in the previous section is to
warp one of the images (using a projective warping function) so that the images would
appear to be exactly the same if all of the pixels in the image were on some typical
ground plane. This results in a situation as shown in Figure 1-2. Both images now
appear to be the same for pixels which are on the ground, but pixels which are on a
vertical surface such as the wall of the garage are now warped in much the same way
that the ground pixels were warped in traditional stereo. This means of computing ste-
reo (described in more detail in [Williamson & Thorpe 98a]) is similar to the tilted
horopter method of Burt et al. [Burt et al. 95], except that in our case, instead of
attempting to determine the parameters of the ground plane at each iteration, we use a
8 Chapter 1. Introduction
both
ce (if
ce (if
y can
horizontal plane that is fixed relative to the vehicle as the starting point for our stereo
search.
Comparing the results from the ground plane method with the results from the tra-
ditional method, we notice several differences. First, the global minimum of the
matching error curve for the point on the ground (the lower curve) now appears at the
correct location. The value of the error at the minimum is also lower than before, since
it matches better. Second, although the global minimum of the curve for the point on
the door is still at the correct location, the trough of the minimum is much wider, indi-
cating a less certain result. The value at the minimum is larger, indicating that it
doesn’t match as well.
This example illustrates an interesting result: if we compute stereo using
methods, it is possible to determine whether a given point lies on a vertical surfa
the traditional method produces at lower minimum error) or on a horizontal surfa
the ground plane method produces a lower minimum error). The correct disparit
also be determined from the position of the lower minimum.
Right Image
Left Image Difference
Right Image
Left Image Difference
Mat
chin
g E
rro
r
DisparityFigure 1-2: “Ground Plane Stereo”
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 50 100 150 200 250 300
"wall_gro.out""ground_g.out"
1.5 Thesis Overview 9
Since most obstacles that we are concerned with contain nearly-vertical surfaces,
detecting such obstacles becomes very easy using this method.
One issue that must be addressed is what conditions are necessary for this method
to work reliably. For example, if two surfaces appear in the same image region (near
where the garage door meets the ground, for instance), which surface will be chosen?
This most important factor is the magnitude of the image texture on each surface.
Another factor is how close the surface directions are to being vertical or horizontal.
Figure 1-3 shows the results of applying both methods to a typical input image set.
The gray coding in both cases represents the number of pixels of displacement along
the epipolar line (dark is negative, medium gray is zero, and bright is positive). As
expected, the ground plane method does very well on the ground pixels, but poorly on
the wall in the background. Conversely, the traditional method works well on vertical
features such as the lamp post and the wall, but there is a lot of noise on the ground
surface.
1.5.3. Implementation
Figure 1-4 shows the architecture of the system that we have implemented. Three
CCD cameras with 35mm lenses are arranged in a triangular configuration, mounted
on top of our Toyota Avalon test vehicle. The distance between the outer set of cam-
eras is about 1.5m.
The computation that is performed is based on the that used by the CMU Video
Rate Multibaseline Stereo Machine [Kanade et al. 96]. The images are first passed
through a Laplacian of Gaussian (LoG) filter, then rectified to align the epipolar lines.
Stereo matching is then performed using both the traditional method and the ground
plane method. Based on the output of both methods, the further step of obstacle detec-
tion and localization is performed.
10 Chapter 1. Introduction
1.5.3.1. Multibaseline (Trinocular) Stereo
There are several benefits to adding a third camera in a triangular configuration.
The most important of these is that the epipolar lines for different pairs of cameras are
Original Image
Traditional Method Output
Ground Plane Method Output
Figure 1-3: Example output of both methods
1.5 Thesis Overview 11
in different directions (as illustrated in Figure 1-5). This is due to the fact that the epi-
polar direction is the same as the direction of displacement between the cameras. This
ImageRectification
LoG Filter LoG Filter LoG Filter
ImageRectification
ImageRectification
Stereo Matching
Obstacle Detection/
Localization
Figure 1-4: Architecture of Stereo Obstacle Detection System
Figure 1-5: Three cameras in an “L” configuration give different epipolar directions
12 Chapter 1. Introduction
is important in situations where the image has texture in one direction but not in the
other (for example, the top border of the obstacle in Figure 1-3).
Another benefit of adding additional cameras is that it allows multiple measure-
ments at each point. This is useful in increasing accuracy and rejecting noise. Further-
more, a system containing only two cameras can be confused by repeated patterns in
the image (such as lines painted on the road surface). With three cameras, this problem
is eliminated.
Adding a fourth (or more) cameras does provide some additional benefit, but it
becomes much more difficult to perform the stereo matching efficiently.
Figure 1-6 shows the output for the ground plane method from Figure 1-3 if only
two cameras are used. The number of incorrectly matched pixels is much larger.
1.5.3.2. Laplacian of Gaussian (LoG) Filtering
Laplacian of Gaussian filtering is a well-accepted means of extracting features to
match from multiple cameras, while at the same time compensating for differences in
camera gain and bias. We use an LoG filter with a high gain in order to enhance the
texture of the otherwise featureless gray asphalt. The results of this filtering are shown
in Figure 1-7. The increase in image texture is very apparent.
The importance of the LoG filter to our algorithm is illustrated in Figure 1-8. The
Figure 1-6: Example using only two cameras
1.5 Thesis Overview 13
lack of image texture on the road surface causes the entire region to be unmatchable,
though regions with higher texture, such as the obstacle itself and the curb are still
computed correctly.
Figure 1-7: Image before and after LoG filtering
Figure 1-8: Example of stereo output without LoG filter
14 Chapter 1. Introduction
1.5.4. Obstacle Detection from Stereo Output
As discussed in Section 1.5.2, our method involves performing two types of stereo
matching (for vertical and horizontal surfaces), and comparing the absolute errors to
determine if a particular image region belongs to a vertical or horizontal surface. The
result of this is shown in Figure 1-9. The regions shown in the lower image are coded
by the size of the difference between the minimum errors. Brighter regions indicate
that the vertical match is much better than the ground plane match. Thus regions which
appear white are most likely to be vertical, and black regions are most likely to be hor-
izontal.
Regions of very low texture (such as the white stripe down the side of the road)
sometimes match well as vertical surfaces because of differences between the individ-
Figure 1-9: Detected vertical surfaces
1.5 Thesis Overview 15
ual cameras being used.
In order to remove such false obstacles from consideration, we use a very simple
confidence measure. For regions which are actual vertical surfaces, we expect that the
traditional stereo matching method will return a relatively large number of pixels at
approximately the same depth. Conversely, if a region belongs to a horizontal plane,
we would expect the traditional method to report a number of different depths. Using
standard connected components labeling methods on the disparity image generated
from traditional stereo matching, we get the image of Figure 1-10. This image encodes
the size (in pixels) of the region to which each pixel belongs. Large regions appear
brighter, and these regions are more likely to be obstacles. By requiring detected
obstacle regions to pass this consistency check, we can remove most false positive
detections.
Combining the images of Figure 1-9 and Figure 1-10, we get the detected obstacle
output of Figure 1-11. Obstacles are shown in black. This example shows two 14cm
high obstacles, which are pieces of wood painted white and black. The obstacles are
100m in front of the vehicle.
1.5.5. Results
We have collected a set of test data using wooden obstacles of four different
Figure 1-10: Size of regions of constant disparity
16 Chapter 1. Introduction
heights (9, 14, 19, and 29cm) and three different colors (white, black, and gray) at
measured distances from 50 meters to 150 meters.
Figure 1-12 shows the accuracy of the detected range for all 12 obstacles. As
expected, the measured range is very accurate when the object is close, and gets
increasingly less accurate as the obstacle gets farther away.
The results of running the obstacle detection system are shown in Table 1-1. This
table shows that we were successfully able to detect obstacles that are bigger than 9cm
Figure 1-11: Detected Obstacles
Actual Range (m)Figure 1-12: Stereo range accuracy
Det
ecte
d R
ange
(m
)
40
60
80
100
120
140
160
180
40 60 80 100 120 140 160
"rangeacc.out""rangeacc.out"
x
1.5 Thesis Overview 17
at up to 110m.
Figure 1-13 shows an example trace of an obstacle detection run. The vehicle
moved at a constant rate (about 25 km/h) toward a 14cm black obstacle of the type
shown in Figure 1-9. The data was taken at 15 fps, and processed off-line. The obsta-
cle is detected in every frame of the data, out to a maximum range of approximately
Black 9cm
Black14cm
Black19cm
Black30cm
Grey9cm
Grey14cm
Grey19cm
Grey30cm
White9cm
White14cm
White19cm
White30cm
50m ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
60m ✓ ✓ ✓ ✓ ✕ ✓ ✓ ✓ ✓ ✓ ✓ ✓
70m ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
80m ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
90m ✓ ✓ ✓ ✓ ✕ ✓ ✓ ✓ ✓ ✓ ✓ ✓
100m ✓ ✓ ✓ ✓ ✕ ✓ ✓ ✓ ✓ ✓ ✓ ✓
110m ✓ ✓ ✓ ✓ ✕ ✓ ✓ ✓ ✕ ✓ ✓ ✓
120m ✓ ✕ ✓ ✓ ✓ ✕ ✓ ✓ ✕ ✓ ✓ ✓
130m ✕ ✕ ✓ ✕ ✕ ✓ ✓ ✓ ✕ ✕ ✕ ✕
140m ✓ ✕ ✓ ✓ ✓ ✓ ✓ ✓ ✕ ✕ ✕ ✕
150m ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✕ ✕ ✕ ✕
Table 1-1: Obstacle Detection Results
Frame # (15 fps)Figure 1-13: Detection trace for 14cm obstacle
Ran
ge (
m)
0
20
40
60
80
100
120
0 20 40 60 80 100 120 140 160
"6in_obso.out"
18 Chapter 1. Introduction
110m (which is the beginning of the data set).
Figure 1-14 shows the same type of trace, this time for a standard 12oz (350ml)
white soda can. The soda can is first reliably detected at 57m.
Each of the previous examples has shown only the detections that actually repre-
sented the obstacle. Of course, there are many more detected objects. A full trace is
shown in Figure 1-15, along with an example image from the set and a diagram show-
ing an overhead view of the scene. The detections can be divided into three sets, repre-
senting the obstacle, the curb behind the obstacle, and the building in the background.
Also note that there are no false detections that are closer than the obstacle.
1.6. Thesis Outline
This thesis consists of a number of chapters, which are divided according to the
major topics to be presented. Each chapter begins with an introduction and a separate
discussion of related work. This is followed by a detailed discussion of the topic at
hand.
Frame # (4 fps)
Ran
ge (m
)
Figure 1-14: Detection trace for a soda can
0
20
40
60
80
100
120
0 10 20 30 40 50 60 70 80 90
"can_obso.out"
1.6 Thesis Outline 19
Chapter 2 briefly introduces the mathematics of projective geometry that will be
used throughout this document. Assuming a pinhole camera model, we derive a com-
pletely general mathematical model for multiple images of a static scene. This chapter
provides some of the fundamental equations upon which the stereo obstacle detection
system is built.
Chapter 3 introduces the problem of calibrating a set of cameras to be used for
multibaseline stereo. A weak calibration method is presented that allows the determi-
nation of just enough parameters to allow the computation of stereo disparity. In order
to perform this calibration, all that is required is images of two planar surfaces (for
example, a wall and a relatively flat patch of ground) in the world, taken from each of
the cameras. Since we are interested in viewing small objects at long range, additional
methods are presented that provide a means to compute these parameters very accu-
rately by adding images of additional planar surfaces, which may be obtained by mov-
ing the vehicle and capturing images at different distances.
0
50
100
150
200
250
0 5 10 15 20 25 30 35 40 45 50
"obs.out"
road
curb
vehicle
obstacle
Frame # (4 fps)
Ran
ge (
m)
Figure 1-15: Other detected points
building
20 Chapter 1. Introduction
In Chapter 3 we also present a method for performing metric calibration of the ste-
reo system. A metric calibration provides a mapping from the natural coordinates for
stereo processing (pixels and disparity values) into 3D coordinates that can be used for
vehicle control. The accuracy requirements for obstacle position are much less strin-
gent than for stereo matching, since we cannot expect millimeter precision in position
at 100 meter range. The geometry of the situation, as well as other factors, prohibit
such high accuracy position recovery. The method that is described makes use of three
images of a vertical plane at known distances, a horizontal ground plane, and a two
points within the vertical plane at known lateral position. The data for both calibra-
tions can thus be collected at the same time.
Chapter 4 presents the stereo algorithm that is used as the basis for the obstacle
detection system. This chapter discusses the stereo algorithm at a high level, in terms
of what sort of processing is necessary to produce high-quality output. The discussion
of how this algorithm can be implemented efficiently is left until Chapter 5. The first
section of this chapter discusses the benefits of using at least three cameras in the ste-
reo system. Following that, the stages of the stereo algorithm are presented in order of
processing. The first step of the algorithm is preprocessing using an LoG filter. The
reasons why such preprocessing is necessary are discussed in detail. The next step of
the algorithm is rectification and interpolation of the images. The method used for
interpolation is discussed in some detail, but the discussion of rectification is delayed
until Chapter 5, where it will be presented at great length. The next stage of the algo-
rithm is the actual stereo matching. A number of different metrics for image similarity
are presented, and a discussion of the benefits and drawbacks of each follows. This
chapter ends with a discussion of sub-pixel interpolation.
Chapter 5 is devoted to mid-level implementation issues. These issues are those
that are not high-level algorithmic issues such as those presented in Chapter 4, but yet
are still at a high enough level that they apply to any general-purpose computing plat-
form. Since the research presented in this dissertation has grown out of an attempt to
1.6 Thesis Outline 21
ereo”,
tions
tereo
w this
mines
s and
hen
etection
olors.
d goal,
ylight
er dis-
m will
es is
use the CMU Video-Rate Multibaseline Stereo Machine for obstacle detection, the
algorithm used by that machine is presented first. Details of the software implementa-
tion of this algorithm are presented in the sections that follow. Following a section on
the implementation of the LoG filter, the stereo matching main loop is presented in
pseudo-code. The discussion of rectification methods is closely tied to a discussion of
the memory and cache performance required by three different possible implementa-
tions of the stereo main loop. The chapter ends with the presentation of benchmark
data that supports the analysis of memory usage by the stereo algorithm, and shows the
significant performance improvements that can be achieved by attention to memory
usage.
Chapter 6 discusses how the output of the stereo algorithm can be used to build an
effective obstacle detection system. First, I present the major problem posed by trying
to apply traditional stereo techniques to a highway environment. The solution to this
problem, presented in the next section, is something that I call “Ground-Plane St
which is equivalent to what others have called “tilted-horopter stereo”. The sec
that follow describe how the combination of traditional stereo and ground plane s
can be used to determine the orientation of the surfaces being viewed, and ho
orientation can be used as a cue for obstacle detection.
Chapter 7 presents results obtained using the system. The first section exa
the performance of the stereo algorithm and the importance of multiple camera
LoG filtering. This is followed by an analysis of the accuracy of stereo range w
applied to detected obstacles. The rest of the chapter presents actual obstacle d
results. The algorithm was tested on a variety of obstacles of different sizes and c
The results of these tests show that the system is capable of achieving our state
detecting a 15 centimeter obstacle 100 meters in front of the vehicle, under da
conditions. In fact, the system is capable of detecting the obstacle at even larg
tances. A series of tests was also run at nighttime to determine whether the syste
continue to function at night. At night, the ability of the system to detect obstacl
22 Chapter 1. Introduction
for
tection
lity and
ction
sible
reas:
e sev-
ith a
limited by the extent of the region illuminated by the vehicle’s headlights, which
low beams is much less than 100 meters. In addition to the single obstacle de
runs, we have also performed repeated tests in order to determine the repeatabi
reliability of the results. These tests give us some idea of the probability of dete
versus range for a particular obstacle.
Finally, Chapter 8 takes a look at the contributions of this thesis, and pos
future work. The contributions of this dissertation are presented in three main a
camera calibration, the stereo algorithm itself, and obstacle detection. There ar
eral interesting directions in which this research can be extended; I conclude w
look at a number of possible topics for future work
23
Chapter 2
Mathematical Fundamentals
Much of the research described in this thesis depends on projective geometry. The
definitive reference for projective geometry as it is applied to computer vision is
[Faugeras 93]. While it is an excellent and complete reference, a more concise deriva-
tion of the necessary equations is possible for the special case that we consider in this
thesis: a set of images of a static world, each taken from a different viewpoint. This
chapter presents a derivation of these necessary equations.
The derivation presented here is simplified by choosing a special coordinate sys-
tem whose origin is located at the camera focus and whose axes are aligned with the
camera axes. This simplification allows a more concise derivation of the fundamental
equations describing a system of multiple cameras, without any loss of generality. It
also eliminates some of the confusion that can be caused by presenting a mapping
between 3D homogeneous coordinates and 2D homogeneous coordinates by avoiding
24 Chapter 2. Mathematical Fundamentals
3D homogeneous coordinates altogether.
First, the basics of projective geometry for stereo will be presented, with a deriva-
tion of the fundamental stereo equation and the epipolar geometry. This is followed by
the derivation of homography matrices relating multiple images of a planar surface,
and a brief look at the fundamental matrix.
2.1. Mathematics of Stereo Vision
Projective geometry provides a useful set of tools for thinking about computer
vision problems. The main idea of projective geometry is that image coordinates
(inherently a 2-D space of columns c and rows r) can be represented as 3-D homoge-
neous coordinates, by the following relationship:
(2-1)
So, to convert from a 3-D homogeneous coordinate to a 2-D image coordinate, all
that is needed is to divide each of the first two elements by the third. This is a many-to-
one mapping. To convert a 2-D image coordinate into a homogeneous coordinate, we
can choose an arbitrary third coordinate (usually we choose 1 for simplicity) and mul-
tiply the column and row by this element.
What makes this concept useful is that camera projections can be written as linear
equations in homogeneous coordinates. Suppose we have the camera geometry shown
in Figure 2-1. A set of camera coordinates (x,y,z) are defined with the origin at the
focus of the camera. The z axis is aligned with the camera viewing direction. In the
image plane we define the coordinate system in terms of rows and columns (c,r). If we
then define the 3x3 matrix A:
c
r
αc
αr
α
⇔
2.1 Mathematics of Stereo Vision 25
(2-2)
then we can represent the mapping from camera coordinates (x, y, z) to image coordi-
nates (c,r) by:
(2-3)
where the equations and can be easily derived from the
geometry of similar triangles; f is the focal length of the camera, γ is the aspect ratio,
x
y
zc
r (u,v)
Figure 2-1: Geometry of camera projection
Af 0 u
0 γf v
0 0 1
=
Ax
y
z
fx uz+
γfy vz+
z
=
z
fxz---- u+
γfyz
------- v+
1
=
zc
r
1
=
cfxz---- u+= r
γfyz
------- v+=
26 Chapter 2. Mathematical Fundamentals
and (u,v) is the image center of the camera. This equation provides a compact and sim-
ple representation of the camera geometry, turning a nonlinear equation into a linear
equation.
Note also that since the matrix A is invertible, equation (2-3) can be inverted:
(2-4)
for each point (c,r) in the image, this equation tells us the corresponding line in world
coordinates, parameterized by z.
Now suppose that we have two cameras, represented by primed coordinates
((r’,c’ ),(x’,y’,z’), and A’) and unprimed coordinates ((r,c),(x,y,z), and A). If we also
know the rotation and translation between the two camera coordinate systems, repre-
sented by the 3x3 rotation matrix R and the 3-D translation vector t so that
(2-5)
then we can substitute equation (2-4) into equation (2-5) twice (once for each camera)
and simplify, giving us
(2-6)
this equation embodies the relationship between points in two different images of the
same scene. If we define (a 3x3 matrix) and (a 3-vector),
then this equation becomes:
x
y
z
zA1–
c
r
1
=
x′y′z′
Rx
y
z
t+=
z′c′r ′1
A′ R A1–z
c
r
1
t+
zA′RA1–
c
r
1
A′t+= =
H∞ A′RA1–
= e A′t=
2.1 Mathematics of Stereo Vision 27
(2-7)
from this equation, we can see three things:
• in the limit as z approaches infinity, the effects of e become negligible, and
(2-8)
• in the limit as approaches zero,
(2-9)
From these equations we can see that for any given point (c,r) in the first camera,
the point (c’,r’ ) in the second camera must lie on the line connecting e (called the epi-
pole, which is the image of one cameras focus in the other camera) to the point
(which is the point at infinity, depending only on the rotations between
cameras). This line is called the epipolar line. In particular, the point must lie between
e and on this line.
2.1.1. Homography Matrices
Being a mapping from a 3-D space (c,r,z) to a 2-D space (c’,r’ ), of course
equation (2-7) is not invertible. But if instead of taking images of a general scene, we
take images of a planar surface (such as a wall or the road surface), we can add an
additional constraint. One way of expressing the general equation of a plane is:
z′c′r ′1
zH∞
c
r
1
e+=
z′c′r ′1
zH∞
c
r
1
=
zz’---
z′c′r ′1
e=
H∞ c r 1T
H∞ c r 1T
28 Chapter 2. Mathematical Fundamentals
(2-10)
where is the unit normal vector to the plane, and d is the normal distance of the
plane from the origin. This can be rewritten as:
(2-11)
If we now multiply the e in equation (2-7) by equation (2-11), we get
(2-12)
Note that this is a linear equation which relates the coordinates of points in the two
images of a planar surface defined by the parameters and d. The 3x3 matrix
is called a homography matrix. Note also that as d goes to infinity, the
homography matrix becomes . Although equation (2-12) refers to a particular
matrix that we can compute, we must note that if we are to try to compute any homog-
raphy matrix (including ) directly from matching sets of image points, we will only
be able to determine it up to a scale factor.
For each point in one image, a homography matrix defines one location in the
other image on the epipolar line corresponding to that point. Thus two homography
matrices yield two points on the epipolar line for each pixel, which is enough to deter-
nT
x
y
z
d=
n
1d--- n
TA
1–z
c
r
1
1=
z′c′r′1
zH∞
c
r
1
en
T
d-----A
1–z
c
r
1
+=
z′z----
c′r′1
H∞ en
T
d-----A
1–
+
c
r
1
=
n
H∞ en
T
d-----A
1–
+
H∞
H∞
2.1 Mathematics of Stereo Vision 29
mine the epipolar geometry, including the epipole e. It is not possible to compute
from general homography matrices without other information.
2.1.2. Fundamental and Essential Matrices
If we take the cross product with e on both sides of equation (2-7), we get
(2-13)
we then take the dot product with , yielding
(2-14)
The matrix quantity is called the fundamental matrix. It encodes informa-
tion about the epipolar line for each pixel, but the information about the endpoints (e
and ) is lost.
Another related matrix is called the essential matrix, due to Longuet-Higgins
[Longuet-Higgins 81]. It can be defined as
(2-15)
which describes the relationship between the world coordinates of points observed in
the frames of reference of the two cameras, via the following equation
H∞
e z′c′r′1
× e zH∞
c
r
1
× e e×+=
c′ r′ 1T
c′r′1
ec′r′1
×
•c′r′1
e zH∞
c
r
1
×
•=
c′r′1
e H∞
c
r
1
×
• 0=
e×H∞
H∞ c r 1T
E A’1– T
FA=
30 Chapter 2. Mathematical Fundamentals
(2-16)
where and are the world coordinates of a single point observed in
the coordinate systems of the two cameras (or any scalar multiples thereof).
2.1.3. Relationship Between Homography Matrices
Given two homography matrices and ,
(2-17)
if we define , then we can write as
(2-18)
which has the same form as the general homography matrix in equation (2-12). This
indicates that it is not necessary to know in order to know the epipolar geometry.
Any pair of homography matrices can be used to define two points on the epipolar line
for each pixel. The epipole e can be computed (up to a scale factor) from any two
homography matrices since the result of equation (2-15) is a rank 1 matrix (since it is
the outer product of two 3-vectors); any rank 1 matrix can be decomposed into two
component vectors with the only ambiguity being what scale to assign to each vector.
Furthermore, all homography matrices for a given camera geometry belong to a three-
dimensional affine subspace of the set of all 3x3 invertible matrices, which can be
parameterized by . is one special member of this subspace.
It is necessary to note, as we did in Section 2.1.1, that in general we can only com-
x′y′z′
T
Ex
y
z
0=
x y z, ,( ) x′ y′ z′, ,( )
H1 H2
H2 H1– H∞ en2
TA
1–
d2--------------+
H∞ en1
TA
1–
d1--------------+
–=
en2
d2-----
n1
d1-----–
T
A1–
=
n′n2
d2-----
n1
d1-----–= H2
H2 H1 H2 H1–( )+ H1 en′d′----
TA
1–+= =
H∞
n′ H∞
2.1 Mathematics of Stereo Vision 31
pute homography matrices up to an unknown scale factor. If the two homographies
and do not share the same scale factor, then equation (2-17) is meaningless, since
the terms will not cancel. Therefore, one must be very careful to somehow com-
pute the relative scale of the two matrices when attempting to apply equations of the
form of equation (2-17). This matter will be further addressed in Section 3.2.3.
H1
H2
H∞
32 Chapter 2. Mathematical Fundamentals
eak”
ing. In
s is
e that
orre-
et of
a
31
Chapter 3
Calibration
Perhaps the most important problem for computing stereo range data from a set of
cameras is the problem of accurately calibrating the cameras relative to each other.
In the system that we have implemented, calibration occurs in two steps. The first
step is to do what is known as “weak calibration” for stereo processing. Here “w
refers to the fact that we only know enough about the system to do stereo match
particular, the mapping from the results of matching into 3D world coordinate
unknown. The weak calibration must be done very accurately in order to ensur
the search for matching pixels between the images is in fact looking for point c
spondences that are geometrically feasible.
The second step is to do metric calibration, which allows us to map from a s
corresponding points in the images into a 3D (x,y,z) coordinate relative to the camer
32 Chapter 3. Calibration
in the world. For our application (detection of obstacles on the road surface), we can-
not expect the results of this mapping to be very accurate, since the range resolution
for far-away points is very low. Although we make some attempts to perform the cali-
bration with relatively high accuracy, the accuracy of metric calibration is not as
essential as it is for the weak calibration.
3.1. Related Work
The weak calibration method presented here is an adaptation of a method used pre-
viously at Carnegie Mellon by the Video-Rate Multibaseline Stereo Machine group,
particularly Kazuo Oda and Tak Yoshigahara. Their method is documented in
[Oda 96b], and is based on the weakly calibrated stereo ideas of Faugeras
[Faugeras 92]. I have extended the method further to optimize for multiple planar sur-
faces at the same time, which allows direct computation of the epipoles as well as
being more accurate.
The method used to turn the weak calibration into a metric calibration is com-
pletely ad-hoc, based on the result obtained in equation (3-22). This equation is well
known, having been derived independently in [Faugeras 92] and [Hartley et al. 92].
Another, more principled method for determining the mapping between the results of
weakly calibrated stereo and Euclidean coordinates is presented in
[Devernay & Faugeras 96], though the goal of their method is to recover Euclidean
coordinates without measuring distances to points in the world. The results that they
obtain are thus not metric results, although the mapping between a Euclidean space
and a true metric space can be found by making a small number of measurements.
3.2. Weak Calibration of Multibaseline Stereo
In order to perform stereo matching, for each point (c,r) we need to know what the
possible corresponding points (c’,r’ ) in the second image are. If we only know this
information, the system is said to be weakly calibrated. That is to say that although we
know the set of possible corresponding points between the two images, we do not nec-
3.2 Weak Calibration of Multibaseline Stereo 33
essarily know the physical interpretation (i.e., 3D location) of a particular point corre-
spondence. The problem of determining this set of corresponding points is a problem
of calibration.
The fundamental projective equation describing stereo is:
(3-1)
which, for each pixel (c,r) in the first image, describes a line segment between
and e along which the corresponding point (c’,r’ ) must lie.
Given the discussion of Section 2.1, several methods of calibration present them-
selves:
1. Measure the projection matrix A of each camera, and the translation t and rota-
tion R between them. Given these parameters, we can compute any of the other
quantities that we need. The main problem with this is that it is very difficult to
measure these parameters accurately. The usual method for measuring A is to
take the camera into a laboratory where very accurate measurements can be
made under controlled conditions. Since we expect that these parameters may
change over time (e.g. because of vehicle vibration), we need a calibration
method that can be done quickly and in place on the vehicle.
2. Measure and e. If we know these two quantities, we know both ends of the
epipolar line that we need to search. The problem with this is that it is not always
easy to measure . This can be done easily by pointing the stereo system at a
scene that is so far away that it is indistinguishable from infinity. It is possible to
roughly calculate how far away that is for a given system; for ours it is roughly
4250m.
z’c’
r’
1
zH∞
c
r
1
e+=
H∞ c r 1T
H∞
H∞
34 Chapter 3. Calibration
ne,
kly, as
ed
pixel
mog-
f one
3. Measure the fundamental matrix. There are two problems with this. First is that
the fundamental matrix does not provide information about where the endpoints
of the epipolar line are, so we do not know where to start and end our search.
Secondly, even if we manage to find corresponding points using only the funda-
mental matrix, the relationship of these correspondences to the distance from the
camera is unclear.
4. Measure a homography matrix for some “typical” plane, and e. The homography
matrix gives us one corresponding point on the epipolar line for each pixel; e is
another point. We also know that we expect points to lie near the “typical” pla
so we can search in a region about that point along the epipolar line.
We have chosen solution 4 because it allows us to recalibrate often and quic
well as having other benefits which will be described later.
3.2.1. Image Warping
Given a homography matrix H, it is possible to apply the transformation describ
by that matrix to one of the images. This is known as projective image warping. After
such a homography is applied, a point in one image will lie at exactly the same
coordinate in the other image if and only if it lies on the plane described by the ho
raphy.
The homography describes a real-valued mapping from the coordinates o
image to the coordinates of the other. The pixel value at (c’,r’ ) of the warped image
should be the value of the original image at:
(3-2)
after division by α, we have a real coordinate (c,r) that represents the location of the
corresponding point in the original image. In reality, since we only have values at dis-
αc
r
1
H1–
c’
r’
1
=
3.2 Weak Calibration of Multibaseline Stereo 35
crete pixel locations, we need to interpolate those values to find the best approxima-
tion to the actual value. In general, bilinear interpolation is sufficient. If I(c,r)
represents the pixel value of the image I at the coordinate (c,r), ci is the integer part of
c, and cf is the floating point remainder (c - ci), then we have:
(3-3)
To simplify the notation, we will use to represent the image obtained from
image I after warping by H. The value of this image at the pixel with coordinates (c,r)
would then be .
3.2.2. Computing Homography Matrices
Since any given point in a 2D image can be represented by any of an infinite num-
ber of homogeneous coordinates, all scalar multiples of each other, we cannot expect
to directly solve equation (3-1) for the homography matrix (which we will call H).
One way to represent equality in homogeneous coordinates is to write the cross-prod-
uct of the two homogeneous coordinates that are supposed to be equal, and set it equal
to zero. This has the effect of constraining the two coordinates to be scalar multiples of
each other.
Thus the problem becomes:
(3-4)
Since, given any solution H to this problem, all scalar multiples of H are also solu-
tions, we can arbitrarily set one element of H to whatever value we like (in general we
usually set to 1) and solve for the other eight. Doing this we get two linear equa-
tions per image point, and a total of eight unknowns, so four point correspondences are
required to compute H.
I c r,( ) 1 cf–( ) 1 rf–( ) I ci ri,( )⋅ ⋅cf 1 rf–( ) I ci 1+ ri,( )⋅ ⋅1 cf–( ) rf I ci ri 1+,( )⋅ ⋅
cf rf I ci 1 ri 1+,+( )⋅ ⋅
+
+
+
≈
W I H,( )
W I H,( ) c r,( )
x′ Hx× 0=
H33
36 Chapter 3. Calibration
Ideally, though, we would like to know the parameters of H to very high precision
so that we can accurately compute depth using sub-pixel interpolation template match-
ing.
For an autonomous vehicle, the goal is to recognize small obstacles (as small as
20cm or so) at long range (60-100m in front of the vehicle). In order to accomplish
this, a combination of telephoto lenses and a large baseline becomes necessary. In this
situation, small inaccuracies in the calibration can cause large errors. In one particular
situation that we have studied, using a 1.2m baseline and 35mm lenses, a 1mm error (1
part in 1000) in the computed position of the camera can cause the epipolar line to be
off by as much as 2 pixels in certain parts of the image at extreme disparities. Accurate
computation of homography matrices is therefore essential.
Thus, some method of accurately determining calibration parameters is necessary.
The most obvious way to do this is to minimize the residual error between one image
and the other image when warped by the homography. The error we want to minimize
is
(3-5)
where I and I’ are the two images to be matched, and W is the region of the image that
corresponds to the planar surface. The standard way to minimize E would be to com-
pute its derivative:
(3-6)
where is the gradient of the warped image (which can also be written in
terms of gradients of the original image if desired), and
E W I′ H,( ) c r,( ) I c r,( )–( )2
c r,( ) W∈∑=
E∂Hij∂
---------- 2 W I′ H,( ) c r,( ) I c r,( )–( ) W I′ H,( )dc'd
------------------------ c' r',( ) c'dHijd
---------- c r,( ) W I′ H,( )dr'd
------------------------ c' r',( ) r'dHijd
---------- c r,( )+
c r,( ) W∈∑=
W I′ H,( )dc’d
------------------------
3.2 Weak Calibration of Multibaseline Stereo 37
(3-7)
Since this equation depends on the image data, we do not expect to find a closed-
form solution for the minimum of E by setting the derivatives to zero. Instead, we
must apply some type of nonlinear optimization to minimize the error. For this, we use
a program that has been in use at Carnegie Mellon for several years
[Oda 96a][Oda 96b]. It asks the user to select four matching points in a set of two
images, and to outline the planar region. This data is used to compute a starting set of
parameters for H. Since most nonlinear optimization techniques need an initial set of
parameters that is close to the minimum, and since the computation of the error and its
derivatives is a very computationally intense process, we make use of image pyramids
when computing homography matrices.
A lower resolution version of both images is obtained by simply replacing each
block of four adjacent pixels with their average. This is done for each level of the pyr-
amid. The homography matrix parameters for the lower resolution images are derived
by:
(3-8)
it is easy to verify that this gives the correct answer. If
c’dHijd
----------1
H31c H32r 1+ +---------------------------------------
c r 1
0 0 0
c H11c H12r H13+ +( ) r H11c H12r H13+ +( ) 0
=
r’dHijd
----------1
H31c H32r 1+ +---------------------------------------
0 0 0
c r 1
c H21c H22r H23+ +( ) r H21c H22r H23+ +( ) 0
=
H11 H12 H13
H21 H22 H23
H31 H32 H33
H11 H1212---H
13
H21 H2212---H
23
2H31 2H32 H33
⇒
38 Chapter 3. Calibration
(3-9)
then
(3-10)
At each level of the pyramid (starting from the lowest resolution), a Levenberg-
Marquart nonlinear optimization is used to minimize E (which requires computing the
derivative). The resulting parameters are then transformed for the next higher resolu-
tion level, and the optimization is performed again using these parameters as a starting
point. The results of the total optimization are shown in Figure 3-1. The results of the
optimization are displayed as a difference image, with the intensities normalized by
the same factor in order to make the errors visible. For this case, the residual error was
reduced by a factor of around 50%.
3.2.3. Finding the Epipole
In order to find the epipole, all that is necessary is to compute two homographies
for different planes:
(3-11)
note that the resulting matrix is rank 1 (it is the outer product of two 3-vectors). This
means that we can determine both of the vectors, but only to within a scale factor.
αc’
r’
1
H11 H12 H13
H21 H22 H23
H31 H32 H33
c
r
1
=
H11 H1212---H
13
H21 H2212---H
23
2H31 2H32 H33
12---c
12---r
1
α
12---c’
12---r’
1
=
H2 H1– H∞ en2
TA
1–
d2--------------+
H∞ en1
TA
1–
d1--------------+
–=
en2
d2-----
n1
d1-----–
T
A1–
=
3.2 Weak Calibration of Multibaseline Stereo 39
Since any scalar multiple of a homogeneous coordinate represents the same point, this
is all that is necessary.
There is one problem with this, however. Since we were able to compute the
homographies only up to an arbitrary scale factor, the cancellation of from the sec-
ond line of equation (3-11) is not possible unless we compute the relative scale factors
of the two matrices.
To accomplish this, we use the fact that the difference between the two homogra-
phies is rank 1 for the correct scale factor. So we simply have to find β such that
is rank 1. In general, because of rounding errors and imperfect assumptions
made by our model, there will be no β that accomplishes this exactly. We evaluate how
good any given β is by computing the Singular Value Decomposition of and
Left Image Right Image
Residual after choosing 4 points Residual after optimization
Figure 3-1: results of homography computation
H∞
H1 βH2–
H1 βH2–
40 Chapter 3. Calibration
e can
and
at we
e only
, the
y know
for a
escribe
ple, a
taking the ratio of the largest and second largest singular values. Finding the best value
of β then becomes a simple 1D optimization problem which can be solved by any
number of methods.
The mathematics described here will be used in Section 5.3.3, when we discuss
Image Rectification, which is a process by which the stereo search is set up to be a
very regular computation which can be implemented efficiently.
3.2.4. Improving Accuracy of Recovered Parameters
As was noted in the last section, the computation of the epipole e from a pair of
homography matrices requires a second step to normalize the difference between the
matrices so that it is rank 1. This is due to the fact that the class of all homography
matrices, for a given camera geometry, is such that the difference between any two
matrices must be a rank 1 matrix (as can be seen from equation (3-11)). Thus the com-
putation of two distinct homography matrices, optimizing 16 separate parameters, has
too many degrees of freedom. Another way to express this is that once we have the
first homography matrix for a pair of cameras, all we need to know to compute another
homography matrix are the two 3-vectors e and . Since we are taking
the outer product of these two vectors, their relative scale doesn’t matter (w
divide one vector by some quantity and multiply the other by the same quantity
still get the same homography matrix). This means that the first homography th
compute for a pair of cameras requires eight parameters, but the second on
requires five. Furthermore, if we know two homographies for a pair of cameras
third and subsequent ones only require three parameters each (since we alread
e up to a scale factor). Similarly, if we already know one set of homographies
system with two baselines, we need to determine two different values for e, but the
other vector is the same in both cases. The number of parameters necessary to d
a set of homographies for a set of cameras is summarized in Table 3-1. For exam
n2
d2-----
n1
d1-----–
T
A1–
3.2 Weak Calibration of Multibaseline Stereo 41
d
set of 3 cameras with 3 planes requires 27 parameters.
We define a new error metric E’:
(3-12)
where B is the set of baselines (numbered from 1 to the number of cameras) and P is
the set of planar surfaces for which we have images. Image 0 is used as a reference
image which is compared to all of the other images. From the previous discussion, the
parameters necessary to compute , the homography matrix for a particular base-
line b and planar surface p are:
• one full homography matrix for each baseline:
• one 3-vector for each baseline, representing the epipole:
• one 3-vector for each planar surface: . Note that .
The equation for is then:
(3-13)
the equations for the derivatives of E can be obtained from equation (3-6) an
equation (3-12) as follows:
first baseline second and additional baselines
first plane 8 8
second plane 5 3
third and additional planes 3 0
Table 3-1: Parameters needed to describe a set of homographies for multiple planes and multiple cameras
E′ W Ibp Hbp,( ) c r,( ) I0p c r,( )–( )2
c r,( ) W∈∑
b B∈∑
p P∈∑=
Hbp
Hb
eb
npnp
dp-----
n0
d0-----–
T
A01–
= n0 0=
Hbp
Hbp Hb ebnpT
+=
42 Chapter 3. Calibration
(3-14)
where refers to the quantity in equation (3-6), computed for a particular
baseline and planar surface. The missing pieces are:
(3-15)
the program described in Section 3.2.2 was rewritten to handle an arbitrary number of
baselines and planar surfaces using the above equations to optimize the large system
for the best set of parameters. As expected, the residual matching errors after optimi-
zation are slightly higher (since degrees of freedom have been removed from the prob-
lem), but satisfactory solutions are found consistently.
3.2.5. Stereo Search
In order to use the results of the calibration technique described in the previous
section, we rewrite equation (3-1) to use the parameters that we have computed:
E′∂Hb’( )ij∂
------------------Ebp∂
Hbp( )kl∂--------------------
Hbp( )kl∂Hb’( )ij∂
--------------------⋅
k l,∑
b B∈∑
p P∈∑=
E′∂eb’( )i∂
---------------Ebp∂
Hbp( )kl∂--------------------
Hbp( )kl∂eb’( )i∂
--------------------⋅
k l,∑
b B∈∑
p P∈∑=
E′∂np’( )i∂
---------------Ebp∂
Hbp( )kl∂--------------------
Hbp( )kl∂
np’( )i∂--------------------⋅
k l,∑
b B∈∑
p P∈∑=
Ebp∂Hbp( )kl∂--------------------
Hbp( )kl∂Hb’( )ij∂
--------------------1 if b’ b k, i and l, j= = =
0 otherwise
=
Hbp( )kl∂eb’( )i∂
--------------------nl if b' b and k i= =
0 otherwise
=
Hbp( )kl∂
np( )j∂--------------------
ek if p' p and l j= =
0 otherwise
=
3.2 Weak Calibration of Multibaseline Stereo 43
(3-16)
note that for a given (c,r) and z for the reference camera, this equation tells us the loca-
tion of the corresponding points in all other cameras. In order to perform the stereo
search, we need to decide in what increments we will move along the line segment
defined by and , which is equivalent to asking what values of z we want
to test.
Since the image is sampled at pixel boundaries, it makes sense to search in one
pixel increments. Smaller search steps could be used to yield sub-pixel precision (up to
some limit determined by the particular camera configuration being used). Larger
steps could be used to reduce the total number of steps searched, thus increasing com-
putational speed at the expense of resolution (though if the steps are too large it is pos-
sible to miss the correct match completely).
Dividing equation (3-16) by z, we get:
(3-17)
which is a more convenient representation for the equation. By dividing the first two
elements of the left-hand side by the third, the corresponding location (c’,r’ ) in the
second image is determined. Because this division is required, all scalar multiples of
equation (3-17) are equally valid for defining the search space.
In order to perform the search, we rewrite the equation once more:
zb
cb
rb
1
zHb
c
r
1
eb+=
Hb c r 1T
eb
zb
z----
cb
rb
1
Hb
c
r
1
1z---eb+=
44 Chapter 3. Calibration
(3-18)
where s is a scale factor that determines how large the search steps will be and d is an
integer. In general, since the relative magnitudes of the will all be different, the step
size will be different in each of the images. In practice, we always adjust s so that the
steps are one pixel for the longest baseline (which also corresponds to the with the
largest magnitude). This implies that the step size on shorter baselines will be less than
one pixel.
3.3. Global (metric or Euclidean) calibration
Although much stereo processing and inference can be done without ever mapping
the image coordinates back into metric 3D space, our eventual goal (obstacle avoid-
ance) requires at least some measure of the size, position, and range of the objects
observed by the stereo system. A high degree of precision may not be possible (since
the accuracy of stereo range decreases with distance), but it is also not necessary.
The process of stereo matching produces a value of d at each pixel (c,r). The ques-
tion then becomes: what is the relationship between the (c,r,d) coordinates and Euclid-
ean (x,y,z) coordinates? From equation (3-17) and equation (3-18), we have that
(3-19)
Combining this with equation (2-3), we can write the relationship as a linear map-
ping between 3D homogeneous coordinates:
(3-20)
zb
z----
cb
rb
1
Hb
c
r
1
deb
s-----⋅+=
eb
eb
dsz--=
z
c
r
d
1
f 0 u 0
0 γf v 0
0 0 0 s
0 0 1 0
x
y
z
1
=
3.3 Global (metric or Euclidean) calibration 45
Often it is convenient to have the origin of our world coordinate system be differ-
ent from the focus of one of the cameras. This is easily accomplished by simply right-
multiplying by a 4x4 rigid transformation:
(3-21)
or equivalently, since the matrices are invertible,
(3-22)
where α represents the fact that we need to divide through by the third coordinate.
Since the resulting P is still a 4x4 matrix, we can solve for it using a variety of linear
algebraic tools. The minimal data necessary to solve this problem is a set of five
points, no four of which are coplanar.
3.3.1. Practical and Accurate Metric Calibration
Although five points provides for a minimal solution to the calibration problem,
the solution thus obtained is very sensitive to measurement errors (both in the mea-
surement of disparity and in the measurement of real-world distances). Since we
already have tools for determining sets of homography matrices very accurately, it
makes sense to use them for metric calibration.
With cameras mounted on top of an automobile, it is easy to find vertical and hori-
zontal planes, and to move the vehicle around within the ground plane. As an example
of one way to calibrate the system fairly accurately using homography matrices, con-
sider taking images of a wall that is vertical and perpendicular to the direction of travel
of the vehicle. In our standard vehicle coordinate system, such a plane is a plane of
z
c
r
d
1
f 0 u 0
0 γf v 0
0 0 0 s
0 0 1 0
R t
0 1
x’
y’
z’
1
P1–
x’
y’
z’
1
= =
α
x’
y’
z’
1
P
c
r
d
1
=
46 Chapter 3. Calibration
constant z.
The equation for a homography can be written as
(3-23)
which, when compared with equation (3-18), yields the following relationship for
points on the plane:
(3-24)
Thus the homography for the plane defines the disparity d of a point on the plane
for each point in the image. If we expand out the part of equation (3-22) that deals with
the z coordinate, we get that
(3-25)
substituting equation (3-24) into equation (3-25) and rearranging terms, we get
(3-26)
collecting terms in c and r yields
(3-27)
since this equation must be true for all c and r, the coefficients multiplying c and r and
the constant term must be zero:
z’z---
c’
r’
1
Hc
r
1
H∞
c
r
1
nT
h-----A
1–c
r
1
e+= =
d1e h
---------- nTA
1–c
r
1
nT
c
r
1
= =
z’P31c P32r P33d P34+ + +
P41c P42r P43d P44+ + +-------------------------------------------------------------=
P41c P42r P43 n1c n2r n3+ +( ) P44+ + +( )z'
P31c P32r P33 n1c n2r n3+ +( ) P34+ + +=
P41z’ P43n1z' P31– P33n1–+( )c
P42z' P43n2z' P32– P33n2–+( )r
P44z' P43n3z' P34– P33n3–+( )+
+ 0=
3.3 Global (metric or Euclidean) calibration 47
(3-28)
Since this set of equations does not define the scale of the parameters (given a solu-
tion, multiplying all of the parameters by some scalar would be another solution), we
can arbitrarily decide to set equal to 1. Intuitively, the denominator of
equation (3-25) determines the location of the plane at infinity in (c,r,d) space, since
when the denominator goes to zero, the 3D coordinates of the point will go to infinity.
Therefore, we cannot set to 1 as in the previous section, because this effectively
requires the denominator to have a non-zero constant term. We can set to 1
because we can be confident that the equation for the plane at infinity will depend on
d.
Thus each plane of constant z gives us a set of three linear equations in seven
unknowns. Three such planes are required to solve for the parameters of P. We collect
the data for several planes by very carefully driving the car along a straight line that is
perpendicular to the wall that we are observing. The homographies for all of the planes
can be computed at once using the technique described in Section 3.2.4 If we arrange
the problem like so:
(3-29)
then it is a linear problem of the form and the least squares solution for the
P41z’ P43n1z' P31– P33n1–+ 0=
P42z' P43n2z' P32– P33n2–+ 0=
P44z' P43n3z' P34– P33n3–+ 0=
P43
P44
P43
1 0 n11 0 z1– ' 0 0
0 1 n12 0 0 z1– ' 0
0 0 n13 1 0 0 z1– '
…
1 0 nk1 0 zk– ' 0 0
0 1 nk2 0 0 zk– ' 0
0 0 nk3 1 0 0 zk– '
P31
P32
P33
P34
P41
P42
P44
n11z1'
n12z1'
n13z1'
…
nk1zk'
nk2zk'
nk3zk'
=
Xp Y=
48 Chapter 3. Calibration
parameters of P can be obtained by the pseudo-inverse, i.e.
(3-30)
If the solution is unstable (because is not invertible), SVD can be applied to com-
pute a suitable pseudo-inverse. The problem generalizes to more than three planes, and
the solution becomes more accurate with additional data.
Of the sixteen unknowns in P, eight of the variables are determined by this proce-
dure. If we also have a homography matrix for the ground plane (which can also be
computed at the same time as the other homographies), we can establish it as a plane
where y’ is zero:
(3-31)
again using equation (3-24), rearranging terms, and again noting that the equation
must be true for all c and r, we get
(3-32)
which allows us to solve for , , and in terms of . If we observe one
point (r,c,d) at a known height y, we have:
(3-33)
which we can solve for since we know the values of all of the other quantities.
All that remains to be determined are the values in the first row of the matrix.
Since planes of constant x are scarce, a different method is in order. We already have
many views of a vertical surface for which the disparity at every pixel is known. If we
can measure the x coordinate for four points that are not coplanar in (c,r,d) space, then
p XTX( )
1–XY( )=
XTX
y’P21c P22r P23d P24+ + +
P41c P42r P43d P44+ + +------------------------------------------------------------- 0= =
P21 P23n1+ 0=
P22 P23n2+ 0=
P24 P23n3+ 0=
P21 P22 P24 P23
yP– 23 n1c n2r d– n3+ +( )
P41c P42r P43d P44+ + +-------------------------------------------------------------=
P23
3.4 Summary of the Calibration Method Steps 49
we can set up the following equation:
(3-34)
the least squares solution for the parameters can be computed in the same way as for
equation (3-29). In practice, a set of points for the optimization can be determined by
measuring the x coordinate of two world points on the vertical surface and tracking the
points through the sequence of images of that surface.
The preceeding section has described one example calibration and a set of tools for
computing the parameters of the calibration matrix. In other situations where this type
of calibration is necessary, it may be more convenient to capture images of different
types of surfaces, but the same techniques and equations are applicable.
3.4. Summary of the Calibration Method Steps
1. Collect at least three images of a vertical planar surface, perpendicular to the
direction of travel of the vehicle and at a known (measured) distance. The sur-
face should have sufficient texture to allow accurate matching.
2. Choose two points on the planar surface, and measure the lateral position and
height of those points.
3. If not already done in 1., collect an image of a horizontal plane, preferably the
ground plane.
4. Using the weak calibration method described in Section 3.2.4, match all of the
planes collected above to obtain a set of weak calibration parameters , ,
and .
c0 r0 d0 1
c1 r1 d1 1
c2 r2 d2 1
c3 r3 d3 1
P11
P12
P13
P14
x0 P41c0 P42r0 P43d0 P44+ + +( )
x1 P41c1 P42r1 P43d1 P44+ + +( )
x2 P41c2 P42r2 P43d2 P44+ + +( )
x3 P41c3 P42r3 P43d3 P44+ + +( )
=
Hb eb
np
50 Chapter 3. Calibration
5. Use the values from the different vertical planes to solve equation (3-30).
This is enough to allow for metric computation of z.
6. Use the value for the ground plane along with the measured height of a single
point on a vertical plane to solve equation (3-32) and equation (3-33). This
allows metric computation of y.
7. Use the measured lateral positions and equation (3-34) to determine the remain-
ing parameters of P, allowing for the metric computation of x.
3.5. Calibration Accuracy
The metric calibration procedure described in the previous sections was used to
calibrate the cameras on our vehicle. We used a garage door as the vertical surface, and
the garage floor as a horizontal surface (the images shown in Figure 1-1 are one set of
images from this calibration set). A total of eight planar surfaces were matched:
images of the garage door were taken at 5-meter intervals from 15 to 45 meters, and
the ground plane from the 45-meter image was used for the horizontal surface.
The calibration was then tested on a set of images of obstacles, taken at 10-meter
intervals from 50 to 150 meters, with about one meter precision. For each obstacle,
stereo matching was performed, and the results were hand-segmented to ensure that no
outlier pixels were included. Then the distance to the obstacle was computed using the
metric calibration parameters derived via the method described in this chapter. The
results are shown in Figure 3-2. The data for 120 meters was erased accidentally, but
the remaining data shows that the calibration is reasonably accurate. The three curves
plotted on the graph represent the correct result (in the center) and the expected results
if the stereo match were off by one pixel in either direction from the correct result.
Since the calibration data only goes out to 45 meters, the results shown in the
graph are all extrapolated from the calibration data and we therefore expect the cali-
np
np
3.5 Calibration Accuracy 51
bration to become less accurate as distance increases. The errors seen in the graph can
thus be explained as some combination of calibration error, error in measurement of
the ground truth (not more than one meter), and possible stereo matching error (not
more than one pixel).
40
60
80
100
120
140
160
40 60 80 100 120 140 160
✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧
✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧
✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧
✧✧✧✧✧✧✧✧✧✧
✧✧
✧
✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧
✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧
✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧
✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧
✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧
✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧
✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧
✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧✧
✧
x
Ground Truth (m)
Mea
sure
d R
ange
(m
)
Figure 3-2: Calibration accuracy
52 Chapter 3. Calibration
51
Chapter 4
Stereo Algorithm
The research described in this thesis originated from an effort to apply the CMU
Video-Rate Multibaseline Stereo Machine to the problem of detecting highway obsta-
cles. The stereo algorithm used in this research is thus based on the algorithm used by
the stereo machine.
No claim is made that this algorithm is necessarily the best way of computing
depth from multiple camera views. Many algorithmic choices within the stereo
machine seem to have been made for ease and speed of implementation rather than for
accuracy of the final result. In fact, the software implementation of this algorithm is
more accurate than the hardware, since it does not suffer from most of the limitations
that were imposed on the hardware by speed, cost, or ease of design.
52 Chapter 4. Stereo Algorithm
will
ion for
o the
ly and
are
il in
are not
The basic system that I have constructed is shown in Figure 4-1. With the excep-
tion of the “obstacle detection/localization” box, each of the boxes in this figure
be touched upon in both this chapter and the next. This chapter provides motivat
why a particular step in the algorithm has been chosen; Chapter 5 will go int
implementation details of how each processing step can be performed efficient
accurately. The methods used in the “obstacle detection/localization” box
described in Chapter 6.
4.1. Related Work
The CMU Video-Rate Multibaseline Stereo Machine is described in deta
[Kanade et al. 96], though the ideas behind the design decisions that were made
described there.
ImageRectification
LoG Filter LoG Filter LoG Filter
ImageRectification
ImageRectification
Stereo Matching
Obstacle Detection/
Localization
Figure 4-1: Architecture of Stereo Obstacle Detection System
4.2 Multibaseline Stereo 53
rough
used;
ances
fea-
dge
ste-
tural
there
ent, it
iffer-
ction
eful
ing
[Matthies 92] derives several different stereo algorithms using a statistical frame-
work, including the basic SSD search method upon which the work in this thesis is
based. All of the pieces of our algorithm are discussed in some detail in [Faugeras 93],
although many other stereo vision algorithms are also discussed.
The method for computing stereo from more than two cameras that is used in this
thesis was first described in [Okutomi & Kanade 93]. The algorithm described there is
called SSSD-in-inverse-distance. The main idea, that matching errors from multiple
baselines can be added together to evaluate different possible geometries, has been
retained in this work. The final section of this chapter discusses alternatives to the SSD
metric. The “inverse distance” part has been generalized in this dissertation th
the use of projective geometry such that any set of planes in the world can be
however, if uniform sampling in the image is desired, then the perpendicular dist
to subsequent planes are still controlled by the inverse distance formula.
Convolution with the Laplacian of Gaussian operator has a long history as a
ture detector. It was first used by Marr and Hildreth [Marr & Hildreth 80] as an e
detector. Nishihara [Nishihara 84] first used the sign of the LoG-filtered image for
reo matching. The use of more bits of information from the LoG filter was a na
extension of this.
4.2. Multibaseline Stereo
Although only two cameras are required to compute range from image data,
are several advantages to using more than two cameras for stereo vision:
1. since the epipolar direction is the same as the direction of camera displacem
is possible to arrange for the epipolar directions of multiple cameras to lie in d
ent directions in the image, thus taking advantage of image texture in any dire
(an example of this is illustrated in Figure 4-2). An example of where this is us
is when viewing a horizontal feature such as the curb in Figure 4-3. When view
54 Chapter 4. Stereo Algorithm
this region with only a pair of cameras with a horizontal baseline, there is very lit-
tle in the image to distinguish one location from another. The addition of a vertical
baseline allows us to take advantage of the available texture.
2. repeating texture in the image can confuse a two camera system by causing match-
ing ambiguities; these ambiguities are eliminated when additional cameras are
present, assuming that the camera spacing is not an integer multiple of the texture
spacing; the latter issue can be avoided by not placing the cameras at an even spac-
ing.
3. as in any measurement process, additional measurements allow more accurate
results by averaging noise; in the case of a large number of cameras, outliers can
be rejected by voting or robust statistics
4. shorter baselines are less prone to matching error while longer baselines are more
accurate; the combination is better than either alone
5. different regions of space are occluded for each camera pair; therefore the prob-
lems caused by occlusion are somewhat ameliorated by using multiple cameras
For advantages 1 and 2 adding a third camera is sufficient; fourth and additional
cameras do not yield any additional benefits. The advantages of 3-5 continue to grow
Figure 4-2: Three cameras in an “L” configuration give different epipolar directions
4.3 LoG Filtering 55
en
akes
-filter
e we
ition,
w at
uating
ture,
ted by
ful for
peri-
e the
lative
gain
past the fourth camera. Thus there is a large benefit to adding the third camera, and the
benefits diminish with the fourth and additional cameras.
4.3. LoG Filtering
Since the Laplacian of Gaussian is a second derivative operator, places where the
LoG-filtered image is zero are places where the intensity of the original image has
maximum variation, i.e., edges. In addition to being a good edge detector, the LoG fil-
ter also has the following two properties:
• it has a tunable Gaussian filter for filtering out high-frequency image noise
• since the LoG function naturally integrates to zero, any bias in intensity betwe
the cameras is eliminated (it subtracts out)
Since the zero crossings of the LoG-filtered image are interesting points, it m
sense that points that are near to zero will also be interesting. Therefore we pre
our images with an LoG filter. We use a small standard deviation for the filter sinc
do not want to agressively remove high frequency texture from the image. In add
we apply the filter with a high gain, and saturate values that overflow or underflo
the maximum and minimum representable values. This has the effect of accent
regions that are near to zero crossings.
The result of LoG filtering is shown in Figure 4-3. The increase in image tex
particularly on the road surface, is very apparent. In practice, the texture extrac
this method is consistent even between different cameras, and thus is very use
stereo matching in such bland environments.
One question that remains is how large the gain on the filter should be. Ex
ments performed with a large number of different gains in an attempt to determin
optimal gain value had predictable results: the optimal gain depends on the re
contrast of the image. Images that have very little contrast benefit from a large
56 Chapter 4. Stereo Algorithm
(even if the noise is amplified greatly, it is still better than having no signal to match
whatsoever). On the other hand, images with high contrast match well without any
additional enhancement.
In practice, the gain should probably depend on the image data itself. A system
that automatically adjusted the gain so that the contrast was as high as possible without
saturating the image would be a good solution, though nothing of this type has been
implemented.
4.4. Rectification and Interpolation
After calibrating the camera system, the necessary geometric constraints between
cameras for stereo matching are known. For each pixel in the reference image, we can
compute the coordinates of a set of possible corresponding points in each of the other
Figure 4-3: Image before and after LoG filtering
4.5 Stereo Matching 57
images. In general, these coordinates will not fall on integer pixel boundaries; thus
some method of estimating the correct value of arbitrary points in the image is neces-
sary.
The correct method for interpolation would be to convolve with a sinc function to
remove higher order harmonics that are introduced in the sampling process. In practice
the sinc function has a large support, which requires a large filter size and is therefore
computationally intensive. A reasonable approximation is to use a Gaussian filter for
interpolation. When combined with the LoG filter, this effectively produces an LoG
filter with a larger standard deviation (the new is ), while interpolating
the data as well.
In practice, for signals which have a cutoff frequency that is sufficiently less than
the Nyquist limit (which we can ensure by choosing our LoG filter coefficients care-
fully), bilinear interpolation has proven to be sufficient for estimating actual image
values at non-integer pixel locations. Bilinear interpolation also has the advantage of
being easy to implement efficiently, since it only involves the four neighboring pixels.
The gain in processing speed more than offsets the small loss in output quality.
Image rectification is the process of transforming an image so that it has particular
alignment properties (such as having the epipolar search directions aligned with the
scan lines of the image). Any such desired transformation can be represented as a
homography matrix, and projective image warping is then used to generate the trans-
formed image. Since the details of how this rectification is done are very different
between the stereo machine and the software implementation, the discussion of the
exact methods used is postponed until the chapter on implementation.
4.5. Stereo Matching
The previous sections discussed how we can compute corresponding pixels
between images that have high enough contrast to allow us to differentiate objects at
σ σL2 σG
2+
58 Chapter 4. Stereo Algorithm
different distances from the camera. What remains to be done is to search through the
possible distances at each point and decide which one is best supported by the image
data.
Ideally, we would just be able to compare the pixel values for each possible dis-
tance, and choose the ones that match best. If we assume that image noise is roughly
Gaussian, then the best measure of the similarity of pixels is simply the squared differ-
ence between them. In practice, we find that outliers are actually much more likely
than a Gaussian model would predict. One large factor that causes this to be the case
for stereo matching is that the appearance of objects when viewed from different direc-
tions can be different. Two examples of when this occurs are specular reflections and
occluding edges.
Since a good statistical model of such outlier points would be difficult if not
impossible to construct, we are left with the problem of finding an error metric that is
less sensitive to outlier points while being practical to compute. One such operator is
the absolute value of the difference between pixel values.
Of course, since the pixel values are discrete integers between 0 and 255, the
chances are good that several different pixels will match equally well even if we have
the correct statistical model. With the addition of possible image noise, it becomes
likely that an incorrect disparity will match well. In order to compensate for this, we
must make some further assumptions about the scene that we are viewing. The sim-
plest assumption that we can make is that points in a small region of the image should
all match in roughly the same way. This assumption is violated at occluding edges, and
at points in the image with extreme slope compared to the reference plane. Methods
for dealing with the latter problem will be discussed in a later chapter.
The error metric for a particular pixel and disparity, for a single baseline (between
cameras 0 and 1) is then:
4.5 Stereo Matching 59
(4-1)
where is the appropriately interpolated value that matches at dis-
tance d. represents a window of pixels around the image point (x,y).
One consideration is how to modify this metric for multiple baselines. The theoret-
ically correct error metric assuming Gaussian noise would be to compute the variance
of the set of image intensities in place of . The metric correspond-
ing to the absolute difference metric in the case of multiple baselines is:
(4-2)
which is just the sum of the absolute differences from the mean (the variance would be
the sum of the squared differences from the mean). Table 4-1 contains a list of differ-
metric formulacomplexity
(n is number of cameras)
actual number of operations
operations for 3
cameras
absolutedifference(variance)
13
11
squareddifference(variance)
9
absolutedifference(reference)
5
Table 4-1: Possible Error Metrics
E01 x y d, ,( ) I0 i j,( ) I1 i j d, ,( )–
i j W x y,( )∈,∑=
I1 i j d, ,( ) I0 i j,( )
W x y,( )
I0 i j,( ) I1 i j d, ,( )–
nIk i j d, ,( ) Il i j d, ,( )l 0=
n
∑–
k 0=
n
∑
nIk i j d, ,( ) Il i j d, ,( )l 0=
n
∑–
k 0=
n
∑O n( ) 5n 2–
O n2( ) 3
2---n
2 12---n– 1–
Ik i j d, ,( )2
k 0=
n
∑ Ik i j d, ,( )k 0=
n
∑ 2
–
O n( ) 3n
I0 i j,( ) Ik i j d, ,( )–
k 1=
n
∑O n( ) 3n 4–
60 Chapter 4. Stereo Algorithm
ent possible metrics and their computational cost. Since we are implementing this
algorithm in modern computer hardware, multiply, add, and absolute value operations
are assumed to be equivalent.
Since whatever metric we choose will be evaluated for each pixel at each search
distance, it is of critical importance that we choose a metric that can be evaluated with
as few operations as possible in order to achieve an implementation that runs quickly.
As is shown in the table, the metric from equation (4-2), although theoretically
correct, is one of the most computationally expensive. The use of variance as an error
metric is motivated by the assumption that the pixels from each of the camera views
are equivalent measurements of the same underlying property, and thus that their mean
is the best estimate of that quantity and their variance is the best estimate of similarity.
In the algorithm that we have described previously, one of the cameras (the refer-
ence camera, camera 0) is special. For each pixel of the image from that camera, we
perform a search over a set of possible distances, comparing that pixel to different pix-
els from the other cameras. If we instead make the assumption that the pixel in the ref-
erence camera has the correct value (instead of assuming that the mean is the correct
value), and we want to find the set of pixels in the other cameras that match it best, we
get the metrics that are marked as (reference) in the table. This metric, though not
squareddifference(reference)
5
allabsolutedifferences
8
Special case for three cameras x is the cost of max(x,y)
4x + 1
metric formulacomplexity
(n is number of cameras)
actual number of operations
operations for 3
cameras
Table 4-1: Possible Error Metrics
I0 i j,( ) Ik i j d, ,( )–( )2
k 1=
n
∑O n( ) 3n 4–
Ik i j d, ,( ) Il i j d, ,( )–
l k≠∑
k 0=
n
∑O n
2( ) 32---n
2 32---n– 1–
4.6 Sub-pixel Interpolation 61
s” in
hat no
an be
also
arely
step
can be
n be
ution
ystem.
by a
-pixel
s data
rela-
o the
Since
mial
mial
s near
.
being strictly correct, is a good comprimise that is about twice as fast and produces
good results.
One other metric worth mentioning is the one marked “all absolute difference
the table. This is the metric that results if the (reference) metric is expanded so t
particular camera is special. Though this metric has no mathematical basis, it c
computed very efficiently, particularly for large numbers of cameras.
In general, we have used “absolute difference (reference)”, though we have
experimented with “all absolute differences”. The loss in accuracy caused was b
detectable, while the increase in performance was large.
4.6. Sub-pixel Interpolation
When more precision is required, there are two options in general. Either the
size of the stereo search can be made smaller, or the sub-pixel interpolation
applied to the results. Both methods are limited in the extent to which they ca
applied; the amount of information contained in the images is limited by the resol
of the camera, the focal length of the lenses, and the longest baseline in the s
Changing the step size in general multiplies the running time of the algorithm
constant (though adaptive schemes which do not do this can be imagined). Sub
interpolation of results, on the other hand, is a constant-time operation that use
that should already be available.
The idea behind sub-pixel interpolation is that the matching error should be a
tively smooth function, and therefore it makes sense to fit a smooth function t
error data near the minimum to more accurately determine where exactly it is.
the function needs to be fit with a minimum of computation, a low-order polyno
(which can be fit with linear algebra) is a good choice. The lowest order polyno
that has a minimum is a quadratic. Therefore, we fit a quadratic to a set of point
the minimum. At least three points are required, though it is possible to use more
62 Chapter 4. Stereo Algorithm
The linear equation that must be solved is:
(4-3)
where the are the disparities of the points near the minimum and the are their
corresponding matching errors. Since we are really interested in where the interpolated
minimum is relative to the discrete minimum that we have already found, we can use
(-1,0,1) or (-2,-1,0,1,2) for the and simply add the resulting offset to the discrete
minimum. If we use minimum and one point on either side, the equation simplifies to:
(4-4)
which can be solved by inverting the matrix:
(4-5)
since the minimum of the function is at , we can substitute
and get:
(4-6)
this does require a computationally expensive division operation, but depending on the
hardware it might be a good trade-off versus doing extra search.
Empirical evidence suggests that sub-pixel interpolation can be used down to a
d02
d0 1
…
dn2
dn 1
a
b
c
E0
…En
=
di Ei
di
1 1– 1
0 0 1
1 1 1
a
b
c
E 1–
E0
E1
=
a
b
c
12--- 1–
12---
12---– 0
12---
0 1 0
E 1–
E0
E1
=
E ad2
bd c+ += db–
2a------=
dmin12---–
E1 E 1––
E1 2E0– E 1–+------------------------------------
=
4.6 Sub-pixel Interpolation 63
resolution of about one-fourth of an original image pixel. Below that point, even the
results for smooth, highly-textured surfaces seem to be more or less random.
64 Chapter 4. Stereo Algorithm
65
Chapter 5
Implementation
This chapter describes two implementations of the multibaseline stereo algorithm.
The first section describes the hardware implementation used in the CMU Video-Rate
Multibaseline Stereo Machine. Although the research described in this thesis was per-
formed after the stereo machine had already been designed and built, there are several
reasons to include a discussion of it here: a) the existing documentation of the stereo
machine is somewhat sparse, not including several details relevant to the software
implementation, b) the algorithms used are directly based on the algorithm used by the
stereo machine, and c) several important differences between the implementations will
be discussed. The second section of this chapter discusses the software implementa-
tion of the stereo algorithm.
Several key insights are made in this chapter. Perhaps the most important insight
is that special rectification techniques, discussed in Section 5.3.3, can be used to allow
66 Chapter 5. Implementation
trinocular stereo to be computed efficiently. A detailed analysis of memory and cache
usage of three different implementations of the stereo main loop leads to a clear choice
which is supported by benchmark data. Additionally, an efficient method for perform-
ing the LoG filter and a means for determining the LoG filter coefficients are dis-
cussed in Section 5.3.2
5.1. Related Work
During the last few years, several commercial stereo vision systems based on PC
hardware have appeared on the market (e.g. the SVM by SRI [Konolige 97] and Tri-
Clops by PointGrey Research [PointGrey 98]). Unfortunately, most of the innards of
these systems are proprietary and thus I can only speculate that these groups must have
done much of the same analysis that is presented in this chapter.
5.2. CMU Video-Rate Multibaseline Stereo Machine
The stereo machine consists of a number of custom-built 9U VME boards con-
nected in a system. The system is described in some detail in [Kanade et al. 96].
The algorithm used by the stereo machine (see Figure 5-2) works by first digitizing
the images from each of the cameras (up to 6 in the current design).
5.2.1. LoG Filter and Quantization
Each of these images is then passed through an 11x11 LoG filter which was imple-
mented in hardware by a pair of special-purpose 2D 8-bit convolution chips
(PDSP16488, made by GEC Plessey). Since the convolution hardware had a maxi-
mum mask size of 7x7, the filter was decomposed into a 7x7 Gaussian filter with a
standard deviation of one pixel followed by a 7x7 LoG filter, also with a standard
deviation of one pixel. This chained convolution is mathematically identical (modulo
round-off errors) to an 11x11 LoG filter with a standard deviation of pixels. The
gain is controlled by a series of programmable multiply and shift operations, and a
2
5.2 CMU Video-Rate Multibaseline Stereo Machine 67
el at
other
nsists
o 4-
final selection of 8 bits of the 16-bit output of the convolver chip.
This 8-bit output is then quantized down to 4 bits using another lookup table that is
part of the stereo machine hardware. The set of values that worked best in this lookup
table (and thus became the default) effectively just maps the range from -8 through 7
to 0 through 15 while saturating smaller values to 0 and larger values to 15. This is an
effective gain enhancement of a factor of 16, since the 8-bit range of the convolver
output has been reduced to 4 bits using only the low-order bits and discarding the
high-order bits.
5.2.2. Rectification (Geometry Compensation)
The 4-bit LOG filtered data is then passed on to the geometry compensation unit.
This unit is of particular importance because it performs a very general transformation
on each of the input images, to rectify these images before performing the SAD com-
putation which comes next. For each pixel in the reference image, a number of possi-
ble “distances” ζ from the camera are evaluated (see Figure 5-1). For each pix
each distance, the offset from the pixel to the corresponding point in each of the
cameras is retrieved from a look-up table. The value stored in the lookup table co
of two 8-bit pixel offsets (for the column and row directions in the image), and tw
j
i
base imageJb(i,j)
Ib(i,j)
j
i
inspection imageJins(i,j,ζ)
Absolute Difference
Add
interpolated pixel values
from other pairs
Figure 5-1: Geometry compensation
Iins(i,j,ζ)
68 Chapter 5. Implementation
bit fractions representing the fractional part of the desired location in the image in 1/
16ths of a pixel.
The (column, row) coordinates of the corresponding points are computed by taking
the 8-bit integer offset in each direction and adding it to the current pixel position (i,j).
Since the hardware that does pixel addressing uses 8 bit registers, the maximum image
size is limited to 256x256 pixels. To approximate the correct pixel intensity at the
desired location, a bilinear interpolation of the four nearest pixels is performed using
the fractional offsets retrieved from the lookup table.
Note that the lookup table can contain any values whatsoever, so it is possible to
correct for lens distortion, or to operate with one camera upside down, or to use cam-
eras with lenses of different focal lengths. The primary limitation of the geometry
compensation circuit is that each 4x4 pixel region of the base image must offset by the
same amount, since there is only one lookup table entry for each 4x4 pixel region of
the image. This was done to keep the size of the lookup table from being too large to
implement. While this is not much of a limitation as long as the camera geometry is
close to that of a traditional stereo system, the extreme geometries that are dealt with
in this thesis are often problematic.
Calibration for the stereo machine consists entirely of computing the values to load
into the lookup tables. These values can be computed directly from the homography
matrices by the simple formula
(5-1)aI i j ζ, ,( )aJ i j ζ, ,( )
a
Hζ
i
j
1
=
5.2 CMU Video-Rate Multibaseline Stereo Machine 69
and normalization by a to convert from
homogeneous coordinates to 2D coordi-
nates.
5.2.3. Stereo Matching
In the next stage of the stereo machine,
the absolute value of difference (AD) is per-
formed pixel-by-pixel for the base camera
(camera #0) paired with each of the other
cameras. The results of the AD computation
are summed over all of the camera pairs,
resulting in a sum of absolute differences
(SAD) value for each pixel for each dispar-
ity level.
The resulting SAD values are then
smoothed by summing over a local window,
the size of which is programmable from 5x5
to 13x13. The result is called the SSAD. In
the final stage, for each pixel, the disparity
level with the minimum SSAD value is
found, and the SSAD values of the mini-
mum and its neighbors are sent to the C40
DSP processing board, where the disparity
levels can be interpolated for higher accuracy.
5.2.4. Stereo Machine Performance
The stereo machine processes images at a constant rate of roughly 30 million
pixel-disparities per second (counting pixels processed in camera #0), regardless of the
number of cameras in use. Thus the frame rate depends on the number of pixels pro-
A/D &
LOG &
Frame Grabber
LOG
LOGtoSAD I/F
SSAD Computation 1 (SAD over image pairs)
Matching Pixel
Sum of Absolute Difference
SSAD Computation 2 (Windowing)
Vertical Sum
Horizontal Sum
Minimum Finder
C40 I/F & Graphics Function
C40 DSP Array
Extraction
Data Compression
Frame Memory #1
#1
#1
•••
VxWorksReal-timeProcessor
SunWorkstation
VM
E B
us
Camera Head
#6#3#2 •••
#6#3#2 •••
#1 CompensationGeometry
#6#3#2 •••
#6#3#2 •••
C40#1
C40#2
C40#4
C40#3
C40#5
C40#6
C40#8
C40#7
Ethernet
C40 Communication Port
Figure 5-2: Architecture of the CMU Stereo Machine
70 Chapter 5. Implementation
cessed and the number of disparities searched. When using the maximum values for
each (256x240 image, 60 disparity levels searched), the frame rate is roughly 7.5 Hz.
5.3. Software Implementation
The software implementation uses almost the same algorithm, with a few minor
changes to adapt from processing in parallel hardware to processing serially in soft-
ware.
5.3.1. Multibaseline
As discussed in Section 4.2, there is a large benefit to using three cameras, and a
diminished benefit to the fourth and additional cameras. On the other hand, it turns out
that there are rectification methods that allow two camera stereo matching to be imple-
mented very efficiently in software. In Section 5.3.3 I will show that there is also a
method that allows a slightly less efficient implementation for three cameras. The
extension to four or more cameras is much more difficult, and requires a large increase
in computation.
Given that four or more cameras gives diminishing returns for greatly increased
computational cost, we decided to concentrate on developing a fast trinocular stereo
system in software.
5.3.2. LoG Filter
A straight-forward serial implementation of 2D convolution in software is very
computationally expensive (it is , where w and h are the width and height of
the convolution template and p is the number of pixels in the image), so an alternative
filtering operation is necessary. The standard optimization technique of splitting a 2D
filter into two 1D filters does not apply, since the LoG filter is not separable. Some
experimentation with different filters revealed that a 7x7 LoG filter with a standard
deviation of one pixel works almost as well, with greatly reduced computational cost.
O pwh( )
5.3 Software Implementation 71
The CMU stereo machine uses a larger filter in part to compensate for image noise that
is introduced by the custom digitization hardware built into the machine. Since the
software implementation uses a commercial digitizer board, the filter size can be
reduced without perceptible loss in output quality.
The formula for an LoG filter is
(5-2)
which is the sum of two separable filters. Thus a new algorithm consisting of four 1D
filters and a summation is possible. The complexity of the new algorithm is
, which is significantly smaller than . The actual number of nec-
essary multiply-accumulate operations per pixel is reduced from 49 to 28 for a 7x7 fil-
ter (which would be reduced further to 14 if the filter were separable).
Another option that I have considered is to use a recursive filter such as those sug-
gested by Deriche [Deriche 90]. The recursive filter implementations have the advan-
tage that they take a constant number of operations independent of the size of σ.
Unfortunately, the constant in this case is 32 multiply-accumulate operations, which is
slightly larger than the case described above. If a larger σ value became necessary, this
method would become advantageous.
5.3.2.1. Determining LoG Filter Coefficients
In order to perform a 2D convolution on discrete data, continuous equations such
as equation (5-2) must be converted into discrete quantities. The values must be dis-
L x y,( ) 1–
πσ4--------- 1
x2
y2
+
2σ2----------------–
e
x2
y2
+( )–
2σ2------------------------
=
1–
2πσ6------------- 2σ2
x2
– y2
–( ) e
x2
–
2σ2---------
e
y2
–
2σ2---------
⋅
=
1–
2πσ6------------- σ2
x2
–( ) e
x2
–
2σ2---------
e
y2
–
2σ2---------
⋅
1–
2πσ6------------- σ2
y2
–( ) e
x2
–
2σ2---------
e
y2
–
2σ2---------
⋅
+=
O p w h+( )( ) O pwh( )
72 Chapter 5. Implementation
ble
rm
ata
ons
order
ee that
-
for
as in
be to
lue of
hat as
f the
that
crete in the spatial (row and column) domain. Each value in the convolution template
must also be a discrete quantity. There are several reasons why we might want to limit
the range of possible values for the filter coefficients:
• the hardware performing the convolution might have a limited range of possi
coefficients (this is the case with the CMU stereo machine)
• the CPU that we are using might have special SIMD instructions that perfo
multiple multiply operations on small data types with one instruction
• we might want to store the accumulated results of the convolution in a small d
type; the need to avoid overflow restricts the range of coefficients
Since the Intel MMX instructions allow us to perform up to four 16-bit operati
per instruction, we need to keep the accumulated results as 16-bit quantities. In
to perform a convolution on 8-bit data under these circumstances, it is easy to s
the sum of the filter coefficients must be less than 28=256. The digital signal process
ing literature contains surprisingly little information about the optimal method
choosing filter coefficients when the range of possible values is severely limited
this case.
The most straightforward manner in which to compute the coefficients would
simply evaluate the function at each discrete point, scaled so that the largest va
the filter function maps to the largest representable value (thus guaranteeing t
much precision as possible is retained), and then round off the result:
(5-3)
There are three main problems with this approach:
• the resulting coefficients are not guaranteed to sum to zero, which was one o
selling points of the LoG filter in eliminating camera bias
• it is possible that some other scale factors might produce a set of coefficients
ci rnd f i( ) scale⋅( )=
5.3 Software Implementation 73
on;
ary
r, the
ple-
of the
, and
, the
very
e of
any
error
ible to
effi-
h the
. In the
must be
is closer to the actual true values
• a division by the scale factor is required at the end of the convolution operati
division operations are expensive, so we would like to convert this into a bin
shift operation instead
The values of the set of filter coefficients is determined by a single paramete
scale factor. Since we would like for the final division of the convolution to be im
mentable as a binary shift operation, we are actually interested in scale factors
form , where j is the maximum feasible amount that we can shift the final result
n is an integer with .
Although some clever method might reduce the search space of this problem
size of the problem is small enough that we can solve it by brute force, trying e
possible value of n.
The algorithm used to find filter coefficients is thus to try every possible valu
n, searching for values for which the resulting coefficients sum to zero. Since m
such solutions exist, we search among these solutions for the one for which the
(5-4)
is minimized. This ensures that the coefficients sum to zero, are as close as poss
representing the original function, and that the convolution can be computed
ciently without any division operations.
5.3.3. Rectification and Stereo Matching
In order to efficiently implement stereo search on modern processors, bot
computation and the data access patterns of the algorithm must be very regular
best case, the data would be accessed sequentially, and accesses to data that
2j
n----
0 n 2j≤<
ci2
j
n----f i( )–
2
i
∑
74 Chapter 5. Implementation
accessed more than once would clustered so that the data cache of the processor can be
effective. The computation should be as regular as possible (avoiding branches caused
by if-then type constructs) since frequent branches are very inefficient on modern pro-
cessors. Note that none of these issues apply to the hardware implementation in the
CMU stereo machine, since it performs computations at a constant rate, and all mem-
ory accesses occur in one cycle.
5.3.3.1. The stereo matching main loop
The stereo search is fundamentally three-dimensional, since the image has two
dimensions, and we are searching in the third dimension of depth. Since this implies
that there will be three nested loops in the algorithm (over the (c,r) pixel coordinates
and the disparity d), one question that arises is, in what order should the computation
be performed? If the goal is to tailor the rectification process so that the matching can
be performed as quickly as possible, then we must consider whether the ordering of
the computation has an effect on the execution speed of the final program. The main
5.3 Software Implementation 75
have
e can
. An
win-
erme-
loop of the stereo matching algorithm can be written in pseudo-code as follows:
for(outer-loop) {for(middle-loop) {for(inner-loop) {SAD(c,r,d) = MATCHING_ERROR(c,r,d);
HORIZONTAL_SUM(c,r,d) = HORIZONTAL_SUM(c-1,r,d) + SAD(c,r,d) - SAD(c-WINDOW_WIDTH,r,d);
VERTICAL_SUM(c,r,d) = VERTICAL_SUM(c,r-1,d) +HORIZONTAL_SUM(c,r,d) -
HORIZONTAL_SUM(c,r-WINDOW_HEIGHT,d);
if(VERTICAL_SUM(c,r,d) < MIN_SSAD(c,r)) {MIN_SSAD(c,r) = VERTICAL_SUM(c,r,d);RESULT_IMAGE(c,r) = d;
}}
}}
The general idea is that we will compute the matching error (SAD) for each pixel, and
then add up the errors over a small window centered at each pixel to produce the
SSAD. Instead of performing the window summation at each pixel as we get to it, it is
more efficient to maintain a “horizontal sum” as we move across the image. If we
HORIZONTAL_SUM(c,r,d) = SAD(c-WINDOW_WIDTH+1,r,d) + ... + SAD(c,r,d)
then we can compute it via the recurrence shown in the pseudo-code. Similarly, w
add up the horizontal sums to get the value of the summation over a window
important aspect of this algorithm is that the running time does not depend on the
dow size.
Aside from the images themselves, the algorithm uses four arrays to hold int
diate values:
76 Chapter 5. Implementation
we
ossi-
per-
since
sider
. This
ch
ch
d the
ethod
r each
• SAD(c,r,d) is the metric error between pixels for the pixel at (c,r) and disparity
d
• HORIZONTAL_SUM(c,r,d) holds the “horizontal sums” described above
• VERTICAL_SUM(c,r,d) holds the vertical sums
• MIN_SSAD(c,r) holds the minimum value of the SSAD for each pixel
The remainder of this section will refer to this main loop pseudo-code, as
describe different high-level algorithmic choices in an attempt to find the fastest p
ble general implementation of trinocular stereo matching.
In particular, we will discuss in what order the three nested loops should be
formed. Since images are usually arranged in row-major order in memory, and
rows and columns are otherwise mathematically equivalent, we will only con
possible orderings in which the loop over rows is outside the loop over columns
leaves us with three possible loop orderings (from outermost to innermost): (r,c,d),
(r,d,c), and (d,r,c).
5.3.3.2. Rectification
Looking at the stereo main loop, we see that MATCHING_ERROR(c,r,d) will
be called for every possible value of c, r, and d, which is in total
NUM_C*NUM_R*NUM_D iterations. The simplest implementation would be for ea
pixel (c,r) and disparity d, compute the coordinates of the corresponding point in ea
of the images, and then perform bilinear interpolation on the closest pixels to fin
correct value. The error metric can then be computed using this value. This m
requires that we perform a separate coordinate computation and interpolation fo
iteration of the loop, both of which are relatively expensive operations.
5.3 Software Implementation 77
their
in the
For the SAD metric discussed earlier,
MATCHING_ERROR(c,r,d) = ABS(IM1(c,r,d) - IM0(c,r)) + ABS(IM2(c,r,d) - IM0(c,r));
where IMn(c,r,d) refers to the coordinates in image n of the point corresponding
to the point (c,r) in IM0, at disparity d.
The primary goal of image rectification is to re-sample the images in such a way
that we only have to do NUM_C*NUM_R interpolations, thus saving a factor of
NUM_D. A secondary goal is to arrange the memory accesses to the images such that
we can take advantage of the cache architecture of the CPU.
The first step in image rectification is to warp the input images such that
IM1(c,r,d) and IM2(c,r,d) always have integer pixel coordinates, so that we
never have to do interpolation after the warping has been completed.
Let us once again return to the basic equation, equation (3-18).
(5-5)
In order for ( , ) to be integers for any integer (c,r) and d, we need to do two things:
• change the images so that when d is zero, ( , ) falls on an integer boundary
• make become an integer offset
We can accomplish the first goal by simply warping images 1 and 2 by
respective homography matrices. The relationship between the coordinates
image before warping (unprimed) and the coordinates after warping (primed) is:
zb
z----
cb
rb
1
Hb
c
r
1
deb
s-----⋅+=
cb rb
cb rb
eb
s-----
78 Chapter 5. Implementation
(5-6)
An example of what the images look like before and after warping by H, with H being
a homography for the ground plane, is shown in Figure 5-3 and Figure 5-4.
The second goal is a little bit more difficult. We would like to warp the images
such that the epipolar direction is an integer offset in the image coordinates. Since
the images are aligned already, any transformation that is applied should preserve the
alignment.
cb’
rb’
1
Hb1–
cb
rb
1
=
Figure 5-3: Original (LoG-filtered) images
eb
s-----
5.3 Software Implementation 79
As a first step, we can warp the image such that the epipolar directions correspond
to the rows and columns of the image. If we call this new warping function M, we get
an equation like this:
(5-7)
The terms occur because we have already warped the images by the H matrices,
and thus the epipolar directions have been changed. This equation is underdetermined.
It seems natural to map the remaining perpendicular directions to each other:
Figure 5-4: Images after warping by homography for the ground plane
1s---M H1
1–e1( ) H2
1–e2( )
1 0
0 1
0 0
=
H1–
80 Chapter 5. Implementation
(5-8)
and then the equation can be solved for M as follows:
(5-9)
Now we define the following three matrices:
(5-10)
Each W matrix is then used to warp its respective image. After this transformation,
points that are located on the plane corresponding to and will be located at the
same pixel coordinate in all three images. Search along the epipolar direction is
accomplished by moving along a row or column of the image. The results of warping
the images of Figure 5-3 by the W matrices are shown in Figure 5-5.
Intuitively, the H matrices warp the images to appear as if they were taken from
virtual cameras whose image planes were all parallel to each other. They also shift the
images along their epipolar lines so that points on the corresponding plane appear to
have zero disparity. The M matrix of equation (5-9) then further rotates the virtual
cameras so that the image planes are all parallel to the plane defined by the foci of the
three cameras. At the same time, the images are warped to a new coordinate system so
that the x-axis is in the direction of one of the baselines, and the y-axis is in the direc-
tion of the other. Note that this means that this method will not work well if the cam-
eras are nearly colinear.
As discussed in Section 3.2.5, the relative scales of and are determined by
the relative geometry of the three cameras. In general we choose s so that the search
1s---M H1
1–e1( ) H2
1–e2( ) H1
1–e1 H2
1–e2×( )
1 0 0
0 1 0
0 0 1
=
M s H11–e1( ) H2
1–e2( ) H1
1–e1 H2
1–e2×( )
1–=
W0 M=
W1 MH11–
=
W2 MH21–
=
H1 H2
e1 e2
5.3 Software Implementation 81
Figure 5-5: Images after warping to align epipoles with image rows and columns. The image size is roughly 1900x2800, original images were 640x240. After warping, the
epipolar direction is vertical in the second image, and horizontal in the third.
82 Chapter 5. Implementation
step along the epipolar direction corresponding to the longest baseline is about one
pixel. Since the other search direction is shorter, this causes the warping function to
expand the image (since it maps the epipolar step in the original image to a one-pixel
step size in the warped image). This means that the size of the image may increase
greatly after warping (in general, it will increase by the ratio of the lengths of the base-
lines). Additionally, the resulting images will be skewed if the original baselines were
not orthogonal.
An increase in the size of the images is undesirable since the main loop of the ste-
reo algorithm loops over all of the pixels. This is in addition to the extra overhead for
generating larger images during the warping process. Even if we were to regularly
subsample the expanded image, the fact that the pixels are no longer adjacent would
cause reduced performance. The combination of these effects would cancel any
increase in performance gained from the rectification.
Because of the disadvantages of increased image size, we want to apply one fur-
ther set of warping functions to reduce the size back down to something near the orig-
inal image size. Such functions can be implemented by applying a further set of
warping matrices Li:
(5-11)
These matrices Li should be derived to satisfy the following constraints:
1. and are matrices consisting entirely of integer components. This
ensures that each pixel in the new image 0 will appear at an exact integer coordi-
nate in the other images. This allows the stereo matching to be done entirely
without interpolation
2. should cause as little distortion as possible. This can be accomplished by
W0 L0M=
W1 L1MH11–
=
W2 L2MH21–
=
L1L01–
L2L01–
L0M
5.3 Software Implementation 83
ensuring that it is near the identity matrix.
3. . This ensures that the epipolar line for the longest baseline is
aligned with the scan lines. We can do this because the orientation of the result-
ing images is not important.
4. one last constraint that depends on the loop ordering. The idea here is to ensure
that pixels that will be accessed sequentially will be in sequential memory loca-
tions, thus taking maximal advantage of the caching hardware in the computer.
Usually, we will set (where the baseline between cameras 0 and 1 is the
longest), which satisfies half of the first constraint. This combined with constraint #3
yields that IM1(c+1,r,d) = IM1(c,r,d+1).
The next three sections will discuss the optimal rectification strategies for the three
different loop orderings. Each strategy will produce a solution with a set of unknown
parameters. The method for finding an optimal set of these parameters will be dis-
cussed later.
5.3.3.3. Rectification strategy for the (r,d,c) ordering
Since the innermost loop is over c, we want IMn(c+1,r,d) to be located
directly to the right of IMn(c,r,d) . This is already true for IM1 if we set
. We add two further constraints:
• : this causes IM2(c+1,r,d) = IM2(c,r,d) + (1,0).
L1
1
0
0
1
0
0
=
L0 L1=
L0 L1=
L2L01–
1
0
0
1
0
0
=
84 Chapter 5. Implementation
t
• : where a and b are integers. IM2(c,r,d) is in the image for all
d.
The combination of all of the constraints gives us that:
(5-12)
and
(5-13)
(5-14)
where e and f must also be integers. This gives us that
(5-15)
After warping by these matrices, a point (c,r) in IM0 corresponds to the point a
(c,r) in IM1, and at (c+ar,br) in IM2. To move up one disparity level in IM1, we move
one pixel to the right, to (c+1,r). To move up a disparity level in IM2, we add (a,b) to
get (c+ar+a,br+b). Thus IM2(c,r,d+1) = IM2(c,r+1,d).
5.3.3.4. Rectification strategy for the (r,c,d) ordering
Since the innermost loop is over d, we want IMn(c,r,d+1) to be located
directly to the right of IMn(c,r,d). This is already true for IM1 if we set .
We introduce one further constraint:
L2
0
1
0
a
b
0
=
L2
1 a 0
0 b 0
0 0 1
=
L2L01–
1 e 0
0 f 0
0 0 1
=
L0 L1
1 aebf
------– 0
0bf--- 0
0 0 1
= =
L0 L1=
5.3 Software Implementation 85
t
r
• : this causes IM2(c,r,d+1) to be to the right of IM2(c,r,d).
We know that:
(5-16)
for some integers a, b, c, and d. This implies
(5-17)
thus we have that:
(5-18)
solving for yields
(5-19)
After warping by these matrices, a point (c,r) in IM0 corresponds to the point a
(c,r) in IM1, and at (ac+er,bc+fr) in IM2. To move up one disparity level in eithe
IM1 or IM2, we simply move one pixel to the right, to (c+1,r) or (ac+er+1,bc+fr)
respectively.
L2
0
1
0
1
0
0
=
L2L01–
a e 0
b f 0
0 0 1
=
L2
1
0
0
a e 0
b f 0
0 0 1
L1
1
0
0
a e 0
b f 0
0 0 1
1
0
0
a
b
0
= = =
L2
a 1 0
b 0 0
0 0 1
=
L1
L1
1f
af be–----------------- 0
0b
af be–-----------------– 0
0 0 1
=
86 Chapter 5. Implementation
5.3.3.5. Rectification strategy for the (d,r,c) ordering:
Since the innermost loop is over c for this ordering, we could in theory use the
same rectification methods as for the (r,d,c) ordering, but there is a better method.
Given that consists only of integers, it follows that IM2 will be larger than
(or the same size as) IM0. Suppose that
(5-20)
and suppose that we know that the epipolar step is (∆c,∆r). If we then look at the coor-
dinates of the point , we get:
(5-21)
which are the coordinates in IM2 of the point (c,r) of IM0, at disparity (eb-af). This
implies that if we hold d constant and loop over r and c, the set of pixels in IM2 that
will be accessed is the same as the set for for any integer n.
Thus, a better method of rectifying the images is to compute (eb-af) separate
images, using the same rectification matrices that we computed for the (r,d,c) method.
5.3.3.6. Computing the Parameters
Each of the above three methods relies on computing a set of four integers (a,b,c,d)
that satisfy the remaining constraint that should come as close as possible to invert-
ing M (thus causing to be near the identity matrix), while satisfying the con-
straint that (a,b,c,d) must be integers. There are several reasons for doing this:
• we would like to keep the number of pixels in IM0 approximately the same as
L2L01–
L2L01–
a e 0
b f 0
0 0 1
=
c e∆r f∆c–+ r b∆c a∆r–+,( )
a c e∆r f∆c–+( ) e r b∆c a∆r–+( )+
b c e∆r f∆c–+( ) f r b∆c a∆r–+( )+ ac a+ e∆r af∆c er+– eb∆c ea∆r–+
bc be∆r bf∆c– fr fb∆c fa∆r–+ + + =
ac er eb af–( )∆c+ +
bc fr eb af–( )∆r+ + =
d d n eb af–( )+=
L0
L0M
5.3 Software Implementation 87
e as
inal
each
olu-
r
:
ts of
t, the
the number of pixels in the original image, the idea being not to arbitrarily
increase or decrease the resolution of the resulting depth image
• since the resulting depth image will be computed in the same coordinate fram
that of IM0, it is best to have that coordinate system be as close to the orig
camera coordinates as possible
Since our goal is to compute their values, it is helpful to note that the range of
of these integers (a,b,c,d) is limited for a number of reasons:
• the parameters in are limited by the fact that we are trying to invert M
• the parameters in are limited by the fact that we do not want to consider s
tions that cause the size of IM2 to grow very large, since some computation is
required to generate each pixel of IM2.
• we would like for the quantity (eb-af) to be small, since it represents the numbe
of pixels in IM2 for each pixel in IM0.
We have developed a program that, given ranges of possible values for (a,b,c,d),
tries all possible combinations, looking for a set that has the following properties
1. and are both invertible
2. the sum of squared difference between the elements of and the elemen
the identity matrix is minimized
3. the size of the resulting IM2 is as small as possible
While some other, better metrics for determining these parameters might exis
method just described has been sufficient for our needs.
An example of what the images look like after optimization for the (r,c,d) case is
L0
L2
L0 L2
L0M
88 Chapter 5. Implementation
shown in Figure 5-6.
5.3.3.7. Memory Use in Stereo Matching
For a particular camera geometry, it is easy to see that the above methods should
each find the same optimal value for . Let us call the resulting width and height of
IM0 after warping NUM_C and NUM_R respectively. It is then easy to show that IM2
will have (eb-af)*NUM_C*NUM_R pixels after warping.
Let us now examine the memory access patterns of the three different possible
loop orderings. After accessing a pixel for the first time, some number of loop itera-
tions will occur before that pixel is used again. During each of the intervening itera-
tions, a different pixel from the same image will be accessed. Thus, in order for the
pixel to still be in the cache when it gets accessed again, all of the intervening pixels
must also be in the cache (assuming a least-recently-used cache replacement strategy).
Table 5-1 shows the number of accesses to image memory that are needed before
returning to the same pixel location, for each possible loop ordering. Note that the
number of intervening pixels in IM2 depends on the exact warping parameters that are
found for each optimization method. The example that is referred to in the table is for
outerloop
middleloop
innerloop
Number of accesses before returning to the same pixel Example640x240
image256
disparities
IM0 IM1 IM2
r c d 0 NUM_D - 1 b * NUM_C * NUM_D - f * NUM_D - eb+ fa
165,386
r d c NUM_C NUM_C - 1 b * NUM_C * NUM_D - f * NUM_D- eb+ fa
162,308
d r c NUM_C * NUM_R (NUM_C * NUM_R) - 1 NUM_C * NUM_R- b * NUM_C+ eb- fa
460,154
Table 5-1: Cache size necessary for fastest possible image access
L0
5.3 Software Implementation 89
Figure 5-6: Final images after warping by the optimal matrices for the (r,c,d) case. The center image (for camera #1) has been reduced by 50% to make it fit on the page. After warping, epipolar lines in both the second and third images correspond to scan lines.
90 Chapter 5. Implementation
ycled”
of the
ediate
the actual camera geometry used on our vehicle, which uses 640x240 images with 256
disparity levels. For this case, the value of (eb-af) is 11.
Of the rectification parameters, the value b shown in Table 5-1 has the most influ-
ence on the necessary cache size, so it is worth examining a little further. Since it mul-
tiplies large coefficients, we would like for b to be as small as possible. For all three
rectification methods, b must be nonzero (this is required to keep the rectification
matrices from becoming singular), but the constraints that are applied to keep the
image sizes small cause b to tend toward small values, and it almost always has the
value 1.
In addition to the memory access patterns for image data, we must also consider
the access patterns for variables used in storing intermediate results. Storing each of
these arrays in their entirety, for all values of c, r, and d, is not practical due to the
sheer size of the data. Even a small image and a small number of disparities (256x256,
32 disparities, 16 bits per storage element) would take almost thirteen megabytes of
storage. Even if we have an amortized memory throughput of 640 MB/sec, the current
top of the line at this writing, the memory accesses alone would take about 20 ms (one
actual memory access to each location), making frame rate (33 ms) very difficult to
achieve. Since we want to deal with larger images and much larger search ranges, and
actually do some processing on the data, the problem is very difficult.
Each of the intermediate values in the algorithm is written exactly once, and then
read exactly once some time later. Instead of storing the all of the intermediate results
for each possible c, r, and d, we instead store the values only until they are needed
again. After the value has been used, the location that it was stored can be “rec
for storing the next value. Table 5-2 shows the minimum necessary size for each
intermediate variables, for each of the three possible loop orderings. The interm
values are each assumed to require 16 bits, except for MIN_SSAD, which requires 24
(16 for the minimum SSAD value and 8 for the location of the minimum).
5.3 Software Implementation 91
a
re that
to be
not
e
nt of
best
tion,
I
The “recycling” of memory locations is actually very important. By writing to
memory location that is the same as the location that we just read from, we ensu
this location is in the cache, and thus the store operation will occur very quickly.
Table 5-3 shows the total necessary cache size for all memory accesses
cached optimally. The value of b is assumed to be 1, and constant terms that do
contain at least one of NUM_C, NUM_R, or NUM_D have been dropped.
It is clear from the table that the (d,r,c) case is rather dramatically superior to th
other possible loop orderings (which are roughly equivalent) in terms of the amou
cache required to attain optimal performance. In order to achieve the
performance, a slightly modified stereo main loop is required. With this modifica
the outer loop does not walk through the values of d sequentially. Instead, it
increments in steps of (eb-af). This modification (which I have just discovered as
outerloop
middleloop
innerloop
SAD size
HORIZONTAL_SUMsize
VERTICAL_SUMsize
MIN_SSADsize
Example640x240 image
11x11 filter256 disparities
r c d WINDOW_WIDTH *NUM_D
WINDOW_HEIGHT *NUM_C *NUM_D
NUM_C *NUM_D
1 3,937,795
r d c WINDOW_WIDTH WINDOW_HEIGHT * NUM_C *NUM_D
NUM_C *NUM_D
NUM_C 3,688,342
d r c WINDOW_WIDTH WINDOW_HEIGHT *NUM_C
NUM_C NUM_R *NUM_C
476,182
Table 5-2: Intermediate Storage Required for Stereo Main Loop
outerloop
middleloop
innerloop
Total Necessary Cache SizeSize for
Example Case
r c d (2*WINDOW_HEIGHT + 3) * NUM_C * NUM_D +(2*WINDOW_WIDTH - f + 1) * NUM_D
4,103,181
r d c (2*WINDOW_HEIGHT + 3) * NUM_C * NUM_D +6 * NUM_C -f * NUM_D
3,850,650
d r c 6 * NUM_C * NUM_R +(WINDOW_HEIGHT + 1) * NUM_C
936,336
Table 5-3: Total cache size needed
92 Chapter 5. Implementation
write this) cuts down on the necessary cache size from (eb-af+5)*NUM_C*NUM_R to
just 6*NUM_C*NUM_R.
I have implemented all three possible loop orderings in C code. The ease and
efficiency of implementation of the (r,c,d) loop caused me to spend the most time
optimizing it (in assembly language) for the Pentium II processors that we have been
using in this project. I have since discovered the results presented in this section, so it
might have been better to spend some time optimizing the (d,r,c) case. It is the
optimized (r,c,d) code that has been used to implement the near-real-time system that
runs on the vehicle.
Most modern processors have small L1 data caches on-chip (16K for the Pentium
II) and larger L2 unified caches off-chip (typically 512K for the Pentium II). In order
to achieve maximum performance, all of the data should at least fit into the L2 cache.
Ideally, it would all fit in the L1 cache. Since the data that we have accounted for will
not be the only items in the cache, we should really aim to use only half of the L2
cache, leaving room for other data and code. Since the L1 cache is often separated into
separate data and instruction caches, we can plan to use more of it if our target is for all
of the data to fit in the L1 cache.
In order to make all of the data fit into a particular given cache size, we can
decrease NUM_D, NUM_C, and NUM_R. We then have to call our main loop several
times in order to cover all of the pixels and disparity levels that we originally intended
to process. A fair amount of bookkeeping also needs to be done to avoid having to
repeat some computation when making these extra calls to the main loop.
In order to reduce the data size to fit in the L1 cache, the (r,c,d) and (r,d,c) loop
orderings each require a reduction in the data size of about a factor of 256. In order to
fit in the L2 cache, a factor of 16 is necessary. In order to achieve a reduction of 256 in
the example case, we would have to cut both the width of the image and the number of
disparities by a large factor. If we evenly distribute the cuts, this means processing 16
5.3 Software Implementation 93
nd 40
, the
cache
con-
p, and
disparity levels on an image that is just 40 pixels wide. With such a small number of
iterations, the constant loop overhead factors become very prominent, so that any
gains from avoiding cache misses are overwhelmed by the overhead. A factor of 16 is
much more reasonable, processing 160 pixel wide images with 64 disparity levels at
once. Since for the (d,r,c) loop ordering only requires a reduction of around a factor of
64 for the L1 cache and a factor of 4 for the L2 cache, this case is even easier to
optimize.
5.3.3.8. Benchmarks for the (r,c,d) case
Table 5-4 shows the results of testing the (r,c,d) loop ordering for various image
widths and numbers of disparities searched. The first number is the number of itera-
tions of the loop body that are executed per second (in millions). For reference, the
CMU stereo machine performs at a constant rate of 30. The second entry in each table
cell (in parentheses) shows the necessary cache size as computed by the formula in
Table 5-3. Note that the table shows a “sweet spot” when the image width is arou
or 80 and the number of disparities is approximately 128 or 192. In this region
necessary cache size is significantly less than 512K, the actual amount of L2
installed in the machine. As the number of iterations in either loop gets small, the
stant time loop overhead becomes similar to the actual time spent inside the loo
Number of disparities searched
32 64 96 128 192 256
ImageWidth
640 14.7 (513K) 16.6 (1,025K) 17.3 (1,538K) 17.3 (2,050K) 16.6 (3,140K) 16.1 (4,101K)
320 15.3 (257K) 18.9 (513K) 19.0 (770K) 19.1 (1,026K) 18.8 (1,539K) 17.7 (2,053K)
160 16.1 (129K) 21.2 (257K) 20.4 (386K) 20.9 (514K) 20.1 (771K) 18.9 (1,029K)
80 15.4 (65K) 20.7 (129K) 23.5 (194K) 23.8 (258K) 23.4 (387K) 21.7 (517K)
40 13.7 (33K) 19.1 (65K) 22.4 (98K) 23.2 (130K) 24.9 (195K) 24.7 (261K)
20 11.3 (17K) 16.6 (33K) 19.1 (50K) 20.8 (66K) 22.9 (102K) 23.7 (133K)
Table 5-4: Performance of the (r,c,d) algorithm (higher is faster)
94 Chapter 5. Implementation
throw
tions
perfor-
efore
no real
orre-
able to
pti-
in our
3.4 In
fast as
of the
works
re, the
target
cache
oces-
performance no longer scales well. In particular, since the innermost loop has been
unrolled to handle 8 disparity levels at once, a search over 32 disparities only takes 4
loop iterations. Thus it is not surprising that the performance is low when only 32 or
64 disparity levels are searched.
The performance degrades gradually as the necessary cache size gets larger, since
the “least recently used” cache replacement strategy causes the algorithm to
away elements from the largest intermediate value storage area (HORIZONTAL_SUM
in this case) first, while holding onto the other values. As the number of loop itera
increases even further, even the smaller arrays no longer fit in the cache, and
mance degrades further.
In this set of benchmarks, the number of loop iterations becomes too small b
the necessary cache size gets small enough to fit within the L1 cache. There is
solution to this problem for the (r,c,d) or (r,d,c) algorithms. It might be possible to
reduce the size of the data sufficiently with the (d,r,c) algorithm using a greatly
reduced image size.
Although the analysis included in this section is very complex, it also has c
spondingly large benefits. Using the same code on the same processor, we are
make a fairly high-level algorithmic decision about which loop ordering provides o
mal performance, although ease of optimization led us to use a different method
system. The reasons behind this choice will be described in detail in Section 5.
addition, we are able to tune the data size such that the chosen algorithm runs as
possible. A relatively simple change in the size of the data can cause the speed
program to increase by more than 50%, and this cache optimization method
independently of any other optimizations that are applied to the code. Furthermo
optimizations described here are only dependent on the cache architecture of the
processor. If the cache sizes are known, then code that is optimal with respect to
effects can be written in a high-level language without consideration of other pr
sor characteristics.
5.3 Software Implementation 95
5.3.4. CPU-Specific Implementation Issues
Currently the system as described in this document exists in two forms: as a pro-
gram written in generic C code, and as a program written in generic C code with opti-
mizations written in i386 (Intel Pentium II) assembly language with MMXTM
technology.
The MMX extensions allow the Pentium II CPU to perform a limited number of
what are called SIMD-in-a-register instructions. Instead of performing one operation
on two 64-bit quantities per instruction, these MMX instructions can perform two 32-
bit operations, four 16-bit operations, or eight 8-bit operations during the execution
time of one CPU instruction. The same arithmetic operation is performed on each of
the smaller data elements (thus the term SIMD, Single Instruction Multiple Data).
More information can be obtained from Intel [Intel 97].
Reimplementing the stereo inner loop using MMX instructions is relatively
straightforward. The main consideration is that there can be no data dependencies
between the operations that are to be performed in parallel. As an example of this, we
see that HORIZONTAL_SUM(c,r,d) depends on
HORIZONTAL_SUM(c-1,r,d). This implies that we cannot perform the operations
of the inner loop for multiple columns in parallel. Similarly, the VERTICAL_SUM
dependencies imply that we cannot perform the operations for multiple rows at once.
This leaves as the only possibility performing the computations for multiple disparity
levels at the same time.
If we are to perform operations on 8 pixels at once, those 8 pixels must be stored in
sequential locations in memory (otherwise the overhead of reading and assembling the
8 pixels into one 64-bit word would defeat the purpose of this optimization). Since the
8 pixels must refer to 8 different disparity levels, this implies that the only loop order-
ing that can be optimized fully with MMX is the (r,c,d) ordering. The optimization
using MMX is expected to yield a 4- to 8-fold speed improvement. When I contrasted
96 Chapter 5. Implementation
this with the at most 2-fold improvement yielded by optimizing for cache issues by
using the (d,r,c) ordering, I made the logical decision that (r,c,d) was the better choice
and spent most of my time optimizing that method. In fact, the use of MMX improved
the performance of the (r,c,d) method by about a factor of four.
97
Chapter 6
Obstacle Detection
The previous chapters have described how we compute very accurate stereo dis-
parity data (and thus stereo range maps) quickly. In order to build an obstacle detec-
tion system, the missing link is a method for determining which range points belong to
obstacles, and grouping these possibly disjoint sets of points into a small number of
obstacle regions that can be presented to a higher level planner so that the vehicle can
take appropriate action.
Most research groups that have attacked this problem have begun by building an
elevation map from the range data. Then they have applied some relatively simple
tests to the elevation map to determine if the vehicle could travel along each point in
its path. This is generally accomplished by placing the vehicle in the map, and com-
puting whether the resulting position violates any kinematic or dynamic constraints. In
this chapter we present a method, based on weakly calibrated stereo, that attempts to
98 Chapter 6. Obstacle Detection
s
ept-
es
in
nt
ste-
ta-
ms
ort
identify obstacles directly from the stereo imagery.
The idea behind this method is that at each pixel we will attempt to find a stereo
match in two different ways. One of these ways will match very well if the pixel lies
on a vertical surface. The other method will match well if the pixel is on a horizontal
surface. By comparing the results of both methods, we can determine which type of
surface the pixel is most likely to belong to. The assumption is that the obstacles that
we need to avoid will contain at least some pixels that will be classified as vertical.
The following sections will first describe methods for matching pixels of different ori-
entations. This is followed by a section describing how pixels are classified as obsta-
cles.
The groups of individual obstacle pixels that are thus identified must then be
grouped into a small number of obstacle regions. For each of these regions, the size
and 3D location must then be computed. The final section of this chapter describes this
process.
6.1. Related Work
Previous obstacle detection systems fall into several different categories:
• Off-road systems. The slow speed of the robot traversing rough terrain implie
that detecting obstacles at short range and/or with long processing time is acc
able. The complexity of the environment usually makes long processing tim
unavoidable. Two examples of this sort of system are presented
[Matthies et al. 95] and [Hébert et al. 97]. The latter contains several differe
approaches to the problem, using multiple sensors including laser radar and
reo vision.
• Indoor systems. The amount of indoor mobile robot research that includes obs
cle detection is incredibly large. In general, however, obstacle detection syste
for indoor mobile robots are only designed to detect obstacles at very sh
6.1 Related Work 99
les
on;
ther
ing
such
s of
y com-
ystem
is dif-
t long
ed in
idea.
ing it
iffer-
per-
paper
t match.
) is that
orien-
ranges (relative to the 50 to 150 meter ranges dealt with in this thesis). Detection
at short range with fast cycle time is necessitated by the fact that indoor environ-
ments are often both complex and dynamic. Two examples of such systems are
discussed in [Horswill 93] and [Thrun et al. 97].
• On-road systems. A fair amount of research has been done in detecting obstac
on-road. Most of these systems ([Luong et al. 95] is an example using visi
[Langer 97] uses radar) have concentrated on the problem of detecting o
vehicles. Other systems such as [Bruyelle & Postaire 93] show promise in be
able to detect people on the road, but lack the speed and acuity to detect
obstacles at highway speeds.
Additionally, [Matthies & Grandjean 94] provide an excellent, detailed analysi
obstacle detectability based on the assumption that obstacles will be detected b
puting the difference in height along a step edge. The performance of the s
described in this thesis exceeds these limits only because the detection method
ferent.
The lack of a convincing system to detect small objects on the road surface a
range has motivated this thesis work.
One of the primary methods used in the obstacle detection algorithm describ
this chapter, the pre-warping so that the ground plane matches, is not a new
[Burt et al. 95] contains an excellent summary of the history of this method, track
back to [Nishihara 84], who applied a shearing function to compensate for the d
ence in disparity over the very large templates he was using.
The computation of surface orientation directly from stereo data has been
formed previously by Robert [Robert et al. 94]. The system presented in that
searches the entire space of possible surface orientations and chooses the bes
The problem with this method (as can be seen in the data presented in the paper
there is not enough information contained in the image to determine the surface
100 Chapter 6. Obstacle Detection
d the
apter.
local-
with
tation very accurately. By comparing only a small number of planes (two in the case
presented here), we have been able to achieve the necessary performance with very lit-
tle additional computation.
6.2. System Architecture
Our test vehicle is shown in Figure 6-1. Figure 6-2 shows the architecture of the
system that we have built to implement our approach to obstacle detection. Three CCD
cameras with 35mm lenses are arranged in a triangular configuration, mounted on top
of our Toyota Avalon test vehicle. The distance between the outer set of cameras is
about 1.2 meters. The center camera is offset by 0.5 meters horizontally, and 0.3
meters vertically from the rightmost camera.
The computation that is performed is based on the that used by the CMU Video
Rate Multibaseline Stereo Machine [Kanade et al. 96], as described in previous chap-
ters. Stereo matching is then performed using both the “traditional method” an
“ground plane method”, which will be described in subsequent sections of this ch
Based on the output of both methods, the further step of obstacle detection and
ization is performed.
6.3. Approaches to Stereo
In Section 3.2.4, we presented a method for calibrating our set of cameras
Figure 6-1: Toyota Avalon test vehicle
6.3 Approaches to Stereo 101
.
aps
very high precision. The results of this method were:
• one full homography matrix for each baseline:
• one 3-vector for each baseline, representing the epipole:
• one 3-vector for each planar surface: . Note that
Remember that is the unit normal vector to plane p, is the perpendicular
distance from the origin, and contains camera intrinsic parameters.
The matrix provides a starting point for stereo search for each pixel (it m
ImageRectification
LoG Filter LoG Filter LoG Filter
ImageRectification
ImageRectification
Stereo Matching
Obstacle Detection/
Localization
Figure 6-2: Architecture of Stereo Obstacle Detection System
Hb
eb
npnp
dp-----
n0
d0-----–
T
A01–
= n0 0=
np dp
A01–
Hb
102 Chapter 6. Obstacle Detection
points in one image to points in the other). After warping by , points located on the
corresponding plane are located at the same pixel coordinate in all of the images. The
optimized rectification and stereo search methods described in Chapter 5 then allow us
to step in one pixel increments along the epipolar line in either direction from this
starting point.
From equation (2-12), we have that
(6-1)
for some and . For compactness, let us define . Combining this
with equation (3-18), we get
(6-2)
further combining this with equation (3-17), we get a formula for z, the range to the
object being viewed:
(6-3)
so that in general, the range depends on all three of c, r, and d.
Combining this result with equation (2-3) and writing out the components of
Hb
Hb H∞ ebn
T
dn-----A
1–
+=
n dn nn
T
dn-----A
1–=
zb
z----
cb
rb
1
H∞ ebnT
+( )c
r
1
deb
s-----⋅+=
H∞
c
r
1
nT
c
r
1
ds---+
eb+=
z1
nT
c
r
1
ds---+
-----------------------=
n
6.4 Traditional Stereo 103
explicitly, we get
(6-4)
thus, the equation for surfaces of constant d is simply the equation for a plane in world
coordinates. Therefore, the output of the stereo matching algorithm (which is the value
for d which matched best at each pixel) is really indicating which of this family of
planes the pixel is most likely to belong to.
When attempting to recognize obstacles, there are at least three different obvious
approaches to computing stereo. The following three sections will describe these
options in more detail.
6.4. Traditional Stereo
Let us assume that . This means that the normal to the plane being
observed is parallel to the camera axis. Using equation (6-4), we can see that
(6-5)
As approaches infinity, this converges to the traditional stereo result that
(6-6)
and, as we expect, the set of planes of constant d are in fact also planes of constant z.
This case is shown in Figure 6-3.
Two examples of stereo computed with this method ( is a plane whose normal
is parallel to the camera axis) are shown in Figure 6-4. In this example, the scene is the
inside of a garage. The garage door has calibration targets attached to it. The images
have also already been LoG filtered to enhance image texture. Two regions are chosen
1dn----- n1x n2y n3z+ +( ) d
s---z+ 1=
nT
0 0 1=
z1
1dn-----
ds---+
---------------=
dn
zsd---=
Hb
104 Chapter 6. Obstacle Detection
as examples of stereo matching to illustrate the problems with traditional stereo pro-
cessing.
For the example region on the garage door, we see that the regions searched in the
-6
-4
-2
0
2
4
40 60 80 100 120 140 160 180 200
Height (m
)
Distance (m)
Figure 6-3: Planes of constant “disparity” for the “Traditional Stereo” method
Right Image region
Left Image region
Difference
Right Image region
Left Image region
Difference
Mat
chin
g E
rro
r
DisparityFigure 6-4: Traditional Stereo Processing
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 50 100 150 200 250 300
“wall_tra.out”“ground_t.out”
6.5 “Ground Plane Stereo” 105
stereo matching (shown in detail below the images) match very well. The upper curve
on the graph at the bottom shows the matching error (sum of absolute differences,
SAD) as a function of the displacement along the epipolar line. This graph shows a
strong global minimum at the correct value of 100.
However, the example on the garage floor does not match as well. This is due to
the fact that since the ground is tilted with respect to the camera axis, points which are
higher in the image are actually farther away and thus match at a different location (a
different value of d). This is seen as a difference in the slope of the line on the ground.
The lower curve of the graph shows that the global minimum of the matching error
does not occur at the correct position (which is at a value of around 155).
It is clear from this example that a simple application of traditional stereo tech-
niques will not be sufficient for detecting obstacles on a road surface; points on the
ground such as those shown in the example will produce incorrect results, particularly
in regions where the image texture is low. Since the problem is caused by a difference
in the geometry of the surfaces being observed, the solution to this problem is to com-
pensate for the different geometry.
6.5. “Ground Plane Stereo”
One way to solve the problems described in the previous section is to use an
that corresponds to a plane that is similar to what we expect for the ground. In this
case, the set of planes defined by equation (6-4) is somewhat more complicated than
for vertical planes. Figure 6-5 shows what the set of planes of constant d would look
like in the case of an idealized example. In the general case, all of the planes pass
through the same intersection line with the x-y plane, and the value of d controls the
pivot angle about this line. The special case of traditional stereo has this line being
infinitely far downward, causing the set of planes to be vertical and parallel.
An example of stereo computed with an corresponding to a horizontal surface
Hb
Hb
106 Chapter 6. Obstacle Detection
is shown in Figure 6-6. Both images now appear to be almost identical for pixels
which are on the ground, but pixels which are on a vertical surface such as the wall of
-8
-6
-4
-2
0
2
4
0 50 100 150 200z
Height (m
)
Distance (m)
Figure 6-5: : Planes of constant “disparity” for the “Ground Plane Stereo” method. Parameters are 1m baseline, 35mm lenses, 1/2” CCD, cameras aligned perfectly and
2m above the ground
Right Image
Left Image Difference
Right Image
Left Image Difference
Mat
chin
g E
rro
r
DisparityFigure 6-6: “Ground Plane Stereo”
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 50 100 150 200 250 300
“wall_gro.out”“ground_g.out”
6.5 “Ground Plane Stereo” 107
using
l sur-
l sur-
rrect
faces,
ethod.
or this
egion
ill be
h sur-
hori-
the garage are now warped in much the same way that the ground pixels were warped
in the previous example of traditional stereo. This means of computing stereo is simi-
lar to the tilted horopter method of [Burt et al. 95], except that in our case, instead of
attempting to determine the parameters of the ground plane at each iteration, we use a
single horizontal plane that is fixed relative to the vehicle as the basis to start our ste-
reo search.
Comparing the results from the ground plane method with the results from the tra-
ditional method, we notice several differences. First, the global minimum of the
matching error curve for the point on the ground (the lower curve) now appears at the
correct location. The value of the error at the minimum is also lower than before, since
it matches better. Second, although the global minimum of the curve for the point on
the door is still at the correct location, the trough of the minimum is much wider, indi-
cating a less certain result. The value at the minimum is larger, indicating that it
doesn’t match as well.
These two examples suggest an alternate approach: if we compute stereo
both methods, it is possible to determine whether a given point lies on a vertica
face (if the traditional method produces a lower minimum error) or on a horizonta
face (if the ground plane method produces a lower minimum error). The co
disparity can also be determined from the position of the lower minimum.
Since most obstacles that we are concerned with contain nearly-vertical sur
detecting such obstacles becomes both easier and more reliable using this new m
One issue that should be addressed is what conditions are necessary f
method to work reliably. For example, if two surfaces appear in the same image r
(near where the garage door meets the ground, for instance), which surface w
chosen? The most important factor is the magnitude of the image texture on eac
face. Another factor is how close the surface directions are to being vertical or
zontal.
108 Chapter 6. Obstacle Detection
Figure 6-7 shows the results of applying both methods to a typical input image set.
The gray coding in both cases represents the number of pixels of displacement along
the epipolar line (dark is negative, medium gray is zero, and bright is positive). As
Original Image
Traditional Method Output
Ground Plane Method Output
Figure 6-7: Example output of both methods
6.6 Height Stereo 109
n: “is
pro-
orrect
ith
pixel
con-
effi-
re not
expected, the ground plane method does very well on the ground pixels, but poorly on
the wall in the background. Conversely, the traditional method works well on vertical
features such as the lamp post and the wall, but many pixels on the ground surface are
mis-matched.
6.6. Height Stereo
The ground plane method described in the previous section produces output that is
closely related to the height of the objects being viewed. This leads to the questio
it possible to compute height directly using stereo vision?”
Let us suppose that
(6-7)
where and refer to a horizontal plane. By varying the value of , we can
duce homographies for other planes that are parallel to the ground plane. The c
equation for epipolar search (in analogy with equation (6-2)) then becomes
(6-8)
This equation is different from equation (6-2) in one fundamental way. W
equation (6-2), changing d is the same as adding a constant offset to the previous
position. This offset does not depend on the location of a pixel in the image. This
trasts with equation (6-8), where the offset is not a constant. This implies that the
cient rectification and stereo search methods that we developed in Chapter 5 a
Hb H∞ ebn
T
dn-----A
1–
+=
n dn dn
zb
z----
cb
rb
1
H∞ ebnT
+( )c
r
1
ds--- ebn
Tc
r
1
⋅+=
H∞
c
r
1
ds--- 1+
nT
c
r
1
eb+=
110 Chapter 6. Obstacle Detection
useful for this method.
Nevertheless, it is possible to compute height using stereo, by computing the
results of equation (6-8) for each pixel and disparity level and interpolating. The out-
put of this height method is shown in Figure 6-8. The triangular section on the lower
right represents pixels for which all possible matches lay outside the image for this
method.
The results of this simple height stereo method are reasonably good, but it seems to
have problems at long distances (toward the top of the image). In order to understand
why this is, we must consider the quantity , which is a scalar that multi-
plies e. If the horizon appears in the image, pixels at the horizon will be perpendicular
to the direction of the ground plane, and this scalar will be zero. The effect of this will
be that pixels on the horizon will all map to the same location, regardless of the value
of d. Another effect is that pixels that are close to the horizon will move very little as d
increases, while pixels that are far from the horizon will move much more. Pixels that
are above the horizon will actually move backwards (as if looking at the ground
behind the cameras).
It is clear from this that no single choice of step size for d will give good results for
Intensity Image Disparity Output
Figure 6-8: : output with height calibration
nT
c r 1T
6.7 Obstacle Detection from Stereo Output 111
this method, and the result of that can be seen in the upper portion of Figure 6-8. We
mention the method here only for the sake of completeness.
6.7. Obstacle Detection from Stereo Output
As discussed in section 6.5, our method involves performing two types of stereo
matching (for vertical and horizontal surfaces), and comparing the absolute errors to
determine if a particular image pixel belongs to a vertical or horizontal surface. The
vertical surface result of this is shown in Figure 6-9. The pixels shown in the lower
part of the image are coded by the size of the difference between the minimum errors
found by the two methods. Brighter pixels indicate that the vertical match is much bet-
ter than the ground plane match. Thus pixels which appear white are most likely to be
vertical, and black pixels are most likely to be horizontal.
Figure 6-9: Detected vertical surfaces
112 Chapter 6. Obstacle Detection
“ver-
igher
n we
lue in
Regions of very low texture (such as the black spot in the center of the road) some-
times match well as vertical surfaces, since the amount of signal that can be used to
determine the surface orientation is very small compared to the amount of camera
noise.
In order to remove such false obstacles from consideration, we compute a simple
confidence measure. For regions which are actual vertical surfaces, we expect that the
traditional stereo matching method will return a relatively large number of pixels at
approximately the same depth. Conversely, if a region belongs to a horizontal plane,
we would expect the traditional method to report a number of different depths. Using
standard connected components labeling methods on the disparity image generated
from traditional stereo matching, we get the image of Figure 6-10. The gray level in
this image encodes the size of the region of similar depths to which each pixel belongs.
Large regions appear brighter, and these regions are more likely to be obstacles. By
requiring detected obstacle regions to pass this consistency check, we can remove
most false positive detections.
Whether a pixel belongs to an obstacle or not is determined by comparing the
tical surfaces” output of Figure 6-9 to a threshold. If the value of this image is h
than the threshold, then the pixel is likely to belong to a vertical surface. The
check the same pixel location in the image of Figure 6-10, and compare its va
Figure 6-10: Size of regions of constant disparity
6.7 Obstacle Detection from Stereo Output 113
high
m in
on the
ack-
liably
n the
2. The
e large
back-
es not
ne (it
this image to another threshold. If it passes this test, then it belongs to a region of the
image that has provided consistent results. Pixels that pass both tests are declared to be
candidate obstacle pixels. An example of the detected obstacle output is shown in
Figure 6-11. Obstacles are shown in black. This example shows a 14cm (6”)
obstacle, which is a piece of wood painted black. The obstacle is roughly 100
front of the vehicle.
Some other points in the image are also reported as obstacles. The curbs
right and left are both identified relatively consistently, as is the building in the b
ground. The curb in the background is too short and too far away to be re
detected.
In order to show that the system is not sensitive to large amounts of texture o
ground plane, we have also tested in situations such as that shown in Figure 6-1
system does not detect any obstacles on the ground of the parking lot, despite th
amount of image texture provided by the painted lines. The car and trees in the
ground are correctly detected in regions where they have sufficient texture.
6.7.1. Sub-pixel interpolation
Since the obstacle detection method described in the previous sections do
depend on accurately determining the distance to particular pixels in the sce
Figure 6-11: Detected Obstacles
114 Chapter 6. Obstacle Detection
instead attempts to determine the surface orientation at those pixels), sub-pixel accu-
racy in matching is not necessary for the determination of whether an obstacle is
present or not.
On the other hand, if accurate determination of the range to obstacles is desired
then sub-pixel interpolation is necessary, at least for those pixels that have been deter-
mined to lie on the obstacle. In practice, our system has not used sub-pixel interpola-
tion since we have been more concerned with being able to detect the obstacles than
with trying to determine their position. If this sort of obstacle detection system were to
be used on an autonomous vehicle, the accuracy requirements of whatever obstacle
avoidance system is used would determine whether sub-pixel interpolation is neces-
sary or not.
Figure 6-12: Output of system with highly textured ground plane
6.7 Obstacle Detection from Stereo Output 115
” and
only
s sup-
de a
nal
y
dis-
ld be:
6.7.2. Computing the two types of stereo efficiently
Since we want to compute both types of stereo (for the “ground plane method
the “traditional method”), we need to find an efficient method for doing so. The
difference between the two methods are the matrices used for rectification. Let u
pose that the matrices refer to the ground plane. The rectification equation is
(6-9)
where exact formula for is not important, since its only purpose is to provi
scale factor for division.
For the ground plane method, we have . For the traditio
method, we have , where is given b
. The v subscripted variables refer to the surface normal and
tance to a vertical plane, and the g subscripted variables refer to the ground plane.
Suppose we were to warp image b using both W matrices, producing two warped
images. The mapping between corresponding points in the resulting images wou
Hb
αb
cb
rb
1
Wb
c
r
1
=
αb
Wb LbMHb1–
=
W'b LbM Hb ebnvT
+( )1–
= nv
nvnv
dv-----
ng
dg-----–
T
A01–
=
116 Chapter 6. Obstacle Detection
(6-10)
from equation (5-9), we have that
(6-11)
furthermore, condition 3 on page 83 gives us that
(6-12)
so that equation (6-10) can be simplified to
(6-13)
note that the parenthesized expression in this equation is a scalar, so that correspond-
ing pixels between the two images are all located on the same scan line of the image,
offset by the quantity
αb
α’b-------
cb
rb
1
WbW’b1–
c’br’b1
=
LbMHb1–
Hb ebnvT
+( )M1–Lb
1–c’br’b1
=
I LbMHb1–ebnv
TM
1–Lb
1–+( )
c’br’b1
=
MHb1–eb
1
0
0
=
Lb
1
0
0
1
0
0
=
αb
α’b-------
cb
rb
1
c’br’b1
nvTM
1–Lb
1–c’br’b1
1
0
0
+=
6.7 Obstacle Detection from Stereo Output 117
om-
ome
rtical
range
ce,
rtical
cation
(6-14)
For convenience, let us define
(6-15)
which is a 3-vector.
For the pixel (c,r) in image 0, with disparity d, in the ground plane method the corre-
sponding pixel will appear in image 1 at (c+d,r). For the traditional method, the corre-
sponding pixel is at
(6-16)
in the image warped for the ground plane method. As might be expected, the function
of equation (6-14) simply appears in this equation as a disparity offset that depends on
the image location.
In order to implement both types of stereo matching efficiently, it would be nice if
we could re-use some of the intermediate results of one method for the other. In order
to solve this problem, it is helpful to realize that the planes for which we compute ste-
reo do not have to be perfectly vertical or horizontal, as long as they are close enough
to be useful tests of “verticalness” or “horizontalness”. In order to make efficient c
putation possible, we must choose a vertical plane such that (it will bec
clear why in a moment). Intuitively, this is a requirement on the slope of the ve
plane. What it means is that as we move across a row of the rectified image, the
to both the vertical and horizontal planes must change at the same rate. In practi
is almost always near to zero anyway, which is a result of the fact that both our ve
and horizontal planes are nearly parallel to the camera rows, and that our rectifi
nvTM
1–Lb
1–c’br’b1
η nvTM
1–Lb
1–( )T
=
c d ηTc d+
r
1
+ + r,
η0 0=
η0
118 Chapter 6. Obstacle Detection
procedure attempts to warp the images as little as possible. Our solution is to set
. If this is the case, the pixel at (c,r) and disparity d will appear at
.
In the stereo main loop (presented in Section 5.3.3.1), if we compute stereo for the
ground plane case we must compute MATCHING_ERROR(c,r,d) for every possi-
ble value of (c,r) and d. From the above discussion, we can see that this is equivalent
to computing MATCHING_ERROR(c,r,d- ) for the traditional case.
Although this is not an integer disparity for the traditional case, it is a perfectly valid
result that we can use.
Since setting removed the dependency on image columns,
MATCHING_ERROR(c,r,d+1) for the ground plane case is also equivalent to
MATCHING_ERROR(c,r,d+1- ) for the traditional case. By this logic, it
should also be possible to reuse the HORIZONTAL_SUM calculations:
HORIZONTAL_SUM(c,r,d) for the ground plane case is the same as
HORIZONTAL_SUM(c,r,d- ) for the traditional case.
The problem comes in with the VERTICAL_SUM computations. Since is not
zero (and the result would not be very interesting if it were), the HORIZONTAL_SUM
computations from different rows refer to different sets of non-integer disparity levels.
The sets are offset by , which is unlikely to be an integer. As an example, suppose
. This would produce HORIZONTAL_SUM values for row 0 at disparities of
{..., -2, -1, 0, 1, 2, 3, ...}. For row 1, it would produce values for disparities
{..., -0.8, 0.2, 1.2, 2.2, 3.2, 4.2, ...}, and so on for the rows down through the image.
We solve this problem by making an approximation: for each row of the image, we
round off the set of disparities to the closest integer for the purpose of adding them
together into a VERTICAL_SUM. So, for example, the sums produced for row 1 would
η0 0=
c d η1r η2+ + + r,( )
η1r η2+
η0 0=
η1r η2+
η1r η2+
η1
η1
η1 1.2=
6.8 Obstacle Clustering 119
pari-
s that
be rounded off and used as if the set of disparities were actually
{..., -1, 0, 1, 2, 3, 4, ...}.
The effect of the previous discussion is that the following pseudo-code is able to
compute correct disparities for the “ground plane method”, and approximate dis
ties for the “traditional method”:
for(outer-loop) {for(middle-loop) {
for(inner-loop) {SAD(c,r,d) = MATCHING_ERROR(c,r,d);
HORIZONTAL_SUM(c,r,d) = HORIZONTAL_SUM(c-1,r,d) + SAD(c,r,d) - SAD(c-WINDOW_WIDTH,r,d);
VERTICAL_SUM(c,r,d) = VERTICAL_SUM(c,r-1,d) +HORIZONTAL_SUM(c,r,d) -
HORIZONTAL_SUM(c,r-WINDOW_HEIGHT,d);
if(VERTICAL_SUM(c,r,d) < MIN_SSAD(c,r)) {MIN_SSAD(c,r) = VERTICAL_SUM(c,r,d);RESULT_IMAGE(c,r) = d;
}
VERTICAL_SUM_TRAD(c,r,d) = VERTICAL_SUM_TRAD(c,r-1,d-VS_OFFSET(r)) +HORIZONTAL_SUM(c,r,d) -HORIZONTAL_SUM(c,r-WINDOW_HEIGHT,d-HS_OFFSET(r))
if(VERTICAL_SUM_TRAD(c,r,d) < MIN_SSAD_TRAD(c,r)) {MIN_SSAD_TRAD(c,r) = VERTICAL_SUM_TRAD(c,r,d);RESULT_IMAGE_TRAD(c,r) = d;
}}
}}
VS_OFFSET(r) and HS_OFFSET(r) are precomputed from the values of
and .
6.8. Obstacle Clustering
In order to be useful to a high-level planner, we need to take the set of pixel
η1
η2
120 Chapter 6. Obstacle Detection
are found to be obstacles, and reduce it to a small number of obstacle regions. The
method that we have used for doing this is very straight-forward. First, a simple one-
pass connected-components labeling is performed on the obstacle image (the intersec-
tion of Figure 6-9 and Figure 6-10). While the labeling is being performed, statistics
are maintained for each region, including its size, centroid, mean disparity, maximum
disparity, minimum disparity, and bounding box. Connected regions whose size is
above a certain threshold are declared to be obstacles.
Using the metric calibration methods described in Section 3.3, we can compute 3D
coordinates for the obstacle centroid and mean disparity, or for the closest point. We
can also compute the 3D extent of the obstacle. These parameters are then available to
a higher level module for deciding on and executing appropriate actions.
6.9. System Parameters
The obstacle detection system has several parameters which can be set at compile
time in order to control different aspects of the system. A summary of those parame-
ters and a rough idea of the effects of changing them is presented here.
LoG filter size: the size of the LoG filter mask is directly related to the value of σ
for the LoG filter. Making σ larger causes the filter to remove more high-frequency
components from the image, as well as making the mask larger and thus causing the
filtering process to be slower. Smaller values of σ allow more high-frequency compo-
nents through the filter, which may allow more noise to pass through.
LoG filter gain: as discussed in Section 4.3, the gain of the LoG filter controls the
ability of the filter to enhance image texture.
Rectification step size, s: although this has been implicitly set according to the
discussion in Section 3.2.5, it could be made larger in order to perform sub-pixel
matching or smaller in order to reduce the amount of search.
Disparity search range: the range of disparities searched controls how far away
6.9 System Parameters 121
from the ground plane (or other target plane) a point in the world can be, and still be
properly matched by the stereo algorithm. The speed of the stereo matching part of the
algorithm depends linearly on the size of the search space, so it needs to be set to the
smallest possible value that still allows recognition of obstacles under all conditions.
Stereo matching window size: in the process of stereo matching, matching errors
are summed over a window. This amounts to an assumption that all of the pixels
within the window will belong to the same surface. If multiple surfaces appear within
the matching window, the algorithm will usually either lock onto one surface or the
other, or produce a result that is intermediate between the two. In some rare cases, it
can produce a completely incorrect result. Reducing the size of the window also
reduces the size of the patch that is required to belong to the same surface, causing
more pixels to be matched correctly, at the expense of increasing susceptibility to
image noise. Conversely, increasing the window size decreases susceptibility to image
noise, having a smoothing effect, but it increases incorrect matches at the borders
between surfaces.
Vertical surface threshold: this parameter controls how much better a pixel must
match as a vertical surface than as a horizontal surface in order to be considered a can-
didate obstacle. Small values tend to produce many noise points, as seen in Figure 6-9.
If the value is too large, obstacles will not be detected.
Consistency threshold: this parameter controls the size of the region of constant
disparity (in pixels) that a pixel must belong to in order to considered an obstacle can-
didate. In general, the size of this threshold can be set to any small value in the range
of 5-15 with similar results. If the value is too small, many small erroneous obstacle
regions can appear. If the value is too large, small obstacles may not be detected.
Obstacle clustering threshold: controls the number of adjacent pixels that must
be declared as obstacle pixels in order for an obstacle to be reported. Regions smaller
than 10 pixels tend to be unreliable, so we eliminate them based on their size.
122 Chapter 6. Obstacle Detection
121
Chapter 7
Obstacle Detection Results
The previous chapters have presented the design and implementation consider-
ations that have gone into our obstacle detection system. This chapter presents the
results of a number of different obstacle detection experiments designed to test our
system under a variety of different conditions.
In order to test the performance of the system with respect to different sizes and
colors of obstacles, we constructed test obstacles out of four common sizes of lumber,
1"x4", 1"x6", 1"x8", and 1"x12". Each type of lumber was cut into pieces that were
12" long, and the pieces were spray painted black, white, or gray, thus producing 12
different obstacles. When used in the tests, the boards were propped up on their edges,
producing obstacles of four different heights, approximately 9 cm, 14 cm, 19 cm, and
29 cm tall (note that the commercial lumber sold in the U.S. as "1x4" is not 4" wide).
122 Chapter 7. Obstacle Detection Results
arage-
s of
t with
ack,
end of
es the
so that
cam-
ects
h the
e door
mage,
garage
to be
ft and
tric cal-
Penn-
of the
in the
Additionally, during testing a number of other objects were used as obstacles.
Although the only such obstacle that will be presented in this section is a 12 oz. (355
ml) Diet Pepsi can, many other objects were also tested. These objects included peo-
ple, bricks, stones, boards lying flat on the road, and paper plates. In general, the sys-
tem performed as expected in that taller obstacles and obstacles with higher contrast
relative to the road surface were detected at longer distances than obstacles that were
shorter or had lower contrast.
The camera system was adjusted and calibrated in the “car barn”, a large g
like space on the Carnegie Mellon campus (which appears in the image
Figure 6-3). The car barn was used because it provides a controlled environmen
a flat floor. Straight lines on the floor are provided by a section of railroad tr
which meets the garage door at a right angle. First, the car was placed at the far
the garage, facing the garage door and aligned with the railroad tracks. This plac
cameras about 45 meters from the door. All three cameras were then adjusted
the images of the door roughly overlapped, thus ensuring that the (very narrow)
era fields of view would overlap sufficiently to compute stereo disparity for obj
over a wide range.
The calibration was performed using the methods outlined in Chapter 3, bot
weak calibration and the metric calibration. Seven sets of images of the garag
were taken at five meter intervals from 15 meters to 45 meters. In the 45-meter i
matching was performed both for the garage door as a vertical plane and for the
floor as a horizontal plane. The origin of the vehicle coordinate system was set
the point where the left front tire touches the ground. The lateral offsets to the le
right edges of the door were measured, and those features were used for the me
ibration.
The tests were performed at a site in the boro of Homestead, near Pittsburgh,
sylvania. The site is a rarely-used stretch of city street. An overhead diagram
site is included in Figure 7-6. The road surface is relatively new asphalt, paved
7.1 Obstacle Detection System Performance 123
and
of the
g that
eed.
sec-
last few years and almost unused, although the asphalt has seen some weathering and
no longer has the black and shiny appearance of fresh asphalt. Tests have also been
performed on concrete roadways; although the results are not included here, there was
no substantial difference in system performance.
The total length of straight road available at the test site prevents testing at dis-
tances exceeding 150 meters. Due to timing difficulties and restrictions on the amount
of data that can be collected at once with our system, many of the data sets taken with
the vehicle in motion begin with the obstacle at distances of 110 meters or less.
7.1. Obstacle Detection System Performance
The stereo processing on the vehicle was performed on a 300 MHz Intel
Pentium II PC with a Matrox digitizer. More recently, we have performed some tests
on a 400 MHz Pentium II processor. The processing times for a set of three 640x240
images, searching 96 disparity levels, and computing both “traditional stereo”
“ground plane stereo” are shown in Table 7-1. Note that the stereo matching part
algorithm sees a large benefit from the increased memory bus speed, indicatin
the performance of that part of the algorithm is probably limited by memory sp
The overall cycle time of the system installed on the vehicle is on the order of 1.5
onds per frame.
300 MHz(66 MHz bus)
Pentium II
400 MHz (100 MHz bus)
Pentium II
LoG Filtering (3 images) 150 ms 115 ms
Image Rectification (3 images) 151 ms 111 ms
Stereo Matching (both ground plane & traditional)
750 ms 500 ms
Obstacle Detection 350 ms 290 ms
Table 7-1: stereo system benchmarks
124 Chapter 7. Obstacle Detection Results
7.2. Stereo Range to Detected Obstacles
In order to assess the accuracy of the range measurements obtained for obstacles
from the complete system, we collected data from a stopped vehicle for each of the 12
obstacles at measured ranges. The ranges used went in 10 meter increments from 50
meters to 150 meters. Figure 7-1 shows a plot of the range as measured by hand versus
the range reported by the obstacle detection system. For an obstacle to be reported, a
region containing at least ten candidate obstacle pixels must have been found. The dia-
monds in this graph represent the range to the pixel of the obstacle that reported the
closest range. The plus signs represent the mean of the ranges reported by all of the
pixels. As expected, the measured range is very accurate when the object is close, and
gets increasingly less accurate as the obstacle gets farther away due to the inherent loss
of stereo accuracy at long range. There is one severe outlier in the data at 150 meters,
which is caused by a single pixel mismatch. Even though the closest pixel for the 150
meter data shows up at 80 meters, the fact that the mean distance is still near 150
meters shows that the other pixels were not mismatched. It would thus be possible to
Actual Range (m)Figure 7-1: Stereo range accuracy
Det
ecte
d R
ange
(m
)
40
60
80
100
120
140
160
180
40 60 80 100 120 140 160
"rangeacc.out""rangeacc.out"
x
7.3 Experiments From a Moving Vehicle 125
filter out such outlier pixels by using the mean or median range instead of the closest
range to the object.
The obstacle detection performance results on this data set are summarized in
Table 7-2. Checkmarks represent successful detection of the obstacle, whereas X
marks represent a failure to detect the obstacle. This table shows at least two interest-
ing facts. First, we were successfully able to detect obstacles that are 14 cm or taller at
up to 110m. Second, the white and gray obstacles are more difficult to detect because
of the lower contrast between obstacle and road surface pixels. The white obstacles
cannot be detected at all beyond 120 meters, regardless of size.
7.3. Experiments From a Moving Vehicle
A number of further experiments were performed with various obstacles while the
vehicle was moving. The cycle time for the computers installed on the vehicle is 1.5
seconds, but we want to get frequent depth measurements in anticipation of hardware
that can run the algorithm at a faster rate. Accordingly, we recorded the image data to
the hard disk at a faster rate (either 4 frames per second or 15 frames per second), and
processed the data off-line.
Black 9cm
Black14cm
Black19cm
Black30cm
Grey9cm
Grey14cm
Grey19cm
Grey30cm
White9cm
White14cm
White19cm
White30cm
50m ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
60m ✓ ✓ ✓ ✓ ✕ ✓ ✓ ✓ ✓ ✓ ✓ ✓
70m ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
80m ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
90m ✓ ✓ ✓ ✓ ✕ ✓ ✓ ✓ ✓ ✓ ✓ ✓
100m ✓ ✓ ✓ ✓ ✕ ✓ ✓ ✓ ✓ ✓ ✓ ✓
110m ✓ ✓ ✓ ✓ ✕ ✓ ✓ ✓ ✕ ✓ ✓ ✓
120m ✓ ✕ ✓ ✓ ✓ ✕ ✓ ✓ ✕ ✓ ✓ ✓
130m ✕ ✕ ✓ ✕ ✕ ✓ ✓ ✓ ✕ ✕ ✕ ✕
140m ✓ ✕ ✓ ✓ ✓ ✓ ✓ ✓ ✕ ✕ ✕ ✕
150m ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✕ ✕ ✕ ✕
Table 7-2: Obstacle Detection Results
126 Chapter 7. Obstacle Detection Results
n in
d in all
eters
acle.
For each of the runs presented in this section, the vehicle was driven toward the
obstacle at a roughly constant speed of between 10 and 25 miles per hour (the upper
restriction occurs because it is the speed limit at our test site).
The system returns several (on the order of 5-15) obstacle regions per frame, corre-
sponding to other objects such as the curbs, lampposts, and buildings, as well as the
obstacle that we have placed. The graphs of this section show segmented results, so
that only the detected obstacles that correspond to the desired obstacle are shown. This
allows us to examine whether the system can detect the obstacle at a given range. A
brief analysis of the other objects that are detected will occur in the following section.
Figure 7-2 shows an example trace of an obstacle detection run. The vehicle
moved towards a 30 centimeter (12”) high white obstacle of the type show
Figure 6-6. The data was taken at 4 frames per second. The obstacle is detecte
but one frame of the data, out to a maximum range of approximately 110 m
(which is the beginning of the data set).
Figure 7-3 shows a similar trace, this time for a 14 centimeter (6”) black obst
Figure 7-2: Detection trace for 30cm obstacle
Ran
ge (
m)
Frame # (4 fps)
0
10
20
30
40
50
60
70
80
90
100
110
120
130
0 5 10 15 20 25 30 35 40 45 50
"obs1.out" using 1:11✧
✧✧
✧✧ ✧
✧✧
✧
✧
✧ ✧ ✧ ✧
✧✧
✧✧
✧ ✧✧
✧ ✧
✧ ✧
✧ ✧✧ ✧
✧✧
✧✧ ✧
✧✧
✧✧✧ ✧ ✧✧ ✧✧
✧✧✧✧✧✧
✧ ✧ ✧ ✧
✧
7.4 Other Detected Objects 127
tacle
ugh it
fficult
12 oz.
at 57
repre-
at sat-
that
The density of the data is higher because the images were collected at 15 frames per
second. Again, the obstacle is detected reliably from the beginning of the data set
(around 110 meters) until the end of the data set.
The system reaches its limitations when viewing a 9 centimeter (4”) white obs
as in Figure 7-4. The obstacle is not reliably detected until about 40 meters (tho
is detected for one frame at about 55 meters).
Since these results were surprisingly good, we decided to attempt a more di
obstacle. Figure 7-5 shows the same type of trace, this time for a standard
(350ml) soda can, which is mostly white. The soda can is first reliably detected
meters.
7.4. Other Detected Objects
Each of the previous examples has shown only the detections that actually
sented the obstacle. Of course, there are likely to be many objects in the world th
isfy our obstacle detection algorithm, which is just looking for surfaces
Frame # (15 fps)Figure 7-3: Detection trace for 14cm obstacle
Ran
ge (
m)
0
20
40
60
80
100
120
0 20 40 60 80 100 120 140 160
"6in_obso.out"
128 Chapter 7. Obstacle Detection Results
consistently appear to be closer to vertical than horizontal. In fact, the system does
detect a large number of objects. A full trace is shown in Figure 7-6. This trace is the
same data set as shown in Figure 7-2, along with an example image from the set and a
diagram showing an overhead view of the scene. The detections can be divided (in this
Ran
ge (
m)
Frame # (4 fps)Figure 7-4: Detection trace for 9cm obstacle
0
20
40
60
80
100
120
0 10 20 30 40 50 60
"obs1.out" using 1:11
✧
✧✧✧✧
✧✧✧✧✧
✧✧✧
✧✧✧✧✧✧
✧✧
✧✧
✧
Frame # (4 fps)
Ran
ge (m
)
Figure 7-5: Detection trace for soda can
0
20
40
60
80
100
120
0 10 20 30 40 50 60 70 80 90
"can_obso.out"
7.5 Lateral Position and Extent 129
case, by hand) into three sets, representing the obstacle, the curb behind the obstacle,
and the building in the background. While this is not an analytical result, it is a con-
vincing argument that the number of false positive obstacles reported from the system
is not high. Perhaps more importantly, there are no false detections that are closer than
the obstacle, implying that the system has a low false positive rate for pixels that corre-
spond to a plain asphalt road surface.
7.5. Lateral Position and Extent
In addition to the range data, the stereo system can also provide us with informa-
tion about the 3D position and extent of the obstacle. This sort of data is shown in
Figure 7-7. The LoG filtering and the width of the SAD summation window tend to
make obstacles appear larger than they really are. The obstacle in this example is about
35 centimeters wide, but we detect its extent to be around 50 centimeters. The strange
trajectory that the obstacle appears to follow is actually due to a mid-course correction
0
50
100
150
200
250
0 5 10 15 20 25 30 35 40 45 50
"obs.out"
road
curb
vehicle
obstacle
Frame # (4 fps)
Ran
ge (
m)
Figure 7-6: Other detected points
building
130 Chapter 7. Obstacle Detection Results
from
h low
bout 55
ult to
s.
same
nable
e data
d with
that the driver made to keep the obstacle from leaving the field of view of the system
as the vehicle approached the obstacle.
7.6. Night Data
In addition to the data shown so far, we have also done experiments at night. The
results of these experiments are shown in Figure 7-8 and Figure 7-9. Figure 7-8 shows
detection of a white 14 centimeter (6”) obstacle. The points marked with “+” are
images collected with high beams, and the points marked with diamonds are wit
beams. This obstacle was detectable at 100 meters with high beams, and at a
meters with low beams. Figure 7-9 shows the results for a black 14 centimeter obstacle
under the same conditions. As expected, the black obstacle is much more diffic
detect — 60 meters with the high beams, and about 37 meters with the low beam
Qualitatively, at night-time the system is able to detect obstacles at about the
time that the obstacle becomes visible in the image to a human. It is questio
whether a human could identify the object as an obstacle from monocular imag
alone. As an example, the first image in which the black obstacle was detecte
Lat
eral
Pos
ition
(m
)
Range (m)Figure 7-7: Detected obstacle extent and trajectory
0
0.5
1
1.5
2
2.5
30 10 20 30 40 50 60 70 80 90 100 110 120 130
"obs2.out" using 11:9:2:3
✧✧
✧
✧
✧
✧
✧
✧✧
✧
✧
✧
✧✧
✧ ✧
✧✧
✧✧✧
✧✧
✧ ✧✧
✧
✧
✧
✧✧✧✧✧ ✧✧
✧
✧
✧✧✧✧ ✧
✧✧✧ ✧✧✧
✧
✧✧
✧
✧✧
✧
✧
✧
✧✧✧✧✧✧✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧✧ ✧✧✧
✧
✧✧
✧
✧
✧
✧
✧
✧✧
✧
✧
✧
✧
✧
✧
✧
✧
✧✧
✧
✧
✧
✧
✧
✧
✧✧
✧✧
✧
✧✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧✧
✧
✧
✧
✧
✧
✧✧
✧
✧
✧
✧
✧
✧✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧✧
✧✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧✧
✧
✧✧
✧
✧
✧
✧✧
✧
✧
✧
✧
✧✧✧
✧
✧
✧
✧
✧
✧
✧
✧✧
✧
✧
✧
✧
✧
✧
✧✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧
✧✧
✧
✧
✧
✧
✧
7.6 Night Data 131
low beams in Figure 7-9 is shown in Figure 7-10. The black obstacle in the center of
the road is just barely visible; a gray obstacle is also visible on the right side of the
road.
Ran
ge (
m)
Frame # (4 fps)Figure 7-8: Detection trace for 14cm white obstacle at night. “+”data is with high
beams, diamonds are with low beams
0
20
40
60
80
100
120
0 5 10 15 20 25 30 35 40 45 50
"obs1.out" using 1:4
✧✧ ✧✧✧✧
✧✧✧✧
✧✧ ✧✧✧✧
✧✧ ✧✧
✧✧
✧
"../blkgry6inhi/obs1.out" using 1:4
✛✛
✛
✛ ✛✛
✛✛✛
✛✛ ✛ ✛
✛ ✛✛ ✛✛
✛✛✛✛✛✛ ✛✛✛✛
✛✛✛ ✛✛✛✛✛ ✛✛✛ ✛✛✛✛
✛✛✛✛✛ ✛✛✛ ✛✛✛
✛✛✛
✛✛✛✛ ✛✛ ✛✛✛✛✛✛✛ ✛✛✛✛✛ ✛✛✛ ✛✛✛ ✛✛✛
✛✛ ✛✛✛ ✛✛✛✛✛ ✛✛
✛
Ran
ge (
m)
Frame # (4 fps)Figure 7-9: Detection trace for 14cm black obstacle at night. “+” data is with high
beams, diamonds are with low beams
0
10
20
30
40
50
60
70
80
90
100
110
120
130
0 5 10 15 20 25 30 35 40 45 50
"obs.out" using 1:4
✧ ✧✧ ✧✧
✧✧✧✧
✧
"../blkblk6in8ihdkhi/obs1.out" using 1:4
✛ ✛✛ ✛
✛✛✛ ✛✛ ✛✛
✛✛✛ ✛✛
✛✛ ✛✛
✛✛ ✛✛✛
✛✛✛✛ ✛✛
✛✛✛ ✛✛✛ ✛✛
✛
132 Chapter 7. Obstacle Detection Results
e col-
ined by
before
ults of
7.7. Repeated Experiments
In an attempt to show how the probability of detection for a particular obstacle can
be quantified, we have performed a set of repeated experiments. As in Section 7.3, the
obstacle was placed on the road surface, and the car was driven toward the obstacle
while data was collected at 4 frames per second. The data collection continued through
12 different passes toward the obstacle.
Table 7-3 shows the accumulated results of the 12 test runs with a 9 centimeter
(4”) tall black obstacle. Over 1000 sets of images from the three cameras wer
lected and processed. For each frame, the distance to the obstacle was determ
our algorithm. If the obstacle was not detected, the range reported from frames
and/or after were interpolated to determine an approximate range. Using the res
<30 m 30-40m 40-50m 50-60m 60-70m 70-80m 80-90m 90-100m 100-110m 110-120m 120-130m >130m
totalframes
86 50 71 70 85 81 93 92 119 91 48 62
framesdetected
86 50 71 70 84 81 85 76 99 60 28 32
percentdetected
100 100 100 100 98.8 100 91.3 82.6 83.2 65.9 58.3 51.6
Table 7-3: Results of repeated experiments
Figure 7-10: First frame in which the black obstacle was detected at night. The black obstacle is in the center, and a gray obstacle is also visible to the right.
7.7 Repeated Experiments 133
num-
umber
entage
this procedure, we classified each frame into one of 12 bins depending on the detected
range. Each such image represents an opportunity to detect an obstacle within a given
range of distances. The row of Table 7-3 marked “total frames” shows the total
ber of image frames that were classified into each bin. The next row shows the n
of frames for which the obstacle was detected. The last row then shows the perc
detection.
134 Chapter 7. Obstacle Detection Results
135
Chapter 8
Conclusions
8.1. Contributions of This Thesis
The primary contribution of this thesis is an obstacle detection system that uses
trinocular stereo to detect very small obstacles at long range on highways. The system
makes use of the apparent orientation of surfaces in the image in order to determine
whether pixels belong to vertical or horizontal surfaces. A simple confidence measure
is applied to reject false positives introduced by image noise. The system is capable of
detecting objects as small as 14cm high at ranges well in excess of 100m. To my
knowledge, no existing system is capable of this level of performance.
In order to make the obstacle detection system function, several other contribu-
tions have been made:
High Precision Calibration Methods. The calibration methodology presented in
136 Chapter 8. Conclusions
Chapter 3 provides a simple method for computing weak calibration parameters, based
only on multiple views of planar surfaces. The precision is increased by the addition of
multiple surfaces. Additionally, an easy method for extending the weak calibration to a
full metric calibration is presented. This method can be applied to the same data used
for weak calibration with the addition of a small number of measurements.
Rectification for Efficient Three Camera Stereo. Chapter 5 presents a method
for rectifying a three camera system so that stereo disparity can be computed effi-
ciently. The method works by the application of constraints to the warping functions
used for rectification.
Analysis of Memory and Cache Usage of Stereo Algorithms. Implementation of
stereo on multi-purpose CPUs (as opposed to special-purpose hardware) requires
some attention to how memory and the CPU cache are used. Chapter 5 presents an
analysis of three different variations on the stereo algorithm with respect to their cache
usage, including benchmarks that support my calculations.
Efficient Calculation of Stereo to Test Surface Orientation. The method pre-
sented in Section 6.7.1 allows efficient computation of stereo for multiple hypothe-
sized surface orientations at once. The results of this can be used to decide which
surface orientation is most likely.
Implementation in “Slow Real-Time”. The entire obstacle detection system has
been implemented in Intel Pentium MMX assembly language to achieve cycle times
of around 1 second. It has been integrated into a complete obstacle detection and track-
ing system, and demonstrated running live on our test vehicle at speeds of up to 25
MPH.
8.2. Future Work
There are a number of logical directions in which this work could be extended.
8.2 Future Work 137
8.2.1. Determining More Orientations
An obvious extension of this work would be to adapt the algorithm to compute the
best match out of a number of possible surface orientations (instead of only vertical
and horizontal). The number of orientations that can be distinguished is a function of
both the available surface texture and the window size.
8.2.2. Test in an Offroad Environment
A number of research groups are building cross-country navigation systems. These
systems also need to detect and avoid obstacles. My obstacle detection system should
be tested to determine if it continues to function well in such an environment. In par-
ticular, in the presence of a highly textured environment the choice of window sizes
may become much more important, since a large window may overlap several regions
with different surface orientations. The solution presented in this thesis works well
with a large window size because the ground is relatively bland, so that the texture on
the obstacle dominates. When this is not the case, the system may not be able to detect
such small obstacles.
8.2.3. Use Temporal Information
The obstacle detection system currently views each frame of video as if it were a
completely new situation, independent of what came before. A method for directing
the stereo search to parts of the image that are likely to belong to the road surface, and
particularly to those regions where obstacles are predicted to appear from past data,
could be used to increase both processing speed and accuracy.
A more complicated system could combine the current system with data from
vehicle sensors to build an accurate model of the road in front of the vehicle over time,
perhaps even including super-resolution textures.
138 Chapter 8. Conclusions
8.2.4. Obstacle Avoidance
After detecting the obstacles, we of course need to avoid hitting them. This is very
much an open research problem. Once the obstacle has been detected, an appropriate
course of action must be decided. This course of action is a function of (at least) the
size and position of the obstacle, the speed of the vehicle, weather conditions, vehicle
maneuverability, and the state of other vehicles in the vicinity. The options may
include swerving, changing lanes, stopping, straddling the obstacle, slowing down, or
even hitting the obstacle.
8.2.5. Further Optimizations and Speed Enhancements
Although these are not research topics per se, the obstacle detection system could
benefit from another pass or two of optimization. The following paragraphs highlight
some of the places where optimization is likely to be fruitful.
In accordance with the results derived in Section 5.3.3.7, a speed improvement of
approximately 50% is possible by simply reducing the amount of data (image size and
number of disparities) that is processed in one chunk. Since it is necessary to continue
to process large images and large numbers of disparities, a method that allows efficient
division of the problem into smaller problems without introducing a lot of overhead
would be necessary to take advantage of this.
Additionally, since the cache requirements of the (d,r,c) algorithm are much
smaller than either of the other options, a fast method of implementing this algorithm
with SIMD instructions has the potential of running much faster than the current
implementation.
The optimization of Section 6.7.1, while much faster than computing stereo sepa-
rately for the two different surface orientations, runs slowly because it performs a
large number of unaligned accesses on the Pentium II processor. I am not convinced
that there is no way to avoid this.
8.2 Future Work 139
Further optimization should also be possible in the LoG filtering process. This
operation by itself should be possible at near frame rate. If that does not seem possible
in software, then special-purpose hardware for performing 2D convolutions could be
employed. Similarly, the rectification process is nothing but a 2D projective warping
of the image. This process is very common in 3D rendering (for texture mapping), and
thus very fast and cheap graphics hardware to perform this operation is available. It
would not be surprising if faster software implementations also existed.
Very little effort has been made to optimize the section of the code that takes the
output of stereo matching and finds the obstacle regions. Since this code takes a signif-
icant fraction of the time spent by the obstacle detection algorithm, it may be worth-
while to take another look at it to see what optimizations are possible. Since many of
the algorithms used are common vision techniques (such as connected components
labelling), optimized libraries may be available.
140 Chapter 8. Conclusions
ee
R.le
lly
141
Bibliography
[Bruyelle & Postaire 93] J. Bruyelle and J. Postaire. “Direct RangMeasurement by Linear Stereovision for Real-TimObstacle Detection in Road Traffic.” Proceedings ofthe International Conference on IntelligentAutonomous Systems, 1993.
[Burt et al. 93] P. Burt, P. Anandan, K. Hanna, G. van der Wal, andBassman. “A Front-End Vision Processor for VehicNavigation,” Proceedings of the InternationalConference on Intelligent Autonomous Systems (IAS-3), 1993.
[Burt et al. 95] P. Burt, L. Wixson, and G. Salgian. “ElectronicaDirected “Focal” Stereo,” Proceedings of the FifthInternational Conference on Computer Vision (ICCV’95), Cambridge, Mass, June, 1995, pp. 94-101.
[Cro & Parker 70] J. Cro and R. Parker. Automatic Headway Control —An Automobile Vehicle Spacing System. TechnicalReport 700086, Society of Automotive Engineers,January 1970.
142 Chapter . Bibliography
e to
dy
sions
l
for
om
[Deriche 90] R. Deriche. “Fast Algorithms for Low-Level Vision,”IEEE Transactions on Pattern Analysis and MachineIntelligence (PAMI), 12:1, January 1990, pp. 78-87.
[Devernay & Faugeras 96] F. Devernay and O. Faugeras. “From ProjectivEuclidean Reconstruction,” Proceedings of theInternational Conference on Computer Vision andPattern Recognition (CVPR’96), 1996.
[Dickmanns & Zapp 86] E. Dickmanns and A. Zapp. “A Curvature-BaseScheme for Improving Road Vehicle Guidance bComputer Vision,” Proceedings of the SPIEConference on Mobile Robots, 1986.
[Faugeras 92] O. Faugeras. “What Can Be Seen in Three DimenWith an Uncalibrated Stereo Rig?”, Proceedings of theEuropean Conference on Computer Vision, 1992.
[Faugeras 93] O. Faugeras. Three-Dimensional Computer Vision.MIT Press, 1993.
[Gardels 60] K. Gardels. Automatic Car Controls for ElectronicHighways. Technical Report GMR-276, GeneraMotors Research Labs, June 1960.
[Hancock 97] J. Hancock. “High-Speed Obstacle Detection Automated Highway Applications,” Carnegie MellonRobotics Institute Technical Report, CMU-RI-TR-97-17, 1997.
[Hartley et al. 92] R. Hartley, R. Gupta, and T. Chang. “Stereo frUncalibrated Cameras,” Proceedings of theInternational Conference on Computer Vision andPattern Recognition, 1994.
[Hébert et al. 97] M. Hébert, C. Thorpe and T. Stentz, eds. IntelligentUnmanned Ground Vehicles: Autonomous NavigationResearch at Carnegie Mellon. Kluwer, 1997.
[Horswill 93] I. Horswill. “Polly: A Vision-Based Artificial Agent.”Proceedings Tenth National Conference on ArtificialIntelligence (AAAI-93). Washington DC, 1993.
[Intel 97] Intel Corporation. Intel Architecture SoftwareDeveloper’s Manual, Volumes 1-3. Order numbers:243190-2, 1997.
143
se”
)
us
d
d
rdse
r.”
nicle
,”
[Kanade et al. 96] T. Kanade, A. Yoshida, K. Oda, H. Kano, and M.Tanaka. “A Stereo Machine for Video Rate DenDepth Mapping and its New Applications,Proceedings of the International Conference onComputer Vision and Pattern Recognition (CVPR’96,June, 1996.
[Kelly 95] A. Kelly. An Intelligent Predictive Control Approachto the High-Speed Cross-Country AutonomoNavigation Problem. Carnegie Mellon RoboticsInstitute Ph.D. Thesis, CMU-RI-TR-95-33, 1995.
[Kluge & Thorpe 89] K. Kluge and C. Thorpe. “Explicit Models for RoaFollowing,” Proceedings of the IEEE Conference onRobotics and Automation, 1989.
[Konolige 97] K. Konolige. “Small Vision Systems: Hardware anImplementation,” Eighth International Symposium onRobotics Research, Hayama, Japan, October 1997.
[Kories et al. 88] R. Kories, N. Rehfeld, and G. Zimmermann. “TowaAutonomous Convoy Driving: Recognizing thStarting Vehicle in Front,” Proceedings of the 9thInternational Conference on Pattern Recognition,1988.
[Langer 97] D. Langer. An Integrated MMW Radar System forOutdoor Navigation. Carnegie Mellon RoboticsInsitutute Ph.D. Thesis. CMU-RI-97-03, 1997.
[Longuet-Higgins 81] H. C. Longuet-Higgins. “A Computer Algorithm foReconstructing a Scene from Two ProjectionsNature, 293:133-135, 1981.
[Luong et al. 95] Q.T. Luong, J. Weber, D. Koller and J. Malik, “Aintegrated stereo-based approach to automatic vehguidance,” Fifth International Conference onComputer Vision (ICCV ’95), Cambridge, Mass, June1995, pp. 52-57.
[Marr & Hildreth 80] D. Marr and E. Hildreth. “Theory of Edge DetectionProceedings Royal Society of London, Series B, 1980,pp. 187-217.
144 Chapter . Bibliography
s:e
ancety
-4.
.s:
e
e
ile
8.
es
[Matthies 92] L. Matthies. “Stereo Vision for Planetary RoverStochastic Modeling to Near Real-TimImplementation,” International Journal of ComputerVision, 1992, 8:1, pp. 71-91.
[Matthies & Grandjean 94] L. Matthies and P. Grandjean. "Stochastic PerformModeling and Evaluation of Obstacle Detectabiliwith Imaging Range Sensors". IEEE Transactions onRobotics and Automation, Special Issue on Perceptionbased Real World Navigation, 10(6), December 199
[Matthies et al. 95] L. Matties, A. Kelly, T. Litwin, and G. Tharp“Obstacle Detection for Unmanned Ground VehicleA Progress Report,” Proceedings of the SeventhInternational Symposium of Robotics Research,Munich, Germany, 1995.
[Nishihara 84] H. K. Nishihara. “PRISM, a Practical Real-timImaging Stereo Matcher,” MIT A.I. Technical ReportMemo 780, 1984.
[Oda 96a] K. Oda. personal correspondence, 1996.
[Oda 96b] K. Oda. Calibration Method for Multi-Camera StereoHead for NavLab II. Internal Carnegie MellonDocument, 1996.
[Okutomi & Kanade 93] M. Okutomi and T. Kanade. “A Multiple-BaselinStereo,” IEEE Transactions on Pattern Analysis andMachine Intelligence, Vol. 15, No. 4, April 1993, pp353-363.
[Oshima et al. 65] R. Oshima et al. “Control System for AutomobDriving,” Proceedings of the Tokyo IFAC Symposium,1965.
[PointGrey 98] Point Grey Research. http://www.ptgrey.com/, 199
[Reece 92] D. Reece. Selective Perception for Robot Driving.Ph.D. thesis, Carnegie Mellon University, 1992.
[Robert et al. 94] L. Robert and M. Hébert. “Deriving Orientation Cufrom Stereo Images,” Proceedings of the EuropeanConference on Computer Vision (ECCV ’94), 1994, pp.377-388.
145
rt.
7.
s
T.T.n
ITn/
y
dle
llo,”
[Robert et al. 95] L. Robert, M. Buffa, M. Hébert. “Weakly-CalibratedStereo Perception for Rover Navigation,” Proceedingsof the International Conference on Computer Vision(ICCV ’95), 1995.
[Robert et al. 97] L. Robert, C. Zeller, O. Faugeras, and M. Hébe“Applications of Nonmetric Vision to Some VisuallyGuided Robotics Tasks,” chapter from VisualNavigation: From Biological Systems to UnmannedGround Vehicles, Lawrence Erlbaum Associates, 199
[Sukthankar 97] R. Sukthankar. Situation Awareness for TacticalDriving. Ph.D. Thesis, Carnegie Mellon RoboticInstitute, 1997.
[Thrun et al. 97] S. Thrun, A. Bücken, W. Burgard, D. Fox, Fröhlinghaus, D. Hennig, T. Hofmann, M. Krell, and Schmidt. “Map Learning and High-Speed Navigatioin RHINO.” to appear in AI-based Mobile Robots:Case studies of successful robot systems, Kortenkamp,D. and Bonasso, R.P. and Murphy, R. (eds.), MPress. Available at http://www.cs.cmu.edu/~thrupapers/thrun.rhino_chapter.html.
[Treat et al. 79] J. Treat et al. Tri-level Study of the Causes of TrafficAccidents: Final Report Volume 1. Technical Report,Federal Highway Administration, U.S. DOT, Ma1979.
[Williamson & Thorpe 98a] T. Williamson and C. Thorpe. “A SpecializeMultibaseline Stereo Technique for ObstacDetection,” Proceedings of the InternationalConference on Computer Vision and PatternRecognition (CVPR ‘98), Santa Barbara, California,June, 1998.
[Williamson & Thorpe 98b] T. Williamson and C. Thorpe. “Detection of SmaObstacles at Long Range Using Multibaseline Stereto appear in Proceedings of the 1998 IEEEInternational Conference on Intelligent Vehicles,Stuttgart, Germany, October 1998.