Download - Accuracy in Real-Time Depth Maps John MORRIS Centre for Image Technology and Robotics (CITR) Computer Science/Electrical Engineering University of Auckland,

Accuracy in Real-Time Depth Maps

John MORRISCentre for Image Technology and Robotics (CITR)Computer Science/Electrical EngineeringUniversity of Auckland, New Zealandand전자전기공학부 ,중앙대학교 ,서울 Iolanthe II drifting off Waiheke Is

Outline

• Background• Motivation

• Problem• Collision Avoidance

• Accuracy• Parallel Axes case• Verging Axes• Optimizing

• Active Illumination• Algorithm Performance

• Stereo Algorithms• Which one is best?

Motivation

• Stereo Vision has many applications• Aerial Mapping• Forensics

• Crime Scenes

• Traffic Accidents

• Mining• Mine face measurement

• Civil Engineering• Structure monitoring

• General Photogrammetry• Non contact measurement

• Most of these are not time critical …

Motivation• Time-critical Applications

• Most existing applications are not time critical• Several would benefit from real-time feedback as data was

collected• Traffic accident scene assessment

• There’s pressure to clear the scene and let traffic continue• Investigators have to rely on experience while taking images

• Mining• Real-time feedback could direct machinery to follow a pre-

determined plan • …

and then there’s …• Collision avoidance

• Without real-time performance, it’s useless!

Motivation• Collision avoidance

• Why stereo?• RADAR keeps airplanes from colliding• SONAR

• Keeps soccer-playing robots from fouling each other

• Guides your automatic vacuum cleaner

• Active methods are fine for `sparse’ environments• Airplane density isn’t too large

• Only 5 robots / team

• Only one vacuum cleaner

Motivation

• Collision avoidance• What about Seoul (Bangkok, London, New York, …) traffic?

• How many vehicles can rely upon active methods?

• Reflected pulse is many dB below probe pulse!

• What fraction of other vehicles can use the same active method before even the most sophisticated detectors get confused?(and car insurance becomes unaffordable )

• Sonar, in particular, is subject to considerable environmental noise also

• Passive methods (sensor only) are the only ‘safe’ solution• In fact, with stereo, one technique for resolving problems may

be assisted by environmental noise!

Stereo Photogrammetry

Pairs of images giving different views of the scene

can be used to compute a depth (disparity) map

Key task – CorrespondenceLocate matching regions in both images

Epipolar constraintAlign images so that matches must appear in the same scan line in L & R images

Depth Maps

Computed: CensusGround Truth

Which is the better algorithm?

Computed: Pixel-to-Pixel

Vision Research tends to be rather visual !Tendency to publish images `proving’ efficacy, efficiency, etc

Performance and Accuracy

• I will use• Performance to describe the quality of matching

• For how many points was the distance computed correctly?

• Metrics

• % of good matches,

• Standard deviation of matching error distribution

• Function of the matching algorithm, image quality, etc

• Accuracy for precision of depth measurements• Assuming a pixel is matched correctly,

how accurate is the computed depth? or

• What is the resolution of depth measurements?

• Metric

• Error in depth - absolute or relative (% of measured depth)

• Function of stereo configuration and sensor resolution (pixel number and size)

Accuracy

• Traditional (film-based) stereophotogrammetry limited by film grain size• Small enough so that mechanical accuracy of the

measuring equipment became the limiting factor and

• Accuracy was determined by your $ budget• More $s -> higher resolution equipment

• Mapping

• Digital cameras • discrete (large but shrinking!) pixelssignificant accuracy considerations

Stereo Geometry

• How accurate are these depth maps?• In collision avoidance, we need to know the current

distance to an object and be able to derive our relative velocity

• Example:• An object’s image ‘has a disparity of 20 pixels’

= Its image in the R image is displaced by 20 pixels relative to the L image

Accuracy of its position?

• First approximation ~ 5% ( 1 / 20 )

• How do we obtain better accuracy?

Stereo Camera Configuration• Standard Case

Two cameras with parallel optical axesb baseline (camera separation) camera angular FoVDsens sensor widthn number of pixelsp pixel widthf focal lengtha object extentD distance to object

Stereo Camera Configuration

• Standard Case – Two cameras with parallel optical axes

• Rays are drawn through each pixel in the image

• Ray intersections represent points imaged onto the centre of each pixel

Points along these lineshave the same

LR displacement (disparity)

but• an object must fit into

the Common Field of View

• Clearly depth resolution increases as the object gets closer to the camera

• Distance, z = b f

p ddisparity

focal length

pixel size

Depth Accuracy – Canonical Configuration

0 1 2 3 4 5 6 7 8 9 10-0.005

0

0.005

0.01

0.015

0.02

0.025

0.03

D2 (m)

D

2 (

m)

Asymptote

Best D2

• Given an object of an extent, a, there’s an optimum position for it!

• Assuming baseline, b, can be varied

• Common fallacy – just increase b to increase accuracy

Stereo Camera Configuration

• This result is easily understood if you consider an object of extent, a

• To be completely measured, it must lie in the Common Field of View

but

• place it as close to the camera as you can so that you can obtain the best accuracy, say at D

• Now increase b to increase the accuracy at D • But you must increase D so that the object stays

within the CFoV!• Detailed analysis leads to the previous curve and

an optimum value of b a

Points along these lineshave the same

LR displacement (disparity)

bD

a

Stereophotogrammetry vs Collision Avoidance• This result is more relevant

for stereo photogrammetry• You are trying to

accurately determine the geometry of some object

• It’s fragile, dangerous, …and you must use non-contact measurement

• For collision avoidance, you are more concerned with measuring the closest approach of an object (ie any point on the object!)

you can increase the baseline so that the critical point stays within the CFoV

Dcritica

l

Collision Avoidance• For collision avoidance, you

are more concerned with measuring the closest approach of an object (ie any point on the object!)

• you can increase the baseline so that the critical point stays within the CFoV

Dcritica

l

Increasing the baseline%

good

matc

hes

Baseline, b

Images: ‘corridor’ set (ray-traced)Matching algorithms: P2P, SAD

Increasing the baselinedecreases performance!!

Increasing the baselineS

tand

ard

Devia

tion

Examine the distribution of errors

Images: ‘corridor’ set (ray-traced)Matching algorithms: P2P, SAD

Increasing the baselinedecreases performance!!

Baseline, b

Increased Baseline Decreased Performance• Reasons

• Statistical• Higher disparity range

increased probability of matching incorrectly - you’ve simply got more choices!

• Perspective• Scene objects are not fronto-planar• Angled to camera axes

subtend different numbers of pixels in L and R images

• Scattering• Perfect scattering (Lambertian) surface assumption• OK at small angular differences

increasing failure at higher angles

• Occlusions• Number of hidden regions increases as angular difference increases

increasing number of ‘monocular’ points for which there is no 3D information!

Accuracy in Collision Avoidance

• Accuracy is important!• Your ability to calculate an optimum avoidance strategy

depends on an accurate measure of the collision velocity

• Luckily, accuracy does increase as an object approaches the critical region, but we’d still like to measure the collision velocity accurately at as large a distance as possible!

• For parallel camera axes,

D = f b / d

• where

d = xL - xR = n p

Nice, simple (if reciprocal) relationship!

D distancef focal lengthb baselined measured disparityxL|R position in L|R imagen number of pixelsp pixel size

Parallel Camera Axis Configuration• Accuracy depends on d - or

the difference in image position in L and R imagesandin a digital system, on the number of pixels in d

• Measurable regions also must lie in the CFoV

• This configuration is rather wasteful

• Observe how much of the image planes of the two cameras is wasted! Dcritica

l

Evolution

• Human eyes ‘verge’ on an object to estimate its distance, ie the eyes fix on the object in the field of view

Configuration commonlyused in stereo systems

Configuration discoveredby evolution millions of years

ago

Note immediately that the CFoV is much larger!

Nothing is free!

• Since the CFoV is much larger, more sensor pixels are being used and depth accuracy should increasebut

• Geometry is much more complicated!• Position on the image planes of a point at (x,z) in the

scene:

• Does the increased accuracy warrant the additional computational complexity?

xL = f/p tan( arctan((b+2x)/2z) - )

yL = f/p tan( arctan((b-2x)/2z) - ) vergence angle

Note: In real fixed systems,Computational complexity can be reduced,see the notes on real-time stereo!

Depth Accuracy

OK - better …but it’s not exactly spectacular!

Is it worth the additional computational load?

Depth accuracy

A minor improvement?

• What happened?

• As the cameras turn in,Dmin gets smaller!

• If Dmin is the critical distance,D < Dmin isn’t useful!

This area isnow wasted!

Look at the optical configuration!

• If we increase f, then Dmin returns to the critical value!

Original f Increase f

Depth Accuracy - Verging axes, increased f

Now the depth accuracyhas increased dramatically!

Note that at large f,the CFoV does not

extendvery far!

Increased focal length

• Lenses with large f• Thinner• Fewer aberrations

• Better images

• Cheaper?

• Alternatively, lower pixel resolution can be used to achieve better depth accuracy ...

Zero disparity matching

• With verging axes,at the fixation point, scene points appear with zero disparity (in the same place on both L and R images)

• If the fixation point is set at some sub-critical distance (eg an ‘early warning’ point), then matching algorithms can focus on a small range of disparities about 0

• With verging axes, both +ve and -ve disparities appearPotential for fast, high performance matching focussing on

this regionPossible research project!

This is similar to the way our vision system works:we focus on the area around the fixation point andhave a higher density of rods and cones in the centre of our retina

Locus for d = 0

Locus for d = +1

Locus for d = -1

Non-parallel axis geometry

• Points with the same disparity lie on circles now

• For parallel axes, they lie on straight lines

Verging axis geometry

• Points with the same disparity lie on Veith-Muller circles with the baseline as a chord

Zero disparity matching (ZDM)

• Using a fixation point in some critical regionintroduces the possibility of faster matching

• It can alleviate the statistical factor reducing matching quality• You search over a restricted disparity range• Several ‘pyramidal’ matching techniques have been

proposed (and success claimed!) for conventional parallel geometries

• These techniques could be adapted to ZDM

• Care:• It has no effect on the other three factors!

Why is stereo such a good candidate for dense collision avoidance applications?• One serious drawback

• It doesn’t work with textureless or featureless regions• There’s nothing for the matching algorithm to match!

• Active illumination• Impressing a textured pattern (basically any one will do!) on the

scene• Several groups (including ours!) have demonstrated that this is

effective - increasing matching performance significantly

• Real benefit• Environmental ‘noise’ (ambient light patterns) do not interfere!!• In fact, they may provide the texture needed to assist matching

Thus multiple vehicles impressing ‘eye-safe’ (near IR) patterns onto the environment should only help each other

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

-44:

-39:

-34:

-29:

-24:

-19:

-14:

-9:

-4:

1:

6:

11:

16:

21:

26:

31:

36:

41:

Metrics ( Introduce some science! )

From this distribution, we canderive several measures:

% of good matches (error ≤ 0.5)Histogram mean (bias)Histogram Std Dev (spread)

• Compute the distribution of differences between depth maps derived for an algorithm and the ground truth

Generally we have used the “% of good matches” metricMean and standard deviation are used as auxiliary metricsRunning time was also measured

StereoSets

Ray-Traced Images …

SNR = SNR = +36dB SNR = 0 dB

• The ‘Corridor’ set are synthetic (perfect) images • Generated by ray-tracing software

Possible to corrupt them with various levels of noise to test robustness

Algorithms Taxonomy

• Area-based• Match regions in both images, eg 99 windows surrounding a

pixelDense depth maps

• A depth assigned to every pixel

• Tend to have dataflow computation styles Most suitable for hardware implementation

• Feature-based• Look for features first, then attempt matches

• eg edge-detect, then match edges

Sparse depth maps• Less suitable for hardware implementation

• More branches in the logic

We concentrated onarea-based algorithmsOur original goal was hardware (FPGA) implementation

Some trials on simple feature-based matchingshowed no improvement over area-based algorithms

Algorithms

• Area-based• Correlation

• A window is moved along a scanline in the R image until the best match with a similar-sized window in the L image is found

• ‘Best match’ defined by various cost functions• Multiplicative correlation• Normalized squared differences• … (many other variations!)• Sum of absolute differences (SAD)

• Ignore occlusions (pixels visible in one image only)

• Dynamic• Attempt to find the best matching path through a region defined by

corresponding pixel in the R image – maximum disparity, • Can recognize occlusions

• … and many more (graph cut, pyramidal, optical flow, … )

Algorithms Evaluated• Area-based

• Correlation• 3 different cost functions

• Multiplicative correlation• Normalized squared differences• Sum of absolute differences (SAD)

• Census• Reduces pixel intensity differences to a single bit• Counts bit differencesClaimed suitable for hardware implementation

• Dynamic• Birchfield and Tomasi’s Pixel-to-Pixel chosen because it takes occlusions into

account

• Most others are too computationally expensive for real-time implementation• Even taking potential parallelism in hardware into account!• eg graph-cut (best results, but slow – >100s per image!)

Algorithm Details• Correlation Algorithm Cost Functions

• Corr1 – Normalized intensity difference

• Corr2 – Normalized multiplicative correlation

• SAD

• Census• Rank ordering of pixel intensities over an inner window forms a ‘census

vector’ (one bit / pixel in window)• Cost function is Hamming distance of these vectors• Summed over outer window

IL(x,y)-IR(x,y-)|C(x,y,) =

(IL(x,y)-IR(x,y-))2

C(x,y,) =IL(x,y)2 IR(x,y-)2

IL(x,y)IR(x,y-)C(x,y,) =

IL(x,y)2 IR(x,y-)2

Typical set of experiments

• Census algorithm• Two operational parameters

– length of the census vector, ie size of the window over which a rank (ordering) transform is performed

– size of correlation window

• Trials were run for all reasonable combinations of the two parameters on all 6 test images

• + one additional aerial photograph pair from IGN, Paris

• These trials locate optimal values of the algorithm parameters• w, window ‘radius’ for simple correlation algorithms

• () for Census

• (match reward, occlusion) for Pixel-to-pixel

Census Good Matches – Corridor

0 1 2 3 4 5 6 71

50%

10%

20%

30%

40%

50%

60%

70%

Corridor

60%-70%

50%-60%

40%-50%

30%-40%

20%-30%

10%-20%

0%-10%

Good match %

β

= 4 β = 3

Census Good Matches – All images

0 1 2 3 4 5 6 71

50%

10%

20%

30%

40%

50%

60%

70%

Corridor

60%-70%

50%-60%

40%-50%

30%-40%

20%-30%

10%-20%

0%-10%

0 1 2 3 4 5 6 71

50%

5%

10%

15%

20%

25%

30%

35%

Madroom

30%-35%

25%-30%

20%-25%

15%-20%

10%-15%

5%-10%

0%-5%

0 1 2 3 4 5 6 71

50%

20%

40%

60%

80%

100%

Map

80%-100%

60%-80%

40%-60%

20%-40%

0%-20%

Good match %

β

0 1 2 3 4 5 6 71

50%

20%

40%

60%

80%

100%

Sawtooth

80%-100%

60%-80%

40%-60%

20%-40%

0%-20%

0 1 2 3 4 5 6 71

50%

10%

20%

30%

40%

50%

60%

70%

Tsukuba

60%-70%

50%-60%

40%-50%

30%-40%

20%-30%

10%-20%

0%-10%

0 1 2 3 4 5 6 71

50%

20%

40%

60%

80%

100%

Venus

80%-100%

60%-80%

40%-60%

20%-40%

0%-20%

= 4 β = 3

is close to best for

for all images

Census – Corridor - Metrics

0 1 2 3 4 5 6 71

50

1

2

3

4

5

6

Std. Dev.

5-6

4-5

3-4

2-3

1-2

0-1

0 1 2 3 4 5 6 71

5-8-7-6-5

-4-3

-2

-1

0

Mean

-1-0

-2--1

-3--2

-4--3

-5--4

-6--5

-7--6

-8--7

Approaches 0as expected

for larger windows

• Becomes smaller for larger windows

• Narrower error peakcentred on zero

• Matching really isimproving!

Pixel-to-Pixel

• Birchfield and Tomasi• ‘Dynamic’ algorithm• Attempts to find the best matching ‘path’• Cost function

(M) = Nocc occ – Nm r + dissimilarity

• Variable parameters occ – Occlusion penalty r – Matching reward

• Dissimilarity• Usually | IL – IR |

• Other variations possibleSub-pixel matching, etc

Number of matches

Number of Occlusions

Pixel-to-Pixel Results'5

'25

'45

'65

'85

'105

'125

'145

'2

'240%

10%

20%

30%

40%

50%

60%

70%

P2P - Corridor - Zero Error

60%-70%

50%-60%

40%-50%

30%-40%

20%-30%

10%-20%

0%-10%

'5

'25

'45

'65

'85

'105

'125

'145

'2

'240%

10%

20%

30%

40%

50%

60%

P2P - Madroom - Zero Error

50%-60%

40%-50%

30%-40%

20%-30%

10%-20%

0%-10%

'5

'25

'45

'65

'85

'105

'125

'145

'2

'240%

20%

40%

60%

80%

100%

P2P - Map - Zero Error

80%-100%

60%-80%

40%-60%

20%-40%

0%-20%

Good match %

r

occ

'5

'30

'55

'80

'105

'130

'2

'16

'3080%

82%

84%

86%

88%

90%

P2P - Sawtooth - Zero Error

88%-90%

86%-88%

84%-86%

82%-84%

80%-82%

'5

'30

'55

'80

'105

'130

'2

'2662%64%66%68%70%72%74%76%78%80%

P2P - Tsukuba - Zero Error

78%-80%

76%-78%

74%-76%

72%-74%

70%-72%

68%-70%

66%-68%

64%-66%

62%-64%

'5

'25

'45

'65

'85

'105

'125

'145

'2

'240%

20%

40%

60%

80%

100%

P2P - Venus - Zero Error

80%-100%

60%-80%

40%-60%

20%-40%

0%-20%

produces goodresults

for all images

occ = 30 r = 8

Corridor

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

1 2 3 4 5 6 7 8 9 10

Corr1

Corr2

SAD

Correlation Results

Good Match %

window radius (2r+1)*(2r+1) window

Optimum, r ~ 4 (99 window)

Compare algorithms

• Measure the performance!

% correct matches

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Corridor Madroom Map Sawtooth Tsukuba Venus

Census - Zero Error

Census 2 - Zero Error

SAD

Correlation

SAD performs as wellas the others

over a range ofimages!

Large matching windows are better ...• 6 sets of images

Corridor

0%

10%

20%

30%

40%

50%

60%

70%

'1 '2 '3 '4 '5 '6 '7 '8 '9 '10

Correlation - Corr1

Correlation - SAD

Madroom

0%

5%

10%

15%

20%

25%

30%

35%

40%

1 2 3 4 5 6 7 8 9 10

Correlation - Corr1

Correlation - SAD

Map

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

1 2 3 4 5 6 7 8 9 10

Correlation - Corr1

Correlation - SAD

Sawtooth

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

1 2 3 4 5 6 7 8 9 10

Correlation - Corr1

Correlation - SAD

Tsukuba

0%

10%

20%

30%

40%

50%

60%

70%

80%

1 2 3 4 5 6 7 8 9 10

Correlation - Corr1

Correlation - SAD

Venus

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 3 4 5 6 7 8 9 10

Correlation - Corr1

Correlation - SAD

% correctmatch

Window‘radius’

Comparisons – Good Matches

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Corridor Madroom Map Sawtooth Tsukuba Venus

Census (4, 3)

Census (7,5)

Corr1 (4)

Corr1 (10)

Corr2 (4)

Corr2 (10)

SAD (4)

SAD (10)

P2P (5, 6)

Using best parameters for ‘Corridor’

Most goodmatches

LowestStd. Dev.

Running Time

Corridor

58.4

360.7

3.8 19.43.5 17.2 4.1

21.02.2

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

Census (4,3)

Census(7,5)

Corr1 (4) Corr1 (10) Corr2 (4) Corr2 (10) SAD (4) SAD (10) P2P (5, 6)

2.4 2.2

Robustness to Noise

Original(no noise)

SNR = +36dB

SNR = +24dB SNR = 0dB

Effect of Noise on Matching Quality

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

Infd

B

60dB

57dB

54dB

51dB

48dB

45dB

42dB

39dB

36dB

33dB

30dB

27dB

24dB

21dB

18dB

15dB

12dB

9dB

6dB

3dB

0dB

-3dB

-6dB

-9dB

-12dB

-15dB

Census (4, 3)

Census (7, 5)

Corr1 (4)

Corr1 (10)

Corr2 (4)

Corr2 (10)

SAD (4)

SAD (10)

P2P (5, 6)

P2P(5,5)

SAD(4)

Which algorithm?

• Dynamic algorithms (Pixel-to-Pixel) perform best• Better matching in most tests• Detect occlusions• Run faster

• Sum of absolute differences (SAD) is almost as good• For hardware implementation, it’s

• Simple• Mainly adders or subtractors

• Regular• Space efficient

• Parallel• Each possible disparity can be evaluated at the same time

• We have built VHDL models and demonstrated that practical systems will fit onto modern FPGAs and run at 30fps using the SAD algorithm

Conclusions

• Verging camera configurations provide• better accuracy

• but change f also to get the best results!

• potential for faster / better matching

• Active illumination solves a key matching problemand is not sensitive to environmental noise

• For hardware implementation,dynamic programming works well

Iolanthe II waiting in Whangareifor the Whangarei-Vanuatu race start

June, 2007