Graphical Models for Computer Vision
Transcript of Graphical Models for Computer Vision
![Page 1: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/1.jpg)
Graphical Models for
Computer VisionPedro F Felzenszwalb
Brown University
Joint work with Dan Huttenlocher, Joshua Schwartz,
Ross Girshick, David McAllester, Deva Ramanan, Allie Shapiro, John Oberlin
![Page 2: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/2.jpg)
Vision ProblemsLow-level vision High-level vision
RestorationCorrupted Restoration
Figure 8: Restoration results with an input that has missing values.
8 Summary
We have presented three algorithmic techniques for speeding up the belief propagation approach
for solving low level vision problems formulated in terms of Markov random fields. The main
focus of the paper is on the max-product formulation of belief propagation, and the corresponding
energy minimization problem in terms of costs that are proportional to negative log probabilities.
We also show how similar techniques apply to the sum-product formulation of belief propagation.
The use of our techniques yields results of comparable accuracy to other algorithms but hundreds
of times faster. In the case of stereo we quantified this accuracy using the Middlebury benchmark.
The method is quite straightforward to implement and in many cases should remove the need to
choose between fast local methods that have relatively low accuracy, and slow global methods that
have high accuracy.
The first of the three techniques reduces the time necessary to compute a single message update
from O(k2) to O(k), where k is the number of possible labels for each pixel. For the max-product
formulation this technique is applicable to problems where the discontinuity cost for neighboring
labels is a truncated linear or truncated quadratic function of the difference between the labels. The
method is not an approximation, it uses an efficient algorithm to produce exactly the same results
as the brute force quadratic time method. For sum-product a similar technique yields anO(k log k)
24
Depth estimationFigure 5: Stereo results for the Tsukuba image pair.
0 5 10 15 20 25 30 35 401.8
2
2.2
2.4
2.6
2.8
3 x 104
iterations
ener
gy
multiscalestandard
Figure 6: Energy of stereo solution as a function of the number of message update iterations.
with all but one of the techniques. In each case the running time of the algorithm is controlled by
varying the number of message update iterations. We see that each speedup technique provides
a significant benefit. Note how the min convolution method provides an important speedup even
when the number of labels is small (16 disparities for the Tsukuba images).
Table 1 shows evaluation results of our stereo algorithm on the Middlebury stereo benchmark
[9]. These results were obtained using the parameters described above. Overall our method per-
21
Segmentation
Recognition
![Page 3: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/3.jpg)
Bayesian Framework
• Bayesian approach
- We observe an image Y
- Hidden variables X --- depth map, object labels, etc.
- Vision involves statistical inference --- P(X|Y)
• Challenges
- Building good models for X and Y
- Thousands of random variables and large state spaces
![Page 4: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/4.jpg)
Image RestorationObject Detection
Multi-scale Models
![Page 5: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/5.jpg)
Image Restoration• Random variables
- X : clean picture
- Y : observed image
• P(X) : Markov random field
- Nearby pixels tend to be similar
- Markov blanket = 4 neighbors
• P(Y|X) : iid noise at each pixel
- Yi = Xi +ei
X Y
![Page 6: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/6.jpg)
• Minimize -log P(X|Y)
- D enforces consistency with the data (-log P(Y|X))
- V enforces smoothness (-log P(X))
• Computational burden
- huge number of variables
- large state spaces
- high treewidth
MAP estimation
E(X) =X
i
D(Xi, Yi) +X
ij
V (Xi, Xj)
![Page 7: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/7.jpg)
Discontinuity Costs
-5 -4 -3 -2 -1 0 1 2 3 4 5
-3
-2
-1
1
2
3
-5 -4 -3 -2 -1 0 1 2 3 4 5
-3
-2
-1
1
2
3
-5 -4 -3 -2 -1 0 1 2 3 4 5
-2
-1
1
2
Quadratic
X is smooth
Potts
X is piecewise constant
Truncated quadratic
X is piecewise smooth
MAP with different choices for V(a,b) = W(a-b) Y
![Page 8: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/8.jpg)
Computation• MCMC (simulated annealing)
- very general
- slow...
• Graph-cuts
- huge impact
- works extremely well on restricted models
• Loopy belief propagation
- very general
- can be very fast
![Page 9: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/9.jpg)
Runtime of Loopy BP
• Runtime depends on
- Time for computing a message
- Number of message updates for convergence
• Can exploit special problem structure to address both
![Page 10: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/10.jpg)
Message Computation
• M(b) = mina(V(a,b) + M1(a) + M2(a) + M3(a) + D(a))
• M(b) = mina(V(a,b) + H(a))
- k possible values for each pixel
- O(k2) time by “brute-force”
M
![Page 11: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/11.jpg)
Fast Message Computation
• M(b) = mina(V(a,b) + H(a))
- States are integers and V(a,b) = W(b-a)
- M(b) = mina(W(b-a) + H(a))
• Convolution of H and W in the (min,+) semi-ring
- No known general fast algorithm like FFT
- Best general algorithm O(k2/log(k))
- Fast methods for restricted W (we can pick W)
![Page 12: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/12.jpg)
Fast Min-Convolution
• M(b) = argmina(W(b-a) + H(a))
• Assume W(x) is convex
- If b’ ≥ b then M(b’) ≥ M(b) --- “no crossings”
- O(k log k) divide and conquer method
- A little more work to get O(k) method
-5 -4 -3 -2 -1 0 1 2 3 4 5
-3
-2
-1
1
2
3
b
A(b)
^
^ ^
![Page 13: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/13.jpg)
Fast Min-Convolution
• If W(x) = min(E(x), F(x))
- MW(b) = min(ME(b), MF(b))
• For truncated quadratic W
- E is quadratic
- F is constant
- Both convex - two O(k) computations plus O(k) to combine
-5 -4 -3 -2 -1 0 1 2 3 4 5
-3
-2
-1
1
2
3
![Page 14: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/14.jpg)
Multi-Grid• Number of updates for convergence is large
- Information needs to propagate across the whole image
• Define a hierarchy of problems
- Use messages from one level to initialize the one below
- Good initialization leads to fast convergence
level 0 level 1
![Page 15: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/15.jpg)
Hierarchical Algorithm
• Number of levels = log of image size
• LBP converges after ~10 iterations at each level(500x500 image)
200000
250000
300000
350000
400000
450000
500000
0 20 40 60 80 100 120 140 160 180 200
Ener
gy
Number of iterations
Energy of the tsukuba image
multiscalestandard
![Page 16: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/16.jpg)
Image Restoration
• Truncated quadratic discontinuity costs
• Quadratic data terms, no data for masked pixels
• 256 states per pixel, propagating over large areas
![Page 17: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/17.jpg)
Stereo Depth Estimation
• State-of-the-art accuracy at frame-rate
• Simple, elegant model
left camera right camera disparities
![Page 18: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/18.jpg)
Image RestorationObject Detection
Multi-scale Models
![Page 19: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/19.jpg)
Object Detection
[PASCAL VOC dataset]
![Page 20: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/20.jpg)
Part-Based Models
• Each part has an appearance model
• Flexible geometric arrangement
- Easier to model appearance of part than whole object
- Factorization leads to better generalization
[Fischler, Elschlager 73]
![Page 21: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/21.jpg)
Graphical Model• Object with n parts
• Random variables
- X = (X1 ... Xn) : object configuration
- Xi : location/pose of one part
- Y : image
• P(X) : Markov random field
- captures which geometric configurations are likely
• P(Y|X) : part appearance models + background model
X1
X2
X3
X4
Y
![Page 22: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/22.jpg)
Data Model• We would like P(Y|X) to factor
• Assume
- 1) pixels (features) in background are iid
- 2) parts don’t overlapA1A2
A3
BG
BG = Y \{A1,.., A3}
![Page 23: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/23.jpg)
Inference
• Fully factored data model
• Further assume P(X)
- tree-structured
- pairwise relative positions
• Fast MAP estimation using min-convolutions
- O(nk) time , n = number of parts, k = state space
- As fast as detecting each part separately
X1
X2
X3
X4
Y
Y
Y
Y
![Page 24: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/24.jpg)
Human Pose Estimation
![Page 25: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/25.jpg)
Object Category Detection
• Mixture of part-based models for each category
• Photometric invariant features (HOG)
• Discriminative learning (Latent SVM)
• Leading approach on PASCAL VOC benchmark
![Page 26: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/26.jpg)
Car
high scoring false positiveshigh scoring true positives
[PASCAL VOC dataset]
![Page 27: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/27.jpg)
Horsehigh scoring true positives high scoring false positives
[PASCAL VOC dataset]
![Page 28: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/28.jpg)
Image RestorationObject Detection
Multi-scale Models
![Page 29: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/29.jpg)
Curve Models• Model for curve
- Sequence of control points
- Markov model P(X) captures local shape
- Drift: hard to control accumulation of local variation
Locally these look very similar
Locally these look very different
![Page 30: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/30.jpg)
Multi-Scale Sequence Model
• Capture local properties at multiple resolutions
- Subsample A to get B
- local property of B = non-local property of A
full modeltree-width = 2
A
B
C
![Page 31: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/31.jpg)
Models for Closed Curves
1st order Markov model Multi-scale model
Both graphs have tree-width 2
Multi-scale model captures global shape properties
![Page 32: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/32.jpg)
Samples from P(X)
Multi-scale model captures global shape properties
![Page 33: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/33.jpg)
Shape Recognition
15 species
75 examples per species
(25 training, 50 test)
Nearest neighbor classificationNearest neighbor classification
Multi-scale model 96.28
Inner distance 94.13
Shape context 88.12
Swedish leaf dataset
![Page 34: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/34.jpg)
Shape Detection
Model
Model
![Page 35: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/35.jpg)
Boundary Detection
• Lots of regularities
- continuity, smoothness, closure, parallel lines, symmetry
• Can we build a “low/mid-level” model for P(X)?
[BSDS]
![Page 36: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/36.jpg)
Local Patterns• Look at each 3x3 patch C
- 512 possible patterns
• Energy model
- Parameterized by 512 costs
- Symmetries reduce to ~100
• Capture continuity, frequency junctions, etc.
![Page 37: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/37.jpg)
Coarse Local Patterns
• Coarsenings
- X 1 ... X
K
- X i+1 is a function of X
i
• Look at 3x3 blocks at all resolutions
- Vi ≠ Vj
![Page 38: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/38.jpg)
Coarse Patterns
frequency high-to-low →
reso
lutio
n →
![Page 39: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/39.jpg)
MCMC Inference
• Repeatedly update pixels
• P(X) is not Markov
- 3x3 block in X K might depend on whole picture
• Efficient MCMC via multi-scale representation
- Energy difference is local over X 1 ... X
K
![Page 40: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/40.jpg)
Restoring noisy imagesiid noise
20% flipped
P(Xi=1|Y)
Y
X
![Page 41: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/41.jpg)
Restoring noisy imagesiid noise
20% flipped
P(Xi=1|Y)
Y
X
![Page 42: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/42.jpg)
![Page 43: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/43.jpg)
![Page 44: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/44.jpg)
Summary• Graphical models permeate computer vision
- Image restoration- Depth estimation- Segmentation / Edge detection- Object Recognition
• A lot of work to do in object recognition/detection
- Better data models- Structure variation
• Need better priors for low/mid-level vision
![Page 45: Graphical Models for Computer Vision](https://reader030.fdocuments.us/reader030/viewer/2022021010/6203c701bff43e6a5a0384fa/html5/thumbnails/45.jpg)
Thank youLow-level vision High-level vision
RestorationCorrupted Restoration
Figure 8: Restoration results with an input that has missing values.
8 Summary
We have presented three algorithmic techniques for speeding up the belief propagation approach
for solving low level vision problems formulated in terms of Markov random fields. The main
focus of the paper is on the max-product formulation of belief propagation, and the corresponding
energy minimization problem in terms of costs that are proportional to negative log probabilities.
We also show how similar techniques apply to the sum-product formulation of belief propagation.
The use of our techniques yields results of comparable accuracy to other algorithms but hundreds
of times faster. In the case of stereo we quantified this accuracy using the Middlebury benchmark.
The method is quite straightforward to implement and in many cases should remove the need to
choose between fast local methods that have relatively low accuracy, and slow global methods that
have high accuracy.
The first of the three techniques reduces the time necessary to compute a single message update
from O(k2) to O(k), where k is the number of possible labels for each pixel. For the max-product
formulation this technique is applicable to problems where the discontinuity cost for neighboring
labels is a truncated linear or truncated quadratic function of the difference between the labels. The
method is not an approximation, it uses an efficient algorithm to produce exactly the same results
as the brute force quadratic time method. For sum-product a similar technique yields anO(k log k)
24
Depth estimationFigure 5: Stereo results for the Tsukuba image pair.
0 5 10 15 20 25 30 35 401.8
2
2.2
2.4
2.6
2.8
3 x 104
iterations
ener
gy
multiscalestandard
Figure 6: Energy of stereo solution as a function of the number of message update iterations.
with all but one of the techniques. In each case the running time of the algorithm is controlled by
varying the number of message update iterations. We see that each speedup technique provides
a significant benefit. Note how the min convolution method provides an important speedup even
when the number of labels is small (16 disparities for the Tsukuba images).
Table 1 shows evaluation results of our stereo algorithm on the Middlebury stereo benchmark
[9]. These results were obtained using the parameters described above. Overall our method per-
21
Segmentation
Recognition