Post on 10-Jan-2016
description
A Bayesian Approach to Recognition Moshe Blank
Ita LifshitzReverend Thomas Bayes
1702-1761
Agenda
Bayesian decision theory Maximum Likelihood Bayesian Estimation
Recognition Simple probabilistic model Mixture model More advanced probabilistic model “One-Shot” Learning
Bayesian Decision Theory
We are given a training set T of samples of class c.
Given a query image x, want to know the probability it belongs to the class, p(x)
We know that the class has some fixed distribution, with unknown parameters θ, that is p(x|θ) is known
Bayes rule tells us:
p(x|T) = ∫p(x,θ|T)dθ = ∫p(x|θ)p(θ|T)dθ What can we do about p(θ|T)?
Maximum Likelihood Estimation
What can we do about p(θ|T)?
Choose parameter value θML, that make the training data most probable:
θML = arg max P(T|θ)p(θ|T) = δ(θ – θML)
∫p(x|θ)p(θ|T)dθ = p(x| θML)
ML Illustration
Assume that the points of T are drawn from some normal distribution with known variance and unknown mean
Bayesian Estimation
The Bayesian Estimation approach considers θ as a random variable.
Before we observe the training data, the parameters are described by a prior p(θ) which is typically very broad.
Once the data is observed, we can make use of Bayes’ formula to find posterior p(θ|T). Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior.
Bayesian Estimation
Unlike ML, Bayesian estimation does not choose a specific value for θ, but instead performs a weighted average over all possible values of θ.
Why is it more accurate then ML?
Maximal Likelihood vs Bayesian
ML and Bayesian estimations are asymptotically equivalent and “consistent”.
ML is typically computationally easier. ML is often easier to interpret: it returns the single best
model (parameter) whereas Bayesian gives a weighted average of models.
But for a finite training data (and given a reliable prior) Bayesian is more accurate (uses more of the information).
Bayesian with “flat” prior is essentially ML; with asymmetric and broad priors the methods lead to different solutions.
Agenda
Bayesian decision theory Recognition Simple probabilistic model Mixture model More advanced probabilistic model “One-Shot” Learning
Objective
Given an image, decide whether or not it contains an object of a specific class.
Main Issues
Representation Learning Recognition
Approaches to Recognition
Photometric properties – filter subspaces, neural networks, principal analysis…
Geometric constraints between low level object features – alignment, geometric invariance, geometric hashing…
Object Model
Fischler & Elschlager, 1973
Yuille, ‘91 Brunelli & Poggio, ‘93 Lades, v.d. Malsburg et al. ‘93 Cootes, Lanitis, Taylor et al. ‘95 Amit & Geman, ‘95, ‘99 Perona et al. ‘95, ‘96, ‘98, ‘00, ‘02
Model: constellation of Parts
Perona’s Approach
Objects are represented as a probabilistic constellation of rigid parts (features).
The variability within a class is represented by a joint probability density function on the shape of the constellation and the appearance of the parts.
Agenda
Bayesian decision theory Recognition Simple probabilistic model
Model parameterization Feature Selection Learning
Mixture model More advanced probabilistic model “One-Shot” Learning
Weber, Weilling, Perona - 2000
Unsupervised Learning of Models for Recognition
Towards Automatic Discovery of Object Categories
Unsupervised Learning
Learn to recognize object class given a set of class and background pictures, without preprocessing – labeling, segmentation, alignment.
Model Description
Each object is constructed of F parts, each of a certain type.
Relations between the part locations define the shape of the object.
Image Model
Image is transformed into a collection of parts
Objects are modeled as sub collections
Model Parameterization
Given an image we detect potential object parts, to obtain the following observable:
Hypothesis
When presented with an un-segmented and unlabeled image, we do not know which parts correspond to the foreground.
Assuming the image contains the object, use vector of indices h to indicate which of the observables correspond to a foreground point (i.e. real part of the object).
We call h hypothesis since it is a guess on the structure of the object. h = (h1, …, hT) is not observable.
Additional Hidden Variables
We denote by the locations of the unobserved object parts.
b = sign(h) – binary vector indicates which parts were detected
n = number of background parts detected of each type
mx
Probabilistic Model
We can now define a generative probabilistic model for the object class using the probability density function:
Model Details
Since n, b are determined by Xo, h, we have:
By Bayesian rule:
Model Details
Full table of joint probabilities (for small F) or F independent detection rate probabilities for large F
Model Details
Poisson probability density function with parameter Mt for detection of feature of type t
Model Details
Uniform probability over all hypotheses consistent with n and b
Model Details
Where - coordinates of all foreground detections, and - coordinates of all background detections
Sample object classes
Invariance to Translation Rotation and Scale There is no use in modeling the shape of the object in terms of
absolute pixel positions of the features. We apply a transformation on features’ coordinates to make the
shape invariant to translation, rotation and scale.
But the feature detector must be invariant to the transformations as well!
Automatic Part Selection
Find points of interest in all training images
Apply Vector Quantization and clustering to get 100 total candidate patterns.
Automatic Part Selection
Points of interest patterns
Method Scheme
Part Selection
Model
Learning Test
Automatic Part Selection
Find subset of candidate parts of (small) size F to be used in the model that gives the best performance in the learning phase.
57%
87%
51%
Learning
Goal: Find θ = {μ, Σ, p(b), M} which best explains the observed (training) data
μ, Σ – expectation and covariance parameters of the joint Gaussian modeling the shape of the foreground
b – random variable denoting whether each of the parts of the model is detected or not
M – average number of background detections for each of the parts
Learning
Goal: Find θ = {μ, Σ, p(b), M} which best explains the observed (training) data,
i.e. maximize the likelihood
arg max p( Xo | θ )
θ
Done using the EM method
Expectation Maximization (EM)
EM is an iterative optimization method to estimate some unknown parameters θ, given measurement data, but not given some “hidden” variables J.
We want to maximize the posterior probability of the parameters θ given the data U, marginalizing over J:
Expectation Maximization (EM)
Choose an initial parameter θ0
Guess of unknown hidden data
E-Step:
Estimate unobserved data using θk
M-Step:
Compute Maximum Likelihood
Estimate parameter θk+1 using estimated data
Observed Data
Guess of parameters θk
Expectation Maximization (EM)
alternate between estimating the unknowns θ and the hidden variables J.
EM algorithm converges to a local maximum
Method Scheme
Part Selection
Model
Learning Test
Recognition
Using the maximum a posteriori approach we consider the ratio
R =
where h0 is the null hypothesis – which explains all parts as background noise.
We accept the image as belonging to the class if R is above a certain threshold.
Database
Two classes – faces and cars 100 training images for each class 100 test images for each class Images vary in scale, location of the
object, lighting conditions Images have cluttered background No manual preprocessing
Learning Results
Model Performance
Average training and testing errors measured as 1-Area(ROC)
Suggests 4 parts model for faces and 5 parts model for cars as optimal.
Multiple use of parts
Part ‘C’ has high variance along the vertical direction – can be detected in several locations – bumper, license plate or roof.
Part Labels:
Recognition Results
Average success rate (at even False Positive and False Negative ratios):• Faces: 93.5%• Cars: 86.5%
Agenda
Bayesian decision theory Recognition Simple probabilistic model Mixture model More advanced probabilistic model “One-Shot” Learning
Mixture Model
Gaussian model works good for homogenous classes, but real life objects can be far from homogenous.
Can we extend our approach to multi-model classes?
Mixture Model
An object is modeled using Ω different components, each is a probabilistic model:
Each component “sees the whole picture”. Components are trained together.
Database
Faces with different viewing angles – 0°, 15°, …, 90°
Cars – rear view and side view
Tree leaves – of several types
Tuning of the mixture components
Each training image was assigned to the component which responds to it the most, i.e. one that maximizes .
Results
Misclassification error at even false positive and false negative rate for training and test sets
Zero false alarm detection rate (ZFA-DR).
Separately trained components
Two components trained independently on two subclasses of the cars class.
When merged into a mixture model with p(w) = 0.5, gave worse results than two-components model trained on both subclasses simultaniously.
Agenda
Bayesian decision theory Recognition Simple probabilistic model Mixture model More advanced probabilistic model
Feature Selection Model parameterization Results
“One-Shot” Learning
Fergus, Perona, Zisserman
Object Class Recognition By Scale Invariant Learning - Proc. of the IEEE Conf on Computer Vision and Pattern Recognition - 2003
Object Class Recognition By Scale Invariant Learning
Extended version of previous model (by weber et al.)
New feature detector Probabilistic model for appearance instead
of feature types
Feature Detection
Kadir-Brady feature detector
Detects salient regions over different scales and locations
Choose N most salient regions
Each feature contains scale and location information
Notation
X – Shape : Locations of the features A – Appearance : Representations of the
features S – Scale : Vector of feature scales h – Hypothesis : Which part is represented
by which observed feature.
Feature Appearance
Feature contents is rescaled to a 11x11 pixel patch
Normalization Reduce data dimension
from 121 to 15 dimensions using PCA method
Result is the appearance vector for the part
11x11 patch
c1c2
Normalize
Projection ontoPCA basis
c15
Recognition
Assuming we learned the model parameters θ. Given an image we extract X, S, A and can make a Bayesian decision:
We apply threshold to the likelihood ratio R to decide whether the input image belongs to the class.
Recognition
The term p(X, S, A | θ) can be factored into:
Each of the terms has a closed (computable) form given the model parameters θ
Part appearance pdf
Foreground model Clutter modelGaussian Gaussian
Shape pdf
Foreground model Clutter model
Gaussian Uniform
Relative Scale pdf
Gaussian
Log(scale)
Uniform
Log(scale)
Foreground model Clutter model
Detection Probability pdf
Foreground model Clutter model
Probability of detection
0.8 0.75 0.9
Poisson probability density function on
the number of detections
Learning
Want to estimate model parameters:
Using EM method find that will best explain the training set images, i.e. maximize the likelihood:
Sample Model
Sample Model
Confusion Table
How good is a model for object class A is for distinguishing images of class B from background images?
Comparison of Results
Average performance of the models at ROC equal error rates:
Scale invariant learning:
Agenda
Bayesian decision theory Recognition Simple probabilistic model Mixture model More advanced probabilistic model “One-Shot” Learning
Fei-Fei, Fergus, Perona
A bayesian Approach to Unsupervised One-Shot Learning of Object Categories - Proc. ICCV. 2003
Small Training Set
Humans can learn a new category using very few training examples.
Rule-of-thumb in computer learning tells us that number of training examples should be 5-10 times the number of model parameters.
Can computers do better?
Prior knowledge about objects
Incorporating prior knowledge
Bayesian methods allow us to use a “prior” information p(θ) about the nature of objects. Given the new observations we can update our knowledge into a “posterior” p(θ|x)
Bayesian Decision
Given test image, we want to make a Bayesian decision by comparing:
P(object | test, train) vs. P(clutter | test, train)
P(test | object, train) p(Object)
∫P(test | θ, object) p(θ | object, train) dθ
Bayesian Decision
∫P(test | θ, object) p(θ | object, train) dθ
Until now we used the ML approach – approximating p(θ) by a delta function centered at the θML = arg max p(θ).
This will not work for small training set.
Maximum Likelihood vs. Bayesian Learning
Maximum Likelihood
Bayesian Learning
Experimental setup
Learn three object categories using ML approach
Estimate the prior hyper-parameters
Use VBEM to learn new object category from few images
Prior Hyper-Parameters
Performance Results – Motorbikes
1 training image 5 training images
Performance Results – Motorbikes
Performance Results – Face Model
1 training image 5 training images
Performance Results – Face Model
Results Comparison
Algorithm # training images
Learning speed Error rate
Burl, et al.
Weber, et al.
Fergus, et al.
200~400 Hours 5.6 -10 %
Bayesian
One-Shot 1 ~ 5 < 1 min 8 –15 %
References
Object Class Recognition By Scale Invariant Learning – Fergus, Perona, Zisserman - 2003
A Bayesian Approach to Unsupervised One-Shot Learning of Object Categories - Fei-Fei, Fergus, Perona - 2003
Towards Automatic Discovery of Object Categories – Weber, Welling, Perona – 2000
Unsupervised Learning of Models for Recognition – Weber, Welling, Perona – 2000
Recognition of Planar Object Classes – Burl, Perona – 1996 Pattern Classification and Scene Analysis – Duda, Hart –
1973