MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP...
Transcript of MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP...
![Page 1: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/1.jpg)
Principal Components AnalysisMLE, EM and MAP
CMSC 828DFundamentals of Computer Vision
Fall 2000
![Page 2: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/2.jpg)
Outline• Lagrange Multipliers • Principal Components Analysis• Review of parameter estimation.• Notation and Problem Definition• Maximum Likelihood Estimation• Difficulties• Bayesian view• Maximum A Posteriori Estimation• Algorithms: Expectation Maximization
![Page 3: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/3.jpg)
Lagrange Multipliers• Find stationary points of a function f(x) subject to one or
more constraints g(x) =0• Consider the surface g(x)=0
– The direction of increase of f is ∇f – However moving this direction may take us away from the
constraint surface– Idea: move along component of ∇f along the surface.– Denote this component as ∇g f– At the extremum point this function will be stationary
∇g f=0– How to get ∇g f ?– Take ∇f and subtract from it that part a which takes it out of
the surface g∇g f = ∇f – a
![Page 4: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/4.jpg)
Finding the component of ∇f along g• Now let us move by a distance δ along the
surface g– g(x+δδδδ)=g(x)+ (δδδδ .∇g)– But this still lies on the surface -- so g(x+δδδδ)=0– So δδδδ .∇g=0 – ⇒ ∇g is perpendicular to motions along the surface
• But we wanted to remove any piece of ∇f that was perpendicular to g(x)=0
• This will be a vector of the form ∇g f =∇f + λ∇g
(For some λ)
![Page 5: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/5.jpg)
Lagrangian• Consider the Lagrangian function
L(x, λ) = f + λg
gLgfL =∂∂∇+∇=
∂∂ ),(,),( λ
λλλ xx
x• Extremize the Lagrangian
0)(),(,0),( ==∂∂=∇+∇=
∂∂ xxxx
gLgfL λλ
λλ
• So this gives us both the constraint equation and the way to optimize the function on the surface.
![Page 6: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/6.jpg)
Principal Components Analysis
![Page 7: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/7.jpg)
Key technique in dealing with data• Data Reduction
– Experimental measurements produce lots of data– Scientists postulate lots of hypotheses as to what factors affect data. Create
overly complex models– Goal: find factors that affect data most and create small models
• Knowledge discovery– Collect lots of data– Are there patterns hidden in the collected data that can help us develop a model
and understanding?– Can we use this understanding to classify a new piece of data?
• Applications: Almost all computer vision– Especially face recognition, tracking, pattern recognition… etc.
![Page 8: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/8.jpg)
Basics• Record data• d dimensional data vector x• Record N observations• Mean • Covariance • Problem: d can be very large
– “megapixel camera” d>1 million (values of the intensity at the pixels)
– Image is a point in a d dimensional space
• Need a way to capture the information in the data but using very few “coordinates”
1
1 N
iiN =
= ∑x x
( ) ( )1
1 N
i iiN =
′= − −∑Σ x x x x
![Page 9: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/9.jpg)
• Consider a vector x that lies in a d dimensional linear space.• Let vectors uk, k=1,…, d define a basis in the space
x=Σ ck ukx is characterized by d coordinates {ck}Different xi have different coordinates {ck}i
• Now consider a case that the vectors x lie ona lower dimensional manifold– Smaller number of coordinates enough– For small d, if points are spread along the
axes it may be easy to recognize the basis
Principal Components Analysis
– For larger d and if points are not along axes it is harder
– Need mathematical tools
![Page 10: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/10.jpg)
Dimension Reduction• Expressing the points using the basis vectors along the
axes, we still need all the coordinates to describe the xi• However if we had an alternate basis we need only two
variables and a constant to describe the points.• Complexity of most algorithms is a
power of d• Mathematical questions to answer:
• Best Basis: How to find out thebasis that is best lined up with the data?
• Approximation question: If we only wanted the best k dimensional basishow do we select it?
• How do we account for noise?
1 1
M d
k k k kk k M
c b= = +
= +∑ ∑x u u
![Page 11: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/11.jpg)
Approximation• Given a dataset {xi} with N members• Write each vector in a basis {uk}• Coefficients • Approximate each xi as sum of a variable part and a
constant part and• Dimension of variable part is M
'k kc = x u
1 1 1k k
d M di i i
k k k kk k k M
c c b= = = +
= +∑ ∑ ∑x u u u!
• Error in approximting a particular vector( )
1 1 1 1 1
M d M d di i i i i
i k k k k k k k k k k kk k M k k M k M
c b c b c b= = + = = + = +
= − − = − − = −∑ ∑ ∑ ∑ ∑ε x u u x u u u
• Define sum of squares error function C
( )2
1 1
( )N d
ik k k k
i k M
C b c b= = +
= − ∑ ∑ u
![Page 12: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/12.jpg)
Getting the parameters bk and uk• Evaluate bk by setting ∂C/ ∂bk=0
1
1 'N
ik k k
ib c
N =
= =∑ u x
( )
( )( )
2
1 1
'
1 1
'
1
N di
M ki k M
d Ni i
k kk M i
d
k kk M
E= = +
= + =
= +
= − ⋅
= − −
=
∑ ∑
∑ ∑
∑
x x u
u x x x x u
u Σu
• To get best basis vectors uk define cost function
• Minimize E with respect to uk
• However the expression is homogeneous in uk
• Obvious solution is uk =0
![Page 13: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/13.jpg)
Finding the best basis• To avoid the trivial solution we need a constraint• Basis vectors have unit magnitude ||uk||=1, uj . uk=δjk• How do we optimize subject to constraints?
– Lagrange Multipliers!
( )' '
1 1 1
Cost function with constraints:d N N
M k k jk j k jkk M j M k M
E µ δ= + = + = +
= − −∑ ∑ ∑u Σu u u
{ } ( ){ }[ ]1 2
Can be written in the form:
| | |M
M M d jk
E Tr Tr
µ+ +
= −
= =
U'ΣU M U'U - I
U u u u M"
• Minimizing with respect to uk( ) ( )' ' 0+ − + = ⇒ =Σ Σ U U M M ΣU UM
• U is an orthonormal vector with columns as basis vectors• So any set of Us and Ms that satisfy t =U ΣU M
![Page 14: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/14.jpg)
PCA• Choose the simplest solution
– U vectors in the eigenbasis of ΣΣΣΣ– M is the diagonal matrix of eigenvalues.
• Algorithm1. Compute the mean of the data
x- = (!ixi)/N2. Compute the covariance of the data,
ΣΣΣΣ = !i(xi – x- ) (xi – x- )’
3. Compute eigenvectors, ui and corresponding eigenvalues "i of ΣΣΣΣsorted according to the magnitude of "i
4. For a desired approximation dimension M, xi can be written as
1 1k
M di i
k kk k M
c= = +
+∑ ∑x u x!
![Page 15: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/15.jpg)
Selecting the approximation dimension M?• The proportion of variance in the data captured when we
truncate at a given M is
•Two strategies:
• 1st: Specify the desired threshold e.g. 99%
• 2nd: Look at the magnitudes of "i / "i+1
•In some problems it will exhibit a sharp value at some value of i
•“Intrinsic dimension” of the problem
1
1
Proportion of variance captured
M
iid
ii
λ
λ
=
=
=∑
∑
![Page 16: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/16.jpg)
Application: Face/fingerprint recognition• 128 faces at 64x64 resolution
for training– d = 4096– Perform PCA choosing 1st 20
modes (16 shown beside)– Approximate new faces using
these– Greater than 95% accuracy
claimed on a database of 7000 faces• Also used for fingerprint storage and
recognition• If interested check
http://c3iwww.epfl.ch/projects_activities/jmv/fingerprint_identification.html
![Page 17: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/17.jpg)
Pedestrian shapes from PCA modes• Problem: track moving pedestrians from a moving
camera.• Solution: generate PCA modes (“eigenvectors”)
from Pedestrian shapes• Track pedestrian shapes in new images by
searching for variations in PCA modes
![Page 18: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/18.jpg)
Movie
• From Philomin et al 2000
![Page 19: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/19.jpg)
Summary
Principal Components Analysis (PCA) exploits the redundancy in multivariate data. Allows one to:
• Determine (relationships) in the variables
• Reduce the dimensionality of the data set without a significant loss of information
![Page 20: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/20.jpg)
Parameter EstimationMLE and MAP
![Page 21: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/21.jpg)
Problem Introduction• Model characterized by values of a parameter vector θθθθ• Have several observations of a process that we think
follows this model• Using this observation set as “training data” we want to
find the most probable values of the parameters• Observations have errors and are assumed to follow a
probability distribution• Two Approaches:
– Maximum Likelihood Estimation (MLE)• Expectation Maximization Algorithm
– Maximum A Priori Estimation (MAP)• “Bayesian approach”
• Talk will only touch on a vast field, but hopefully will make you familiar with the jargon.
![Page 22: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/22.jpg)
Notation• parameter vector being estimated θθθθ• a test value to be compared• E.g., if N(µ, σ) 1-D normal distribution
21 1( , ) exp22
xN µµ σσπσ
− = −
•Parameters to be estimated µ, σ
•d dimensional data with mean µµµµ and covariance matrix ΣΣΣΣ
( )( ) ( )11 1( , ) exp '
22 | |dN
π− = − − −
μ Σ x μ Σ x μΣ
•Parameters to be estimated µµµµ and ΣΣΣΣ•Data set on which the estimation is based !
![Page 23: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/23.jpg)
Maximum Likelihood Estimation• Use a set of N data points xi belonging to a training set !,
and assumed to be drawn independently from the probability density p(x|θθθθ) to estimate θθθθ
• Because observations are independent ( )1
| ( | )N
ii
p pθ θ=
=∏ x!
• Likelihood of θθθθ with respect to the samples in !, is p(!|θθθθ)• probability that the set of observations in the dataset
would have occurred, given the parameters θθθθ• Maximum likelihood estimate, θθθθ^ is the value of θθθθ that
maximizes this probability.• Estimation problem: treat p(!|θθθθ) as a function of θθθθ and find
value that maximizes it.
![Page 24: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/24.jpg)
Log Likelihood Function• Probabilities are positive.• Logarithm is a monotonic increasing
function• So, maxima of the likelihood function
will occur at the same values as its logarithm
• Easier to work with– Converts products to sums– Shrinks big numbers and small
numbers to O(1)– Easier to differentiate resulting cost
function• Denoted l(θθθθ)
( ) ( ) ( )1
ln | ln |N
kk
l p p xθ θ θ=
= = ∑!
( )
( ) ( )1
ˆ arg max
ln | ln | 0N
kk
l
p p x
θ
θ θ
θ θ
θ θ=
=
∇ = ∇ =∑!
Estimate can be a local minimum or a global minimum
![Page 25: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/25.jpg)
Maximum Likelihood Estimation
• Summary– Given a dataset whose elements are assumed to be
distributed according to a probability distribution p(x|θθθθ)– Create the likelihood function for the data set that
shows the probability that the data set could have come out of the assumed probability distribution with given parameters θθθθ.
– If observations in the dataset are independent the likelihood function is
– Using the log of the likelihood function is often more convenient.
– Parameter estimated by maximizing the likelihood or the log with respect to θθθθ
( )1
| ( | )N
ii
p pθ θ=
=∏ x!
![Page 26: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/26.jpg)
Expectation Maximization• Algorithm for approximate maximum likelihood parameter
estimation when features are missing• Situation:
– Given a set of N data points xi belonging to a training set ∆– Data is d dimensional – Some of the data points is missing features, or has poorly measurec values– Good data point xg ={x1, x2,…, xN} – Bad data point xb ={x1, x2,…,xk,…, xN}
• Separate features into a good set !g and a bad set !b
• Using a guess θθθθ , fix some of the parameters, and form a likelihood function over the unknown features
( ) ( ); ln , ; | ;i ig b gQ pθ θ ε θ θ = D D D
Maximize Q with respect to the unfixed values. Fix the found valuesRepeat for the previously fixed values
![Page 27: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/27.jpg)
•Sometimes we prefer to apply the EM, even when there are no missing features
•Q may be simpler to optimize
•Get an approx. MLE solution
![Page 28: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/28.jpg)
Maximum A Posteriori Estimation• In MLE the estimated value of the parameter vector θθθθ^ is not taken to
be a random variable.• This is against the philosophy of “Bayesian” methods• Everything is random• We have an estimate of a “prior” probability• We make a measurement• Based on this measurement we convert/update the prior probability
to a “posterior” one.• Thus we are given a prior probability for the parameters, p(θθθθ)• In MAP methods instead of maximizing l(θθθθ) we maximize l(θθθθ)p(θθθθ)• In this context MLE can be viewed as finding the most likely values
of θθθθ , assuming all values are equally likely
![Page 29: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/29.jpg)
MAP methods
• Goal: knowing a priori estimate p(θθθθ ) compute the posterior estimate p(θθθθ |!)
![Page 30: MLE EM MAP - UMIACSramani/cmsc828d/lecture23.pdf · Principal Components Analysis MLE, EM and MAP CMSC 828D Fundamentals of Computer Vision Fall 2000. Outline • Lagrange Multipliers](https://reader031.fdocuments.us/reader031/viewer/2022030405/5a7b77d47f8b9a72118beee7/html5/thumbnails/30.jpg)
Sources• Christopher Bishop, “Neural Networks for Pattern
Recognition”, Clarendon Press, 1995.• R.O. Duda, Hart (and D. Stork), 1973 (new edition
expected in 2000.)– A classic, but a bit heavy
• Numerical Recipes– For general discussion of MLE
• The web