Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C...
Transcript of Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C...
![Page 1: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/1.jpg)
Copyright © 2001-2018 K.R. Pattipati
Fall 2018
October 1, 8 & 15, 2018
Prof. Krishna R. Pattipati
Dept. of Electrical and Computer Engineering
University of Connecticut Contact: [email protected] (860) 486-2890
Lectures 4 and 5: ML and Bayesian Learning,
Density Estimation, Gaussian Mixtures and EM
and Variational Bayesian Inference, Performance
Assessment of Classifiers
![Page 2: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/2.jpg)
Copyright © 2001-2018 K.R. Pattipati
• Duda, Hart and Stork, Sections 3.1-3.6, Chapter 4
• Bishop, Section 2.5, Chapter 9, Sections 10.1 , 10.2
• Murphy, Chapter 4, Chapter 11, Section 21.6
• Theodoridis, Chapter 7, Chapter 12
Reading List
![Page 3: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/3.jpg)
Copyright © 2001-2018 K.R. Pattipati
Estimating Parameters of Densities From Data
Maximum Likelihood Methods
Bayesian Learning
Estimating Probability Densities (Nonparametric)
Histogram Methods
Parzen Windows
Probabilistic Neural Network
k-nearest Neighbor Approach
Mixture Models and EM
Performance Assessment of Classifiers
Lecture Outline
![Page 4: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/4.jpg)
Copyright © 2001-2018 K.R. Pattipati
Need to know
We usually estimate them from data
Parametric Methods (assume a form for density)
Nonparametric Methods
Estimate density via Parzen windows, PNN, RCE,NN, k-NN,..
Mixture Models
Mixture of Gaussians, Multinoullis, Multinomials, student,…
Recall Bayesian Classifiers
}),|(),({ ijjzxpjzP z
x
Hidden
Categorical
Features,
observed
Parameters
{µk, k}
![Page 5: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/5.jpg)
Copyright © 2001-2018 K.R. Pattipati
Estimating parameters of densities from data:
For example, in the Gaussian case
Data:
Assuming samples are independent,
Estimating Parameters of Densities from Data
C
kkzkzP 1
C
1k)}|,xp( { )(
2{( , )} ({ }, ) ({ }, ) General Case or Hyperellipsoid Case or Hypersphere casek pk k kI
CkD kn
kkkk . . . . 3, 2, 1, :x ..........x x x321
C
1k
Let . class from samples Nnkn kk
1 1
) ( ( | , ) ( ) ( | )knC
j
k
k j
L p D p x z k P z k
)|( ln)( ln)( DpLl
k
1 1 1
k )k(; ln),|( ln
C
k
n
j
c
k
k
j
k
k
zPnkzxp
![Page 6: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/6.jpg)
Copyright © 2001-2018 K.R. Pattipati
Optimal ML Estimate of k :
What if some classes did not have samples in training data?
Zero count or sparse data or black swan problem
Optimal k
1 s.t. lnmax1
C
1k
kk
C
k
kn
ˆN
nkk
Nnn
nLagrangian
kk
k
k
C
k
kk
ˆˆ
)]1( ln[max:1
C
1k
kk}{
k
Fraction of samples of class k.
Intuitively appealing.
1
ˆCN
nkk
Laplace rule of succession or add-one smoothing
Recall:
ln x is concave
-ln x is convex
![Page 7: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/7.jpg)
Copyright © 2001-2018 K.R. Pattipati
Optimal ML estimate of : (when covariance matrices for
all classes are taken to be equal, i.e., i = )
Optimal : Hyperellipsoid Case
C
k
n
j
j
k
k
kzxpl1 1
),|( ln)( { },k
C
k
n
jk
j
k
T
k
j
k
C
kk
k
xxN
l1 1
1
1 )()(2
1||ln
2),}({
Ckxn
xlk
k
n
j
j
k
kkk
j
k ,..,2,1;1
ˆ0)ˆ(ˆ 01
n
1j
1k
ML estimate for mean is just the sample mean!
AAtrATT
2
![Page 8: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/8.jpg)
Copyright © 2001-2018 K.R. Pattipati
we know,
We can show that, so use
Optimal
0l
1||ln
0ˆ)ˆ)(ˆ(ˆ2
1ˆ2
1 1
1 1
11
C
k
n
j
T
k
j
kk
j
k
k
xxNl
C
k
n
j
T
k
j
kk
j
k
k
xxN 1 1
)ˆ)(ˆ(1ˆ
N
CNE ˆ
C
k
n
j
T
k
j
kk
j
k
k
xxCN 1 1
)ˆ)(ˆ(1ˆ
ML estimate of covariance
Matrix is the arithmetic average
of over all classesT
k
j
kk
j
k xx )ˆ)(ˆ(
1||ln AAA
1
1
||ln
][ln||ln
])[lnexp()lnexp(|)|exp(ln||
:
AA
AtraceA
AtraceAA
PDisAAside
A
n
i
i
![Page 9: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/9.jpg)
Copyright © 2001-2018 K.R. Pattipati
Proof of
1 1 1
1 1
1 1
21 1 1 1
1ˆ ˆ ˆ ˆ( )( )
1ˆ ˆ ˆ ˆ[ ]
1ˆ ˆ[ ]
1 1ˆ( ) { }
1
k k
k
k
k k
n nCj j jT
k k kkk k kk j j
nCj jT j T jT T
k k k kk k k kk j
nCj jT T
k k k kk j
n nC CT q r Tk
k kk k kk k q rk
k k k
x x know n xN
x x x xN
x xN
nE n E x x
N N n
nN
1 1
1
( )
C CT T
k k kk k
Cn
N N
N C
N
ˆ N CE
N
N
CNE ˆ
![Page 10: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/10.jpg)
Copyright © 2001-2018 K.R. Pattipati
Linear discriminants may outperform quadratic discriminants
when training data is small
Shrink the covariance matrices towards common value possibly biased, but results in a less variable estimator
Select and to minimize error rate via cross validation
Shrinkage Methods
)Σ and Σof mbination (convex co kˆˆ
Nn
Nn
k
kkk
)1(
ˆˆ)1()(ˆ
ˆ ˆ( ) (1 ) ; 0 1I
Itrp
kkk ))(ˆ()(ˆ)1(),(ˆ
If variables are scaled to have
zero mean and unit variance
Regularized Discriminant
Analysis
ˆ)1(ˆ)(ˆ diag This has Bayesian (MAP) interpretation
![Page 11: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/11.jpg)
Copyright © 2001-2018 K.R. Pattipati
ML-based Discriminants
kn
j
j
k
kk
xn 1
1
kn
j
T
k
j
kk
j
k
k
k xxn 1
)ˆ)(ˆ(1
1ˆ
As N , they approach the performance of optimal Bayesian Classifier. However,
for finite N, a linear rule may outperform a quadratic one!
Note that we need nkp for each class. If not, use shrinkage methods.
ML-estimated Best Linear Rule (C+Cp+p(p+1)/2 parameters)
Unequal Covariance Case
ML-estimated quadratic rule (C+Cp+Cp(p+1)/2 parameters)
]ˆlnˆˆˆ
2
1[ˆˆmax arg 11
},...,2,1{kk
T
k
T
kCkxj
kkk
T
kkCk
xxj ˆln)ˆ(ˆ)ˆ(2
1|ˆ|ln
2
1max arg 1
},..,2,1{
![Page 12: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/12.jpg)
Copyright © 2001-2018 K.R. Pattipati
• In general, we can use SA algorithm:
Recursive Estimation of Parameters-1
n
k
n
k
n
kx
nn
1ˆ)
11(ˆ
1
kn
j
j
k
k
kzxn 1
0),|p( ln1
0),|p( ln
kzxE
j
kkn
lim
1ˆ
1|),|p( lnˆˆ
n
k
kzxn
kn
n
k
n
k
)ˆ(ˆˆ111
n
k
n
kn
n
kx
Idea of Stochastic Approximation:
|)( gEf
ufroots )( of
][ 2|θ(g-f)Eassume
1 [ ( )]n n n nu g
0lim
nn
1n
n
1
2
n
n
What if “big streaming data”? Recursive estimation of mean
conditionsSAsatisfyKn
Kn
mnnn
1)/(
)1/(,
)(
1,
120
![Page 13: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/13.jpg)
Copyright © 2001-2018 K.R. Pattipati
Unequal Covariance Case
Covariance Recursion for Class k:
Recursive Estimation of Parameters-2
Class k
kn
j
j
k
kk
xn 1
1
kn
j
T
k
j
kk
j
k
k
k xxn 1
)ˆ)(ˆ(1
1ˆ
nn
1
Tn
k
n
k
n
k
n
k
n
k
Tn
k
n
k
n
k
n
k
n
k
n
j
Tn
k
n
k
n
k
j
k
n
k
n
k
n
k
j
k
n
j
Tn
k
j
k
n
k
j
k
n
k
xxnn
n
xxn
n
nnnn
n
xxnn
n
xxn
)ˆ)(ˆ(1ˆ)
1
2(
)ˆ)(ˆ)(11
1)(1
1(ˆ)
1
2(
})ˆˆˆ)(ˆˆˆ({2
1)
1
2(
)ˆ)(ˆ(1
1ˆ
111
11
2
1
1
1111
1
)ˆ(1
ˆˆ11
n
k
n
k
n
k
n
kx
n
![Page 14: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/14.jpg)
Copyright © 2001-2018 K.R. Pattipati
Covariance Recursion in Hyperellipsoid Case
Update means via
For covariance, suppose nth sample is from class l. Then
Recursive Estimation of Parameters-3
Cixnn
iii n
i
n
i
n
i,..,2,1;
1ˆ)
11(ˆ
1
1 1
1 1 1
1 11
2
1ˆ ˆ ˆ{ ( )( ) }
1 1ˆ ˆ ˆ ˆ( ) {[ ( )( ) ] ( )( ) }
1
1 1 1 1ˆ ˆ ˆ( ) ( )(1 )( )( )
k
k k
k l
k k l l
l l l l
nCj n j nn T
k kk kk j
n nCj n j n j n j nT T
k k l lk k l lk j j
k l
n n n nn ll ll l
l l l
x xn C
n Cx x x x
n C n C
nn Cx x
n C n C n n n
1 11 ( 1)1 ˆ ˆ ˆ( ) ( )( )( )
l l l l
T
n n n nn Tll ll l
l
nn Cx x
n C n C n
![Page 15: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/15.jpg)
Copyright © 2001-2018 K.R. Pattipati
Posterior probability P(z=i |x) is the key
Have data:
we can only get P(z=k |x, Dk)
Class conditional density and posterior density of parameters
Bayesian Learning
1 2 3
1 2 .......... : 1, 2, 3, . . . . { , ,..., }kn
k k k k CD x x x x k C D D D
1
( | , ) ( )( | , )
( | , ) ( )
kk C
i
i
p x z k D P z kP z k x D
p x z i D P z i
( | , ) ( | , ) ( | , )k kp x z k D p x z k p z k D d
1
1
( | , ) ( | )( | , )
( | , ) ( | )
( | , ) ( | )
( | , ) ( | )
k
k
kk
k
nj
k
j
nj
k
j
p D z k p z kp z k D
p D z k p z k d
p x z k p z k
p x z k p z k d
If p(Dk | z=k,) has
sharp peak at
then so does p( |z=k, Dk)
Tough to compute unless
reproducing density.
Need Simulation.
![Page 16: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/16.jpg)
Copyright © 2001-2018 K.R. Pattipati
Illustration of Bayesian Learning - 1
Bayesian learning of the mean of Gaussian distributions in one dimension.
Strong prior (small variance) posterior mean “shrinks” towards the prior mean.
Weak prior (large variance) posterior mean is similar to the MLE
-5 0 50
0.1
0.2
0.3
0.4
0.5
0.6
prior variance = 1.00
prior
lik
post
-5 0 50
0.1
0.2
0.3
0.4
0.5
0.6
prior variance = 5.00
prior
lik
post
gaussInferParamsMean1d from Murphy, Page 121
![Page 17: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/17.jpg)
Copyright © 2001-2018 K.R. Pattipati
Illustration of Bayesian Learning - 2
Bayesian learning of the mean of Gaussian distributions in one and two dimensions.
As the number of samples increase, the posterior density peaks at the true value.
![Page 18: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/18.jpg)
Copyright © 2001-2018 K.R. Pattipati
Sensor Fusion
Bayesian sensor fusion appropriately combines measurements from multiple sensors
based on uncertainty of the sensor measurements. Larger uncertainty less weight.
sensorFusion2d from Murphy, Page 123
-0.5 0 0.5 1 1.5-1.4
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
Equally reliable Sensors (R,G)
Fused Estimate: Black
-1 -0.5 0 0.5 1 1.5-1.6
-1.4
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
G is more reliable than R
Fused Estimate: Black
-1 -0.5 0 0.5 1 1.5-1.5
-1
-0.5
0
0.5
1
R is more reliable in y
G is more reliable in x
Fused Estimate: Black
![Page 19: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/19.jpg)
Copyright © 2001-2018 K.R. Pattipati
Likelihood Recursion
Recursion for Posterior Density
Recursive Bayesian Learning
1
1
0
( | , ) ( | , )( | , )
( | , ) ( | , )
( | , ) ( | )
k k
k
k k
n nn k kk n n
k k
k
p x z k p z k Dp z k D
p x z k p z k D d
where p z k D p z k
1 2 1 1
1
{ , ,....., , } { , }
( | , ) ( | , ) ( | , )
k k kk k
kk k
n n nn n
k k k k kk k
nn n
kk k
Let D x x x x D x
p D z k p x z k p D z k
Problem: Need to store all training samples to calculate 1kn
kD1
( | , ).kn
kp z k D
For exponential family (e.g., Gaussian, exponential, Rayleigh, Gamma,
Beta, Poisson, Bernoulli, Binomial, Multinomial) need only few parameters
to characterize , They are called sufficient statistics. 1
( | , ).kn
kp z k D
![Page 20: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/20.jpg)
Copyright © 2001-2018 K.R. Pattipati
Want to learn both the mean vector & covariance
Recursive Learning of Mean and Covariance
0
0
( )
0 0 0
0
00
0 0
0 20 0
0
2
1( | , ) ( , ) ( , ) ( | , ) ( | , )
1 1( | , ) ( ; , )
| |( | , ) |
2 ( )2
NIW Normal Inverse Wishart
n
v v v v v
inverse WishartGaussian
v v
v vp
p
p x N and p p m pk
p m N mk k
IW
00 1
1
1
1 1 0 0( )
2 2
1
1 2
0 10
0 0
0 1 0
0
1| ; ( ) ( )
2 2
{ , ,...., }, ( , | ) ( , | , , , )
1
1; 1
(
v
n
n
p ptr
p
i
n nn n n n n
v v
nnn n
n n
n n n
n
ie
Given D x x x p D NIW m k
k n kwhere m m x m x
k n k n k k
k k n k n
00 0
1
)( ) ( )( ) (HW: Get recursive expression)n
n n n ni i T T
ni
nx x x x x m x m
Wishart is a generalization of
Gamma and chi-squared
![Page 21: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/21.jpg)
Copyright © 2001-2018 K.R. Pattipati
Recall
Conjugate prior: Dirichlet distribution
Posterior
MAP estimate Conditional mean (MMSE estimate)
Bayesian Generalization of Laplacian Smoothing
100
111
( )( | ) ( | ) ;
( ).... ( )k
C C
k k
kkC
p Dir
1
({ }) k
Cn
k k
k
L
1
( | , );C
N
k
k
p D N n
10
11 1
( )( | ). ( | )( | , )
( | ) ( )..... ( )
=Dir( | )
k k
N CnN
kNkC C
Np D pp D
p D n n
n
0
1ˆ MAP k k
k
n
N C
0
ˆ MMSE k kk
n
N
Mode and Mean are not the same here!
![Page 22: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/22.jpg)
Copyright © 2001-2018 K.R. Pattipati
Sum of Gaussian vectors is Gaussian.
Linear transformations of Gaussians is Gaussian
Marginal and conditional densities
Operations on Gaussians - 1
1 2 1 211 22 121 2
1 2 11 12 12 221 2
~ ( , ); ~ ( , );cov( , )
~ ( , )T
x N x N x x
x x x N
~ ( , ); ~ ( , )Tx N y Ax N A A A
11 11 12 11 121
2 12 22 12 222
2 222
1
1 2 212 22 11 121 2
and ~ ( , ); ; ;
Then, the marginal & conditional densities are also Gaussian
( ) ( , )
( | ) ( ( ),
T T
J Jxx x N J
J Jx
p x N
p x x N x
1
22 12
1 1
211 12 111 2
)
=N( ( ), )
T
J J x J
This is what happens in Least Squares & Kalman filtering
If x1 is scalar
(e.g., Gibbs sampling),
Information form could
save computation.
1 1
1 1
11 11 12 22 12 22 22 12 11 12
1 1
12 11 12 22 11 12 22
;T TJ J
J J J
![Page 23: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/23.jpg)
Copyright © 2001-2018 K.R. Pattipati
Marginal and conditional densities
Marginals & Conditionals of 2D Gaussian
21 1 1 1 2
22 2 1 2 2
2
1 1 1
2 211 2 1 2 2 1 2 1 2
2
1 0.8 and ~ ( , ); 0;
0.8 1
( ) ( , ) (0,1)
( | ) ( ( ), (1 )). 1 ( | ) (0.8,0.36)
xx x N
x
p x N N
p x x N x If x p x x N
-5 0 5
-10
-5
0
5
10
x1
x2
p(x1,x2)
-5 0 50
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
x1
x2
p(x1)
-5 0 50
1
2
3
4
5
6
7
x1
x2
p(x1|x2=1)
gaussCondition2Ddemo2
from Murphy, Page 112
2 2 2
1 1 2
2 2 2
1 2 2
1
(1 ) (1 ) 2.778 2.222
1 2.222 2.778
(1 ) (1 )
J
![Page 24: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/24.jpg)
Copyright © 2001-2018 K.R. Pattipati
Covariance matrix captures marginal independencies between variables
Information matrix captures conditional independencies
Non-zero entries in J correspond to edges in the dependency network
Product of Gaussian PDFs is proportional to a Gaussian
Valid for division also
Operations on Gaussians - 2
0 & ( )ij i j i jx x are independent or x x 1J
1/2
1
/2
1 1 1
1 1 1 1
1 11
1 1( ; , ) exp{ ( ) ( )} exp( + - )
(2 ) 2 2
1= ; = ; ln 2 ln | |
2
( ; , ) ( ; , ) ;
T Ti T
i i i i ipi i i i
T
i i i i i i ii i i i i
n n n
i i i ii ii ii
Jp x J x J x A x x J x
J J A p J J
p x J p x J where J J J J
1 20 | { , ,..., } { , }ij i j n i jJ x x x x x x x
![Page 25: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/25.jpg)
Copyright © 2001-2018 K.R. Pattipati
Sampling From Multivariate Gaussian
1 1
1/2
~ ( , ) ( , ); (Precision or Information Matrix)
Method 1: ; (0, );
Method 2: Cholesky Decomposition of ; (0, );
Method 3: Given Cholesky Decomposition of Precisio
T
T
x N N J J
Q Q z N I x Q z
LL z N I x Lz
,
2
,
,
n Matrix, ,
(0, );Solve ;
Method 4: Gibbs sampler using Covariance Matrix/Precision Matrix
Let ~ ( , ) (i i
T
T
i ii i ii
T
i ii i i
J BB
z N I B y z x y
xy N N
x
,
,
1
,
,
1 2 1
, , , ,
,
3
, )
( | ) ~ ( ( ), )
1 = ( ( ), )
Methods 1 and 2 require O(n ) computation. Method 3 can e
i i
i i
ii i i
T
i i
T
i ii i i i i i i i i i ii
i i
ii iii ii
J J
J J
p x x N x
JN x
J J
xploit sparsity of .
Information form of Gibbs sampler does not require matrix inversion! Relation
to Gausee-Seidel, successive over-relaxati https://arxiv.org/pdf/1505.0on, et 3512. c. pdf
J
![Page 26: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/26.jpg)
Copyright © 2001-2018 K.R. Pattipati
ML versus Bayesian Learning
ML Bayesian
Computationally Simpler
(calculus, optimization)
Complex Numerical
Integration (unless a
reproducing density)
Single Best Model.
Easier to Interpret
Weighted Av. of Models
Does not use a priori
information on
Uses a priori information
on
( | , )
( | , ) ( | , )
k
k
p x z k D
p x z k p z k D d
ˆ( | , ) ( | , )MLkp x z k D p x z k
![Page 27: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/27.jpg)
Copyright © 2001-2018 K.R. Pattipati
• Assume x was already standardized, then to construct a
histogram, divide the range into a set of B evenly spread bins
Then,
where Kb = number of points which fall inside the bin
B is crucial
B large Est. density is spiky (“noisy”)
B small Smoothed density
There is an optimum choice for B
We can construct the histogram sequentially
considering data one at a time
Density Estimation: Histograms
( ) , 1,2, . . . . ,bKh b b B
N
B=3
B=7
B=12
![Page 28: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/28.jpg)
Copyright © 2001-2018 K.R. Pattipati
• Estimated density is not smooth (has discontinuities at the boundaries of the bins)
• In high dimensions, we need Bp bins ( curse of dimensionality requires a huge number of data points to estimate density )
• In high dimensions
– Density is concentrated in a small part of the space
– Most of the bins will be empty estimated density = 0
– As the number of dimensions grows, a shell of thin, constant thickness on the interior of the sphere ends up containing almost all of its volume most of the volume is near the surface!
Major Problems with Histogram Algorithm
![Page 29: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/29.jpg)
Copyright © 2001-2018 K.R. Pattipati
General idea
Suppose we want to find p(x)
Suppose draw N points from p(x)
Expected number falling in region R
So,
Kernel and K-nearest Neighbor Methods-1
Pr ( ) Pr 1x R
ob x R P p x d x ob x R P
Pr these fall in (1 ) ( ; , )k N kN
ob k of R P P B k N Pk
kNk PPk
NkkE
)1( )(
N
0k
kNk PPk
Nk
)1(
N
1k
1
1
1 (1 )
1
Nk N k
k
NNP P P
k
NP
[ ]E k NP
![Page 30: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/30.jpg)
Copyright © 2001-2018 K.R. Pattipati
Expected fraction of points falling in region R
Variance of
As, N, variance 0
. . . . . .(1)
Also, . . . . . .(2)
So, or, . . . . . .(3)
Note that R should be large for (1) to hold. However, R should be small for (2) to hold an optimal choice for R.
Kernel and K-nearest Neighbor Methods-2
[ ]E kP
N
2k k
E PN N
2
1
(1 )N
k N k
k
NkP P P
kN
(1 )P P
N
estimate good a isN
kP
VxpdxxpP )()( R
N
kVxp )(
NV
kxp )(
![Page 31: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/31.jpg)
Copyright © 2001-2018 K.R. Pattipati
Effect of N on Probability Estimates
N
k
We are trying to estimate P via . As N increases,
the estimate peaks at the true value (it becomes a delta function) NV
kxp )(
![Page 32: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/32.jpg)
Copyright © 2001-2018 K.R. Pattipati
There are basically two approaches to use Eq. (3) for density estimation
Fix V and determine k from the data Kernel Based Estimation
Fix k and determine the corresponding volume from the data
k-nearest neighbor approach.
Major Disadvantage: Needs all data
Both of these methods converge to true densities as N, provided
V as N V0 as N
k as N k as N and k/N 0
Kernel and K-nearest Neighbor Methods-3
methodsneighbornearestkforNkSelect
methodsbasednelforN
VselectTypically
ker1
![Page 33: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/33.jpg)
Copyright © 2001-2018 K.R. Pattipati
Selection of V and k
![Page 34: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/34.jpg)
Copyright © 2001-2018 K.R. Pattipati
Kernel Estimators
• Suppose we take the region R to be a hypercube with sides of length h
centered on the point x. Its volume is
• We can find an expression for k, the number of points which fall within
this region, by defining a Kernel Function, H(u). It is also known as the
Parzen Window
For all data points xj, the quantity is equal to unity (1) if
the point xj falls inside the hypercube of side h at x and 0 otherwise.
Total number of points falling inside the hypercube is
Kernel Estimators-1
PhV
11 | | 1, 2, ....,
( ) 2
0
ifor u i pH u
otherwise
)(
j
h
xxH
N
j
j
h
xxHk
1
)(
H(u) is a unit hypercube
centered at the origin.
Gaussian windows could
be used as well.
)(1
)(ˆ1
N
j
j
pp h
xxH
NhNh
kxp
![Page 35: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/35.jpg)
Copyright © 2001-2018 K.R. Pattipati
Gaussian Parzen Windows
Example of two dimensional circularly symmetric normal
Parzen windows for three different values of h
]||||2
1exp[
)2(
1)( 2
2/ h
x
hh
xH
pp
![Page 36: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/36.jpg)
Copyright © 2001-2018 K.R. Pattipati
Kernel Estimators-2
NV
kxp )(ˆ
N
j
j
p h
xxH
Nh 1
1
points data theof oneon centered cubeeach with
, side of cubes ofion superposit ˆ hN)x(p
We can smooth out this estimate by choosing different forms
for the kernel function H(u). Example: Gaussian Parzen Windows
N
j
j
p h
xxH
hNExpE
1
11)(ˆ
1p
x vE H
h h
1( ) ( )
p
v
H x v p v dvh
Convolution of H and p.
Blurred version of p(x) as
seen through the window
Large N Good estimate for h 0
Small N Need to select h properly
For small N, h small noisy estimate
![Page 37: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/37.jpg)
Copyright © 2001-2018 K.R. Pattipati
Illustration of Density Estimation: Effect of h
Three Parzen-window density estimates based on
the same set of five samples, using the Gaussian
Parzen window functions
![Page 38: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/38.jpg)
Copyright © 2001-2018 K.R. Pattipati
Density Estimation: Effects of h and N
Parzen-window estimates of a univariate normal density using
different window widths and numbers of samples
![Page 39: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/39.jpg)
Copyright © 2001-2018 K.R. Pattipati
Bivariate Density Estimation
Parzen-window estimates of a bivariate normal density using
different window widths and numbers of samples
![Page 40: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/40.jpg)
Copyright © 2001-2018 K.R. Pattipati
Bimodal Density Estimation
Parzen-window estimates of a bimodal distribution using different window widths
and numbers of samples. Note that when n, estimates are the same and match
the true distribution.
![Page 41: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/41.jpg)
Copyright © 2001-2018 K.R. Pattipati
Volume of a hypersphere in p dimensions of radius
= sum of multivariate Gaussian distributions centered at
each training sample.
(can also do for each class )
Gaussian Windows:PNN
pp
pp
)2/(
)(2 2/
/2 21
1 1 ( ) ( )ˆ ( ) exp
(2 ) 2
j jTN
p pj
x x x xp x h
N
ˆ ( | )k
j j
k
N np x z k
x x
Also called “Probabilistic Neural Network (PNN)”
Small density estimation will have discontinuities
Larger causes greater degree of interpolation (smoothness)
1
0( ) a ua u e du
![Page 42: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/42.jpg)
Copyright © 2001-2018 K.R. Pattipati
Probabilistic Neural Network-1
• Note that each component of x has the same
Must scale xi to have the same range
Scale
PNN is fairly
insensitive to over a
wide range.
Sometimes, it is easier
to experiment with
than compute it this
way.
k Compute
2 1 ˆˆ tr( )k kp
2 2
k 1
1ˆ
C
k knN
![Page 43: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/43.jpg)
Copyright © 2001-2018 K.R. Pattipati
Probabilistic Neural Network-2
PNN structure
1 12 n1
x1 x2 xp
1 12 nc
1
2/)2(
1
npp
/2
1
(2 ) p p
Cn
( 1)P z ( )P z C
Pick Maximum
Input Units
Pattern Units
Summing
Units
Output Unit
• Speed up PNN using cluster centers as representative patterns
Cluster using K-means, LVQ,….
![Page 44: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/44.jpg)
Copyright © 2001-2018 K.R. Pattipati
Probabilistic Neural Network-2
One form of Pattern Unit structure
x1 x2 xp ||x||2 -1/2
1
2jk /z
e
j
kx1 j
pkx2
1
2||||j
kx
22 ||||2
1||||
2
1 j
k
Tj
k
j
k xxxxz
j
kx2
0-j
kz
j = sample number
k = class
p= dimension of pattern
0
![Page 45: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/45.jpg)
Copyright © 2001-2018 K.R. Pattipati
Probabilistic Neural Network-3
x1 x2xp
1
square squaresquare
0
1 1
j
kx x
j
kx1
j
kx2
2 2
j
kx xj
p pkx x
j
pkx
2||||j
k
j
k xxz
2/2j
kze
Second form of Pattern Unit structure
j
kz
j = sample number
k = class
p= dimension of pattern
![Page 46: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/46.jpg)
Copyright © 2001-2018 K.R. Pattipati
Alternative forms of Kernels
N
j
p
i
j
iipxx
hNhxp
1 1
||2
exp1
)(
N
j
p
i
j
ii
p h
xx
hNxp
1
1
12
2)(1
)(
1)(
N
j
p
i
j
ii
p
xxc
Nxp
1
2
1
)(sin
)2(
1)(
sinwhere sin ( )
xc x
x
Manhattan Distance, 1-norm
Alternate forms of Kernels:
Cauchy Distribution
3
1 1
1 70( ) 1 | | ; | | 1
81
p pN
i i
j i
p x x x tri cube KernelN
2
1 1
1 3( ) 1 ; | | 1
4
p pN
i i
j i
p x x x Epanechnikov KernelN
![Page 47: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/47.jpg)
Copyright © 2001-2018 K.R. Pattipati
Decide on K, the number of clusters in advance.
Suppose there are N data points,
Want to find K representative vectors
K-means algorithm seeks to partition the data into K
disjoint subsets containing {nj}points, in such a way as
to minimize the sum of squares clustering function
K – Means Algorithm or K – Centers Algorithm
NxxxD , . . . . . .
21
K
μ. . . . . . μ μ21
2
1
|| ||j
Ki
jj i C
J x
1
j
i
ji Cj
xn
clearly if are known, then jC
The means of clusters
“Cluster Centers”
“Codebook Vectors”
}{ jC
![Page 48: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/48.jpg)
Copyright © 2001-2018 K.R. Pattipati
• 1) Start with any K random feature vectors as K centers.
Alternately, assign the N points into K sets randomly and
compute their means; these means serve as initial centers.
2) For i = 1, 2, . . . . . ,N
Assign pattern i to cluster if
3) Recompute means via:
4) If centers have changed, go to step 2. Else, stop.
Covariance of each cluster: etc.
Batch Version of K–Means Algorithm
jC 21
||||min argki
Kkxj
1
j
i
ji Cj
xn
1ˆ ( )( )1
jC
T
i ij j jij
x xn
K–Means Algorithm-Unsupervised Learning
![Page 49: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/49.jpg)
Copyright © 2001-2018 K.R. Pattipati
Initialization (Recall David Arthur and Sergei Vassilvitskii’s
“k-means++: The Advantages of Careful Seeding”,
Eighteenth Annual ACM-SIAM Symposium on Discrete
Algorithms)
a. Choose initial center at random. Let n1 be the data point.
b. For k= 2,..,K
For n=1,2,.., N & n ≠ ni , i=1,2,..,k-1
End
Select probabilistically
Store nk
End
Initialization
2 2
1 1 1 12 2min exp( min )
n n
n ni ii k i kD x or D x
1
; 1,2,.., 1
( )( )
( )
k
k
i
nn
Nn
n
n n i k
D xp x
D x
1
1.
nLet x
kn
kx
![Page 50: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/50.jpg)
Copyright © 2001-2018 K.R. Pattipati
• Pick K to minimize the sum of errors (training error) +
model complexity term. The sum is called prediction error.
where is the variance of noise in the data.
• Kurtosis based measure
– Kurtosis for normalized Gaussian
– Find Excess Kurtosis for each cluster
– Plot and pick K that gives minimum KT
– Can also use this idea in a dynamic cluster splitting scheme (see
Vlasis and Likos, IEEE Trans.- SMCA, July 1999, pp-393-399).
Selection of K
K
j Cnj
n
jN
Kpx
NPE
1
22 22
2
3
4
xE
4
13 1,2, ,
C
j
n
i ji
ji
n ijj
xK i p
C
1 1 1 1
1 1 1;
p pK K
T ji j j ji
j i j i
K K K K KKp K p
KKT vs.
![Page 51: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/51.jpg)
Copyright © 2001-2018 K.R. Pattipati
Online Version
• Start with K randomly chosen centers
• For each data point, update the nearest via
.
Leave all others the same.
Can use it for dynamic data.
For static data, need to go through data multiple times!
Online Version of K-Means
)(oldoldnew jijj
x
K
jj 1
j
Vector
Quantization via
Stochastic
Approximation
![Page 52: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/52.jpg)
Copyright © 2001-2018 K.R. Pattipati
Supervised Algorithm (Learning Vector Quantization {know class labels!})
Start with K codebook vectors.
For each data point xi find the closest codebook
(or center) .
if xi is classified correctly.
if xi is classified incorrectly.
We will take up the variants of the algorithm in Lecture 8.
Supervised Algorithm
)(oldoldnew jijj
x
)(oldoldnew jijj
x
j
It is a version of reinforcement learning
![Page 53: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/53.jpg)
Copyright © 2001-2018 K.R. Pattipati
k-Nearest Neighbor Approach
• Fix k and allow the volume V to vary
Again, Small k noisy
Large k smooth
Major use of k- Nearest Neighbor Technique is not in
probability estimation, but in classification.
• Assign a point to class i if class i has the largest number of points
among the k-nearest neighbors MAP rule
k-Nearest Neighbor Approach
NV
kxp )( V is the volume of a hypersphere centered at
point x and contains exactly k points.
( | ) i
i
kp x z i
nV
NV
kxp )(
( ) inP z i
N
( | ) ( ) 1( | )
( )
i i i
i
k n kp x z i P z iP z i x
kp x nV N k
NV
![Page 54: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/54.jpg)
Copyright © 2001-2018 K.R. Pattipati
1-NN Classifier
1-NN classifier has interesting Geometric Interpretation
x2
1-NN Classifier x1
1-Nearest Neighbor Classifier
Decision
boundary
x2
1-NN Classifier x1
Data points become corner points of triangles
spanned by each reference point and two of its
neighbors. The network of triangles is called
Delauny Triangulation (or Delauny Net).
Perpendiculars to triangle’s edges run borders of
polygonal patches delimiting local neighborhood
of each point. Resulting network of polygons is
called VORONOI Diagram or VORONOI Nets.
Decision Boundary follows sections of
VORONOI Nets.
1-NN & K-NN adapt to any data; high
variability & low bias.
![Page 55: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/55.jpg)
Copyright © 2001-2018 K.R. Pattipati
Why Gaussian Mixtures?
• Parametric fast but limited
• Non Parametric general but slow (require lot of data)
• Mixture Models
Gaussian Mixtures, Maximum Likelihood and
Expectation Maximization Algorithm - 1
Conditional Density Estimation
(function approx.)
RBF
Mixture of experts models
) ( | )
1 2
(.
Similar technique for p p x k
k , , . . . ,C
M
j
jPjxpxp1
)|()(
1 0 ; 11
j
M
j
j PP
1)|(1
M
j x
j xdPjxp
![Page 56: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/56.jpg)
Copyright © 2001-2018 K.R. Pattipati
Gaussian Mixtures, Maximum Likelihood and
Expectation Maximization Algorithm - 2
p(x)
p(x|M)p(x|j)p(x|1)
PMPjP1
x1 x2 xp
),()|( jjNjxp
ly typical),(2IN jj
2
22
2
)||||(
2/2)2(
1 j
jx
p
j
e
Problem: Given data,
1 2
1
. . . . , , ,M
N
j jj j
D x x x find the ML estimates of P
Let j jjθ P ,μ ,σ
max ( | ) max ln min lnθθ
L p D l p(D |θ ) - p(D |θ ) J
![Page 57: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/57.jpg)
Copyright © 2001-2018 K.R. Pattipati
Gaussian Mixtures, Maximum Likelihood and
Expectation Maximization Algorithm - 3
N N M
i 1 i 1 j 1
ln , ln ( ) i i
jJ p( x θ ) p x | j P
N
1iM
1k
1
j
i
j
k
ij
μ
|j)xp(P
|k)Pxp(μ
J
22
2
22 e
2
1
2
2
j
i
ji
j
i
jσ
)||μx||(
p/
jj
i
σ
xμ|j)xp(
σ
xμ
)πσ(μ
|j)xp( j
j
i
(1) .. . . . . . . . .. . . . . . 2
1
σ
xμ)xP(j|
μ
J
j
i
jN
i
i
j
M
i
j
M
j
jj
PJl
Lagrangian
P; Pts
J
1
1
:
101 ..
min
So,Note the Simplicity
of Gradient
posterior
![Page 58: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/58.jpg)
Copyright © 2001-2018 K.R. Pattipati
Gaussian Mixtures, Maximum Likelihood and
Expectation Maximization Algorithm - 4
N
i j
i
jM
k
k
ij
|j)xp(P
|k)Pxp(
σ
J
1
1
1
2
31
iN
i j
i j j
|| x μ ||pP(j | x )
σ σ
. . . . . .. . . . . . . . ..... ......(2)
1
1
1 ( | )
( | )
Ni
Miij
k
k
lp x j
Pp x k P
(3) . . . . . . . . . . . . . . 1
1 1
N
i
N
i
j
i
jj
i
λP)xP(j|P
λ P
)xP(j|
Dimension of
feature vector
![Page 59: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/59.jpg)
Copyright © 2001-2018 K.R. Pattipati
Gaussian Mixtures, Maximum Likelihood and
Expectation Maximization Algorithm - 5
From (1),
N
i
i
N
i
ii
j
xjP
xxjP
1
1
)|(
)|(
N
i
i
N
ij
ii
j
xjP
xxjP
p
1
1
2
2
)|(
||ˆ||)|(1
ˆ
N have we1 and 1)|( that,noting1 1
M
j
M
j
j
iPxjP
N
i
i
j xjPN
P1
)|(1ˆ
These are coupled non-linear equations
Necessary Conditions of Optimality:
Set Gradients Equal to Zero
Responsibility
1
1
General Case:
ˆ ˆ( | )( )( )
( | )
Ni i i T
j ji
j Ni
i
P j x x x
P j x
![Page 60: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/60.jpg)
Copyright © 2001-2018 K.R. Pattipati
Nonlinear Programming (NLP) Techniques
Methods of Solution: NLP
*
10 . . . . . . . . . . . .
lHkk 1
H =
I SD or Gradient Method
Newton’s Method
Levenberg-Marquardt Method
Levenberg-Marquardt version of
Gauss Newton Method
Various versions of Quasi-Newton Method
Various versions of Conjugate Gradient method
12 J
12J I
1
TJ J I
Best to compute
Hessian using finite
Difference method
![Page 61: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/61.jpg)
Copyright © 2001-2018 K.R. Pattipati
EM Algorithm
EM Algorithm
N
i
iold
N
i
iiold
new
j
xjP
xxjP
1
1
)|(ˆ
)|(ˆ
N
i
iold
N
i
newiiold
new
j
xjP
xxjP
p
j
1
1
2
2
)|(ˆ
||ˆ||)|(ˆ1
ˆ
N
i
ioldnew
j xjPN
P1
)|(ˆ1ˆ
How did we get these equations
and Why?…. Later
• By setting gradient to zero (M-step)
• Evaluating posterior
Probabilities/Responsibilities (E-step)
1
ˆ( | )ˆ ( | )
ˆ( | )
i new
i jnew
Mi new
m
m
p x j PP j x
p x m P
M-step
E-step
Gauss-Seidel view of EM
![Page 62: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/62.jpg)
Copyright © 2001-2018 K.R. Pattipati
Sequential Estimation -1
1
1
1
11
)|(
)|(
ˆn
i
i
n
i
ii
n
xjP
xxjP
j
1
1
1
1
1
1
1
)|(
)|(ˆ
)|(
)|(
n
n
i
i
nn
n
i
i
n
i
i
x
xjP
xjP
xjP
xjP
j
nn
n
i
i
nn
jj
x
xjP
xjP ˆ
)|(
)|(ˆ
1
1
1
1
1n
j
Sequential Estimation Stochastic Approximation
1
1 1
1 11
1
1
1
:
( | ) ( | )1
1( | ) ( | )
( | )( | )
=1 .( | ) ( | )
( | ) 1 =1+ .
( | )
n ni i
i i
n nn
j
ni
n
i
n n
n
n n
j
Note
P j x P j x
P j x P j x
P j xP j x
P j x P j x
P j x
P j x
![Page 63: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/63.jpg)
Copyright © 2001-2018 K.R. Pattipati
• Sometimes replace, or,
Similarly,
Sequential Estimation Stochastic Approximation -2
11
1
( | )
ˆ( 1)j
nn
n
j
P j x
n P
1
1
)|(
)|(111
nn
n
n
jj
xjP
xjP
n
i
i
n
i
nii
n
j
xjP
xxjP
p
j
1
1
2
2
)|(
||ˆ||)|(1
ˆ
1
1
1
1
21
12
)|(
||ˆ||)|(1
ˆn
i
i
n
i
nii
n
j
xjP
xxjP
p
j
1
1
1
1
21
)|(
||ˆˆˆ||)|(1
n
i
i
n
i
nnnii
xjP
xxjP
p
jjj
1
1
1
1
2112
)|(
||ˆˆ||ˆˆˆ2||ˆ||)|(1
n
i
i
n
i
nnnnT
ninii
xjP
xxxjP
p
jjjjjj
2121
1
1
1
1
1
12||ˆˆ||
1||ˆ||
)|(
)|(1
)|(
)|(
ˆnnnn
n
i
i
n
n
i
i
n
i
i
n
jjjj p
x
xjP
xjP
pxjP
xjP
1
1
1
( | )
( | ) ( | )
j
j
j
n n
n
n n n
P j x
P j x P j x
![Page 64: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/64.jpg)
Copyright © 2001-2018 K.R. Pattipati
• Similarly,
Sequential Estimation Stochastic Approximation -3
212211212||ˆˆ||
1ˆ||ˆ||
1ˆˆ
nnn
j
nnn
j
n
j
n
jjjj p
xp
n
j
nnn
j
n
j
n
jj
xp
221112ˆ||ˆ||1
1ˆ
n
j
nn
j
n
j PxjPn
PP ˆ)|(1
1ˆˆ 11
212
21
1
1
12
||ˆˆ||1
ˆ||ˆ||
)|(
)|(ˆ
nnn
j
nn
n
i
i
nn
jjj
j
pp
x
xjP
xjP
1 11
Re
ˆ ˆ ˆj j j
n n n nn
j
call
x
![Page 65: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/65.jpg)
Copyright © 2001-2018 K.R. Pattipati
Key ideas of EM as applied to Gaussian Mixture Problem
EM Algorithm for Gaussian Mixture Problem-1
1 1 1
ln ln |N N M
i i
j
i i j
J p( x ) p( x j)P
N
iiold
inewoldnew
)x(p
)x( pJJ
1
ln
N
iiold
M
jiold
ioldinewnew
j
) x(p
)x(j|P
)x(j|P|j)x(pP
1
1ln
Idea:
ln ( , )
( ( )||
:
: var ( )
:
( )
ln ( , | ) ln ( | , ) ln ( | )
( , | )ln ( | ) [ln ]
( )
( | , )[ln ]
( )
q
L q
q
KL q z p
x data
z hidden iables mixture
parameters
q z any arbitrary distribution
p x z p z x p x
p x zp x E
q z
p z xE
q z
( | , ))
ln ( , ) ( ( ) || ( | , ))
ln ( , ) ( ( ) || ( | , )) 0
: ( ) ( | , )
: min[ ln ( , )]
min [ln ( , | )]
min
z x
old
new
q
J L q KL q z p z x
J L q KL q z p z x
E step q z p z x
M step L q
E p x z
=
( , )
: ln ( , ) ( , ) ( , ) ( , )
old
old old old
q
Q
Note L q Q Q H z
J=-ln p(x| )
-ln L(q, )=Q(, old
KL(q||p)
![Page 66: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/66.jpg)
Copyright © 2001-2018 K.R. Pattipati
• For convex functions,
Minimizing l will lead to a decrease in J()
0 ,1 e wher lnln iiiiii xx
1 1
1 1
1 1
ln
,ln
, )
ln , ( , ) l
inew newN Mi jnew old old
i iold oldi j
inewN Miold
ioldi j
N Mi inew old new old
i j
P p ( x | j)J J P (j | x )
p ( x ) P (j | x )
p ( x j)P (j | x )
p ( x j
J P (j | x ) p ( x j) Q
=
n ( , )L q
: , ( ) ( , ) 0
( ) ( , ) 0
( , ) and ( ) have the same gradient at
old old old old old
new new new old
old old
Note at J Q Force KL
J Q KL
Q J
old new
J()
Jold
Q(, old)
EM Algorithm for Gaussian Mixture Problem-2
( , )oldq z
Q(, new)
![Page 67: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/67.jpg)
Copyright © 2001-2018 K.R. Pattipati
• Optimization problem:
– min
– s.t.
1 1 1 1
( | ) ln ( | ) ( | ) ln ( , )N M N M
i i i iold new new old new
j
i j i j
Q P j x P p x j P j x p x j
2
21 1
|| ||( | ) ln ln
2
i newN M
i jold new new
j j newi j j
xQ P j x P p
Q
. . . .M, . . . . , jPP new
j
M
j
new
j 21 ;0 ;11
Dropping terms that depend on old parameters, we get
EM Algorithm for Gaussian Mixture Problem-3
For Gaussian conditional probability density functions
![Page 68: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/68.jpg)
Copyright © 2001-2018 K.R. Pattipati
N
i
iold
iN
i
iold
new
j
)x(j|P
x)x(j|P
1
1
N
i
iold
N
i
newiiold
new
j
xjP
xxjP
p
j
1
1
2
2
)|(
||||)|(1
N
i
ioldnew
j xjPN
P1
)|(1
EM Algorithm for Gaussian Mixture Problem-4
1
1
General Case:
ˆ ˆ( | )( )( )
( | )
Ni i new i newold T
j jnew ij N
iold
i
P j x x x
P j x
![Page 69: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/69.jpg)
Copyright © 2001-2018 K.R. Pattipati
Graphical Illustration of E and M Steps
ln ( , ) ( ( ) || ( | , ))
ln ( , ) ( ( ) || ( | , )) 0
: ( ) ( | , )
ln ( , ) ln ( | )
?
( ( ) || ( | , )) 0
: arg min[ ln ( , )]
ln ( , ) ln ( | )
?
old
old old
old
new
new new
J L q KL q z p z x
J L q KL q z p z x
E step q z p z x
L q p x
Why
KL q z p z x
M step L q
L q p x
why
( ( ) ( | , ) || ( | , )) 0old new
KL q z p z x p z x
J=-ln p(x| new )Q(, old =-ln L(q, new )
KL(q||p)
E-step
J=-ln p(x| old ) Q(, old =-ln L(q, old )
KL(q||p)=0
M-step
Note: EM is a Maximum Likelihood Algorithm. Is there a Bayesian Version? Yes: If
you assume priors on ({j, j2, P j}) called Variational Bayesian Inference.
![Page 70: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/70.jpg)
Copyright © 2001-2018 K.R. Pattipati
An Alternate View of EM for Gaussian Mixtures - 1
z
x
Hidden
(Latent)
Variables
Observation
z is a M-dimensional binary random vector such that
1
1
{0,1} and 1
( 1) ( ) j
M
j j
j
Mz
j j j
j
z z
P z P P z P
x is a p-dimensional random vector such that
1
11
( | ) [ ( ; , )]
( ) ( , ) ( ) ( | )
[ ( ; , )] ( ; , )
j
j
Mz
jjj
z z
M Mz
j j j jj jz jj
p x z N x
p x p x z P z p x z
P N x P N x
pdf of x is a Gaussian Mixture
only possible vectors:
{ : 1,2,.., }
unit vector
i
th
i
z
z e i M
e i
![Page 71: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/71.jpg)
Copyright © 2001-2018 K.R. Pattipati
Problem: Given incomplete (partial) data,
1 2
1
. . . . , , ,M
N
j jj j
D x x x find the ML estimates of P
1
Let M
j jj j
θ P ,μ ,
min lnθ
J where J - p(D |θ )
If have several observations {xn: n=1,2,..,N} , each data point will have
a corresponding latent vector zn. Note the generality
Complete Data:
1
1 1 2 2
2
1 1
( , ), ( , ) . . . .( , )
1 1ln ( | ) { ln ln 2 ln | | || || }
2 2 2 j
N N
c
N Mnn
c j j j jn j
D x z x z x z
pp D z P x
An Alternate View of EM for Gaussian Mixtures - 2
![Page 72: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/72.jpg)
Copyright © 2001-2018 K.R. Pattipati
If had complete data, estimation is trivial. Similar to Gaussian
case, except that we estimate with subsets of data that are
assigned to each mixture component
In EM, replace each latent variable by its expectation with
respect to the posterior density during the E-step
In EM, minimize the expected value of the negative
complete-data log likelihood during the M-step
1
( | , ) ( 1| , )
( ; , )( 1| , )
( ; , )
n nn n n n
j j j j
n
j jn jn n
j jMn
k kkk
z E z x P z x
P N xP z x
P N x
An Alternate View of EM for Gaussian Mixtures - 3
Responsibilities
1
2
1 1
1 1{ ln ( | )} { ln ln 2 ln | | || || }
2 2 2 j
N Mnn
Z c j j j jn j
pE p D P x
Q(, old
![Page 73: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/73.jpg)
Copyright © 2001-2018 K.R. Pattipati
EM Algorithm for Gaussian Mixtures -4
1 1 1
1 1
1. Initialize the means { } , covariances { } , and mixing coefficients { } .
Evaluate = -ln ( | ) ln{ ( ; , )}
2. E-step: Evaluate the responsibilities using the curr
M M M
j j j j jj
N Mn
j jjn j
P
J p x P N x
1
1
ent parameter values
( ; , ) ; 1, 2,.., ; 1, 2,..,
( ; , )
; 1, 2,..,
3. M-step: Re-estimate the parameters using the current responsibi
n
j jjn
j Mn
k kkk
Nn
j j
n
P N xj M n N
P N x
N j M
1
1
lities
1
1 ( )( )
4. Evaluate the negative log likelihood and check for convergence of parameters or
Nnew nn
jjnj
Nn new n newnew n T
j j j jnj
jnew
j
xN
x xN
NP
N
the likelihood.
If not converged, go to step 2.
2
1
For unbiased estimate of covariance,
1Divide by
( )
Nn
j
j
j
j
NN
Goes to 1/(Nj -1)
for (0-1) case
![Page 74: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/74.jpg)
Copyright © 2001-2018 K.R. Pattipati
Illustration of EM Algorithm for Gaussian Mixtures
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2
-1
0
1
iteration 0, loglik -Inf
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2
-1
0
1
iteration 1, loglik -4929.3761
-3 -2 -1 0 1 2
-2
-1
0
1
iteration 2, loglik -441.9557
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
-2
-1
0
1
iteration 7, loglik -389.7686
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
-2
-1
0
1
iteration 8, loglik -389.7602
1 2 3 4 5 6 7 8-5000
-4500
-4000
-3500
-3000
-2500
-2000
-1500
-1000
-500
0
iter
avera
ge loglik
Murphy, Page 353, mixGaussDemoFaithful
![Page 75: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/75.jpg)
Copyright © 2001-2018 K.R. Pattipati
Relation of Gaussian Mixtures to K - means
2
2
1
|| || /2
|| || /2
1
Suppose for 1, 2,..,
Then
( ; , ) ; 1, 2,.., ; 1, 2,..,
( ; , )
As 0
1
n
j
n
k
j
n
j jn
j Mn
k kk
x
jn
j Mx
k
k
n
j
I j M
P N x Ij M n N
P N x I
P e
P e
if j
2
1 1
arg min || ||; the rest go to zero as long as none of the is zero.
The expected value of negative log likelihood of complete-data is
1 { ln ( | )} || || consta
2
n
jkk
N Mnn
Z c j jn j
x P
E p D x
2
1 1
nt
1So, K-means minimizes || ||
2
N Mnn
j jn j
x
![Page 76: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/76.jpg)
Copyright © 2001-2018 K.R. Pattipati
Variational Bayesian Inference - 1
w
x
Hidden
(Latent)
Variables
Observation
w is a latent vector (continuous or discrete)
− Mixture vector (discrete), z
− Parameters ({j, j, P j})
x is a p-dimensional random vector
Recall
Variational inference typically assumes q(w) to be factorized
( )
( )
ln ( ) ln ( ( )) ( ( ) || ( | ))
( , )ln ( ( )) ( ) ln ln ( , ) ( )
( )
( | )( ( ) || ( | )) ( ) ln ln ( | ) ( )
( )
ln ( ) ln ( ( )) ( ( ) || ( | )) 0
q w q
q w q
J p x L q w KL q w p w x
p x wL q w q w d w E p x w H w
q w
p w xKL q w p w x q w d w E p w x H w
q w
J p x L q w KL q w p w x
1
( ) ( );{ } are disjoint groups
Example: ( ) ( ) ({ , , })
K
j jj
j
j jj
q w q w w
q w q z q P
P z
x
-1
![Page 77: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/77.jpg)
Copyright © 2001-2018 K.R. Pattipati
Variational Bayesian Inference - 2
Minimize the upper bound -lnL(q(w)) with respect to qj (wj) while
keeping {qi (wi) : i ≠ j} constant (a la Gauss-Seidel)
Iterative algorithm for finding the factors {qj (wj)}
11
11
[ln ( , )]
( , )ln ( ( )) ( ) ln ( ) ln ( , ) ( )
( )
( ) ln ( , ) ( ) ( ) ( )
[ ln ( (
i
j i
i j
K K
i ii q
ii
K K
j i i j j ij i q q
iii ji j
E p x w
p x wL q w q w d w q w p x w d w H w
q w
q w p x w q w d w d w H w H w
L q w
=
[ln ( , )]
[ln ( , )]
))][ln ( , )] 1 ln ( ) 0
( )
ln ( ) [ln ( , )]
( )i j
i j
ji j j
jj
jj i j
E p x w
jj E p x w
j
E p x w q wq w
q w E p x w
eq w
e d w
Log of the optimal qj is the expectation
of the log of joint distribution with respect
to all of the other factors {qi (wi) : i ≠ j}.
This idea is used in loopy belief propagation
and expectation propagation also.
![Page 78: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/78.jpg)
Copyright © 2001-2018 K.R. Pattipati
Application to Gaussian Mixtures - 1
Here w involves mixture variables and component parameters
Model assumptions
1 ( ) ( ) ({ , , } )
is a binary random vector of dimension
M
j j jjq w q z q P
z M
1 1
1 1
1 1 1
1 1
1
Mixture Distribution: ({ } |{ } )
Data Likelihood given latent variables:
({ } |{ } ,{ , } ) [ ( ; , )]
We also assume on { , , } Bay
nj
nj
N Mzn N M
n j j j
n j
N Mzn n nN N M
n n j j jj jn j
M
j j jj
p z P P
p x z N x
priors P
0 10
10
esian approach
( ) ( ) ( | ) ; conjugate prior to multinomial
( )
M
jMj
Mp P Dirichlet P P
1
0 ( ) ; ( 1) ( ); ( ) ( 1)! for integerste t dt n n
![Page 79: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/79.jpg)
Copyright © 2001-2018 K.R. Pattipati
Application to Gaussian Mixtures - 2
Model assumptions (continued)
Joint distribution of all random variables decomposes as follows:
1 1 1 1
1
0 0 0
1 0
({ , } ) ({ } |{ } ) ({ } )
1 = ( ; , ). ( ; , )
M M M M
j j j j j j jj j
M
j jjj
p p p
N m Wishart W
In one dimension, Gamma
See Bishop’s book
1 1 1 1 1 1
1 1
1 1 1 1 1
1 1
({ } ,{ } ,{ , , } ) ({ } |{ } ,{ , } ).
({ } |{ } ). ( ). ({ } |{ } ). ({ } )
= [ (
n n n nN N M N N M
n n j j j n n j jj j
n N M M M M
n j j j j j j jj
N M
j
n j
p x z P p x z
p z P p P p p
P N
0 1 10
0 0 0
1 00
; , )] .
( ) 1 . ( ; , ). ( ; , )
( )
njzn
jj
M
j j jM jj
x
MP N m Wishart W
P z
x
-1
![Page 80: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/80.jpg)
Copyright © 2001-2018 K.R. Pattipati
Application to Gaussian Mixtures - 3
Variational Bayes M-step (VBM-step)…. It is easier to see M-step first
1
1
0
1 1 1 1({ } )
1 1
({ } )1 10
0 0 0
1 00
ln ({ , , } ) ln ({ } ,{ } ,{ , , } )
ln [ ( ; , )] .
= t( ) 1
. ( ; , ). ( ; , )( )
n Nn
nj
n Nn
n nM N N M
j j j n n j j jq zj j
N Mzn
j jjn j
q z M
j j jM jj
q P E p x z P
P N x
E consM
P N m Wishart W
0
00
1 1 1 1
0 0
1 1
.
1ln ( ; , )
= [( 1) ]ln
ln ( ; , )
+ ln ( ; , ) tan
j
nj
M N Mjjn
j j
J n j
jN
N Mnn
j jjn j
N mE z P
Wishart W
E z N x cons t
1 1
0 1
({ , , } ) ( ) ({ , } )
( ) ( ;{ } )
( , )
M M
j j j j jj j
M
j j j
jj
q P q P q
q P Dirichlet P N
q Gaussian Wishart
![Page 81: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/81.jpg)
Copyright © 2001-2018 K.R. Pattipati
Application to Gaussian Mixtures - 4
Updated factorized distribution after M-step
1 1
0
0 1
0
1
1
0
1
0
({ , , } ) ( ) ({ , } )
( ) ( ;{ } ) ( )
( , )
1 = ( ; , ). ( ; , )
;
1
M M
j j j j jj j
j jM
j j j j M
k
k
jj
j j j j jjj
Nn
j j j j
n
j
j
q P q P q
Nq P Dirichlet P N E P
M N
q Gaussian Wishart
N m Wishart W
N N
m m
0
1
01 1
0 00
0
1
0
1;
( )( )
1where ( )( )
jj
j j
j j
Nnn
j j
nj
j T
j j j
j
Nn nn T
j j
nj
j j
N x x xN
NW W N S x m x m
N
S x x x xN
N
Updates for { , , }
are similar to ML
jj jN x S
Sequential VBEM?
![Page 82: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/82.jpg)
Copyright © 2001-2018 K.R. Pattipati
Application to Gaussian Mixtures - 5
Variational Bayes E-step (VBE-step)
1
1
1
1 1 1 1({ , , } )
({ , , } )1 1
({ , } )1
ln ({ } ) ln ({ } ,{ } ,{ , , } )
= ln [ ( ; , )] tan
=
Mj j jj
nj
Mj j jj
Mj jj
n n nN N N M
n n n j j jq P j
N Mzn
j jq P jn j
Mn
jqn j
q z E p x z P
E P N x cons t
E z
1 1 1
1
( ; , )
{ } 1 1
ln ({ } |{ } ,{ , } )
ln ({ } |{ } ) tan
= ln
njj
j
Nn nN N M
n n j jj
N x
n N M
q P n j j
P
n n
j j
p x z
E p z P cons t
z
1 1 1
1 1
1 2
,
1
1 1
1
1 1where ln ln ln | | ln 2 || ||
2 2 2
({ } ) [ ] where .... responsibilities
j j j jj
nj
N M
n j
nn
j P j j j
nN Mzn jN n n
n j j Mnn jk
k
pE P E E x
q z
![Page 83: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/83.jpg)
Copyright © 2001-2018 K.R. Pattipati
Application to Gaussian Mixtures - 6
Variational Bayes E-step (VBE-step) … continued
− Evaluation of responsibilities
− Recall
1 1 1
1 2
,
1 1ln ln ln | | ln 2 || ||
2 2 2j j j jj
nn
j P j j j
pE P E E x
1
1 1
1
1
1
2
,
ln ( ) ( ); ( ) ln ( ).... function
1ln | | ( ) ln 2 ln | |
2
|| || ( ) ( )
j
j
j jj
M
P j j k
k
pj
j j
i
n n nT
j jj jjj
dE P digamma
d
iE p W
pE x x m W x m
1
11| | exp ( ) ( ) ( ) ( )
2 2 2 2
n n
j j
pn nj jn T
j jj j j j
i j
Since
i pW x m W x m
See Bishop
Chapter 10
![Page 84: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/84.jpg)
Copyright © 2001-2018 K.R. Pattipati
VB Approximation of Gaussian Gamma
In VBEM start with large M and very small 0 <<1 (0.001)
It automatically prunes clusters with very few members (“rich get
richer”)
In this example, we start with 6 clusters, but only 2 remain at the end
-2 -1.5 -1 -0.5 0 0.5 1 1.5-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
4
5
iter 94
1 2 3 4 5 60
20
40
60
80
100
120
140
160
180iter 94
mixGaussVbDemoFaithful from Murphy, Page 755
![Page 85: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/85.jpg)
Copyright © 2001-2018 K.R. Pattipati
Bayesian Model Selection -1
( | ) ( )( | )
( | ) ( )l
p D m p mP m D
p D l p l
Bayesian Model Selection (max . prob. of model given data)
Bayes Factors for Comparing Models m and l
Set of models (e.g., linear, quadratic discriminants)
Model is specified by parameter vector ;m
M
m m M
1( )
2
1( )
2
( | ) ( | ) ( ) ( )( , ) / /
( | ) ( | ) ( ) ( )
( , ) 1 mod mod
( sin log )
BIC m
BIC l
p D m p m D p m e p mBF m l
p D l p l D p l p le
BF m l el m is preferred over el l
BIC Bayesian Information Criterion u g negative likelihood
![Page 86: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/86.jpg)
Copyright © 2001-2018 K.R. Pattipati
Bayesian Model Selection -2
Bayesian Information Criterion (BIC)… minimize BIC
Akaike Information Criterion (AIC) …. Minimize AIC
ˆ ˆ2ln ( | ) ( ) ln
, .
; equal & speherical covaraiance (Note: only ( -1) probabi
ˆ( )
m m
m
BIC p D dof N Schwarz criterion
Valid for regression classification and density estimation
For linear and quadratic classifiers
C Cp C
dof
ities)
1; equal and spherical feature-dependent covaraiance
( 1) / 2 1; equal and general covaraiance
( 1) / 2 1;unequal and general covaraianc
is closley related to
C Cp p
C Cp p p
C Cp Cp p
BIC Minimum Des
( )
: ln ln[( 2) / 24]
cription Length MDL
Adjusted BIC N N
&
ˆ ˆ2ln ( | ) 2 ( )
ˆ ˆ2 ( )( ( ) 1)
ˆ( ) 1
m m
m msmall N Gaussian
m
AIC p D dof
dof dofAIC AIC
N dof
![Page 87: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/87.jpg)
Copyright © 2001-2018 K.R. Pattipati
Binary Classification, BIC & AIC - 1
Class z = 0: N(0,1);zero mean … Null hypothesis, H0
Class z = 1: N (µ,1); µ 0 …. Alternative hypothesis, H1
Under null hypothesis, sample mean
Note .
So, under H1:
1 2int : { , ,..., }NN scalar data po s D x x x
1~ (0, ) ~ (0,1)x N N x N
N
x x
~ (0,1)N x N
0
( | | ) 1 ( . ., 2 0.05),
0 1 .
.
, : | |
If P N x c for a specified c e g c for
we are confident that with probability
is the probability of faslsely rejecting H
cSo test statistic is x
N
![Page 88: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/88.jpg)
Copyright © 2001-2018 K.R. Pattipati
Binary Classification, BIC & AIC - 2
BIC
AIC
2
0
1
2 22
1
1 1
2
1 0
ˆ ˆ2ln ( | ) ( ) ln
( )
( ) ln ln
ln ln, ( ) ( ) | |
m
Ni
i
N Ni i
i i
BIC p D dof N
BIC H x
BIC H x x N x N x N
N NSo BIC H BIC H if x x
N N
2
0
1
2 22
1
1 1
2
1 0
ˆ ˆ2ln ( | ) 2. ( )
( )
( ) 2 2
2 2, ( ) ( ) | |
; 2
m
Ni
i
N Ni i
i i
AIC p D dof
AIC H x
AIC H x x x N x
So AIC H AIC H if x xN N
Similar to classical hypothesis testing c
sample number-
dependent threshold
![Page 89: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/89.jpg)
Copyright © 2001-2018 K.R. Pattipati
Probability of error, misclassification rate.
• Holdout Method:
Error Count = R out of K PE=R/K
From Binomial Distribution:
To obtain an estimate of PE within 1%:
If PE 0.05,
When PE ½, we need lots of samples…. 10,000
We can also estimate Class Conditional Error Rate, PE(z=k). Then
Holdout Method of Cross Validation
P PE z
N
N-K training
K validation
Typically, K N/5
(1 )PE
PE PE
K
(1 )0.01 2
PE PE
K
2
4
4 (1 ) 40000( (1 )) 1900
10
PE PEK PE PE
1
( ). PE( )c
k
PE P z k z k
Are there better
bounds?
![Page 90: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/90.jpg)
Copyright © 2001-2018 K.R. Pattipati
• Suppose have a nonnegative rv, x (discrete or continuous)
• Assume continuous WLOG
• Markov inequality
• Chebyshev inequality
Markov & Chebyshev Inequalities
00
( ) ( ) ( ) ( ) ( ) ( ) ( )
( )( )
E x xf x dx xf x dx xf x dx xf x dx f x dx P x
E xP x
22 2
2
2
2
2
2 2
( ) ( )( ) ( )
, (| | )
: ,
| | 1(| | ) ( )
,
(1 )(| | )
e e
e ee e
E x E xP x P x
so P x
Example x sample mean of n numbers m
mP m P
n n
For binary classifier with unknown probability of error P and sample error S
P PP S P
2 2
1; 100, 0.2, 0.0625....
4n bound loose
n n
![Page 91: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/91.jpg)
Copyright © 2001-2018 K.R. Pattipati
Hoeffding’s Inequality
2 2 2
21
1
2 /2
; [ , ];
{| ( ) | } 2 ; when 1 {| ( ) | } 2
n
i
i
n
i i i i i i i
i
n Rn
i i i
Let Y X X a b R b a
P Y E Y n e R b a P Y E Y n e
1 1
1
1 1 1 1
1
( ) ( ) 0 & [ , ]2 2
{| ( ) | } {| | } { | | } { } { }
{ } { } { }
{
n n
i i
i i
n
i
i
i ii i i i i
n n n n
i i i i
i i i i
n t Z t Ztn tn
i
i
t Ztn
R RLet Z X E X E Z Z
P Y E Y n P Z n P t Z tn P t Z tn P t Z tn
P t Z tn P e e e E e Markov Inequality
e E e
[ (1 ) ] [ (1 ) ]2 2 2 2
1 1 1
/2 /22 2
1 1
/2
1
/ 2} { } { } { };
1{ (1 ) }
2
1, { }
2
i i i ii i i i
i
i i
i i
i
R R R Rn n nt t
tZtn tn tn i ii
i i i i
R Rn nt t
tR tRtn tn
i i
i i
ntR ttn
i
i
Z Re E e e E e e E e
R
e E e e e e e
So P t Z tn e e e
2 2
2 21
2 2 2
21
[ /8 ]/2 /8
1 1
2 /2 2
1 1
min 4 / {| | } 2 2 1
n
i
i i i
n
i
i
n n t R tnR t Rtn
i i
n n n Rn
i i i
i i
e e e
RHS is imized when t n R P Z n e e when R
Jensen’s inequality
https://en.wikipedia.org/wiki/Hoeffding%27s_lemma
![Page 92: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/92.jpg)
Copyright © 2001-2018 K.R. Pattipati
• Suppose we have classifiers/models M1, M2,…., MC
• Training Data, D & Validation Data, V
• Samples are assumed to be i.i.d.
Rationale behind Cross Validation
2
2
2| |
2| |
1 ˆlog ln ( | )| |
ˆ , { }
' {max | | } 2
2ln( )
ln, 2 ,
2 | | | |
" ( ) , (
i
im m
x V
mm m
Vm m
m
V
Average likelihood l p xV
Since does not depend on V E l l
By Hoeffding s inequality P l l Ce
C
CSo if Ce
V V
Confidence is cheap but accuracy
) exp "
2ln( )
lnmax max 2 max ( )
2 | | | |
, (1 ), mod .
m m mm m m
is more ensive
C
Cl l l O
V V
So with probability of at least one chooses the best el
22| |
1
{max | | }
{ | | }
{| | } 2
m mm
m mm
CV
m m
m
P l l
P l l
P l l Ce
![Page 93: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/93.jpg)
Copyright © 2001-2018 K.R. Pattipati
• S-fold Cross validation:
S-fold Cross Validation
Testing
Validation
N
NS
S
S
N
1N
S
S
S
N
1N
S
S
S
N
1
D1 D2 Ds
Run1
Run2
Run S
1
1( ( ) ) (.)
s
S
i i
s i D
PE I x z I indicator functionN
• S=N N-fold cross validation or Leave one-out CV (LOOCV) method
1
1( ( ) ); ( ) ( , )
N
i i i
i
PE I x z x f x DN
• Practical Scheme: 5x2 Cross-Validation. Variation: 5 repetitions of 2-
fold cross-validation on a randomized dataset
S is typically 5-10
![Page 94: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/94.jpg)
Copyright © 2001-2018 K.R. Pattipati
Summary of Validation and Model Selection Methods
Simple splitLeave one out CV
N=S for S-fold CV Model Selection
Regularization and Model Selection Any combination of CV,
Model Selection and
Regularization
(hyper-parameter selection)
is possible.
See AMLbook.com .
![Page 95: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/95.jpg)
Copyright © 2001-2018 K.R. Pattipati
• Bootstrap Method:
Sample from D with uniform probability (1/N). Each xi is drawn
independently with replacement
Let b= bootstrap index, b=1,2,…,B
B= number of bootstrap samples (typically 50-200)
Do b=1,2,..,B
Bootstrap Sample
End
Bootstrap Method of Validation
NxxxD , . . . . . .
21
)( , . . . . . . )()2()1(
bPExxxD trainingb
N
b
: \ ( )b b valValidation Data A D D samples not inbootstrap PE b
0.632* ( ) 0.368* ( )training valPE PE b PE b Effron Estimator
![Page 96: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/96.jpg)
Copyright © 2001-2018 K.R. Pattipati
• Bootstrap Method (cont’d):
Why?
• Confusion Matrix:
In words, just count errors from validation set, bootstrap, etc. for class j
and divide by the number of samples from class j
Confusion Matrix
0.632* ( ) 0.368* ( )training valPE PE b PE b
1 1{ } (1 )
{ } 0.368
{ } 0.632
NP observation bootstrap sample as NN e
P observation validation samples
P observation training samples
][ ijPP
{ | } ; 0,1,2,..., ; 1, 2,...,ij
ij
j
NP P decision i z j i C j C
N
![Page 97: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/97.jpg)
Copyright © 2001-2018 K.R. Pattipati
• Contingency table showing the differences between the true and predicted classes for a set of labeled examples
• The following metrics can be derived from the confusion matrix:
– PD (Sensitivity, Recall): • TP/(TP+FN)
– PF (1-Selectivity): • FP/(TN+FP)
– Positive Prediction Power (Precision)• TP/(TP+FP)
– Negative Prediction Power• TN/(TN+FN)
– Correct Classification Rate (CCR)• (TP+TN)/N
– Misclassification Rate• (FP+FN)/N
– Odds-ratio
• (TP*TN)/(FP*FN)
Outcome Fault No-Fault Total
Positive
Detection
Number of
detected
faults
(TP)
Number of
false-alarms
(FP)
Total
number of
positive
detections
Negative
Detection
Number of
missed faults
(FN)
Number of
correct
rejections.
(TN)
Total
number of
negative
detections
Total
number of
faulty
samples
Total
number of
fault-free
samples
Total
number of
samples
– Kappa
2
( ) ( )
1
2
( ) ( )
1
1
row i col i
i
row i col i
i
CCR P P
P P
CCR = Correct
Classification Rate
Prow(i)=% entries in row i
Pcol(i)=% entries in column i
Poor: K < 0.4 Good: 0.4 < K < 0.75
Excellent: K > 0.75
Performance Metrics for Binary Classification - 1
TRUE
P
R
E
D
I
C
T
E
D
2
1
)()(
2
1
)()(
1i
icolirow
i
icolirow
PP
PPCCR
No reject option
When doing this,
All four entries should sum to 1
![Page 98: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/98.jpg)
Copyright © 2001-2018 K.R. Pattipati
Outcome Fault No-Fault Total
Positive
Detection
3023(TP)
1518(FP)
Total
number of
positive
detections
Negative
Detection
1977
(FN)
3482
(TN)
Total
number of
negative
detections
Total
number of
faulty
samples
Total
number of
fault-free
samples
Total
number of
samples
Normalized
0.605 0.304
0.395 0.696
PD = 0.605 False Neg. Rate = 0.395
PF = 0.304 (False Positive Rate )
Correct Classification Rate = 0.65
Misclassification Rate = 0.35
Odds Ratio = 3.51
Kappa = 0.301Poor
Positive Prediction Power = 0.666
Negative Prediction Power = 0.638
Prevalence = 0.5 (Priors)
Metrics
Aircraft Engine Data1 0
1 0 1 1 0
1 0
2 ( )
( )( )
( ) if
D F
D F
D F
PP P PKappa
P P P PP P P
P P P P
Performance Metrics for Binary Classification - 2
Re ( ) 0.605
Pr ( ) 0.666 (1 ) / 0.304
20.634
D
F
call R P R
ecision P P R P P
PRF score
P R
![Page 99: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/99.jpg)
Copyright © 2001-2018 K.R. Pattipati
• One-versus-One
– Generates C(C-1)/2
confusion matrices
• C1 vs. C2
• C1 vs. C3
• C2 vs. C3
• One-versus-All
– Generates C
confusion matrices
• C1 vs. C2 & C3
• C2 vs. C1 & C3
• C3 vs. C1 & C2
One-versus-One Confusion Matrices
One-versus-All Confusion Matrix
Summed,
creating a
2x2 matrix
Confusion Matrices for Multiple Classes - 1
Confusion Matrix for C Classes
![Page 100: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/100.jpg)
Copyright © 2001-2018 K.R. Pattipati
• Some-versus-Rest
– Generates 2(C-1) - 1 confusion matrices
– Both true and false classifications may be sums
– Here is a four class example
C1 vs. C2 & C3 & C4
C2 vs. C1 & C3 & C4
C1 & C2 vs. C3 & C4
C3 vs. C1 & C2 & C4
C1 & C3 vs. C2 & C4
C2 & C3 vs. C1 & C4
C1 & C2 & C3 vs. C4
C1 vs. C2 & C3 & C4
C1 & C2 vs. C3 & C4
Confusion Matrices for Multiple Classes - 2
Can form the basis for code book based classifiers
![Page 101: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/101.jpg)
Copyright © 2001-2018 K.R. Pattipati
• Cobweb
– Illustrates the probability that
each class will be predicted
incorrectly (the off diagonal cells
of the confusion matrix)
– Shows relative performance
between classes for each
classifier
– High performance classifiers
have poor visibility in cobweb
– May be difficult to interpret for
high numbers of classes
• c(c-1)/2 rays
Cobweb
Cobweb
![Page 102: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/102.jpg)
Copyright © 2001-2018 K.R. Pattipati
• Fawcett’s Extension
– Sums the areas under the curves for each class versus the rest, multiplied by the probability of that class
• Hand and Till Function
– Averages the areas under the curves for each pair of classes
• Macro-Average Modified
– Uses both the geometric mean and average correct classification rate
1
( ) ( , );C
i
AUCF P z i AUC i rest AUC Area under the ROC Curve
Other Metrics of Performance Assessment
1
1 1
2( , )
( 1)
C i
i j
HT AUC i jC C
1 1
0.75( ) 0.25( )CC
CMOD ii ii
i i
MAVG P P
![Page 103: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/103.jpg)
Copyright © 2001-2018 K.R. Pattipati
• Comparing Classifiers
Suppose have Classifiers A and B
nA= number of errors made by A but not by B
nB= number of errors made by B but not by A
McNemar’s Test: Check if
Comparing Classifiers
)1,0(1||
Nnn
nn
BA
BA
Need | nA- nB|>5 for a significant difference
To detect 1% difference in error rates, need at least 500 samples
![Page 104: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/104.jpg)
Copyright © 2001-2018 K.R. Pattipati
1. Alpaydin, Ethem. Introduction to Machine Learning. MIT Press, Cambridge. 2004.
2. Kuncheva, Ludmila I. Combining Pattern Classifiers; Methods and Algorithms. John Wiley and Sons: Hoboken, NJ. 2004.
3. Ferri, C. “Volume Under the ROC Surface for Multi-class Problems.” Univ. Politecnica de Valencia, Spain. 2003.
4. D. Mossman. “Three-way ROCs”, Medical Decision Making, 19(1):78–89,1999.
5. A. Patel, M. K. Markey, "Comparison of three-class classification performancemetrics: a case study in breast cancer CAD", Medical Imaging 2005: ImageProcessing.
6. Ferri C., Hernandez J., Salido M. A., “Volume Under the ROC Surface forMulticlass Problems. Exact Computation and Evaluation of Approximations”Technical Report DSOC. Univ. Politec. Valencia. 2003.http://www.dsic.upv.es/users/elp/cferri/VUS.pdf.
7. Mooney, C.Z. and R.D. Duval, 1993, Bootstrapping: A Non-ParametricApproach to Statistical Inference. Newbury Park, CA: Sage Publications.
8. J. K. Martin and D. S. Hirschberg. “Small sample statistics for classification error rates, I: error rate measurements.” Technical Report 96-21, Dept. of Information & Computer Science, University of California, Irvine, 1996.
References on Performance Assessment
![Page 105: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ](https://reader036.fdocuments.us/reader036/viewer/2022071418/6116dce3ceb9882e96727e77/html5/thumbnails/105.jpg)
Copyright © 2001-2018 K.R. Pattipati
Estimating Parameters of Densities From Data
Maximum Likelihood Methods
Bayesian Learning
Probability Density Estimation
Histogram Methods
Parzen Windows
Probabilistic Neural Network
k-nearest Neighbor Approach
Mixture Models
Estimate parameters via EM and VBEM algorithm
Various interpretations of EM
Performance Assessment of Classifiers
Summary