CS340 Machine learning Gaussian classifiersmurphyk/Teaching/CS340-Fall07/...CS340 Machine learning...
Transcript of CS340 Machine learning Gaussian classifiersmurphyk/Teaching/CS340-Fall07/...CS340 Machine learning...
1
CS340 Machine learningGaussian classifiers
2
Correlated features
• Height and weight are not independent
3
Multivariate Gaussian
• Multivariate Normal (MVN)
• Exponent is the Mahalanobis distance between x and µ
Σ is the covariance matrix (positive definite)
N (x|µ,Σ)def=
1
(2π)p/2|Σ|1/2exp[− 1
2 (x− µ)TΣ−1(x− µ)]
∆ = (x− µ)TΣ−1(x− µ)
xTΣx > 0 ∀x
4
Bivariate Gaussian
• Covariance matrix is
where the correlation coefficient is
and satisfies -1 ≤ ρ ≤ 1
• Density is
Σ =
(σ2x ρσxσy
ρσxσy σ2y
)
ρ =Cov(X,Y )
√V ar(X)V ar(Y )
p(x, y) =1
2πσxσy√1− ρ2
exp
(−
1
2(1− ρ2)
(x2
σ2x+y2
σ2y−
2ρxy
(σxσy)
))
5
Spherical, diagonal, full covariance
Σ =
(σ2 00 σ2
)
Σ =
(σ2x 00 σ2y
)Σ =
(σ2x ρσxσy
ρσxσy σ2y
)
6
Surface plots
7
Generative classifier
• A generative classifier is one that defines a class-conditional density p(x|y=c) and combines this with a class prior p(c) to compute the class posterior
• Examples:– Naïve Bayes:
– Gaussian classifiers
• Alternative is a discriminative classifier, that estimates p(y=c|x) directly.
p(y = c|x) =p(x|y = c)p(y = c)∑c′ p(x|y = c
′)p(c′)
p(x|y = c) =d∏
j=1
p(xj |y = c)
p(x|y = c) = N (x|µc,Σc)
8
Naïve Bayes with Bernoulli features
• Consider this class-conditional density
• The resulting class posterior (using plugin rule) has the form
• This can be rewritten as
p(x|y = c) =d∏
i=1
θI(xi=1)ic (1− θic)
I(xi=0)
p(y = c|x) =p(y = c)p(x|y = c)
p(x)=πc∏di=1 θ
I(xi=1)ic (1− θic)
I(xi=0)
p(x)
p(Y = c|x, θ, π) =p(x|y = c)p(y = c)∑c′ p(x|y = c
′)p(y = c′)
=exp[log p(x|y = c) + log p(y = c)]∑c′ exp[log p(x|y = c
′) + log p(y = c′)]
=exp [log πc +
∑i I(xi = 1) log θic + I(xi = 0) log(1− θic)]∑
c′ exp [log πc′ +∑
i I(xi = 1) log θi,c′ + I(xi = 0) log(1− θic)]
9
Form of the class posterior
• From previous slide
• Define
• Then the posterior is given by the softmax function
• This is called softmax because it acts like the max function when |βc| → ∞
p(Y = c|x, θ, π) ∝ exp
[
log πc +∑
i
I(xi = 1) log θic + I(xi = 0) log(1− θic)
]
x′ = [1, I(x1 = 1), I(x1 = 0), . . . , I(xd = 1), I(xd = 0)]
βc = [log πc, log θ1c, log(1− θ1c), . . . , log θdc, log(1− θdc)]
p(Y = c|x, β) =exp[βTc x
′]∑
c′ exp[βTc′x
′]
p(Y = c|x) =
{1.0 if c = argmaxc′ β
Tc′x
0.0 otherwise
10
Two-class case
• From previous slide
• In the binary case, Y ∈ {0,1}, the softmax becomes
the logistic (sigmoid) function σ(u) = 1/(1+e-u)
p(Y = c|x, β) =exp[βTc x
′]∑
c′ exp[βTc′x
′]
p(Y = 1|x, θ) =eβ
T
1x′
eβT
1x′ + eβ
T
0x′
=1
1 + e(β0−β1)Tx′
=1
1 + ewT x′
= σ(wTx′)
11
Sigmoid function
• σ(ax + b), a controls steepness, b is threshold.• For small a and x ≈ –b/2, roughly linear
a=0.3
a=1.0
a=3.0
b=-30 b=0 b=+30
12
Sigmoid function in 2D
Mackay 39.3
σ(w1 x1 + w2 x2) = σ(wT x): w is perpendicular to the decision boundary
13
Logit function
• Let p=p(y=1) and η be the log odds
• Then p = σ(η) and η = logit(p)
� η is the natural parameter of the Bernoulli distribution, and p = E[y] is the moment parameter
• If η = wT x, then wi is how much the log-odds increases by if we increase xi
η = logp
1− p
σ(η) =1
1 + e−η=
eη
eη + 1
=
p(1−p)p1−p + 1
=
p(1−p)
p+1−p1−p
= p
14
Gaussian classifiers
• Class posterior (using plug-in rule)
• We will consider the form of this equation for various special cases:
• Σ1=Σ0, • tied, many classes
• General case
p(Y = c|x) =p(x|Y = c)p(Y = c)
∑Cc′=1 p(x|Y = c
′)p(Y = c′)
=πc|2πΣc|
−12 exp
[− 12(x− µc)TΣ−1c (x− µc)
]
∑c′ πc′ |2πΣc′ |
−12 exp
[− 12(x− µc′)TΣ
−1c′ (x− µc′)
]
Σc
15
Σ1 = Σ0
• Class posterior simplifies top(Y = 1|x) =
p(x|Y = 1)p(Y = 1)
p(x|Y = 1)p(Y = 1) + p(x|Y = 0)p(Y = 0)
=π1 exp
[− 12 (x− µ1)
TΣ−1(x− µ1)]
π1 exp[− 12(x− µ1)TΣ−1(x− µ1)
]+ π0 exp
[− 12(x − µ0)TΣ−1(x− µ0)
]
=π1e
a1
π1ea1 + π0ea0=
1
1 + π0π1ea0−a1
acdef= − 1
2(x− µc)
TΣ(x− µc)
16
Σ1 = Σ0
• Class posterior simplifies top(Y = 1|x) =
1
1 + exp[− log π1
π0+ a0 − a1
]
a0 − a1 = − 12 (x− µ0)
TΣ−1(x− µ0) +12(x− µ1)
TΣ−1(x− µ1)
= −(µ1 − µ0)TΣ−1x+ 1
2 (µ1 − µ0)TΣ−1(µ1 + µ0)
so
p(Y = 1|x) =1
1 + exp [−βTx− γ]= σ(βTx+ γ)
βdef= Σ−1(µ1 − µ0)
γdef= −1
2 (µ1 − µ0)TΣ−1(µ1 + µ0) + log
π1π0
σ(η)def=
1
1 + e−η=
eη
eη + 1
Linear function of x
17
Decision boundary
• Rewrite class posterior as
• If Σ=I, then w=(µ1-µ0) is in the direction of µ1-µ0, so the hyperplane is orthogonal to the line between the two means, and intersects it at x0
• If π1=π0, then x0 = 0.5(µ1+µ0) is midway between the two means
• If π1 increases, x0 decreases, so the boundary shifts toward µ0 (so more space gets mapped to class 1)
p(Y = 1|x) = σ(βTx+ γ) = σ(wT (x− x0))
w = β = Σ−1(µ1 − µ0)
x0 = −γ
β= 1
2 (µ1 + µ0)−log(π1/π0)
(µ1 − µ0)TΣ−1(µ1 − µ0)
(µ1 − µ0)
18
Decision boundary in 1d
Discontinuous decision region
19
Decision boundary in 2d
20
Tied Σ, many classes
• Similarly to before
• This is the multinomial logit or softmax function
p(Y = c|x) =πc exp
[− 12 (x− µc)
TΣ−1c (x− µc)]
∑c′ πc′ exp
[− 12(x− µc′)TΣ
−1c′ (x− µc′)
]
=exp
[µT′ Σ−1x− 1
2µTc Σ
−1µc + log πc]
∑c′ exp
[µTc′Σ
−1x− 12µTc′Σ
−1µc′ + log πc′]
θcdef=
(−µTc Σ
−1µc + log πcΣ−1µc
)=
(γcβc
)
p(Y = c|x) =eθ
T
cx
∑c′ e
θTc′x=
eβT
cx+γc
∑c′ e
βTc′x+γ
c′
21
Tied Σ, many classes
• Discriminant function
• Decision boundary is again linear, since xT Σ x terms cancel
• If Σ = I, then the decision boundaries are orthogonal to µi - µj, otherwise skewed
gc(x) = − 12 (x− µc)
TΣ−1(x− µc) + log p(Y = c) = βTc x+ βc0
βc = Σ−1µc
βc0 = − 12µ
Tc Σ
−1µc + log πc
22
Decision boundaries
[x,y] = meshgrid(linspace(-10,10,100), linspace(-10,10,100));
g1 = reshape(mvnpdf(X, mu1(:)’, S1), [m n]); ...
contour(x,y,g2*p2-max(g1*p1, g3*p3),[0 0],’-k’);
23
Σ0, Σ1 arbitrary
• If the Σ are unconstrained, we end up with cross product terms, leading to quadratic decision boundaries
24
General case
µ1 = (0, 0), µ2 = (0, 5), µ3 = (5, 5), π = (1/3, 1/3, 1/3)