Bayesian Machine Learning - Cornell University · Bayesian Machine Learning Andrew Gordon Wilson...
Transcript of Bayesian Machine Learning - Cornell University · Bayesian Machine Learning Andrew Gordon Wilson...
Bayesian Machine Learning
Andrew Gordon Wilson
ORIE 6741Lecture 2: Bayesian Basics
https://people.orie.cornell.edu/andrew/orie6741
Cornell UniversityAugust 25, 2016
1 / 17
Canonical Machine Learning Problems
I Linear Classification
I Polynomial Regression
I Clustering with Gaussian Mixtures
2 / 17
Linear Classification
I Data: D = {(xi, yi)}ni=1.
xi ∈ RD
yi ∈ {+1,−1} .
I Model:
p(yi = +1|θ, x) =
{1 if θTxi ≥ 0,0 otherwise
(1)
I Parameters: θ ∈ RD
I Goal: To infer θ from data and to predict future labels p(y|D, x).
3 / 17
Polynomial Regression
I Data: D = {(xi, yi)}ni=1.
xi ∈ Ryi ∈ R .
I Model: yi = a0 + a1xi + a2x2i + · · ·+ amxm
i + εi ,where εi ∼ N (0, σ2).
I Parameters: θ = (a0, . . . , am, σ2)T .
I Goal: To infer θ from the data and to predict future outputs p(y|D, x,m).
4 / 17
Clustering with Gaussian Mixtures(Density Estimation)
I Data: D = {xi}ni=1
xi ∈ RD
I Model:
xi ∼m∑
j=1
wjpj(xi) (2)
where pj(xi) = N (xi;µi,Σi) . (3)
I Parameters: θ = ((µ1,Σ1), . . . , (µm,Σm),w)
I Goal: To infer θ from the data, predict the density p(x|D,m), and infer whichpoints belong to which cluster.
5 / 17
Statistics from Scratch
Basic Regression Problem
I Training set of N targets (observations) y = (y(x1), . . . , y(xN))T.
I Observations evaluated at inputs X = (x1, . . . , xN)T.
I Want to predict the value of y(x∗) at a test input x∗.
For example: Given CO2 concentrations y measured at times X, what will the CO2
concentration be for x∗ = 2024, 10 years from now?
Just knowing high school math, what might you try?
7 / 17
Statistics from Scratch
1949 1951 1953 1955 1957 1959 1961100
200
300
400
500
600
700A
irlin
e P
asse
nger
s (T
hous
ands
)
Year
8 / 17
Statistics from Scratch
Guess the parametric form of a function that could fit the data
I f (x,w) = wTx [Linear function of w and x]
I f (x,w) = wTφ(x) [Linear function of w] (Linear Basis Function Model)
I f (x,w) = g(wTφ(x)) [Non-linear in x and w] (E.g., Neural Network)
φ(x) is a vector of basis functions. For example, if φ(x) = (1, x, x2) and x ∈ R1 thenf (x,w) = w0 + w1x + w2x2 is a quadratic function.
Choose an error measure E(w), minimize with respect to wI E(w) =
∑Ni=1[f (xi,w)− y(xi)]
2
9 / 17
Statistics from Scratch
A probabilistic approachWe could explicitly account for noise in our model.
I y(x) = f (x,w) + ε(x) , where ε(x) is a noise function.
One commonly takes ε(x) = N (0, σ2) for i.i.d. additive Gaussian noise, in whichcase
p(y(x)|x,w, σ2) = N (y(x); f (x,w), σ2) Observation Model (4)
p(y|x,w, σ2) =
N∏i=1
N (y(xi); f (xi,w), σ2) Likelihood (5)
I Maximize the likelihood of the data p(y|x,w, σ2) with respect to σ2,w.
For a Gaussian noise model, this approach will make the same predictions as using asquared loss error function:
log p(y|X,w, σ2) ∝ − 12σ2
N∑i=1
[f (xi,w)− y(xi)]2 (6)
10 / 17
Statistics from Scratch
I The probabilistic approach helps us interpret the error measure in adeterministic approach, and gives us a sense of the noise level σ2.
I Probabilistic methods thus provide an intuitive framework for representinguncertainty, and model development.
I Both approaches are prone to over-fitting for flexible f (x,w): low error on thetraining data, high error on the test set.
Regularization
I Use a penalized log likelihood (or error function), such as
log p(y|X,w) ∝
model fit︷ ︸︸ ︷− 1
2σ2
n∑i=1
(f (xi,w)− y(xi)2)
complexity penalty︷ ︸︸ ︷−λwTw . (7)
I But how should we define complexity, and how much should we penalizecomplexity?
I Can set λ using cross-validation.
11 / 17
Bayesian Inference
Bayes’ Rule
p(a|b) = p(b|a)p(a)/p(b) , p(a|b) ∝ p(b|a)p(a) . (8)
posterior =likelihood× prior
marginal likelihood, p(w|y,X, σ2) =
p(y|X,w, σ2)p(w)
p(y|X, σ2). (9)
Predictive Distribution
p(y|x∗, y,X) =
∫p(y|x∗,w)p(w|y,X)dw . (10)
I Think of each setting of w as a different model. Eq. (15) is a Bayesian modelaverage, an average of infinitely many models weighted by their posteriorprobabilities.
I No over-fitting, automatically calibrated complexity.
I Eq. (10) is intractable for many likelihoods p(y|X,w, σ2) and priors p(w).
I Typically more interested in the induced distribution over functions than inparameters w. Can be hard to have intuitions for priors on p(w). 12 / 17
Parametric Regression Review
Deterministic
E(w) =N∑
i=1
(f (xi,w)− yi)2 . (11)
Maximum Likelihood
p(y(x)|x,w) = N (y(x); f (x,w), σ2n) , (12)
p(y|X,w) =N∏
i=1
N (y(xi); f (xi,w), σ2n) . (13)
Bayesian
posterior =likelihood× prior
marginal likelihood, p(w|y,X) =
p(y|X,w)p(w)
p(y|X). (14)
p(y|x∗, y,X) =
∫p(y|x∗,w)p(w|y,X)dw . (15)
13 / 17
Worked Example: Basis Regression (Chalkboard)
I We have data D = {(xi, yi)}Ni=1
I We use the model:
y = wTφ(x) + ε (16)
ε ∼ N (0, σ2) . (17)
I We want to make predictions of y∗ for any x∗.I We will now explore this question on the whiteboard using maximum
likelihood and Bayesian approaches.I We will consider topics such as regularization, cross-validation, Bayesian
model averaging and conjugate priors.14 / 17
Rant: Regularisation = MAP 6= Bayesian Inference
Example: Density Estimation
I Observations y1, . . . , yN drawn from unknown density p(y).
I Model p(y|θ) = w1N (y|µ1, σ21) + w2N (y|µ2, σ
22),
θ = {w1,w2, µ1, µ2, σ1, σ2}.I Likelihood p(y|θ) =
∏Ni=1 p(yi|θ) .
Can learn all free parameters θ using maximum likelihood...
15 / 17
Regularisation = MAP 6= Bayesian Inference
Regularisation or MAP
I Find
argmaxθ log p(θ|y)c=
model fit︷ ︸︸ ︷log p(y|θ) +
complexity penalty︷ ︸︸ ︷log p(θ)
I Choose p(θ) such that p(θ)→ 0 faster thanp(y|θ)→∞ as σ1 or σ2 → 0.
Bayesian Inference
I Predictive Distribution: p(y∗|y) =∫
p(y∗|θ)p(θ|y)dθ.
I Parameter Posterior: p(θ|y) ∝ p(y|θ)p(θ).
I p(θ) need not be zero anywhere in order to make reasonable inferences. Canuse a sampling scheme, with conjugate posterior updates for each separatemixture component, using an inverse Gamma prior on the variances σ2
1 , σ22 .
16 / 17