Bayesian Machine Learning - Cornell University · Bayesian Machine Learning Andrew Gordon Wilson...

17
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17

Transcript of Bayesian Machine Learning - Cornell University · Bayesian Machine Learning Andrew Gordon Wilson...

Bayesian Machine Learning

Andrew Gordon Wilson

ORIE 6741Lecture 2: Bayesian Basics

https://people.orie.cornell.edu/andrew/orie6741

Cornell UniversityAugust 25, 2016

1 / 17

Canonical Machine Learning Problems

I Linear Classification

I Polynomial Regression

I Clustering with Gaussian Mixtures

2 / 17

Linear Classification

I Data: D = {(xi, yi)}ni=1.

xi ∈ RD

yi ∈ {+1,−1} .

I Model:

p(yi = +1|θ, x) =

{1 if θTxi ≥ 0,0 otherwise

(1)

I Parameters: θ ∈ RD

I Goal: To infer θ from data and to predict future labels p(y|D, x).

3 / 17

Polynomial Regression

I Data: D = {(xi, yi)}ni=1.

xi ∈ Ryi ∈ R .

I Model: yi = a0 + a1xi + a2x2i + · · ·+ amxm

i + εi ,where εi ∼ N (0, σ2).

I Parameters: θ = (a0, . . . , am, σ2)T .

I Goal: To infer θ from the data and to predict future outputs p(y|D, x,m).

4 / 17

Clustering with Gaussian Mixtures(Density Estimation)

I Data: D = {xi}ni=1

xi ∈ RD

I Model:

xi ∼m∑

j=1

wjpj(xi) (2)

where pj(xi) = N (xi;µi,Σi) . (3)

I Parameters: θ = ((µ1,Σ1), . . . , (µm,Σm),w)

I Goal: To infer θ from the data, predict the density p(x|D,m), and infer whichpoints belong to which cluster.

5 / 17

Bayesian Modelling (Theory of Everything)

Ghahramani (2015)

6 / 17

Statistics from Scratch

Basic Regression Problem

I Training set of N targets (observations) y = (y(x1), . . . , y(xN))T.

I Observations evaluated at inputs X = (x1, . . . , xN)T.

I Want to predict the value of y(x∗) at a test input x∗.

For example: Given CO2 concentrations y measured at times X, what will the CO2

concentration be for x∗ = 2024, 10 years from now?

Just knowing high school math, what might you try?

7 / 17

Statistics from Scratch

1949 1951 1953 1955 1957 1959 1961100

200

300

400

500

600

700A

irlin

e P

asse

nger

s (T

hous

ands

)

Year

8 / 17

Statistics from Scratch

Guess the parametric form of a function that could fit the data

I f (x,w) = wTx [Linear function of w and x]

I f (x,w) = wTφ(x) [Linear function of w] (Linear Basis Function Model)

I f (x,w) = g(wTφ(x)) [Non-linear in x and w] (E.g., Neural Network)

φ(x) is a vector of basis functions. For example, if φ(x) = (1, x, x2) and x ∈ R1 thenf (x,w) = w0 + w1x + w2x2 is a quadratic function.

Choose an error measure E(w), minimize with respect to wI E(w) =

∑Ni=1[f (xi,w)− y(xi)]

2

9 / 17

Statistics from Scratch

A probabilistic approachWe could explicitly account for noise in our model.

I y(x) = f (x,w) + ε(x) , where ε(x) is a noise function.

One commonly takes ε(x) = N (0, σ2) for i.i.d. additive Gaussian noise, in whichcase

p(y(x)|x,w, σ2) = N (y(x); f (x,w), σ2) Observation Model (4)

p(y|x,w, σ2) =

N∏i=1

N (y(xi); f (xi,w), σ2) Likelihood (5)

I Maximize the likelihood of the data p(y|x,w, σ2) with respect to σ2,w.

For a Gaussian noise model, this approach will make the same predictions as using asquared loss error function:

log p(y|X,w, σ2) ∝ − 12σ2

N∑i=1

[f (xi,w)− y(xi)]2 (6)

10 / 17

Statistics from Scratch

I The probabilistic approach helps us interpret the error measure in adeterministic approach, and gives us a sense of the noise level σ2.

I Probabilistic methods thus provide an intuitive framework for representinguncertainty, and model development.

I Both approaches are prone to over-fitting for flexible f (x,w): low error on thetraining data, high error on the test set.

Regularization

I Use a penalized log likelihood (or error function), such as

log p(y|X,w) ∝

model fit︷ ︸︸ ︷− 1

2σ2

n∑i=1

(f (xi,w)− y(xi)2)

complexity penalty︷ ︸︸ ︷−λwTw . (7)

I But how should we define complexity, and how much should we penalizecomplexity?

I Can set λ using cross-validation.

11 / 17

Bayesian Inference

Bayes’ Rule

p(a|b) = p(b|a)p(a)/p(b) , p(a|b) ∝ p(b|a)p(a) . (8)

posterior =likelihood× prior

marginal likelihood, p(w|y,X, σ2) =

p(y|X,w, σ2)p(w)

p(y|X, σ2). (9)

Predictive Distribution

p(y|x∗, y,X) =

∫p(y|x∗,w)p(w|y,X)dw . (10)

I Think of each setting of w as a different model. Eq. (15) is a Bayesian modelaverage, an average of infinitely many models weighted by their posteriorprobabilities.

I No over-fitting, automatically calibrated complexity.

I Eq. (10) is intractable for many likelihoods p(y|X,w, σ2) and priors p(w).

I Typically more interested in the induced distribution over functions than inparameters w. Can be hard to have intuitions for priors on p(w). 12 / 17

Parametric Regression Review

Deterministic

E(w) =N∑

i=1

(f (xi,w)− yi)2 . (11)

Maximum Likelihood

p(y(x)|x,w) = N (y(x); f (x,w), σ2n) , (12)

p(y|X,w) =N∏

i=1

N (y(xi); f (xi,w), σ2n) . (13)

Bayesian

posterior =likelihood× prior

marginal likelihood, p(w|y,X) =

p(y|X,w)p(w)

p(y|X). (14)

p(y|x∗, y,X) =

∫p(y|x∗,w)p(w|y,X)dw . (15)

13 / 17

Worked Example: Basis Regression (Chalkboard)

I We have data D = {(xi, yi)}Ni=1

I We use the model:

y = wTφ(x) + ε (16)

ε ∼ N (0, σ2) . (17)

I We want to make predictions of y∗ for any x∗.I We will now explore this question on the whiteboard using maximum

likelihood and Bayesian approaches.I We will consider topics such as regularization, cross-validation, Bayesian

model averaging and conjugate priors.14 / 17

Rant: Regularisation = MAP 6= Bayesian Inference

Example: Density Estimation

I Observations y1, . . . , yN drawn from unknown density p(y).

I Model p(y|θ) = w1N (y|µ1, σ21) + w2N (y|µ2, σ

22),

θ = {w1,w2, µ1, µ2, σ1, σ2}.I Likelihood p(y|θ) =

∏Ni=1 p(yi|θ) .

Can learn all free parameters θ using maximum likelihood...

15 / 17

Regularisation = MAP 6= Bayesian Inference

Regularisation or MAP

I Find

argmaxθ log p(θ|y)c=

model fit︷ ︸︸ ︷log p(y|θ) +

complexity penalty︷ ︸︸ ︷log p(θ)

I Choose p(θ) such that p(θ)→ 0 faster thanp(y|θ)→∞ as σ1 or σ2 → 0.

Bayesian Inference

I Predictive Distribution: p(y∗|y) =∫

p(y∗|θ)p(θ|y)dθ.

I Parameter Posterior: p(θ|y) ∝ p(y|θ)p(θ).

I p(θ) need not be zero anywhere in order to make reasonable inferences. Canuse a sampling scheme, with conjugate posterior updates for each separatemixture component, using an inverse Gamma prior on the variances σ2

1 , σ22 .

16 / 17

Learning with Stochastic Gradient Descent

Chalkboard.

17 / 17