CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters...

63
CS B351 STATISTICAL LEARNING

description

L EARNING C OIN F LIPS Observe that c out of N draws are cherries (data) Intuition: c/N might be a good hypothesis for the fraction of cherries in the bag “Intuitive” parameter estimate: empirical distribution P(cherry)  c / N ( Why is this reasonable? Perhaps we got a bad draw! )

Transcript of CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters...

Page 1: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

CS B351STATISTICAL LEARNING

Page 2: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

AGENDA Learning coin flips, learning Bayes net

parameters Likelihood functions, maximum likelihood

estimation (MLE) Priors, maximum a posteriori estimation

(MAP) Bayesian estimation

Page 3: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

LEARNING COIN FLIPS Observe that c out of N draws are cherries

(data) Intuition: c/N might be a good hypothesis for

the fraction of cherries in the bag “Intuitive” parameter estimate: empirical

distribution P(cherry) c / N(Why is this reasonable? Perhaps we got a bad draw!)

Page 4: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

LEARNING COIN FLIPS Observe that c out of N draws are cherries

(data) Let the unknown fraction of cherries be q

(hypothesis) Probability of drawing a cherry is q Assumption: draws are independent and

identically distributed (i.i.d)

Page 5: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

LEARNING COIN FLIPS Probability of drawing a cherry is q Assumption: draws are independent and

identically distributed (i.i.d) Probability of drawing 2 cherries is q*q Probability of drawing 2 limes is (1-q)2

Probability of drawing 1 cherry and 1 lime: q*(1-q)

Page 6: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

LIKELIHOOD FUNCTION Likelihood: the probability of the data d={d1,

…,dN} given the hypothesis q P(d|q) = Pj P(dj|q)

i.i.d assumption

Page 7: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

LIKELIHOOD FUNCTION Likelihood: the probability of the data d={d1,

…,dN} given the hypothesis q P(d|q) = Pj P(dj|q) = Pj

q if dj=Cherry1-q if dj=Lime

Probability model, assuming q is given

Page 8: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

LIKELIHOOD FUNCTION Likelihood: the probability of the data d={d1,

…,dN} given the hypothesis q P(d|q) = Pj P(dj|q) = Pj

= qc (1-q)N-c

Gather c cherry terms together, then N-c lime terms

q if dj=Cherry1-q if dj=Lime

Page 9: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1/1 cherry

q

P(da

ta|q

)

Page 10: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

2/2 cherry

q

P(da

ta|q

)

Page 11: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.020.040.060.080.1

0.120.140.16

2/3 cherry

q

P(da

ta|q

)

Page 12: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.01

0.02

0.03

0.04

0.05

0.06

0.07

2/4 cherry

q

P(da

ta|q

)

Page 13: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.0050.01

0.0150.02

0.0250.03

0.0350.04

2/5 cherry

q

P(da

ta|q

)

Page 14: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c

0 0.10.20.30.40.50.60.70.80.9 10

0.0000002

0.0000004

0.0000006

0.0000008

0.000001

0.0000012

10/20 cherry

q

P(da

ta|q

)

Page 15: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1E-312E-313E-314E-315E-316E-317E-318E-319E-31

50/100 cherry

q

P(da

ta|q

)

Page 16: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

MAXIMUM LIKELIHOOD Peaks of likelihood function seem to hover

around the fraction of cherries… Sharpness indicates some notion of

certainty…

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1E-312E-313E-314E-315E-316E-317E-318E-319E-31

50/100 cherry

q

P(da

ta|q

)

Page 17: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

MAXIMUM LIKELIHOOD P(d|q) is the likelihood function The quantity argmaxq P(d|q) is known as the

maximum likelihood estimate (MLE)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1/1 cherry

q

P(da

ta|q

)

q=1 is MLE

Page 18: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

2/2 cherry

q

P(da

ta|q

)MAXIMUM LIKELIHOOD P(d|q) is the likelihood function The quantity argmaxq P(d|q) is known as the

maximum likelihood estimate (MLE)

q=1 is MLE

Page 19: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.020.040.060.080.1

0.120.140.16

2/3 cherry

q

P(da

ta|q

)MAXIMUM LIKELIHOOD P(d|q) is the likelihood function The quantity argmaxq P(d|q) is known as the

maximum likelihood estimate (MLE)

q=2/3 is MLE

Page 20: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.01

0.02

0.03

0.04

0.05

0.06

0.07

2/4 cherry

q

P(da

ta|q

)MAXIMUM LIKELIHOOD P(d|q) is the likelihood function The quantity argmaxq P(d|q) is known as the

maximum likelihood estimate (MLE)

q=1/2 is MLE

Page 21: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

MAXIMUM LIKELIHOOD P(d|q) is the likelihood function The quantity argmaxq P(d|q) is known as the

maximum likelihood estimate (MLE)

q=2/5 is MLE

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.0050.01

0.0150.02

0.0250.03

0.0350.04

2/5 cherry

q

P(da

ta|q

)

Page 22: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

PROOF: EMPIRICAL FREQUENCY IS THE MLE l(q) = log P(d|q) = log [ qc (1-q)N-c]

Page 23: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

PROOF: EMPIRICAL FREQUENCY IS THE MLE l(q) = log P(d|q) = log [ qc (1-q)N-c]

= log [ qc ] + log [(1-q)N-c]

Page 24: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

PROOF: EMPIRICAL FREQUENCY IS THE MLE l(q) = log P(d|q) = log [ qc (1-q)N-c]

= log [ qc ] + log [(1-q)N-c]= c log q + (N-c) log (1-q)

Page 25: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

PROOF: EMPIRICAL FREQUENCY IS THE MLE l(q) = log P(d|q) = c log q + (N-c) log (1-q) Setting dl/dq(q) = 0 gives the maximum likelihood

estimate

Page 26: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

PROOF: EMPIRICAL FREQUENCY IS THE MLE dl/dq(q) = c/q – (N-c)/(1-q) At MLE, c/q – (N-c)/(1-q) = 0

…=> q = c/N

Page 27: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

MAXIMUM LIKELIHOOD FOR BN For any BN, the ML parameters of any CPT

can be derived by the fraction of observed values in the data, conditioned on matched parent values

Alarm

Earthquake Burglar

E: 500 B: 200

N=1000

P(E) = 0.5 P(B) = 0.2

A|E,B: 19/20A|B: 188/200A|E: 170/500A| : 1/380

E B P(A|E,B)T T 0.95F T 0.95T F 0.34F F 0.003

Page 28: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

FITTING CPTS VIA MLE M examples D=(d[1],…,d[M])

Each d[i] is a complete example of all variables in the Bayes net

Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN

Page 29: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

FITTING CPTS VIA MLE M examples D=(d[1],…,d[M])

Each d[i] is a complete example of all variables in the Bayes net

Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN

Suppose BN has a single variable X Estimate X’s CPT, P(X) X

Page 30: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

FITTING CPTS VIA MLE M examples D=(d[1],…,d[M])

Each d[i] is a complete example of all variables in the Bayes net

Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN

Suppose BN has a single variable X Estimate X’s CPT, P(X) (Just learning a coin flip as usual) PMLE(X) = empirical distribution of D

PMLE(X=T) = Count(X=T) / M PMLE(X=F) = Count(X=F) / M

X

Page 31: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

FITTING CPTS VIA MLE M examples D=(d[1],…,d[M])

Each d[i] is a complete example of all variables in the Bayes net

Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN

Suppose BN to the right: Estimate P(X), P(Y|X)

Estimate PMLE(X) as usualX

Y

Page 32: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

FITTING CPTS VIA MLE M examples D=(d[1],…,d[M])

Each d[i] is a complete example of all variables in the Bayes net

Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN

Estimate PMLE(Y|X) with…X

Y

P(Y|X)

XT F

YT

F

Page 33: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

FITTING CPTS VIA MLE M examples D=(d[1],…,d[M])

Each d[i] is a complete example of all variables in the Bayes net

Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN

Estimate PMLE(Y|X) with…X

Y

P(Y|X)

XT F

YT Count(Y=T,X=T)

/ Count(X=T)Count(Y=T,X=F)

/ Count(X=F)F Count(Y=F,X=T)

/ Count(X=T)Count(Y=F,X=F) /

Count(X=F)

Page 34: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

FITTING CPTS VIA MLE M examples D=(d[1],…,d[M])

Each d[i] is a complete example of all variables in the Bayes net

Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN

In general, for P(Y|X1,…,Xk): For each setting of (y,x1,…,xk):

Compute Count(y, x1,…,xk) Compute Count(x1,…,xk) Set

X2

Y

X1 X3

Page 35: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

OTHER MLE RESULTS Categorical distributions (Non-binary discrete

variables): empirical distribution is MLE Make histogram, divide by N

Continuous Gaussian distributions Mean = average of data Standard deviation = standard deviation of data

0 50 100 150 200 2500

0.0010.0020.0030.0040.0050.0060.0070.0080.009

0 20 40 60 80 1001201401601802000

0.050.1

0.150.2

0.250.3

0.350.4

Gaussian (normal) distributionHistogram

Page 36: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

NICE PROPERTIES OF MLE Easy to compute (for certain probability

models) With enough data, the qMLE estimate will

approach the true unknown value of q

Page 37: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

PROBLEMS WITH MLE The MLE was easy to compute… but what

happens when we don’t have much data? Motivation

You hand me a coin from your pocket 1 flip, turns up tails Whats the MLE?

Page 38: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

PROBLEMS WITH MLE The MLE was easy to compute… but what

happens when we don’t have much data? Motivation

You hand me a coin from your pocket 1 flip, turns up tails Whats the MLE?

qMLE has a high variance with small sample sizes

Page 39: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

VARIANCE OF AN ESTIMATOR: INTUITION The dataset D is just a sample of the underlying

distribution, and if we could “do over” the sample, then we might get a new dataset D’.

With D’, our MLE estimate qMLE’ might be different

How much? How often?

Assume all values of q are equally likely In the case of 1 draw, D would have just as likely been

a Lime. In that case, qMLE = 0 So with probability 0.5, qMLE would be 1, and with the

same probability, qMLE would be 0. High variance: typical “do overs” give drastically

different results!

Page 40: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

FITTING BAYESIAN NETWORK CPTS WITH MLE Potential problem: for large k, very few

datapoints will share the values x1,…,xk! O(M*P(x1,…,xk)), but some values may be even

rarer Data fragmentation

Page 41: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

IS THERE A BETTER WAY? BAYESIAN LEARNING

Page 42: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

AN ALTERNATIVE APPROACH: BAYESIAN ESTIMATION P(D|q) is the likelihood P(q) is the hypothesis prior P(q|D) = 1/Z P(D|q) P(q) is the posterior

Distribution of hypotheses given the data

q

d[1] d[2] d[M]

Page 43: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

BAYESIAN PREDICTION For a new draw Y: use hypothesis posterior to

predict P(Y|D)

Y

q

d[1] d[2] d[M]

Page 44: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

CANDY EXAMPLE• Candy comes in 2 flavors, cherry and lime, with identical

wrappers• Manufacturer makes 5 indistinguishable bags

• Suppose we draw• What bag are we holding? What flavor will we draw

next?

h1C: 100%L: 0%

h2C: 75%L: 25%

h3C: 50%L: 50%

h4C: 25%L: 75%

h5C: 0%L: 100%

Page 45: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

BAYESIAN LEARNING Main idea: Compute the probability of each

hypothesis, given the data Data D: Hypotheses: h1,…,h5

h1C: 100%L: 0%

h2C: 75%L: 25%

h3C: 50%L: 50%

h4C: 25%L: 75%

h5C: 0%L: 100%

Page 46: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

BAYESIAN LEARNING Main idea: Compute the probability of each

hypothesis, given the data Data D: Hypotheses: h1,…,h5

h1C: 100%L: 0%

h2C: 75%L: 25%

h3C: 50%L: 50%

h4C: 25%L: 75%

h5C: 0%L: 100%

P(hi|D)

P(D|hi)

We want this…

But all we have is this!

Page 47: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

USING BAYES’ RULE P(hi|D) = a P(D|hi) P(hi) is the posterior

(Recall, 1/a = P(D) = Si P(D|hi) P(hi)) P(D|hi) is the likelihood P(hi) is the hypothesis prior

h1C: 100%L: 0%

h2C: 75%L: 25%

h3C: 50%L: 50%

h4C: 25%L: 75%

h5C: 0%L: 100%

Page 48: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

COMPUTING THE POSTERIOR Assume draws are independent Let P(h1),…,P(h5) = (0.1, 0.2, 0.4, 0.2, 0.1) D = { 10 x }

P(D|h1) = 0P(D|h2) = 0.2510

P(D|h3) = 0.510

P(D|h4) = 0.7510

P(D|h5) = 110

P(D|h1)P(h1)=0P(D|h2)P(h2)=9e-8P(D|h3)P(h3)=4e-4P(D|h4)P(h4)=0.011P(D|h5)P(h5)=0.1

P(h1|D) =0P(h2|D) =0.00P(h3|D) =0.00P(h4|D) =0.10P(h5|D) =0.90

Sum = 1/a = 0.1114

Page 49: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

POSTERIOR HYPOTHESES

Page 50: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

PREDICTING THE NEXT DRAW P(Y|d) = Si P(Y|hi,D)P(hi|D)

= Si P(Y|hi)P(hi|D)

P(h1|D) =0P(h2|D) =0.00P(h3|D) =0.00P(h4|D) =0.10P(h5|D) =0.90

H

D Y

P(Y|h1) =0P(Y|h2) =0.25P(Y|h3) =0.5P(Y|h4) =0.75P(Y|h5) =1

Probability that next candy drawn is a lime

P(Y|D) = 0.975

Page 51: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

P(NEXT CANDY IS LIME | D)

Page 52: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

BACK TO COIN FLIPS: UNIFORM PRIOR, BERNOULLI DISTRIBUTION Assume P(q) is uniform P(q|D) = 1/Z P(D|q) = 1/Z qc(1-q)N-c

What’s P(Y|D)?

qi

d[1] d[2] d[M]

Y

Page 53: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

ASSUMPTION: UNIFORM PRIOR, BERNOULLI DISTRIBUTION

=>Z = c! (N-c)! / (N+1)! =>P(Y) = 1/Z (c+1)! (N-c)! / (N+2)!

= (c+1) / (N+2)

qi

d[1] d[2] d[M]

Y

Can think of this as a “correction” using “virtual counts”

Page 54: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

NONUNIFORM PRIORS P(q|D) P(D|q)P(q) = qc (1-q)N-c P(q)

Define, for all q, the probability that I believe in q

10 q

P(q)

Page 55: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

BETA DISTRIBUTION Betaa,b(q) = g qa-1 (1-q)b-1

a, b hyperparameters > 0 g is a normalization

constant a=b=1 is uniform

distribution

Page 56: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

POSTERIOR WITH BETA PRIORPosterior qc (1-q)N-c P(q)

= g qc+a-1 (1-q)N-c+b-1

= Betaa+c,b+N-c(q)

Prediction = meanE[q]=(c+a)/(N+a+b)

Page 57: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

POSTERIOR WITH BETA PRIORWhat does this mean? Prior specifies a “virtual

count” of a=a-1 heads, b=b-1 tailsSee heads, increment aSee tails, increment b

Effect of prior diminishes with more data

Page 58: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

CHOOSING A PRIOR Part of the design process; must be chosen

according to your intuition Uninformed belief a=b=1, strong belief => a,b high

Page 59: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

FITTING CPTS VIA MAP M examples D=(d[1],…,d[M]), virtual counts

a, b Estimate PMLE(Y|X) by assuming we’ve seen a

examples of Y=T, and b examples of Y=F

P(Y|X) XT F

Y

T (Count(Y=T,X=T)+a) / (Count(X=T)

+a+b)

(Count(Y=T,X=F) +a)/ (Count(X=F)

+a+b)F (Count(Y=F,X=T)

+b)/ (Count(X=T)+a+b)

(Count(Y=F,X=F)+b)/ (Count(X=F)+a+b)

Page 60: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

PROPERTIES OF MAP Approaches the MLE as dataset grows large

(effect of prior diminishes in the face of evidence)

More stable estimates than MLE with small sample sizes Lower variance, but added bias

Needs a designer’s judgment to set the prior

Page 61: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

EXTENSIONS OF BETA PRIORS Parameters of multi-valued (categorical)

distributions, e.g. histograms: Dirichlet prior Mathematical derivation more complex, but in

practice still takes the form of “virtual counts”

0 20 40 60 80 1001201401601802000

0.050.1

0.150.2

0.250.3

0.35

0 20 40 60 80 1001201401601802000

0.05

0.1

0.15

0.2

0.25

0 20 40 60 80 1001201401601802000

0.050.1

0.150.2

0.250.3

0.350.4

0 20 40 60 80 1001201401601802000

0.020.040.060.080.1

0.120.140.160.18

0 1

5 10

Page 62: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

RECAP Parameter learning via coin flips

Maximum Likelihood Bayesian Learning with Beta prior

Learning Bayes net parameters

Page 63: CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

NEXT TIME Introduction to machine learning R&N 18.1-3