Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s...

62
1 The Naïve Bayes Classifier Machine Learning Spring 2019 The slides are mainly from Vivek Srikumar

Transcript of Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s...

Page 1: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

1

The Naïve Bayes Classifier

MachineLearningFall2017

SupervisedLearning:TheSetup

1

Machine LearningSpring 2019

The slides are mainly from Vivek Srikumar

Page 2: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Today’s lecture

• The naïve Bayes Classifier

• Learning the naïve Bayes Classifier

2

Page 3: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Today’s lecture

• The naïve Bayes Classifier

• Learning the naïve Bayes Classifier

3

Page 4: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Model joint probability p(X,Y)

Let’s be use the Bayes rule for predicting y given an input x

4

Posterior probability of label being y for this input x

Page 5: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Model joint probability p(X,Y)

Let’s be use the Bayes rule for predicting y given an input x

Predict y for the input x using

5

Page 6: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

MAP prediction from p(X,Y)

Let’s be use the Bayes rule for predicting y given an input x

Predict y for the input x using

6

Page 7: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

MAP prediction from p(X,Y)

Let’s be use the Bayes rule for predicting y given an input x

Predict y for the input x using

7

Don’t confuse with MAP learning: finds hypothesis by

Page 8: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

MAP prediction

Predict y for the input x using

8

Likelihood of observing this input x when the label is y

Prior probability of the label being y

All we need are these two sets of probabilities

Page 9: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Example: Tennis again

9

Temperature Wind P(T, W |Tennis = Yes)

Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35

Temperature Wind P(T, W |Tennis = No)

Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2

Play tennis P(Play tennis)

Yes 0.3

No 0.7Prior

Likelihood

Without any other information, what is the prior probability that I should play tennis?

On days that I do play tennis, what is the probability that the temperature is T and the wind is W?

On days that I don’t play tennis, what is the probability that the temperature is T and the wind is W?

Page 10: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Example: Tennis again

10

Temperature Wind P(T, W |Tennis = Yes)

Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35

Temperature Wind P(T, W |Tennis = No)

Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2

Play tennis P(Play tennis)

Yes 0.3

No 0.7Prior

Likelihood

Without any other information, what is the prior probability that I should play tennis?

On days that I do play tennis, what is the probability that the temperature is T and the wind is W?

On days that I don’t play tennis, what is the probability that the temperature is T and the wind is W?

Page 11: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Example: Tennis again

11

Temperature Wind P(T, W |Tennis = Yes)

Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35

Temperature Wind P(T, W |Tennis = No)

Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2

Play tennis P(Play tennis)

Yes 0.3

No 0.7Prior

Likelihood

Without any other information, what is the prior probability that I should play tennis?

On days that I do play tennis, what is the probability that the temperature is T and the wind is W?

On days that I don’t play tennis, what is the probability that the temperature is T and the wind is W?

Page 12: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Example: Tennis again

12

Temperature Wind P(T, W |Tennis = Yes)

Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35

Temperature Wind P(T, W |Tennis = No)

Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2

Play tennis P(Play tennis)

Yes 0.3

No 0.7Prior

Likelihood

Input:Temperature = Hot (H)Wind = Weak (W)

Should I play tennis?

Page 13: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Example: Tennis again

13

Temperature Wind P(T, W |Tennis = Yes)

Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35

Temperature Wind P(T, W |Tennis = No)

Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2

Play tennis P(Play tennis)

Yes 0.3

No 0.7Prior

Likelihood

Input:Temperature = Hot (H)Wind = Weak (W)

Should I play tennis?

argmaxy P(H, W | play?) P (play?)

Page 14: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Example: Tennis again

14

Temperature Wind P(T, W |Tennis = Yes)

Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35

Temperature Wind P(T, W |Tennis = No)

Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2

Play tennis P(Play tennis)

Yes 0.3

No 0.7Prior

Likelihood

Input:Temperature = Hot (H)Wind = Weak (W)

Should I play tennis?

argmaxy P(H, W | play?) P (play?)

P(H, W | Yes) P(Yes) = 0.4 X 0.3= 0.12

P(H, W | No) P(No) = 0.1 X 0.7= 0.07

Page 15: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Example: Tennis again

15

Temperature Wind P(T, W |Tennis = Yes)

Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35

Temperature Wind P(T, W |Tennis = No)

Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2

Play tennis P(Play tennis)

Yes 0.3

No 0.7Prior

Likelihood

Input:Temperature = Hot (H)Wind = Weak (W)

Should I play tennis?

argmaxy P(H, W | play?) P (play?)

P(H, W | Yes) P(Yes) = 0.4 X 0.3= 0.12

P(H, W | No) P(No) = 0.1 X 0.7= 0.07

MAP prediction = Yes

Page 16: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

How hard is it to learn probabilistic models?

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +

10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

Outlook: S(unny), O(vercast), R(ainy)

Temperature: H(ot), M(edium), C(ool)

Humidity: H(igh),N(ormal), L(ow)

Wind: S(trong), W(eak)

16

Page 17: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

How hard is it to learn probabilistic models?

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +

10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

Outlook: S(unny), O(vercast), R(ainy)

Temperature: H(ot), M(edium), C(ool)

Humidity: H(igh),N(ormal), L(ow)

Wind: S(trong), W(eak)

17

We need to learn

1. The prior P(Play?)2. The likelihoods P(X | Play?)

Page 18: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

How hard is it to learn probabilistic models?

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +

10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

Prior P(play?)

• A single number (Why only one?)

Likelihood P(X | Play?)

• There are 4 features

• For each value of Play? (+/-), we need a value for each possible assignment: P(x1, x2, x3, x4 | Play?)

• (24 – 1) parameters in each case

One for each assignment

18

Page 19: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

How hard is it to learn probabilistic models?

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +

10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

Prior P(play?)

• A single number (Why only one?)

Likelihood P(X | Play?)

• There are 4 features

• For each value of Play? (+/-), we need a value for each possible assignment: P(x1, x2, x3, x4 | Play?)

19

Page 20: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

How hard is it to learn probabilistic models?

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +

10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

3 3 3 2

Prior P(play?)

• A single number (Why only one?)

Likelihood P(X | Play?)

• There are 4 features

• For each value of Play? (+/-), we need a value for each possible assignment: P(x1, x2, x3, x4 | Play?)

20Values for this feature

Page 21: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

How hard is it to learn probabilistic models?

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +

10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

3 3 3 2

Prior P(play?)

• A single number (Why only one?)

Likelihood P(X | Play?)

• There are 4 features

• For each value of Play? (+/-), we need a value for each possible assignment: P(x1, x2, x3, x4 | Play?)

• (3 ⋅ 3 ⋅ 3 ⋅ 2 − 1) parameters in each case

One for each assignment

21Values for this feature

Page 22: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

How hard is it to learn probabilistic models?

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +

10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

Prior P(Y)

• If there are k labels, then k – 1 parameters (why not k?)

Likelihood P(X | Y)

• If there are d features, then:

• We need a value for each possible P(x1, x2, !, xd | y) for each y

• k(2d – 1) parameters

Need a lot of data to estimate these many numbers!

22

In general

Page 23: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

How hard is it to learn probabilistic models?

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +

10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

Prior P(Y)

• If there are k labels, then k – 1 parameters (why not k?)

Likelihood P(X | Y)

• If there are d Boolean features:

• We need a value for each possible P(x1, x2, !, xd | y) for each y

• k(2d – 1) parameters

Need a lot of data to estimate these many numbers!

23

In general

Page 24: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

How hard is it to learn probabilistic models?

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +

10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

Prior P(Y)

• If there are k labels, then k – 1 parameters (why not k?)

Likelihood P(X | Y)

• If there are d Boolean features:

• We need a value for each possible P(x1, x2, !, xd | y) for each y

• k(2d – 1) parameters

Need a lot of data to estimate these many numbers!

24

In general

Page 25: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

How hard is it to learn probabilistic models?

Prior P(Y)

• If there are k labels, then k – 1 parameters (why not k?)

Likelihood P(X | Y)

• If there are d Boolean features:

• We need a value for each possible P(x1, x2, !, xd | y) for each y

• k(2d – 1) parameters

Need a lot of data to estimate these many numbers!

25

High model complexity

Easy to overfit!

Page 26: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

How hard is it to learn probabilistic models?

Prior P(Y)

• If there are k labels, then k – 1 parameters (why not k?)

Likelihood P(X | Y)

• If there are d Boolean features:

• We need a value for each possible P(x1, x2, !, xd | y) for each y

• k(2d – 1) parameters

Need a lot of data to estimate these many numbers!

26

High model complexity

Easy to overfit!

How can we deal with this?

Page 27: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

How hard is it to learn probabilistic models?

Prior P(Y)

• If there are k labels, then k – 1 parameters (why not k?)

Likelihood P(X | Y)

• If there are d Boolean features:

• We need a value for each possible P(x1, x2, !, xd | y) for each y

• k(2d – 1) parameters

Need a lot of data to estimate these many numbers!

27

High model complexity

Easy to overfit!

How can we deal with this?

Answer: Make independence assumptions

Page 28: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Recall: Conditional independence

Suppose X, Y and Z are random variables

X is conditionally independent of Y given Z if the probability distribution of X is independent of the value of Y when Z is observed

Or equivalently

28

Page 29: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Modeling the features

𝑃(𝑥*, 𝑥,,⋯ , 𝑥.|𝑦) required k(2d – 1) parameters

What if all the features were conditionally independent given the label?

That is, 𝑃 𝑥*, 𝑥,,⋯ , 𝑥. 𝑦 = 𝑃 𝑥* 𝑦 𝑃 𝑥, 𝑦 ⋯𝑃 𝑥. 𝑦

Requires only d numbers for each label. kd features overall. Not bad!

29

The Naïve Bayes Assumption

Page 30: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Modeling the features

𝑃(𝑥*, 𝑥,,⋯ , 𝑥.|𝑦) required k(2d – 1) parameters

What if all the features were conditionally independent given the label?

That is, 𝑃 𝑥*, 𝑥,,⋯ , 𝑥. 𝑦 = 𝑃 𝑥* 𝑦 𝑃 𝑥, 𝑦 ⋯𝑃 𝑥. 𝑦

Requires only d numbers for each label. kd parameters overall. Not bad!

30

The Naïve Bayes Assumption

Page 31: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

The Naïve Bayes Classifier

Assumption: Features are conditionally independent given the label Y

To predict, we need two sets of probabilities– Prior P(y)– For each xj, we have the likelihood P(xj | y)

31

Page 32: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

The Naïve Bayes Classifier

Assumption: Features are conditionally independent given the label Y

To predict, we need two sets of probabilities– Prior P(y)– For each xj, we have the likelihood P(xj | y)

Decision rule

32

ℎ34 𝒙 = argmax;

𝑃 𝑦 𝑃 𝑥*, 𝑥,,⋯ , 𝑥. 𝑦)

Page 33: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

The Naïve Bayes Classifier

Assumption: Features are conditionally independent given the label Y

To predict, we need two sets of probabilities– Prior P(y)– For each xj, we have the likelihood P(xj | y)

Decision rule

33

ℎ34 𝒙 = argmax;

𝑃 𝑦 𝑃 𝑥*, 𝑥,,⋯ , 𝑥. 𝑦)

= argmax;

𝑃 𝑦 <=

𝑃(𝑥=|𝑦)

Page 34: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Today’s lecture

• The naïve Bayes Classifier

• Learning the naïve Bayes Classifier

34

Page 35: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Learning the naïve Bayes Classifier

• What is the hypothesis function h defined by?– A collection of probabilities

• Prior for each label: 𝑃(𝑦)• Likelihoods for feature xj given a label: 𝑃(𝑥𝑗| 𝑦)

Suppose we have a data set 𝐷 = {(𝒙𝑖, 𝑦𝑖)} with m examples

35

A note on convention for this section:• Examples in the dataset are indexed by the subscript 𝑖 (e.g. 𝒙𝑖) • Features within an example are indexed by the subscript 𝑗

• The 𝑗CD feature of the 𝑖CD example will be 𝑥E=

Page 36: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Learning the naïve Bayes Classifier

• What is the hypothesis function h defined by?– A collection of probabilities

• Prior for each label: 𝑃(𝑦)• Likelihoods for feature xj given a label: 𝑃(𝑥𝑗| 𝑦)

If we have a data set 𝐷 = {(𝒙𝑖, 𝑦𝑖)} with m examplesAnd we want to learn the classifier in a probabilistic way– What is a probabilistic criterion to select the hypothesis?

36

Page 37: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Learning the naïve Bayes Classifier

Maximum likelihood estimation

37

Here h is defined by all the probabilities used to construct the naïve Bayes decision

Page 38: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Maximum likelihood estimation

Given a dataset 𝐷 = {(𝒙𝑖, 𝑦𝑖)} with m examples

38

Each example in the dataset is independent and identically distributed

So we can represent P(D| h) as this product

Page 39: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Maximum likelihood estimation

Given a dataset 𝐷 = {(𝒙𝑖, 𝑦𝑖)} with m examples

39

Asks “What probability would this particular h assign to the pair (xi, yi)?”

Each example in the dataset is independent and identically distributed

So we can represent P(D| h) as this product

Page 40: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Maximum likelihood estimation

Given a dataset D = {(xi, yi)} with m examples

40

Page 41: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Maximum likelihood estimation

Given a dataset D = {(xi, yi)} with m examples

41

The Naïve Bayes assumption

xij is the jthfeature of xi

Page 42: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Maximum likelihood estimation

Given a dataset D = {(xi, yi)} with m examples

42

How do we proceed?

Page 43: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Maximum likelihood estimation

Given a dataset D = {(xi, yi)} with m examples

43

Page 44: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Learning the naïve Bayes Classifier

Maximum likelihood estimation

44

What next?

Page 45: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Learning the naïve Bayes Classifier

Maximum likelihood estimation

45

What next?

We need to make a modeling assumption about the functional form of these probability distributions

Page 46: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Learning the naïve Bayes Classifier

Maximum likelihood estimation

46

For simplicity, suppose there are two labels 1 and 0 and all features are binary

• Prior: P(y = 1) = p and P (y = 0) = 1 – p

Page 47: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Learning the naïve Bayes Classifier

Maximum likelihood estimation

47

For simplicity, suppose there are two labels 1 and 0 and all features are binary

• Prior: P(y = 1) = p and P (y = 0) = 1 – p

• Likelihood for each feature given a label• P(xj = 1 | y = 1) = aj and P(xj = 0 | y = 1) = 1 – aj• P(xj = 1 | y = 0) = bj and P(xj = 0 | y = 0) = 1 - bj

Page 48: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Learning the naïve Bayes Classifier

Maximum likelihood estimation

48

For simplicity, suppose there are two labels 1 and 0 and all features are binary

• Prior: P(y = 1) = p and P (y = 0) = 1 – p

• Likelihood for each feature given a label• P(xj = 1 | y = 1) = aj and P(xj = 0 | y = 1) = 1 – aj• P(xj = 1 | y = 0) = bj and P(xj = 0 | y = 0) = 1 - bj

Page 49: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Learning the naïve Bayes Classifier

Maximum likelihood estimation

49

For simplicity, suppose there are two labels 1 and 0 and all features are binary

• Prior: P(y = 1) = p and P (y = 0) = 1 – p

• Likelihood for each feature given a label• P(xj = 1 | y = 1) = aj and P(xj = 0 | y = 1) = 1 – aj• P(xj = 1 | y = 0) = bj and P(xj = 0 | y = 0) = 1 - bj

h consists of p, all the a’s and b’s

Page 50: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Learning the naïve Bayes Classifier

Maximum likelihood estimation

50

• Prior: P(y = 1) = p and P (y = 0) = 1 – p

Page 51: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Learning the naïve Bayes Classifier

Maximum likelihood estimation

51

• Prior: P(y = 1) = p and P (y = 0) = 1 – p

[z] is called the indicator function or the Iverson bracket

Its value is 1 if the argument z is true and zero otherwise

Page 52: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Learning the naïve Bayes Classifier

Maximum likelihood estimation

52

Likelihood for each feature given a label• P(xj = 1 | y = 1) = aj and P(xj = 0 | y = 1) = 1 – aj• P(xj = 1 | y = 0) = bj and P(xj = 0 | y = 0) = 1 - bj

Page 53: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Learning the naïve Bayes Classifier

Substituting and deriving the argmax, we get

53

P(y = 1) = p

Page 54: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Learning the naïve Bayes Classifier

Substituting and deriving the argmax, we get

54

P(y = 1) = p

P(xj = 1 | y = 1) = aj

Page 55: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Learning the naïve Bayes Classifier

Substituting and deriving the argmax, we get

55

P(y = 1) = p

P(xj = 1 | y = 1) = aj

P(xj = 1 | y = 0) = bj

Page 56: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Let’s learn a naïve Bayes classifier

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +

10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

56

Page 57: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Let’s learn a naïve Bayes classifier

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +

10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

57

P(Play = +) = 9/14 P(Play = -) = 5/14

Page 58: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Let’s learn a naïve Bayes classifier

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +

10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

58

P(Play = +) = 9/14 P(Play = -) = 5/14

P(O = S | Play = +) = 2/9

Page 59: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Let’s learn a naïve Bayes classifier

59

P(Play = +) = 9/14 P(Play = -) = 5/14

P(O = S | Play = +) = 2/9

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +

10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

P(O = R | Play = +) = 3/9

Page 60: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Let’s learn a naïve Bayes classifier

60

P(Play = +) = 9/14 P(Play = -) = 5/14

P(O = S | Play = +) = 2/9

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +

10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

P(O = R | Play = +) = 3/9

P(O = O | Play = +) = 4/9

And so on, for other attributes and also for Play = -

Page 61: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Naïve Bayes: Learning and Prediction

• Learning– Count how often features occur with each label. Normalize

to get likelihoods– Priors from fraction of examples with each label– Generalizes to multiclass

• Prediction– Use learned probabilities to find highest scoring label

61

Page 62: Supervised Learning: The Setup The Naïve Bayes Classifierzhe/teach/pdf/naive-bayes.pdf · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x

Summary: Naïve Bayes

• Independence assumption– All features are independent of each other given the label

• Maximum likelihood learning: Learning is simple– Generalizes to real valued features

• Prediction via Maximum a posterior (MAP)– Generalizes to beyond binary classification

62