Machine Learning and Data Mining I · 2018. 10. 27. · Machine Learning and Data Mining I. Review...

DS 4400

Alina Oprea

Associate Professor, CCIS

Northeastern University

October 25 2018

Machine Learning and Data Mining I

Review

2

Outline

• Naïve Bayes classifier

– Density Estimation

• Application

– Document classification

• Review of supervised learning methods

• Metrics for evaluating classifiers

3

Prior and Joint Probabilities

4

Density Estimation

5

Example – Learning Joint Probability Distribution

6

Pros and Cons of Density Estimators

• Pros

– Density Estimators can learn distribution of training data

– Can compute probability for a record

– Can do inference (predict likelihood of record)

• Cons

– Can overfit to the training data and not generalize to test data

– Curse of dimensionality

7

Naïve Bayes classifier fixes these cons!

Bayes’ Rule

8

Bayes’ Rule

9

LDA

10

• Classify to one of k classes

• Logistic regression computes directly

– P 𝑌 = 1 𝑋 = 𝑥

– Assume sigmoid function

• LDA uses Bayes Theorem to estimate it

– P 𝑌 = 𝑘 𝑋 = 𝑥 =P 𝑋 = 𝑥 𝑌 = 𝑘 P[𝑌=𝑘]

P[𝑋=𝑥]

– Let 𝜋𝑘 = P[𝑌 = 𝑘] be the prior probability of class k and 𝑓𝑘 𝑥 = P 𝑋 = 𝑥 𝑌 = 𝑘

Discriminative model

Generative model

LDA

11

Assume 𝑓𝑘 𝑥 is Gaussian!Unidimensional case (d=1)

Assumption: 𝜎1 = …𝜎𝑘 = σ

Classification Setting

12

P 𝑌 = 𝑘 𝑋 = 𝑥 =P[𝑌 = 𝑘]P 𝑋 = 𝑥 𝑌 = 𝑘

P[𝑋 = 𝑥]

P 𝑌 = 𝑘 𝑋 = 𝑥 =P[𝑌 = 𝑘]P 𝑋1 = 𝑥1 ∧⋯∧ 𝑋𝑑= 𝑥𝑑 𝑌 = 𝑘

P[𝑋1 = 𝑥1 ∧⋯∧ 𝑋𝑑= 𝑥𝑑]

Naïve Bayes Classifier

13

P[𝑌 = 𝑘]P 𝑋1 = 𝑥1 ∧⋯∧ 𝑋𝑑= 𝑥𝑑 𝑌 = 𝑘

P[𝑋1 = 𝑥1 ∧⋯∧ 𝑋𝑑= 𝑥𝑑]P 𝑌 = 𝑘 𝑋 = 𝑥


14

Training Naïve Bayes

15


16

3/4 1/4


17

1


18

0


19

2/3


20

1

Laplace Smoothing

21

𝑘

𝑘

Using the Naïve Bayes Classifier

22

𝑘 𝑘

𝑘 𝑘

P[𝑌 = 𝑘]P 𝑋1 = 𝑥1 ∧ ⋯∧ 𝑋𝑑= 𝑥𝑑 𝑌 = 𝑘

P[𝑋1 = 𝑥1 ∧ ⋯∧ 𝑋𝑑= 𝑥𝑑]P 𝑌 = 𝑘 𝑋 = 𝑥


23

𝑘 𝑘

• For each class label 𝑘1. Estimate prior 𝑃[𝑌 = 𝑘] from the data 2. For each value 𝑣 of attribute 𝑋𝑗

• Estimate P[𝑋𝑗 = 𝑣|𝑌 = 𝑘]

Computing Probabilities

24

𝑘 𝑘 𝑘

𝑘

𝑘 𝑘

Naïve Bayes Summary

25

Document Classification

26

Document Classification

27

Text Classification: Examples

28

Bag of Words Representation

29

Bag of Words Representation

30

Another View of Naïve Bayes

31

Another View of Naïve Bayes

32

Document Classification with Naïve Bayes

33

Review Naïve Bayes

• Density Estimators can estimate joint probability distribution from data

• Risk of overfitting and curse of dimensionality

• Naïve Bayes assumes that features are independent given labels– Reduces the complexity of density estimation

– Even though the assumption is not always true, Naïve Bayes works well in practice

• Applications: text classification with bag-of-words representation– Naïve Bayes becomes a linear classifier

34Generative model

Traditional learning

• Linear classifiers – Logistic regression, LDA, perceptrons

• Decision trees

• Ensembles – Random Forests

– AdaBoost

• SVM– Linear SVM

– Kernels

• Naïve Bayes

35

Confusion Matrix

36

Accuracy and Error

37

Precision & Recall

38

F-Score

39

A Word of Caution

40

Receiver Operating Characteristic (ROC)

41

Performance Depends on Threshold

42

ROC Example

43

ROC Example

44

ROC Curve

45

ROC Curve

46

ROC Curve

47

Area Under the ROC Curve

48

Comparing Supervised Learning

49

Acknowledgements

• Slides made using resources from:

– Andrew Ng

– Eric Eaton

– David Sontag

– Andrew Moore

• Thanks!

Machine Learning and Data Mining I · 2018. 10. 27. · Machine Learning and Data Mining I. Review...

Documents

Transcript of Machine Learning and Data Mining I · 2018. 10. 27. · Machine Learning and Data Mining I. Review...