Machine Learning and Data Mining I · 2018. 10. 27. · Machine Learning and Data Mining I. Review...
Transcript of Machine Learning and Data Mining I · 2018. 10. 27. · Machine Learning and Data Mining I. Review...
DS 4400
Alina Oprea
Associate Professor, CCIS
Northeastern University
October 25 2018
Machine Learning and Data Mining I
Review
2
Outline
• Naïve Bayes classifier
– Density Estimation
• Application
– Document classification
• Review of supervised learning methods
• Metrics for evaluating classifiers
3
Prior and Joint Probabilities
4
Density Estimation
5
Example – Learning Joint Probability Distribution
6
Pros and Cons of Density Estimators
• Pros
– Density Estimators can learn distribution of training data
– Can compute probability for a record
– Can do inference (predict likelihood of record)
• Cons
– Can overfit to the training data and not generalize to test data
– Curse of dimensionality
7
Naïve Bayes classifier fixes these cons!
Bayes’ Rule
8
Bayes’ Rule
9
LDA
10
• Classify to one of k classes
• Logistic regression computes directly
– P 𝑌 = 1 𝑋 = 𝑥
– Assume sigmoid function
• LDA uses Bayes Theorem to estimate it
– P 𝑌 = 𝑘 𝑋 = 𝑥 =P 𝑋 = 𝑥 𝑌 = 𝑘 P[𝑌=𝑘]
P[𝑋=𝑥]
– Let 𝜋𝑘 = P[𝑌 = 𝑘] be the prior probability of class k and 𝑓𝑘 𝑥 = P 𝑋 = 𝑥 𝑌 = 𝑘
Discriminative model
Generative model
LDA
11
Assume 𝑓𝑘 𝑥 is Gaussian!Unidimensional case (d=1)
Assumption: 𝜎1 = …𝜎𝑘 = σ
Classification Setting
12
P 𝑌 = 𝑘 𝑋 = 𝑥 =P[𝑌 = 𝑘]P 𝑋 = 𝑥 𝑌 = 𝑘
P[𝑋 = 𝑥]
P 𝑌 = 𝑘 𝑋 = 𝑥 =P[𝑌 = 𝑘]P 𝑋1 = 𝑥1 ∧⋯∧ 𝑋𝑑= 𝑥𝑑 𝑌 = 𝑘
P[𝑋1 = 𝑥1 ∧⋯∧ 𝑋𝑑= 𝑥𝑑]
Naïve Bayes Classifier
13
P[𝑌 = 𝑘]P 𝑋1 = 𝑥1 ∧⋯∧ 𝑋𝑑= 𝑥𝑑 𝑌 = 𝑘
P[𝑋1 = 𝑥1 ∧⋯∧ 𝑋𝑑= 𝑥𝑑]P 𝑌 = 𝑘 𝑋 = 𝑥
Naïve Bayes Classifier
14
Training Naïve Bayes
15
Training Naïve Bayes
16
3/4 1/4
Training Naïve Bayes
17
1
Training Naïve Bayes
18
0
Training Naïve Bayes
19
2/3
Training Naïve Bayes
20
1
Laplace Smoothing
21
𝑘
𝑘
Using the Naïve Bayes Classifier
22
𝑘 𝑘
𝑘 𝑘
P[𝑌 = 𝑘]P 𝑋1 = 𝑥1 ∧ ⋯∧ 𝑋𝑑= 𝑥𝑑 𝑌 = 𝑘
P[𝑋1 = 𝑥1 ∧ ⋯∧ 𝑋𝑑= 𝑥𝑑]P 𝑌 = 𝑘 𝑋 = 𝑥
Naïve Bayes Classifier
23
𝑘 𝑘
• For each class label 𝑘1. Estimate prior 𝑃[𝑌 = 𝑘] from the data 2. For each value 𝑣 of attribute 𝑋𝑗
• Estimate P[𝑋𝑗 = 𝑣|𝑌 = 𝑘]
Computing Probabilities
24
𝑘 𝑘 𝑘
𝑘
𝑘 𝑘
Naïve Bayes Summary
25
Document Classification
26
Document Classification
27
Text Classification: Examples
28
Bag of Words Representation
29
Bag of Words Representation
30
Another View of Naïve Bayes
31
Another View of Naïve Bayes
32
Document Classification with Naïve Bayes
33
Review Naïve Bayes
• Density Estimators can estimate joint probability distribution from data
• Risk of overfitting and curse of dimensionality
• Naïve Bayes assumes that features are independent given labels– Reduces the complexity of density estimation
– Even though the assumption is not always true, Naïve Bayes works well in practice
• Applications: text classification with bag-of-words representation– Naïve Bayes becomes a linear classifier
34Generative model
Traditional learning
• Linear classifiers – Logistic regression, LDA, perceptrons
• Decision trees
• Ensembles – Random Forests
– AdaBoost
• SVM– Linear SVM
– Kernels
• Naïve Bayes
35
Confusion Matrix
36
Accuracy and Error
37
Precision & Recall
38
F-Score
39
A Word of Caution
40
Receiver Operating Characteristic (ROC)
41
Performance Depends on Threshold
42
ROC Example
43
ROC Example
44
ROC Curve
45
ROC Curve
46
ROC Curve
47
Area Under the ROC Curve
48
Comparing Supervised Learning
49
Acknowledgements
• Slides made using resources from:
– Andrew Ng
– Eric Eaton
– David Sontag
– Andrew Moore
• Thanks!