Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes...

54
Decision tree, kNN and model selection Yifeng Tao School of Computer Science Carnegie Mellon University Slides adapted from Ziv Bar-Joseph, Tom Mitchell, Eric Xing Carnegie Mellon University 1 Yifeng Tao Introduction to Machine Learning

Transcript of Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes...

Page 1: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Decision tree, kNN and model selection

Yifeng TaoSchool of Computer ScienceCarnegie Mellon University

Slides adapted from Ziv Bar-Joseph, Tom Mitchell, Eric Xing

Carnegie Mellon University 1Yifeng Tao

Introduction to Machine Learning

Page 2: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Outline

oDecision tree and random forestokNNoModel selection and overfitting

Yifeng Tao Carnegie Mellon University 2

Page 3: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Decision trees

oOne of the most intuitive classifiers oEasy to understand and construct oSurprisingly, also works very well

Yifeng Tao Carnegie Mellon University 3

[Slide from Ziv Bar-Joseph]

Page 4: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

The ‘best’ classifier

oThey are quite robust, intuitive and, surprisingly, very accurateoXGBoost: an ensemble method of decision trees

Yifeng Tao Carnegie Mellon University 4

[Slide from https://www.quora.com/What-machine-learning-approaches-have-won-most-Kaggle-competitions]

Page 5: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Structure of a decision tree

oAn advertising example:o Internal nodes correspond to attributes (features) oLeafs correspond to classification outcome oEdges denote assignment

Yifeng Tao Carnegie Mellon University 5

[Slide from Ziv Bar-Joseph]

Page 6: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Dataset

Yifeng Tao Carnegie Mellon University 6

[Slide from Ziv Bar-Joseph]

Page 7: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Building a decision tree

Yifeng Tao Carnegie Mellon University 7

[Slide from Ziv Bar-Joseph]

Page 8: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Identifying ‘bestAttribute’

oThere are many possible ways to select the best attribute for a given set.

oWe will discuss one possible way which is based on information theory and generalizes well to non binary variables.

Yifeng Tao Carnegie Mellon University 8

[Slide from Ziv Bar-Joseph]

Page 9: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Entropy

oQuantifies the amount of uncertainty associated with a specific probability distribution

oThe higher the entropy, the less confident we are in the outcome oDefinition

Yifeng Tao Carnegie Mellon University 9

[Slide from Ziv Bar-Joseph]

Page 10: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Entropy

oDefinition

oSo, if P(X=1) = 1 then

oIf P(X=1) = .5 then

Yifeng Tao Carnegie Mellon University 10

[Slide from Ziv Bar-Joseph]

Page 11: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Interpreting entropy

oEntropy can be interpreted from an information standpoint oAssume both sender and receiver know the distribution. How many

bits, on average, would it take to transmit one value? oIf P(X=1) = 1 then the answer is 0 (we don’t need to transmit

anything)oIf P(X=1) = .5 then the answer is 1 (either values is equally likely) oIf 0<P(X=1)<.5 or 0.5<P(X=1)<1 then the answer is between 0 and 1

Yifeng Tao Carnegie Mellon University 11

[Slide from Ziv Bar-Joseph]

Page 12: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Conditional entropy

oEntropy measures the uncertainty in a specific distribution

oWhat if we know something about the transmission?

oFor example, say I want to send the label (liked) when the length is known

oThis becomes a conditional entropy problem: H(Li | Le=v) is the entropy of Liked among movies with length v.

Yifeng Tao Carnegie Mellon University 12

[Slide from Ziv Bar-Joseph]

Page 13: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Conditional entropy: Examples for specific values

oLets compute H(Li | Le=v) o1. H(Li | Le = S) = .92o2. H(Li | Le = M) = 0o3. H(Li | Le = L) = .92

Yifeng Tao Carnegie Mellon University 13

[Slide from Ziv Bar-Joseph]

Page 14: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Conditional entropy

oWe can generalize the conditional entropy idea to determine H( Li | Le)

oDefinition:

Yifeng Tao Carnegie Mellon University 14

[Slide from Ziv Bar-Joseph]

Page 15: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Conditional entropy: Example

oLets compute H( Li | Le)

Yifeng Tao Carnegie Mellon University 15

[Slide from Ziv Bar-Joseph]

Page 16: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Information gain

oHow much do we gain (in terms of reduction in entropy) from knowing one of the attributes

oIn other words, what is the reduction in entropy from this knowledge oDefinition: IG(Y|X) = H(Y) - H(Y|X)

o IG(X|Y) is always ≥ 0 Proof: Jensen inequality

Yifeng Tao Carnegie Mellon University 16

[Slide from Ziv Bar-Joseph]

Page 17: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Where we are

oWe were looking for a good criteria for selecting the best attribute for a node split

oWe defined the entropy, conditional entropy and information gain oWe will now use information gain as our criteria for a good split oThat is, bestAttribute will return the attribute that maximizes the

information gain at each node

Yifeng Tao Carnegie Mellon University 17

[Slide from Ziv Bar-Joseph]

Page 18: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Building a decision tree

Yifeng Tao Carnegie Mellon University 18

[Slide from Ziv Bar-Joseph]

Page 19: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Example: Root attribute

oP(Li=yes) = 2/3oH(Li) = .91

oH(Li | T) = 0.61oH(Li | Le) = 0.61oH(Li | D) = 0.36oH(Li | F) = 0.85

oIG(Li | T) = .91-.61 = 0.3oIG(Li | Le) = .91-.61 = 0.3 oIG(Li | D) = .91-.36 = 0.55oIG(Li | Le) = .91-.85 = 0.06

Yifeng Tao Carnegie Mellon University 19

[Slide from Ziv Bar-Joseph]

Page 20: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Building a tree

Yifeng Tao Carnegie Mellon University 20

[Slide from Ziv Bar-Joseph]

Page 21: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Building a tree

oWe only need to focus on the records (samples) associated with this node

oWe eliminated the ‘director’ attribute. All samples have the same director

oP(Li=yes) = ¼, H(Li) = .81oH(Li | T) = 0, H(Li | Le) = 0, H(Li | F) = 0.5 oIG(Li | T) = 0.81, IG(Li | Le) = 0.81, IG(Li | F) = .31

Yifeng Tao Carnegie Mellon University 21

[Slide from Ziv Bar-Joseph]

Page 22: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Final tree

Yifeng Tao Carnegie Mellon University 22

[Slide from Ziv Bar-Joseph]

Page 23: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Continuous values

oEither use threshold to turn into binary or discretize oIt’s possible to compute information gain for all possible thresholds

(there are a finite number of training samples) oHarder if we wish to assign more than two values (can be done

recursively)

Yifeng Tao Carnegie Mellon University 23

[Slide from Ziv Bar-Joseph]

Page 24: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Additional points

oThe algorithm we gave reaches homogenous nodes (or runs out of attributes)

oThis is dangerous: For datasets with many (non relevant) attributes the algorithm will continue to split nodes

oThis will lead to overfitting!

Yifeng Tao Carnegie Mellon University 24

[Slide from Ziv Bar-Joseph]

Page 25: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Overfitting

oTraining/empirical error:

oError over whole data distribution:

Yifeng Tao Carnegie Mellon University 25

[Slide from TomMitchell]

Page 26: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Avoiding overfitting: Tree pruning

oOne possible way in decision tree to avoid overfittingoOther ways: Do not grow tree beyond some maximum depth; Do not split if

splitting criterion (e.g. mutual information) is below some threshold; Stop growing when the split is not statistically significant

oSplit data into train and testing set oBuild tree using training setoFor all internal nodes (starting at the root)

o remove subtree rooted at nodeoassign class to be the most common among training set ocheck test data error

o if error is lower, keep change o otherwise restore subtree, repeat for all nodes in subtree

Yifeng Tao Carnegie Mellon University 26

[Slide from Ziv Bar-Joseph]

Page 27: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Random forest: bootstrap

oA collection of decision trees: ensemble methodsoFor each tree we select a subset of the attributes (recommended

square root of |A|), a subset of samples, and build tree using just these attributes

oAn input sample is classified using majority voting

Yifeng Tao Carnegie Mellon University 27

[Slide from Ziv Bar-Joseph]

Page 28: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Revisit Document Classification: Vector Space Representation

oEach document is a vector, one component for each term (= word).

oNormalize to unit length.oHigh-dimensional vector space:

oTerms are axes, 10,000+ dimensions, or even 100,000+ oDocs are vectors in this space

Yifeng Tao Carnegie Mellon University 28

[Slide from EricXing]

Page 29: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Classes in a Vector Space

Yifeng Tao Carnegie Mellon University 29

[Slide from EricXing]

Page 30: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Test Document = ?

Yifeng Tao Carnegie Mellon University 30

[Slide from EricXing]

Page 31: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

K-Nearest Neighbor (kNN) classifier

Yifeng Tao Carnegie Mellon University 31

[Slide from EricXing]

Page 32: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

kNN is an instance of Instance-Based Learning

oWhat makes an Instance-Based Learner? oA distance metricoHow many nearby neighbors to look at? oA weighting function (optional)

Yifeng Tao Carnegie Mellon University 32

[Slide from EricXing]

Page 33: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Euclidean Distance Metric

Yifeng Tao Carnegie Mellon University 33

[Slide from EricXing]

Page 34: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

1-Nearest Neighbor (kNN) classifier

Yifeng Tao Carnegie Mellon University 34

[Slide from EricXing]

Page 35: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

5-Nearest Neighbor (kNN) classifier

Yifeng Tao Carnegie Mellon University 35

[Slide from EricXing]

Page 36: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Nearest-Neighbor Learning Algorithm

oLearning is just storing the representations of the training examples in D.

oTesting instance x:oCompute similarity between x and all examples in D.oAssign x the category of the most similar example in D.

oDoes not explicitly compute a generalization or category prototypes.oAlso called:

oCase-based learning oMemory-based learning oLazy learning

Yifeng Tao Carnegie Mellon University 36

[Slide from EricXing]

Page 37: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

kNN Is Close to Optimal

oCover and Hart 1967oAsymptotically, the error rate of 1-nearest-neighbor classification is

less than twice the Bayes rate [error rate of classifier knowing model that generated data]

oIn particular, asymptotic error rate is 0 if Bayes rate is 0. oDecision boundary:

Yifeng Tao Carnegie Mellon University 37

[Slide from EricXing]

Page 38: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Overfitting

Yifeng Tao Carnegie Mellon University 38

[Slide from EricXing]

Page 39: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Another example

Yifeng Tao Carnegie Mellon University 39

[Figure from ChristopherBishiop]

Page 40: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Bias-variance decomposition

oNow let's look more closely into two sources of errors in an functional approximator:

oIn the following we show the Bias-variance decomposition using LR as an example.

Yifeng Tao Carnegie Mellon University 40

[Slide from EricXing]

Page 41: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Loss functions for regression

Yifeng Tao Carnegie Mellon University 41

[Slide from EricXing]

Page 42: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Expected loss

Yifeng Tao Carnegie Mellon University 42

[Slide from EricXing]

Page 43: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Bias-variance decomposition

Yifeng Tao Carnegie Mellon University 43

[Slide from EricXing]

Page 44: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Bias-variance tradeoff: Regularized Regression

Yifeng Tao Carnegie Mellon University 44

[Slide from EricXing]

Page 45: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Bias2+variance vs regularizer

oBias2+variance predicts (shape of) test error quite well.oHowever, bias and variance cannot be computed since it relies on

knowing the true distribution of x and t (and hence h(x) = E[t|x]).

Yifeng Tao Carnegie Mellon University 45

[Slide from EricXing]

Page 46: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Model Selection

oSuppose we are trying select among several different models for a learning problem.

oExamples: opolynomial regression

o Model selection: we wish to automatically and objectively decide if M should be, say, 0, 1, …, or 10.

oEtc…oThe Problem:

Yifeng Tao Carnegie Mellon University 46

[Slide from EricXing]

Page 47: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Cross Validation

Yifeng Tao Carnegie Mellon University 47

[Slide from EricXing]

Page 48: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Cross Validation

Yifeng Tao Carnegie Mellon University 48

[Slide from MattGormley]

Page 49: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Example

oWhen α=1/N, the algorithm is known as Leave-One-Out Cross-Validation (LOOCV)

Yifeng Tao Carnegie Mellon University 49

[Slide from EricXing]

Page 50: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Practical issues for CV

oHow to decide the values for K and α oCommonly used K=10 and α=0.1. oWhen data sets are small relative to the number of models that are being

evaluated, we need to decrease α and increase K

oBias-variance trade-offoSmall α usually lead to low bias. In principle, LOOCV provides an almost

unbiased estimate of the generalization ability of a classifier, especially when the number of the available training samples is severely limited.

oLarge α can reduce variance, but will lead to under-use of data, and causing high-bias.

oOne important point is that the test data Dtest is never used in CV, because doing so would result in overly (indeed dishonest) optimistic accuracy rates during the testing phase.

Yifeng Tao Carnegie Mellon University 50

[Slide from EricXing]

Page 51: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Regularization

oMore in lecture 1…

Yifeng Tao Carnegie Mellon University 51

[Slide from EricXing]

Page 52: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

AIC and BIC

oAkaike information criterion (AIC):

oBayesian information criterion (BIC):

Yifeng Tao Carnegie Mellon University 52

[Slide from Wikipedia]

Page 53: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

Take home message

oDecision tree is a discriminative classifieroBuilt recursively based on maximizing information gainoPrevent overfitting through pruning and bootstrap

okNN is close to optimal (asymptotically)oTradeoff between bias and variance of models

oMethods to reduce overfittingoCross-validationoRegularizationoAIC, BICoEnsemble methods: bootstrap

Carnegie Mellon University 53Yifeng Tao

Page 54: Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes an Instance-Based Learner? oA distance metric oHow many nearby neighbors to look

References

oEric Xing, Tom Mitchell. 10701 Introduction to Machine Learning: http://www.cs.cmu.edu/~epxing/Class/10701-06f/

oEric Xing, Ziv Bar-Joseph. 10701 Introduction to Machine Learning: http://www.cs.cmu.edu/~epxing/Class/10701/

oMatt Gormley. 10601 Introduction to Machine Learning: http://www.cs.cmu.edu/~mgormley/courses/10601/index.html

oChristopher M. Bishop. Pattern recognition and Machine LearningoWikipedia

Carnegie Mellon University 54Yifeng Tao