Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes...

Post on 27-Jul-2020

4 views 0 download

Transcript of Decision tree, kNN and model selection · kNN is an instance of Instance-Based Learning oWhat makes...

Decision tree, kNN and model selection

Yifeng TaoSchool of Computer ScienceCarnegie Mellon University

Slides adapted from Ziv Bar-Joseph, Tom Mitchell, Eric Xing

Carnegie Mellon University 1Yifeng Tao

Introduction to Machine Learning

Outline

oDecision tree and random forestokNNoModel selection and overfitting

Yifeng Tao Carnegie Mellon University 2

Decision trees

oOne of the most intuitive classifiers oEasy to understand and construct oSurprisingly, also works very well

Yifeng Tao Carnegie Mellon University 3

[Slide from Ziv Bar-Joseph]

The ‘best’ classifier

oThey are quite robust, intuitive and, surprisingly, very accurateoXGBoost: an ensemble method of decision trees

Yifeng Tao Carnegie Mellon University 4

[Slide from https://www.quora.com/What-machine-learning-approaches-have-won-most-Kaggle-competitions]

Structure of a decision tree

oAn advertising example:o Internal nodes correspond to attributes (features) oLeafs correspond to classification outcome oEdges denote assignment

Yifeng Tao Carnegie Mellon University 5

[Slide from Ziv Bar-Joseph]

Dataset

Yifeng Tao Carnegie Mellon University 6

[Slide from Ziv Bar-Joseph]

Building a decision tree

Yifeng Tao Carnegie Mellon University 7

[Slide from Ziv Bar-Joseph]

Identifying ‘bestAttribute’

oThere are many possible ways to select the best attribute for a given set.

oWe will discuss one possible way which is based on information theory and generalizes well to non binary variables.

Yifeng Tao Carnegie Mellon University 8

[Slide from Ziv Bar-Joseph]

Entropy

oQuantifies the amount of uncertainty associated with a specific probability distribution

oThe higher the entropy, the less confident we are in the outcome oDefinition

Yifeng Tao Carnegie Mellon University 9

[Slide from Ziv Bar-Joseph]

Entropy

oDefinition

oSo, if P(X=1) = 1 then

oIf P(X=1) = .5 then

Yifeng Tao Carnegie Mellon University 10

[Slide from Ziv Bar-Joseph]

Interpreting entropy

oEntropy can be interpreted from an information standpoint oAssume both sender and receiver know the distribution. How many

bits, on average, would it take to transmit one value? oIf P(X=1) = 1 then the answer is 0 (we don’t need to transmit

anything)oIf P(X=1) = .5 then the answer is 1 (either values is equally likely) oIf 0<P(X=1)<.5 or 0.5<P(X=1)<1 then the answer is between 0 and 1

Yifeng Tao Carnegie Mellon University 11

[Slide from Ziv Bar-Joseph]

Conditional entropy

oEntropy measures the uncertainty in a specific distribution

oWhat if we know something about the transmission?

oFor example, say I want to send the label (liked) when the length is known

oThis becomes a conditional entropy problem: H(Li | Le=v) is the entropy of Liked among movies with length v.

Yifeng Tao Carnegie Mellon University 12

[Slide from Ziv Bar-Joseph]

Conditional entropy: Examples for specific values

oLets compute H(Li | Le=v) o1. H(Li | Le = S) = .92o2. H(Li | Le = M) = 0o3. H(Li | Le = L) = .92

Yifeng Tao Carnegie Mellon University 13

[Slide from Ziv Bar-Joseph]

Conditional entropy

oWe can generalize the conditional entropy idea to determine H( Li | Le)

oDefinition:

Yifeng Tao Carnegie Mellon University 14

[Slide from Ziv Bar-Joseph]

Conditional entropy: Example

oLets compute H( Li | Le)

Yifeng Tao Carnegie Mellon University 15

[Slide from Ziv Bar-Joseph]

Information gain

oHow much do we gain (in terms of reduction in entropy) from knowing one of the attributes

oIn other words, what is the reduction in entropy from this knowledge oDefinition: IG(Y|X) = H(Y) - H(Y|X)

o IG(X|Y) is always ≥ 0 Proof: Jensen inequality

Yifeng Tao Carnegie Mellon University 16

[Slide from Ziv Bar-Joseph]

Where we are

oWe were looking for a good criteria for selecting the best attribute for a node split

oWe defined the entropy, conditional entropy and information gain oWe will now use information gain as our criteria for a good split oThat is, bestAttribute will return the attribute that maximizes the

information gain at each node

Yifeng Tao Carnegie Mellon University 17

[Slide from Ziv Bar-Joseph]

Building a decision tree

Yifeng Tao Carnegie Mellon University 18

[Slide from Ziv Bar-Joseph]

Example: Root attribute

oP(Li=yes) = 2/3oH(Li) = .91

oH(Li | T) = 0.61oH(Li | Le) = 0.61oH(Li | D) = 0.36oH(Li | F) = 0.85

oIG(Li | T) = .91-.61 = 0.3oIG(Li | Le) = .91-.61 = 0.3 oIG(Li | D) = .91-.36 = 0.55oIG(Li | Le) = .91-.85 = 0.06

Yifeng Tao Carnegie Mellon University 19

[Slide from Ziv Bar-Joseph]

Building a tree

Yifeng Tao Carnegie Mellon University 20

[Slide from Ziv Bar-Joseph]

Building a tree

oWe only need to focus on the records (samples) associated with this node

oWe eliminated the ‘director’ attribute. All samples have the same director

oP(Li=yes) = ¼, H(Li) = .81oH(Li | T) = 0, H(Li | Le) = 0, H(Li | F) = 0.5 oIG(Li | T) = 0.81, IG(Li | Le) = 0.81, IG(Li | F) = .31

Yifeng Tao Carnegie Mellon University 21

[Slide from Ziv Bar-Joseph]

Final tree

Yifeng Tao Carnegie Mellon University 22

[Slide from Ziv Bar-Joseph]

Continuous values

oEither use threshold to turn into binary or discretize oIt’s possible to compute information gain for all possible thresholds

(there are a finite number of training samples) oHarder if we wish to assign more than two values (can be done

recursively)

Yifeng Tao Carnegie Mellon University 23

[Slide from Ziv Bar-Joseph]

Additional points

oThe algorithm we gave reaches homogenous nodes (or runs out of attributes)

oThis is dangerous: For datasets with many (non relevant) attributes the algorithm will continue to split nodes

oThis will lead to overfitting!

Yifeng Tao Carnegie Mellon University 24

[Slide from Ziv Bar-Joseph]

Overfitting

oTraining/empirical error:

oError over whole data distribution:

Yifeng Tao Carnegie Mellon University 25

[Slide from TomMitchell]

Avoiding overfitting: Tree pruning

oOne possible way in decision tree to avoid overfittingoOther ways: Do not grow tree beyond some maximum depth; Do not split if

splitting criterion (e.g. mutual information) is below some threshold; Stop growing when the split is not statistically significant

oSplit data into train and testing set oBuild tree using training setoFor all internal nodes (starting at the root)

o remove subtree rooted at nodeoassign class to be the most common among training set ocheck test data error

o if error is lower, keep change o otherwise restore subtree, repeat for all nodes in subtree

Yifeng Tao Carnegie Mellon University 26

[Slide from Ziv Bar-Joseph]

Random forest: bootstrap

oA collection of decision trees: ensemble methodsoFor each tree we select a subset of the attributes (recommended

square root of |A|), a subset of samples, and build tree using just these attributes

oAn input sample is classified using majority voting

Yifeng Tao Carnegie Mellon University 27

[Slide from Ziv Bar-Joseph]

Revisit Document Classification: Vector Space Representation

oEach document is a vector, one component for each term (= word).

oNormalize to unit length.oHigh-dimensional vector space:

oTerms are axes, 10,000+ dimensions, or even 100,000+ oDocs are vectors in this space

Yifeng Tao Carnegie Mellon University 28

[Slide from EricXing]

Classes in a Vector Space

Yifeng Tao Carnegie Mellon University 29

[Slide from EricXing]

Test Document = ?

Yifeng Tao Carnegie Mellon University 30

[Slide from EricXing]

K-Nearest Neighbor (kNN) classifier

Yifeng Tao Carnegie Mellon University 31

[Slide from EricXing]

kNN is an instance of Instance-Based Learning

oWhat makes an Instance-Based Learner? oA distance metricoHow many nearby neighbors to look at? oA weighting function (optional)

Yifeng Tao Carnegie Mellon University 32

[Slide from EricXing]

Euclidean Distance Metric

Yifeng Tao Carnegie Mellon University 33

[Slide from EricXing]

1-Nearest Neighbor (kNN) classifier

Yifeng Tao Carnegie Mellon University 34

[Slide from EricXing]

5-Nearest Neighbor (kNN) classifier

Yifeng Tao Carnegie Mellon University 35

[Slide from EricXing]

Nearest-Neighbor Learning Algorithm

oLearning is just storing the representations of the training examples in D.

oTesting instance x:oCompute similarity between x and all examples in D.oAssign x the category of the most similar example in D.

oDoes not explicitly compute a generalization or category prototypes.oAlso called:

oCase-based learning oMemory-based learning oLazy learning

Yifeng Tao Carnegie Mellon University 36

[Slide from EricXing]

kNN Is Close to Optimal

oCover and Hart 1967oAsymptotically, the error rate of 1-nearest-neighbor classification is

less than twice the Bayes rate [error rate of classifier knowing model that generated data]

oIn particular, asymptotic error rate is 0 if Bayes rate is 0. oDecision boundary:

Yifeng Tao Carnegie Mellon University 37

[Slide from EricXing]

Overfitting

Yifeng Tao Carnegie Mellon University 38

[Slide from EricXing]

Another example

Yifeng Tao Carnegie Mellon University 39

[Figure from ChristopherBishiop]

Bias-variance decomposition

oNow let's look more closely into two sources of errors in an functional approximator:

oIn the following we show the Bias-variance decomposition using LR as an example.

Yifeng Tao Carnegie Mellon University 40

[Slide from EricXing]

Loss functions for regression

Yifeng Tao Carnegie Mellon University 41

[Slide from EricXing]

Expected loss

Yifeng Tao Carnegie Mellon University 42

[Slide from EricXing]

Bias-variance decomposition

Yifeng Tao Carnegie Mellon University 43

[Slide from EricXing]

Bias-variance tradeoff: Regularized Regression

Yifeng Tao Carnegie Mellon University 44

[Slide from EricXing]

Bias2+variance vs regularizer

oBias2+variance predicts (shape of) test error quite well.oHowever, bias and variance cannot be computed since it relies on

knowing the true distribution of x and t (and hence h(x) = E[t|x]).

Yifeng Tao Carnegie Mellon University 45

[Slide from EricXing]

Model Selection

oSuppose we are trying select among several different models for a learning problem.

oExamples: opolynomial regression

o Model selection: we wish to automatically and objectively decide if M should be, say, 0, 1, …, or 10.

oEtc…oThe Problem:

Yifeng Tao Carnegie Mellon University 46

[Slide from EricXing]

Cross Validation

Yifeng Tao Carnegie Mellon University 47

[Slide from EricXing]

Cross Validation

Yifeng Tao Carnegie Mellon University 48

[Slide from MattGormley]

Example

oWhen α=1/N, the algorithm is known as Leave-One-Out Cross-Validation (LOOCV)

Yifeng Tao Carnegie Mellon University 49

[Slide from EricXing]

Practical issues for CV

oHow to decide the values for K and α oCommonly used K=10 and α=0.1. oWhen data sets are small relative to the number of models that are being

evaluated, we need to decrease α and increase K

oBias-variance trade-offoSmall α usually lead to low bias. In principle, LOOCV provides an almost

unbiased estimate of the generalization ability of a classifier, especially when the number of the available training samples is severely limited.

oLarge α can reduce variance, but will lead to under-use of data, and causing high-bias.

oOne important point is that the test data Dtest is never used in CV, because doing so would result in overly (indeed dishonest) optimistic accuracy rates during the testing phase.

Yifeng Tao Carnegie Mellon University 50

[Slide from EricXing]

Regularization

oMore in lecture 1…

Yifeng Tao Carnegie Mellon University 51

[Slide from EricXing]

AIC and BIC

oAkaike information criterion (AIC):

oBayesian information criterion (BIC):

Yifeng Tao Carnegie Mellon University 52

[Slide from Wikipedia]

Take home message

oDecision tree is a discriminative classifieroBuilt recursively based on maximizing information gainoPrevent overfitting through pruning and bootstrap

okNN is close to optimal (asymptotically)oTradeoff between bias and variance of models

oMethods to reduce overfittingoCross-validationoRegularizationoAIC, BICoEnsemble methods: bootstrap

Carnegie Mellon University 53Yifeng Tao

References

oEric Xing, Tom Mitchell. 10701 Introduction to Machine Learning: http://www.cs.cmu.edu/~epxing/Class/10701-06f/

oEric Xing, Ziv Bar-Joseph. 10701 Introduction to Machine Learning: http://www.cs.cmu.edu/~epxing/Class/10701/

oMatt Gormley. 10601 Introduction to Machine Learning: http://www.cs.cmu.edu/~mgormley/courses/10601/index.html

oChristopher M. Bishop. Pattern recognition and Machine LearningoWikipedia

Carnegie Mellon University 54Yifeng Tao