Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits...

45
Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s notes Shawndra Hill notes

Transcript of Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits...

Page 1: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 1

Classification and Supervised Learning

CreditsHand, Mannila and Smyth

Cook and SwaynePadhraic Smyth’s notes

Shawndra Hill notes

Page 2: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 2

Classification

• Classification or supervised learning– prediction for categorical response

• T/F, color, etc• Often quantized real values, or non-scaled numeric

– can be used with categorical predictors– Can be used for missing data – as a response in

itself!– methods for fitting can be

• Parametric (e.g. linear discriminant)• Algorithmic (e.g. trees)

– Logistic regression (with threshold on response prob)

Page 3: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 3

• Because labels are known, you can build parametric models for the classes

• can also define decision regions and decision boundaries

Page 4: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 4

Types of classification models

• Probabilistic, based on p( x | ck ), – Naïve Bayes– Linear discriminant analysis

• Regression-based, based on p( ck | x )

– Logistic regression: linear predictor of logit– Neural network non-linear extension of logistic

• Discriminative models, focus on locating optimal decision boundaries– Decision trees: Most popular– Support vector machines (SVM): currently trendy,

computationally complex– Nearest neighbor: simple, elegant

Page 5: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 5

Evaluating Classifiers

• Classifiers predict class for new data– Some models give probability class estimates

• Simplest: accuracy = % classified correctly (in-sample or out-of-sample)– Not always a great idea – e.g. fraud

• Recall: ROC Area– area under ROC plot

Page 6: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 6

Linear Discriminant Analysis

• LDA - parametric classification– assume multivariate normal

distribution of each class –w/ equal covariance structure

– Decision boundaries are a linear combination of the variables

– compare the difference between class means with the variance in each class

– pros:• easy to define likelihood• easy to define boundary • easy to measure goodness of fit• interpretation easy

– cons:• very rare for data come close to a

multi-normal!• works only on numeric predictors

Page 7: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

LDA

• Flea Beetles data– Clear classification rule for new data

Data Mining - Volinsky - 2011 - Columbia University 7

1 2 3 Error

1 20 0 1 0.048

2 0 22 0 0.00

3 3 0 28 0.097

Total 0.054

In-sample misclassification rate = 5.4%Better to do X-val

Courtesy Cook/Swayne

Page 8: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 8

Classification (Decision) Trees• Trees are one of the most popular and useful of

all data mining models• Algorithmic version of classification • Pros:

– no distributional assumptions– can handle real and nominal inputs– speed and scalability– robustness to outliers and missing values– interpretability– compactness of classification rules

• Cons– interpretability ?– several tuning parameters to set with little guidance– decision boundary is non-continuous

Page 9: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 9

Decision Tree Example

Income

Debt

Example: Do people pay bills?Courtesy P.Smyth

Page 10: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 10

Decision Tree Example

t1 Income

DebtIncome > t1

??

Page 11: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 11

Decision Tree Example

t1

t2

Income

DebtIncome > t1

Debt > t2

??

Page 12: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 12

Decision Tree Example

t1t3

t2

Income

DebtIncome > t1

Debt > t2

Income > t3

Page 13: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 13

Decision Tree Example

t1t3

t2

Income

DebtIncome > t1

Debt > t2

Income > t3

Note: tree boundaries are piecewiselinear and axis-parallel

Page 14: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 14

Example: Titanic Data

• On the Titanic– 1313 passengers– 34% survived– was it a random sample?– or did survival depend on features of the

individual?• sex• age• class pclass survived name age embarked sex

1 1st 1 Allen, Miss Elisabeth Walton 29.0000 Southampton female2 1st 0 Allison, Miss Helen Loraine 2.0000 Southampton female3 1st 0 Allison, Mr Hudson Joshua Creighton 30.0000 Southampton male4 1st 0 Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 25.0000 Southampton female5 1st 1 Allison, Master Hudson Trevor 0.9167 Southampton male6 2nd 1 Anderson, Mr Harry 47.0000 Southampton male

Page 15: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 15

Decision trees

• At first ‘split’ decide which is the best variable to create separation between the survivors and non-survivors cases:

N:1313p: 0.34

Age Less Than 12

N:29p:0.73

N: 821p: 0.15

Greater than 12Class

1st or 2nd Class3rd Class

N: 250p: 0.912

N=213p: 0.37

Class

N: 646p:0.10

N: 175p: 0.31

2nd or 3rd 1st Class

Y: 150N: 1500

Y: 50N: 3500

Sex?

N:850p: 0.16

N:463p: 0.66

Male Female

Goodness of split is determined by the ‘purity’ of the leaves

Page 16: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 16

Decision Tree Induction

• Basic algorithm (a greedy algorithm)– Tree is constructed in a top-down recursive divide-and-conquer

manner– At start, all the training examples are at the root– Examples are partitioned recursively to maximize purity

• Conditions for stopping partitioning– All samples belong to the same class– Leaf node smaller than a specified threshold– Tradeoff between complexity and generalizability

• Predictions for new data:– Classification by majority voting is employed for classifying all

members of the leaf– Probability based on training data that ended up in that leaf.– Class Probability estimates can be used also

Page 17: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Determining optimal splits via Purity

• Can be measured by Gini Index or Entropy– For node n with m classes:

The goodness of a split s (resulting in two nodes s1 and s2 is assessed by the weighted gini from s1 and s2 :

Purity(s) :

Data Mining - Volinsky - 2011 - Columbia University 17

Page 18: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Example

• Two class problem:• 400 observations in each class (400, 400)• Caluclate Gini:• Split A:

– (300, 100) (100, 300)

• Split B:– (200, 400) (200, 0)

• What about the misclassification rate?

Data Mining - Volinsky - 2011 - Columbia University 18

Page 19: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 19

Finding the right size

• Use a hold out sample (n fold cross-validation)

• Overfit a tree - with many leaves• snip the tree back and use the hold out

sample for prediction, calculate predictive error

• record error rate for each tree size• repeat for n folds• plot average error rate as a function of tree

size• fit optimal tree size to the entire data set

Page 20: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Finding the right size: Iris data

Data Mining - Volinsky - 2011 - Columbia University 20

Page 21: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 21

Page 22: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Multi-class example

• Be careful with examples with class > 2– Might not predict all cases

Data Mining - Volinsky - 2011 - Columbia University 22

Page 23: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Notes on X-Validation with Trees

To do n-fold x-validation:split into n-foldsuse each fold to find the optimal number of nodesaverage results of folds to pick the overall optimum k‘Final Model’ is the tree fit on ALL data of size k

However,if the best trees in each fold are very different (eg different terminal nodes), this is a cause for alarm.

Data Mining - Volinsky - 2011 - Columbia University 23

Page 24: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 24

Regression Trees

• Trees can also be used for regression: when the response is real valued– leaf prediction is mean value instead of

class probability estimates – Can use variance as a purity measure– helpful with categorical predictors

Page 25: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 25

Tips data

Page 26: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 26

Treating Missing Data in Trees

• Missing values are common in practice

• Approaches to handing missing values – Algorithms can handle missing data automatically– During training and testing

• Send the example being classified down both branches and average predictions

– Treat “missing” as a unique value (if variable is categorical)

Page 27: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 27

Extensions of with Classification Trees

• Can use non-binary splits– Multi-way– Tend to increase complexity substantially, and don’t improve

performance– Binary splits are interpretable, even by non-experts– Easy to compute, visualize

• Can also consider linear combination splits– Can improve predictive performance, but hurts interpretability– Harder to optimize

• Loss function– Some errors may be more costly than others– Can incorporate into Gini calculation

• Plain old trees usually work quite well

Page 28: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 28

Why Trees are widely used in Practice

• Can handle high dimensional data– builds a model using 1 dimension at time

• Can handle any type of input variables– Categorical predictors

• Invariant to monotonic transformations of input variables– E.g., using x, 10x + 2, log(x), 2^x, etc, will not change the

tree– So, scaling is not a factor - user can be sloppy!

• Trees are (somewhat) interpretable– domain expert can “read off” the tree’s logic as rules

• Tree algorithms are relatively easy to code and test

Page 29: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 29

Limitations of Trees

• Difficulty in modelling linear structure

• Lack of smoothness

• High Variance– trees can be “unstable” as a function of the sample

• e.g., small change in the data -> completely different tree– causes two problems

• 1. High variance contributes to prediction error• 2. High variance reduces interpretability

– Trees are good candidates for model combining• Often used with boosting and bagging

Page 30: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 30

Figure from Duda, Hart & Stork, Chap. 8

Moving just one example slightly may lead to quite different trees and space partition!

Lack of stability against small perturbation of data.

Decision Trees are not stable

Page 31: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 31

Random Forests

• Another con for trees:– trees are sensitive to the primary split, which can lead the

tree in inappropriate directions– one way to see this: fit a tree on a random sample, or a

bootstrapped sample of the data -

• Solution:– random forests: an ensemble of unpruned decision trees– each tree is built on a random subset (or bootstrap) of the

training data– at each split point, only a random subset of predictors are

selected– prediction is simply majority vote of the trees ( or mean

prediction of the trees).

• Has the advantage of trees, with more robustness, and a smoother decision rule.

• More on this later, worth knowing about now

Page 32: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 32

Other Models: k-NN• k-Nearest Neighbors (kNN)• to classify a new point

– look at the kth nearest neighbor(s) from the training set– what is the class distribution of these neighbors?

Page 33: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

K-nearest neighbor

Data Mining - Volinsky - 2011 - Columbia University 33

Page 34: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

K-nearest neighbor

• Advantages– simple to understand– simple to implement - nonparametric

• Disadvantages– what is k?

• k=1 : high variance, sensitive to data• k large : robust, reduces variance but blends everything

together - includes ‘far away points’ – what is near?

• Euclidean distance assumes all inputs are equally important• how do you deal with categorical data?

– no interpretable model• Best to use cross-validation to pick k.

Data Mining - Volinsky - 2011 - Columbia University 34

Page 35: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 35

Probabilistic (Bayesian) Models for Classification

If you belong to class ck, you have a distribution over input vectors x:

If given priors p(ck), we can get posterior distribution on classes p(ck|x)

At each point in the x space, we have a predicted class vector, allowing for decision boundaries

Bayes rule (as applied to classification):

Page 36: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 36

Example of Probabilistic Classification

p( x | c1 )p( x | c2 )

p( c1 | x )1

0.5

0

Page 37: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 37

Example of Probabilistic Classification

p( x | c1 )p( x | c2 )

p( c1 | x )1

0.5

0

Page 38: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 38

Decision Regions and Bayes Error Rate

p( x | c1 )p( x | c2 )

Class c1 Class c2 Class c1Class c2Class c2

Optimal decision regions = regions where 1 class is more likely

Optimal decision regions optimal decision boundaries

Page 39: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 39

Decision Regions and Bayes Error Rate

p( x | c1 )p( x | c2 )

Class c1 Class c2 Class c1Class c2Class c2

Under certain conditions, we can estimate the BEST case error IF our model is correct.

Bayes error rate = fraction of examples misclassified by optimal classifier(shaded area above).

If max=1, then there is no error. Hence:

Page 40: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 40

Procedure for optimal Bayes classifier

• For each class learn a model p( x | ck )

– E.g., each class is multivariate Gaussian with its own mean and covariance

• Use Bayes rule to obtain p( ck | x )

=> this yields the optimal decision regions/boundaries => use these decision regions/boundaries for

classification

• Correct in theory…. but practical problems include:

– How do we model p( x | ck ) ?

– Even if we know the model for p( x | ck ), modeling a distribution or density will be very difficult in high dimensions (e.g., p = 100)

Page 41: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 41

Naïve Bayes Classifiers

• To simplify things in high dimension, make a conditional independence assumption on p( x| ck ), i.e.

• Typically used with categorical variables– Real-valued variables discretized to create nominal

versions

• Comments:– Simple to train (estimate conditional probabilities for each

feature-class pair)– Often works surprisingly well in practice

• e.g., state of the art for text-classification, basis of many widely used spam filters

Page 42: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 42

Play-tennis example: estimating P(C=win|x)

Outlook Temperature Humidity Windy Win?sunny hot high false Nsunny hot high true Novercast hot high false Yrain mild high false Yrain cool normal false Yrain cool normal true Novercast cool normal true Ysunny mild high false Nsunny cool normal false Yrain mild normal false Ysunny mild normal true Yovercast mild high true Yovercast hot normal false Yrain mild high true N

outlook

P(sunny|y) = 2/9 P(sunny|n) = 3/5

P(overcast|y) = 4/9

P(overcast|n) = 0

P(rain|y) = 3/9 P(rain|n) = 2/5

temperature

P(hot|y) = 2/9 P(hot|n) = 2/5

P(mild|y) = 4/9 P(mild|n) = 2/5

P(cool|y) = 3/9 P(cool|n) = 1/5

humidity

P(high|y) = 3/9 P(high|n) = 4/5

P(normal|y) = 6/9

P(normal|n) = 2/5

windy

P(true|y) = 3/9 P(true|n) = 3/5

P(false|y) = 6/9 P(false|n) = 2/5

P(y) = 9/14

P(n) = 5/14

Page 43: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 43

Play-tennis example: classifying X

• An unseen sample X = <rain, hot, high, false>

• P(X|y)·P(y) = P(rain|y)·P(hot|y)·P(high|y)·P(false|y)·P(y) = 3/9·2/9·3/9·6/9·9/14 = 0.010582

• P(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286

• Sample X is classified in class n (you’ll lose!)

Page 44: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 44

The independence hypothesis…

• … makes computation possible

• … yields optimal classifiers when satisfied

• … but is seldom satisfied in practice, as attributes (variables) are often correlated.

• Yet, empirically, naïve bayes performs really well in practice.

Page 45: Data Mining - Volinsky - 2011 - Columbia University 1 Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s.

Data Mining - Volinsky - 2011 - Columbia University 45

Naïve Bayes

∏=

∝p

jkjkk cxpcpxcp

1

)|()()|(

“weights of evidence”

estimate of the prob that a point x will belong to ck:

If there are two classes, we look at the ratio of the two probabilit