Support Vector Machines, Decision Trees, Cross-validation

52
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 11a, April 7, 2014 Support Vector Machines, Decision Trees, Cross- validation

description

Support Vector Machines, Decision Trees, Cross-validation. Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 11a, April 7 , 2014. Reading?. Probabilities…. l ibrary( kernlab ) data ( promotergene ) # # create test and training set ind

Transcript of Support Vector Machines, Decision Trees, Cross-validation

Page 1: Support Vector Machines, Decision Trees, Cross-validation

1

Peter Fox

Data Analytics – ITWS-4963/ITWS-6965

Week 11a, April 7, 2014

Support Vector Machines, Decision Trees, Cross-

validation

Page 2: Support Vector Machines, Decision Trees, Cross-validation

Reading?

2

Page 3: Support Vector Machines, Decision Trees, Cross-validation

Probabilities…library(kernlab)

data(promotergene)

## create test and training set

ind <- sample(1:dim(promotergene)[1],20)

genetrain <- promotergene[-ind, ]

genetest <- promotergene[ind, ]

## train a support vector machine

gene <- ksvm(Class~.,data=genetrain,kernel="rbfdot",\ kpar=list(sigma=0.015),C=70,cross=4,prob.model=TRUE)

## predict gene type probabilities on the test set

genetype <- predict(gene,genetest,type="probabilities")3

Page 4: Support Vector Machines, Decision Trees, Cross-validation

Result> genetype

+ -

[1,] 0.205576217 0.794423783

[2,] 0.150094660 0.849905340

[3,] 0.262062226 0.737937774

[4,] 0.939660586 0.060339414

[5,] 0.003164823 0.996835177

[6,] 0.502406898 0.497593102

[7,] 0.812503448 0.187496552

[8,] 0.996382257 0.003617743

[9,] 0.265187582 0.734812418

[10,] 0.998832291 0.001167709

[11,] 0.576491204 0.423508796

[12,] 0.973798660 0.026201340

[13,] 0.098598411 0.901401589

[14,] 0.900670101 0.099329899

[15,] 0.012571774 0.987428226

[16,] 0.977704079 0.022295921

[17,] 0.137304637 0.862695363

[18,] 0.972861575 0.027138425

[19,] 0.224470227 0.775529773

[20,] 0.004691973 0.9953080274

Page 5: Support Vector Machines, Decision Trees, Cross-validation

Glasslibrary(e1071)

library(rpart)

data(Glass, package="mlbench")

index <- 1:nrow(Glass)

testindex <- sample(index, trunc(length(index)/3))

testset <- Glass[testindex,]

trainset <- Glass[-testindex,]

svm.model <- svm(Type ~ ., data = trainset, cost = 100, gamma = 1)

svm.pred <- predict(svm.model, testset[,-10])

5

Page 6: Support Vector Machines, Decision Trees, Cross-validation

> table(pred = svm.pred, true = testset[,10])

true

pred 1 2 3 5 6 7

1 12 9 1 0 0 0

2 6 19 6 5 2 2

3 1 0 2 0 0 0

5 0 0 0 0 0 0

6 0 0 0 0 1 0

7 0 1 0 0 0 46

Page 7: Support Vector Machines, Decision Trees, Cross-validation

Now what?# now what happens?

> rpart.model <- rpart(Type ~ ., data = trainset)

> rpart.pred <- predict(rpart.model, testset[,-10], type = "class”)

7

Page 8: Support Vector Machines, Decision Trees, Cross-validation

General idea behind trees• Although the basic philosophy of all the classifiers

based on decision trees is identical, there are many possibilities for its construction.

• Among all the key points in the selection of an algorithm to build decision trees some of them should be highlighted for their importance:– Criteria for the choice of feature to be used in each

node– How to calculate the partition of the set of examples– When you decide that a node is a leaf– What is the criterion to select the class to assign to

each leaf8

Page 9: Support Vector Machines, Decision Trees, Cross-validation

• Some important advantages can be pointed to the decision trees, including:– Can be applied to any type of data– The final structure of the classifier is quite simple and can

be stored and handled in a graceful manner– Handles very efficiently conditional information,

subdividing the space into sub-spaces that are handled individually

– Reveal normally robust and insensitive to misclassification in the training set

– The resulting trees are usually quite understandable and can be easily used to obtain a better understanding of the phenomenon in question. This is perhaps the most important of all the advantages listed

9

Page 10: Support Vector Machines, Decision Trees, Cross-validation

Stopping – leaves on the tree• A number of stopping conditions can be used

to stop the recursive process. The algorithm stops when any one of the conditions is true:– All the samples belong to the same class, i.e.

have the same label since the sample is already "pure"

– Stop if most of the points are already of the same class. This is a generalization of the first approach, with some error threshold

– There are no remaining attributes on which the samples may be further partitioned

– There are no samples for the branch test attribute10

Page 11: Support Vector Machines, Decision Trees, Cross-validation

Recursive partitioning• Recursive partitioning is a fundamental tool in data mining. It

helps us explore the structure of a set of data, while developing easy to visualize decision rules for predicting a categorical (classification tree) or continuous (regression tree) outcome.

• The rpart programs build classification or regression models of a very general structure using a two stage procedure; the resulting models can be represented as binary trees.

11

Page 12: Support Vector Machines, Decision Trees, Cross-validation

Recursive partitioning• The tree is built by the following process:

– first the single variable is found which best splits the data into two groups ('best' will be defined later). The data is separated, and then this process is applied separately to each sub-group, and so on recursively until the subgroups either reach a minimum size or until no improvement can be made.

– second stage of the procedure consists of using cross-validation to trim back the full tree.

12

Page 13: Support Vector Machines, Decision Trees, Cross-validation

Why are we careful doing this?• Because we will USE these trees, i.e. apply

them to make decisions about what things are and what to do with them!

13

Page 14: Support Vector Machines, Decision Trees, Cross-validation

> printcp(rpart.model)

Classification tree:

rpart(formula = Type ~ ., data = trainset)

Variables actually used in tree construction:

[1] Al Ba Mg RI

Root node error: 92/143 = 0.64336

n= 143

CP nsplit rel error xerror xstd

1 0.206522 0 1.00000 1.00000 0.062262

2 0.195652 1 0.79348 0.92391 0.063822

3 0.050725 2 0.59783 0.63043 0.063822

4 0.043478 5 0.44565 0.64130 0.063990

5 0.032609 6 0.40217 0.57609 0.062777

6 0.010000 7 0.36957 0.51087 0.06105614

Page 15: Support Vector Machines, Decision Trees, Cross-validation

plotcp(rpart.model)

15

Page 16: Support Vector Machines, Decision Trees, Cross-validation

> rsq.rpart(rpart.model)

Classification tree:

rpart(formula = Type ~ ., data = trainset)

Variables actually used in tree construction:

[1] Al Ba Mg RI

Root node error: 92/143 = 0.64336

n= 143

CP nsplit rel error xerror xstd

1 0.206522 0 1.00000 1.00000 0.062262

2 0.195652 1 0.79348 0.92391 0.063822

3 0.050725 2 0.59783 0.63043 0.063822

4 0.043478 5 0.44565 0.64130 0.063990

5 0.032609 6 0.40217 0.57609 0.062777

6 0.010000 7 0.36957 0.51087 0.061056

Warning message:

In rsq.rpart(rpart.model) : may not be applicable for this method 16

Page 17: Support Vector Machines, Decision Trees, Cross-validation

rsq.rpart

17

Page 18: Support Vector Machines, Decision Trees, Cross-validation

> print(rpart.model)

n= 143

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 143 92 2 (0.3 0.36 0.091 0.056 0.049 0.15)

2) Ba< 0.335 120 70 2 (0.35 0.42 0.11 0.058 0.058 0.0083)

4) Al< 1.42 71 33 1 (0.54 0.28 0.15 0.014 0.014 0)

8) RI>=1.517075 58 22 1 (0.62 0.28 0.086 0.017 0 0)

16) RI< 1.518015 21 1 1 (0.95 0 0.048 0 0 0) *

17) RI>=1.518015 37 21 1 (0.43 0.43 0.11 0.027 0 0)

34) RI>=1.51895 25 10 1 (0.6 0.2 0.16 0.04 0 0)

68) Mg>=3.415 18 4 1 (0.78 0 0.22 0 0 0) *

69) Mg< 3.415 7 2 2 (0.14 0.71 0 0.14 0 0) *

35) RI< 1.51895 12 1 2 (0.083 0.92 0 0 0 0) *

9) RI< 1.517075 13 7 3 (0.15 0.31 0.46 0 0.077 0) *

5) Al>=1.42 49 19 2 (0.082 0.61 0.041 0.12 0.12 0.02)

10) Mg>=2.62 33 6 2 (0.12 0.82 0.061 0 0 0) *

11) Mg< 2.62 16 10 5 (0 0.19 0 0.37 0.37 0.062) *

3) Ba>=0.335 23 3 7 (0.043 0.043 0 0.043 0 0.87) *

18

Page 19: Support Vector Machines, Decision Trees, Cross-validation

Tree plot

19

plot(object, uniform=FALSE, branch=1, compress=FALSE, nspace, margin=0, minbranch=.3, args)

> plot(rpart.model,compress=TRUE)> text(rpart.model, use.n=TRUE)

Page 20: Support Vector Machines, Decision Trees, Cross-validation

And if you are bravesummary(rpart.model)

… pages….

20

Page 21: Support Vector Machines, Decision Trees, Cross-validation

Wait, did anyone LOOK at the data?

> names(Glass)

[1] "RI" "Na" "Mg" "Al" "Si" "K" "Ca" "Ba" "Fe" "Type"

> head(Glass)

RI Na Mg Al Si K Ca Ba Fe Type

1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1

2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1

3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1

4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1

5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1

6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1

21

Page 22: Support Vector Machines, Decision Trees, Cross-validation

rpart.pred> rpart.pred

91 163 138 135 172 20 1 148 169 206 126 157 107 39 150 203 151 110 73 104 85 93 144 160 145 89 204 7 92 51

1 1 2 1 5 2 1 1 5 7 2 1 7 1 1 7 2 2 2 1 2 2 2 2 1 2 7 1 2 1

186 14 190 56 184 82 125 12 168 175 159 36 117 114 154 62 139 5 18 98 27 183 42 66 155 180 71 83 123 11

7 1 7 2 2 2 1 1 5 5 2 1 1 1 1 7 2 1 1 1 1 5 1 1 1 5 2 1 2 2

195 101 136 45 130 6 72 87 173 121 3

7 2 1 1 5 2 1 2 5 1 2

Levels: 1 2 3 5 6 7

22

Page 23: Support Vector Machines, Decision Trees, Cross-validation

plot(rpart.pred)

23

Page 24: Support Vector Machines, Decision Trees, Cross-validation

Hierarchical clustering

24

> dswiss <- dist(as.matrix(swiss))

> hs <- hclust(dswiss)

> plot(hs)

Page 25: Support Vector Machines, Decision Trees, Cross-validation

ctree

25

require(party)

swiss_ctree <- ctree(Fertility ~ Agriculture + Education + Catholic, data = swiss)

plot(swiss_ctree)

Page 26: Support Vector Machines, Decision Trees, Cross-validation

26

Page 27: Support Vector Machines, Decision Trees, Cross-validation

hclust for iris

27

Page 28: Support Vector Machines, Decision Trees, Cross-validation

plot(iris_ctree)

28

Page 29: Support Vector Machines, Decision Trees, Cross-validation

Ctree> iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)

> print(iris_ctree)

Conditional inference tree with 4 terminal nodes

Response: Species

Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

Number of observations: 150

1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264

2)* weights = 50

1) Petal.Length > 1.9

3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894

4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865

5)* weights = 46

4) Petal.Length > 4.8

6)* weights = 8

3) Petal.Width > 1.7

7)* weights = 46 29

Page 30: Support Vector Machines, Decision Trees, Cross-validation

> plot(iris_ctree, type="simple”)

30

Page 31: Support Vector Machines, Decision Trees, Cross-validation

New dataset to work with treesfitK <- rpart(Kyphosis ~ Age + Number + Start, method="class", data=kyphosis)

printcp(fitK) # display the results

plotcp(fitK) # visualize cross-validation results

summary(fitK) # detailed summary of splits

# plot tree

plot(fitK, uniform=TRUE, main="Classification Tree for Kyphosis")

text(fitK, use.n=TRUE, all=TRUE, cex=.8)

# create attractive postscript plot of tree

post(fitK, file = “kyphosistree.ps", title = "Classification Tree for Kyphosis") # might need to convert to PDF (distill)

31

Page 32: Support Vector Machines, Decision Trees, Cross-validation

32

Page 33: Support Vector Machines, Decision Trees, Cross-validation

33

> pfitK<- prune(fitK, cp= fitK$cptable[which.min(fitK$cptable[,"xerror"]),"CP"])> plot(pfitK, uniform=TRUE, main="Pruned Classification Tree for Kyphosis")> text(pfitK, use.n=TRUE, all=TRUE, cex=.8)> post(pfitK, file = “ptree.ps", title = "Pruned Classification Tree for Kyphosis”)

Page 34: Support Vector Machines, Decision Trees, Cross-validation

34

> fitK <- ctree(Kyphosis ~ Age + Number + Start, data=kyphosis)> plot(fitK, main="Conditional Inference Tree for Kyphosis”)

Page 35: Support Vector Machines, Decision Trees, Cross-validation

35

> plot(fitK, main="Conditional Inference Tree for Kyphosis",type="simple")

Page 36: Support Vector Machines, Decision Trees, Cross-validation

randomForest> require(randomForest)

> fitKF <- randomForest(Kyphosis ~ Age + Number + Start, data=kyphosis)

> print(fitKF) # view results

Call:

randomForest(formula = Kyphosis ~ Age + Number + Start, data = kyphosis)

Type of random forest: classification

Number of trees: 500

No. of variables tried at each split: 1

OOB estimate of error rate: 20.99%

Confusion matrix:

absent present class.error

absent 59 5 0.0781250

present 12 5 0.7058824

> importance(fitKF) # importance of each predictor

MeanDecreaseGini

Age 8.654112

Number 5.584019

Start 10.168591 36

Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based on random samples of variables), classifying a case using each tree in this new "forest", and deciding a final predicted outcome by combining the results across all of the trees (an average in regression, a majority vote in classification).

Page 37: Support Vector Machines, Decision Trees, Cross-validation

More on another dataset.# Regression Tree Example

library(rpart)

# build the tree

fitM <- rpart(Mileage~Price + Country + Reliability + Type, method="anova", data=cu.summary)

printcp(fitM) # display the results….

Root node error: 1354.6/60 = 22.576

n=60 (57 observations deleted due to missingness)

CP nsplit rel error xerror xstd

1 0.622885 0 1.00000 1.03165 0.176920

2 0.132061 1 0.37711 0.51693 0.102454

3 0.025441 2 0.24505 0.36063 0.079819

4 0.011604 3 0.21961 0.34878 0.080273

5 0.010000 4 0.20801 0.36392 0.075650 37

Page 38: Support Vector Machines, Decision Trees, Cross-validation

Mileage…plotcp(fitM) # visualize cross-validation results

summary(fitM) # detailed summary of splits38

Page 39: Support Vector Machines, Decision Trees, Cross-validation

39

par(mfrow=c(1,2)) rsq.rpart(fitM) # visualize cross-validation results

Page 40: Support Vector Machines, Decision Trees, Cross-validation

# plot tree

plot(fitM, uniform=TRUE, main="Regression Tree for Mileage ")

text(fitM, use.n=TRUE, all=TRUE, cex=.8)

# prune the tree

pfitM<- prune(fitM, cp=0.01160389) # from cptable

# plot the pruned tree

plot(pfitM, uniform=TRUE, main="Pruned Regression Tree for Mileage")

text(pfitM, use.n=TRUE, all=TRUE, cex=.8)

40

Page 41: Support Vector Machines, Decision Trees, Cross-validation

41

?????

Page 42: Support Vector Machines, Decision Trees, Cross-validation

# Conditional Inference Tree for Mileage

fit2M <- ctree(Mileage~Price + Country + Reliability + Type, data=na.omit(cu.summary))

42

Page 43: Support Vector Machines, Decision Trees, Cross-validation

Examplen <- 150 # number of data points

p <- 2 # dimension

sigma <- 1 # variance of the distribution

meanpos <- 0 # centre of the distribution of positive examples

meanneg <- 3 # centre of the distribution of negative examples

npos <- round(n/2) # number of positive examples

nneg <- n-npos # number of negative examples

# Generate the positive and negative examples

xpos <- matrix(rnorm(npos*p,mean=meanpos,sd=sigma),npos,p)

xneg <- matrix(rnorm(nneg*p,mean=meanneg,sd=sigma),npos,p)

x <- rbind(xpos,xneg)

# Generate the labels

y <- matrix(c(rep(1,npos),rep(-1,nneg)))

43

Page 44: Support Vector Machines, Decision Trees, Cross-validation

Train/ testntrain <- round(n*0.8) # number of training examples

tindex <- sample(n,ntrain) # indices of training samples

xtrain <- x[tindex,]

xtest <- x[-tindex,]

ytrain <- y[tindex]

ytest <- y[-tindex]

istrain=rep(0,n)

istrain[tindex]=1

44

Page 45: Support Vector Machines, Decision Trees, Cross-validation

Examplesvp <- ksvm(xtrain,ytrain,type="C-svc", kernel='vanilladot', C=100,scaled=c())

> alpha(svp)

[[1]]

[1] 100.00000 100.00000 100.00000 100.00000 58.00009 88.71759 100.00000 69.28250

> svp

Support Vector Machine object of class "ksvm”

SV type: C-svc (classification)

parameter : cost C = 100

Linear (vanilla) kernel function.

Number of Support Vectors : 8

Objective Function Value : -713.6363

Training error : 0.025

45

Page 46: Support Vector Machines, Decision Trees, Cross-validation

Example> alphaindex(svp)

[[1]]

[1] 2 14 31 35 38 60 68 92

> b(svp)

[1] -4.20671

plot(svp,data=xtrain)

46

Page 47: Support Vector Machines, Decision Trees, Cross-validation

Cross-validation• Cross-validation is a model validation

technique for assessing how the results of a statistical analysis will generalize to an independent data set.

• It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice.

• I.e. predictive and prescriptive analytics…47

Page 48: Support Vector Machines, Decision Trees, Cross-validation

Cross-validationIn a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (testing dataset).

Sound familiar?

48

Page 49: Support Vector Machines, Decision Trees, Cross-validation

Cross-validationThe goal of cross validation is to define a dataset to "test" the model in the training phase (i.e., the validation dataset), in order to limit problems like overfitting

And, give an insight on how the model will generalize to an independent data set (i.e., an unknown dataset, for instance from a real problem), etc.

49

Page 50: Support Vector Machines, Decision Trees, Cross-validation

Common type of x-validation• K-fold• 2-fold (do you know this one?)• Rep-random-subsample• Leave out-subsample

• Lab on Friday… to try these out

50

Page 51: Support Vector Machines, Decision Trees, Cross-validation

Admin info (keep/ print this slide)• Class: ITWS-4963/ITWS 6965• Hours: 12:00pm-1:50pm Tuesday/ Friday• Location: SAGE 3101• Instructor: Peter Fox• Instructor contact: [email protected], 518.276.4862 (do not

leave a msg)• Contact hours: Monday** 3:00-4:00pm (or by email appt)• Contact location: Winslow 2120 (sometimes Lally 207A

announced by email)• TA: Lakshmi Chenicheri [email protected] • Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014

– Schedule, lectures, syllabus, reading, assignments, etc.

51

Page 52: Support Vector Machines, Decision Trees, Cross-validation

Ozone> library(e1071)

> library(rpart)

> data(Ozone, package=“mlbench”) # beware of “()”

> ## split data into a train and test set

> index <- 1:nrow(Ozone)

> testindex <- sample(index, trunc(length(index)/3))

> testset <- na.omit(Ozone[testindex,-3])

> trainset <- na.omit(Ozone[-testindex,-3])

> svm.model <- svm(V4 ~ ., data = trainset, cost = 1000, gamma = 0.0001)

> svm.pred <- predict(svm.model, testset[,-3])

> crossprod(svm.pred - testset[,3]) / length(testindex)See: http://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf 52