Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference
description
Transcript of Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference
![Page 1: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/1.jpg)
1
Peter Fox
Data Analytics – ITWS-4963/ITWS-6965
Week 7a, March 3, 2014, SAGE 3101
Interpreting weighted kNN, forms of clustering, decision trees and
Bayesian inference
![Page 2: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/2.jpg)
Contents
2
![Page 3: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/3.jpg)
Weighted KNNrequire(kknn)
data(iris)
m <- dim(iris)[1]
val <- sample(1:m, size = round(m/3), replace = FALSE, prob = rep(1/m, m))
iris.learn <- iris[-val,] # train
iris.valid <- iris[val,] # test
iris.kknn <- kknn(Species~., iris.learn, iris.valid, distance = 1, kernel = "triangular") # Possible choices are "rectangular" (which is standard unweighted knn), "triangular", "epanechnikov" (or beta(2,2)), "biweight" (or beta(3,3)), "triweight" (or beta(4,4)), "cos", "inv", "gaussian", "rank" and "optimal".
3
![Page 4: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/4.jpg)
names(iris.kknn)• fitted.values Vector of predictions.• CL Matrix of classes of the k nearest neighbors.• W Matrix of weights of the k nearest neighbors.• D Matrix of distances of the k nearest neighbors.• C Matrix of indices of the k nearest neighbors.• prob Matrix of predicted class probabilities.• response Type of response variable, one of
continuous, nominal or ordinal.• distance Parameter of Minkowski distance.• call The matched call.• termsThe 'terms' object used. 4
![Page 5: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/5.jpg)
Look at the output> head(iris.kknn$W)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.4493696 0.2306555 0.1261857 0.1230131 0.07914805 0.07610159 0.014184110
[2,] 0.7567298 0.7385966 0.5663245 0.3593925 0.35652546 0.24159191 0.004312408
[3,] 0.5958406 0.2700476 0.2594478 0.2558161 0.09317996 0.09317996 0.042096849
[4,] 0.6022069 0.5193145 0.4229427 0.1607861 0.10804205 0.09637177 0.055297983
[5,] 0.7011985 0.6224216 0.5183945 0.2937705 0.16230921 0.13964231 0.053888244
[6,] 0.5898731 0.5270226 0.3273701 0.1791715 0.15297478 0.08446215 0.010180454
5
![Page 6: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/6.jpg)
Look at the output> head(iris.kknn$D)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.7259100 1.0142464 1.1519716 1.1561541 1.2139825 1.2179988 1.2996261
[2,] 0.2508639 0.2695631 0.4472127 0.6606040 0.6635606 0.7820818 1.0267680
[3,] 0.6498131 1.1736274 1.1906700 1.1965092 1.4579977 1.4579977 1.5401298
[4,] 0.2695631 0.3257349 0.3910409 0.5686904 0.6044323 0.6123406 0.6401741
[5,] 0.7338183 0.9272845 1.1827617 1.7344095 2.0572618 2.1129288 2.3235298
[6,] 0.5674645 0.6544263 0.9306719 1.1357241 1.1719707 1.2667669 1.3695454
6
![Page 7: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/7.jpg)
Look at the output> head(iris.kknn$C)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 86 38 43 73 92 85 60
[2,] 31 20 16 21 24 15 7
[3,] 48 80 44 36 50 63 98
[4,] 4 21 25 6 20 26 1
[5,] 68 79 70 65 87 84 75
[6,] 91 97 100 96 83 93 81
> head(iris.kknn$prob)
setosa versicolor virginica
[1,] 0 0.3377079 0.6622921
[2,] 1 0.0000000 0.0000000
[3,] 0 0.8060743 0.1939257
[4,] 1 0.0000000 0.0000000
[5,] 0 0.0000000 1.0000000
[6,] 0 0.0000000 1.00000007
![Page 8: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/8.jpg)
Look at the output> head(iris.kknn$fitted.values)
[1] virginica setosa versicolor setosa virginica virginica
Levels: setosa versicolor virginica
8
![Page 9: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/9.jpg)
Contingency tables
fitiris <- fitted(iris.kknn)
table(iris.valid$Species, fitiris)
fitiris
setosa versicolor virginica
setosa 17 0 0
versicolor 0 18 2
virginica 0 1 12
# rectangular – no weight
fitiris2
setosa versicolor virginica
setosa 17 0 0
versicolor 0 18 2
virginica 0 2 119
![Page 10: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/10.jpg)
The plotpcol <- as.character(as.numeric(iris.valid$Species))
pairs(iris.valid[1:4], pch = pcol, col = c("green3", "red”)[(iris.valid$Species != fit)+1])
10
![Page 11: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/11.jpg)
New dataset - ionosphererequire(kknn)
data(ionosphere)
ionosphere.learn <- ionosphere[1:200,]
ionosphere.valid <- ionosphere[-c(1:200),]
fit.kknn <- kknn(class ~ ., ionosphere.learn, ionosphere.valid)
table(ionosphere.valid$class, fit.kknn$fit)
b g
b 19 8
g 2 122
11
![Page 12: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/12.jpg)
Vary the parameters - ionosphere> (fit.train1 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15,
kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 1))
Call:
train.kknn(formula = class ~ ., data = ionosphere.learn, kmax = 15, distance = 1, kernel = c("triangular", "rectangular", "epanechnikov", "optimal"))
Type of response variable: nominal
Minimal misclassification: 0.12
Best kernel: rectangular
Best k: 2
> table(predict(fit.train1, ionosphere.valid), ionosphere.valid$class)
b g
b 25 4
g 2 12012
b g b 19 8 g 2 122
![Page 13: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/13.jpg)
Alter distance - ionosphere> (fit.train2 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15,
kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 2))
Type of response variable: nominal
Minimal misclassification: 0.12
Best kernel: rectangular
Best k: 2
> table(predict(fit.train2, ionosphere.valid), ionosphere.valid$class)
b g
b 20 5
g 7 119
13
#1 b g b 25 4 g 2 120
#0 b g b 19 8 g 2 122
![Page 14: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/14.jpg)
(Weighted) kNN• Advantages
– Robust to noisy training data (especially if we use inverse square of weighted distance as the “distance”)
– Effective if the training data is large
• Disadvantages– Need to determine value of parameter K (number of
nearest neighbors)– Distance based learning is not clear which type of
distance to use and which attribute to use to produce the best results. Shall we use all attributes or certain attributes only?
14
![Page 15: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/15.jpg)
Additional factors• Dimensionality – with too many dimensions
the closest neighbors are too far away to be considered close
• Overfitting – does closeness mean right classification (e.g. noise or incorrect data, like wrong street address -> wrong lat/lon) – beware of k=1!
• Correlated features – double weighting• Relative importance – including/ excluding
features15
![Page 16: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/16.jpg)
More factors• Sparseness – the standard distance measure
(Jaccard) loses meaning due to no overlap• Errors – unintentional and intentional• Computational complexity• Sensitivity to distance metrics – especially
due to different scales (recall ages, versus impressions, versus clicks and especially binary values: gender, logged in/not)
• Does not account for changes over time• Model updating as new data comes in 16
![Page 17: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/17.jpg)
Lots of clustering options• http://wiki.math.yorku.ca/index.php/R:
_Cluster_analysis • Clustergram - This graph is useful in
exploratory analysis for non-hierarchical clustering algorithms like k-means and for hierarchical cluster algorithms when the number of observations is large enough to make dendrograms impractical.
• (remember our attempt at a dendogram for mapmeans?) 17
![Page 18: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/18.jpg)
Cluster plottingsource("http://www.r-statistics.com/wp-content/uploads/2012/01/source_https.r.txt") # source code from github
require(RCurl)
require(colorspace)
source_https("https://raw.github.com/talgalili/R-code-snippets/master/clustergram.r")
data(iris)
set.seed(250)
par(cex.lab = 1.5, cex.main = 1.2)
Data <- scale(iris[,-5]) # scaling
18
![Page 19: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/19.jpg)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> head(Data)
Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,] -0.8976739 1.01560199 -1.335752 -1.311052
[2,] -1.1392005 -0.13153881 -1.335752 -1.311052
[3,] -1.3807271 0.32731751 -1.392399 -1.311052
[4,] -1.5014904 0.09788935 -1.279104 -1.311052
[5,] -1.0184372 1.24503015 -1.335752 -1.311052
[6,] -0.5353840 1.93331463 -1.165809 -1.04866719
![Page 20: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/20.jpg)
20
• Look at the location of the cluster points on the Y axis. See when they remain stable, when they start flying around, and what happens to them in higher number of clusters (do they re-group together)
• Observe the strands of the datapoints. Even if the clusters centers are not ordered, the lines for each item might (needs more research and thinking) tend to move together – hinting at the real number of clusters
• Run the plot multiple times to observe the stability of the cluster formation (and location)
http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/
![Page 21: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/21.jpg)
clustergram(Data, k.range = 2:8, line.width = 0.004) # line.width - adjust according to Y-scale
21
![Page 22: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/22.jpg)
Any good?set.seed(500)
Data2 <- scale(iris[,-5])
par(cex.lab = 1.2, cex.main = .7)
par(mfrow = c(3,2))
for(i in 1:6) clustergram(Data2, k.range = 2:8 , line.width = .004, add.center.points = T)
# why does this produce different plots?
# what defaults are used (kmeans)
# PCA?? Remember your linear algebra 22
![Page 23: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/23.jpg)
23
![Page 24: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/24.jpg)
How can you tell it is good?set.seed(250)
Data <- rbind( cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),
cbind(rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3)),
cbind(rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3)))
clustergram(Data, k.range = 2:5 , line.width = .004, add.center.points = T)
24
![Page 25: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/25.jpg)
More complex…set.seed(250)
Data <- rbind( cbind(rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),
cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),
cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3)),
cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3)))
clustergram(Data, k.range = 2:8 , line.width = .004, add.center.points = T)
25
![Page 26: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/26.jpg)
Exercise - swisspar(mfrow = c(2,3))
swiss.x <- scale(as.matrix(swiss[, -1])) set.seed(1);
for(i in 1:6) clustergram(swiss.x, k.range = 2:6, line.width = 0.01)
26
![Page 27: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/27.jpg)
27
clusplot
![Page 28: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/28.jpg)
Hierarchical clustering
28
> dswiss <- dist(as.matrix(swiss))
> hs <- hclust(dswiss)
> plot(hs)
![Page 29: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/29.jpg)
ctree
29
require(party)
swiss_ctree <- ctree(Fertility ~ Agriculture + Education + Catholic, data = swiss)
plot(swiss_ctree)
![Page 30: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/30.jpg)
30
![Page 31: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/31.jpg)
pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species”, pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])
31
![Page 32: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/32.jpg)
splom extra!require(lattice)
super.sym <- trellis.par.get("superpose.symbol")
splom(~iris[1:4], groups = Species, data = iris,
panel = panel.superpose,
key = list(title = "Three Varieties of Iris",
columns = 3,
points = list(pch = super.sym$pch[1:3],
col = super.sym$col[1:3]),
text = list(c("Setosa", "Versicolor", "Virginica"))))
splom(~iris[1:3]|Species, data = iris,
layout=c(2,2), pscales = 0,
varnames = c("Sepal\nLength", "Sepal\nWidth", "Petal\nLength"),
page = function(...) {
ltext(x = seq(.6, .8, length.out = 4),
y = seq(.9, .6, length.out = 4),
labels = c("Three", "Varieties", "of", "Iris"),
cex = 2)
})
32
![Page 33: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/33.jpg)
33
parallelplot(~iris[1:4] | Species, iris)
![Page 34: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/34.jpg)
34
parallelplot(~iris[1:4], iris, groups = Species, horizontal.axis = FALSE, scales = list(x = list(rot = 90)))
![Page 35: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/35.jpg)
hclust for iris
35
![Page 36: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/36.jpg)
plot(iris_ctree)
36
![Page 37: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/37.jpg)
Ctree> iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)
> print(iris_ctree)
Conditional inference tree with 4 terminal nodes
Response: Species
Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
Number of observations: 150
1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264
2)* weights = 50
1) Petal.Length > 1.9
3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894
4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865
5)* weights = 46
4) Petal.Length > 4.8
6)* weights = 8
3) Petal.Width > 1.7
7)* weights = 46 37
![Page 38: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/38.jpg)
> plot(iris_ctree, type="simple”)
38
![Page 39: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/39.jpg)
New dataset to work with treesfitK <- rpart(Kyphosis ~ Age + Number + Start, method="class", data=kyphosis)
printcp(fitK) # display the results
plotcp(fitK) # visualize cross-validation results
summary(fitK) # detailed summary of splits
# plot tree
plot(fitK, uniform=TRUE, main="Classification Tree for Kyphosis")
text(fitK, use.n=TRUE, all=TRUE, cex=.8)
# create attractive postscript plot of tree
post(fitK, file = “kyphosistree.ps", title = "Classification Tree for Kyphosis") # might need to convert to PDF (distill)
39
![Page 40: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/40.jpg)
40
![Page 41: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/41.jpg)
41
> pfitK<- prune(fitK, cp= fitK$cptable[which.min(fitK$cptable[,"xerror"]),"CP"])> plot(pfitK, uniform=TRUE, main="Pruned Classification Tree for Kyphosis")> text(pfitK, use.n=TRUE, all=TRUE, cex=.8)> post(pfitK, file = “ptree.ps", title = "Pruned Classification Tree for Kyphosis”)
![Page 42: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/42.jpg)
42
> fitK <- ctree(Kyphosis ~ Age + Number + Start, data=kyphosis)> plot(fitK, main="Conditional Inference Tree for Kyphosis”)
![Page 43: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/43.jpg)
43
> plot(fitK, main="Conditional Inference Tree for Kyphosis",type="simple")
![Page 44: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/44.jpg)
randomForest> require(randomForest)
> fitKF <- randomForest(Kyphosis ~ Age + Number + Start, data=kyphosis)
> print(fitKF) # view results
Call:
randomForest(formula = Kyphosis ~ Age + Number + Start, data = kyphosis)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 1
OOB estimate of error rate: 20.99%
Confusion matrix:
absent present class.error
absent 59 5 0.0781250
present 12 5 0.7058824
> importance(fitKF) # importance of each predictor
MeanDecreaseGini
Age 8.654112
Number 5.584019
Start 10.168591 44
Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based on random samples of variables), classifying a case using each tree in this new "forest", and deciding a final predicted outcome by combining the results across all of the trees (an average in regression, a majority vote in classification).
![Page 45: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/45.jpg)
More on another dataset.# Regression Tree Example
library(rpart)
# build the tree
fitM <- rpart(Mileage~Price + Country + Reliability + Type, method="anova", data=cu.summary)
printcp(fitM) # display the results….
Root node error: 1354.6/60 = 22.576
n=60 (57 observations deleted due to missingness)
CP nsplit rel error xerror xstd
1 0.622885 0 1.00000 1.03165 0.176920
2 0.132061 1 0.37711 0.51693 0.102454
3 0.025441 2 0.24505 0.36063 0.079819
4 0.011604 3 0.21961 0.34878 0.080273
5 0.010000 4 0.20801 0.36392 0.075650 45
![Page 46: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/46.jpg)
Mileage…plotcp(fitM) # visualize cross-validation results
summary(fitM) # detailed summary of splits
<we will leave this for Friday to look at> 46
![Page 47: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/47.jpg)
47
par(mfrow=c(1,2)) rsq.rpart(fitM) # visualize cross-validation results
![Page 48: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/48.jpg)
# plot tree
plot(fitM, uniform=TRUE, main="Regression Tree for Mileage ")
text(fitM, use.n=TRUE, all=TRUE, cex=.8)
# prune the tree
pfitM<- prune(fitM, cp=0.01160389) # from cptable
# plot the pruned tree
plot(pfitM, uniform=TRUE, main="Pruned Regression Tree for Mileage")
text(pfitM, use.n=TRUE, all=TRUE, cex=.8)
post(pfitM, file = ”ptree2.ps", title = "Pruned Regression Tree for Mileage”)
48
![Page 49: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/49.jpg)
49
![Page 50: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/50.jpg)
# Conditional Inference Tree for Mileage
fit2M <- ctree(Mileage~Price + Country + Reliability + Type, data=na.omit(cu.summary))
50
![Page 51: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/51.jpg)
Enough of trees!
51
![Page 52: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/52.jpg)
Bayes> cl <- kmeans(iris[,1:4], 3)
> table(cl$cluster, iris[,5])
setosa versicolor virginica
2 0 2 36
1 0 48 14
3 50 0 0
#
> m <- naiveBayes(iris[,1:4], iris[,5])
> table(predict(m, iris[,1:4]), iris[,5])
setosa versicolor virginica
setosa 50 0 0
versicolor 0 47 3
virginica 0 3 47 52
pairs(iris[1:4],main="Iris Data (red=setosa,green=versicolor,blue=virginica)", pch=21, bg=c("red","green3","blue")[unclass(iris$Species)])
![Page 53: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/53.jpg)
Digging into irisclassifier<-naiveBayes(iris[,1:4], iris[,5])
table(predict(classifier, iris[,-5]), iris[,5], dnn=list('predicted','actual'))
actual
predicted setosa versicolor virginica
setosa 50 0 0
versicolor 0 47 3
virginica 0 3 47
53
![Page 54: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/54.jpg)
Digging into iris> classifier$apriori
iris[, 5]
setosa versicolor virginica
50 50 50
> classifier$tables$Petal.Length
Petal.Length
iris[, 5] [,1] [,2]
setosa 1.462 0.1736640
versicolor 4.260 0.4699110
virginica 5.552 0.551894754
![Page 55: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/55.jpg)
Digging into irisplot(function(x) dnorm(x, 1.462, 0.1736640), 0, 8, col="red", main="Petal length distribution for the 3 different species")
curve(dnorm(x, 4.260, 0.4699110), add=TRUE, col="blue")
curve(dnorm(x, 5.552, 0.5518947 ), add=TRUE, col = "green")
55
![Page 56: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/56.jpg)
http://www.ugrad.stat.ubc.ca/R/library/mlbench/html/HouseVotes84.
html > require(mlbench)
> data(HouseVotes84)
> model <- naiveBayes(Class ~ ., data = HouseVotes84)
> predict(model, HouseVotes84[1:10,-1])
[1] republican republican republican democrat democrat democrat republican republican republican
[10] democrat
Levels: democrat republican 56
![Page 57: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/57.jpg)
House Votes 1984> predict(model, HouseVotes84[1:10,-1], type = "raw")
democrat republican
[1,] 1.029209e-07 9.999999e-01
[2,] 5.820415e-08 9.999999e-01
[3,] 5.684937e-03 9.943151e-01
[4,] 9.985798e-01 1.420152e-03
[5,] 9.666720e-01 3.332802e-02
[6,] 8.121430e-01 1.878570e-01
[7,] 1.751512e-04 9.998248e-01
[8,] 8.300100e-06 9.999917e-01
[9,] 8.277705e-08 9.999999e-01
[10,] 1.000000e+00 5.029425e-1157
![Page 58: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/58.jpg)
House Votes 1984
> pred <- predict(model, HouseVotes84[,-1])
> table(pred, HouseVotes84$Class)
pred democrat republican
democrat 238 13
republican 29 155
58
![Page 59: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/59.jpg)
So now you could complete this:> data(HairEyeColor)
> mosaicplot(HairEyeColor)
> margin.table(HairEyeColor,3)
Sex
Male Female
279 313
> margin.table(HairEyeColor,c(1,3))
Sex
Hair Male Female
Black 56 52
Brown 143 143
Red 34 37
Blond 46 81
Construct a naïve Bayes classifier and test. 59
![Page 60: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/60.jpg)
Assignments to come…
• Term project (A6). Due ~ week 13. 30% (25% written, 5% oral; individual).
• Assignment 7: Predictive and Prescriptive Analytics. Due ~ week 9/10. 20% (15% written and 5% oral; individual);
60
![Page 61: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/61.jpg)
Coming weeks• I will be out of town Friday March 21 and 28• On March 21 you will have a lab –
attendance will be taken – to work on assignments (term (6) and assignment 7). Your project proposals (Assignment 5) are on March 18.
• On March 28 you will have a lecture on SVM, thus the Tuesday March 25 will be a lab.
• Back to regular schedule in April (except 18th)61
![Page 62: Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814c67550346895db98b73/html5/thumbnails/62.jpg)
Admin info (keep/ print this slide)• Class: ITWS-4963/ITWS 6965• Hours: 12:00pm-1:50pm Tuesday/ Friday• Location: SAGE 3101• Instructor: Peter Fox• Instructor contact: [email protected], 518.276.4862 (do not
leave a msg)• Contact hours: Monday** 3:00-4:00pm (or by email appt)• Contact location: Winslow 2120 (sometimes Lally 207A
announced by email)• TA: Lakshmi Chenicheri [email protected] • Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014
– Schedule, lectures, syllabus, reading, assignments, etc.
62