1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10b, April 10, 2015 Labs: Cross Validation,...

1

Peter Fox

Data Analytics – ITWS-4963/ITWS-6965

Week 10b, April 10, 2015

Labs: Cross Validation, RandomForest, Multi-Dimensional Scaling,

Dimension Reduction, Factor Analysis

Advertisements• Mike Schroepfer, Facebook CTO, will be

lecturing and doing Q&A for Data and Society (CSCI 4967/6963) on Friday, April 24 in Walker 5113 from 9 to 11.

• Who would attend (must confirm)?

2

If you did not complete svm• Lab9b_svm(1,11}_2015.R

3

Cross-validation - coleman> head(coleman)

salaryP fatherWc sstatus teacherSc motherLev Y

1 3.83 28.87 7.20 26.6 6.19 37.01

2 2.89 20.10 -11.71 24.4 5.17 26.51

3 2.86 69.05 12.32 25.7 7.04 36.51

4 2.92 65.40 14.28 25.7 7.10 40.70

5 3.06 29.59 6.31 25.4 6.15 37.10

6 2.07 44.82 6.16 21.6 6.41 33.90

4

Lab11_11_2014.R> call <- call("lmrob", formula = Y ~ .)

> # set up folds for cross-validation

> folds <- cvFolds(nrow(coleman), K = 5, R = 10)

> # perform cross-validation

> cvTool(call, data = coleman, y = coleman$Y, cost = rtmspe,

+ folds = folds, costArgs = list(trim = 0.1))

CV

[1,] 0.9880672

[2,] 0.9525881

[3,] 0.8989264

[4,] 1.0177694

[5,] 0.9860661

[6,] 1.8369717

[7,] 0.9550428

[8,] 1.0698466

[9,] 1.3568537

[10,] 0.8313474

5

Warning messages:1: In lmrob.S(x, y, control = control) : S refinements did not converge (to refine.tol=1e-07) in 200 (= k.max) steps2: In lmrob.S(x, y, control = control) : S refinements did not converge (to refine.tol=1e-07) in 200 (= k.max) steps3: In lmrob.S(x, y, control = control) : find_scale() did not converge in 'maxit.scale' (= 200) iterations4: In lmrob.S(x, y, control = control) : find_scale() did not converge in 'maxit.scale' (= 200) iterations

Lab11b_12_2014.R> cvFits

5-fold CV results:

Fit CV

1 LS 1.674485

2 MM 1.147130

3 LTS 1.291797

Best model:

CV

"MM"

6

50 and 75% subsetsfitLts50 <- ltsReg(Y ~ ., data = coleman, alpha = 0.5)

cvFitLts50 <- cvLts(fitLts50, cost = rtmspe, folds = folds,

fit = "both", trim = 0.1)

# 75% subsets

fitLts75 <- ltsReg(Y ~ ., data = coleman, alpha = 0.75)

cvFitLts75 <- cvLts(fitLts75, cost = rtmspe, folds = folds,

fit = "both", trim = 0.1)

# combine and plot results

cvFitsLts <- cvSelect("0.5" = cvFitLts50, "0.75" = cvFitLts75)

7

cvFitsLts (50/75)> cvFitsLts

5-fold CV results:

Fit reweighted raw

1 0.5 1.291797 1.640922

2 0.75 1.065495 1.232691

Best model:

reweighted raw

"0.75" "0.75"

8

Tuningtuning <- list(tuning.psi=c(3.14, 3.44, 3.88, 4.68))

# perform cross-validation

cvFitsLmrob <- cvTuning(fitLmrob$call, data = coleman, y = coleman$Y, tuning = tuning, cost = rtmspe, folds = folds, costArgs = list(trim = 0.1))

9

cvFitsLmrob5-fold CV results:

tuning.psi CV

1 3.14 1.179620

2 3.44 1.156674

3 3.88 1.169436

4 4.68 1.133975

Optimal tuning parameter:

tuning.psi

CV 4.68 10

Lab11b_18mammals.glm <- glm(log(brain) ~ log(body), data = mammals)

(cv.err <- cv.glm(mammals, mammals.glm)$delta)

[1] 0.4918650 0.4916571

> (cv.err.6 <- cv.glm(mammals, mammals.glm, K = 6)$delta)

[1] 0.4967271 0.4938003

# As this is a linear model we could calculate the leave-one-out

# cross-validation estimate without any extra model-fitting.

muhat <- fitted(mammals.glm)

mammals.diag <- glm.diag(mammals.glm)

(cv.err <- mean((mammals.glm$y - muhat)^2/(1 - mammals.diag$h)^2))

[1] 0.491865 11

Lab11b_18# leave-one-out and 11-fold cross-validation prediction error for

# the nodal data set. Since the response is a binary variable

# an appropriate cost function is

> cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)

> nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal)

> (cv.err <- cv.glm(nodal, nodal.glm, cost, K = nrow(nodal))$delta)

[1] 0.1886792 0.1886792

> (cv.11.err <- cv.glm(nodal, nodal.glm, cost, K = 11)$delta)

[1] 0.2264151 0.2228551

12

randomForest> library(e1071)

> library(rpart)

> library(mlbench) # etc.

> data(kyphosis)

> require(randomForest) # or library(randomForest)

> fitKF <- randomForest(Kyphosis ~ Age + Number + Start, data=kyphosis)

> print(fitKF) # view results

> importance(fitKF) # importance of each predictor

# what else can you do?

data(swiss) # fertility?

Lab10b_rf3_2015.R

data(Glass,package=“mlbench”) # Type ~ <what>?

data(Titanic) # Survived ~ .

Find - Mileage~Price + Country + Reliability + Type13

MDS• Lab8b_mds1_2015.R• Lab8b_mds2_2015.R• Lab8b_mds3_2015.R

• http://www.statmethods.net/advstats/mds.html

• http://gastonsanchez.com/blog/how-to/2013/01/23/MDS-in-R.html

14

http://www.statmethods.net/advstats/mds.html

http://www.statmethods.net/advstats/mds.html

http://gastonsanchez.com/blog/how-to/2013/01/23/MDS-in-R.html




R – many ways (of course)library(igraph)

g <- graph.full(nrow(dist.au))

V(g)$label <- city.names

layout <- layout.mds(g, dist = as.matrix(dist.au))

plot(g, layout = layout, vertex.size = 3)

15

Distances between Australian cities

# dist.au <- read.csv("http://rosetta.reltech.org/TC/v15/Mapping/data/dist-Aus.csv")

# Lab8b_mds1_2015.R

row.names(dist.au) <- dist.au[, 1]

dist.au <- dist.au[, -1]

dist.au## A AS B D H M P S

## A 0 1328 1600 2616 1161 653 2130 1161

## AS 1328 0 1962 1289 2463 1889 1991 2026

## B 1600 1962 0 2846 1788 1374 3604 732

## D 2616 1289 2846 0 3734 3146 2652 3146

## H 1161 2463 1788 3734 0 598 3008 1057

## M 653 1889 1374 3146 598 0 2720 713

## P 2130 1991 3604 2652 3008 2720 0 3288

## S 1161 2026 732 3146 1057 713 3288 0

16

Distances between Australian cities

fit <- cmdscale(dist.au, eig = TRUE, k = 2)

x <- fit$points[, 1]

y <- fit$points[, 2]

plot(x, y, pch = 19, xlim = range(x) + c(0, 600))

city.names <- c("Adelaide", "Alice Springs", "Brisbane", "Darwin", "Hobart",

"Melbourne", "Perth", "Sydney")

text(x, y, pos = 4, labels = city.names)

Try the other MDS functions... 17

In Rfunction (library)• cmdscale() (stats)• smacofSym() (smacof)• wcmdscale() (vegan)• pco() (ecodist)• pco() (labdsv)• pcoa() (ape)

• Only stats is loaded by default, and the rest are not installed by default 18

Do these dimension reductions• Lab8b_dr1_2015.R• Lab8b_dr2_2015.R• Lab8b_dr3_2015.R• Lab8b_dr4_2015.R

19

Factor Analysisdata(iqitems)

#

data(ability)

ability.irt <- irt.fa(ability)

ability.scores <- score.irt(ability.irt,ability)

data(attitude)

cor(attitude)

# Compute eigenvalues and eigenvectors of the correlation matrix.

pfa.eigen<-eigen(cor(attitude))

pfa.eigen$values

# set a value for the number of factors (for clarity)

factors<-2

# Extract and transform two components.

pfa.eigen$vectors [ , 1:factors ] %*%

+ diag ( sqrt (pfa.eigen$values [ 1:factors ] ),factors,factors )20

Glassindex <- 1:nrow(Glass)

testindex <- sample(index, trunc(length(index)/3))

testset <- Glass[testindex,]

trainset <- Glass[-testindex,]

Cor(testset)

Factor Analysis?

21

Try these• example_exploratoryFactorAnalysis.R on

dataset_exploratoryFactorAnalysis.csv– http://rtutorialseries.blogspot.com/2011/10/r-tutori

al-series-exploratory-factor.html

• http://www.statmethods.net/advstats/factor.html

• http://stats.stackexchange.com/questions/1576/what-are-the-differences-between-factor-analysis-and-principal-component-analysi

• Lab10b_fa{1,2,4,5}_2015.R22

http://rtutorialseries.blogspot.com/2011/10/r-tutorial-series-exploratory-factor.html



http://www.statmethods.net/advstats/factor.html

http://www.statmethods.net/advstats/factor.html

http://stats.stackexchange.com/questions/1576/what-are-the-differences-between-factor-analysis-and-principal-component-analysi




1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10b, April 10, 2015 Labs: Cross Validation,...

Documents

Transcript of 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10b, April 10, 2015 Labs: Cross Validation,...