1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10b, April 10, 2015 Labs: Cross Validation,...
-
Upload
meredith-morton -
Category
Documents
-
view
225 -
download
0
Transcript of 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10b, April 10, 2015 Labs: Cross Validation,...
1
Peter Fox
Data Analytics – ITWS-4963/ITWS-6965
Week 10b, April 10, 2015
Labs: Cross Validation, RandomForest, Multi-Dimensional Scaling,
Dimension Reduction, Factor Analysis
Advertisements• Mike Schroepfer, Facebook CTO, will be
lecturing and doing Q&A for Data and Society (CSCI 4967/6963) on Friday, April 24 in Walker 5113 from 9 to 11.
• Who would attend (must confirm)?
2
If you did not complete svm• Lab9b_svm(1,11}_2015.R
3
Cross-validation - coleman> head(coleman)
salaryP fatherWc sstatus teacherSc motherLev Y
1 3.83 28.87 7.20 26.6 6.19 37.01
2 2.89 20.10 -11.71 24.4 5.17 26.51
3 2.86 69.05 12.32 25.7 7.04 36.51
4 2.92 65.40 14.28 25.7 7.10 40.70
5 3.06 29.59 6.31 25.4 6.15 37.10
6 2.07 44.82 6.16 21.6 6.41 33.90
4
Lab11_11_2014.R> call <- call("lmrob", formula = Y ~ .)
> # set up folds for cross-validation
> folds <- cvFolds(nrow(coleman), K = 5, R = 10)
> # perform cross-validation
> cvTool(call, data = coleman, y = coleman$Y, cost = rtmspe,
+ folds = folds, costArgs = list(trim = 0.1))
CV
[1,] 0.9880672
[2,] 0.9525881
[3,] 0.8989264
[4,] 1.0177694
[5,] 0.9860661
[6,] 1.8369717
[7,] 0.9550428
[8,] 1.0698466
[9,] 1.3568537
[10,] 0.8313474
5
Warning messages:1: In lmrob.S(x, y, control = control) : S refinements did not converge (to refine.tol=1e-07) in 200 (= k.max) steps2: In lmrob.S(x, y, control = control) : S refinements did not converge (to refine.tol=1e-07) in 200 (= k.max) steps3: In lmrob.S(x, y, control = control) : find_scale() did not converge in 'maxit.scale' (= 200) iterations4: In lmrob.S(x, y, control = control) : find_scale() did not converge in 'maxit.scale' (= 200) iterations
Lab11b_12_2014.R> cvFits
5-fold CV results:
Fit CV
1 LS 1.674485
2 MM 1.147130
3 LTS 1.291797
Best model:
CV
"MM"
6
50 and 75% subsetsfitLts50 <- ltsReg(Y ~ ., data = coleman, alpha = 0.5)
cvFitLts50 <- cvLts(fitLts50, cost = rtmspe, folds = folds,
fit = "both", trim = 0.1)
# 75% subsets
fitLts75 <- ltsReg(Y ~ ., data = coleman, alpha = 0.75)
cvFitLts75 <- cvLts(fitLts75, cost = rtmspe, folds = folds,
fit = "both", trim = 0.1)
# combine and plot results
cvFitsLts <- cvSelect("0.5" = cvFitLts50, "0.75" = cvFitLts75)
7
cvFitsLts (50/75)> cvFitsLts
5-fold CV results:
Fit reweighted raw
1 0.5 1.291797 1.640922
2 0.75 1.065495 1.232691
Best model:
reweighted raw
"0.75" "0.75"
8
Tuningtuning <- list(tuning.psi=c(3.14, 3.44, 3.88, 4.68))
# perform cross-validation
cvFitsLmrob <- cvTuning(fitLmrob$call, data = coleman, y = coleman$Y, tuning = tuning, cost = rtmspe, folds = folds, costArgs = list(trim = 0.1))
9
cvFitsLmrob5-fold CV results:
tuning.psi CV
1 3.14 1.179620
2 3.44 1.156674
3 3.88 1.169436
4 4.68 1.133975
Optimal tuning parameter:
tuning.psi
CV 4.68 10
Lab11b_18mammals.glm <- glm(log(brain) ~ log(body), data = mammals)
(cv.err <- cv.glm(mammals, mammals.glm)$delta)
[1] 0.4918650 0.4916571
> (cv.err.6 <- cv.glm(mammals, mammals.glm, K = 6)$delta)
[1] 0.4967271 0.4938003
# As this is a linear model we could calculate the leave-one-out
# cross-validation estimate without any extra model-fitting.
muhat <- fitted(mammals.glm)
mammals.diag <- glm.diag(mammals.glm)
(cv.err <- mean((mammals.glm$y - muhat)^2/(1 - mammals.diag$h)^2))
[1] 0.491865 11
Lab11b_18# leave-one-out and 11-fold cross-validation prediction error for
# the nodal data set. Since the response is a binary variable
# an appropriate cost function is
> cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
> nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal)
> (cv.err <- cv.glm(nodal, nodal.glm, cost, K = nrow(nodal))$delta)
[1] 0.1886792 0.1886792
> (cv.11.err <- cv.glm(nodal, nodal.glm, cost, K = 11)$delta)
[1] 0.2264151 0.2228551
12
randomForest> library(e1071)
> library(rpart)
> library(mlbench) # etc.
> data(kyphosis)
> require(randomForest) # or library(randomForest)
> fitKF <- randomForest(Kyphosis ~ Age + Number + Start, data=kyphosis)
> print(fitKF) # view results
> importance(fitKF) # importance of each predictor
# what else can you do?
data(swiss) # fertility?
Lab10b_rf3_2015.R
data(Glass,package=“mlbench”) # Type ~ <what>?
data(Titanic) # Survived ~ .
Find - Mileage~Price + Country + Reliability + Type13
MDS• Lab8b_mds1_2015.R• Lab8b_mds2_2015.R• Lab8b_mds3_2015.R
• http://www.statmethods.net/advstats/mds.html
• http://gastonsanchez.com/blog/how-to/2013/01/23/MDS-in-R.html
14
R – many ways (of course)library(igraph)
g <- graph.full(nrow(dist.au))
V(g)$label <- city.names
layout <- layout.mds(g, dist = as.matrix(dist.au))
plot(g, layout = layout, vertex.size = 3)
15
Distances between Australian cities
# dist.au <- read.csv("http://rosetta.reltech.org/TC/v15/Mapping/data/dist-Aus.csv")
# Lab8b_mds1_2015.R
row.names(dist.au) <- dist.au[, 1]
dist.au <- dist.au[, -1]
dist.au## A AS B D H M P S
## A 0 1328 1600 2616 1161 653 2130 1161
## AS 1328 0 1962 1289 2463 1889 1991 2026
## B 1600 1962 0 2846 1788 1374 3604 732
## D 2616 1289 2846 0 3734 3146 2652 3146
## H 1161 2463 1788 3734 0 598 3008 1057
## M 653 1889 1374 3146 598 0 2720 713
## P 2130 1991 3604 2652 3008 2720 0 3288
## S 1161 2026 732 3146 1057 713 3288 0
16
Distances between Australian cities
fit <- cmdscale(dist.au, eig = TRUE, k = 2)
x <- fit$points[, 1]
y <- fit$points[, 2]
plot(x, y, pch = 19, xlim = range(x) + c(0, 600))
city.names <- c("Adelaide", "Alice Springs", "Brisbane", "Darwin", "Hobart",
"Melbourne", "Perth", "Sydney")
text(x, y, pos = 4, labels = city.names)
Try the other MDS functions... 17
In Rfunction (library)• cmdscale() (stats)• smacofSym() (smacof)• wcmdscale() (vegan)• pco() (ecodist)• pco() (labdsv)• pcoa() (ape)
• Only stats is loaded by default, and the rest are not installed by default 18
Do these dimension reductions• Lab8b_dr1_2015.R• Lab8b_dr2_2015.R• Lab8b_dr3_2015.R• Lab8b_dr4_2015.R
19
Factor Analysisdata(iqitems)
#
data(ability)
ability.irt <- irt.fa(ability)
ability.scores <- score.irt(ability.irt,ability)
data(attitude)
cor(attitude)
# Compute eigenvalues and eigenvectors of the correlation matrix.
pfa.eigen<-eigen(cor(attitude))
pfa.eigen$values
# set a value for the number of factors (for clarity)
factors<-2
# Extract and transform two components.
pfa.eigen$vectors [ , 1:factors ] %*%
+ diag ( sqrt (pfa.eigen$values [ 1:factors ] ),factors,factors )20
Glassindex <- 1:nrow(Glass)
testindex <- sample(index, trunc(length(index)/3))
testset <- Glass[testindex,]
trainset <- Glass[-testindex,]
Cor(testset)
Factor Analysis?
21
Try these• example_exploratoryFactorAnalysis.R on
dataset_exploratoryFactorAnalysis.csv– http://rtutorialseries.blogspot.com/2011/10/r-tutori
al-series-exploratory-factor.html
• http://www.statmethods.net/advstats/factor.html
• http://stats.stackexchange.com/questions/1576/what-are-the-differences-between-factor-analysis-and-principal-component-analysi
• Lab10b_fa{1,2,4,5}_2015.R22