1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN...
-
Upload
amie-dixon -
Category
Documents
-
view
220 -
download
0
Transcript of 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN...
1
Peter Fox
Data Analytics – ITWS-4963/ITWS-6965
Week 4b, February 20, 2015
Lab: regression, kNN and K-means results, interpreting
and evaluating models
Classification (2)• Retrieve the abalone.csv dataset• Predicting the age of abalone from physical
measurements. • The age of abalone is determined by cutting the
shell through the cone, staining it, and counting the number of rings through a microscope: a boring and time-consuming task.
• Other measurements, which are easier to obtain, are used to predict the age.
• Perform knn classification to get predictors for Age (Rings). Interpretation not required.
2
What did you get?• See pdf
3
Clustering (3)• The Iris dataset (in R use data(“iris”) to load it)• The 5th column is the species and you want to
find how many clusters without using that information
• Create a new data frame and remove the fifth column
• Apply kmeans (you choose k) with 1000 iterations
• Use table(iris[,5],<your clustering>) to assess results
4
Return objectcluster A vector of integers (from 1:k) indicating the cluster to which each point is allocated.
centers A matrix of cluster centres.
totss The total sum of squares.
withinss Vector of within-cluster sum of squares, one component per cluster.
tot.withinss Total within-cluster sum of squares, i.e., sum(withinss).
betweenss The between-cluster sum of squares, i.e. totss-tot.withinss.
size The number of points in each cluster. 5
Contingency tables• See pdf file
6
Contingency tables> table(nyt1$Impressions,nyt1$Gender) #
0 1
1 69 85
2 389 395
3 975 937
4 1496 1572
5 1897 2012
6 1822 1927
7 1525 1696
8 1142 1203
9 722 711
10 366 400
11 214 200
12 86 101
13 41 43
14 10 9
15 5 7
16 0 4
17 0 1
7
Contingency table - displays the (multivariate) frequency distribution of the variable.
Tests for significance (not now)
> table(nyt1$Clicks,nyt1$Gender) 0 1 1 10335 10846 2 415 440 3 9 17
Regression Exercises• Using the EPI dataset find the single most
important factor in increasing the EPI in a given region
• Examine distributions down to the leaf nodes and build up an EPI “model”
8
Linear and least-squares> EPI_data<- read.csv(”EPI_data.csv")
> attach(EPI_data)
> boxplot(ENVHEALTH,DALY,AIR_H,WATER_H)
> lmENVH<-lm(ENVHEALTH~DALY+AIR_H+WATER_H)
> lmENVH
… (what should you get?)
> summary(lmENVH)
…
> cENVH<-coef(lmENVH)
9
Linear and least-squares> lmENVH<-lm(ENVHEALTH~DALY+AIR_H+WATER_H)
> lmENVH
Call:
lm(formula = ENVHEALTH ~ DALY + AIR_H + WATER_H)
Coefficients:
(Intercept) DALY AIR_H WATER_H
-2.673e-05 5.000e-01 2.500e-01 2.500e-01
> summary(lmENVH)
…
> cENVH<-coef(lmENVH)
10
Read the documentation!
11
Linear and least-squares> summary(lmENVH)
Call:
lm(formula = ENVHEALTH ~ DALY + AIR_H + WATER_H)
Residuals:
Min 1Q Median 3Q Max
-0.0072734 -0.0027299 0.0001145 0.0021423 0.0055205
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.673e-05 6.377e-04 -0.042 0.967
DALY 5.000e-01 1.922e-05 26020.669 <2e-16 ***
AIR_H 2.500e-01 1.273e-05 19645.297 <2e-16 ***
WATER_H 2.500e-01 1.751e-05 14279.903 <2e-16 ***
---
12
p < 0.01 : very strong presumption against null hypothesis vs. this fit 0.01 < p < 0.05 : strong presumption against null hypothesis 0.05 < p < 0.1 : low presumption against null hypothesis p > 0.1 : no presumption against the null hypothesis
Linear and least-squaresContinued:
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.003097 on 178 degrees of freedom
(49 observations deleted due to missingness)
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 3.983e+09 on 3 and 178 DF, p-value: < 2.2e-16
> names(lmENVH)
[1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign"
[7] "qr" "df.residual" "na.action" "xlevels" "call" "terms"
[13] "model" 13
Object of class lm:An object of class "lm" is a list containing at least the following components:
coefficients a named vector of coefficients
residuals the residuals, that is response minus fitted values.
fitted.values the fitted mean values.
rank the numeric rank of the fitted linear model.
weights (only for weighted fits) the specified weights.
df.residual the residual degrees of freedom.
call the matched call.
terms the terms object used.
contrasts (only where relevant) the contrasts used.
xlevels (only where relevant) a record of the levels of the factors used in fitting.
offset the offset used (missing if none were used).
y if requested, the response used.
x if requested, the model matrix used.
model if requested (the default), the model frame used. 14
> plot(ENVHEALTH,col="red")
> points(lmENVH$fitted.values,col="blue")
> Huh?
15
Plot original versus fitted
Try again!
16
> plot(ENVHEALTH[!is.na(ENVHEALTH)], col="red")
> points(lmENVH$fitted.values,col="blue")
Predict> cENVH<-coef(lmENVH)
> DALYNEW<-c(seq(5,95,5)) #2
> AIR_HNEW<-c(seq(5,95,5)) #3
> WATER_HNEW<-c(seq(5,95,5)) #4
17
Predict> NEW<-data.frame(DALYNEW,AIR_HNEW,WATER_HNEW)
> pENV<- predict(lmENVH,NEW,interval=“prediction”)
> cENV<- predict(lmENVH,NEW,interval=“confidence”) # look up what this does
18
Predict object returnspredict.lm produces a vector of predictions or a matrix of predictions and bounds with column names fit, lwr, and upr if interval is set. Access via [,1] etc.
If se.fit is TRUE, a list with the following components is returned:
fit vector or matrix as above
se.fit standard error of predicted means
residual.scale residual standard deviations
df degrees of freedom for residual
19
Output from predict> head(pENV)
fit lwr upr
1 NA NA NA
2 11.55213 11.54591 11.55834
3 18.29168 18.28546 18.29791
4 NA NA NA
5 69.92533 69.91915 69.93151
6 90.20589 90.19974 90.21204
…20
> tail(pENV)
fit lwr upr
226 NA NA NA
227 NA NA NA
228 34.95256 34.94641 34.95871
229 59.00213 58.99593 59.00834
230 24.20951 24.20334 24.21569
231 38.03701 38.03084 38.04319
21
Read the documentation!
22
Classification Exercises (Lab3b_knn1.R)> nyt1<-read.csv(“nyt1.csv")
> nyt1<-nyt1[which(nyt1$Impressions>0 & nyt1$Clicks>0 & nyt1$Age>0),]
> nnyt1<-dim(nyt1)[1] # shrink it down!
> sampling.rate=0.9
> num.test.set.labels=nnyt1*(1.-sampling.rate)
> training <-sample(1:nnyt1,sampling.rate*nnyt1, replace=FALSE)
> train<-subset(nyt1[training,],select=c(Age,Impressions))
> testing<-setdiff(1:nnyt1,training)
> test<-subset(nyt1[testing,],select=c(Age,Impressions))
> cg<-nyt1$Gender[training]
> true.labels<-nyt1$Gender[testing]
> classif<-knn(train,test,cg,k=5)
#
> classif
> attributes(.Last.value)
# interpretation to come!23
K Nearest Neighbors (classification)Script – Lab3b_knn1_2015.R
> nyt1<-read.csv(“nyt1.csv")
… from week 3b slides or script
> classif<-knn(train,test,cg,k=5)
#
> head(true.labels)
[1] 1 0 0 1 1 0
> head(classif)
[1] 1 1 1 1 0 0
Levels: 0 1
> ncorrect<-true.labels==classif
> table(ncorrect)["TRUE"] # or > length(which(ncorrect))
> What do you conclude?24
Classification Exercises (Lab3b_knn2_2015.R)
2 examples in the script
25
Clustering Exercises• Lab3b_kmeans1.R• Lab3b_kmeans2.R – plotting up results from
the iris clustering
26
Regression> bronx<-read.xlsx(”sales/rollingsales_bronx.xls",pattern="BOROUGH",stringsAsFactors=FALSE,sheetIndex=1,startRow=5,header=TRUE)
> plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE) )
> m1<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bronx)
What’s wrong?27
Clean up…> bronx<-bronx[which(bronx$GROSS.SQUARE.FEET>0 & bronx$LAND.SQUARE.FEET>0 & bronx$SALE.PRICE>0),]
> m1<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bronx)
#
> summary(m1)28
Call:
lm(formula = log(SALE.PRICE) ~ log(GROSS.SQUARE.FEET), data = bronx)
Residuals:
Min 1Q Median 3Q Max
-14.4529 0.0377 0.4160 0.6572 3.8159
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.0271 0.3088 22.75 <2e-16 ***
log(GROSS.SQUARE.FEET) 0.7013 0.0379 18.50 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.95 on 2435 degrees of freedom
Multiple R-squared: 0.1233, Adjusted R-squared: 0.1229
F-statistic: 342.4 on 1 and 2435 DF, p-value: < 2.2e-16 29
Plot> plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE))
> abline(m1,col="red",lwd=2)
# then
> plot(resid(m1)) 30
Another model (2)?Add two more variables to the linear model LAND.SQUARE.FEET and NEIGHBORHOOD
Repeat but suppress the intercept (2a)
31
Model 3/4Model 3
Log(SALE.PRICE) vs. no intercept Log(GROSS.SQUARE.FEET), Log(LAND.SQUARE.FEET), NEIGHBORHOOD, BUILDING.CLASS.CATEGORY
Model 4
Log(SALE.PRICE) vs. no intercept Log(GROSS.SQUARE.FEET), Log(LAND.SQUARE.FEET), NEIGHBORHOOD*BUILDING.CLASS.CATEGORY
32
Solution model 2> m2<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD),data=bronx)
> summary(m2)
> plot(resid(m2))
#
> m2a<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD),data=bronx)
> summary(m2a)
> plot(resid(m2a))33
34
Solution model 3 and 4> m3<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD)+factor(bronx$BUILDING.CLASS.CATEGORY),data=bronx)
> summary(m3)
> plot(resid(m3))
#
> m4<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD)*factor(bronx$BUILDING.CLASS.CATEGORY),data=bronx)
> summary(m4)
> plot(resid(m4))
35
36
Assignment 3• Preliminary and Statistical Analysis. Due ~
March 6. 15% (written)– Distribution analysis and comparison, visual
‘analysis’, statistical model fitting and testing of some of the nyt1…31 datasets.
37