03 Modelling
-
Upload
hadley-wickham -
Category
Technology
-
view
11.345 -
download
3
description
Transcript of 03 Modelling
![Page 1: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/1.jpg)
Hadley Wickham
plyrModelling large data
Tuesday, 7 July 2009
![Page 2: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/2.jpg)
1. Strategy for analysing large data.
2. Introduction to the Texas housing data.
3. What’s happening in Houston?
4. Using a models as a tool
5. Using models in their own right
Tuesday, 7 July 2009
![Page 3: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/3.jpg)
Large data strategy
Start with a single unit, and identify interesting patterns.
Summarise patterns with a model.
Apply model to all units.
Look for units that don’t fit the pattern.
Summarise with a single model.
Tuesday, 7 July 2009
![Page 4: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/4.jpg)
Texas housing data
For each metropolitan area (45) in Texas, for each month from 2000 to 2009 (112):
Number of houses listed and sold
Total value of houses, and average sale price
Average time on market
CC BY http://www.flickr.com/photos/imagesbywestfall/3510831277/Tuesday, 7 July 2009
![Page 5: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/5.jpg)
Strategy
Start with a single city (Houston).
Explore patterns & fit models.
Apply models to all cities.
Tuesday, 7 July 2009
![Page 6: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/6.jpg)
date
avgprice
160000
180000
200000
220000
2000 2002 2004 2006 2008
Tuesday, 7 July 2009
![Page 7: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/7.jpg)
date
sales
3000
4000
5000
6000
7000
8000
2000 2002 2004 2006 2008
Tuesday, 7 July 2009
![Page 8: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/8.jpg)
date
onmarket
4.0
4.5
5.0
5.5
6.0
6.5
2000 2002 2004 2006 2008
Tuesday, 7 July 2009
![Page 9: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/9.jpg)
Seasonal trends
Make it much harder to see long term trend. How can we remove the trend?
(Many sophisticated techniques from time series, but what’s the simplest thing that might work?)
Tuesday, 7 July 2009
![Page 10: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/10.jpg)
month
avgprice
160000
180000
200000
220000
2 4 6 8 10 12
Tuesday, 7 July 2009
![Page 11: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/11.jpg)
What does the following function do?
deseas <- function(var, month) {
resid(lm(var ~ factor(month))) +
mean(var, na.rm = TRUE)
}
How could you use it in conjunction with transform to deasonalise the data? What if you wanted to deasonalise every city?
Challenge
Tuesday, 7 July 2009
![Page 12: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/12.jpg)
houston <- transform(houston, avgprice_ds = deseas(avgprice, month), listings_ds = deseas(listings, month), sales_ds = deseas(sales, month), onmarket_ds = deseas(onmarket, month))
qplot(month, sales_ds, data = houston, geom = "line", group = year) + avg
Tuesday, 7 July 2009
![Page 13: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/13.jpg)
month
avgprice_ds
150000
160000
170000
180000
190000
200000
210000
2 4 6 8 10 12
Tuesday, 7 July 2009
![Page 14: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/14.jpg)
Model as tools
Here we’re using the linear model as a tool - we don’t care about the coefficients or the standard errors, just using it to get rid of a striking pattern.
Tukey described this pattern as residuals and reiteration: by removing a striking pattern we can see more subtle patterns.
Tuesday, 7 July 2009
![Page 15: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/15.jpg)
date
avgprice_ds
150000
160000
170000
180000
190000
200000
210000
2000 2002 2004 2006 2008
Tuesday, 7 July 2009
![Page 16: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/16.jpg)
date
sales_ds
4000
4500
5000
5500
6000
6500
7000
2000 2002 2004 2006 2008
Tuesday, 7 July 2009
![Page 17: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/17.jpg)
date
onmarket_ds
4.0
4.5
5.0
5.5
6.0
6.5
2000 2002 2004 2006 2008
Tuesday, 7 July 2009
![Page 18: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/18.jpg)
Summary
Most variables seem to be combination of strong seasonal pattern plus weaker long-term trend.
How do these patterns hold up for the rest of Texas? We’ll focus on sales.
Tuesday, 7 July 2009
![Page 19: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/19.jpg)
date
sales
2000
4000
6000
8000
2000 2002 2004 2006 2008
Tuesday, 7 July 2009
![Page 20: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/20.jpg)
tx <- read.csv("tx-house-sales.csv")qplot(date, sales, data = tx, geom = "line", group = city)
tx <- ddply(tx, "city", transform, sales_ds = deseas(sales, month))qplot(date, sales_ds, data = tx, geom = "line", group = city)
Tuesday, 7 July 2009
![Page 21: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/21.jpg)
date
sales_ds
1000
2000
3000
4000
5000
6000
7000
2000 2002 2004 2006 2008
Tuesday, 7 July 2009
![Page 22: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/22.jpg)
It works, but...
It doesn’t give us any insight into the similarity of the patterns across multiple cities. Are the trends the same or different?
So instead of throwing the models away and just using the residuals, let’s keep the models and explore them in more depth.
Tuesday, 7 July 2009
![Page 23: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/23.jpg)
Two new toolsdlply: takes a data frame, splits up in the same way as ddply, applies function to each piece and combines the results into a list
ldply: takes a list, splits up into elements, applies function to each piece and then combines the results into a data frame
dlply + ldply = ddply
Tuesday, 7 July 2009
![Page 24: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/24.jpg)
models <- dlply(tx, "city", function(df) lm(sales ~ factor(month), data = df))
models[[1]]coef(models[[1]])
ldply(models, coef)
Tuesday, 7 July 2009
![Page 25: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/25.jpg)
Notice we didn’t have to do anything to have the coefficients labelled correctly.
Behind the scenes plyr records the labels used for the split step, and ensures they are preserved across multiple plyr calls.
Labelling
Tuesday, 7 July 2009
![Page 26: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/26.jpg)
What are some problems with this model? How could you fix them?
Is the format of the coefficients optimal?
Turn to the person next to you and discuss for 2 minutes.
Back to the model
Tuesday, 7 July 2009
![Page 27: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/27.jpg)
qplot(date, log10(sales), data = tx, geom = "line", group = city)
models2 <- dlply(tx, "city", function(df) lm(log10(sales) ~ factor(month), data = df))
coef2 <- ldply(models2, function(mod) { data.frame( month = 1:12, effect = c(0, coef(mod)[-1]), intercept = coef(mod)[1])})
Tuesday, 7 July 2009
![Page 28: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/28.jpg)
qplot(date, log10(sales), data = tx, geom = "line", group = city)
models2 <- dlply(tx, "city", function(df) lm(log10(sales) ~ factor(month), data = df))
coef2 <- ldply(models2, function(mod) { data.frame( month = 1:12, effect = c(0, coef(mod)[-1]), intercept = coef(mod)[1])})
Log transform sales to make coefficients
comparable (ratios)
Puts coefficients in rows, so they can be plotted more easily
Tuesday, 7 July 2009
![Page 29: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/29.jpg)
month
effect
−0.1
0.0
0.1
0.2
0.3
0.4
2 4 6 8 10 12
qplot(month, effect, data = coef2, group = city, geom = "line")Tuesday, 7 July 2009
![Page 30: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/30.jpg)
month
10^effect
1.0
1.5
2.0
2.5
2 4 6 8 10 12
qplot(month, 10 ^ effect, data = coef2, group = city, geom = "line")Tuesday, 7 July 2009
![Page 31: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/31.jpg)
month
10^e
ffect
1.01.52.02.5
1.01.52.02.5
1.01.52.02.5
1.01.52.02.5
1.01.52.02.5
1.01.52.02.5
1.01.52.02.5
Abilene
Brownsville
Fort Bend
Killeen−Fort Hood
Montgomery County
San Angelo
Victoria
2 4 6 8 1012
Amarillo
Bryan−College Station
Fort Worth
Laredo
Nacogdoches
San Antonio
Waco
2 4 6 8 1012
Arlington
Collin County
Galveston
Longview−Marshall
NE Tarrant County
San Marcos
Wichita Falls
2 4 6 8 1012
Austin
Corpus Christi
Garland
Lubbock
Odessa
Sherman−Denison
2 4 6 8 1012
Bay Area
Dallas
Harlingen
Lufkin
Palestine
Temple−Belton
2 4 6 8 1012
Beaumont
Denton County
Houston
McAllen
Paris
Texarkana
2 4 6 8 1012
Brazoria County
El Paso
Irving
Midland
Port Arthur
Tyler
2 4 6 8 1012
qplot(month, 10 ^ effect, data = coef2, geom = "line") + facet_wrap(~ city)Tuesday, 7 July 2009
![Page 32: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/32.jpg)
What should we do next?
What do you think?
You have 30 seconds to come up with (at least) one idea.
Tuesday, 7 July 2009
![Page 33: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/33.jpg)
My ideas
Fit a single model, log(sales) ~ city * factor(month), and look at residuals
Fit individual models, log(sales) ~ factor(month) + ns(date, 3), look cities that don’t fit
Tuesday, 7 July 2009
![Page 34: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/34.jpg)
# One approach - fit a single model
mod <- lm(log10(sales) ~ city + factor(month), data = tx)
tx$sales2 <- 10 ^ resid(mod)qplot(date, sales2, data = tx, geom = "line", group = city)last_plot() + facet_wrap(~ city)
Tuesday, 7 July 2009
![Page 35: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/35.jpg)
date
sales2
0.5
1.0
1.5
2.0
2.5
3.0
3.5
2000 2002 2004 2006 2008
qplot(date, sales2, data = tx, geom = "line", group = city)Tuesday, 7 July 2009
![Page 36: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/36.jpg)
date
sale
s2
0.51.01.52.02.53.03.5
0.51.01.52.02.53.03.5
0.51.01.52.02.53.03.5
0.51.01.52.02.53.03.5
0.51.01.52.02.53.03.5
0.51.01.52.02.53.03.5
0.51.01.52.02.53.03.5
Abilene
Brownsville
Fort Bend
Killeen−Fort Hood
Montgomery County
San Angelo
Victoria
20002002200420062008
Amarillo
Bryan−College Station
Fort Worth
Laredo
Nacogdoches
San Antonio
Waco
20002002200420062008
Arlington
Collin County
Galveston
Longview−Marshall
NE Tarrant County
San Marcos
Wichita Falls
20002002200420062008
Austin
Corpus Christi
Garland
Lubbock
Odessa
Sherman−Denison
20002002200420062008
Bay Area
Dallas
Harlingen
Lufkin
Palestine
Temple−Belton
20002002200420062008
Beaumont
Denton County
Houston
McAllen
Paris
Texarkana
20002002200420062008
Brazoria County
El Paso
Irving
Midland
Port Arthur
Tyler
20002002200420062008
last_plot() + facet_wrap(~ city)Tuesday, 7 July 2009
![Page 37: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/37.jpg)
# Another approach: Essence of most cities is seasonal # term plus long term smooth trend. We could fit this # model to each city, and then look for models which don't# fit well.
library(splines)models3 <- dlply(tx, "city", function(df) { lm(log10(sales) ~ factor(month) + ns(date, 3), data = df)})
# Extract rsquared from each modelrsq <- function(mod) c(rsq = summary(mod)$r.squared)quality <- ldply(models3, rsq)
Tuesday, 7 July 2009
![Page 38: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/38.jpg)
rsq
city
AbileneAmarillo
ArlingtonAustin
Bay AreaBeaumont
Brazoria CountyBrownsville
Bryan−College StationCollin County
Corpus ChristiDallas
Denton CountyEl Paso
Fort BendFort WorthGalveston
GarlandHarlingen
HoustonIrving
Killeen−Fort HoodLaredo
Longview−MarshallLubbock
LufkinMcAllenMidland
Montgomery CountyNacogdoches
NE Tarrant CountyOdessa
PalestineParis
Port ArthurSan AngeloSan AntonioSan Marcos
Sherman−DenisonTemple−Belton
TexarkanaTyler
VictoriaWaco
Wichita Falls
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.5 0.6 0.7 0.8 0.9
qplot(rsq, city, data = quality)Tuesday, 7 July 2009
![Page 39: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/39.jpg)
rsq
reor
der(c
ity, r
sq)
Port ArthurEl Paso
PalestineParis
TexarkanaBeaumont
VictoriaLufkin
NacogdochesAmarillo
San AngeloSan Marcos
OdessaGalveston
IrvingSherman−Denison
Wichita FallsBrownsville
McAllenBrazoria County
Killeen−Fort HoodAbilene
HarlingenLaredo
MidlandLongview−Marshall
GarlandLubbock
Temple−BeltonWaco
ArlingtonCorpus Christi
Bay AreaNE Tarrant County
TylerFort WorthFort Bend
AustinDenton County
Collin CountyBryan−College Station
DallasHouston
Montgomery CountySan Antonio
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.5 0.6 0.7 0.8 0.9
qplot(rsq, reorder(city, rsq), data = quality)
How are the good fits different from
the bad fits?
Tuesday, 7 July 2009
![Page 40: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/40.jpg)
quality$poor <- quality$rsq < 0.7tx2 <- merge(tx, quality, by = "city")
mfit <- ldply(models3, function(mod) { data.frame( resid = resid(mod), pred = predict(mod))})tx2 <- cbind(tx2, mfit[, -1])
Can you think of any potential problems with this line?
Tuesday, 7 July 2009
![Page 41: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/41.jpg)
date
log1
0(sa
les)
0.51.01.52.02.53.03.5
0.51.01.52.02.53.03.5
0.51.01.52.02.53.03.5
0.51.01.52.02.53.03.5
0.51.01.52.02.53.03.5
0.51.01.52.02.53.03.5
0.51.01.52.02.53.03.5
Abilene
Brownsville
Fort Bend
Killeen−Fort Hood
Montgomery County
San Angelo
Victoria
20002002200420062008
Amarillo
Bryan−College Station
Fort Worth
Laredo
Nacogdoches
San Antonio
Waco
20002002200420062008
Arlington
Collin County
Galveston
Longview−Marshall
NE Tarrant County
San Marcos
Wichita Falls
20002002200420062008
Austin
Corpus Christi
Garland
Lubbock
Odessa
Sherman−Denison
20002002200420062008
Bay Area
Dallas
Harlingen
Lufkin
Palestine
Temple−Belton
20002002200420062008
Beaumont
Denton County
Houston
McAllen
Paris
Texarkana
20002002200420062008
Brazoria County
El Paso
Irving
Midland
Port Arthur
Tyler
20002002200420062008Raw data
Tuesday, 7 July 2009
![Page 42: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/42.jpg)
date
pred
1.01.52.02.53.03.5
1.01.52.02.53.03.5
1.01.52.02.53.03.5
1.01.52.02.53.03.5
1.01.52.02.53.03.5
1.01.52.02.53.03.5
1.01.52.02.53.03.5
Abilene
Brownsville
Fort Bend
Killeen−Fort Hood
Montgomery County
San Angelo
Victoria
20002002200420062008
Amarillo
Bryan−College Station
Fort Worth
Laredo
Nacogdoches
San Antonio
Waco
20002002200420062008
Arlington
Collin County
Galveston
Longview−Marshall
NE Tarrant County
San Marcos
Wichita Falls
20002002200420062008
Austin
Corpus Christi
Garland
Lubbock
Odessa
Sherman−Denison
20002002200420062008
Bay Area
Dallas
Harlingen
Lufkin
Palestine
Temple−Belton
20002002200420062008
Beaumont
Denton County
Houston
McAllen
Paris
Texarkana
20002002200420062008
Brazoria County
El Paso
Irving
Midland
Port Arthur
Tyler
20002002200420062008Predictions
Tuesday, 7 July 2009
![Page 43: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/43.jpg)
date
resid
−0.4−0.20.00.20.4
−0.4−0.20.00.20.4
−0.4−0.20.00.20.4
−0.4−0.20.00.20.4
−0.4−0.20.00.20.4
−0.4−0.20.00.20.4
−0.4−0.20.00.20.4
Abilene
Brownsville
Fort Bend
Killeen−Fort Hood
Montgomery County
San Angelo
Victoria
20002002200420062008
Amarillo
Bryan−College Station
Fort Worth
Laredo
Nacogdoches
San Antonio
Waco
20002002200420062008
Arlington
Collin County
Galveston
Longview−Marshall
NE Tarrant County
San Marcos
Wichita Falls
20002002200420062008
Austin
Corpus Christi
Garland
Lubbock
Odessa
Sherman−Denison
20002002200420062008
Bay Area
Dallas
Harlingen
Lufkin
Palestine
Temple−Belton
20002002200420062008
Beaumont
Denton County
Houston
McAllen
Paris
Texarkana
20002002200420062008
Brazoria County
El Paso
Irving
Midland
Port Arthur
Tyler
20002002200420062008Residuals
Tuesday, 7 July 2009
![Page 44: 03 Modelling](https://reader031.fdocuments.us/reader031/viewer/2022020122/54881624b47959dd0c8b5672/html5/thumbnails/44.jpg)
ConclusionsSimple (and relatively small) example, but shows how collections of models can be useful for gaining understanding.
Each attempt illustrated something new about the data.
Plyr made it easy to create and summarise collection of models, so we didn’t have to worry about the mechanics.
Tuesday, 7 July 2009