Maps and Time Series - hofroe.nethofroe.net/stat579/08-maps.pdf · •Pick one state and crime type...
Transcript of Maps and Time Series - hofroe.nethofroe.net/stat579/08-maps.pdf · •Pick one state and crime type...
Maps and Time Series Stat 579
Heike Hofmann
Outline
• Melting and Casting
• Maps: polygons, chloropleth
• Time series
Warm-up
• Start R and load data ‘fbi’ from http://www.hofroe.net/stat579/crimes-2012.csv
• This data set contains number of crimes by type for each state in the U.S.
• Investigate which states have the highest number of crimes (almost independently of type)
• Pick one state and crime type and plot a time series
getting ready for loops
• Let’s concentrate on the years since 2000
• Pick a state and fit a model (use lm) in the number of Burglaries over time (i.e. lm(Burglary~Year) )
• Save the resulting object. Investigate it with your poking and prodding functions.
• Extract the coefficients (mean and slope) from the model
• Repeat for another state.
• How can we extract coefficients for all states?
• Want to run the same block of code multiple times:
!
!
!
• Loop or iteration
for (i in allstates) { onestate <- subset(fbi, state==i & Year >= 2000) model <- lm(Burglary~Year, data=onestate)! print(coef(model))}
Iterations
output
block of commands
Why should we avoid loops?
• speed of for-loops still is an issue
• main reason: lots of error-prone householding chores before and after the ‘meat’
fbi exploration
• Plot scatterplot of population size against number of violent crimes in 2012. What is your conclusion? How do things change in 2011?
• Plot population against number of burglaries in 2012. What is your conclusion there?
• What should we rather look at?
Reshaping Data
• Two step process:
• get data into a “convenient” shape, i.e. one that is particularly flexible
• cast data into new shape(s) that are better suited for analysis
melt
cast
• id.vars: all identifiers (keys) and qualitative variables
• measure.vars: all quantitative variables
key X1X2X3X4X5
X1
X2
X3
X4
X5
key
molten form “long & skinny”
original data
melt.data.frame(data, id.vars, measure.vars, na.rm = F, ...)"
id.vars measure.vars
Casting
• Function castdcast(dataset, rows ~ columns, aggregate)
aggregate(data)rows
columns
Data aggregation sometimes is just a transformation
> fbi.melt <- melt(fbi, id.vars=c("State","Abbr","Population"), measure.vars=4:12)!!> head(fbi.melt) State Abbr Population variable value1 Alabama AL 4708708 Violent.crime 211792 Alaska AK 698473 Violent.crime 44213 Arizona AZ 6595778 Violent.crime 269294 Arkansas AR 2889450 Violent.crime 149595 California CA 36961664 Violent.crime 1744596 Colorado CO 5024748 Violent.crime 16976!> tail(fbi.melt) State Abbr Population variable value445 Vermont VT 621760 Motor.vehicle.theft 448446 Virginia VA 7882590 Motor.vehicle.theft 11419447 Washington WA 6664195 Motor.vehicle.theft 23680448 West Virginia WV 1819777 Motor.vehicle.theft 2741449 Wisconsin WI 5654774 Motor.vehicle.theft 8926450 Wyoming WY 544270 Motor.vehicle.theft 771!!> summary(fbi.melt) State Abbr Population variable value Alabama : 9 AK : 9 Min. : 544270 Violent.crime : 50 Min. : 7 Alaska : 9 AL : 9 1st Qu.: 1796619 Murder.and.nonnegligent.manslaughter: 50 1st Qu.: 1536 Arizona : 9 AR : 9 Median : 4403094 Forcible.rape : 50 Median : 11056 Arkansas : 9 AZ : 9 Mean : 6128138 Robbery : 50 Mean : 47124 California: 9 CA : 9 3rd Qu.: 6664195 Aggravated.assault : 50 3rd Qu.: 37964 Colorado : 9 CO : 9 Max. :36961664 Property.crime : 50 Max. :1009614 (Other) :396 (Other):396 (Other) :150
Incidences are now easy to compute:
•fbi.melt$irate <- fbi.melt$value/fbi.melt$Population
Recreate this chart of incidence rates
count
reord
er(
Sta
te, irate
)
South DakotaNorth Dakota
Idaho
New HampshireNew York
New JerseyMaine
Vermont
PennsylvaniaIowa
ConnecticutVirginiaMontana
Massachusetts
KentuckyWest VirginiaRhode Island
Wisconsin
WyomingMinnesota
ColoradoNebraska
CaliforniaOregonIllinois
MississippiMichiganIndiana
Utah
Alaska
OhioKansas
Nevada
MarylandMissouri
Hawaii
Arizona
Delaware
WashingtonNorth Carolina
OklahomaGeorgiaAlabama
Arkansas
New Mexico
Louisiana
Tennessee
Florida
Texas
South Carolina
Murder.and.nonnegligent.manslaughter
0 1000200030004000
Forcible.rape
0 1000200030004000
Robbery
0 1000200030004000
Motor.vehicle.theft
0 1000200030004000
Aggravated.assault
0 1000200030004000
Violent.crime
0 1000200030004000
Burglary
0 1000200030004000
Larceny.theft
0 1000200030004000
Property.crime
0 1000200030004000
Then, cast
• Row variables, column variables, and a summary function (sum, mean, max, etc)
•dcast(molten, row ~ col, summary)"
•dcast(molten, row1 + row2 ~ col, summary)"
•dcast(molten, row ~ . , summary)"
•dcast(molten, . ~ col, summary)
Casting
• Using dcast:
• find the number of all offenses in 2009
• find the number of offenses by type of crime
• find the number of all offenses by state
What is a map?
long
lat
40.5
41.0
41.5
42.0
42.5
43.0
43.5
-96 -95 -94 -93 -92 -91
Set of points specifying latitude and longitude
long
lat
40.5
41.0
41.5
42.0
42.5
43.0
43.5
-96 -95 -94 -93 -92 -91
Polygon: connect dots in correct order
long
lat
30
35
40
-95 -90 -85
What is a map?
long
lat
30
35
40
-95 -90 -85
Polygon: connect only the correct dots
Grouping
• Use parameter group to connect the “right” dots (need to create grouping sometimes)
long
lat
30
35
40
45
-120 -110 -100 -90 -80 -70
long
lat
30
35
40
45
-120 -110 -100 -90 -80 -70
long
lat
30
35
40
45
-120 -110 -100 -90 -80 -70
long
lat
30
35
40
45
-120 -110 -100 -90 -80 -70
lat
30
35
40
45
qplot(long, lat, geom="point", data=states)
qplot(long, lat, geom="path", data=states, group=group)
qplot(long, lat, geom="polygon", data=states, group=group, fill=region)
qplot(long, lat, geom="polygon", data=states.map, fill=lat, group=group)
Practice
• Using the maps package, pull out map data for all US countiescounties <- map_data(“county”)
• Draw a map of counties (polygons & path geom)
• Colour all counties called “story”
• Advanced: What county names are used often?
Merging Data
• Merging data from different datasets:
merge(x, y, by = intersect(names(x), names(y))," by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all," sort = TRUE, suffixes = c(".x",".y"), incomparables = NULL, ...)"
states.fbi <- merge(states, fbi.cast, by.x="", by.y="Abbr")
e.g.:
Merging Data• Merging data from different datasets:
regionalabama
...
...
X1 X2 region X3alabamaalabamaalabama
...
...
...
region
X1 X2 X3alabama
alabama
alabama
Practice
• Merge the fbi crime data and the map of the States
• Plot Chloropleth maps of crimes.
• Describe the patterns that you see.
!
• Advanced: try to cluster the states according to crime rates (use hclust)
Time Series
• 24 x 24 grid across Central America
• satellite captured data: temperature, near surface temperature (surftemp) pressure, ozone, cloud coverage: low (cloudlow) medium (cloudmid) high (cloudhigh)
• for each location monthly averages for Jan 1995 to Dec 2000
NASA Meteorological Data
Gridx 1 to 24
Gri
dy 1
to
24
What is a Time Series?
TimeIndx
ts
275
280
285
290
295
300
305
10 20 30 40 50 60 70
for each location multiple measurements
TimeIndx
ts
275
280
285
290
295
300
305
10 20 30 40 50 60 70
connected by a line
TimeIndx
ts
275
280
285
290
295
300
305
10 20 30 40 50 60 70
but only connect the right points
qplot(time, temperature, geom="point", data=subset(nasa, (x==1) & (y==1)))
qplot(time, temperature, geom="line", data=subset(nasa, (x==1) & (y==1)))
qplot(time, temperature, geom="line", data=subset(nasa, (x==1) & (y %in% c(1,15))), group=y)
Practice
• For each location, draw a time series for pressure. What do you expect? Are there surprising values? Which are they?
• Plot near surface temperatures for each locationWhich locations show the highest range in temperatures? Which locations show the highest overall increase in temperatures?
use ddply to get these summaries