HW1 solution - pdixon.stat.iastate.edu · HW1 solution Author: Yushan Gu Created Date: 2/18/2020...

13
HW1 solution Yushan Gu 2020/2/11 #library(maps) #library(spsample) library(sp) library(RgoogleMaps) library(dplyr) library(readxl) library(gstat) library(mgcv) setwd (".") # Change this to the appropriate working directory containing the data files, if necessary Each part was worth 4 points, except for questions 1b, 1e, 4b and 4c, which were 2 points each. There are 27 parts, so a total of 100 points. Partial credit was given for all questions when appropriate. # Problem 1 a) No. They are point pattern data. The primary information is the location of the wind farm, with secondary information (e.g. number of turbines at each). Geostatistical data exist everywhere (e.g. elevation or soil pH) and vary continuously. The number of turbines only exists where there is a wind farm. You could consider this areal data, with the areas being the footprint of each wind farm (i.e., something equivalent to a county or district in our example). But then it’s not clear how to define comparable areas for the “non-wind farm” sites. There are lots of practical problems defining the entire region (except for tiny dots) as one non-wind farm area. Note: This was a difficult question because it forced you to think carefully about the data. It would have been easier if the data set didn’t include the number of turbines. b) Latitude values are positive and longitude values are negative since data collected from farms in the Midwestern United States, which also means located at Northern Hemisphere and West Hemisphere of the Earth. c) Wind <- read.table("wind.txt", head=T) zoom <- MaxZoom(range(Wind$lat_DD), range(Wind$long_DD)) mid.lat <- mean(range(Wind$lat_DD)) mid.long <- mean(range(Wind$long_DD)) midwest <- GetMap(c(mid.lat, mid.long), zoom=zoom, maptype=terrain) PlotOnStaticMap(midwest, Wind$lat_DD, Wind$long_DD, pch=19, col=4) 1

Transcript of HW1 solution - pdixon.stat.iastate.edu · HW1 solution Author: Yushan Gu Created Date: 2/18/2020...

Page 1: HW1 solution - pdixon.stat.iastate.edu · HW1 solution Author: Yushan Gu Created Date: 2/18/2020 4:33:49 PM ...

HW1 solutionYushan Gu2020/2/11

#library(maps)#library(spsample)

library(sp)library(RgoogleMaps)library(dplyr)library(readxl)library(gstat)library(mgcv)

setwd (".")# Change this to the appropriate working directory containing the data files, if necessary

Each part was worth 4 points, except for questions 1b, 1e, 4b and 4c, which were 2 points each. There are 27parts, so a total of 100 points. Partial credit was given for all questions when appropriate.# Problem 1

a)No. They are point pattern data. The primary information is the location of the wind farm, with secondaryinformation (e.g. number of turbines at each).

Geostatistical data exist everywhere (e.g. elevation or soil pH) and vary continuously. The number of turbinesonly exists where there is a wind farm. You could consider this areal data, with the areas being the footprintof each wind farm (i.e., something equivalent to a county or district in our example). But then it’s not clearhow to define comparable areas for the “non-wind farm” sites. There are lots of practical problems definingthe entire region (except for tiny dots) as one non-wind farm area.

Note: This was a difficult question because it forced you to think carefully about the data. It would havebeen easier if the data set didn’t include the number of turbines.

b)Latitude values are positive and longitude values are negative since data collected from farms in the MidwesternUnited States, which also means located at Northern Hemisphere and West Hemisphere of the Earth.

c)

Wind <- read.table("wind.txt", head=T)

zoom <- MaxZoom(range(Wind$lat_DD), range(Wind$long_DD))mid.lat <- mean(range(Wind$lat_DD))mid.long <- mean(range(Wind$long_DD))midwest <- GetMap(c(mid.lat, mid.long), zoom=zoom, maptype='terrain')

PlotOnStaticMap(midwest, Wind$lat_DD, Wind$long_DD, pch=19, col=4)

1

Page 2: HW1 solution - pdixon.stat.iastate.edu · HW1 solution Author: Yushan Gu Created Date: 2/18/2020 4:33:49 PM ...

# or

Wind.sp <- Windcoordinates(Wind.sp) <- c('long_DD','lat_DD')proj4string(Wind.sp) <- CRS('+proj=longlat +datum=NAD83')

bubbleMap(Wind.sp,map=midwest,zcol='total_turb')

<59.4<82.4<100<151.8<293

There was no 1d)I removed a question and forgot to relabel the remainder.

e)Zone 14

Notes: You could look at the UTM zone map online or compute the zone. There are 60 zones spanning 360

2

Page 3: HW1 solution - pdixon.stat.iastate.edu · HW1 solution Author: Yushan Gu Created Date: 2/18/2020 4:33:49 PM ...

degrees Longitude, with 0 Longitude at the boundary of zones 30 and 31. The middle location (mid_long,mid_lat) is computed above. So, zone is given by:(180+mid.long) / 6

## [1] 13.79964

And if you want, you can check that by projecting (mid_long, mid_lat) into a UTM zone and checking thatthe Easting is between approx. 200,000 and approx. 800,000 and should be close to 500,000 (the middle of aUTM zone).cbind(mid.long, mid.lat) %>%

SpatialPoints(proj4string=CRS('+proj=longlat +datum=NAD83')) %>%spTransform(CRS("+proj=utm +zone=14"))

## SpatialPoints:## mid.long mid.lat## [1,] 658956.6 4149928## Coordinate Reference System (CRS) arguments: +proj=utm +zone=14## +ellps=WGS84

f)Since some farms located in different zones, it may not be that precise if we use UTM coordinate system.

Note: Again, you can look at the UTM map or compute the zone for each location.(180+Wind$long_DD)/6

## [1] 13.17710 13.07483 13.61280 13.29182 13.39580 13.61385 13.16850## [8] 13.22282 13.28707 13.62720 14.42205 13.50585 13.12592 13.06858## [15] 13.32862 14.39333 14.42445 13.25138 14.53070 13.72432 14.24023## [22] 14.43468 13.62248 13.36920 13.19547

g)

Wind.sp[Wind.sp$FID %in% c(33011,48278,47308,25090,25853),] %>%spDists(longlat = T) -> D1; D1

## [,1] [,2] [,3] [,4] [,5]## [1,] 0.0000 165.5740 1095.448 1246.2575 710.1593## [2,] 165.5740 0.0000 1261.010 1082.7423 587.0285## [3,] 1095.4479 1261.0099 0.000 2334.1569 1718.2682## [4,] 1246.2575 1082.7423 2334.157 0.0000 904.6636## [5,] 710.1593 587.0285 1718.268 904.6636 0.0000

Distance between FID’s 33011 and 48278: 2334.157km

Distance between FID’s 48278 and 47308: 1718.268km

Distance between FID’s 25090 and 25853: 165.5740km

h)

Wind.UTM <- spTransform(Wind.sp, CRS("+proj=utm +zone=14"))Wind.UTM[Wind.UTM$FID %in% c(33011,48278,47308,25090,25853),] %>%

spDists(longlat = F) %>%`/`(1000) -> D2; D2

3

Page 4: HW1 solution - pdixon.stat.iastate.edu · HW1 solution Author: Yushan Gu Created Date: 2/18/2020 4:33:49 PM ...

## [,1] [,2] [,3] [,4] [,5]## [1,] 0.0000 165.5145 1095.020 1245.7720 710.7336## [2,] 165.5145 0.0000 1260.522 1082.3218 587.5135## [3,] 1095.0204 1260.5217 0.000 2333.2583 1719.4333## [4,] 1245.7720 1082.3218 2333.258 0.0000 905.1439## [5,] 710.7336 587.5135 1719.433 905.1439 0.0000

Distance between FID’s 33011 and 48278: 2333.2583km

Distance between FID’s 48278 and 47308: 1719.433km

Distance between FID’s 25090 and 25853: 165.5145km

i)

Wind.sp[Wind.sp$FID %in% c(33011,48278,47308,25090,25853),]

## coordinates FID total_turb## 3 (-98.32319, 37.37579) 25853 293## 10 (-98.23679, 38.86589) 25090 155## 12 (-98.96489, 27.5126) 48278 57## 15 (-100.0283, 48.52459) 33011 71## 19 (-92.81579, 42.15957) 47308 19Wind.sp[Wind.sp$FID %in% c(33011,48278,47308,25090,25853),] %>%

coordinates %>%`[`(,1) -> longi

(D1-D2)[1,2] %>% abs; longi[1]-longi[2]

## [1] 0.05953986

## [1] -0.0864(D1-D2)[3,4] %>% abs; longi[3]-longi[4]

## [1] 0.8986258

## [1] 1.063398(D1-D2)[3,5] %>% abs; longi[3]-longi[5]

## [1] 1.165094

## [1] -6.149102

Farms with FID’s 25090 and 25853 have the most similar distance since the difference between their numberof longitude is smallest. Farms with FID’s 48278 and 47308 have the least similar distance since the differencebetween their number of longitude is largest.

j)Two answers are possible:

Error in the data: WGS84 and NAD83 are reconciled with each other, so should give same locations.

No error in the data base, if the data base not NAD83. Since GPS using the WGS84 datum and datum forthis data set may not same (We assume it to be NAD83), different datums means different math model beused, which also means there may be difference between longlat coordinates reported.

4

Page 5: HW1 solution - pdixon.stat.iastate.edu · HW1 solution Author: Yushan Gu Created Date: 2/18/2020 4:33:49 PM ...

Problem 2Note: There are two ways to implement these computations in R. My code had a loop. You could use apply()on a matrix. That is, generate 2000*50 random numbers, store in a matrix with 2000 columns and 50 rows,then apply the summary functions to each column. Yushan’s code shows you both.nsim <- 2000temp <- matrix(rlnorm(nsim*50, 2, 1.1), nrow=50)range(temp)

## [1] 0.08834931 859.98623306m1 <- apply(temp, 2, mean)ltemp <- log(temp)lm <- apply(ltemp, 2, mean)lv <- apply(ltemp, 2, var)m2 <- exp(lm + lv/2 - lv/50)m3 <- exp(lm)m4 <- apply(temp, 2, median)c(mean(m1), mean(m2), mean(m3), mean(m4) )

## [1] 13.601527 13.557487 7.506242 7.570491c(sd(m1), sd(m2), sd(m3), sd(m4) )

## [1] 2.936850 2.747827 1.199715 1.506741exp(2+1.1^2/2)

## [1] 13.53123# or

nsim <- 2000m1 <- m2 <- m3 <- m4 <- rep(NA, nsim)for (i in 1:nsim) {

Y <- rlnorm(50, 2, 1.1)m1[i] <- mean(Y)lY <- log(Y); mlog <- mean(lY); vlog <- var(lY)m2[i] <- exp(mlog+vlog/2-vlog/50)m3[i] <- exp(mlog)m4[i] <- median(Y)

}c(mean(m1), mean(m2), mean(m3), mean(m4))

## [1] 13.470130 13.416613 7.454165 7.511292c(sd(m1), sd(m2), sd(m3), sd(m4))

## [1] 2.858320 2.681066 1.159974 1.456480

a)

abs(c(mean(m1), mean(m2), mean(m3), mean(m4))-13.53)<0.3

## [1] TRUE TRUE FALSE FALSE

EStimator #1 and Estimator #2 are nearly unbiased estimators of population mean.

5

Page 6: HW1 solution - pdixon.stat.iastate.edu · HW1 solution Author: Yushan Gu Created Date: 2/18/2020 4:33:49 PM ...

b)More appropriate to consider this an estimator of the population median.

c)Since estimator #2 has smaller sd than estimator #1, it’s more precise.

d)Since estimator #3 has smaller sd than estimator #4, it’s more precise.

e)Doesn’t make sence since estimator #2 is estimator of the population mean but estimator #3 is eatimator odthe population median.

Problem 3a)

rain <- read_excel('IArainfall.xlsx')rain <- as.data.frame(rain)rain$Longitude <- -rain$Longitude

rain.sp <- raincoordinates(rain.sp) <- c('Longitude', 'Latitude')proj4string(rain.sp) <- CRS('+proj=longlat +datum=NAD83')

rain.utm <- spTransform(rain.sp, CRS('+proj=utm +zone=15'))spplot(rain.sp, 'Inches', scales=list(draw=T)) # or

40.5°N41°N

41.5°N42°N

42.5°N43°N

43.5°N

96°W95°W94°W93°W92°W91°W90°W

[27.7,30.37](30.37,33.03](33.03,35.7](35.7,38.36](38.36,41.03]

spplot(rain.utm, 'Inches', scales=list(draw=T))

6

Page 7: HW1 solution - pdixon.stat.iastate.edu · HW1 solution Author: Yushan Gu Created Date: 2/18/2020 4:33:49 PM ...

4500000

4600000

4700000

4800000

300000 500000 700000

[27.7,30.37](30.37,33.03](33.03,35.7](35.7,38.36](38.36,41.03]

b)

IAborder <- read.csv('IAborder.csv', as.is=T)IAborder.sp <- IAbordercoordinates(IAborder.sp) <- c('Longitude','Latitude')proj4string(IAborder.sp) <- CRS('+proj=longlat +datum=NAD83')IAborder.utm <- spTransform(IAborder.sp, CRS('+proj=utm +zone=15'))

ia.poly <- Polygon(IAborder.utm)ia.sppoly <- SpatialPolygons(list(Polygons(list(ia.poly), 'StudyArea') ))proj4string(ia.sppoly) <- CRS('+proj=utm +zone=15')plot(ia.sppoly)

7

Page 8: HW1 solution - pdixon.stat.iastate.edu · HW1 solution Author: Yushan Gu Created Date: 2/18/2020 4:33:49 PM ...

c)

predgrid <- expand.grid(Longitude=seq(200000, 750000, 10000),Latitude=seq(4470000, 4823000, 10000))

coordinates(predgrid) <- c('Longitude','Latitude')proj4string(predgrid) <- CRS('+proj=utm +zone=15')gridded(predgrid) <- T

temp <- over(ia.sppoly, predgrid, returnList=T)

predgrid2 <- predgrid[temp[[1]], ]plot(predgrid2)

d)

# quadratic trend surface

fitRain2b <- lm(Inches ~ Latitude + Longitude + I(Latitude^2) + I(Longitude^2) +I(Latitude*Longitude), data=rain.utm)

predgrid2$quad <- predict(fitRain2b, newdata=predgrid2)spplot(predgrid2, 'quad')

8

Page 9: HW1 solution - pdixon.stat.iastate.edu · HW1 solution Author: Yushan Gu Created Date: 2/18/2020 4:33:49 PM ...

26

28

30

32

34

36

38

# or

fitRain2 <- krige(Inches ~ 1, rain.utm, degree=2, newdata=predgrid2)

## [ordinary or weighted least squares prediction]spplot(fitRain2, 'var1.pred')

26

28

30

32

34

36

38

e)

# 2D smoothing spline

fitRain <- gam(Inches ~ s(Longitude, Latitude), data=rain.utm)predgrid2$predRain <- predict(fitRain, newdata=data.frame(predgrid2) )spplot(predgrid2, 'predRain')

9

Page 10: HW1 solution - pdixon.stat.iastate.edu · HW1 solution Author: Yushan Gu Created Date: 2/18/2020 4:33:49 PM ...

28

30

32

34

36

38

f)

summary(fitRain2b)

#### Call:## lm(formula = Inches ~ Latitude + Longitude + I(Latitude^2) +## I(Longitude^2) + I(Latitude * Longitude), data = rain.utm)#### Residuals:## Min 1Q Median 3Q Max## -2.51535 -0.95023 0.07629 0.78341 2.90103#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 6.576e+02 4.452e+02 1.477 0.14632## Latitude -2.425e-04 1.891e-04 -1.282 0.20605## Longitude -1.165e-04 7.074e-05 -1.647 0.10631## I(Latitude^2) 2.258e-11 2.012e-11 1.122 0.26756## I(Longitude^2) -5.738e-11 1.009e-11 -5.689 7.94e-07 ***## I(Latitude * Longitude) 3.921e-11 1.447e-11 2.710 0.00936 **## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 1.265 on 47 degrees of freedom## Multiple R-squared: 0.8356, Adjusted R-squared: 0.8182## F-statistic: 47.79 on 5 and 47 DF, p-value: < 2.2e-16summary(fitRain)

#### Family: gaussian## Link function: identity#### Formula:

10

Page 11: HW1 solution - pdixon.stat.iastate.edu · HW1 solution Author: Yushan Gu Created Date: 2/18/2020 4:33:49 PM ...

## Inches ~ s(Longitude, Latitude)#### Parametric coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 35.3594 0.1654 213.8 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Approximate significance of smooth terms:## edf Ref.df F p-value## s(Longitude,Latitude) 10.59 14.61 17.95 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### R-sq.(adj) = 0.835 Deviance explained = 86.9%## GCV = 1.8563 Scale est. = 1.4503 n = 53

Fitted 2D spline model is more complicated since effective degree of freedom of it is 10.59, which is morethan degrees of freedom for quadratic trend model (5).

Note: If you look at the summary of the spline, you see two sets of information, one set for the parametriccoefficients (i.e., fixed effects) and one set for the smooth terms. The only fixed effect in this model is theintercept. That tells you that the smooth effects do not include the intercept. The quadratic model has 6parameters if you count the intercept and 5 if you do not. To compare to the spline, you need to ignore theintercept. 10.6 is larger than 5, so the spline is more complicated (wigglier) than a quadratic.

g)

fitRain$sig2 %>% sqrt

## [1] 1.204293

RMSE for quadratic trend model is 1.265 and for 2D spline model is 1.204. Therefore, 2D spline model hassmaller rMSE.

Note: the RMSE for the quadratic model is on the summary output, labelled Residual standard error.

h)I would expect to get the same map since changing of scale does not change location of points on the map.

i)I would expect a similar but not same map since locations of points on map change slightlywhen we useLongitude, Latitude coordinates instead of UTM coordinates.

Note: You can see this if you plot LongLat locations then UTM locations. The North side of Iowa “shrinks”a bit in UTM coordinates. You should be able to explain why this happens.

Problem 4a)

locs <- spsample(ia.sppoly, n=100, type='random')plot(ia.sppoly)points(locs)

11

Page 12: HW1 solution - pdixon.stat.iastate.edu · HW1 solution Author: Yushan Gu Created Date: 2/18/2020 4:33:49 PM ...

b)

locs2 <- spsample(ia.sppoly, n=100, type='hexagonal')plot(ia.sppoly)points(locs2)

c)

locs3 <- spsample(ia.sppoly, n=100, type='clustered', nclusters=4)plot(ia.sppoly)points(locs3, pch=19, col=4)

12

Page 13: HW1 solution - pdixon.stat.iastate.edu · HW1 solution Author: Yushan Gu Created Date: 2/18/2020 4:33:49 PM ...

d)There are much more time, money, and other resources required for random and hexagonal sampling designssince they collect data from all over the state. The cluster sample only requires traveling to four areas of thestate.

Note: This practical difference is especially large if you do the random and hexagonal sampling correctly bysampling locations in a random order. You would go back and forth across the state a lot of times. You cansample the cluster design in random order by randomly choosing a cluster, then randomly choosing locationswithin the cluster.

e)The hexagonal sampling design will give the most precise estimation since it’s systematic sampling as wetalked in lecture. It spread points out and maximizes information when nearby observations are correlated.

The clustered sampling design will give the least precise estimation. Data collected from same cluster shouldbe highly correlated and it’s possible that all 4 clusters located in places with high rainfall or all located inplace with low rainfall. It’s much more esay to get an inaccurate results.

13