intro_to_R.pdf

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013 1

An Introduction to Mapping and Spatial Modelling RBy and Richard Harris, School of Geographical Sciences, University of Bristol

An Introduction to Mapping and Modelling R by Richard Harris is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.Based on a work at www.social-statistics.org.

You are free:

to Share to copy, distribute and transmit the work

to Remix to adapt the work

Under the following conditions:

Attribution You must attribute the work in the following manner: Based on An Introduction to Mapping and Spatial Modelling R by Richard Harris (www.social-statistics.org).

Noncommercial You may not use this work for commercial purposes. Use for education in a recognised higher education institution (a College or University) is permissible.

Share Alike If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one.

With the understanding that:

Waiver Any of the above conditions can be waived if you get permission from the copyright holder (Richard Harris, [email protected])

Public Domain Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.

Other Rights In no way are any of the following rights affected by the license:Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations;The author's moral rights;Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights.

Notice For any reuse or distribution, you must make clear to others the license terms of this work which applies also to derivatives.

(Document version 0.1, November, 2013. Draft version.)

Introduction and contentsThis document presents a short introduction to R highlighting some geographical functionality. Specifically, it provides:

A basic introduction to R (Session 1)

A short 'showcase' of using R for data analysis and mapping (Session 2)

Further information about how R works (Session 3)

Guidance on how to use R as a simple GIS (Session 4)

Details on how to create a spatial weights matrix (Session 5)

An introduction to spatial regression modelling including Geographically Weighted Regression (Session 6)

Further sessions will be added in the months (more likely, years) ahead.

The document is provided in good faith and the contents have been tested by the author. However, use is entirely as the user's risk. Absolutely no responsibility or liability is accepted by the author for consequences arising from this document howsoever it is used. It is is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License (see above).

Before starting the following should be considered.

First, you will notice that in this document the pages and, more unusually, the lines are numbered. The reason is educational: it makes directing a class to a specific part of a page easier and faster. For other readers, the line numbers can be ignored.

Second, the sessions presume that, as well as R, a number of additional R packages (libraries) have been installed and are available to use. You can install them by following the 'Before you begin' instructions below.

Third, each session is written to be completed in a single sitting. If that is not possible, then it would normally be possible to stop at a convenient point, save the workspace before quitting R, then reload the saved workspace when you wish to continue. Note, however, that whereas the additional packages (libraries) need be installed only once, they must be loaded each time you open R and require them. Any objects that were attached before quitting R also need to be attached again to take you back to the point at which you left off. See the sections entitled 'Saving and loading workspaces', 'Attaching a data frame' and 'Installing and loading one or more of the packages (libraries)' on pages 10, 31 and 37 for further information.


10

20

30

Before you beginInstall R. It can be downloaded from http://cran.r-project.org/.

I currently am using version 3.0.2.

Start R. Use the drop-down menus to change your working directory to somewhereyou are happy to download all the files you need for this tutorial.

At the > prompt type,download.file("http://dl.dropboxusercontent.com/u/214159700/RIntro.zip", "Rintro.zip")

and press return.

Next, type unzip("Rintro.zip")

All the data you need for the sessions are now available in the working directory.

If you would like to install all the libraries (packages) you need for these practicals, typeload(begin.RData)

and theninstall.libs()

You are advised to read Installing and loading one or more of the packages (libraries) on p. 37before doing so.

Please note:this is a draft version of the document and has not as yet

been thoroughly checked for typos and other errors.


10

20

Session 1: Getting Started with RThis session provides a brief introduction to how R works and introduces some of the more common commands and procedures. Don't worry if not everything is clear at this stage. The purpose is to get you started not to make you an expert user. If you would prefer to jump straight to seeing R in action, then move on the Session 2 (p.13) and come back to this introduction later.

1.1 About RR is an open source software package, licensed under the GNU General Public Licence. You can obtain and install it for free, with versions available for PCs, Macs and Linux. To find out what is available, go to the Comprehensive R Archive Network (CRAN) at http://cran.r-project.org/

Being free is not necessarily a good reason to use R. However, R is also well developed, well documented, widely used and well supported by an extensive user community. It is not just software for 'hobbyists'. It is widely used in research, both academic and commercial. It has well developed capabilities for mapping and spatial analysis.

In his book R in a Nutshell (O'Reilly, 2010), Joseph Adler writes, R is very good at plotting graphics, analyzing data, and fitting statistical models using data that fits in the computer's memory. Nevertheless, no software provides the perfect tool for every job and Adler adds that it's not good at storing data in complicated structures, efficiently querying data, or working with data that doesn't fit in the computer's memory.

To these caveats it should be added that R does not offer spreadsheet editing of data of the type found, for example, in Microsoft Excel. Consequently, it is often easier to prepare and 'clean' data prior to loading them into R. There is an add-in to R that provides some integration with Excel. Go to http://rcom.univie.ac.at/ and look for RExcel.

A possible barrier to learning R is that it is generally command-line driven. That is, the user types a command that the software interprets and responds to. This can be daunting for those who are used to extensive graphical user interfaces (GUIs) with drop-down menus, tabs, pop-up menus, left or right-clicking and other navigational tools to steer you through a process. It may mean that R takes a while longer to learn; however, that time is well spent. Once you know the commands it is usually much faster to type them than to work through a series of menu options. They can be easily edited to change things such as the size or colour of symbols on a graph, and a log or script of the commands can be saved for use on another occasion or for sharing with others.

Saying that, a fairly simple and platform independent GUI called R Commander can be installed (see http://cran.r-project.org/web/packages/Rcmdr/index.html). Field et al.'s book Discovering Statistics Using R provides a comprehensive introduction to statistical analysis in R using both command-lines and R Commander.

1.2 Getting StartedAssuming R has been installed in the normal way on your computer, clicking on the link/shortcut to R on the desktop will open the RGui, offering some drop-down menu options, and also the R Console, within which R commands are typed and executed. The appearance of the RGui differs a little depending upon the operating system being used (Windows, Mac or Linux) but having used one it should be fairly straightforward to navigate around another.


10

20

30

40

Figure 1.1. Screen shot of the R Gui for Windows

1.2.1 Using R as a calculator

At its simplest, R can be used as a calculator. Typing 1 + 1 after the prompt > will (after pressing the return/enter key, ) produce the result 2, as in the following example:

> 1 + 1[1] 2

Comments can be indicated with a hash tag and will be ignored> # This is a comment, no need to type it

Some other simple mathematical expressions are given below.> 10 - 5[1] 5> 10 * 2[1] 20> 10 - 5 * 2 # The order of operations gives priority to

# multiplication[1] 0> (10 - 5) * 2 # The use of brackets changes the order[1] 10> sqrt(100) # Uses the function that calculates the square root[1] 10> 10^2 # 102[1] 100> 100^0.5 # 100.5, i.e. the square root again[1] 10> 10^3[1] 1000> log10(100) # Uses the function that calculates the common log[1] 2> log10(1000)[1] 3> 100 / 5[1] 20> 100^0.5 / 5[1] 2


10

20

30

1.2.2 Incomplete commands

If you see the + symbol instead of the usual (>) prompt it is because what has been typed is incomplete. Often there is a missing bracket. For example,

> sqrt(+ 100 # The + symbol indicates that the command is incomplete+ )[1] 10> (1 + 2) * (5 - 1+ )[1] 12

Commands broken over multiple lines can be easier to read.> for (i in 1:10) { # This is a simple loop+ print(i) # printing the numbers 1 to 10 on-screen+ }[1] 1[1] 2[1] 3[1] 4[1] 5[1] 6[1] 7[1] 8[1] 9[1] 10

1.2.3 Repeating or modifying a previous command

If there is a mistake in a line of a code that needs to be corrected or if some previously typed commands will be repeated then the and keys on the keyboard can be used to scroll between previous entries in the R Console. Try it!

1.3 Scripting and Logging in R

1.3.1 Scripting

You can create a new script file from the drop down menu File New script (in Windows) or File New Document (Mac OS). It is basically a text file in which you could write, for example,

a

Notepad. Although you will not be able to use Ctrl + R in the same way, if R crashes for any reason you will not lose your script file.

1.3.2 Logging

You can save the contents of the R Console window to a text file which will then give you a log file of the commands you have been using (including any mistakes). The easiest way to do this is to click on the R Console (to take the focus from the Scripting window) and then use File Save History (in Windows) or File Save As (Mac). Note that graphics are not usually plotted in the R Console and therefore need to be saved separately.

1.4 Some R Basics

1.4.1 Functions, assignments and getting help

It is helpful to understand R as an object-oriented system that assigns information to objects within the current workspace. The workspace is simply all the objects that have been created or loaded since beginning the session in R. Look at it this way: the objects are like box files, containing useful information, and the workspace is a larger storage container, keeping the box files together. A useful feature of this is that R can operate on multiple tables of data at once: they are just stored as separate objects within the workspace.

To view the objects currently in the workspace, type> ls()character(0)

Doing this runs the function ls(), which lists the contents of the workspace. The result, character(0), indicates that the workspace is empty. (Assuming it currently is).

To find out more about a function, type ? or help with the function name,> ?ls()> help(ls)

This will provide details about the function, including examples of its use. It will also list the arguments required to run the function, some of which may be optional and some of which may have default values which can be changed as required. Consider, for example,

> ?log()

A required argument is x, which is the data value or values. Typing log() omits any data and generates an error. However, log(100) works just fine. The argument base takes a default value of e1which is approximately 2.72 and means the natural logarithm is calculated. Because the default is assumed unless otherwise stated so log(100) gives the same answer as log(100, base=exp(1)). Using log(100, base=10) gives the common logarithm, which can also be calculated using the convenience function log10(100).

The results of mathematical expressions can be assigned to objects, as can the outcome of many commands executed in the R Console. When the object is given a name different to other objects within the current workspace, a new object will be created. Where the name and object already exist, the previous contents of the object will be over-written, without warning so be careful!

> a print(a)[1] 5> b

> print(b)[1] 20> print(a * b)[1] 100> a print(a)[1] 100

In these examples the assignment is achieved using the combination of < and -, as in a a could be used or, more simply, a = 100. The print(..)command can often be omitted, though it is useful, and sometimes necessary (for example, when what you hope should appear on-screen doesn't).

> f = a * b> print(f)[1] 2000> f[1] 2000> sqrt(b)[1] 4.472136> print(sqrt(b), digits=3) # The additional parameter now specifies[1] 4.47 # the number of significant figures> c(a,b) # The c(...) function combines its arguments[1] 100 20> c(a,sqrt(b))[1] 100.000000 4.472136> print(c(a,sqrt(b)), digits=3)[1] 100.00 4.47

1.4.2 Naming objects in the workspace

Although the naming of objects is flexible, there are some exceptions,> _a 2a a A a == A[1] FALSE

The following is rarely sensible because it won't appear in the workspace, although it is there,> .a ls()[1] "a" "b" "f"> .a[1] 10> rm(.a, A) # Removes the objects .a and A (see below)

1.4.3 Removing objects from the workspace

From typing ls() we know when the workspace is not empty. To remove an object from the workspace it can be referenced to explicitly as in rm(A) or indirectly by its position in the workspace. To see how the second of these options will work, type

> ls()


10

20

30

40

[1] "a" "b" "f"The output returned from the ls() function is here a vector of length three where the first element is the first object (alphabetically) in the workspace, the second is the second object, and so forth. We can access specific elements by using notation of the form ls[index.number]. So, the first element the first object in the workspace can be obtained using,

> ls()[1] # Get the brackets right! some rounded some square[1] "a"> ls()[2][1] "b"

Note how the square brackets [] are used to reference specific elements within the vector. Similarly,

> ls()[3][1] "f"> ls()[c(1,3)][1] "a" "f"> ls()[c(1,2,3)][1] "a" "b" "f"> ls()[c(1:3)] # 1:3 means the numbers 1 to 3[1] "a" "b" "f"

Using the remove function, rm(...), the second and third objects in the workspace can be removed using

> rm(list=ls()[c(1,3)])> ls()[1] "b"

Alternatively, objects can be removed by name> rm(b)

To delete all the objects in the workspace and therefore empty it, type the following code but be warned! there is no undo function. Whenever rm(...) is used the objects are deleted permanently.

> rm(list=ls())> ls()character(0) # In other words, the workspace is empty

1.4.4 Saving and loading workspaces

Because objects are deleted permanently, a sensible precaution prior to using rm(...) is to save the workspace. To do so permits the workspace to be reloaded if necessary and the objects recovered. One way to save the workspace is to use

> save.image(file.choose(new=T))Alternatively, the drop-down menus can be used (File Save Workspace in the Windows version of the RGui). In either case, type the extension .RData manually else it risks being omitted, making it harder to locate and reload what has been saved. Try creating a couple of objects in your workspace and then save it with the names workspace1.RData

To load a previously saved workspace, use> load(file.choose())

or the drop-down menus.

When quitting R, it will prompt to save the workspace image. If the option to save is chosen it will


10

20

30

40

be saved to the file .RData within the working directory. Assuming that directory is the default one, the workspace and all the objects it contains will be reloaded automatically each and every time R is opened, which could be useful but also potentially irritating. To stop it, locate and delete the file. The current working directory is identified using the get working directory function, getwd() and changed most easily using the drop-down menus.

> getwd()[1] "/Users/rich_harris"

(Your working directory will differ from the above)

Tip: A good strategy for file management is to create a new folder for each project in R, saving the workspace regularly using a naming convention such as Dec_8_1.RData, Dec_8_2.RData etc. That way you can easily find and recover work.

1.5 Quitting RBefore quitting R, you may wish to save the workspace. To quit R use either the drop-down menus or

> q()As promised, you will be prompted whether to save the workspace. Answering yes will save the workspace to the file .RData in the current working directory (see section 1.4.4, 'Saving and loading workspaces', on page 10, above). To exit without the prompt, use

> q(save = "no")Or, more simply,

> q("no")

1.6 Getting HelpIn addition to the use of the ? or help() documentation and the material available at CRAN, http://cran.r-project.org/, R has an active user community. Helpful mailing lists can be accessed from www.r-project.org/mail.html.

Perhaps the best all round introduction to R is the An Introduction to R which is freely available at CRAN (http://cran.r-project.org/manuals.html) or by using the drop-down Help menus in the RGui. It is clear and succinct.

I also have a free introduction to statistical analysis in R which accompanies the book Statistics for Geography and Environmental Science. It can be obtained from http://www.social-statistics.org/?p=354.

There are many books available. My favourite, with a moderate level statistical leaning and written with clarity is,

Maindonald, J. & Braun, J., 2007. Data Analysis and Graphics using R (2nd edition). Cambridge: CUP.

I also find useful,

Adler, J., 2010. R in a Nutshell. O'Reilly: Sebastopol, CA.

Crawley, MJ, 2005. Statistics: An Introduction using R. Chichester: Wiley (which is a shortened version of The R Book by the same author).


10

20

30

Field, A., Miles, J. & Field, Z., 2012. Discovering Statistics Using R. London: Sage

However, none of these books is about mapping or spatial analysis (of particular interest to me as a geographer). For that, the authoritative guide making the links between geographical information science, geographical data analysis and R (but not really written for R newcomers) is,

Bivand, R.S., Pebesma, E.J. & Gmez-Rubio, V., 2008. Applied Spatial Data Analysis with R. Berlin: Springer.

Also helpful is,

Ward, M.D. & Skrede Gleditsch, K., 2008. Spatial Regression Models. London: Sage. (Which uses R code examples).

And

Chun, Y. & Griffith, D.A., 2013. Spatial Statistics and Geostatistics. London: Sage. (I found this book a little eccentric but it contains some very good tips on its subject and gives worked examples in R).

The following book has a short section of maps as well as other graphics in R (and is also, as the title suggests, good for practical guidance on how to analyse surveys using cluster and stratified sampling, for example):

Lumley, T., 2010. Complex Surveys. A Guide to Analysis Using R. Hoboken, NJ: Wiley.

Springer publish an ever-growing series of books under the banner Use R! If you are interested in visualization, time-series analysis, Bayesian approaches, econometrics, data mining, , then you'll find something of relevance at http://www.springer.com/series/6991. But you may well also find what you are looking for for free on the Internet.


10

20

Session 2: A Demonstration of RThis session provides a quick tour of some of R's functionality, with a focus on some geographical applications. The idea here is to showcase a little of what R can do rather than providing a comprehensive explanation to all that is going on. Aim for an intuitive understanding of the commands and procedures but do not worry about the detail. More information about the workings of R is given in the next session. More details about how to use R as a GIS and for spatial analysis are given in Sessions 4, 5 and 6.

Note: this session assumes the libraries RgoogleMaps, png, sp and spdep are installed and available for use. You can find out which packages you currently have installed by using

> row.names(installed.packages())If the packages cannot be found then they can be installed using install.packages(c("RgoogleMaps","png","sp","spdep")). Note that you may need administrative rights on your computer to install the package (see Section 3.5.1, Installing and loading one or more of the packages (libraries), p.37).

2.1 Getting StartedAs the focus of this session is on showing what R can do rather than teaching you how to do it. instead of requiring you to type a series of commands, they can instead be executed automatically from a previously written source file (a script: see Section 1.3.1, page 7). As the commands are executed we will ask R to echo (print) them to the screen so you can following what is going on. At regular intervals you will be prompted to press return before the script continues.

To begin, type,> source(file.choose(), echo=T)

and load the source file session2.R. After some comments that you should ignore, you will be prompted to load the .csv file schools.csv:

> ## Read in the file schools.csv file> wait()Please presss return

schools.data head(schools.data) # Shows the first few rows of the data> tail(schools.data) # Shows the bottom few rows of the data

We can produce a summary of each column in the data table using> summary(schools.data)

In this instance, each column is a continuous variable so we obtain a six-number summary of the centre and spread of each variable.

The names of the variables are


10

20

30

40

> names(schools.data)Next the number of columns and rows; and a check row-by-row to see if the data are complete (have no missing data).

> ncol(schools.data)> nrow(schools.data)> complete.cases(schools.data)

It is not the most comprehensive check but everything appears to be in order.

2.3 Some simple graphicsThe file schools.csv contains information about the location and some attributes of schools in Greater London (in 2008). The locations are given as a grid reference (Easting, Northing). The information is not real but is realistic. It should not, however, be used to make inferences about real schools in London.

Of particular interest is the average attainment on leaving primary school (elementary school) of pupils entering their first year of secondary school. Do some schools in London attract higher attaining pupils more than others? The variable attainment contains this information.

A stripchart and then a histogram will show that (not surprisingly) there is variation in the average prior attainment by school.

> attach(schools.data)> stripchart(attainment, method="stack", xlab="Mean Prior Attainment by School")> hist(attainment, col="light blue", border="dark blue", freq=F, ylim=c(0,0.30),+ xlab=Mean attainment)

Here the histogram is scaled so the total area sums to one. To this we can add a rug plot,> rug(attainment)

also a density curve, a Normal curve for comparison and a legend.> lines(density(sort(attainment)))> xx yy lines(xx, yy, lty="dotted")> rm(xx, yy)> legend("topright", legend=c("density curve","Normal curve"),+ lty=c("solid","dotted"))

If would be interesting to know if attainment varies by school type. A simple way to consider this is to produce a box plot. The data contain a series of dummy variables for each of a series of school types (Voluntary Aided Church of England school: coe = 1; Voluntary Aided Roman Catholic: rc = 1; Voluntary controlled faith school: vol.con = 1; another type of faith school: other.faith = 1; a selective school (sets an entrance exam): selective = 1). We will combine these into a single, categorical variable then produce the box plot showing the distribution of average attainment by school type.

First the categorical variable:> school.type school.type[coe==1]

> school.type[rc==1] school.type[vol.con==1] school.type[other.faith==1] school.type[selective==1] school.type levels(school.type) # A list of the categories[1] "Not Faith/Selective" "Other Faith" "Selective" [etc.]

Now the box plots:> par(mai=c(1,1.4,0.5,0.5)) # Changes the graphic margins> boxplot(attainment ~ school.type, horizontal=T, xlab="Mean attainment", las=1,+ cex.axis=0.8) # Includes options to draw the boxes and labels horizontally> abline(v=mean(attainment), lty="dashed") # Adds the mean value to the plot> legend("topright", legend="Grand Mean", lty="dashed")

Not surprisingly, the selective schools (those with an entrance exam) recruit the pupils with highest average prior attainment.

Figure 2.1. A histogram with annotation in R


10

Figure 2.2. Mean prior attainment by school type

2.4 Some simple statisticsIt appears (in Figure 2.2) that there are differences in the levels of prior attainment of pupils in different school types. We can test whether the variation is significant using an analysis of variance.

> summary(aov(attainment ~ school.type)) Df Sum Sq Mean Sq F value Pr(>F) school.type 5 479.8 95.95 71.42 attainment.high.fsm.schools quantile(fsm, probs=0.75)]# Finds the attainment scores for schools with the highest proportions of FSM pupils> attainment.low.fsm.schools t.test(attainment.high.fsm.schools, attainment.low.fsm.schools)

Welch Two Sample t-test

data: attainment.high.fsm.schools and attainment.low.fsm.schools t = -15.0431, df = 154.164, p-value < 2.2e-16alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.437206 -2.639240 sample estimates:mean of x mean of y 26.58352 29.62174


10

20

It comes as little surprise to learn that those schools with the greatest proportions of FSM eligible pupils are also those recruiting lower attaining pupils on average (mean attainment 26.6 Vs 29.6, t = -15.0, p < 0.001, the 95% confidence interval is from -3.44 to 2.64).

Exploring this further, the Pearson correlation between the mean prior attainment of pupils entering each school and the proportion of them that are FSM eligible is -0.689, and significant (p < 0.001):

> round(cor(fsm, attainment),3)> cor.test(fsm, attainment)

Pearson's product-moment correlation

data: fsm and attainment t = -18.1731, df = 365, p-value < 2.2e-16alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.7394165 -0.6313939 sample estimates: cor -0.6892159

Of course, the use of the Pearson correlation assumes that the relationship is linear, so let's check:> plot(attainment ~ fsm)> abline(lm(attainment ~ fsm)) # Adds a line of best fit (a regression line)

There is some suggestion the relationship might be curvilinear. However, we will ignore that here.

Finally, some regression models. The first seeks to explain the mean prior attainment scores for the schools in London by the proportion of their intake who are free school meal eligible. (The result is the line of best fit added to the scatterplot above).

The second model adds a variable giving the proportion of the intake of a white ethnic group.

The third adds a dummy variable indicating whether the school is selective or not.> model1 summary(model1)

Call:lm(formula = attainment ~ fsm, data = schools.data)

Residuals: Min 1Q Median 3Q Max -2.8871 -0.7413 -0.1186 0.5487 3.6681

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 29.6190 0.1148 258.12

Call:lm(formula = attainment ~ fsm + white, data = schools.data)


Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 30.1250 0.1979 152.21 < 2e-16 ***fsm -7.2502 0.4214 -17.20 < 2e-16 ***white -0.8722 0.2796 -3.12 0.00196 ** ---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 1.164 on 364 degrees of freedomMultiple R-squared: 0.4887, Adjusted R-squared: 0.4859 F-statistic: 173.9 on 2 and 364 DF, p-value: < 2.2e-16

> model3 summary(model3)

Call:lm(formula = attainment ~ fsm + white + selective, data = schools.data)

Residuals: Min 1Q Median 3Q Max -2.6262 -0.5620 0.0537 0.5607 3.6215

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 29.1706 0.1689 172.712

1 364 307.42 2 363 306.54 1 0.88222 1.0447 0.3074

The residual error, measured by the residual sum of squares (RSS), is not very different for the two models, and that difference, 0.882, is not significant (F = 1.045, p = 0.307).

2.5 Some simple mapsFor a geographer like myself, R becomes more interesting when we begin to look at its geographical data handling capabilities.

The schools data contain geographical coordinates and are therefore geographical data. Consequently they can be mapped. The simplest way for point data is to use a 2-dimensional plot, making sure the aspect ratio is fixed correctly.

> plot(Easting, Northing, asp=1, main="Map of London schools")# The argument asp=1 fixes the aspect ratio correctly

Amongst the attribute data for the schools, the variable esl gives the proportion of pupils who speak English as an additional language. It would be interesting for the size of the symbol on the map to be proportional to it.

> plot(Easting, Northing, asp=1, main="Map of London schools",+ cex=sqrt(esl*5))

It would also be nice to add a little colour to the map. We might, for example, change the default plotting 'character' to a filled circle with a yellow background.

> plot(Easting, Northing, asp=1, main="Map of London schools",+ cex=sqrt(esl*5), pch=21, bg="yellow")

A more interesting option would be to have the circles filled with a colour gradient that is related to a second variable in the data the proportion of pupils eligible for free school meals for example.

To achieve this, we can begin by creating a simple colour palette:> palette map.class plot(Easting, Northing, asp=1, main="Map of London schools",+ cex=sqrt(esl*5), pch=21, bg=palette[map.class])

It would be good to add a legend, and perhaps a scale bar and North arrow. Nevertheless, as a first map in R this isn't too bad!


10

20

30

40

Figure 2.3. A simple point map in R

Why don't we be a bit more ambitious and overlay the map on a Google Maps tile, adding a legend as we do so? This requires us to load an additional library for R and to have an active Internet connection.

> library(RgoogleMaps)If you get an error such as the following

Error in library(RgoogleMaps) : there is no package called RgoogleMapsit is because the library has not been installed.

Assuming that the data frame, schools.data, remains in the workspace and attached (it will be if you have followed the instructions above), and that the colour palette created above has not been deleted, then the map shown in Figure 2.4 is created with the following code:

> MyMap PlotOnStaticMap(MyMap, Lat, Long, cex=sqrt(esl*5), pch=21, bg=palette[map.class])> legend("topleft", legend=paste("

Remember that the data are simulated. The points shown on the map are not the true locations of schools in London. Do not worry about understanding the code in detail the purpose is to see the sort of things R can do with geographical data. We will look more closely at the detail in later sessions.

Figure 2.4. A slightly less simple map produced in R

2.6 Some simple geographical analysisRemember the regression models from earlier? It would be interesting to test the assumption that the residuals exhibit independence by looking for spatial dependencies. To do this we will consider to what degree the residual value for any one school correlates with the mean residual value for its six nearest other schools (the choice of six is completely arbitrary).

First, we will take a copy of the schools data and convert it into an explicitly spatial object in R:> detach(schools.data)> schools.xy library(sp)> attach(schools.xy)> coordinates(schools.xy) # Converts into a spatial object> class(schools.xy)> detach(schools.xy)> proj4string(schools.xy)

> # Sets the Coordinate Referencing System Second, we find the six nearest neighbours for each school.

> library(spdep)> nearest.six # RANN = F to override the use of the RANN package that may not be installed

We can learn from this that the six nearest schools to the first school in the data (row 1) are schools 5, 38, 2, 40, 223 and 6:

> nearest.six$nn[1,][1] 5 38 2 40 223 6

The neighbours object, nearest.six, is an object of class knn:> class(nearest.six)

It is next converted into the more generic class of neighbours.> neighbours class(neighbours)[1] "nb"> summary(neighbours)Neighbour list object:Number of regions: 367 Number of nonzero links: 2202 Percentage nonzero weights: 1.634877 Average number of links: 6 [etc.]

The connections between each point and its neighbours can then be plotted. It may take a few minutes.

> plot(neighbours, coordinates(schools.xy))Having identified the six nearest neighbours to each school we could give each equal weight in a spatial weights matrix or, alternatively, decrease the weight with distance away (so the first nearest neighbour gets most weight and the sixth nearest the least). Creating a matrix with equal weight given to all neighbours is sufficient for the time being.

> spatial.weights

2.7 Tidying upIt is better to save your workspace regularly whilst you are working (see Section 1.4.4, 'Saving and loading workspaces', page 10) and certainly before you finish. Don't forget to include the extension .RData when saving. Having done so, you can tidy-up the workspace.

> save.image(file.choose(new=T))> rm(list=ls()) # Be careful, it deletes everything!

2.8 Further InformationA simple introduction to graphics and statistical analysis in R is given in Statistics for Geography and Environmental Science: An Introduction in R, available at http://www.social-statistics.org/?p=354.


10

Session 3: A Little More about the workings of RThis session provides a little more guidances on the 'inner workings' of R. All the commands are contained in file session3.R and can be run using it (see the section 'Scripting' on p.7). You can, if you wish, skip this session and move straight on to the sessions on mapping and spatial modelling in R, returning to this later to better understand some of the commands and procedures you will have used.

3.1 Classes, types and coercionLet us create two objects, each a vector containing ten elements. The first will be the numbers from one to ten, recorded as integers. The second will be the same sequence but now recorded as real numbers (that is, 'floating point' numbers, those with a decimal place).

> b b [1] 1 2 3 4 5 6 7 8 9 10> c c [1] 1 2 3 4 5 6 7 8 9 10

Note that in the second case, we could just type,> c c [1] 1 2 3 4 5 6 7 8 9 10

This works because if we don't explicitly define the argument (by omitting from=1 etc.) then R will assume we are giving values to the arguments in their default order, which in this case is in the order from, to and by.Type ?seq and look under Usage for this to make a little more sense.

In any case, the two objects, b and c, appear the same on screen but one is an object of class integer whereas the other is an object of class numeric and of type double (double precision in the memory space).

> class(b)[1] "integer"> class(c)[1] "numeric"> typeof(c)[1] "double"

Often it possible to coerce an object from one class and type to another.> b class(b)[1] "integer"> b class(b)[1] "numeric"> typeof(b)[1] "double"> class(c)> c class(c)[1] "integer"> c


10

20

30

40

[1] 1 2 3 4 5 6 7 8 9 10> c class(c)[1] "character"> c [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"

The examples above are trivial. However, it is important to understand that seemingly generic functions like summary(...) can produce outputs that are dependent upon the class type. Try, for example,

> class(b)[1] "numeric"> summary(b) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 3.25 5.50 5.50 7.75 10.00 > class(c)[1] "character"> summary(c) Length Class Mode 10 character character

In the first instance, a six number summary of the centre and spread of the numeric data is given. That makes no sense for character data. The second summary gives the length of the vector, its class type and its storage mode.

A more interesting example is provided if we consider the plot(...) command, used first with a single data variable, secondly with two variables in a data table, and finally on a model of the relationship between those two variables.

The first variable is created by generating 100 observations drawn randomly from a Normal distribution with mean of 100 and a standard deviation of 20.

> var1 set.seed(1)> var1 class(var1)[1] "numeric"> length(var1) # The number of elements in the vector [1] 100> summary(var1) Min. 1st Qu. Median Mean 3rd Qu. Max. 55.71 90.12 102.30 102.20 113.80 148.00 > head(var1) # The first few elements[1] 87.47092 103.67287 83.28743 131.90562 106.59016 83.59063> tail(var1) # The last few elements[1] 131.73667 111.16973 74.46816 88.53469 75.50775 90.53199

They seem fine! Returning to the use of the plot(...) command, in this instance it simply plots the data in order of their position in the vector.

> plot(var1)


10

20

30

40

Figure 3.1. A simple plot of a numeric vector

To demonstrate a different interpretation of the plot command, a second variable is created that is a function of the first but with some random error.

> set.seed(101)> var2 mydata class(mydata)[1] "data.frame"> head(mydata) x y1 87.47092 264.26192 103.67287 334.83013 83.28743 242.98874 131.90562 411.07585 106.59016 337.53976 83.59063 290.1211> nrow(mydata) # The number of rows in the data[1] 100> ncol(mydata) # The number of columns[1] 2

In this case, plotting the data frame will produce a scatter plot (to which the line of best fit shown in Figure 3.2 will be added shortly).

> plot(mydata)If there had been more than two columns in the data table, or if they had not been arranged in x, y order, then the plot could be produced by referencing the columns directly. All the following are equivalent:


10

20

30

> with(mydata, plot(x, y)) # Here the order is x, y> with(mydata, plot(y ~ x)) # Here it is y ~ x> plot(mydata$x, mydata$y)> plot(mydata[,1], mydata[,2]) # Plot using the first and second columns> plot(mydata[,2] ~ mydata[,1])

The attach(...) command could also be used. This is introduced in Section 3.2.2, 'Attaching a data frame' on page 31.

Figure 3.2. A scatter plot. A line of best fit has been added.

The line of best fit in Figure 3.2 is a regression line. To fit the regression model, summarising the relationship between y and x, use

> model1 class(model1)[1] "lm"

model1 is an object of class lm, short for linear model. Using the summary(...) function summarises the relationship between y and x.

> summary(model1)Call:lm(formula = y ~ x, data = mydata)Residuals: Min 1Q Median 3Q Max -57.102 -16.274 0.484 15.188 47.290 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.6462 13.6208 0.635 0.527 x 3.0042 0.1313 22.878

line is y = 8.65 + 3.00x.

Now using the plot(...) function on the object of class lm has an effect that is different from the previous two cases. It produces a series a diagnostic plots to help check the assumptions of regression have been met.

> plot(model1)The first plot is a check for non-constant variance and outliers, the second for normality of the model residuals, the third is similar to the first, and the fourth identifies both extreme residuals and leverage points.

These four plots can be viewed together, changing the default graphical parameters to show the plots in a 2-by-2 array (as in Figure 3.3).

> par(mfrow = c(2,2)) # Sets the graphical output to be 2 x 2> plot(model1)

Finally, we might like to go back to our previous scatter plot and add the regression line of best fit to it,

> par(mfrow = c(1,1)) # Resets the window to a single graph> plot(mydata)> abline(model1)

Figure 3.3.Default plots for an object of class linear model

3.2 Data framesThe preceding section introduced the data frame as a class of object containing a table of data where the variables are the columns of the data and the rows are the observations.


10

20

> class(mydata)> summary(mydata)

Looking at the data summary, the object mydata contains two columns, labelled x and y. These column headers can also be revealed by using

> names(mydata)[1] "x" "y"

or with> colnames(mydata)[1] "x" "y"

The row names appear to be the numbers from 1 to 100 (the number of rows in the data), though actually they are character data:

> rownames(mydata) [1] "1" "2" "3" "4" "5" "6" "7" "8" [etc.]> class(rownames(mydata))[1] "character"

The column names can be changed either individually or together. Individually: > names(mydata)[1] names(mydata)[2] names(mydata)[1] "v1" "v2"

All at once:> names(mydata) names(mydata)[1] "x" "y"

as can the row names,> rownames(mydata)[1] rownames(mydata) [1] "0" "2" "3" "4" "5" "6" "7" "8" [etc.]> rownames(mydata) = seq(from=0, by=1, length.out=nrow(mydata))> rownames(mydata) [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" [etc.]

The above can be especially useful when merging data tables with GIS shapefiles in R (because the first entry in an attribute table for a shapefile usually is given an ID of 0). Otherwise, it is usually easiest for the first row in a data table to be labelled 1, so let's put them back to how they were.

> rownames(mydata) = 1:nrow(mydata)> rownames(mydata) [1] "1" "2" "3" "4" "5" "6" "7" "8" [etc.]

3.2.1 Referencing rows and columns in a data frame

The square bracket notation can be used to index specific row, columns or cells in the data frame. For example:

> mydata[1,] # The first row of data x y1 87.47092 264.2619> mydata[2,] # The second row of data x y2 103.6729 334.8301> round(mydata[2,],2) # The second row, rounded to 2 decimal places x y


10

20

30

40

2 103.67 334.83> mydata[nrow(mydata),] # The final row of the data x y100 90.53199 261.236> mydata[,1] # The first column of data [1] 87.47092 103.67287 83.28743 131.90562 [etc.]> mydata[,2] # The second column, which here is also [1] 264.2619 334.8301 242.9887 411.0758 337.5397 [etc.]> mydata[,ncol(mydata)] # the final column of data [1] 264.2619 334.8301 242.9887 411.0758 337.5397 [etc.]> mydata[1,1] # The data in the first row of the first column[1] 87.47092> mydata[5,2] # The data in the fifth row of the second column[1] 337.5397> round(mydata[5,2],0)[1] 338

Specific columns of data can also be referenced using the $ notation> mydata$x # Equivalent to mydata[,1] because the column name is x [1] 87.47092 103.67287 83.28743 131.90562 106.59016 [etc.]> mydata$y [1] 264.2619 334.8301 242.9887 411.0758 337.5397 290.1211 [etc.] > summary(mydata$x) Min. 1st Qu. Median Mean 3rd Qu. Max. 55.71 90.12 102.30 102.20 113.80 148.00 > summary(mydata$y) Min. 1st Qu. Median Mean 3rd Qu. Max. 140.4 284.1 314.1 315.6 355.7 447.6 > mean(mydata$x)[1] 102.1777> median(mydata$y)[1] 314.1226> sd(mydata$x) # Gives the standard deviation of x[1] 17.96399> boxplot(mydata$y)> boxplot(mydata$y, horizontal=T, main="Boxplot of variable y")

(Boxplots are sometimes said to be easier to read when drawn horizontally)

One way to avoid the use of the $ notation is to use the function with(...) instead:> with(mydata, var(x)) # Gives the variance of x[1] 322.7048> with(mydata, plot(y, xlab="Observation number"))

However, even this is cumbersome so the attach function may be preferred...

3.2.2 Attaching a data frame

Sometimes any of the ways to access a specific part of a data table becomes tiresome and it is useful to reference the column or variable name directly. For example, instead of having to type mean(mydata[,1]), mean(mydata$x) or with(mydata, mean(x)) it would be easier just to refer to the variable of interest, x, as in mean(x).

To achieve this the attach(...) command is used. Compare, for example,> mean(x)


10

20

30

40

Error in mean(x) : object 'x' not found(which generates an error because there is not an object called x in the workspace; it is only a column name within the data frame mydata) with

> attach(mydata)> mean(x)[1] 102.1777

(which works fine). If, to use the earlier analogy, objects in R's workspace are like box files, then now you have opened one up and its contents (which include the variable x) are visible.

To detach the contents of the data frame use detach(...)> detach(mydata)> mean(x)Error in mean(x) : object 'x' not found

It is sensible to use detach when the data frame is no longer being used or else confusion can arise when multiple data frames contain the same column names, as in the following example:

> attach(mydata)> mean(x) # This will give the mean of mydata$x[1] 102.1777> mydata2 = data.frame(x = 1:10, y=11:20)> head(mydata2) x y1 1 112 2 123 3 134 4 145 5 156 6 16> attach(mydata2)The following object(s) are masked from 'mydata': x, y> mean(x) # This will now give the mean of mydata2$x[1] 5.5> detach(mydata2)> mean(x)[1] 102.1777> detach(mydata)> rm(mydata2)

3.2.3 Sub-setting the data table and logical queries

Subsets of a data frame can be created by referencing specific rows within it. For example, imagine we want a table only of those observations that have a a value above the mean of some variable.

> attach(mydata)> subset mean(x))> class(subset)[1] "integer"> subset [1] 2 4 5 7 8 9 11 12 15 18 19 20 21 22 25 30 31 33 [etc.] > mydata.sub head(mydata.sub) x y2 103.6729 334.83014 131.9056 411.07585 106.5902 337.5397


10

20

30

40

50

7 109.7486 354.71558 114.7665 351.48119 111.5156 367.4726

Note how the row names of this subset have been inherited from the parent data frame.

A more direct approach is to define the subset as a logical vector that is either true or false dependent upon whether a condition is met.

> subset mean(x)> class(subset)[1] "logical"> subset [1] FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE [etc.]> mydata.sub head(mydata.sub) x y2 103.6729 334.83014 131.9056 411.07585 106.5902 337.53977 109.7486 354.71558 114.7665 351.48119 111.5156 367.4726

A yet more succinct way of achieving the same is: > mydata.sub mean(x),]

# Selects those rows that meet the logical condition, and all columns> head(mydata.sub) x y2 103.6729 334.83014 131.9056 411.07585 106.5902 337.53977 109.7486 354.71558 114.7665 351.48119 111.5156 367.4726

In the same way, to select those rows where x is greater than or equal to the mean of x and y is greater than or equal to the mean of y

> mydata.sub = mean(x) & y >= mean(y),]# The symbol & is used for and

Or, those rows where x is less than the mean of x or y is less than the mean of y> mydata.sub mydata[1,1] = NA> mydata[2,2] = NA> head(mydata) x y1 NA 264.26192 103.67287 NA3 83.28743 242.98874 131.90562 411.07585 106.59016 337.53976 83.59063 290.1211


10

20

30

40

50

R will, by default, report NA or an error when some calculations are tried with missing data:> mean(mydata$x)[1] NA> quantile(mydata$y)Error in quantile.default(mydata$y) : missing values and NaN's not allowed if 'na.rm' is FALSE

To overcome this, the default can be changed or the missing data removed.

To ignore the missing data in the calculation,> mean(mydata$x, na.rm=T)[1] 102.3263> quantile(mydata$y, na.rm=T) # Divides the data into quartiles 0% 25% 50% 75% 100% 140.4475 282.1064 313.7536 356.5862 447.6020

Alternatively, there are various ways to remove the missing data. For example > subset head(subset)[1] FALSE TRUE TRUE TRUE TRUE TRUE

Using the subset,> x2 mean(x2)[1] 102.3263

More succinctly,> with(mydata, mean(x[!is.na(x)]))[1] 102.3263

Alternatively, a new data frame can be created without any missing data whereby any row with any missing value is omitted.

> subset head(subset)[1] FALSE FALSE TRUE TRUE TRUE TRUE> mydata.complete = mydata[subset,]> head(mydata.complete) x y3 83.28743 242.98874 131.90562 411.07585 106.59016 337.53976 83.59063 290.12117 109.74858 354.71558 114.76649 351.4811

3.2.5 Reading data from a file into a data frame

The accompanying file schools.csv (used in Session 2) contains information about the location and some attributes of schools in Greater London (in 2008). The locations are given as a grid reference (Easting, Northing). The information is not real but is realistic.

A standard way to read a file into a data frame, with cases corresponding to lines and variables to fields in the file, is to use the read.table(...) command.


10

20

30

40

> ?read.tableIn the case of schools.csv, it is comma delimited and has column headers. Looking through the arguments for read.table the data might be read into R using

> schools.data schools.data ncol(schools.data)[1] 17> nrow(schools.data)[1] 366> summary(schools.data) FSM EAL SEN [etc.] Min. :0.0000 Min. :0.0000 Min. :0.00000 1st Qu.:0.1323 1st Qu.:0.1472 1st Qu.:0.00800 Median :0.2500 Median :0.3165 Median :0.02000 Mean :0.2702 Mean :0.3491 Mean :0.02308 3rd Qu.:0.3897 3rd Qu.:0.5122 3rd Qu.:0.03400 Max. :0.7730 Max. :1.0000 Max. :0.11300

It seems to be fine.

For more about importing and exporting data in R, consult the R help document, R Data Import/Export (see under the Help menu in R or http://cran.r-project.org/manuals.html).

3.3 ListsA list is a little like a data frame but offers a more flexible way to gather objects of different classes together. For example,

> mylist class(mylist)[1] "list"

To find the number of components in a list, use length(...),> length(mylist)[1] 3

Here the first component is the data frame containing the schools data. The second component is the linear model created earlier. The third is the character a. To reference a specific component, double square brackets are used:

> head(mylist[[1]], n=3)


10

20

30

40

FSM EAL SEN white blk.car blk.afr indian pakistani [etc.]1 0.659 0.583 0.031 0.217 0.032 0.222 0.002 0.0202 0.391 0.424 0.001 0.350 0.087 0.126 0.003 0.0123 0.708 0.943 0.038 0.048 0.000 0.239 0.000 0.004

> summary(mylist[[2]])Call:lm(formula = y ~ x, data = mydata)Residuals: Min 1Q Median 3Q Max -57.102 -16.274 0.484 15.188 47.290 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.6462 13.6208 0.635 0.527 x 3.0042 0.1313 22.878 class(mylist[[3]])[1] "character"

The double square brackets can be combined with single ones. For example,> mylist[[1]][1,] FSM EAL SEN white blk.car blk.afr indian pakistani [etc.]1 0.659 0.583 0.031 0.217 0.032 0.222 0.002 0.020

is the first row of the schools data. The first cell of the same data is> mylist[[1]][1,1][1] 27

3.4 Writing a functionIn brief, a function is written in R in the follow way,

> function.name my.function

3.5 R packages for mapping and spatial data analysisBy default, R comes with a base set of packages and methods for data analysis and visualization. However, there are many other packages available, too, that greatly extend R's value and functionality. These packages are listed alphabetically at http://cran.r-project.org/web/packages/available_packages_by_name.html.

Because there are so many, it can be useful to browse the packages by topic (at http://cran.r-project.org/web/views/). The topic, or 'task view' of particular interest here is the analysis of spatial data: http://cran.r-project.org/web/views/Spatial.html

3.5.1 Installing and loading one or more of the packages (libraries)

Note: If reading this in class it is likely that the packages have been installed already or you will not have the administrative rights to install them. If so, this section is for information only.There also is no need to install the packages if you have done so already (when following the instructions under 'Before you begin' on p.3).

To install a specific package the install.packages(...) command is used, as in:> install.packages("ctv")Installing package(s) into /Users/ggrjh/Library/R/2.13/library(as lib is unspecified)trying URL 'http://cran.uk.r-project.org/bin/macosx/leopard/contrib/2.13/ctv_0.7-4.tgz'Content type 'application/x-gzip' length 289693 bytes (282 Kb)opened URL==================================================downloaded 282 Kb

The package needs to be installed once but loaded each time R is started, using the library(...) command

> library("ctv")In this case what has been installed is a package that will now allow all the packages associated with the spatial task view to be installed together, using:

> install.views("Spatial")Note that installing packages may, by default, require access to a directory/folder for which administrative rights are required. If necessary, it is entirely possible to install R (and therefore the additional packages) in, for example, 'My Documents' or on a USB stick.

3.5.2 Checking which packages are installed

You can see which packages are installed by using> row.names(installed.packages())

To see whether a specific package is installed use> is.element("sp",installed.packages()[,1])

# replace sp with another package name to check if that is installed

3.6 Tidying up and quittingYou may want to save and/or tidy up your workspace before quitting R. See sections 1.5 and 2.7 on pages 11 and 23.


10

20

30

40

3.7 Further InformationSee An Introduction to R, available at CRAN (http://cran.r-project.org/manuals.html) or by using the drop-down Help menus in the RGui.


Session 4: Using R as a simple GISThis session introduces some of the libraries available for R that allow it to be used as a simple GIS. Examples of plotting (x, y) point data, producing choropleth maps, creating a raster, and undertaking a spatial join are given. The code required may seem quite difficult on first impression but it becomes easier with practice, plus often it is a case of 'recycling' existing code, making small changes where required. The advantages of using R instead of a normal standalone GIS are addressed with an example of mapping the residuals from a simple econometric (regression) model and looking for a correlation between neighbours.

For this session you will need to have the following libraries installed: sp, maptools, GISTools, classInt, RColorBrewer, raster and spdep (see 'Installing and loading one or more of the packages (libraries)', p. 37).

4.1 Simple Maps of XY Data

4.1.1 Loading and Mapping XY Data

We begin by reading some XY Data into R. The file landprices.csv is in a comma separated format and contains information about land parcels in Beijing, including a point georeference a centroid, marking the centre of the land parcel. The data are simulated and not real but they are realistic.

> landdata head(landdata, n=3) x y LNPRICE LNAREA DCBD DELE DRIVER DPARK Y0405 Y0607 Y08091 454393.1 4417809 9.29 11.84 8.28 7.25 7.38 8.09 0 1 02 442744.9 4417781 8.64 10.97 9.04 5.61 8.41 7.51 0 1 03 444191.7 4416996 6.69 7.92 8.89 5.62 8.22 7.27 0 1 0

Given the geographical coordinates, it is possible to map the data as they are in much the same way as we did in Section 2.5 ('Some simple maps'). At its simplest,

> with(landdata, plot(x, y, asp=1))However, given an interest in undertaking spatial analysis in R, it would be better to convert the data into what R will explicitly recognise as a spatial object. For this we will require the sp library, Assuming it is installed,

> library(sp)We can now coerce landdata into an object of class spatial (sp) by telling R that the geographical coordinates for the data are found in columns 1 and 2 of the current data frame.

> coordinates(landdata) = c(1,2)> class(landdata)[1] "SpatialPointsDataFrame"attr(,"package")[1] "sp"

The locations of the data points are now simply plotted using> plot(landdata)

or, if we prefer a different symbol,> plot(landdata, pch=10)> plot(landdata, pch=4)

(type ?points and scroll down to below 'pch values' to see the options for different types of point


10

20

30

40

character)

4.1.2 The Coordinate Reference System (CRS)

If you type summary(landdata) you will find it is an object of class SpatialPointsDataFrame meaning it is comprised of geographic data (the Spatial Points) and also attribute data (information about the land parcels at those points). The bounding box the rectangle enclosing all the points has the coordinates (428554.2, 4406739.2) at its bottom-left corner and the coordinates (463693.8, 4440346.6) at its top-right. A six number summary is given for each of the attribute variables to show their average and range. Above these summaries you will find some NAs, indicating missing information. What are missing are the details about the coordinate reference system (CRS) for the data. If the geographical coordinates were in (longitude, latitude) format then we could use the codeproj4string(landdata) crs proj4string(landdata) summary(landdata)> plot(landdata, pch=21, bg="yellow", cex=0.7)

# cex means character expansion: it controls the symbol size

4.1.3 Adding an additional map layer

The map still lacks a sense of geographical context so we will add a polygon shapefile giving the boundaries of districts in Beijing. The file is called 'beijing_districts.shp'. This first needs to be loaded into R which in turn requires the maptools library:

> library(maptools)> districts summary(districts)

As before, the coordinate reference system is missing but is the same as for the land price data.> proj4string(districts) summary(districts)

We can now plot the boundaries of the districts and then overlay the point data on top:> plot(districts)> plot(landdata, pch=21, bg="yellow", cex=0.7, add=T)

4.2 Creating a choropleth map using GISToolsA choropleth map is one where the area units (the polygons) are shaded according to some measurement of them. Looking at the attribute data for the districts shapefile (which is accessed using the code @data) we find it includes a measure of the population density,

> head(districts@data, n=3) SP_ID POPDEN JOBDEN0 0 0.891959 1.4951501 1 1.246510 0.4662642 2 5.489660 2.787890

That is a variable we can map. The easiest way to do this is using the GIS Tools library,> library(GISTools)

As with most libraries, if we want to know more about what it can do, type ? followed by its name


10

20

30

40

(so here ?GISTools) and follow the link to the main index. If you do, you will find there is a function called choropleth with the description: Draws a choropleth map given a spatialPolygons object, a variable and a shading scheme.

Currently we have the spatialPolygons object. It is the object districts.> class(districts)[1] "SpatialPolygonsDataFrame"attr(,"package")[1] "sp"

The difference between a spatialPolygons object and a SpatialPolygonsDataFrame is the first contains only the geographical information required to draw the boundaries of a map whereas the second also contains attribute data that could be used to shade it (but doesn't have to be; another variable could be used). We will use the variable POPDEN. All that is left is a shading scheme. In the first instance, let's allow another function, auto.shading, to do the work for us.

> shades choropleth(districts, districts@data$POPDEN)> plot(landdata, pch=21, bg="yellow", cex=0.7, add=T)

It would be good to add a legend to the map. To do so, use the following command and then click towards the bottom-right of your map in the place where you would like the legend will go:

> locator(1)$x[1] 462440$y[1] 4407000

We can now place the legend at those map units,> choro.legend(462440,4407000,shades,fmt="%4.1f",title='Population density')

In practice, it may take some trial-and-error to get the legend in the right place.

Similarly, we can add a north arrow and a map scale,> north.arrow(461000, 4445000, "N", len=1000, col="light gray")> map.scale(425000,4400000,10000,"km",5,subdiv=2,tcol='black',scol='black',+ sfcol='black')

The result is shown in Figure 4.1, below.


10

20

30

Figure 4.1. An example of a choropleth map produced in R

4.2.1 Some alternative choropleth maps

What we see on a choropleth map and how we interpret it is a function of the classification used to shade the areas. Typing ?auto.shading we discover that the default for autoshading is to use a quantile classification with five (n + 1) categories and a red shading scheme. We might like to compare this with using a standard deviation based classification and one based on the range.

> x shades2 shades3 map.details # The quantile classification, as before> choropleth(districts, x, shades)> map.details(shades)

> # The standard deviation classification> choropleth(districts, x, shades2)> map.details(shades2)


10

20

> # The range-based classification> choropleth(districts, x, shades3)> map.details(shades3)

4.3 XY Maps with the point symbols shadedIn Section 4.1.1, the maps we produced of the land price data were not very elegant. It would be nice to alter the size or shading of the points according to some attribute of the data; for example the land value. This information is contained in the dataset,

> head(landdata@data, n=3) LNPRICE LNAREA DCBD DELE DRIVER DPARK Y0405 Y0607 Y08091 9.29 11.84 8.28 7.25 7.38 8.09 0 1 02 8.64 10.97 9.04 5.61 8.41 7.51 0 1 03 6.69 7.92 8.89 5.62 8.22 7.27 0 1 0

Altering the size is achieved simply enough by passing some function of the variable (LNPRICE) to the character expansion argument (cex). For example,

> x plot(landdata, pch=21, bg="yellow", cex=0.2*x)

Shading them according to their value is a little harder. The process is to cut the values into groups (using a quintile classification, for example). Then create a colour palette for each group. Finally, map the points and shade them by group.

4.3.1 Creating the groups

There are various ways this can be done. The first stage is to find the lower and upper values (the break points) for each group and an easy way to do that is to use the classInt library that is designed for the purpose.

> library(classInt)For example, for a quantile classification with five groups the break points are:

> classIntervals(x, 5, "quantile")(we can see the number of land parcels in each group is approximately but not exactly equal)

For a 'natural breaks' classification they are> classIntervals(x, 5, "fisher")

or> classIntervals(x, 5, "jenks")

For an equal interval classification> classIntervals(x, 5, "equal")

For a standard deviation classification> classIntervals(x, 5, "sd")

and so forth (see ?classIntervals for more options)

Let's use a natural breaks classification, extracting the break points from the classIntervals(...) function and storing them in an object called break.points,

> break.points break.points[1] 4.850 6.300 7.105 7.865 8.820 11.060


10

20

30

40

The second stage is to assign each of the land price values to one of the five groups. This is done using the cut(...) function.

> groups table(groups)groups 1 2 3 4 5 159 267 352 227 112

Which should be the same as for> classIntervals(x, 5, "fisher")style: fisher [4.85,6.3) [6.3,7.105) [7.105,7.865) [7.865,8.82) [8.82,11.06] 159 267 352 227 112

4.3.2 Creating the colour palette and mapping the data

This is most easily done using the RColorBrewer library which is based on the ColorBrewer website, http://www.colorbrewer.org and is designed to create nice looking colour palettes especially for thematic maps.

> library(RColorBrewer)In fact, you have used this library already when creating the choropleth maps. It is implicit in the command auto.shading(x, cutter=sdCuts, cols=brewer.pal(5,'Greens')), for example, where the function brewer.pal(...) is a call to RColorBrewer asking it to create a sequential palette of five colours going from light to dark green.

We can create a colour palette in the same way,> palette plot(districts)> plot(landdata, pch=21, bg=palette[groups], cex=0.2*x, add=T)

and add the north arrow and scale bar> north.arrow(473000, 4445000, "N", len=1000, col="light gray")> map.scale(425000,4400000,10000,"km",5,subdiv=2,tcol='black',scol='black',+ sfcol='black')

Adding the legend is a little harder and uses the legend(...) function,> legend("bottomright", legend=c("4.85 to

Figure 4.2 A shaded point map produced in R

4.4 Creating a raster gridA problem with the map shown in Figure 5.2 is that of over-plotting: of the multiple points occupying the same space on the map and obscuring each other. One solution is to convert the points into a raster grid, giving the average land value per grid cell.

We begin by loading the raster library> library(raster)

then define the length of each raster cell (here giving a 1km by 1km cell size as the units are metres)> cell.length bbox(districts) min maxx 418358.5 473517.4y 4391094.2 4447245.9

and this information will be used to calculate the number of columns (ncol) and the number of rows (nrow) for the grid:

> xmin xmax ymin ymax ncol nrow ncol


10

20

[1] 55> nrow[1] 56

We then create a blank 55 by 56 raster grid,> blank.grid xs ys xy x land.grid = rasterize(xy, blank.grid, x, mean)

We can now plot the grid using the default values,> plot(land.grid)> plot(districts, add=T)

or customise it a little> break.points palette plot(land.grid, col=palette, breaks=break.points, main="Mean log land value")> plot(districts, add=T)> map.scale(426622,4394549,10000,"km",5,subdiv=2,tcol='black',scol='black',+ sfcol='black')> north.arrow(468230, 4394708, "N", len=1000, col="white")

Figure 4.3. The land price data rasterised


10

20

4.5 Spatially joining dataA further example of the sort of GIS functionality available in R is given by assigning each of the land parcel points the attribute data for the districts within which they are located a spatial join based on a point-in-polygon operation.

This is simple to achieve with the basis being the over(...) function, here overlaying the points on the polygons. For example,

> head(over(landdata, districts), n=3) SP_ID POPDEN JOBDEN1 57 17.9478 19.33222 8 22.9676 20.44683 15 32.4849 65.3001

shows that the first point in the land price data is located in the 58 th of the districts. The reason that it is the 58th and not the 57th is that the IDs (SP_ID) are numbered beginning from zero not one, which is common for GIS. We can easily check this is correct:

> plot(districts[58,])> plot(landdata[1,], pch=21, add=T)

- the point is indeed with the polygon.

To join the data we will create a new data frame containing the attribute data from landdata and the results of the overlay function above.

> joined.data head(joined.data, n=3) LNPRICE LNAREA DCBD DELE DRIVER DPARK Y0405 Y0607 Y0809 SP_ID POPDEN JOBDEN1 9.29 11.84 8.28 7.25 7.38 8.09 0 1 0 57 17.9478 19.33222 8.64 10.97 9.04 5.61 8.41 7.51 0 1 0 8 22.9676 20.44683 6.69 7.92 8.89 5.62 8.22 7.27 0 1 0 15 32.4849 65.3001

These data cannot be plotted as they are. What we have is just a data table. It is not linked to any map.

> class(joined.data)[1] "data.frame"

However, they can be linked to the geography of the existing map to create a new Spatial Points-with-attribute data object that can then be mapped in the way described in Section 4.3, p.43.

> combined.map class(combined.map)[1] "SpatialPointsDataFrame"attr(,"package")[1] "sp"> proj4string(combined.map) head(combined.map@data)

> x break.points groups palette plot(districts)> plot(combined.map, pch=21, bg=palette[groups], cex=0.8, add=T)


10

20

30

40

# Plot the data> map.scale(426622,4396549,10000,"km",5,subdiv=2,tcol='black',scol='black',+ sfcol='black')> north.arrow(473000, 4445000, "N", len=1000, col="white")

# Add the map scale and north arrow> classIntervals(x, 5, "quantile")

# Look at the break points to include in the legend> legend("bottomright", legend=c("0.87 to

few tweaks to a script can be much faster than having to go through a number of drop-down menus, tabs, right-clicks, etc. to achieve what you want.

Third, precisely because R's spatial capabilities do not standalone from the rest of its functionality, they allow for the integration of statistical and spatial ways of working. Three examples follow.

4.6.1 Example 1: mapping regression residuals

We can use the combined dataset to fit a hedonic land price model, estimating some of the predictors of land price at each of the locations. The variables are:

LNPRICE: Log of the land price: RMB per square metre LNAREA: Log of the land parcel size DCBD: distance to the CBD on a log scale DELE: distance to the nearest elementary school on a log scale DRIVER: distance to the nearest river on a log scale DPARK: distance to the nearest park on a log scaleY0405: A dummy variable (1 or 0) indicating whether the land was sold in the period 2004-5Y0607: A dummy variable indicating whether the land was sold in the period 2006-7Y0809: A dummy variable indicating whether the land was sold in the period 2007-8(All of the remaining land parcels were sold before 2004)SP_ID: An ID for the polygon (district) in which the land parcel is locatedPOPDEN: the population density in each district. data source: the fifth census data in 2000. JOBDEN the jobden density in each district. data source: the fifth census data in 2000.

Remember: the data are simulated and not genuine.

To fit the regression model we use the function lm(...), short for linear model.> model1 summary(model1)

Call:lm(formula = LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN + Y0405 + Y0607 + Y0809, data = combined.map@data)


Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.916563 0.575031 20.723 < 2e-16 ***DCBD -0.259700 0.055202 -4.705 2.87e-06 ***DELE -0.079764 0.032784 -2.433 0.01513 * DRIVER 0.063370 0.029665 2.136 0.03288 * DPARK -0.294466 0.046747 -6.299 4.31e-10 ***POPDEN 0.004552 0.001047 4.346 1.51e-05 ***JOBDEN 0.006990 0.003009 2.323 0.02037 * Y0405 -0.183913 0.057289 -3.210 0.00136 ** Y0607 0.152825 0.087152 1.754 0.07979 . Y0809 0.803620 0.118338 6.791 1.81e-11 ***---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1


10

20

30

40

Residual standard error: 0.868 on 1107 degrees of freedomMultiple R-squared: 0.295, Adjusted R-squared: 0.2893 F-statistic: 51.47 on 9 and 1107 DF, p-value: < 2.2e-16

To obtain the residuals (errors) from this model we can use any of the functions residuals(...), rstandard(...) or rstudent(...) to obtain the 'raw' residuals, the standarised residuals and the Studentised residuals, respectively.

> residuals summary(residuals)

An advantage of using R is that we can now map the residuals to look for geographical patterns that, if they exist, would violate the assumption of independent errors (and potentially affect both the estimate of the model coefficients and their standard errors).

> break.points groups palette plot(districts)> plot(combined.map, pch=21, bg=palette[groups], cex=1, add=T)> map.scale(426622,4396549,10000,"km",5,subdiv=2,tcol='black',scol='black',+ sfcol='black')> north.arrow(473000, 4445000, "N", len=1000, col="white")> legend("bottomright", legend=c("1.96"),+ pch=21, pt.bg=palette, pt.cex=0.8, title="Studentised residuals")

Figure 4.5. Map of the regression residuals from a model predicting the land parcel prices


10

20

4.6.2 Example 2: Comparing values with those of a nearest neighbour

Is there a geographical pattern to the residuals in Figure 4.5? Perhaps, although this raises the question of what actually a random pattern would like. What there definitely is, is a significant correlation between the residual value at any one point and that of its nearest neighbouring point:

> library(spdep) # Loads the spatial dependence library> knn1 head(knn1, 3) [,1][1,] 10 # The nearest neighbour to point 1 is point 10[2,] 172 # The nearest neighbour to point 2 is point 172[3,] 152 [etc.]> cor.test(residuals, residuals[knn1])

Pearson's product-moment correlation

data: residuals and residuals[knn1]t = 13.0338, df = 1115, p-value < 2.2e-16alternative hypothesis: true correlation is not equal to 095 percent confidence interval: 0.3116038 0.4134505sample estimates: cor 0.3636132 # Calculates the correlation between a point and its nearest neighbour

4.6.3 Example 3: Exporting the data as a new shapefile

If we wished, we could now save the residual values as a new shapefile to be used in other GIS. This is straightforward and uses the same procedure to create a Spatial Points-with-attribute data object that we used in Section 4.5.

residuals.map

Reference:

Bivand, R.S., Pebesma, E.J. & Gmez-Rubio, V., 2008. Applied Spatial Data Analysis with R. Berlin: Springer.

4.8 Getting HelpThere is a mailing list for discussing the development and use of R functions and packages for handling and analysis of spatial, and particularly geographical, data. It can be subscribed to at www.stat.math.ethz.ch/mailman/listinfo/r-sig-geo.

The spatial cheat sheet by Barry Rowlingson at Lancaster University is really helpful: http://www.maths.lancs.ac.uk/~rowlings/Teaching/UseR2012/cheatsheet.html

There are some excellent R spatial tips and tutorials on Chris Brunsdon's Rpubs site, http://rpubs.com/chrisbrunsdon, and on James Cheshire's website, http://spatial.ly/r/.

Perhaps the most hardest thing is to remember which library to use when. At the risk of over-simplification:

Use library(maptools) to import and export shapefileUse library(sp) to create and manipulate spatial objects in RUse library(GISTools) to help create mapsUse library (RColorBrewer) to create colour schemes for mapsUse library(raster) to create rasters and rasterisations of dataUse library(spdep) to look at the spatial dependencies in data and (as we shall see in the next session) to create spatial weights matrices

Tip: if you load begin by loading library(GISTools) you will find maptools, sp and RColorBrewer (amongst others) are automatically loaded too.


10

20

Session 5: Defining NeighboursThis session demonstrates more of R's capability for spatial analysis by showing how to create spatial weightings based on contiguity, nearest neighbours and by distance. Methods of inverse distance weighting are also introduced and used to produce Moran plots and tests of spatial dependency in data. We find that the assumption of independence amongst the residuals of the regression model fitted in the previous session appears to be unwarranted.

5.1 Identifying neighbours

5.1.1 Getting Started

If you have closed and restarted R since the last session, load the workspace session5.RData which contains the districts polygons and combined map created previously in Session 4 (see Section4.1.3, p.40 and Section 4.5, p.47). All the code for this session is contained in the file Session5.R

> load(file.choose()) # Load the workspace session5.RDataAlso load the spdep library,

> library(spdep)

5.1.2 Creating a contiguity matrix

A contiguity matrix is one that identifies polygons that share boundaries and (in what is called the Queen's case) corners too. In other words, it identifies neighbouring areas. To do this we use the poly2nb(...) function polygons to an object of class neighbours.

> contig

[1] "list"Looking at the first part of this list we find that the first district has two neighbours: polygons 52 and 54.

> contig[[1]][1] 52 54

The second district has five neighbours: polygons 3, 4, 6, 99, 100> contig[[2]][1] 3 4 6 99 100

We can confirm this is correct by plotting the district and its neighbours on a map> plot(districts)> plot(districts[2,], col="red", add=T)> plot(districts[c(3,4,6,99,100),], col="yellow", add=T)

5.1.3 K nearest neighbours (knn)

It would not make sense to evaluate contiguity of point data. Instead, we could find, for example, the six nearest neighbours to each point:

> knear6 names(knear6)[1] "nn" "np" "k" "dimension" "x" > head(knear6$nn) [,1] [,2] [,3] [,4] [,5] [,6][1,] 10 11 1077 1076 93 453[2,] 172 110 155 162 153 156[3,] 152 149 150 151 169 168[4,] 135 143 679 148 678 674[5,] 32 164 166 165 167 168[6,] 432 920 969 1024 919 968> head(knear6$np)[1] 1117> head(knear6$k)[1] 6> head(knear6$dimension)[1] 2> head(knear6$x, n=3) x y[1,] 454393.1 4417809[2,] 442744.9 4417781[3,] 444191.7 4416996

Imagine we are interested in calculating the correlation between some variable (call it x) at each of the points and at each of the points' closest neighbour. From the above [head(knear6$nn)] we can see this is the correlation between x1, x2, x3, x4, x5, x6, (etc.) and x10, x172, x152, x135, x32, x432, (etc.). For the correlation with the second closest neighbours it would be with x11, x110, x149, x143, x164, x920, (etc.), for the third closest, x1077, x155, x150, x679, x166, x969, (etc.), and so forth. Using the simulated data about the price of land parcels in Beijing, we can calculate these correlations as follows:

> x

> cor(x, x[knear6$nn[,1]]) # Correlation with the 1st nearest neighbour[1] 0.5275507> cor(x, x[knear6$nn[,2]]) # Correlation with the 2nd nearest neighbour[1] 0.4053884> cor(x, x[knear6$nn[,3]]) # Correlation with the 3rd nearest neighbour[1] 0.4073915> cor(x, x[knear6$nn[,6]]) # Correlation with the 6th nearest neighbour[1] 0.3163395

What these values suggest is that even at the sixth nearest neighbour, the value of a land parcel at any given point tends to be similar to the value of the land parcels around it an example of positive spatial autocorrelation.

An issue is that the threshold of six nearest neighbours is purely arbitrary. An interesting question is how far how many neighbours away we can typically go from a point and still find a similarity in the land price values. One way to determine this would be to carry on with the calculations above, repeating the procedure until we get to, say, the 250 th nearest neighbour. This is, in fact, what we will do but automating the procedure. One way to achieve this is to use a for loop:

> knear100 # Find the 250 nearest neighbours> correlations # Creates an object to store the correlations> for (i in 1: 250) {+ correlations[i] correlations [1] 0.52755068 0.40538835 0.40739150 0.35804349 0.33960454 0.31633952 [etc.]

Another way which amounts to the same thing is to make use of R's ability to apply a function sequentially to columns (or rows) in an array of data:

> correlations correlations [1] 0.52755068 0.40538835 0.40739150 0.35804349 0.33960454 0.31633952 [etc.]

In either case, once we have the correlations they can be plotted,> plot(correlations, xlab="nth nearest neighbour", ylab="Correlation")> lines(lowess(correlations)) # Adds a smoothed trend line

Looking at the plot (Figure 5.1) we find that the land prices become more dissimilar (less correlated) the further away we go from each point, dropping to zero correlation from after about the 200th nearest neighbour. The rate of decrease in the correlation is greatest to about the 35 th neighbour, after which it begins to flatten.

We can also determine the p-values associated with each of these correlations and identify which are not significant at a 99% confidence,

> pvals which(pvals > 0.01) [1] 63 88 110 115 121 125 134 136 137 138 140 142 145 146 [etc.]

It is from about the 100th neighbour that the correlations begin to become insignificant. Whether this is usefully information or not is a moot point: a measure of statistical significance is really only an indirect measure of the sample size. It may be better to make a decision about the threshold at which the neighbours are not substantively correlated based on the actual correlations (the effect sizes) rather than their p-values. Whilst it remains a subjective choice, here we will here use the 35 thneighbour as the limit, before which the correlations are

intro_to_R.pdf

Documents

Transcript of intro_to_R.pdf