Ann Arbor ASA ‘Up and Running’ With R Prepared by volunteers of the Ann Arbor chapter of the...

40
Ann Arbor ASA ‘Up and Running’ With R Prepared by volunteers of the Ann Arbor chapter of the American statistical association, in cooperation with the department of statistics and the center for statistical consultation and research of the October 27 th , 2010

Transcript of Ann Arbor ASA ‘Up and Running’ With R Prepared by volunteers of the Ann Arbor chapter of the...

Ann Arbor ASA ‘Up and Running’ With R

Prepared by volunteers of the Ann Arbor chapter of the American statistical association, in cooperation with the department of statistics and the center for statistical consultation and research of the university of Michigan

October 27th, 2010

Ann Arbor ASA (Up and Running with R)

2

R Class Agenda Brief Introduction to R Using R Help Introduction to Functions Available in R Working with Data Importing/Exporting Data Graphs Simple Models Writing Functions/Programming

Ann Arbor ASA (Up and Running with R)

3

What is R?

R is a computing language commonly used for statistical analysis

R is open source which means that the source code is available to all users

R is a free software package, download it at http://www.r-project.org/

Ann Arbor ASA (Up and Running with R)

4

More About R Most statistical analysis is done using

pre-defined functions in R. These functions are available in many

different packages. When you download R, you have access

to many functions from the ‘base’ package.

More advanced functions will require that you download other packages.

Ann Arbor ASA (Up and Running with R)

5

What can you do with R? Topics in statistics are readily available

such as linear modeling, linear mixed modeling, multivariate analysis, clustering, non-parametric methods, and classification

R is well known to produce high quality graphics. Simple plots are easy and with a little more practice, users can produce publishable graphics!

Ann Arbor ASA (Up and Running with R)

6

Time to Launch R Find R on your computer:

Start>Statistical Software Packages>R Go to the file menu and double click

‘New script’ Here is the editor window where we will

type our script It is more convenient to type here than

in your workspace Try typing in both the workspace and

the editor window

Ann Arbor ASA (Up and Running with R)

7

Data Objects in R Users create different data objects in R Data objects refer to variables, arrays of

numbers, character strings, functions and other more complicated data manipulations

‘<-’ allows you to assign data objects with names of your choice

Type ‘a<-7’ in your editor window Submit this command by highlighting it and

pressing ctrl+r Practice creating different data objects and

submit them to the workspace

Ann Arbor ASA (Up and Running with R)

8

Data Objects in R Type ‘objects ()’ This allows you to see that you have created

the object ‘a’ during this R session You can view previously submitted commands

by using the up/down arrow on your computer You can remove this object by typing ‘rm(a)’ Try removing some objects you created and

then type ‘objects()’ to see if they are listed

Ann Arbor ASA (Up and Running with R)

9

Getting Help in R To get help on any specific function:

Type ‘help(name of function)’ OR type ‘?(name of function)’

Sometimes help is not available from the packages you have downloaded Type ‘??(name of function)’ Try searching for help on ‘hist’ or ‘lm’

Two popular R resource websites: Rseek.org nabble.com

Ann Arbor ASA (Up and Running with R)

10

A Simple Example to Get You Started

To set up a vector named x use the R command: ‘x<-c(5,4,3,6)’ This is an assignment statement using the

function c() which creates a vector by concatenating its arguments

Perform vector/matrix arithmetic: ‘v<- 3*x - 5’

Ann Arbor ASA (Up and Running with R)

11

R Reference Card*created by Tom Short

There are thousands of available functions in R, but this Reference Card provides a strong working knowledge

Let’s take a minute to look at the organization of the Reference Card and try out a few of the functions available!

Ann Arbor ASA (Up and Running with R)

12

Generating Sequences/Replicating Objects Sequences: submit the following commands

‘seq(-5, 5, by=.2)’ ‘seq(length=51, from=-5, by=.2)’ Both produce a sequence from -5 to 5 with a

distance of .2 between objects Replications: submit the following

commands ‘rep(x, times=5)’ ‘rep(x, each=5) ‘ Both produce x replicated 5 times

Ann Arbor ASA (Up and Running with R)

13

Working with Data Sets

There are many data sets available for use in R Type ‘data()’ to see what’s available

We will work with the trees data set Type ‘data(trees)’ This data set is now ready to use in R

The following are useful commands: ‘summary(trees)’ – summary of variables ‘dim(trees)’ – dimension of data set ‘names(trees)’ – see variable names ‘attach(trees)’ – attach the variable names for use

in R

Ann Arbor ASA (Up and Running with R)

14

Extracting Data

R has saved the data set trees as a data frame object Check this by typing ‘class(trees)’

R stores this data in matrix row/column format: data.frame[rows,columns] Type ‘trees[c(1:2),2]’ – we see the first 2 rows and

2nd column Type ‘trees[3,c(“Height”,”Girth”)]’ – can also

reference column names Type ‘trees[-c(10:20),”Height”]’ – skips rows 10-

20 for variable Height

Ann Arbor ASA (Up and Running with R)

15

Extracting Data (continued)

The subset() command is very useful to extract data in a logical manner. 1st argument is data, 2nd argument is logical subset requirement ‘subset(trees, Height>80)’ – subset where all

tree heights >80 ‘subset(trees, Height<70 & Girth>10) ‘– subset

where all tree heights<70 AND tree girth>10 ‘subset(trees, Height <60 | Girth >11)’ – subset

where all tree heights <60 OR Girth >11

Ann Arbor ASA (Up and Running with R)

16

Importing Data

The most common (and easiest) file to import is a text file with the read.table() command

R needs to be told where the file is located You can set the working directory which tells R

where all your files are located by typing ‘setwd("C:\\Users\\hicksk\\Desktop")’

OR you can physically point to the working directory by going to File<Change dir… and choosing the location of your files

OR you can include the physical location of your file in your read.table() command

Ann Arbor ASA (Up and Running with R)

17

Using the read.table() command

Go to ASA Ann Arbor Chapter’s website here and look under the R Classes section, open ‘furniture.zip’ and save the files to your desktop

Remember we must tell R where these files are located to read them in properly read.table("C:\\Users\\hicksk\\Desktop\\

furniture.txt",header=TRUE,sep=“”) Important to use double slashes \\ rather than

single slash \ Tell R whether you have column names on your

data with header=TRUE or header=FALSE

Ann Arbor ASA (Up and Running with R)

18

Using read.table() (cont’d)

Remember, another way of specifying the file’s location is to set the working directory first and then read in the file setwd(“C:\\Users\\hicksk\\Desktop”) read.table(“furniture.txt”,header=TRUE,sep=“

”)• OR we had the option of physically pointing the

location by going to File>Change dir… and pointing to the file’s location. We would then be able to read the file similar to above by typing ‘read.table(“furniture.txt”,header=TRUE,sep=“”)’

Ann Arbor ASA (Up and Running with R)

19

read.table(), read.csv() and Missing Values It is also popular to import csv files since

excel files are easily converted to csv files read.csv() and read.table() are very similar

although they handle missing values differently read.csv() automatically assign an ‘NA’ to

missing values read.table() will not load data with missing

values, so you must assign ‘NA’ to missing values before reading it into R

Ann Arbor ASA (Up and Running with R)

20

read.table(), read.csv() and Missing Values (cont’d) Let’s remove a data entry from both

“furniture.txt” and “furniture.csv” From the first row, erase 100 from the Area

column Now try to read in the data from these

two files using read.table() and read.csv() You should see that you cannot read the

data in using the read.table() command unless you input an entry for the missing value

Ann Arbor ASA (Up and Running with R)

21

Other Options for Importing Data When you download R, you should have

automatically obtained the foreign package

By submitting ‘library(foreign)’, you will have many more options for importing data: read.xport(), read.spss(), read.dta(),

read.mtp() For more information on these options,

simply submit ‘help(read.XXXX)’

Ann Arbor ASA (Up and Running with R)

22

Exporting Data You can export data by using the write.table()

command ‘write.table(trees, “treesDATA.txt”,

row.names=FALSE, sep=“,”)’ Specify that we want the trees data set exported Type in name of file to be exported. The default

is that it will write the file to the working directory already specified unless you give a location.

row.names=FALSE tells R that we do not wish to preserve the row names

sep=“,” tells R that the data set is comma delimited

Ann Arbor ASA (Up and Running with R)

23

Furniture Data Set

Let’s assign a name to the furniture data set as we read it in so we can do some analysis furn<-read.table(“furniture.txt”,sep=“”,h=T)

To get a better understanding of our data set, use some useful commands: dim(furn) summary(furn) names(furn) attach(furn)

Ann Arbor ASA (Up and Running with R)

24

Graphs in R Using the Furniture Data R can produce both very simple and very

complex graphs We will only get a brief introduction today but I

encourage you to investigate further Let’s start by making a simple scatter plot of the

Area and Cost variables from our furniture data set plot(Area,Cost,main=“Area vs Cost”,

xlab=“Area”,ylab=“Cost”) We have told R to put Area on the x-axis, Cost on

the y-axis and provided a title and label axes

Ann Arbor ASA (Up and Running with R)

25

Graphs in R

Let’s look at the distribution of our variables using some different graphs in R hist(Area) – histogram of Area hist(Cost) – histogram of Cost boxplot(Cost ~ Type) – boxplot of Cost by

Type We can make the boxplot much prettier

boxplot(Cost ~ Type, main=“Boxplot of Cost by Type”, col=c(“orange”, “green”, “blue”), xlab=“Type”, ylab = “Cost”)

Ann Arbor ASA (Up and Running with R)

26

Graphs in R

We can also look at a scatter plot matrix of all variables in a data set by using the pairs() function pairs(furn)

Or we can look at a correlation/covariance matrix of the numeric variables cor(furn[,c(2:3)]) cov(furn[,c(2:3)])

Ann Arbor ASA (Up and Running with R)

27

Graphs in R/Simple Models

Let’s perform a simple linear regression using the furniture data set m1<-lm(Cost ~ Area) summary(m1) coef(m1) fitted.values(m1) residuals(m1)

We can also plot the residuals against the fitted values plot(fitted.values(m1), residuals(m1))

Ann Arbor ASA (Up and Running with R)

28

Graphs in R/Simple Models

Let’s continue with our scatter plot of Area and Cost plot(Area, Cost, main = “Cost Regression

Example”, xlab=“Cost”, ylab=“Area”) abline(lm(Cost~Area), col=3, lty=1) lines( lowess(Cost~Area), col=3, lty=2)

Now let’s interactively add a legend legend(locator(1), c(“Linear”, “Lowess”),

lty=c(1,2), col=2) You can point to your graph and place the legend

where you wish!

Ann Arbor ASA (Up and Running with R)

29

Graphs in R/Simple Models

Now let’s identify different points on the graph identify(Area, Cost, row.names(furn)) Makes it easy to identify outliers

We can use the locator() command to quantify differences between the regression fit and the loess line locator(2) Now let’s compare predicted values of Cost

when Area is equal to 250

Ann Arbor ASA (Up and Running with R)

30

Multivariate Analysis

Now let’s do a multivariate regression using both Area and Type as predictors in the model m2<-lm(Cost ~ Area + Type) summary(m2)

Now let’s see if our multivariate model is significantly better than the simple model by using ANOVA anova(m1, m2) The ANOVA table compares the two nested regression

models by testing the null hypothesis that the Type predictor did not need to be in the model. Since the p-value<.05, we have evidence to conclude that Type is an important predictor.

Ann Arbor ASA (Up and Running with R)

31

Writing Functions

You can easily write your own programs and functions in R

Type in the following function named f1: f1<-function(m,n) {

result<-m + nreturn(result) }

Now type ‘f1(3,5)’ and you should see that your function ran for the values 3,5 as specified

Ann Arbor ASA (Up and Running with R)

32

Working with If-Then Statements Here’s an example of how if-then works in R:

You’ll see since 10>5, it printed “GO BLUE” You can tell R to do multiple items using the

following structure if (logical condition)

{do this and this and this}

Ann Arbor ASA (Up and Running with R)

33

If-Else Conditions

We can make If-then statements slightly more complex using If-Else Conditions. Here’s an example: if(4>5) {print("Happy

Halloween") print(" BOO’’) } else {

print(‘’Merry XMAS’’)print(‘’HO HO HO’’)}

Ann Arbor ASA (Up and Running with R)

34

For Loop/While Loop For loops can be quite helpful when writing

functions. Here’s an example: for (i in 1:5) { print(i+1)}

While loops are also quite handy. Here’s an example: f2<-function (x) {

while( x<5) {x<- x+1print(x) }}

f2(-5)

Ann Arbor ASA (Up and Running with R)

35

Practice Problem #1

Create a sequence that starts at 0 and goes to 5 with a step of 0.5

Replicate ‘a b c’ 3 times

Replicate ‘a’ 3 times, ‘b’ 3 times, ‘c’ 3 times in one command

Ann Arbor ASA (Up and Running with R)

36

Practice Problem #2

Make a histogram of the “Girth” variable from the ‘trees’ data set. Include a title.

Make a boxplot of the “Height” variable from the ‘trees’ data set. Color it blue and label your axes.

Make a scatter plot of Girth and Height. Add the regression line.

Ann Arbor ASA (Up and Running with R)

37

Practice Problem #3

Create a simple linear model with Girth as the predictor and Height as the response. Extract the coefficients.

Now add Volume to the model. How can we tell if this model is preferred to the simpler model?

Ann Arbor ASA (Up and Running with R)

38

Practice Problem #4

Fix x at a number smaller than 5. Use a ‘while loop’ to create a sequence that starts at x and increases by 2 until you reach 20.

Create a function that will return the product of any two numbers.

Ann Arbor ASA (Up and Running with R)

39

Thank you for your attention!

Additional R Resources:

R project home http://www.r-project.org R documentation

http://www.r-project.org/other-docs.html R help forum

http://www.nabble.com/R-help-f13820.html R Journal http://journal.r-project.org/ R Graphical Gallery

http://addictedtor.free.fr/graphiques/ R Graphical Manual http://bm2.genes.nig.ac.jp/RGM2/ R Seek http://www.rseek.org/

Ann Arbor ASA (Up and Running with R)

40

Acknowledgements/References

Thank you to Brady West for allowing the use of his R introductory materials.

http://www.r-project.org

http://addictedtor.free.fr/graphiques/