Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing...

31
Getting Started with R Stacy Hoehn February 27, 2008 Stacy Hoehn Getting Started with R

Transcript of Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing...

Page 1: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Getting Started with R

Stacy Hoehn

February 27, 2008

Stacy Hoehn Getting Started with R

Page 2: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

The History of R

R evolved from the S language, which was first developed in theAT&T Bell Laboratories by Rick Becker, John Chambers, andAllan Wilks.

Their idea was to provide a software tool for professionalstatisticians who wanted to combine state-of-the-art graphics withpowerful model-fitting capabilities.

S is made up of 3 components:

1 Statistical Modeling

2 Data Exploration

3 Sophisticated Calculator

Stacy Hoehn Getting Started with R

Page 3: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

The History of R – Continued

S evolved into S-Plus, which is very powerful but also veryexpensive.

Ross Ihaka and Robert Gentleman from New Zealand wrote astripped-down free version of S, which became known as R, for useby universities.

R is currently distributed under the GNU open software license anddeveloped by the user community.

Stacy Hoehn Getting Started with R

Page 4: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Useful Resources

Websites

R Homepage: www.r-project.org

User’s Manual:http://cran.r-project.org/doc/manuals/R-intro.pdf

Another User’s Manual:cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf

Book:

Michael Crawley’s Statistics: An Introduction using R

Stacy Hoehn Getting Started with R

Page 5: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Installing R

Go to the nearest CRAN (Comprehensive R Archive Network)mirror: http://cran.mtu.edu/

Click on Windows (or your operating system).

Click on base.

Click on R-2.6.2-win32.exe to download the setup program.

Install this program once the download is completed.

Launch the R GUI (graphical user interface).

Stacy Hoehn Getting Started with R

Page 6: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Packages

Many applications of R use a package, which is a library of specialfunctions designed for a specific problem. For example, thepackage BSDA has several functions that are useful for basicstatistics and data analysis.

To install a package, go to the R GUI > Packages Menu > InstallPackages.

You only have to install a package once, but each time you openthe R GUI, you will need to load the package (Packages Menu >Load Package).

Stacy Hoehn Getting Started with R

Page 7: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Fundamental Data Objects

vector - A vector is an indexed set of values that are all of thesame type. Some possible vector types are numeric, character,and logical.

data.frame - A data frame is a table-like structure.Experimental results and other data sets are often collected inthis form

Given a data object x, class(x) tells you what the class of x is. Thiscan be useful when trying to decide which functions can be usedon x.

Stacy Hoehn Getting Started with R

Page 8: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Vectors

> x <- 99

> x

[1] 99

> y <- c(1, 2, 3)

> y

[1] 1 2 3

> z <- "statistics"

> class(z)

[1] "character"

> w <- c(TRUE, FALSE, FALSE, TRUE)

> class(w)

[1] "logical"

Stacy Hoehn Getting Started with R

Page 9: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Basics about Vectors

To create a sequential list of numbers:

> data1 <- 1:5

> data1

[1] 1 2 3 4 5

To append numbers to a previous data set:

> data1 <- c(data1, 6, 7, 8, 9)

> data1

[1] 1 2 3 4 5 6 7 8 9

To repeat a number:

> data2 <- rep(1, 5)

> data2

[1] 1 1 1 1 1

Stacy Hoehn Getting Started with R

Page 10: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

More Basics about Vectors

To change an entry:

> data2[1] <- -99

> data2

[1] -99 1 1 1 1

> list <- c(-1.1, 2, -3, -4, 6.6, 7, 8.1, 20)

> abs(list)

[1] 1.1 2.0 3.0 4.0 6.6 7.0 8.1 20.0

To change an entry based on a logical expression:

> list2 <- list

> list2[list2 > 10] <- 10

> list2

[1] -1.1 2.0 -3.0 -4.0 6.6 7.0 8.1 10.0

Stacy Hoehn Getting Started with R

Page 11: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Arithmetic Operations

Numeric vectors can be used in arithmetic expressions. Theoperations are performed element by element.

Example: Make a data list containing all the odd numbers between1 and 25.

> odds1 <- 2 * c(0:12) + 1

> odds1

[1] 1 3 5 7 9 11 13 15 17 19 21 23 25

> odds2 <- seq(1, 25, 2)

> odds2

[1] 1 3 5 7 9 11 13 15 17 19 21 23 25

Stacy Hoehn Getting Started with R

Page 12: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Mean and Variance

> temp <- c(70, 74, 72, 68, 76, 88, 80, 66)

> sum(temp)/length(temp)

[1] 74.25

> mean(temp)

[1] 74.25

> sort(temp)

[1] 66 68 70 72 74 76 80 88

> var(temp)

[1] 50.78571

> sd(temp)

[1] 7.12641

> summary(temp)

Min. 1st Qu. Median Mean 3rd Qu. Max.66.00 69.50 73.00 74.25 77.00 88.00

Stacy Hoehn Getting Started with R

Page 13: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Histograms

> list <- c(1, 1, 2, 2, 2, 3, 3, 4, 5, 6, 8, 8, 8, 8)

> hist(list)Histogram of list

list

Fre

quen

cy

1 2 3 4 5 6 7 8

01

23

45

> hist(list, prob = T, breaks = 4)

Histogram of list

list

Den

sity

0 2 4 6 8

0.00

0.05

0.10

0.15

Stacy Hoehn Getting Started with R

Page 14: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Random Sampling

sample(x, size, replace = FALSE, prob) takes a random sample ofthe specified size from the elements of the vector x. Probabilitiesof selecting the values in x can optionally be specified with theargument prob.

Example: Simulate tossing a coin 50 times.

> coin <- c("H", "T")

> tosses <- sample(coin, 50, replace = TRUE)

> table(tosses)

tossesH T26 24

Example: Simulate choosing 6 lottery numbers from 1 to 54.

> sample(1:54, 6, replace = FALSE)

[1] 39 5 12 13 17 16

Stacy Hoehn Getting Started with R

Page 15: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Distributions

Normal Distribution

pnorm(x,mean,sd,lower.tail=TRUE) finds P(z ≤ x)

qnorm(p,mean,sd,lower.tail=TRUE) finds the value c suchthat P(z ≤ c) = p

Example: Scores on a particular standardized test follow a normaldistribution with a mean of 80 and standard deviation of 10. Whatis the probability that a student scores below 70? Above 95? Whatscore would you expect only 5 percent to score below?

> pnorm(70, 80, 10, lower.tail = TRUE)

[1] 0.1586553

> pnorm(95, 80, 10, lower.tail = FALSE)

[1] 0.0668072

> qnorm(0.05, 80, 10, lower.tail = TRUE)

[1] 63.55146

Stacy Hoehn Getting Started with R

Page 16: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

More Distributions

t-Distribution

pt(q,df,lower.tail=TRUE)

qt(p,df,lower.tail=TRUE)

χ2-Distribution

pchisq(q,df,lower.tail=TRUE)

qchisq(p,df,lower.tail=TRUE)

Binomial Distribution

pbinom(q,size,prob,lower.tail=TRUE)

qbinom(p,size,prob,lower.tail=TRUE)

Stacy Hoehn Getting Started with R

Page 17: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Getting Help

If you know the name of the function that you need help with, typea question mark (?) followed immediately by the function name.For example, to get more information about the function sample(),type

?sample

If you don’t know the exact name of the function that you need,use help.search(””). For example,

help.search("histograms")

will display a list of all functions related to histograms.

Stacy Hoehn Getting Started with R

Page 18: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Things to Try on Your Own

1 Create a vector x1 that contains all of the even numbersbetween -10 and 10 (inclusive).

2 Use a logical expression to replace all of the negative entries inx1 with 0’s.

3 Take a random sample of 10 distinct numbers between 1 to100.

4 Suppose that average body temperatures are normallydistributed with a mean of 98.6 and standard deviation of 1.2.What is the probability that a randomly selected individualwill have a body temperature between 97 and 100 degrees?

Stacy Hoehn Getting Started with R

Page 19: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Data Frames

Data frames are matrix-like structures, in which the columns arevectors of possibly different types. You can think of data frames as‘data matrices’ with one row per observational unit. Manyexperiments are best described by data frames.

> info <- data.frame(SEX = c("M", "M", "F"), +

AGE = c(18, 19, 17), HEIGHT = c(69, 72, 62))

> info

SEX AGE HEIGHT1 M 18 692 M 19 723 F 17 62

Stacy Hoehn Getting Started with R

Page 20: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Rows and Columns of Data Frames

You can specifiy which row(s) and column(s) of a data frame youwant to view.

> info[1, ]

SEX AGE HEIGHT1 M 18 69

> info[, 2:3]

AGE HEIGHT1 18 692 19 723 17 62

> info[1:2, c(1, 3)]

SEX HEIGHT1 M 692 M 72

Stacy Hoehn Getting Started with R

Page 21: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Sorting Data Frames

Often data is better viewed when sorted. The following code sortsby HEIGHT.

> byHeight <- info[order(info[, "HEIGHT"]), ]

> byHeight

SEX AGE HEIGHT3 F 17 621 M 18 692 M 19 72

> byHeightRev <- info[rev(order(info[, "HEIGHT"])), ]

> byHeightRev[1:3, ]

SEX AGE HEIGHT2 M 19 721 M 18 693 F 17 62

Stacy Hoehn Getting Started with R

Page 22: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Reading Data from an External File

The simplest way to create large data frames is by reading datafrom an external file, such as from an Excel file, using theread.table() or read.csv() functions. For example, you can use thedata CD that was included with your textbook. This data can alsobe downloaded from www.thomsonedu.com/statistics.

Example:

> fam <- read.csv("E:/Excel Comma/Chapter 7/families.csv", +

header = T)

> dim(fam)

[1] 43886 6

Stacy Hoehn Getting Started with R

Page 23: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Manipulating Data Frames

One way to access the column vectors of a data frame is using $.For example, fam$INCOME is the vector corresponding to theincome column. Alternatively, we can use the attach() function.

> attach(fam)

> names(fam)

[1] "TYPE" "PERSONS" "CHILDREN" "INCOME" "REGION" "EDUCATION"

> INCOME[1:5]

[1] 43450 79000 51306 24850 65145

> detach(fam)

Stacy Hoehn Getting Started with R

Page 24: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Obtaining Summaries of Data Frames

> table(fam$REGION)

1 2 3 410149 10390 13457 9890

> hist(fam$INCOME)

Histogram of fam$INCOME

fam$INCOME

Fre

quen

cy

0e+00 1e+05 2e+05 3e+05 4e+05

020

0040

0060

0080

0010

000

Stacy Hoehn Getting Started with R

Page 25: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Subsetting Data Frames

Suppose you only want the data about a subset of the families.You can isolate this information using a logical expression.

> reg1 <- fam[fam$REGION == 1, ]

> reg1[1:3, ]

TYPE PERSONS CHILDREN INCOME REGION EDUCATION1 1 2 0 43450 1 392 1 2 0 79000 1 403 1 2 0 51306 1 39

> rich <- fam[fam$INCOME > 3e+05, ]

> rich[1:3, ]

TYPE PERSONS CHILDREN INCOME REGION EDUCATION2246 1 2 0 369121 1 4413493 1 6 0 320552 2 4314877 1 2 0 379395 2 45

Stacy Hoehn Getting Started with R

Page 26: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Assessing Normality of Data

A researcher collected the body temperatures and heart rates of130 college students.

> body <- read.csv("E:/Excel Comma/Chapter 9/bodytemp.csv", +

header = F)

> colnames(body) <- c("TEMP", "SEX", "HEART")

Do the heart rates look normally distributed?

> hist(body$HEART)

Histogram of body$HEART

body$HEART

Fre

quen

cy

55 60 65 70 75 80 85 90

05

1015

2025

30

Stacy Hoehn Getting Started with R

Page 27: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Quantile-Quantile Plots

An alternative way to test the normality of data is using aquantile-quantile plot. If the plot looks like a straight line, the datais approximately normal.

> qqnorm(body$HEART)

> qqline(body$HEART)

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

−2 −1 0 1 2

6065

7075

8085

90

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Stacy Hoehn Getting Started with R

Page 28: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Hypothesis Testing

For normally distributed data:t.test(x, alternative = c(“two.sided”,“less”,“greater”), mu = 0)

Example: The AMA claims that the average heart rate ofAmerican college students is 72 beats per minute. Is there evidenceto conclude that the actual average heart rate is not 72?

> t.test(x = body$HEART, mu = 72, alternative = "two.sided")

One Sample t-test

data: body$HEARTt = 2.844, df = 129, p-value = 0.005182alternative hypothesis: true mean is not equal to 7295 percent confidence interval:72.53607 74.98701sample estimates:mean of x73.76154

Stacy Hoehn Getting Started with R

Page 29: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Two Sample Hypothesis Testing

For normally distributed data:t.test(x, y, alternative, mu = 0)

Example: Is there reason to believe that the average heart rate formales is different than for females based on the data we have?

> t.test(x = body[body$SEX == 1, ]$HEART, y = body[body$SEX ==

+ 2, ]$HEART, alternative = "two.sided", mu = 0)

Welch Two Sample t-test

data: body[body$SEX == 1, ]$HEART and body[body$SEX == 2, ]$HEARTt = -0.6319, df = 116.704, p-value = 0.5287alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:-3.243732 1.674501sample estimates:mean of x mean of y73.36923 74.15385

Stacy Hoehn Getting Started with R

Page 30: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Pearson’s χ2 Goodness of Fit Tests

Example: If we toss a die 150 times and find get 22 ones, 21 twos,22 threes, 27 fours, 22 fives, and 36 sixes, is the die fair?

> freq = c(22, 21, 22, 27, 22, 36)

> probs = rep(1/6, 6)

> chisq.test(x = freq, p = probs)

Chi-squared test for given probabilities

data: freqX-squared = 6.72, df = 5, p-value = 0.2423

Since the p-value is large, there is not enough evidence to concludethat the die is not fair.

Stacy Hoehn Getting Started with R

Page 31: Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing Normality of Data A researcher collected the body temperatures and heart rates of 130

Goodness of Fit (continued)

Exercise 39 in Chapter 9: The lunar cycle (29 days) was dividedinto 10 (unequal) periods, and the number of animal bites in eachperiod is given. Is there a temporal trend in the incidence of bites?

> freq = c(137, 150, 163, 201, 269, 155, 142, 146, 148, 110)

> probs = c(rep(3/29, 9), 2/29)

> chisq.test(x = freq, p = probs)

Chi-squared test for given probabilities

data: freqX-squared = 85.4797, df = 9, p-value = 1.308e-14

Since the p-value is so small, there does seem to be a temporaltrend in the incidence of bites.

Stacy Hoehn Getting Started with R