Post on 08-Jul-2020
Getting Started with R
Stacy Hoehn
February 27, 2008
Stacy Hoehn Getting Started with R
The History of R
R evolved from the S language, which was first developed in theAT&T Bell Laboratories by Rick Becker, John Chambers, andAllan Wilks.
Their idea was to provide a software tool for professionalstatisticians who wanted to combine state-of-the-art graphics withpowerful model-fitting capabilities.
S is made up of 3 components:
1 Statistical Modeling
2 Data Exploration
3 Sophisticated Calculator
Stacy Hoehn Getting Started with R
The History of R – Continued
S evolved into S-Plus, which is very powerful but also veryexpensive.
Ross Ihaka and Robert Gentleman from New Zealand wrote astripped-down free version of S, which became known as R, for useby universities.
R is currently distributed under the GNU open software license anddeveloped by the user community.
Stacy Hoehn Getting Started with R
Useful Resources
Websites
R Homepage: www.r-project.org
User’s Manual:http://cran.r-project.org/doc/manuals/R-intro.pdf
Another User’s Manual:cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf
Book:
Michael Crawley’s Statistics: An Introduction using R
Stacy Hoehn Getting Started with R
Installing R
Go to the nearest CRAN (Comprehensive R Archive Network)mirror: http://cran.mtu.edu/
Click on Windows (or your operating system).
Click on base.
Click on R-2.6.2-win32.exe to download the setup program.
Install this program once the download is completed.
Launch the R GUI (graphical user interface).
Stacy Hoehn Getting Started with R
Packages
Many applications of R use a package, which is a library of specialfunctions designed for a specific problem. For example, thepackage BSDA has several functions that are useful for basicstatistics and data analysis.
To install a package, go to the R GUI > Packages Menu > InstallPackages.
You only have to install a package once, but each time you openthe R GUI, you will need to load the package (Packages Menu >Load Package).
Stacy Hoehn Getting Started with R
Fundamental Data Objects
vector - A vector is an indexed set of values that are all of thesame type. Some possible vector types are numeric, character,and logical.
data.frame - A data frame is a table-like structure.Experimental results and other data sets are often collected inthis form
Given a data object x, class(x) tells you what the class of x is. Thiscan be useful when trying to decide which functions can be usedon x.
Stacy Hoehn Getting Started with R
Vectors
> x <- 99
> x
[1] 99
> y <- c(1, 2, 3)
> y
[1] 1 2 3
> z <- "statistics"
> class(z)
[1] "character"
> w <- c(TRUE, FALSE, FALSE, TRUE)
> class(w)
[1] "logical"
Stacy Hoehn Getting Started with R
Basics about Vectors
To create a sequential list of numbers:
> data1 <- 1:5
> data1
[1] 1 2 3 4 5
To append numbers to a previous data set:
> data1 <- c(data1, 6, 7, 8, 9)
> data1
[1] 1 2 3 4 5 6 7 8 9
To repeat a number:
> data2 <- rep(1, 5)
> data2
[1] 1 1 1 1 1
Stacy Hoehn Getting Started with R
More Basics about Vectors
To change an entry:
> data2[1] <- -99
> data2
[1] -99 1 1 1 1
> list <- c(-1.1, 2, -3, -4, 6.6, 7, 8.1, 20)
> abs(list)
[1] 1.1 2.0 3.0 4.0 6.6 7.0 8.1 20.0
To change an entry based on a logical expression:
> list2 <- list
> list2[list2 > 10] <- 10
> list2
[1] -1.1 2.0 -3.0 -4.0 6.6 7.0 8.1 10.0
Stacy Hoehn Getting Started with R
Arithmetic Operations
Numeric vectors can be used in arithmetic expressions. Theoperations are performed element by element.
Example: Make a data list containing all the odd numbers between1 and 25.
> odds1 <- 2 * c(0:12) + 1
> odds1
[1] 1 3 5 7 9 11 13 15 17 19 21 23 25
> odds2 <- seq(1, 25, 2)
> odds2
[1] 1 3 5 7 9 11 13 15 17 19 21 23 25
Stacy Hoehn Getting Started with R
Mean and Variance
> temp <- c(70, 74, 72, 68, 76, 88, 80, 66)
> sum(temp)/length(temp)
[1] 74.25
> mean(temp)
[1] 74.25
> sort(temp)
[1] 66 68 70 72 74 76 80 88
> var(temp)
[1] 50.78571
> sd(temp)
[1] 7.12641
> summary(temp)
Min. 1st Qu. Median Mean 3rd Qu. Max.66.00 69.50 73.00 74.25 77.00 88.00
Stacy Hoehn Getting Started with R
Histograms
> list <- c(1, 1, 2, 2, 2, 3, 3, 4, 5, 6, 8, 8, 8, 8)
> hist(list)Histogram of list
list
Fre
quen
cy
1 2 3 4 5 6 7 8
01
23
45
> hist(list, prob = T, breaks = 4)
Histogram of list
list
Den
sity
0 2 4 6 8
0.00
0.05
0.10
0.15
Stacy Hoehn Getting Started with R
Random Sampling
sample(x, size, replace = FALSE, prob) takes a random sample ofthe specified size from the elements of the vector x. Probabilitiesof selecting the values in x can optionally be specified with theargument prob.
Example: Simulate tossing a coin 50 times.
> coin <- c("H", "T")
> tosses <- sample(coin, 50, replace = TRUE)
> table(tosses)
tossesH T26 24
Example: Simulate choosing 6 lottery numbers from 1 to 54.
> sample(1:54, 6, replace = FALSE)
[1] 39 5 12 13 17 16
Stacy Hoehn Getting Started with R
Distributions
Normal Distribution
pnorm(x,mean,sd,lower.tail=TRUE) finds P(z ≤ x)
qnorm(p,mean,sd,lower.tail=TRUE) finds the value c suchthat P(z ≤ c) = p
Example: Scores on a particular standardized test follow a normaldistribution with a mean of 80 and standard deviation of 10. Whatis the probability that a student scores below 70? Above 95? Whatscore would you expect only 5 percent to score below?
> pnorm(70, 80, 10, lower.tail = TRUE)
[1] 0.1586553
> pnorm(95, 80, 10, lower.tail = FALSE)
[1] 0.0668072
> qnorm(0.05, 80, 10, lower.tail = TRUE)
[1] 63.55146
Stacy Hoehn Getting Started with R
More Distributions
t-Distribution
pt(q,df,lower.tail=TRUE)
qt(p,df,lower.tail=TRUE)
χ2-Distribution
pchisq(q,df,lower.tail=TRUE)
qchisq(p,df,lower.tail=TRUE)
Binomial Distribution
pbinom(q,size,prob,lower.tail=TRUE)
qbinom(p,size,prob,lower.tail=TRUE)
Stacy Hoehn Getting Started with R
Getting Help
If you know the name of the function that you need help with, typea question mark (?) followed immediately by the function name.For example, to get more information about the function sample(),type
?sample
If you don’t know the exact name of the function that you need,use help.search(””). For example,
help.search("histograms")
will display a list of all functions related to histograms.
Stacy Hoehn Getting Started with R
Things to Try on Your Own
1 Create a vector x1 that contains all of the even numbersbetween -10 and 10 (inclusive).
2 Use a logical expression to replace all of the negative entries inx1 with 0’s.
3 Take a random sample of 10 distinct numbers between 1 to100.
4 Suppose that average body temperatures are normallydistributed with a mean of 98.6 and standard deviation of 1.2.What is the probability that a randomly selected individualwill have a body temperature between 97 and 100 degrees?
Stacy Hoehn Getting Started with R
Data Frames
Data frames are matrix-like structures, in which the columns arevectors of possibly different types. You can think of data frames as‘data matrices’ with one row per observational unit. Manyexperiments are best described by data frames.
> info <- data.frame(SEX = c("M", "M", "F"), +
AGE = c(18, 19, 17), HEIGHT = c(69, 72, 62))
> info
SEX AGE HEIGHT1 M 18 692 M 19 723 F 17 62
Stacy Hoehn Getting Started with R
Rows and Columns of Data Frames
You can specifiy which row(s) and column(s) of a data frame youwant to view.
> info[1, ]
SEX AGE HEIGHT1 M 18 69
> info[, 2:3]
AGE HEIGHT1 18 692 19 723 17 62
> info[1:2, c(1, 3)]
SEX HEIGHT1 M 692 M 72
Stacy Hoehn Getting Started with R
Sorting Data Frames
Often data is better viewed when sorted. The following code sortsby HEIGHT.
> byHeight <- info[order(info[, "HEIGHT"]), ]
> byHeight
SEX AGE HEIGHT3 F 17 621 M 18 692 M 19 72
> byHeightRev <- info[rev(order(info[, "HEIGHT"])), ]
> byHeightRev[1:3, ]
SEX AGE HEIGHT2 M 19 721 M 18 693 F 17 62
Stacy Hoehn Getting Started with R
Reading Data from an External File
The simplest way to create large data frames is by reading datafrom an external file, such as from an Excel file, using theread.table() or read.csv() functions. For example, you can use thedata CD that was included with your textbook. This data can alsobe downloaded from www.thomsonedu.com/statistics.
Example:
> fam <- read.csv("E:/Excel Comma/Chapter 7/families.csv", +
header = T)
> dim(fam)
[1] 43886 6
Stacy Hoehn Getting Started with R
Manipulating Data Frames
One way to access the column vectors of a data frame is using $.For example, fam$INCOME is the vector corresponding to theincome column. Alternatively, we can use the attach() function.
> attach(fam)
> names(fam)
[1] "TYPE" "PERSONS" "CHILDREN" "INCOME" "REGION" "EDUCATION"
> INCOME[1:5]
[1] 43450 79000 51306 24850 65145
> detach(fam)
Stacy Hoehn Getting Started with R
Obtaining Summaries of Data Frames
> table(fam$REGION)
1 2 3 410149 10390 13457 9890
> hist(fam$INCOME)
Histogram of fam$INCOME
fam$INCOME
Fre
quen
cy
0e+00 1e+05 2e+05 3e+05 4e+05
020
0040
0060
0080
0010
000
Stacy Hoehn Getting Started with R
Subsetting Data Frames
Suppose you only want the data about a subset of the families.You can isolate this information using a logical expression.
> reg1 <- fam[fam$REGION == 1, ]
> reg1[1:3, ]
TYPE PERSONS CHILDREN INCOME REGION EDUCATION1 1 2 0 43450 1 392 1 2 0 79000 1 403 1 2 0 51306 1 39
> rich <- fam[fam$INCOME > 3e+05, ]
> rich[1:3, ]
TYPE PERSONS CHILDREN INCOME REGION EDUCATION2246 1 2 0 369121 1 4413493 1 6 0 320552 2 4314877 1 2 0 379395 2 45
Stacy Hoehn Getting Started with R
Assessing Normality of Data
A researcher collected the body temperatures and heart rates of130 college students.
> body <- read.csv("E:/Excel Comma/Chapter 9/bodytemp.csv", +
header = F)
> colnames(body) <- c("TEMP", "SEX", "HEART")
Do the heart rates look normally distributed?
> hist(body$HEART)
Histogram of body$HEART
body$HEART
Fre
quen
cy
55 60 65 70 75 80 85 90
05
1015
2025
30
Stacy Hoehn Getting Started with R
Quantile-Quantile Plots
An alternative way to test the normality of data is using aquantile-quantile plot. If the plot looks like a straight line, the datais approximately normal.
> qqnorm(body$HEART)
> qqline(body$HEART)
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
−2 −1 0 1 2
6065
7075
8085
90
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
Stacy Hoehn Getting Started with R
Hypothesis Testing
For normally distributed data:t.test(x, alternative = c(“two.sided”,“less”,“greater”), mu = 0)
Example: The AMA claims that the average heart rate ofAmerican college students is 72 beats per minute. Is there evidenceto conclude that the actual average heart rate is not 72?
> t.test(x = body$HEART, mu = 72, alternative = "two.sided")
One Sample t-test
data: body$HEARTt = 2.844, df = 129, p-value = 0.005182alternative hypothesis: true mean is not equal to 7295 percent confidence interval:72.53607 74.98701sample estimates:mean of x73.76154
Stacy Hoehn Getting Started with R
Two Sample Hypothesis Testing
For normally distributed data:t.test(x, y, alternative, mu = 0)
Example: Is there reason to believe that the average heart rate formales is different than for females based on the data we have?
> t.test(x = body[body$SEX == 1, ]$HEART, y = body[body$SEX ==
+ 2, ]$HEART, alternative = "two.sided", mu = 0)
Welch Two Sample t-test
data: body[body$SEX == 1, ]$HEART and body[body$SEX == 2, ]$HEARTt = -0.6319, df = 116.704, p-value = 0.5287alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:-3.243732 1.674501sample estimates:mean of x mean of y73.36923 74.15385
Stacy Hoehn Getting Started with R
Pearson’s χ2 Goodness of Fit Tests
Example: If we toss a die 150 times and find get 22 ones, 21 twos,22 threes, 27 fours, 22 fives, and 36 sixes, is the die fair?
> freq = c(22, 21, 22, 27, 22, 36)
> probs = rep(1/6, 6)
> chisq.test(x = freq, p = probs)
Chi-squared test for given probabilities
data: freqX-squared = 6.72, df = 5, p-value = 0.2423
Since the p-value is large, there is not enough evidence to concludethat the die is not fair.
Stacy Hoehn Getting Started with R
Goodness of Fit (continued)
Exercise 39 in Chapter 9: The lunar cycle (29 days) was dividedinto 10 (unequal) periods, and the number of animal bites in eachperiod is given. Is there a temporal trend in the incidence of bites?
> freq = c(137, 150, 163, 201, 269, 155, 142, 146, 148, 110)
> probs = c(rep(3/29, 9), 2/29)
> chisq.test(x = freq, p = probs)
Chi-squared test for given probabilities
data: freqX-squared = 85.4797, df = 9, p-value = 1.308e-14
Since the p-value is so small, there does seem to be a temporaltrend in the incidence of bites.
Stacy Hoehn Getting Started with R