Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing...

Getting Started with R

Stacy Hoehn

February 27, 2008

Stacy Hoehn Getting Started with R

The History of R

R evolved from the S language, which was first developed in theAT&T Bell Laboratories by Rick Becker, John Chambers, andAllan Wilks.

Their idea was to provide a software tool for professionalstatisticians who wanted to combine state-of-the-art graphics withpowerful model-fitting capabilities.

S is made up of 3 components:

1 Statistical Modeling

2 Data Exploration

3 Sophisticated Calculator

The History of R – Continued

S evolved into S-Plus, which is very powerful but also veryexpensive.

Ross Ihaka and Robert Gentleman from New Zealand wrote astripped-down free version of S, which became known as R, for useby universities.

R is currently distributed under the GNU open software license anddeveloped by the user community.

Useful Resources

Websites

R Homepage: www.r-project.org

User’s Manual:http://cran.r-project.org/doc/manuals/R-intro.pdf

Another User’s Manual:cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf

Michael Crawley’s Statistics: An Introduction using R

Installing R

Go to the nearest CRAN (Comprehensive R Archive Network)mirror: http://cran.mtu.edu/

Click on Windows (or your operating system).

Click on base.

Click on R-2.6.2-win32.exe to download the setup program.

Install this program once the download is completed.

Launch the R GUI (graphical user interface).

Packages

Many applications of R use a package, which is a library of specialfunctions designed for a specific problem. For example, thepackage BSDA has several functions that are useful for basicstatistics and data analysis.

To install a package, go to the R GUI > Packages Menu > InstallPackages.

You only have to install a package once, but each time you openthe R GUI, you will need to load the package (Packages Menu >Load Package).

Fundamental Data Objects

vector - A vector is an indexed set of values that are all of thesame type. Some possible vector types are numeric, character,and logical.

data.frame - A data frame is a table-like structure.Experimental results and other data sets are often collected inthis form

Given a data object x, class(x) tells you what the class of x is. Thiscan be useful when trying to decide which functions can be usedon x.

Vectors

> x <- 99

[1] 99

> y <- c(1, 2, 3)

[1] 1 2 3

> z <- "statistics"

> class(z)

[1] "character"

> w <- c(TRUE, FALSE, FALSE, TRUE)

> class(w)

[1] "logical"

Basics about Vectors

To create a sequential list of numbers:

> data1 <- 1:5

> data1

[1] 1 2 3 4 5

To append numbers to a previous data set:

> data1 <- c(data1, 6, 7, 8, 9)

> data1

[1] 1 2 3 4 5 6 7 8 9

To repeat a number:

> data2 <- rep(1, 5)

> data2

[1] 1 1 1 1 1

More Basics about Vectors

To change an entry:

> data2[1] <- -99

> data2

[1] -99 1 1 1 1

> list <- c(-1.1, 2, -3, -4, 6.6, 7, 8.1, 20)

> abs(list)

[1] 1.1 2.0 3.0 4.0 6.6 7.0 8.1 20.0

To change an entry based on a logical expression:

> list2 <- list

> list2[list2 > 10] <- 10

> list2

[1] -1.1 2.0 -3.0 -4.0 6.6 7.0 8.1 10.0

Arithmetic Operations

Numeric vectors can be used in arithmetic expressions. Theoperations are performed element by element.

Example: Make a data list containing all the odd numbers between1 and 25.

> odds1 <- 2 * c(0:12) + 1

> odds1

[1] 1 3 5 7 9 11 13 15 17 19 21 23 25

> odds2 <- seq(1, 25, 2)

> odds2

[1] 1 3 5 7 9 11 13 15 17 19 21 23 25

Mean and Variance

> temp <- c(70, 74, 72, 68, 76, 88, 80, 66)

> sum(temp)/length(temp)

[1] 74.25

> mean(temp)

[1] 74.25

> sort(temp)

[1] 66 68 70 72 74 76 80 88

> var(temp)

[1] 50.78571

> sd(temp)

[1] 7.12641

> summary(temp)

Min. 1st Qu. Median Mean 3rd Qu. Max.66.00 69.50 73.00 74.25 77.00 88.00

Histograms

> list <- c(1, 1, 2, 2, 2, 3, 3, 4, 5, 6, 8, 8, 8, 8)

> hist(list)Histogram of list

1 2 3 4 5 6 7 8

> hist(list, prob = T, breaks = 4)

Histogram of list

0 2 4 6 8

Random Sampling

sample(x, size, replace = FALSE, prob) takes a random sample ofthe specified size from the elements of the vector x. Probabilitiesof selecting the values in x can optionally be specified with theargument prob.

Example: Simulate tossing a coin 50 times.

> coin <- c("H", "T")

> tosses <- sample(coin, 50, replace = TRUE)

> table(tosses)

tossesH T26 24

Example: Simulate choosing 6 lottery numbers from 1 to 54.

> sample(1:54, 6, replace = FALSE)

[1] 39 5 12 13 17 16

Distributions

Normal Distribution

pnorm(x,mean,sd,lower.tail=TRUE) finds P(z ≤ x)

qnorm(p,mean,sd,lower.tail=TRUE) finds the value c suchthat P(z ≤ c) = p

Example: Scores on a particular standardized test follow a normaldistribution with a mean of 80 and standard deviation of 10. Whatis the probability that a student scores below 70? Above 95? Whatscore would you expect only 5 percent to score below?

> pnorm(70, 80, 10, lower.tail = TRUE)

[1] 0.1586553

> pnorm(95, 80, 10, lower.tail = FALSE)

[1] 0.0668072

> qnorm(0.05, 80, 10, lower.tail = TRUE)

[1] 63.55146

More Distributions

t-Distribution

pt(q,df,lower.tail=TRUE)

qt(p,df,lower.tail=TRUE)

χ2-Distribution

pchisq(q,df,lower.tail=TRUE)

qchisq(p,df,lower.tail=TRUE)

Binomial Distribution

pbinom(q,size,prob,lower.tail=TRUE)

qbinom(p,size,prob,lower.tail=TRUE)

Getting Help

If you know the name of the function that you need help with, typea question mark (?) followed immediately by the function name.For example, to get more information about the function sample(),type

?sample

If you don’t know the exact name of the function that you need,use help.search(””). For example,

help.search("histograms")

will display a list of all functions related to histograms.

Things to Try on Your Own

1 Create a vector x1 that contains all of the even numbersbetween -10 and 10 (inclusive).

2 Use a logical expression to replace all of the negative entries inx1 with 0’s.

3 Take a random sample of 10 distinct numbers between 1 to100.

4 Suppose that average body temperatures are normallydistributed with a mean of 98.6 and standard deviation of 1.2.What is the probability that a randomly selected individualwill have a body temperature between 97 and 100 degrees?

Data Frames

Data frames are matrix-like structures, in which the columns arevectors of possibly different types. You can think of data frames as‘data matrices’ with one row per observational unit. Manyexperiments are best described by data frames.

> info <- data.frame(SEX = c("M", "M", "F"), +

AGE = c(18, 19, 17), HEIGHT = c(69, 72, 62))

> info

SEX AGE HEIGHT1 M 18 692 M 19 723 F 17 62

Rows and Columns of Data Frames

You can specifiy which row(s) and column(s) of a data frame youwant to view.

> info[1, ]

SEX AGE HEIGHT1 M 18 69

> info[, 2:3]

AGE HEIGHT1 18 692 19 723 17 62

> info[1:2, c(1, 3)]

SEX HEIGHT1 M 692 M 72

Sorting Data Frames

Often data is better viewed when sorted. The following code sortsby HEIGHT.

> byHeight <- info[order(info[, "HEIGHT"]), ]

> byHeight

SEX AGE HEIGHT3 F 17 621 M 18 692 M 19 72

> byHeightRev <- info[rev(order(info[, "HEIGHT"])), ]

> byHeightRev[1:3, ]

SEX AGE HEIGHT2 M 19 721 M 18 693 F 17 62

Reading Data from an External File

The simplest way to create large data frames is by reading datafrom an external file, such as from an Excel file, using theread.table() or read.csv() functions. For example, you can use thedata CD that was included with your textbook. This data can alsobe downloaded from www.thomsonedu.com/statistics.

Example:

> fam <- read.csv("E:/Excel Comma/Chapter 7/families.csv", +

header = T)

> dim(fam)

[1] 43886 6

Manipulating Data Frames

One way to access the column vectors of a data frame is using $.For example, fam$INCOME is the vector corresponding to theincome column. Alternatively, we can use the attach() function.

> attach(fam)

> names(fam)

[1] "TYPE" "PERSONS" "CHILDREN" "INCOME" "REGION" "EDUCATION"

> INCOME[1:5]

[1] 43450 79000 51306 24850 65145

> detach(fam)

Obtaining Summaries of Data Frames

> table(fam$REGION)

1 2 3 410149 10390 13457 9890

> hist(fam$INCOME)

Histogram of fam$INCOME

fam$INCOME

0e+00 1e+05 2e+05 3e+05 4e+05

Subsetting Data Frames

Suppose you only want the data about a subset of the families.You can isolate this information using a logical expression.

> reg1 <- fam[fam$REGION == 1, ]

> reg1[1:3, ]

TYPE PERSONS CHILDREN INCOME REGION EDUCATION1 1 2 0 43450 1 392 1 2 0 79000 1 403 1 2 0 51306 1 39

> rich <- fam[fam$INCOME > 3e+05, ]

> rich[1:3, ]

TYPE PERSONS CHILDREN INCOME REGION EDUCATION2246 1 2 0 369121 1 4413493 1 6 0 320552 2 4314877 1 2 0 379395 2 45

Assessing Normality of Data

A researcher collected the body temperatures and heart rates of130 college students.

> body <- read.csv("E:/Excel Comma/Chapter 9/bodytemp.csv", +

header = F)

> colnames(body) <- c("TEMP", "SEX", "HEART")

Do the heart rates look normally distributed?

> hist(body$HEART)

Histogram of body$HEART

body$HEART

55 60 65 70 75 80 85 90

Quantile-Quantile Plots

An alternative way to test the normality of data is using aquantile-quantile plot. If the plot looks like a straight line, the datais approximately normal.

> qqnorm(body$HEART)

> qqline(body$HEART)

●●

−2 −1 0 1 2

Normal Q−Q Plot

Theoretical Quantiles

Hypothesis Testing

For normally distributed data:t.test(x, alternative = c(“two.sided”,“less”,“greater”), mu = 0)

Example: The AMA claims that the average heart rate ofAmerican college students is 72 beats per minute. Is there evidenceto conclude that the actual average heart rate is not 72?

> t.test(x = body$HEART, mu = 72, alternative = "two.sided")

One Sample t-test

data: body$HEARTt = 2.844, df = 129, p-value = 0.005182alternative hypothesis: true mean is not equal to 7295 percent confidence interval:72.53607 74.98701sample estimates:mean of x73.76154

Two Sample Hypothesis Testing

For normally distributed data:t.test(x, y, alternative, mu = 0)

Example: Is there reason to believe that the average heart rate formales is different than for females based on the data we have?

> t.test(x = body[body$SEX == 1, ]$HEART, y = body[body$SEX ==

+ 2, ]$HEART, alternative = "two.sided", mu = 0)

Welch Two Sample t-test

data: body[body$SEX == 1, ]$HEART and body[body$SEX == 2, ]$HEARTt = -0.6319, df = 116.704, p-value = 0.5287alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:-3.243732 1.674501sample estimates:mean of x mean of y73.36923 74.15385

Pearson’s χ2 Goodness of Fit Tests

Example: If we toss a die 150 times and find get 22 ones, 21 twos,22 threes, 27 fours, 22 fives, and 36 sixes, is the die fair?

> freq = c(22, 21, 22, 27, 22, 36)

> probs = rep(1/6, 6)

> chisq.test(x = freq, p = probs)

Chi-squared test for given probabilities

data: freqX-squared = 6.72, df = 5, p-value = 0.2423

Since the p-value is large, there is not enough evidence to concludethat the die is not fair.

Goodness of Fit (continued)

Exercise 39 in Chapter 9: The lunar cycle (29 days) was dividedinto 10 (unequal) periods, and the number of animal bites in eachperiod is given. Is there a temporal trend in the incidence of bites?

> freq = c(137, 150, 163, 201, 269, 155, 142, 146, 148, 110)

> probs = c(rep(3/29, 9), 2/29)

> chisq.test(x = freq, p = probs)

Chi-squared test for given probabilities

data: freqX-squared = 85.4797, df = 9, p-value = 1.308e-14

Since the p-value is so small, there does seem to be a temporaltrend in the incidence of bites.

Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing...

Documents

Transcript of Getting Started with R - Vanderbilt University...Stacy Hoehn Getting Started with R Assessing...

Getting Started Guide - Engravers Network · DCS_Direct_Jet_1024UVMVP_Getting_Started_Guide_1.2_030314 Getting Started Guide

Getting Started Guide - Informatica · Getting Started Guide - Informatica ... - Profiling

Getting Started

GETTING STARTED WITH SEOcontentz.mkt51.net/lp/...Started-With-SEO-Slides.pdf · getting started with seo 11 pyramids. getting started with seo 12 beware of forms. getting started

Getting Started V6 - Data and Information Services Center ... · DATASTREAM GETTING STARTED GUIDE DATASTREAM GETTING STARTED GUIDE DATASTREAM GETTING STARTED GUIDE DATASTREAM GETTING

LTspice IV Getting Started GuideLTspice IV Getting Started ...

Floor 1 Getting Started Guide 2015 - s3.amazonaws.com1+Getting+Started+Guide.pdf · Floor 1 Getting Started Guide 2 Floor 1 Getting Started ... for review purposes, ... performance

Getting Started V6 - University of Wisconsin–Madison · DATASTREAM GETTING STARTED GUIDE DATASTREAM GETTING STARTED GUIDE DATASTREAM GETTING STARTED GUIDE DATASTREAM GETTING STARTED

Getting Started Guide - Cisco...Getting Started Guide - Cisco ... the .*

Getting Started V6 · DATASTREAM GETTING STARTED GUIDE DATASTREAM GETTING STARTED GUIDE DATASTREAM GETTING STARTED GUIDE DATASTREAM GETTING STARTED GUIDE DATASTREAM GETTING STARTED

ShMC Getting Started - Samway - Products Getting Started... · ShMC Getting Started Rev 1.0 ShMC Getting Started Revision History: Rev 1.0 01.10.2013 First Draft Table of Contents

Getting Started 4 latest - University of York · 2019-12-20 · Introduction ADVANCE GETTING STARTED GUIDE ADVANCE GETTING STARTED GUIDE ADVANCE GETTING STARTED GUIDE ADVANCE GETTING

Getting Started Basic Panels - Siemens · Willkommen Getting Started Basic Panels Getting Started, 04/2009, A5E02529517-01 5 Willkommen zum WinCC flexible "Getting Started Einsteiger".

Getting started Deployment Getting started.

Getting Started Guidestatic.highspeedbackbone.net/pdf/Sonicwall NSA 2400...SonicWALL NSA 2400 Getting Started Guide Page 1 SonicWALL NSA 2400 Getting Started Guide This Getting Started

Getting Started Chapter 1. Getting Started Getting Startedmsi-ftp.de/Manuals/7187-englv1.0-MS-7187.pdf · Getting Started Chapter 1. Getting Started Getting Started Thank you for

Getting Started Chapter 1. Getting Started Getting Started

LTspice IV Getting Started GuideLTspice IV Getting Started Guide

WinCC flexible Getting Started Básico · 2016-05-25 · WinCC flexible Getting Started Básico Getting Started, Edición 04/2006, 6ZB5370-1CL04-0BA2 Bienvenido al «Getting Started

Getting Started V5 - Thomson Reuters · DATASTREAM GETTING STARTED GUIDE DATASTREAM GETTING STARTED GUIDE DATASTREAM GETTING STARTED GUIDE DATASTREAM GETTING STARTED GUIDE DATASTREAM