BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1...

25
BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This is the first R lab session of course BIOSTAT640 at UMass during the Spring 2016 semester. I, Minming (Matt) Li, am going to use the R codes from some questions of the first 3 homework sets as examples for the R lab session. For Amherst class on Wed (02/03/2016), I am going to present this document, and during this talk, I am also going to cover the key points and personal suggestions in using RStudio, R Markdown, etc. Starting from Page14, it was written by Steele Valenzuela in 2015, which I also combined into this file. For HW1 questions after Q4, please refer to the other version of solution by Prof. Carol Bigelow. Here we will mainly use R to solve HW1: Q1-Q4. Q1: a <- c(5, 10, 6, 11, 5, 14, 30, 11, 17, 3, 9, 3, 8, 8, 5, 5, 7, 4, 3, 7, 9, 11, 11, 9, 4) breaks <- seq(0, 30, by=5) # set intervals # breaks a.cut <- cut(a, breaks, right=TRUE) a.freq <- table(a.cut) # a.freq a.cumfreq <- cumsum(a.freq) # cumsum means the cumulative sums # a.cumfreq a.relfreq <- a.freq/length(a) # a.relfreq a.cumrelfreq <- cumsum(a.relfreq) # a.cumrelfreq Q1table <- cbind(a.freq, a.cumfreq, a.relfreq, a.cumrelfreq) colnames(Q1table) <- c("Freq", "Cum Freq", "Rel Freq", "Cum Rel Freq") Q1table ## Freq Cum Freq Rel Freq Cum Rel Freq ## (0,5] 9 9 0.36 0.36 ## (5,10] 9 18 0.36 0.72 ## (10,15] 5 23 0.20 0.92 ## (15,20] 1 24 0.04 0.96 ## (20,25] 0 24 0.00 0.96 ## (25,30] 1 25 0.04 1.00 Q2: Boxplot for cholesterol levels (mg/dl) for two groups of men 1

Transcript of BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1...

Page 1: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

BIOSTAT640 R Lab1 for Spring 2016Minming Li & Steele H. Valenzuela

Feb.1, 2016

This is the first R lab session of course BIOSTAT640 at UMass during the Spring 2016 semester. I, Minming(Matt) Li, am going to use the R codes from some questions of the first 3 homework sets as examples for theR lab session. For Amherst class on Wed (02/03/2016), I am going to present this document, and during thistalk, I am also going to cover the key points and personal suggestions in using RStudio, R Markdown, etc.Starting from Page14, it was written by Steele Valenzuela in 2015, which I also combined into this file.

For HW1 questions after Q4, please refer to the other version of solution by Prof. Carol Bigelow. Here wewill mainly use R to solve HW1: Q1-Q4.

Q1:

a <- c(5, 10, 6, 11, 5, 14, 30, 11, 17, 3, 9, 3, 8, 8, 5, 5, 7, 4, 3, 7, 9, 11, 11, 9, 4)breaks <- seq(0, 30, by=5) # set intervals# breaksa.cut <- cut(a, breaks, right=TRUE)

a.freq <- table(a.cut)# a.freqa.cumfreq <- cumsum(a.freq) # cumsum means the cumulative sums# a.cumfreqa.relfreq <- a.freq/length(a)# a.relfreqa.cumrelfreq <- cumsum(a.relfreq)# a.cumrelfreq

Q1table <- cbind(a.freq, a.cumfreq, a.relfreq, a.cumrelfreq)colnames(Q1table) <- c("Freq", "Cum Freq", "Rel Freq", "Cum Rel Freq")Q1table

## Freq Cum Freq Rel Freq Cum Rel Freq## (0,5] 9 9 0.36 0.36## (5,10] 9 18 0.36 0.72## (10,15] 5 23 0.20 0.92## (15,20] 1 24 0.04 0.96## (20,25] 0 24 0.00 0.96## (25,30] 1 25 0.04 1.00

Q2: Boxplot for cholesterol levels (mg/dl) for two groups of men

1

Page 2: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

grp1 <- c(233, 291, 312, 250, 246, 197, 268, 224, 239, 239,254, 276, 234, 181, 248, 252, 202, 218, 212, 325)

grp2 <- c(344, 185, 263, 246, 224, 212, 188, 250, 148, 169,226, 175, 242, 252, 153, 183, 137, 202, 194, 213)

cholesterol <- cbind(grp1, grp2)# cholesterol# median(grp1) #242.5# median(grp2) #207

# 2a: Side-by-side box plotboxplot(grp1, grp2)

1 2

150

200

250

300

350

# 2b. Side-by-side histograms with same definitions (starting value, ending value, tick marks, etc) of the horizontal and vertical axes.# summary(grp1)# summary(grp2)par(mfrow=c(1,2))hist(grp1, breaks=seq(120, 360, by=20))hist(grp2, breaks=seq(120, 360, by=20))

2

Page 3: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

Histogram of grp1

grp1

Frequency

150 250 350

01

23

45

Histogram of grp2

grp2

Frequency

150 250 3500

12

34

Comparing the two distributions, the first group of men has higher median, Q1, Q3 and Min values than thesecond group, but has smaller Max than 2nd group.

Q3: Prob(11 or more of the 64 exposed firefighters reporting breath shortness):

1-pbinom(10, 64, 1/22)

## [1] 0.0001360821

# OR WE CAN USE: sum(dbinom(11:64, 64, 1/22)) # 0.0001360821

Q4: 4a. What proportion of women have weights that are outside the ACES-II ejection seat acceptable range?

pnorm(140, mean=143, sd=29) + (1-pnorm(211, mean=143, sd=29))

## [1] 0.4683215

4b. In a sample of 1000 women, how many are expected to have weights below the 140 lb threshold?

3

Page 4: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

1000*pnorm(140, mean=143, sd=29)

## [1] 458.8036

So, there are about 459 women that are expected to have weights below the 140 lb threshold.

For HW2 Q1 and Q2, please refer to the other version of solution by Prof. Carol Bigelow. Here we will mainlyuse R to solve HW2-Q3.

Q3: Let’s download the dataset.

library(foreign)url <- "http://people.umass.edu/biep640w/datasets/week02.dta"dat <- read.dta(file = url) # read.dta: Read Stata Binary Files# dathead(dat)

## temp boiling## 1 210.8 29.211## 2 210.1 28.559## 3 208.4 27.972## 4 202.5 24.697## 5 200.6 23.726## 6 200.1 23.369

x <- dat$tempy1 <- dat$boilingm1 <- lm(y1 ~ x)# OR: m1 <- lm(boiling~temp, data=dat)# OR: m1 <- lm(dat$boiling ~ dat$temp)

summary(m1)

#### Call:## lm(formula = y1 ~ x)#### Residuals:## Min 1Q Median 3Q Max## -4.8026 0.0057 0.1418 0.5573 1.0022##

4

Page 5: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

## Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) -65.34300 4.64381 -14.07 1.73e-14 ***## x 0.44379 0.02419 18.35 < 2e-16 ***## ---## Signif. codes: 0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1#### Residual standard error: 1.157 on 29 degrees of freedom## Multiple R-squared: 0.9207, Adjusted R-squared: 0.9179## F-statistic: 336.6 on 1 and 29 DF, p-value: < 2.2e-16

m1$coefficients # output the intercept and slope

## (Intercept) x## -65.3429977 0.4437942

# confint(m1) # output the CI for the parameters in the fitted modelanova(m1)

## Analysis of Variance Table#### Response: y1## Df Sum Sq Mean Sq F value Pr(>F)## x 1 450.56 450.56 336.6 < 2.2e-16 ***## Residuals 29 38.82 1.34## ---## Signif. codes: 0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1

plot(x, y1, main="Simple Linear Regression of Y=boiling on X=temp", xlab="temp",ylab="boiling", pch=18) # pch means "plot character", can be 1 to 18;

abline(m1, col="red", lty=1, lwd=2) # lty: "line type"; lwd: "line width".

5

Page 6: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

180 185 190 195 200 205 210

1520

25

Simple Linear Regression of Y=boiling on X=temp

temp

boiling

From the above summary(m1) output, we know that:

(a): Parameter Estimates: -65.34 and 0.44, so regression line: Y=-65.34+0.44*X;

(b): ANOVA (Analysis of Variance) table is obtained above using the anova(m1) code;

(c): R-square=0.9207;

(d): The scatterplot with the fitted line is plotted above.

(2). Now, instead of Y=boiling, we want to use newy=100*log10(boiling)

newy=100*log10(y1)m2 <- lm(newy ~ x)summary(m2)

#### Call:## lm(formula = newy ~ x)#### Residuals:## Min 1Q Median 3Q Max## -14.2548 0.3286 0.6079 0.9923 1.4159#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) -48.85829 11.88807 -4.11 0.000297 ***## x 0.92615 0.06192 14.96 3.62e-15 ***## ---

6

Page 7: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

## Signif. codes: 0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1#### Residual standard error: 2.962 on 29 degrees of freedom## Multiple R-squared: 0.8852, Adjusted R-squared: 0.8813## F-statistic: 223.7 on 1 and 29 DF, p-value: 3.623e-15

m2$coefficients # output the intercept and slope

## (Intercept) x## -48.8582854 0.9261547

# confint(m2) # output the CI for the parameters in the fitted modelanova(m2)

## Analysis of Variance Table#### Response: newy## Df Sum Sq Mean Sq F value Pr(>F)## x 1 1962.2 1962.25 223.69 3.623e-15 ***## Residuals 29 254.4 8.77## ---## Signif. codes: 0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1

plot(x, newy, main="Simple Linear Regression of Y=100*log10(boiling) on X=temp", xlab="temp",ylab="100*log10(boiling)", pch=18) # pch means "plot character", can be 1 to 18;

abline(m2, col="red", lwd=2) # lwd means "line width"

180 185 190 195 200 205 210

110

120

130

140

Simple Linear Regression of Y=100*log10(boiling) on X=temp

temp

100*log10(boiling)

7

Page 8: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

From the above summary(m2) output, we know that:

(a): Parameter Estimates: -48.86 and 0.93, so regression line: 100log10(Y)=-48.86+0.93*X;

(b): ANOVA (Analysis of Variance) table is obtained above using the anova(m2) code;

(c): R-square=0.8852;

(d): The scatterplot with the fitted line is plotted above.

Solution (one paragraph of text that is interpretation of analysis):

Did you notice that the scatter plot of these data reveal two outlying values? Their inclusion may or may notbe appropriate.

If all n=31 data points are included in the analysis, then the model that explains more of the variability inboiling point is Y=boiling point modeled linearly in X=temp. It has a greater Rˆ2 (92.07% vs. 88.52%).

Be careful - It would not make sense to compare the residual mean squares of the two models because thescales of measurement involved are di�erent.

For HW3 Q1, Q4, Q5, please refer to the other version of solution by Prof. Carol Bigelow. Here we willmainly use R to solve Q2,Q3.

Q3: Use R to reproduce th anova tables you worked with in Q2. Let’s download the dataset.

library(foreign)url <- "http://people.umass.edu/biep640w/datasets/larvae.dta"dat <- read.dta(file = url) # read.dta: Read Stata Binary Files# dathead(dat)

## id y x1 x2## 1 1 2.836 0.150 0.425## 2 2 2.966 0.214 0.439## 3 3 2.687 0.487 0.301## 4 4 2.679 0.509 0.325## 5 5 2.827 0.570 0.371## 6 6 2.442 0.593 0.093

dim(dat)

## [1] 15 4

str(dat)

8

Page 9: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

## �data.frame�: 15 obs. of 4 variables:## $ id: num 1 2 3 4 5 6 7 8 9 10 ...## $ y : num 2.84 2.97 2.69 2.68 2.83 ...## $ x1: num 0.15 0.214 0.487 0.509 0.57 ...## $ x2: num 0.425 0.439 0.301 0.325 0.371 ...## - attr(*, "datalabel")= chr "PubHlth 640 Unit 2 Regression - Larvae data"## - attr(*, "time.stamp")= chr "10 Feb 2013 16:36"## - attr(*, "formats")= chr "%9.0g" "%9.0g" "%9.0g" "%9.0g"## - attr(*, "types")= int 254 254 254 254## - attr(*, "val.labels")= chr "" "" "" ""## - attr(*, "var.labels")= chr "larva id" "log10(survival)" "log10(dose)" "log10(weight)"## - attr(*, "expansion.fields")=List of 2## ..$ : chr "_dta" "note1" "\"Week 3 homework assignment exercises 2 and 3\""## ..$ : chr "_dta" "note0" "1"## - attr(*, "version")= int 12

summary(dat)

## id y x1 x2## Min. : 1.0 Min. :2.351 Min. :0.1500 Min. :0.0930## 1st Qu.: 4.5 1st Qu.:2.430 1st Qu.:0.5395 1st Qu.:0.1745## Median : 8.0 Median :2.452 Median :0.7390 Median :0.2890## Mean : 8.0 Mean :2.567 Mean :0.7007 Mean :0.2779## 3rd Qu.:11.5 3rd Qu.:2.683 3rd Qu.:0.8845 3rd Qu.:0.3675## Max. :15.0 Max. :2.966 Max. :1.1940 Max. :0.4390

# model regression on x1 alonem1 <- lm(y ~ x1, data=dat)summary(m1)

#### Call:## lm(formula = y ~ x1, data = dat)#### Residuals:## Min 1Q Median 3Q Max## -0.184135 -0.051477 0.002581 0.067284 0.188219#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 2.95220 0.07356 40.136 5.13e-15 ***## x1 -0.54986 0.09734 -5.649 7.94e-05 ***## ---## Signif. codes: 0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1#### Residual standard error: 0.1067 on 13 degrees of freedom## Multiple R-squared: 0.7105, Adjusted R-squared: 0.6883## F-statistic: 31.91 on 1 and 13 DF, p-value: 7.944e-05

m1$coefficients # output the intercept and slope

## (Intercept) x1## 2.9521990 -0.5498559

9

Page 10: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

confint(m1) # output the CI for the parameters in the fitted model

## 2.5 % 97.5 %## (Intercept) 2.7932931 3.1111050## x1 -0.7601409 -0.3395709

anova(m1)

## Analysis of Variance Table#### Response: y## Df Sum Sq Mean Sq F value Pr(>F)## x1 1 0.36327 0.36327 31.911 7.944e-05 ***## Residuals 13 0.14799 0.01138## ---## Signif. codes: 0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1

# model regression on x2 alonem2 <- lm(y ~ x2, data=dat)summary(m2)

#### Call:## lm(formula = y ~ x2, data = dat)#### Residuals:## Min 1Q Median 3Q Max## -0.14119 -0.11629 0.04170 0.07746 0.17741#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 2.1847 0.0820 26.644 9.92e-13 ***## x2 1.3756 0.2747 5.007 0.00024 ***## ---## Signif. codes: 0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1#### Residual standard error: 0.1159 on 13 degrees of freedom## Multiple R-squared: 0.6585, Adjusted R-squared: 0.6322## F-statistic: 25.07 on 1 and 13 DF, p-value: 0.00024

m2$coefficients # output the intercept and slope

## (Intercept) x2## 2.184706 1.375579

confint(m2) # output the CI for the parameters in the fitted model

## 2.5 % 97.5 %## (Intercept) 2.0075646 2.361847## x2 0.7820388 1.969119

10

Page 11: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

anova(m2)

## Analysis of Variance Table#### Response: y## Df Sum Sq Mean Sq F value Pr(>F)## x2 1 0.33667 0.33667 25.068 0.00024 ***## Residuals 13 0.17459 0.01343## ---## Signif. codes: 0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1

# model regression on x1 and x2m3 <- lm(y ~ x1 + x2, data=dat)summary(m3)

#### Call:## lm(formula = y ~ x1 + x2, data = dat)#### Residuals:## Min 1Q Median 3Q Max## -0.071789 -0.049678 -0.001716 0.029783 0.129122#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 2.58900 0.08361 30.966 8.09e-13 ***## x1 -0.37848 0.06637 -5.702 9.88e-05 ***## x2 0.87497 0.17248 5.073 0.000274 ***## ---## Signif. codes: 0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1#### Residual standard error: 0.06263 on 12 degrees of freedom## Multiple R-squared: 0.9079, Adjusted R-squared: 0.8926## F-statistic: 59.18 on 2 and 12 DF, p-value: 6.085e-07

m3$coefficients # output the intercept and slope

## (Intercept) x1 x2## 2.5889956 -0.3784805 0.8749749

confint(m3) # output the CI for the parameters in the fitted model

## 2.5 % 97.5 %## (Intercept) 2.4068296 2.7711616## x1 -0.5230972 -0.2338637## x2 0.4991656 1.2507841

anova(m3)

## Analysis of Variance Table

11

Page 12: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

#### Response: y## Df Sum Sq Mean Sq F value Pr(>F)## x1 1 0.36327 0.36327 92.623 5.409e-07 ***## x2 1 0.10093 0.10093 25.733 0.0002739 ***## Residuals 12 0.04706 0.00392## ---## Signif. codes: 0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1

From above, you see that there are 3 times that we input almost exactly the same codes, but on m1, or m2,or m3. Here is a good chance to use “Function” in R.

model.regress <- function(x) {a <- summary(x)b <- x$coefficientsc <- confint(x)d <- anova(x)print(a); print(b); print(c); print(d)# the last line typically specifies what one wants the function to print

}

m1 <- lm(y ~ x1, data=dat)m2 <- lm(y ~ x2, data=dat)m3 <- lm(y ~ x1 + x2, data=dat)

model.regress(m1)

#### Call:## lm(formula = y ~ x1, data = dat)#### Residuals:## Min 1Q Median 3Q Max## -0.184135 -0.051477 0.002581 0.067284 0.188219#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 2.95220 0.07356 40.136 5.13e-15 ***## x1 -0.54986 0.09734 -5.649 7.94e-05 ***## ---## Signif. codes: 0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1#### Residual standard error: 0.1067 on 13 degrees of freedom## Multiple R-squared: 0.7105, Adjusted R-squared: 0.6883## F-statistic: 31.91 on 1 and 13 DF, p-value: 7.944e-05#### (Intercept) x1## 2.9521990 -0.5498559## 2.5 % 97.5 %## (Intercept) 2.7932931 3.1111050

12

Page 13: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

## x1 -0.7601409 -0.3395709## Analysis of Variance Table#### Response: y## Df Sum Sq Mean Sq F value Pr(>F)## x1 1 0.36327 0.36327 31.911 7.944e-05 ***## Residuals 13 0.14799 0.01138## ---## Signif. codes: 0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1

# model.regress(m2)# model.regress(m3)

13

Page 14: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

First Session

Steele H. ValenzuelaJanuary 2015

Contents

Downloading and Installing R and RStudio (for PC/Mac)

If you have an Ubuntu operating system (OS) or any other OS aside from a PC/Mac, then you must loveLinux and/or working in the terminal. That’s impressive and you most likely can skip ahead to the sectiontitled ‘First Session’

In order to download R, please see their website. The website appears underdeveloped but you’re at the rightspot. Under the heading Download and Install R, you’ll see the links to download R for your respectiveOS.

In order to download RStudio (a fancy graphical user interface [GUI]), please see their website. This websiteis not underdeveloped but rather crisp. Moving on, under the heading Installers for ALL Platforms,you’ll see the links to download RStudio for your respective OS. For your respective OS, it should be thefirst or the second link.

The best explanation to give as to why you need both is this: RStudio is the fancy and friendly GUI thatkeeps you sane and organized as well as providing an inviting workplace, hence the very point of a GUI. YouCANNOT use RStudio without R. R is what is under the hood, the engine per se, and RStudio is theinterior and exterior of your car/truck/Italian moped. I am ‘automobile illiterate’ so please don’t make mecontinue with this metaphor.

Launching and Exiting RStudio

Launching RStudio For PC users, please do the following: Start > All Programs > RStudio

For Mac users, please do the following: Applications > RStudio; or Click on the RStudio icon shortcut onyour dock; or In spotlight search (Cmd + Spacebar; or the magnifying glass in the upper right corner), enterRStudio and press return.

Once in awhile, you will be asked to update RStudio. Please do so.

Exiting RStudio From the toolbar at the top, at the far left: Click RStudio. From the drop down menu,click Quit RStudio. For PC, the shortcut is [insert shortcut here]. For Macs, the shortcut is Cmd + Q.

Toolbar and Windows

Upon starting RStudio, it will be intimidating, but do not fear, your loyal TA is here. There is a lot tograsp but getting your feet wet is the best way to progress with R.

1

Page 15: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

Toolbar There are many options present in the toolbar, but for now, there are 4 commands we willhighlight.

The first command is under the following: File > New File From there, one should see R Script. Similar toStata’s do-file, a script allows you to write notes, commands, and anything else one may desire for a session.

The second command is near the first: File > New Project From there, a box will pop up, asking you toCreate a Project From. Choose the first option, New Directory, followed by Empty Directory, and give yourproject a name. For the sake of this class, name your project by the respective week of the class or the sessionnumber.

The third command is under the following: Tools > Import Dataset > From Text File. . . /From Web URL. . .Uploading the data can be painful and if it takes you longer than 10 minutes, please consult me or any online

2

Page 16: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

help forums. Unless one is scraping data from Twitter or some other complex method, it should be relativelyeasy. If need be, copy and paste into excel and save as a comma separated value (CSV) file and then import.You’ll also notice the shortcut symbol for Import Dataset, a sheet and an arrow.

The fourth command is under the following: Tools > Global Options For now, Appearance and Pane Layoutare readily available to change the appearance of RStudio, such as the size of the text, the color scheme ofthe script and code, and various other options.

3

Page 17: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

RStudio Layout Upon starting RStudio, you will see this similar layout (enter image). One half of thecomputer monitor is devoted to the source code, scripts, and the console, which runs the code. As for theother half of the monitor, you will see an environment where objects are stored, such as datasets and functionsas well as the name of your files, plots, packages, and other tabs that you may explore.

First Session

Key to Colors The default color theme for R is TextMate. If you recall, you may change the theme underTools > Global Options > Appearances.

4

Page 18: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

Green- These are comments. In R, commands that begin with a number sign/pound sign/hashtag arecomments.

Blue- These are commands. Each command begins with an angle bracket ( > ).

Black- This is R output. Compare it with the output you get.

Preliminaries (working directory, start log, input data) In order to start or track a log, there issomething called git. You may have heard of GitHub. We’ll talk about that at a later time. As for now,simply start a new file or script to save your notes and commands.

Your working directory is the default location to save and generate files, datasets, etc. Here are two helpfulcommands.

# The following command displays your current working directory.

getwd(...) # default location will be where project is specified

5

Page 19: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

# The following command changes your current working directory to whatever you set it to.

setwd(...) # this is a pain to write out but it�ll become useful

R has the ability to import data from a web url as well as a text file (CSV, comma delimited or tab files,etc.). R contains packages, a unique set of commands and datasets, which are developed by users from allover the world. Before uploading a package into your library, you must install the respective package assuch: “install.packages(”package name“)”. In order to upload the same dataset from the Stata handout, thefollowing commands will ensue.

library(foreign) # a package that translates Stata, SAS, or SPSS data into R

stata <- "http://www.pauldickman.com/survival/ivf.dta" # Web URL, note �.../ivf.dta�

ivf <- read.dta(file = stata) # ivf is the name of the dataset whereas read.dta is the command

As I’m sure you’ve noticed, the symbol ‘<-’ is how objects are stored in R.

View Data Structure Here are commands to view the structure of the data.

str(ivf) # the structure of dataset ivf

## �data.frame�: 641 obs. of 6 variables:

## $ id : num 1 2 3 4 5 6 7 8 9 10 ...

## $ matage : int 33 34 34 30 35 37 31 31 33 33 ...

## $ hyp : int 0 0 0 0 0 0 0 1 1 0 ...

## $ gestwks: num 37.7 39.2 35.7 39.3 38.4 ...

## $ sex : Factor w/ 2 levels "male","female": 2 2 2 1 2 1 1 2 1 2 ...

## $ bweight: int 2410 2977 2100 3270 2620 3260 3750 1450 3200 3675 ...

## - attr(*, "datalabel")= chr "In Vitro Fertilization data"

## - attr(*, "time.stamp")= chr "27 Aug 2001 13:11"

## - attr(*, "formats")= chr "%9.0g" "%8.0g" "%8.0g" "%9.0g" ...

## - attr(*, "types")= int 102 98 98 102 98 105

## - attr(*, "val.labels")= chr "" "" "" "" ...

## - attr(*, "var.labels")= chr "identity number" "maternal age (years)" "hypertension (1=yes, 0=no)" "gestational age (weeks)" ...

## - attr(*, "version")= int 7

## - attr(*, "label.table")=List of 1

## ..$ sex: Named int 1 2

## .. ..- attr(*, "names")= chr "male" "female"

dim(ivf) # displays the dimensions of dataset ivf

## [1] 641 6

names(ivf) # displays variable/column names

## [1] "id" "matage" "hyp" "gestwks" "sex" "bweight"

summary(ivf) # summary of ivf

6

Page 20: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

## id matage hyp gestwks sex

## Min. : 1 Min. :23 Min. :0.000 Min. :24.7 male :326

## 1st Qu.:161 1st Qu.:31 1st Qu.:0.000 1st Qu.:38.0 female:315

## Median :321 Median :34 Median :0.000 Median :39.1

## Mean :321 Mean :34 Mean :0.139 Mean :38.7

## 3rd Qu.:481 3rd Qu.:37 3rd Qu.:0.000 3rd Qu.:40.1

## Max. :641 Max. :43 Max. :1.000 Max. :42.4

## NA�s :2

## bweight

## Min. : 630

## 1st Qu.:2850

## Median :3200

## Mean :3129

## 3rd Qu.:3550

## Max. :4650

##

summary(ivf$sex) # factoral summary of categorical variable sex

## male female

## 326 315

summary(ivf$bweight) # numerical summary of variable bweight

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 630 2850 3200 3130 3550 4650

Examine Data The simplest way to view your data is with the following command:

View(ivf) # will display a table in the same pane as scripts

To view snippets of your data, implement the following commands:

head(ivf) # default display is 6 rows

## id matage hyp gestwks sex bweight

## 1 1 33 0 37.74 female 2410

## 2 2 34 0 39.15 female 2977

## 3 3 34 0 35.72 female 2100

## 4 4 30 0 39.29 male 3270

## 5 5 35 0 38.38 female 2620

## 6 6 37 0 37.86 male 3260

tail(ivf, 10) # displays tail end of ivf, more specifically, the last 10 rows

## id matage hyp gestwks sex bweight

## 632 632 31 0 39.66 female 3940

## 633 633 33 0 39.37 female 3545

## 634 634 35 0 39.86 female 3320

## 635 635 33 0 38.17 male 3350

7

Page 21: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

## 636 636 34 0 41.15 male 2972

## 637 637 28 0 38.58 female 2850

## 638 638 38 1 38.44 male 3182

## 639 639 26 0 38.94 female 3048

## 640 640 31 0 40.43 female 3183

## 641 641 31 0 38.15 male 2920

Lastly, you may view specific rows and columns as such:

ivf[10, 6] # 10th row, 6th column

ivf[325, ] # 325th row, all columns

ivf[ , 4] # all 641 rows, 4th column

Single Variable Description In R, if one would like to highlight a single variable, implement the followingsyntax:

mean(ivf$bweight) # dataset followed by �$� followed by variable name

## [1] 3129

If one would like to explore a single variable, here are some commands:

min(ivf$matage); max(ivf$matage) # similar to SAS, you may implement the a semi-colon to separate commands on line

## [1] 23

## [1] 43

range(ivf$gestwks) # range

## [1] 24.69 42.35

mean(ivf$gestwks); median(ivf$gestwks) # mean and median

## [1] 38.69

## [1] 39.15

sd(ivf$gestwks) # standard deviation

## [1] 2.33

table(ivf$sex) # frequency

##

## male female

## 326 315

8

Page 22: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

The base graphics in R are really just that, basic, but in the coming weeks, new graphical packages will beexplained and implemented. For now, here are a few commands:

hist(ivf$bweight) # histogram

Histogram of ivf$bweight

ivf$bweight

Frequency

1000 2000 3000 4000 5000

050

100

150

200

boxplot(ivf$bweight) # boxplot

1000

2000

3000

4000

stem(ivf$bweight) # stem and leaf plot...not sure if you�ll need to ever use this

##

9

Page 23: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

## The decimal point is 2 digit(s) to the right of the |

##

## 6 | 301

## 8 | 3628

## 10 | 0206

## 12 | 0123

## 14 | 035024569

## 16 | 17745689

## 18 | 02792456

## 20 | 00880001357

## 22 | 0225689223466679

## 24 | 011122235680000114455555688899

## 26 | 000011122222345556678990000000222223345557777788

## 28 | 00000123344445556667788999990000011111222222333344445556666677778888

## 30 | 00000000000111222333444444444555666667788888889900000000000001222222+21

## 32 | 00000000122223333333444444566666666777778888999999900000000112222333+23

## 34 | 00000001222234445555566666666677888999000000011222444445555556666667+1

## 36 | 0000000011222233344455566668888899000001122222333444555667777788899

## 38 | 00011122245667889913344444577888

## 40 | 02334466892234488

## 42 | 0002389011234

## 44 | 22350025

## 46 | 5

Two Variable Descriptives Here are a few commands for two variable descriptives:

table(ivf$sex, ivf$hyp) # 2-way table

##

## 0 1

## male 273 52

## female 277 37

table(ivf$sex, ivf$hyp)/641 # R does not have a good way to display percentages so it must be done manually

##

## 0 1

## male 0.42590 0.08112

## female 0.43214 0.05772

And some graphical commands for two variable descriptives:

boxplot(ivf$gestwks ~ ivf$sex) # one continuous variable by one descriptive variable

10

Page 24: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

male female

2530

3540

plot(ivf$gestwks, ivf$bweight, main = "insert main title here") # simple plot command

25 30 35 40

1000

2000

3000

4000

insert main title here

ivf$gestwks

ivf$bw

eight

plot(ivf$gestwks, ivf$bweight, xlab = "x-axis label here", ylab = "y-axis label here")

11

Page 25: BIOSTAT640 R Lab1 for Spring 2016 - UMasscourses.umass.edu/biep640w/pdf/R lab session 1 Spring...BIOSTAT640 R Lab1 for Spring 2016 Minming Li & Steele H. Valenzuela Feb.1, 2016 This

25 30 35 40

1000

2000

3000

4000

x−axis label here

y−ax

is la

bel h

ere

We can already see that in terms of summary statistics, Stata can be quite useful, so no hard feelings if youuse Stata in the beginning.

Very Important Information

In RStudio, under one of the panes, you will see a tab for files. Depending on your working directory, youmay see your own files from another project. DO NOT DELETE THESE FILES AS THEY WILL BEPERMANENTLY DELETED. This happened to a student from a class last semester. . .

There is a lot I have not included because there is simply too much material. If you are ever confused abouta command, type the following:

?plot # this will direct you to a file under the help tab

??plot # a more general search containing the words after �??�

Lastly, there are many resources online, such as stackoverflow or ‘http://www.ats.ucla.edu/stat/r/’, just toname a few. Professor Nicholas Reich taught a course last semester in R and his website is at this link.

12