Biostatistics with Rvanburenlab.com/wp-content/uploads/2017/11/R-Projects-2017.pdf · 13 Wandsworth...

10
5 Practice your new skills in R and present your results R projects

Transcript of Biostatistics with Rvanburenlab.com/wp-content/uploads/2017/11/R-Projects-2017.pdf · 13 Wandsworth...

Page 1: Biostatistics with Rvanburenlab.com/wp-content/uploads/2017/11/R-Projects-2017.pdf · 13 Wandsworth SouthwarkAndVauxhall 268 18122 14 Wandsworth Lambeth 7 3863 15 Campberwell SouthwarkAndVauxhall

5 Practice your new skills in R and present your results

R projects

Page 2: Biostatistics with Rvanburenlab.com/wp-content/uploads/2017/11/R-Projects-2017.pdf · 13 Wandsworth SouthwarkAndVauxhall 268 18122 14 Wandsworth Lambeth 7 3863 15 Campberwell SouthwarkAndVauxhall

41

These projects are learning exercises. You are encouraged to read through all of the projects, as I have included notes that are meant to teach some useful details about the R language for processing data. Each student should present their own pro-ject, but you are encouraged to get help from your peers or from the instructor as needed.

1. Select one of the projects described in the following sections of this Chapter.

2. Use R to import the respective data file

3. Use R to explore the data and become familiar with its struc-ture and variable types, and process the data so it is cor-rectly formatted for statistical analysis.

4. Use R to analyze the data to answer the questions posed in the project description.

5. Make a graph that supports this analysis.

6. Give a 5-minute presentation that:

A. Gives a synopsis of the project goal / hypothesis

B. Describes the dataset for the project

C. Describes how you used R to import and handle the data-set

D. Describes the appropriate choice of statistical test and how you conducted the analysis with R

E. Interprets the result of the analysis

F. Shows a supporting graph

Links to these instructions and data files for the project options may be found at http://vanburenlab.com

Section 1

Project Instructions

Page 3: Biostatistics with Rvanburenlab.com/wp-content/uploads/2017/11/R-Projects-2017.pdf · 13 Wandsworth SouthwarkAndVauxhall 268 18122 14 Wandsworth Lambeth 7 3863 15 Campberwell SouthwarkAndVauxhall

42

Before Game of Thrones, there was John Snow, one of the founders of epidemiology. He was famous for demonstrating that the source of a cholera outbreak in London was water sup-plies, and not “bad air”. He did this by creating a convincing map showing the pattern out the outbreak around water pumps. We will pose some other questions to answer with Snow’s counts of deaths and other cases of cholera, which are organ-ized by district and by water company.

The project poses two questions:

1. Is the number of deaths due to cholera relative to other cases of cholera have an association with the water com-pany? [in other words, did one of the water companies have

Section 2

Project 1: Snow

Page 4: Biostatistics with Rvanburenlab.com/wp-content/uploads/2017/11/R-Projects-2017.pdf · 13 Wandsworth SouthwarkAndVauxhall 268 18122 14 Wandsworth Lambeth 7 3863 15 Campberwell SouthwarkAndVauxhall

43

a death rate that was worse?] Note: there were two water companies.

2. Is the number of deaths due to cholera relative to other cases of cholera have an association with the district in which the cases occurred? [in other words, did some districts have worse death rates than others?] Note: there were nine districts. If there is an association, what pairings of districts showed a difference in death rates?

Data file: http://vanburenlab.com/wp-content/uploads/2017/11/snow.csv

Answering these questions requires Chi-squared analysis. Ma-nipulation of the data will be required to build the appropriate ta-bles for analysis. Note that the way we did this in class is not ap-propriate for this dataset, as the counts are already tabulated, and we need to extract and add certain numbers together. What we need to do is make a new table to answer question (1) and another new table to answer question (2). To do this, we need to learn a few more R tricks:

The discussion below supposes that you have managed to im-port the datafile into a dataframe called Snow.

The data frame should look like this:

> Snow District Water.company Deaths Other.Cases1 St.Savior,Southwark SouthwarkAndVauxhall 406 192112 St.Savior,Southwark Lambeth 72 141293 St.Olave,Southwark SouthwarkAndVauxhall 277 183614 St.Olave,Southwark Lambeth 0 05 St.George,Southwark SouthwarkAndVauxhall 388 246516 St.George,Southwark Lambeth 99 236137 Bermondsey SouthwarkAndVauxhall 821 570638 Bermondsey Lambeth 0 17859 Newington SouthwarkAndVauxhall 458 3148210 Newington Lambeth 58 3347311 Lambeth SouthwarkAndVauxhall 525 5445712 Lambeth Lambeth 138 8364813 Wandsworth SouthwarkAndVauxhall 268 1812214 Wandsworth Lambeth 7 386315 Campberwell SouthwarkAndVauxhall 352 2312016 Campberwell Lambeth 33 1044517 Rotherhithe SouthwarkAndVauxhall 207 1474418 Rotherhithe Lambeth 0 0

For Question (1), we need a table that sums up these data by water company. There is more than one approach that would work. I will try a strategy that uses some clever indexing.

Recall that we can create a vector with a range of integers us-ing a colon between the first and last number in the range:

> 1:18 [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Page 5: Biostatistics with Rvanburenlab.com/wp-content/uploads/2017/11/R-Projects-2017.pdf · 13 Wandsworth SouthwarkAndVauxhall 268 18122 14 Wandsworth Lambeth 7 3863 15 Campberwell SouthwarkAndVauxhall

44

Vectors like this can be used as indices for variables. For exam-ple, suppose I wanted to get the names of districts for rows 13 to 18:

> Snow$District[13:18][1] Wandsworth Wandsworth Campberwell Campberwell Rotherhithe Rotherhithe9 Levels: Bermondsey Campberwell Lambeth Newington ... Wandsworth

Each district is served by one or both water companies, so there are two entries for each district. Likewise, each water com-pany served multiple districts, so there are nine entries for each water company. To focus on each question separately, we need to sum up numbers appropriately.

Coming back to Question (1) we need separate sums for each water company, and we will need another function that can get every other index number:

> seq(from=1, to=18, by=2) #for SouthwarkAndVauxhall indices[1] 1 3 5 7 9 11 13 15 17> seq(from=2, to=18, by=2) #for Lambeth indices[1] 2 4 6 8 10 12 14 16 18

Note that the ‘by’ parameter tells R how much to increment be-tween numbers, and the only difference between these two lines is the starting number ‘from’.

The next lines will get the sums that we need. They use the sum function, indexing using the seq function and matrix-style indexing of data frames:

> SV.deaths<-sum(Snow[seq(from=1, to=18, by=2),3])> SV.cases<-sum(Snow[seq(from=1, to=18, by=2),4])> L.deaths<-sum(Snow[seq(from=2, to=18, by=2),3])> L.cases<-sum(Snow[seq(from=2, to=18, by=2),4])

Let’s examine this piece:

Snow[seq(from=1, to=18, by=2),3]We have previously seen that you can extract a column from a data frame by using $ notation: Snow$Deaths

Above we see that we can also extract specific cells using coor-dinates of rows and columns in the data frame. For example, Snow[1,2] would be the element in the first row of the second column of the Snow data frame:

> Snow[1,2][1] SouthwarkAndVauxhallLevels: Lambeth SouthwarkAndVauxhallThe row and column indices are separated by commas, and they can be vectors representing multiple rows or multiple col-umns. The 3rd column is ‘Deaths’ so Snow[seq(from=1, to=18, by=2),3] gets a vector of the number of Deaths for each of the SouthwarkAndVauxhall entries. The sum func-

Page 6: Biostatistics with Rvanburenlab.com/wp-content/uploads/2017/11/R-Projects-2017.pdf · 13 Wandsworth SouthwarkAndVauxhall 268 18122 14 Wandsworth Lambeth 7 3863 15 Campberwell SouthwarkAndVauxhall

45

tion just adds the elements of this vector together to get a sin-gle value.

To do a Chi-squared analysis, we need to make a table of these four numbers: SV.deaths, SV.cases, L.deaths, L.cases. Here’s how:

> by.water.company<-matrix(c(SV.deaths,SV.cases, L.deaths,L.cases),byrow=TRUE, nrow=2)> by.water.company [,1] [,2][1,] 3702 261211[2,] 407 170956

The above statement makes a matrix with entries supplied by the first parameter (a vector). The matrix is populated ‘by row’ and the number of rows is 2, as specified by the other two pa-rameters listed.

To keep track of these entries, let’s makes sure that columns and rows in this table are labelled appropriately:

> colnames(by.water.company)<-c("Deaths","Other.cases")> rownames(by.water.company)<-c("SouthwarkAndVauxhall","Lambeth")> by.water.company Deaths Other.casesSouthwarkAndVauxhall 3702 261211Lambeth 407 170956

That’s what you need for the analysis of the first question.

Let’s look at what is needed in the second question, and see if we can be more efficient. Here we need to sum by District.

> attach(Snow)> Deaths.by.district<-Deaths[seq (from=1, to=18, by=2)]+Deaths[seq (from=2, to=18, by=2)]> Cases.by.district<-Other.Cases[seq (from=1, to=18, by=2)]+Other.Cases[seq (from=2, to=18, by=2)]

Those statements created separate vectors with the number of deaths by district and the number of other cases by district re-spectively. Next we will use the cbind (short for ‘column bind’) function to stick these vectors together as columns in a table.

by.district<-cbind(Deaths.by.district,Cases.by.district)by.district Deaths.by.district Cases.by.district [1,] 478 33340 [2,] 277 18361 [3,] 487 48264 [4,] 821 58848 [5,] 516 64955 [6,] 663 138105 [7,] 275 21985 [8,] 385 33565 [9,] 207 14744

Note that this function automatically assigns the component vari-able names as the column names in the new table. Let’s add ap-propriate row names:

> rownames(by.district)<-District[seq (from=1, to=18, by=2)]> by.district

Page 7: Biostatistics with Rvanburenlab.com/wp-content/uploads/2017/11/R-Projects-2017.pdf · 13 Wandsworth SouthwarkAndVauxhall 268 18122 14 Wandsworth Lambeth 7 3863 15 Campberwell SouthwarkAndVauxhall

46

Deaths.by.district Cases.by.districtSt.Savior,Southwark 478 33340St.Olave,Southwark 277 18361St.George,Southwark 487 48264Bermondsey 821 58848Newington 516 64955Lambeth 663 138105Wandsworth 275 21985Campberwell 385 33565Rotherhithe 207 14744

And that is the table needed to answer Question (2).

Page 8: Biostatistics with Rvanburenlab.com/wp-content/uploads/2017/11/R-Projects-2017.pdf · 13 Wandsworth SouthwarkAndVauxhall 268 18122 14 Wandsworth Lambeth 7 3863 15 Campberwell SouthwarkAndVauxhall

47

Mosquitos are assholes. While getting a free meal at our ex-pense, they transmit several nasty diseases, including malaria, dengue fever, yellow fever, Zika, West Nile, chikangunya, and even elephantiasis. Let’s agree that mosquitos should be wiped out. Until that glorious day arrives, we should try our best to keep them from biting us. The dataset that we will examine is for 5 different candidate repellent applications and subsequent ‘bite rates’ by mosquitos, with 30 subjects per treatment. The da-taset for this project has 2-column data, where the first column is a code (1-5) for the treatment given (explained in the descrip-tion file), and the second column is a measure of the ‘bite rate’. From the published research:

The entomological parameter studied was ‘man-mosquito contact’ (number of mosquito bites) which was compared among the ‘test groups’ (those with treated patches) and the control group. The ‘Per man-hour biting rate’ (between 1700 – 1800 hrs) using the ‘Flashlight method’ was recorded by health assistants.

Question for this project:

1. Is there a difference in the outcomes among the different treatments? If so, what are the specific pairwise differences between treatment?

You will need to do ANOVA followed by pairwise t-tests to an-swer these questions.

Data file: http://vanburenlab.com/wp-content/uploads/2017/11/mosquito_patch.csv

Description: http://vanburenlab.com/wp-content/uploads/2017/11/mosquito_patch.txt

Special note:

1. The first column contains a factor, namely the treatment. However, the study authors represented that factor with a number instead of text, which is not so good for R. The rea-son that this is not good is that when you read the file into R, R will infer that this column contains integer data instead of factor data. Suppose that you made a data frame of this data-set called Bugs. The structure would look like this:

> str(Bugs)'data.frame': 150 obs. of 2 variables:

Section 3

Project 2: Bugs

Page 9: Biostatistics with Rvanburenlab.com/wp-content/uploads/2017/11/R-Projects-2017.pdf · 13 Wandsworth SouthwarkAndVauxhall 268 18122 14 Wandsworth Lambeth 7 3863 15 Campberwell SouthwarkAndVauxhall

48

$ trt.mosq: int 1 1 1 1 1 1 1 1 1 1 ... $ y.mosq : num 4.5 10.04 13.05 0.26 10.61 ...

So the first variable (trt.mosq) is considered to be an integer when it should be a factor. This is especially bad because it won’t cause an error that prevent the analysis from proceeding; instead, it would do an incorrect analysis that would perform a regression of the outcome variable against the numbers 1-5, even though those numbers have no numerical meaning in this context (remember, they represent factors or groups and not some increasing value).

The fix for this is easy; we overwrite the data frame variable trt.mosq using the factor() function:

> Bugs$trt.mosq<-factor(Bugs$trt.mosq)And now when we examine the structure:

> str(Bugs)'data.frame': 150 obs. of 2 variables: $ trt.mosq: Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ... $ y.mosq : num 4.5 10.04 13.05 0.26 10.61 ...we see that the first variable has been converted to a factor.

Remember this for your own data: Factors should be text with words or something like “A”,”B”,”C”,”C1”, etc. That will inform R to interpret that column as a factor by default.

Page 10: Biostatistics with Rvanburenlab.com/wp-content/uploads/2017/11/R-Projects-2017.pdf · 13 Wandsworth SouthwarkAndVauxhall 268 18122 14 Wandsworth Lambeth 7 3863 15 Campberwell SouthwarkAndVauxhall

49

Some rats are little and some rats are big. Unfair! The dataset for this project contains 3-column data on rat weight gain in re-sponse to six different kinds of treatment with recombinant bo-vine growth hormone. The columns are Gender, Group, and Gain, although they are unlabelled in the data file.

The questions for this project are:

1. Is weight gain associated with Gender?

2. Is weight gain associated with Treatment? If so, what pair-wise comparisons of treatments show a statistical difference?

You will need to perform an ANOVA, followed by pairwise t-tests for a significant ANOVA result (for Question 2), to answer these questions.

Data file: http://vanburenlab.com/wp-content/uploads/2017/11/bgh1.dat_.txt

Description: http://vanburenlab.com/wp-content/uploads/2017/11/bgh1.txt

Special notes:

1. The data file is space separated, so you will need to use the read.table() function instead of the read.csv() func-tion to import this data file.

2. The data file does not have column labels. Supposing you im-ported the data into a data frame called Rats, you can col-umn labels like this:

> colnames(Rats)<-c("Gender","Treatment","Gain")

3. As in Project 2, the study authors have recorded factor groups with integers. This is not good in R, because R will in-fer that these are integer variables and will then incorrectly treat the analysis like a regression against those integer val-ues (even though they have no numerical meaning). This can be fixed by overwriting the integer values using factor () func-tion. Supposing you imported the data into a data frame called Rats, and labelled the columns as suggested above, you can fix the data type problem with:

> Rats$Gender<-factor(Rats$Gender)> Rats$Treatment<-factor(Rats$Treatment)

Section 4

Project 3: Rats