A gentle introduction to R – how to load in data and produce summary statistics BRC MH...
-
Upload
dylan-rollins -
Category
Documents
-
view
226 -
download
0
Transcript of A gentle introduction to R – how to load in data and produce summary statistics BRC MH...
![Page 1: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/1.jpg)
A gentle introduction to R – how to load in data and produce summary
statisticsBRC MH Bioinformatics group
![Page 2: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/2.jpg)
Tutorial outline
• How to install R on your own computers– Its free– But its already installed on these computers
• Loading data from excel• Plotting• Summary statistics
![Page 3: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/3.jpg)
Files
• Data and slides on:• http://core.brc.iop.kcl.ac.uk/brc-
bioinformatics-workshop-october-2012
![Page 4: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/4.jpg)
Show file extensions
![Page 5: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/5.jpg)
Show file extensions
• Uncheck ‘hide extensions for known file types’
• Click ‘Apply’
![Page 6: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/6.jpg)
Installing R – skip as already installed
![Page 7: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/7.jpg)
Installing R – skip as already installed
![Page 8: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/8.jpg)
Installing R – skip as already installed
![Page 9: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/9.jpg)
And follow operating system specific installation instructions
Installing R – skip as already installed
![Page 10: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/10.jpg)
Starting R on these computers
![Page 11: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/11.jpg)
Help files
![Page 12: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/12.jpg)
Loading help files
• A useful function is read.table()– It allows you to read data from spreadsheets into
R
• To see it’s help file you can use• You can use ?function_name for any function
to see a help file
?read.table
![Page 13: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/13.jpg)
Loading data into R from excel
![Page 14: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/14.jpg)
From excelOpen testdata.xls
![Page 15: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/15.jpg)
From excel• You need to save it as a comma separated
value file (.csv), go to file>save as>other formats
![Page 16: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/16.jpg)
From excel
![Page 17: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/17.jpg)
R working directory
• To open a file you will need to point R towards the folder that contains it.
• You can do this with setwd(), but we’ll do it using the mouse
• Suppose you have the file in My Documents
![Page 18: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/18.jpg)
Browsing folders• To check that you are in the right folder type
• To see files in this folder you can type
• To list the current variables type
• Nothing should be loaded yet
getwd()
list.files()
ls()
![Page 19: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/19.jpg)
Loading data
To follow along with this section, make sure your R working directory is that which contains the tutorial data
![Page 20: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/20.jpg)
• Read the contents of file testdata.csv into an R variable my.data with:
• read.csv is a wrapper for read.table which lets you specify more details about your file, eg:
my.data <- read.csv(‘testdata.csv’)
my.data <- read.table(‘testdata.csv’,sep=‘,’,header=TRUE)
![Page 21: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/21.jpg)
• sep : Column separator• header : Does the first row of the file contain column headers?• skip : Number of rows to skip at the top of the file
• ?read.table for other useful parameters
read.table()
![Page 22: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/22.jpg)
Looking at loaded data
![Page 23: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/23.jpg)
• Take a look at the top couple of lines:
• Generate some basic summary stats:
• Check your new variable is in the R environment:
ls()
head(my.data)
summary(my.data)
![Page 24: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/24.jpg)
• Number of rows and columns
• Row and column names
• Check the dimensions of your dataset:
dim(my.data)
nrow(my.data)ncol(my.data)
rownames(my.data)colnames(my.data)
![Page 25: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/25.jpg)
Subsetting Data
![Page 26: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/26.jpg)
• Look at the first col:
• Look at the third column of row 10
• Look at the first row:
my.data[1,]
my.data[,1]
my.data[10,3]
![Page 27: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/27.jpg)
• Look at the first column for rows 100 to 110
• Same as above, but save to a variable
• Same as above but pre-defining the index vector
• Look at rows 30,40,50 and 60
my.data[100:110,1]
my.subset <- my.data[100:110,1]
my.data[c(30,40,50,60),]
my.indices <- c(30, 40, 50, 60)my.data[my.indices,]
![Page 28: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/28.jpg)
• Look at the columns named 'height' and 'weight' for row 1:
• Same as above but pre-define the colnames vector
• Look at the column named 'weight' for row 1:
You can subset on names instead of indices:
my.data[1,’weight’]
my.data[1,c(’weight’,’height’)]
cols <- c(’weight’,’height’)my.data[1,cols]
![Page 29: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/29.jpg)
• Look at all columns except the second for row 1
• Extract all rows except 1-100
• Extract all rows except 35, 67,101
Negative indices exclude elements:
my.data[1,-2]
my.new.data <- my.data[-1:-100,]
my.indices <- -1 * c(35, 67, 101)my.new.data <- my.data[my.indices,]
![Page 30: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/30.jpg)
Quiz!
![Page 31: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/31.jpg)
• How tall is the person in the 7th row?
• What gender is the person in the 300th row?
• For the people in rows 20-30, who is the heaviest?
• For the people in rows 110, 350, 219, 74, who is the tallest?
• Save all rows except 500-600 in a variable my.new.data
• How many males and females are in this new dataset?
![Page 32: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/32.jpg)
Formatting problems
![Page 33: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/33.jpg)
Data isn't comma-separated?
• Specify the separator in read.table
• tab-delimited text is another common format, for which you can use sep=”\t”
Load "testdata.txt", a tab-delimited version of the data
![Page 34: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/34.jpg)
Data has extra header information at the top?
• Either delete this data in Excel before exporting to csv
• Or, use the skip=N argument to read.table
Have a look at "testdata_1.csv" in Excel and then load it into R using read.table
![Page 35: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/35.jpg)
Factors are inconsistently named
• R will just read in the data you give it.
• If you aren't consistent naming the levels of your factors it will see them as different levels
• R is case sensitive. 'MyLevel' != 'mylevel'
Load the data from testdata_2.csv and have a look at the gender variable.
Try and fix the problems in Excel and reload.
![Page 36: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/36.jpg)
Measurements and units in a single column
• If you store values like 10kg, R will not interpret this as a numeric column
Try loading file 'testdata_3.csv' - what has happened to the weights and heights information?
Try loading again so that the two are loaded as character vectors.
Have a look at the sub() function and see if you can fix the problem
![Page 37: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/37.jpg)
Excel has just screwed up your data
• Older versions of Excel have a limit of 65536 rows. If you open a larger dataset in Excel it will be truncated. If you then save this dataset you will be saving the truncated version.
Avoid opening large datasets in Excel, use R
• Excel tries to be helpful by formatting elements for you. Try the following and then open in Excel, save as csv and reload into R. What has happened?
my.genes<-c('MASH1','SOX2','OCT4')write.csv(my.genes, file='mygenes.csv')
![Page 38: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/38.jpg)
Plotting
![Page 39: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/39.jpg)
Drawing histograms
Optional exercises –
1) Try drawing a histogram of height
2) Try and label the x axis [hint: read the help file]
![Page 40: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/40.jpg)
Drawing normal QQ plotsqqnorm(my.data$weight);qqline(my.data$weight)
![Page 41: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/41.jpg)
Drawing scatterplots
Optional exercises: try these, do you understand this plot?
plot(height~weight,data=my.data)
plot(height~weight,data=my.data,col=as.numeric(gender))
![Page 42: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/42.jpg)
Drawing boxplotsboxplot(height~gender,data=my.data)
![Page 43: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/43.jpg)
Saving plots
JPEGs
PDFs
jpeg(“boxplot.jpg”)boxplot(height~gender,data=my.data)dev.off()
pdf(“boxplot.pdf”)boxplot(height~gender,data=my.data)dev.off()
![Page 44: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/44.jpg)
Summary statistics
![Page 45: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/45.jpg)
Functions Covered
http://www.statmethods.net/index.html
![Page 46: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/46.jpg)
Writing tables
![Page 47: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/47.jpg)
Calculate Mean and SD
![Page 48: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/48.jpg)
Correlate phenotypes and test for group differences
![Page 49: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/49.jpg)
It is always important to check model assumptions before making statistical inferences
![Page 50: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.](https://reader035.fdocuments.us/reader035/viewer/2022062404/55160faa55034694308b5277/html5/thumbnails/50.jpg)
Linear regression