To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the...

92
To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February 2010

Transcript of To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the...

Page 1: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

To err is human – to R is divine

R from step 1 for the experimental biologist with an

eye on the tomoRRow!

Schraga Schwartz, Bioinformatic Workshop, February 2010

Page 2: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Outline

• Why R

• How R

• iRis

• Down syndRome

• WheRe R

Page 3: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Why ?

Page 4: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

R from step 1 for the experimental

biologist with an eye on the tomoRRow!

R programming language is a lot like magic... except instead of spells you have functions.

Page 5: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

=muggle

SPSS and Excel users are like muggles. They are limited in their ability to change their environment. The way they approach a problem is constrained by how SPSS/Microsoft employed programmers thought to approach them. And they have to pay money to use these constraining softwares.

Page 6: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

=

wizard

R users are like wizards. They can rely on functions (spells) that have been developed for them by statistical researchers, but they can also create their own. They don’t have to pay for the use of them, and once experienced enough (like Dumbledore), they are almost unlimited in their ability to change their environment.

Page 7: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

R’s strengths

• Data management & manipulation

• Statistics

• Graphics

• Programming language

• Active user community

• Free!

Page 8: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

R’s weakness

• Not user friendly at start.

• Minimal GUI.

• No commercial support

• Substantially slower than programming languages (e.g. perl, java, C++).

Page 9: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

R graphics: the sky's the limit!

http://addictedtor.free.fr/graphiques/

Page 10: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

How R?

Page 11: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

R as a calculator

• Calculator+, -, /, *, ^, log(), exp(), sqrt(), …:

(17*0.35)^(1/3)

log(10)

exp(1)

3^-1

Page 12: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Variables in R

• Variables are assigned using either “=“ or “<-”x=12.6

x

[1] 12.6

Page 13: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Numeric vectors

A vector composed of numbers. Such a vector may be created:1. Using the c() (short for concatenate) function:

y=c(3,7,9,11)

> y[1] 3 7 9 11

2. Using the rep(what,how_many_times) function:y=rep(3,30)

3. Using the “:” operator, signifiying “a series of integers between”y=1:30

Page 14: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Boolean vectors

A boolean variable can be either TRUE or FALSE.

b=c(TRUE,FALSE,TRUE,FALSE,TRUE,TRUE)

sum(b) #number of "TRUE" elements

Page 15: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Vector manipulation

n=c(1,4,5,6,7,2,3,4,5,6) #creates a vector with the numbers in the brackets, stores it in y

length(n) #number of elementsn[3] #extract 3rd element in yn[-2] #extract all of y but 2nd element

n[1:3] #extract first three element of y

n[c(1,3,4)] #extract first, third, and fourth element of y

Page 16: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Vector manipulation…

n+1 #add 1 to all elements in y

n*2 #multiply by two all elements in y

sum(n)

mean(n)

median(n)

var(n)

min(n)

max(n)

log(n) #extract logs from all variables in y

Page 17: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

More advanced manipulation

n<4 #returns boolean vector of same length as n, with "TRUE" for each value smaller than 4 and FALSE for all other values.

n[n<4] #extract all elements in y smaller than 4

n[n<4 & n!=1] #extract element smaller than 4 AND different from 1

n[n<4 | n!=1] #extract element smaller than 4 OR different from 1

sum(n[n<4]) #sum of elements in n with values smaller than 4

Page 18: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Fuctions (spells…) in R

- Functions are bits of code which receive something as input (termed: arguments), and produce something as output (termed: return value).

- A function can be recognized by the round brackets "()" following the function name.

- The arguments of the "mean" function is a vector of numbers; the return value is their average.

Page 19: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Basic visualization of numbers

•barplot(n)•plot(n)•hist(n)•boxplot(n)•pie(n)

Page 20: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

barplot(n,col="red")

01

23

45

67

Page 21: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

plot(n,col="red")

2 4 6 8 10

12

34

56

7

Index

y

Page 22: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

hist(n,col="red")Histogram of y

y

Fre

qu

en

cy

1 2 3 4 5 6 7

0.0

0.5

1.0

1.5

2.0

Page 23: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

boxplot(n,col="red")

12

34

56

7

Page 24: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

pie(n[1:3])

1

2

3

Page 25: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Help in RClick ? + function_name.? barplot

Help pages contain the following components:- function_name(package) – if the package is not installed, this is the

time to install it and call it (using "library")- Description: brief overview- Usage- Description of arguments (input)- Details: more information- Value: value returned by the function (output)- See also: great way to learn new stuff you didn't even know you

wanted to do!- Examples: Can be copy-pasted as is! Highly informative!

Page 26: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Other vectors

Character vectors:nms=c("miriam","schragi","chaim","jochanan","ephraim","avraham","yemima","shakked","ayala","adi")

names(n)=nms #giving names to each value in numeric vector y

n["shakked"]

Class Exercise: Redraw some of the previous plots with modified n!

Page 27: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

The paste() function

Concatenates different characters into a single character, separated by the variable defined by sep argument (default: sep=" ")

paste("To","err","is human.","To R is","divine!",sep="_")

Page 28: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Factor vectors (We love factors!)

f=as.factor(c("stupid","stupid","smart","stupid","imbecile","smart","smart","imbecile"))

levels(f) #possible values a variable in y can have

summary(f) #provides the number of time each factor occurs

Class Exercise: Compare summary(n), summary(b), and summary(f) – note difference in output!

Page 29: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

The data.frame Class(We also love data.frames!)

• A data.frame is simply a table• Each column may be of a different class

(i.e. one column may be numeric, another may be a character, a third may be boolean and a fourth may be a factor)

• All rows in a given column must be of the same class

• The number of rows in each column must be identical.

age gender disease50 M TRUE43 M FALSE25 F TRUE18 M TRUE72 F FALSE65 M FALSE45 F TRUE

Page 30: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Iris databasePetal (עלה כותרת)

Sepal (עלה גביע)

Page 31: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

The iris dataset

Page 32: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

The fascinating questions

• What are typical lengths and widths of sepals and petals?

• Do these change from one family of irises to another?

• Do longer petals tend to be wider? • Do longer petals tend to correlate with

longer (or wider) sepals? • Do such correlations change from one

family of irises to another?

Page 33: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Playing with data frames - I

1. Set the work directory to the directory you're working in:

setwd("F:/presentations/R presentation")

(Note: getwd() tells you which directory you're in)

2. Load the table you want to work with (make sure you saved it as tab delimited file!):

ir=read.table(file="iris_dataset.txt",sep="\t",header=T) #loads iris_dataset.txt into variable "ir". Assumes that the file is tab delimited, and that the first line is a header.

Page 34: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Playing with data frames II

class(ir) #shows the class of irdim(ir) #returns the number of rows and columns in ir

ir[1,2] #first line, second column in irir[1,] #all columns in first line in irir[,1] #all rows in first column of irir$seplen #same as aboveir[,"seplen"] #same as aboveir[,c("seplen","sepwid")] OR ir[,1:2] #first two columns of ir

summary(ir) #each of the columns is summarized according to its class

Page 35: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Playing with data frames - III

ir$seplen>6 #returns a boolean vector with TRUE and FALSE values depending on whether seplen is greater than 6

ir[ir$seplen>6,] #returns a subset of ir containing all columns of all rows in which seplen is greater than 6

ir[ir$seplen>6,c("seplen","sepwid")] #returns same rows as above, but only "seplen" and "sepwid" columns

ir[ir$seplen>6 & ir$sepwid >3,c("seplen","sepwid")] #returns same columns as above, but only rows in which seplen is greater than 6 and sepwid is greater than 3

Page 36: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Visualization

hist(ir$seplen) #histogram of seplen

Histogram of ir$seplen

ir$seplen

Fre

qu

en

cy

4 5 6 7 8

05

10

15

20

25

30

Page 37: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Visualization - II

Histogram of ir$seplen

ir$seplen

Fre

qu

en

cy

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

02

46

81

0

hist(ir$seplen,30) #histogram of seplen

Page 38: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Visualization - III

Distribution of Septal lengths

Mean septal length is 5.84333333333333Lengths of septal (cm)

Fre

qu

en

cy

5 6 7 8

05

10

15

mean_seplen=mean(ir$seplen)

hist(ir$seplen,20,col="light blue",main="Distribution of Septal lengths",xlab="Lengths of septal (cm)",sub=paste("Mean septal length is",mean_seplen))

Page 39: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

The tapply() function

Suppose you want to obtain average

ages of patients (a numeric) variable,

as a function of their gender (a factor)

variable. And suppose the data is stored

in the data frame data. The magic spell is:

tapply(data$age,data$gender,mean)

The tapply function – receives three parameters:- A numeric distribution- A factor variable, dividing the numeric distribution into

groups- A function (mean,min,max,sd,sum)

age gender disease50 M TRUE43 M FALSE25 F TRUE18 M TRUE72 F FALSE65 M FALSE45 F TRUE

Page 40: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

setosa versicolor virginica

01

23

45

6

mean_per_species=tapply(ir$seplen,ir$species,mean) #calculates the mean value of ir$seplen after dividing it into three groups based on ir$species

barplot(mean_per_species,col="red")

Visualization - IV

Page 41: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Adding packages

Page 42: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Select mirror

Page 43: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Select library

Page 44: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Class exercise

Install the following three libraries:

gplots, lattice,car

These libraries will be used in subsequent examples.

Page 45: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Visualization - Vsd_per_species=tapply(ir$seplen,ir$species,sd) #caculate

standard deviationlibrary(gplots) #loads all functions in gplots into

workspace (including the barplot2 function)barplot2(mean_per_species, plot.ci = T, ci.l =

mean_per_species-sd_per_species, ci.u = mean_per_species+sd_per_species,col="red",ylab="Mean septal lengths")

setosa versicolor virginica

Me

an

se

pta

l le

ng

ths

01

23

45

67

Page 46: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Visualization - VI

library(gplots)

plotmeans(ir$seplen~ir$species,xlab="species",ylab="Sepal length")

5.0

5.5

6.0

6.5

species

Se

pa

l le

ng

th

setosa versicolor virginica

n=50 n=50 n=50

Page 47: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Looking at correlations

plot(ir$petlen,ir$petwid) #plotting one set of numbers as a function of another

1 2 3 4 5 6 7

0.5

1.0

1.5

2.0

2.5

ir$petlen

ir$

pe

twid

Page 48: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Arguments of the plot function

Some parameters of plot() function (get more by typing "? plot.default"):

x – x values (defaults 1:number of points)y – the distributiontype – type: can be either "l" (line), "p" (points) or morepch – type of bullets (values from 19-25)col – color (either numbers of names of colors) – can receive

multiple colorslwd – line widthlty – line typexlab,ylab – X and Y labelsmain, sub – main title (top of chart) and subtitle (beneath the X

label)

Page 49: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

More sophisticated plottingplot(ir$petlen,ir$petwid,col=as.numeric(ir$species),p

ch=19,xlab="Petal width",ylab="Petal length")

1 2 3 4 5 6 7

0.5

1.0

1.5

2.0

2.5

Petal width

Pe

tal l

en

gth

Page 50: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

And more sophisticated plot, with legend and P values

stat=cor.test(ir$petlen,ir$petwid)rval=stat$estimatepval=stat$p.valueplot(ir$petlen,ir$petwid,col=as.numeric(ir$species),pch=19,xlab="P

etal width",ylab="Petal length",main=paste("R=",rval," ; P=",pval,sep=""))

legend(x="topleft",legend=levels(ir$species),col=1:3,lty=1,lwd=2) #adding a legend

1 2 3 4 5 6 7

0.5

1.0

1.5

2.0

2.5

R=0.962865431402796 ; P=0

Petal width

Pe

tal l

en

gth

setosaversicolorvirginica

Page 51: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Plotting correlations as a function of a third factor variable

library("lattice")

xyplot(ir$seplen ~ ir$sepwid | ir$species)

ir$sepwid

ir$

sep

len

5

6

7

8

2.0 2.5 3.0 3.5 4.0 4.5

setosa versicolor

5

6

7

8virginica

Page 52: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Looking at everything as a function of everything else

pairs(ir[,1:4])

pairs(ir[,1:4],col=ir$Species,upper.panel=NULL)4.5 5.5 6.5 7.5

4.5

5.5

6.5

7.5

Sepal.Length

2.0

3.0

4.0

Sepal.Width

12

34

56

7

Petal.Length

4.5 5.5 6.5 7.5

0.5

1.5

2.5

2.0 3.0 4.0 1 2 3 4 5 6 7 0.5 1.5 2.50.

51.

52.

5

Petal.Width

Page 53: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Even more sophisticated…

library(car)

scatterplot.matrix(ir[,1:4],groups=ir$Species,ellipse=T,levels=0.95,upper.panel=NULL, smooth=F)

4.5 5.5 6.5 7.5

4.5

5.5

6.5

7.5

|||| | || || | |||| |||| || ||| || || |||| || ||| ||| |||| ||| || || || || || || ||| |||| || | || | | || | | |||||| | || | ||||| ||| ||| || | || || | || || ||| ||| || ||| || || | ||| | | | |||| |||| || || |||| |||

Sepal.Length

2.0

3.0

4.0

|| || | |||| | |||| | ||| ||| ||| || | |||| | | || | ||| | || | | || || |||||| || || ||| || || |||| | ||| | | || ||||| || | ||| || | ||| | |||| | || || ||| || ||| || | || ||| |||| ||| || || |||| | ||| |||| | ||| | ||

Sepal.Width

12

34

56

7

||||| |||||||||| ||| ||||| |||||| |||||||||||||| | || |||| || || |||| ||| || || ||| || || |||| |||| ||| ||||||| | ||| ||||| | || ||| || || |||||||| ||| || || | ||| || | ||| | |||| | ||| |||||||

Petal.Length

4.5 5.5 6.5 7.5

0.5

1.5

2.5

2.0 3.0 4.0 1 2 3 4 5 6 7 0.5 1.5 2.5

0.5

1.5

2.5

||||| |||||||||| |||||| || ||| ||||| ||||||||| ||| ||||||| |||| || || | || || || ||| || || || | || ||| || | || |||||| ||| || ||| | || || ||| || ||| || ||| | || |||| |||| || | | ||| | ||| | ||| | ||| | ||

Petal.Width

setosa

versicolor

virginica

Page 54: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

And more (for the highly motivated or extremly bored…)

upperpanel.cor <- function(x, y,method="pearson",digits=2,...) { points(x,y,type="n"); usr <- par("usr"); on.exit(par(usr)) par(usr = c(0, 1, 0, 1)); correl <- cor.test(x, y,method=method); r=correl$estimate; pval=correl$p.value; color="black"; if (pval<0.05) color="blue";

txt <- format(r,digits=2) pval <- format(pval,digits=2) txt <- paste("r=", txt, "\npval=",pval,sep="") text(0.5, 0.5, txt,col=color)}

scatterplot.matrix(ir[,1:4],groups=ir$Species,ellipse=T,levels=0.95,upper.panel=upperpanel.cor,cex=0.3,smooth=F,main="This is cool!!!")

Page 55: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Final output

|||| | || || | |||| |||| || ||| || || ||| | || || | ||| |||| | || || || || || || || ||| | | || || | || | | || | | |||||| | || | ||||| ||| | || || | || || | || || ||| || | | | ||| || || | ||| | | | |||| || || || || |||| |||

Sepal.Length

2.0 2.5 3.0 3.5 4.0

r=-0.12pval=0.15

r=0.87pval=0

0.5 1.0 1.5 2.0 2.5

4.5

5.5

6.5

7.5

r=0.82pval=0

2.0

2.5

3.0

3.5

4.0

|| || | |||| | |||| | ||| ||| ||| || | |||| | | || | | || | || | | || || |||||| || || ||| || || |||| | ||| | | || ||||| || | ||| || | ||| | |||| | || || ||| || ||| || | || ||| |||| ||| || || |||| | ||| |||| | ||| | ||

Sepal.Width

r=-0.43pval=4.5e-08

r=-0.37pval=4.1e-06

||| || || || |||||| ||| || ||| | ||||| ||||| || ||| |||| | || || || || || || || ||| || || ||| || || ||| | | ||| || | ||| |||| | ||| ||||| | || || | || || || | || | | | | || || || | ||| | | | ||| | |||| | ||| |||| | ||

Petal.Length

12

34

56

7

r=0.96pval=0

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

0.5

1.0

1.5

2.0

2.5

1 2 3 4 5 6 7

||||| ||||| |||| | |||||| || ||| ||||| || ||||| || ||| ||||||| | ||| || || | || || || | || || || || | || ||| || | || |||||| ||| || ||| | || || ||| || ||| || ||| | || |||| |||| || | | ||| | ||| | ||| | ||| | ||

Petal.Width

This is cool!!!

setosa

versicolor

virginica

Page 56: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Saving Graphics to Files

• Before running the visualizing function, redirect all plots to a file of a certain type. Possibilities:– jpeg(filename)

– png(filename)

– pdf(filename)

– postscript(filename)

• After running the visualization function, close graphic device using dev.off()

Page 57: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Saving graphics

Example: pdf("F:/test.pdf")barplot(1:10,col="red")dev.off()

Note:Different graphic functions can also receive arguments regarding width and height of canvas. Use "?" + function name (e.g. ?jpeg to obtain arguments)

Page 58: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Statistics

• t.test #Student t test• wilcox.test #Mann-Whitney test • kruskal.test #Kruskal-Wallis rank sum test • chisq.test #chi squared test• cor.test #pearson/spearman correlations

• lm(),glm() #linear and generalized linear models• p.adjust #adjustment of P values to multiple

testing using FDR, bonferroni, or whatnot…

Page 59: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Down Syndrome

Page 60: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

The fascinating research question

Do genes from any particular chromosome alter their expression levels in Down syndrome?

Page 61: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

GEO database: A paradise of numbers

Page 62: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Getting the data!

Page 63: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Loading the data (look at it first!)

setwd("F:/Presentations/R presentation/") #sets the work directory

a=read.table(file="GSE5390_series_matrix.txt",sep="\t",header=T,comment.char="!") #loads the gene expression values and stores them in a

names(a)=c("id","down1","down2","down3","down4","down5","down6","down7","healty1","healty2","healty3","healty4","healty5","healty6","healty7","healty8") #give informative names to columns in a

Page 64: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

The merge() function

name age gender diseaseinky 50 M TRUEdinky 43 M FALSEJose 25 F TRUEOleg 18 M TRUEJesus 72 F FALSECleopatra 65 M FALSEStanislav 45 F TRUE

name IQdinky 43Cleopatra 750Stanislav 565

name age gender disease IQdinky 43 M FALSE 43Cleopatra 65 M FALSE 750Stanislav 45 F TRUE 565

a= b=

merge(a,b,by="name") OR merge(a,b,by.x="name",by.y="name")

Page 65: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Merging data

convert=read.table(file="convert_affyprobes_2_chromosome_location_from_UCSC.txt",sep="\t",header=T)

b=merge(a,convert,by="id") #merges a and convert by the columns indicated by the by arguments. In other words, the column "id" in "a" is compared to the column "id" in "convert". Only lines in which the two values are identical are retained, yielding a new data frame with shared values & shared information.

Page 66: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Assign informative names

downcols=2:8

healthycols=9:16

allarraycols=c(downcols,healthycols)

Page 67: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Calculate Fold Change between disease and healthy

Step 1: calculate mean expression values for all patients with Down syndrome

b$meandown=apply(b[,downcols],1,mean)

Step 2: calculate mean expression values for all healthy subjectsb$meanhealthy=apply(b[,healthycols],1,mean)

Step 3: Calculate difference between the two (since data is log transformed)

b$dif=b$meandown-b$meanhealthy

Step 4: anti-log the fold changeb$foldchange=2^b$dif

Page 68: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Calculate P values

Step 1: Create function which receives a line as input, and knows how to break it up into disease and control groups and yield a p value

GetPval=function(line) { ttest=t.test(line[downcols-1],line[healthycols-

1])ttest$p.value

}

Step 2: Apply this function to all rows of the data frameb$pval=apply(b[,allarraycols],1,GetPval)

Step 3: Adjust P value to multiple testingb$adjustedPval=p.adjust(b$pval,method="fdr")

Page 69: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Saving data frames to a file

write.table(b,file="DownWithPvals.txt",sep="\t",row.names=F,col.names=T) #generates a tab-delimited file with column names, without row names containing the data in the data frame b

Page 70: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Finding significant events

sigs=b[b$foldchange>1.75 & b$adjustedPval<0.01,] #finding events with significant fold change and significant P values

sigs=sigs[order(sigs$adjustedPval,decreasing=T),] #sorting table based on P values

Page 71: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Finding and plotting % significantly over/under expressed genes per

chromosome

percentages=summary(sigs$chr)*100/summary(b$chr) #divides the number of times each chrosome appears in "sigs" by number of time it appears in original data

barplot(percentages,las=3,col="light blue",ylab="% significant genes",main="To R is divine!") #barplot depicting the percentage of genes from each chromosome within sig

Page 72: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

chr1

chr1

_ra

nd

om

chr1

0ch

r11

chr1

2ch

r13

chr1

3_

ran

do

mch

r14

chr1

5ch

r15

_ra

nd

om

chr1

6ch

r16

_ra

nd

om

chr1

7ch

r17

_ra

nd

om

chr1

8ch

r19

chr1

9_

ran

do

mch

r2ch

r2_

ran

do

mch

r20

chr2

1ch

r21

_ra

nd

om

chr2

2ch

r22

_h

2_

ha

p1

chr2

2_

ran

do

mch

r3ch

r3_

ran

do

mch

r4ch

r4_

ran

do

mch

r5ch

r5_

h2

_h

ap

1ch

r5_

ran

do

mch

r6ch

r6_

cox_

ha

p1

chr6

_q

bl_

ha

p2

chr6

_ra

nd

om

chr7

chr7

_ra

nd

om

chr8

chr8

_ra

nd

om

chr9

chr9

_ra

nd

om

chrM

chrX

chrX

_ra

nd

om

chrY

R - for a better tomoRRow!

% s

ign

ifica

nt g

en

es

01

23

45

Page 73: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Even better plot…

validchrs=c(paste("chr",1:22,sep=""),"chrX","chrY")

percentages=percentages[validchrs]

barplot(percentages,las=3,col="light blue",ylab="% significant genes",main="R - for a better tomoRRow!")

Page 74: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Results…ch

r1

chr2

chr3

chr4

chr5

chr6

chr7

chr8

chr9

chr1

0

chr1

1

chr1

2

chr1

3

chr1

4

chr1

5

chr1

6

chr1

7

chr1

8

chr1

9

chr2

0

chr2

1

chr2

2

chrX

chrY

To R is divine!

% s

ign

ifica

nt g

en

es

01

23

45

Page 75: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Volcano plots: P values as a measure of fold change

plot(log2(b$foldchange),-log2(b$pval),col=(b$chr=="chr21")+1,pch=19,xlab="log fold-change",ylab="-log P value")

legend(x="topleft",legend=c("non chr 21","chr 21"),lty=1,col=1:2,lwd=3)

abline(h=-log2(0.001),col="blue",lty=3)abline(v=c(log2(1.75),-log2(1.75)),col="blue",lty=3)text(2,17,"Significantly\nOver-represented",col="blue")text(-1.4,17,"Significantly\nUnder-represented",col="blue")

abline() function: adds either horizontal or vertical line/s (as well as more sophisticated stuff as well), depending on whether the "h" or "v" arguments are populated

text() function: receives x,y coordinates|on plot, as well as text to plot

Page 76: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Volcano plot

Page 77: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

A particular R strength: genetics

• Bioconductor is a suite of additional functions and some 200 packages dedicated to analysis, visualization, and management of genetic data

• Much more functionality than software released by Affy or Illumina

Page 78: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Where R?

Page 79: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

R homepage: http://www.r-project.org/

Page 80: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Choose server…

Page 81: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Click on “Windows”

Page 82: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Click “base”

Page 83: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Click on “Download” link and follow installation guidelines…

Page 84: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

There you R!

Page 85: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Installing Tinn-R

• Go to: http://www.sciviews.org/Tinn-R/

• Scroll to bottom of page

Page 86: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Loading R from within Tinn-R

Page 87: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Final Tips• Use http://www.rseek.org/ & google for finding help on

what you want• Know your objects’ classes: class(x)• Know your functions arguments. Use "?

function_name" to learn what arguments a function receives & what its return values are.

• Each help files provides examples, which can be copy-pasted into R as is. Extremely useful!

• MOST IMPORTANT - the more time you spend using R, the more comfortable you become with it. DESPAIR NOT – and you will never look back!

Page 88: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Final Words of Warning

• “Using R is a bit akin to smoking. The beginning is difficult, one may get headaches and even gag the first few times. But in the long run,it becomes pleasurable and even addictive. Yet, deep down, for those willing to be honest, there is something not fully healthy in it.” --Francois Pinard

R

Page 89: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Thank you!

May the R be with you!

Page 90: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Quick hands-on

• Generate a numeric vector called a containing the number 1,3,4,5,9.

• Calculate the square root (sqrt) of the values in a.

• Create a barplot displaying a• Show a as a regular plot, showing the

values in red.• Label the x-axis of the plot "R is gReat",

and the y-axis "I love R".

Page 91: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Hands-On - II

Based on the down syndrome microarrays:

- Find the 10 genes showing the highest differences between healthy and sick.

- Create bar plots showing the average values in sick, and in healthy for those ten genes.

- For true geeks: Add error bars to the graph.

Page 92: To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Todo

• multiple panels• lists, loops, lapply, sapply• regular expressions