Introduction to R Aedín Culhane [email protected] aedin

33
Introduction to R Aedín Culhane [email protected] http://bcb.dfci.harvard.edu/~aedin http://www.hsph.harvard.edu/research/aedin-culhane/

Transcript of Introduction to R Aedín Culhane [email protected] aedin

Page 1: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

Introduction to R

Aedín [email protected]

http://bcb.dfci.harvard.edu/~aedin

http://www.hsph.harvard.edu/research/aedin-culhane/

Page 2: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

Jan 2009Data Analysts Captivated by R’s Power

"R is really important to the point that it’s hard to overvalue it,” said Daryl Pregibon, a research scientist at Google, which uses the software widely. “It allows statisticians to do very intricate and complicated analyses without knowing the blood and guts of computing systems.”

Nov 10 2010Names You Need to Know in 2011: R Data Analysis Software

"R is rapidly augmenting or replacing other statistical analysis packages at universities"

Page 3: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

▫Open source, development- flexible, extensible

▫Large number of statistical and numerical methods

▫High quality visualization and graphical tools▫Extended by a very large collection of rapidly

developing packages

Page 4: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

R

•Why is it called R? ▫The name is partly based on the (first) names

of the first two R authors and partly a play on the name of the Bell Labs language ‘S

▫Initially written by Robert Gentleman, & Ross Ihaka, Dept of Statistics, University of Auckland, New Zealand (1996)

Page 5: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

Short R History

ˆ1991: Ross Ihaka, Robert Gentleman begin work on a project that will become R

1993: The first announcement of R1995: R available by ftp1996: A mailing list is started and

maintained by Martin Maechler at ETH1997: The R core group is formed2000: R 1.0.0 is released

Page 6: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

Short R History Continued2001: Bioconductor for the analysis and

comprehension of genomic data using R2008: The Omegahat project to enable

connectivity between R and other languages2010: Former co-founder and employees of

SPSS found Revolution Analytics, a company which offers a commerical package around R.

2011: Rstudio Project provide a free open source integrated development environment (IDE) for R

Page 7: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

RR project (v2.15 April 2012)

pre v2.15 biannual release (April, October)

post v2.15 annual release (April)

Download core and contributed packages from CRAN

Link: R Task Views

Page 8: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

R Interface

•Default R interface

•Rstudio▫www.rstudio.org▫Cross platform, Windows/Mac/Linux

•Others▫Notepad++, TinnR, RCMDR, etc

Page 9: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

RStudio•4 windows

-Editor, Console, History, Files/plots

•Code completion•Easy access to help (F1)•One step Sweave pdf generation•Searchable history•Keyboard Shortcuts

▫http://www.rstudio.org/docs/using/keyboard_shortcuts

Page 10: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

Starting with R

• The R environment is controlled by hidden files in the startup directory: .RData, .Rhistory and .Rprofile (optional) These are very useful.

• History means you can automatically save all commands you type

• Rdata saves everything in memory (can be large- be careful)

• Best to rename these using▫ save.image(file=“S01_GeneProjectMay2012.RData”)▫ save(myVec, file=“S01_GeneProjectMay2012.RData”)▫ savehistory(file=“S01_GeneProjectMay2012.Rhistory”)

Page 11: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

Tips for projects management•Save commands to a script myscript.R

## In R source(“myscript.R”)

## Or from the command lineR CMD BATCH myscript.R

•Save scripts, S01_xxxDate.R, S02_xxxDate.R, etc where xxx is project name

•Use Folders or Projects in Rstudio getwd()setwd()

Page 12: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

Overview of Bioconductor

Aedín [email protected]

http://bcb.dfci.harvard.edu/~aedin

http://www.hsph.harvard.edu/research/aedin-culhane

Page 13: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

BioconductorRelease coincides with R release.

Current: Bioconductor 2.10 (release coincide with R 2.15)

To install use script on Bioconductor Website

source("http://www.bioconductor.org/biocLite.R")

biocLite()

Page 14: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

What Packages do I need?

Specific to you data and analysis pipeline but for examples:

•Bioconductor Workshops

•Bioconductor Workflows

Page 15: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

Packages Overview

BioConductor web site

• Bioconductor BiocViews Task view

SoftwareAnnotation DataExperimental Data

Page 16: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

Main types of Annotation Packages• Gene centric AnnotationDbi packages:

▫ Organism: org.Mm.eg.db.▫ Technology/Platform: hgu133plus2.db.▫ GeneSets and Pathway (biology level): GO.db or KEGG.db▫ .db packages can be queried with sql or accessed using

annotation package (totable, get, mget)• Genome centric GenomicFeatures packages:

▫ Transriptome level: TxDb.Hsapiens.UCSC.hg19.knownGene

▫ Generic features: Can generate via GenomicFeatures• biomaRt:

▫ Query web-based `biomart' resource for genes, sequence, SNPs, and etc.

• See http://www.bioconductor.org/help/course-materials/2011/BioC2011/LabStuff/AnnotationSlidesBioc2011.pdf

Page 17: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

Bioconductor resources

• Mailing List (sign up for daily digest)

• Documentation, workshop/course material online▫Slides from talks, pdf of tutorials, R code

• Help available for each software package▫Each package MUST contain vignette (howto)

• Other resources ww.Rseek.org www.r-bloggers.com

Page 18: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

Vignette

•Tutorials, provide worked example of package

•Required in Bioconductor packages•Written in Sweave (Leisch, 2002).

▫LATEX dynamic reports in which R code is embedded and executable

▫All R code in vignette is checked (and executed) by R CMD check

▫http://www.bioconductor.org/docs/vignettes.html

library("Biobase") library("GOstats") # Load package of interestopenVignette()

Page 19: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin
Page 20: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

Getting Data into R & Bioconductor

Aedín [email protected]

http://www.hsph.harvard.edu/research/aedin-culhane/

Page 21: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

Simple Excel SpreadSheet data• Simple table

▫read.table()▫read.csv()▫scan()

• However more datatype specialized. See Technologies on BiocViews.▫http://www.bioconductor.org/packages/

release/BiocViews.html • Large data files. Also see

http://www.revolutionanalytics.com

21

Page 22: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

Some common data types

•Microarray•SNP•NGS

May 2011

22

Page 23: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

A Microarray OverviewA Microarray Overview

23

Page 24: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

Reading Affymetrix Data

library(affy)require(affy) # Alternative

affybatch <- ReadAffy(celfile.path="[Location of your data]")

eSet<-justRMA()

May 2011

24

Page 25: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

Sample R code

25

Page 26: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

Other Arrays

•Illumina▫Lumi package

•2 color spotted arrays▫Limma package

•Other arrays▫http://www.bioconductor.org/help/

workflows/oligo-arrays/

May 2011

26

Page 27: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

Next Generation Sequencing Data

Page 28: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

Public Microarray Data

ArrayExpress 21997 Studies (622,617 profiles,)

GEO 22,735 Studies (558,074 profiles)

Statistics May 2011

Page 29: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

R Code

May 2011

29

Page 30: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

More on GEOqueryMay 2011

30

require(GEOquery)

Let's try to load the GDS810 dataset which contains data on Alzheimer's disease at various stages of severity.

GDS810<-getGEO("GDS810")

The getGEO function returns an object of class GEOData. You can get a description of this class like this: help("GEOData-class")

Meta(GDS810) Columns(GDS810) head(Table(GDS810))

Page 31: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

Assessing Data Quality

May 2011

31

Page 32: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

ExpressionSet Class in R

May 2011

32

Page 33: Introduction to R Aedín Culhane aedin@jimmy.harvard.edu aedin

R basics: Getting help

•To get help▫?mean▫help(mean)

•help.search(“mean”)•apropos("mean")•example(mean)

•http://www.bioconductor.org/help/