J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group...

31
JOINING THE DARK SIDE: HTS DATA ANALYSIS WITH R-BIOCONDUCTOR Pieta Schofield Barton Group Talk

Transcript of J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group...

Page 1: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

JOINING THE DARK SIDE:

HTS DATA ANALYSIS WITH R-BIOCONDUCTOR

Pieta SchofieldBarton Group Talk

Page 2: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

“aRrgh” a bit of chatter on the mail list

• “Ignorance is bliss” • the less I have to know about R the more

blissful I will be

or

• “Knowledge is power”• The more I know R the more powerful I find

it is

Page 3: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

WARNING!This talk contains snippets of R

code that may be distressing those of a nervous disposition

Page 4: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

R – Software environment for doing data analysis and statistics• “It’s like marmite, you love it or hate it” • … and like marmite I have to confess love it.

• Extensible• Scriptable• Terminal base • Free• Open source • Multiplatform (multiprocessor)…

! Bioconductor !…but then I also still love vim so what do I know?

Page 5: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

Bioconductor• Constantly growing set of R packages• Focused on biological data analysis• Common installation method• Relatively easy package management• Attempts at some common coding,

testing and documentation standards• Common (reusable/reused) data

structures

Page 6: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

Over the years I have made a living writing code inFORTRAN (IV , 77, 95)

RPL-Filetab (RapidGen decision table language)APL (functional array processing language)Modula-2 (Pascal dialect)RPL-RS/1 (BBN Statistics Language)Basic / VisualBasicC / C++ /VisualC++Objective-CActionScriptMatlabJavaPerlPythonR

Learn a languages strengths and play to them

Page 7: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

I am too old to keep swapping syntaxes!

I use R exclusively (with a bit of bash)• Microarrays• affy: for affymetrix gene-chips• limma: DE tool for micro-arrays

• “Great RNA-seq Experiment”• DE tools mainly R-bioconductor packages

• edgeR, DESeq, BBSeq, SAMR, limma, BitSeq… • …too many to mention

• HTS data analysis• ChIP-seq, Mnase-seq, SNP calling

Page 8: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

What is available for HTS data analysis?

• Lots and growing every day• Few core packages for efficient storage and

processing of sequence base data and annotations • These are continually being refined and

improved• Sometimes at the cost of backward

compatibility• Best to keep R and bioconductor packeges up to

date

• Integrated Tools being build on top of these core packages for specific analysis and visualisation

Page 9: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

Which packages are worth the investment in learning?

Are any packages worth the investment of learning/switching too R?

Page 10: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

Biostrings• Set of classes for representing large

biological sequences (DNA/RNA/amino acids)• Base class type

• XString (Bstring)=> DNAString RNAString AAString

• Collection class • XStringSet…

• Pairwise and multiple sequence alignments

• Set of methods for manipulation and computation on these classes

• Set of method for sequence matching and pairwise alignments

Page 11: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

1 # Code to demonstrate the use of Biostrings 2 3 require(Biostrings) 4 5 dna <- DNAString("TCAACGTTGAATAGCGTACCG") 6 #> dna 7 # 21-letter "DNAString" instance 8 #seq: TCAACGTTGAATAGCGTACCG 9 10 aa <- AAString(translate(dna))11 #> aa12 # 7-letter "AAString" instance13 #seq: STLNSVP14 15 orfs <- AAStringSet(lapply(seq(1:3),16 function(x){17 adj <- c(0,2,1)18 AAString(translate(dna[x:(length(dna)-adj[x])]))19 }20 ))21 #> orfs22 # A AAStringSet instance of length 323 # width seq24 #[1] 7 STLNSVP25 #[2] 6 QR*IAY26 #[3] 6 NVE*RT

Page 12: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

1 require(Biostrings) 2 3 dna <- DNAString("TCAACGTTGAATAGCGTACCGAACGTTGAATATCGTTGAATAG") 4 p1 <- DNAString("AACGTT") 5 6 dinucFreq <- dinucleotideFrequency(dna) 7 #> dinucFreq 8 #AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT 9 # 5 3 2 4 1 1 5 0 4 1 0 4 4 2 3 3 10 11 trinucFreq<-trinucleotideFrequency(dna)12 #> trinucFreq13 #AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT 14 # 0 2 0 3 0 1 2 0 0 1 0 0 3 1 0 0 1 0 0 0 15 #CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT 16 # 0 0 1 0 1 0 0 4 0 0 0 0 4 0 0 0 0 0 1 0 17 #GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT 18 # 0 0 0 0 1 0 0 3 0 1 2 1 1 0 1 0 3 0 0 0 19 #TTA TTC TTG TTT 20 # 0 0 3 0 21 22 cpm <- countPattern(p1, dna)23 #> cpm24 #[1] 225

Page 13: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

1 # Read in a FASTA file and a GTF features file 2 # Find chromosomes/contigs with features 3 # Generate a FASTA file with only those chromosomes/contigs 4 # containing features 5 6 require(Biostrings) 7 8 # read in Fasta file 9 gg4 <- readDNAStringSet("~/data/ensembl/gg4/Gg4_73.fa")10 11 # read in GTF12 gtf <- read.delim("~/data/ensembl/gg4/Galgal4.gtf",sep="\t",h=F)13 14 # get chromosome that exist in GTF file15 chrs <- levels(gtf$V1)16 17 # subset the strings18 gg4.trim <- gg4[which(sapply(names(gg4),19 function(x){20 unlist(strsplit(x," "))[1]21 }22 ) %in% chrs),]23 24 # write out new fasta file25 writeXStringSet(gg4.trim,format="fasta",26 "~/data/ensembl/hg19_73/gg4_73_trimmed.fa",width=256)

Page 14: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

26 mpm <- matchPattern(p1, dna)27 #> mpm28 # Views on a 43-letter DNAString subject29 #subject: TCAACGTTGAATAGCGTACCGAACGTTGAATATCGTTGAATAG30 #views:31 # start end width32 #[1] 3 8 6 [AACGTT]33 #[2] 22 27 6 [AACGTT]34 35 mpm1 <- matchPattern(p1, dna,max.mismatch = 1)36 #> mpm137 # Views on a 43-letter DNAString subject38 #subject: TCAACGTTGAATAGCGTACCGAACGTTGAATATCGTTGAATAG39 #views:40 # start end width41 #[1] 3 8 6 [AACGTT]42 #[2] 22 27 6 [AACGTT]43 #[3] 32 37 6 [ATCGTT]....29 dna <- DNAString("TCAACGTTGAAT")30 print(date())31 #[1] "Sun Nov 10 14:45:31 2013"32 v1<-vmatchPattern(dna,gg4.trim)33 print(date())34 #[1] "Sun Nov 10 14:45:36 2013“

Page 15: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

40 > v1 41 MIndex object of length 934 42 $`1 dna:chromosome chromosome:Galgal4:1:1:195276750:1 REF` 43 IRanges of length 4 44 start end width 45 [1] 13647018 13647029 12 46 [2] 56226448 56226459 12 47 [3] 78853166 78853177 12 48 [4] 143811701 143811712 12 49 50 $`10 dna:chromosome chromosome:Galgal4:10:1:19911089:1 REF` 51 IRanges of length 1 52 start end width 53 [1] 14846274 14846285 12 54 55 $`11 dna:chromosome chromosome:Galgal4:11:1:19401079:1 REF` 56 IRanges of length 0 57 58 ... 59 <931 more elements> 60 61 >v1[[1]] 62 IRanges of length 4 63 start end width 64 [1] 13647018 13647029 12 65 [2] 56226448 56226459 12 66 [3] 78853166 78853177 12 67 [4] 143811701 143811712 12

Page 16: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

Range Data• Slight Annoyance

• Data structure proliferation similar but different• Sometimes there are easy conversion methods• Sometimes there are not!

• GenomicRanges• GRanges & GRangesList

• Iranges use RLE – run length encoding

• Run length encoding

Raw: AAAACAAAAATTGTGGGGRLE: A4C1A5T2G1T1G4

Page 17: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

GenomicRanges• GRanges class• It is a named list of ranges.

• Coordinate data : seqname, range, start, end, strand

• Metadata : user specified data fields

• Aligned read classes• Gapped Aligned Reads GAlignement• Gapped Aligned Read Pairs GAlignmentPair

• Importing reads from BAM files• Frontend to Rsamtools• Tools for iterative access to large files

• SummarizedExperiment • Managing matrix of ranges and samples

Page 18: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

25 gr <- GRanges(seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), 26 ranges = IRanges(1:10, end = 7:16, names = head(letters, 10)), 27 strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), 28 score = rnorm(10,20,2), 29 GC = seq(1, 0, length=10), 30 group = c(rep("S",4),rep("T",2),rep("S",4))) 31 32 seqlengths(gr) <- c(249250621,243199373,198022430) 33 gr 34

36 GRanges with 10 ranges and 3 metadata columns: 37 seqnames ranges strand | score GC group 38 <Rle> <IRanges> <Rle> | <numeric> <numeric> <character> 39 a chr1 [ 1, 7] - | 17.78724 1.0000000 S 40 b chr2 [ 2, 8] + | 24.25995 0.8888889 S 41 c chr2 [ 3, 9] + | 19.32891 0.7777778 S 42 d chr2 [ 4, 10] * | 20.88689 0.6666667 S 43 e chr1 [ 5, 11] * | 21.50293 0.5555556 T 44 f chr1 [ 6, 12] + | 18.47575 0.4444444 T 45 g chr3 [ 7, 13] + | 21.90163 0.3333333 S 46 h chr3 [ 8, 14] + | 23.35226 0.2222222 S 47 i chr3 [ 9, 15] - | 19.72108 0.1111111 S 48 j chr3 [10, 16] - | 18.54088 0.0000000 S 49 --- 50 seqlengths: 51 chr1 chr2 chr3 52 249250621 243199373 198022430

Page 19: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

IRanges & GRanges methods • Access

• seqnames(), range(), strand(), mcols(), start(), end()

• Manipulate• split() ,unlist(), c()

• Lots of range set opperators• reduce(),disjoin(),shift(),flank(), union(),intersect(),setdiff(),gaps(), restrict()

• Find • findOverlaps(), nearest(), proceed(), follow()

• Calculate • coverage()• summerizeOverlaps()

Page 20: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.
Page 21: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

1 # required packages 2 require(GenomicRanges) 3 require(Gviz) 4 5 # make a GRanges object 6 ducks <- GRanges(rep("chrW",6), 7 IRanges(start = c(50, 180, 260, 800, 600, 1240), 8 width = c(15, 20, 40, 100, 500, 20)), 9 strand = rep("*",6),10 group = rep(c("Huey", "Dewey", "Louie"), c(1,3, 2)))11 12 # make and Annotation track of object13 duckTrack <- AnnotationTrack(ducks,14 genome="gg4",15 name="Ducks")16 17 # make and Annotation track of reduced object18 duckRed <- AnnotationTrack(reduce(ducks),19 genome="gg4",20 name="Ducks Reduce")21 22 # save it as a pdf23 outFile <- "/homes/pschofield/scratch/NOBACK/ducks.pdf"24 pdf(outFile,width=7,height=2)25 plotTracks(list(duckTrack,duckRed),showId=T)26 dev.off()

Page 22: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

Visualisation• Gviz

“Gviz uses the biomaRt and the rtracklayer packages to perform live annotation queries to Ensembl and UCSC and translates this to e.g. gene/transcript structures in viewports of the grid graphics package. This results in genomic information plotted together with your data”

Programmatically produced “IGB style” plots

Page 23: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.
Page 24: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.
Page 25: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.
Page 26: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

Many track classes• AlignedReadTrack • AnnotationTrack • BiomartGeneRegionTrack • DataTrack • GeneRegionTrack • GenomeAxisTrack • IdeogramTrack • NumericTrack • RangeTrack • ReferenceTrack • SequenceTrack • StackedTrack • UcscTrack

Page 27: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

• Fastq files • Quality Assessment• Alignment

• Feature Processing• Peak calling (ChIP-seq)• Read allocation (RNA-seq)• SNP calling

• Annotational Processing • Gene-set, functional analysis• Differential Expression analysis• SNP mapping

Page 28: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

• Quality Assessment• ShortRead, htSeqTools

• Alignment • Rsubread, gmapR, GSNAP, Rbowtie Biostrings

• Accessing Alignments• ShortRead, Rsamtools

• Annotation• BSgenome, annotate, biomaRt, topGO, GenomicFeatures…

“I’ve been vaguely aware of biomaRt for a few years. Inexplicably, I’ve only recently started to use it. It’s one of the most useful applications I’ve ever used.” Neil Saunders, What You’re Doing is Rather Desparate

Page 29: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

• Differential Expression• EdgeR, DESeq, limma, samr …

• ChIP-seq• ChIPpeakAnno, QuasR, PICS, nucleR

• Motif discovery• rGADEM, MotIV, seqlog

• SNP• VariantAnnotation, snpStats

• Visualization• GenomeGraphs, ggbio, Gviz, rtracklayer, biovizBase

Page 30: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

“Easy” Parallelisation• Although it has improved try to avoid for

loops, R is designed for processing list• lapply(), mapply() and relatives• do.call() map()• endoapply() mendoapply() *IRanges

• parallel package• mclapply(), parlapply(), mcMap()

The overheads of parallel processing will eventually limit speed up, (deminishing returns)Some R libraries are already multitreaded and compiled using OpenMP, be aware.

Page 31: J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

That’s all, thank you for listening.

If you have tips and techniques for using R

I would be very pleased to hear them.

Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, et al. (2013) Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol 9(8): e1003118. doi:10.1371/journal.pcbi.1003118