J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group...

Post on 26-Dec-2015

213 views 0 download

Tags:

Transcript of J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group...

JOINING THE DARK SIDE:

HTS DATA ANALYSIS WITH R-BIOCONDUCTOR

Pieta SchofieldBarton Group Talk

“aRrgh” a bit of chatter on the mail list

• “Ignorance is bliss” • the less I have to know about R the more

blissful I will be

or

• “Knowledge is power”• The more I know R the more powerful I find

it is

WARNING!This talk contains snippets of R

code that may be distressing those of a nervous disposition

R – Software environment for doing data analysis and statistics• “It’s like marmite, you love it or hate it” • … and like marmite I have to confess love it.

• Extensible• Scriptable• Terminal base • Free• Open source • Multiplatform (multiprocessor)…

! Bioconductor !…but then I also still love vim so what do I know?

Bioconductor• Constantly growing set of R packages• Focused on biological data analysis• Common installation method• Relatively easy package management• Attempts at some common coding,

testing and documentation standards• Common (reusable/reused) data

structures

Over the years I have made a living writing code inFORTRAN (IV , 77, 95)

RPL-Filetab (RapidGen decision table language)APL (functional array processing language)Modula-2 (Pascal dialect)RPL-RS/1 (BBN Statistics Language)Basic / VisualBasicC / C++ /VisualC++Objective-CActionScriptMatlabJavaPerlPythonR

Learn a languages strengths and play to them

I am too old to keep swapping syntaxes!

I use R exclusively (with a bit of bash)• Microarrays• affy: for affymetrix gene-chips• limma: DE tool for micro-arrays

• “Great RNA-seq Experiment”• DE tools mainly R-bioconductor packages

• edgeR, DESeq, BBSeq, SAMR, limma, BitSeq… • …too many to mention

• HTS data analysis• ChIP-seq, Mnase-seq, SNP calling

What is available for HTS data analysis?

• Lots and growing every day• Few core packages for efficient storage and

processing of sequence base data and annotations • These are continually being refined and

improved• Sometimes at the cost of backward

compatibility• Best to keep R and bioconductor packeges up to

date

• Integrated Tools being build on top of these core packages for specific analysis and visualisation

Which packages are worth the investment in learning?

Are any packages worth the investment of learning/switching too R?

Biostrings• Set of classes for representing large

biological sequences (DNA/RNA/amino acids)• Base class type

• XString (Bstring)=> DNAString RNAString AAString

• Collection class • XStringSet…

• Pairwise and multiple sequence alignments

• Set of methods for manipulation and computation on these classes

• Set of method for sequence matching and pairwise alignments

1 # Code to demonstrate the use of Biostrings 2 3 require(Biostrings) 4 5 dna <- DNAString("TCAACGTTGAATAGCGTACCG") 6 #> dna 7 # 21-letter "DNAString" instance 8 #seq: TCAACGTTGAATAGCGTACCG 9 10 aa <- AAString(translate(dna))11 #> aa12 # 7-letter "AAString" instance13 #seq: STLNSVP14 15 orfs <- AAStringSet(lapply(seq(1:3),16 function(x){17 adj <- c(0,2,1)18 AAString(translate(dna[x:(length(dna)-adj[x])]))19 }20 ))21 #> orfs22 # A AAStringSet instance of length 323 # width seq24 #[1] 7 STLNSVP25 #[2] 6 QR*IAY26 #[3] 6 NVE*RT

1 require(Biostrings) 2 3 dna <- DNAString("TCAACGTTGAATAGCGTACCGAACGTTGAATATCGTTGAATAG") 4 p1 <- DNAString("AACGTT") 5 6 dinucFreq <- dinucleotideFrequency(dna) 7 #> dinucFreq 8 #AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT 9 # 5 3 2 4 1 1 5 0 4 1 0 4 4 2 3 3 10 11 trinucFreq<-trinucleotideFrequency(dna)12 #> trinucFreq13 #AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT 14 # 0 2 0 3 0 1 2 0 0 1 0 0 3 1 0 0 1 0 0 0 15 #CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT 16 # 0 0 1 0 1 0 0 4 0 0 0 0 4 0 0 0 0 0 1 0 17 #GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT 18 # 0 0 0 0 1 0 0 3 0 1 2 1 1 0 1 0 3 0 0 0 19 #TTA TTC TTG TTT 20 # 0 0 3 0 21 22 cpm <- countPattern(p1, dna)23 #> cpm24 #[1] 225

1 # Read in a FASTA file and a GTF features file 2 # Find chromosomes/contigs with features 3 # Generate a FASTA file with only those chromosomes/contigs 4 # containing features 5 6 require(Biostrings) 7 8 # read in Fasta file 9 gg4 <- readDNAStringSet("~/data/ensembl/gg4/Gg4_73.fa")10 11 # read in GTF12 gtf <- read.delim("~/data/ensembl/gg4/Galgal4.gtf",sep="\t",h=F)13 14 # get chromosome that exist in GTF file15 chrs <- levels(gtf$V1)16 17 # subset the strings18 gg4.trim <- gg4[which(sapply(names(gg4),19 function(x){20 unlist(strsplit(x," "))[1]21 }22 ) %in% chrs),]23 24 # write out new fasta file25 writeXStringSet(gg4.trim,format="fasta",26 "~/data/ensembl/hg19_73/gg4_73_trimmed.fa",width=256)

26 mpm <- matchPattern(p1, dna)27 #> mpm28 # Views on a 43-letter DNAString subject29 #subject: TCAACGTTGAATAGCGTACCGAACGTTGAATATCGTTGAATAG30 #views:31 # start end width32 #[1] 3 8 6 [AACGTT]33 #[2] 22 27 6 [AACGTT]34 35 mpm1 <- matchPattern(p1, dna,max.mismatch = 1)36 #> mpm137 # Views on a 43-letter DNAString subject38 #subject: TCAACGTTGAATAGCGTACCGAACGTTGAATATCGTTGAATAG39 #views:40 # start end width41 #[1] 3 8 6 [AACGTT]42 #[2] 22 27 6 [AACGTT]43 #[3] 32 37 6 [ATCGTT]....29 dna <- DNAString("TCAACGTTGAAT")30 print(date())31 #[1] "Sun Nov 10 14:45:31 2013"32 v1<-vmatchPattern(dna,gg4.trim)33 print(date())34 #[1] "Sun Nov 10 14:45:36 2013“

40 > v1 41 MIndex object of length 934 42 $`1 dna:chromosome chromosome:Galgal4:1:1:195276750:1 REF` 43 IRanges of length 4 44 start end width 45 [1] 13647018 13647029 12 46 [2] 56226448 56226459 12 47 [3] 78853166 78853177 12 48 [4] 143811701 143811712 12 49 50 $`10 dna:chromosome chromosome:Galgal4:10:1:19911089:1 REF` 51 IRanges of length 1 52 start end width 53 [1] 14846274 14846285 12 54 55 $`11 dna:chromosome chromosome:Galgal4:11:1:19401079:1 REF` 56 IRanges of length 0 57 58 ... 59 <931 more elements> 60 61 >v1[[1]] 62 IRanges of length 4 63 start end width 64 [1] 13647018 13647029 12 65 [2] 56226448 56226459 12 66 [3] 78853166 78853177 12 67 [4] 143811701 143811712 12

Range Data• Slight Annoyance

• Data structure proliferation similar but different• Sometimes there are easy conversion methods• Sometimes there are not!

• GenomicRanges• GRanges & GRangesList

• Iranges use RLE – run length encoding

• Run length encoding

Raw: AAAACAAAAATTGTGGGGRLE: A4C1A5T2G1T1G4

GenomicRanges• GRanges class• It is a named list of ranges.

• Coordinate data : seqname, range, start, end, strand

• Metadata : user specified data fields

• Aligned read classes• Gapped Aligned Reads GAlignement• Gapped Aligned Read Pairs GAlignmentPair

• Importing reads from BAM files• Frontend to Rsamtools• Tools for iterative access to large files

• SummarizedExperiment • Managing matrix of ranges and samples

25 gr <- GRanges(seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), 26 ranges = IRanges(1:10, end = 7:16, names = head(letters, 10)), 27 strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), 28 score = rnorm(10,20,2), 29 GC = seq(1, 0, length=10), 30 group = c(rep("S",4),rep("T",2),rep("S",4))) 31 32 seqlengths(gr) <- c(249250621,243199373,198022430) 33 gr 34

36 GRanges with 10 ranges and 3 metadata columns: 37 seqnames ranges strand | score GC group 38 <Rle> <IRanges> <Rle> | <numeric> <numeric> <character> 39 a chr1 [ 1, 7] - | 17.78724 1.0000000 S 40 b chr2 [ 2, 8] + | 24.25995 0.8888889 S 41 c chr2 [ 3, 9] + | 19.32891 0.7777778 S 42 d chr2 [ 4, 10] * | 20.88689 0.6666667 S 43 e chr1 [ 5, 11] * | 21.50293 0.5555556 T 44 f chr1 [ 6, 12] + | 18.47575 0.4444444 T 45 g chr3 [ 7, 13] + | 21.90163 0.3333333 S 46 h chr3 [ 8, 14] + | 23.35226 0.2222222 S 47 i chr3 [ 9, 15] - | 19.72108 0.1111111 S 48 j chr3 [10, 16] - | 18.54088 0.0000000 S 49 --- 50 seqlengths: 51 chr1 chr2 chr3 52 249250621 243199373 198022430

IRanges & GRanges methods • Access

• seqnames(), range(), strand(), mcols(), start(), end()

• Manipulate• split() ,unlist(), c()

• Lots of range set opperators• reduce(),disjoin(),shift(),flank(), union(),intersect(),setdiff(),gaps(), restrict()

• Find • findOverlaps(), nearest(), proceed(), follow()

• Calculate • coverage()• summerizeOverlaps()

1 # required packages 2 require(GenomicRanges) 3 require(Gviz) 4 5 # make a GRanges object 6 ducks <- GRanges(rep("chrW",6), 7 IRanges(start = c(50, 180, 260, 800, 600, 1240), 8 width = c(15, 20, 40, 100, 500, 20)), 9 strand = rep("*",6),10 group = rep(c("Huey", "Dewey", "Louie"), c(1,3, 2)))11 12 # make and Annotation track of object13 duckTrack <- AnnotationTrack(ducks,14 genome="gg4",15 name="Ducks")16 17 # make and Annotation track of reduced object18 duckRed <- AnnotationTrack(reduce(ducks),19 genome="gg4",20 name="Ducks Reduce")21 22 # save it as a pdf23 outFile <- "/homes/pschofield/scratch/NOBACK/ducks.pdf"24 pdf(outFile,width=7,height=2)25 plotTracks(list(duckTrack,duckRed),showId=T)26 dev.off()

Visualisation• Gviz

“Gviz uses the biomaRt and the rtracklayer packages to perform live annotation queries to Ensembl and UCSC and translates this to e.g. gene/transcript structures in viewports of the grid graphics package. This results in genomic information plotted together with your data”

Programmatically produced “IGB style” plots

Many track classes• AlignedReadTrack • AnnotationTrack • BiomartGeneRegionTrack • DataTrack • GeneRegionTrack • GenomeAxisTrack • IdeogramTrack • NumericTrack • RangeTrack • ReferenceTrack • SequenceTrack • StackedTrack • UcscTrack

• Fastq files • Quality Assessment• Alignment

• Feature Processing• Peak calling (ChIP-seq)• Read allocation (RNA-seq)• SNP calling

• Annotational Processing • Gene-set, functional analysis• Differential Expression analysis• SNP mapping

• Quality Assessment• ShortRead, htSeqTools

• Alignment • Rsubread, gmapR, GSNAP, Rbowtie Biostrings

• Accessing Alignments• ShortRead, Rsamtools

• Annotation• BSgenome, annotate, biomaRt, topGO, GenomicFeatures…

“I’ve been vaguely aware of biomaRt for a few years. Inexplicably, I’ve only recently started to use it. It’s one of the most useful applications I’ve ever used.” Neil Saunders, What You’re Doing is Rather Desparate

• Differential Expression• EdgeR, DESeq, limma, samr …

• ChIP-seq• ChIPpeakAnno, QuasR, PICS, nucleR

• Motif discovery• rGADEM, MotIV, seqlog

• SNP• VariantAnnotation, snpStats

• Visualization• GenomeGraphs, ggbio, Gviz, rtracklayer, biovizBase

“Easy” Parallelisation• Although it has improved try to avoid for

loops, R is designed for processing list• lapply(), mapply() and relatives• do.call() map()• endoapply() mendoapply() *IRanges

• parallel package• mclapply(), parlapply(), mcMap()

The overheads of parallel processing will eventually limit speed up, (deminishing returns)Some R libraries are already multitreaded and compiled using OpenMP, be aware.

That’s all, thank you for listening.

If you have tips and techniques for using R

I would be very pleased to hear them.

Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, et al. (2013) Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol 9(8): e1003118. doi:10.1371/journal.pcbi.1003118