An introduction to RNA-seq - Universitetet i Bergeninge/inf389-2010/MD_RNASeqIntro.pdf · 2010. 11....
Transcript of An introduction to RNA-seq - Universitetet i Bergeninge/inf389-2010/MD_RNASeqIntro.pdf · 2010. 11....
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Deep sequencing of transcriptomesAn introduction to RNA-seq
Michael Dondrup
UNI BCCS
2. november 2010
1 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Transcriptomics by Ultra-Fast Sequencing
Microarrays have been the primary transcriptomicshigh-throughput tool for almost a decade.• New approach: sequence the transcriptome• Millions of short read fragments (15 – 400 bp) from NGS
machinesRemarks:• SAGE/CAGE and other tag based methods not suitable for
bacteria• Reverse transcription: RNA −→ cDNA• Most papers use a reference genome• Count read fragments per bin (CDS, ORF, exon, intergenic
region, window of size n)
2 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Outline
Introduction
Lab procedures
Bioinformatics analysisWorkflowApplication examples
Statistics of DE analysisNormalizationTrials and distributionsStatistical testing
3 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
RNA-seq, a revolutionary tool?
4 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
RNA-seq, a revolutionary tool
4 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
RNA-seq, a revolutionary tool!
4 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Applications
• Gene structure (e.g. Exons/Introns)• Finding novel transcripts:
• non-annotated (pseudo) Genes• non-coding RNA (ncRNA, sRNA)• antisense RNA
• Transcription Start Sites (TSS)• Operon structure• De-novo transcriptome assembly (given no reference exists)• Metagenomics approach (sampling of bact. communities by
RNA-seq)• (semi-) quantitative approach: differential expression (DE)
5 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Outline
Introduction
Lab procedures
Bioinformatics analysisWorkflowApplication examples
Statistics of DE analysisNormalizationTrials and distributionsStatistical testing
6 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Overview
cDNA
small RNA mRNA
total RNA
sample
ACTGATGTGAT
ACTTTCCCGTGATAATCCGCTTATGTGATACTGGTCCAAAAATGAT
RNA extractionpurification
reverse transcription
fragmentation
depletion of rRNA,amplification of mRNA
high-throughputsequencing
short sequencereads
Remarks:
• DNase I treatment• prokaryotes don’t have
polyA• rRNA depletion: ≈90%
removal• rRNA/RNA before: 97-99,
after 90%• induces bias• no depletion e.g. for
meta-RNA-seq• directional by: ss-cDNA or
adapter ligation7 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
cDNA Sequencing
• Illumina/Solexa• ABI/Solid• Roche/454• direct RNA sequencing is under development
High-coverage (Illumina, Solid) preferred over read-length (454),if no transcriptome assembly required
8 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Outline
Introduction
Lab procedures
Bioinformatics analysisWorkflowApplication examples
Statistics of DE analysisNormalizationTrials and distributionsStatistical testing
9 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Overview
Quality control & filtering
Aligning reads to reference
Alignment statistics & filtering
reference sequence Transcriptome assembly
Binning (genes, intergenic regions, etc)
genome annotation
Compute coverage
Sequence data, (FASTA, FASTQ)
DE analysis Transcription start sites/Operons
Search for novel transcripts
Transcript variants analysis (Splicing, Intron/
Exon, etc)
Visualization
10 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Filtering of reads
• by base-qualities• lenght• Duplicate reads (identical sequence)
• removal• condense into single read• probabilistic (have not seen this applied)• duplicates are suspected to be artifacts
11 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
AlignmentsChallenges are the same as with all NGS data:• millions/billions of reads• rather short reads• sequencing errors• sequencing bias
(Short)-read alignment programs used in RNA-seq:• BWA• Bowtie• Shrimp• SOAP• Eland• blat• blast (not so good!)• ...
12 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Challenges begin after the alignmentMapping fragments to the genome:
13 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Filtering possibilities
• unique alignments• sequence identity• alignment quality score• alignment length• proportion of read length aligned
14 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Pseudo coverage
• count of alignments spanning a genomic position• can be computed with 1bp resolution or for larger windows• used in visualization• pseudo: because we do not know the size of the
transcriptome
15 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Examples
16 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Examples
16 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Interval binning
• DE analysis requires binning• typical bins: exons, CDS, transcripts, introns, genes• reads and genes are represented as genomic intervals
[start, end]• fast interval overlap algorithms:
• Sorting based methods• Interval tree based methods (as in the IRanges Bioconducto
package)• Nested containment lists (Alekseyenko & Lee, Bioinformatics
2007)
• Result: a read count ni ∈ N0 reads bin i
17 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Search for ncRNAs
• define an arbitrary coverage threshold c0• search for continuos intergenic regions (seed) c > c0• possibly extend over small gaps• with "intergenic": regions that do not overlap a CDS (why?)• extract sequence and search databases
18 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Search for ncRNAs
19 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
TSS and operon detection
• try to find out where coverage changes quickly (a TSScandidate)
• compute first order differences d(ci) for each genomicposition i (aka differentiation) using a sliding window of widthw
• look for maxima/minima upstream of a gene (max for +, minfor - strand)
20 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
TSS and operon detection
21 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Outline
Introduction
Lab procedures
Bioinformatics analysisWorkflowApplication examples
Statistics of DE analysisNormalizationTrials and distributionsStatistical testing
22 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Why data normalization for RNA-seq data?
Account for systematic technical errors/bias:• different library sizes: different numbers of reads• different gene lengths• some sequencing methods seem to prefer longer transcripts
even more• GC content might have an influence too• limited read capacity: highly-expressed genes stealreads
This is not trivial (e.g. common 5’ preference). Differentiatebetween biological effects and technical effectsBtw.: isn’t this a déjà-vu?
23 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Normalization methods
• RPKM (Mortazavi et al., 2008)• house-keeping or constant reference gene (e.g. POLR2A)• upper quartil (75% pecentile) normalization• quantile normalization (as for MA data, see: Bolstad et al.,
2003)
See: Bullard et al. Evaluation of statistical methods fornormalization and differential expression in mRNA-Seqexperiments. BMC bioinformatics (2010)
24 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
RPKM
reads per kilobase of exon per million mapped sequence reads
rpkm(ni , li ,N) =niliN
[1
1010b]
ni : number of mapped readsli : length of gene sequenceN: total number of mapped readsb: genomic base position (as a unit, this is not really a unit!)A very simple gene specific scaling factor. Some publicationsalso took log values of similarly scaled data.
25 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Quantile normalization
• no assumption about original data distribution• results in the data being all samples from the same
distribution
26 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Method comparison by Bullard et al.
• gold standard: qRT-PCR data (MAQC project)• divide the gold standard data in DE, non-DE, no-call• try to reproduce classification based on RNA-seq data (+MA
data) using statistical tests
Results:• house-keeping, upper quartile, and quantile preferable over
RPKM and total counts• the choice of the normalization method had a larger effect on
the performance than the choice of the statistical model(test)
27 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
What is significant here?
• Is a gene significantly differentially expressed under twoconditions (DE analysis)?
Remarks:
• We need a model for the data, more precisely we need amodel that can explain the variance in the data.
• Significance: the probability of rejecting the null-hypothesisin a statistical test setting just by chance, while there is – inreality – no effect, is low
• are dealing with count data, thus we are dealing withdiscrete statistics (for now).
28 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Bernoulli trials
A trial with two outcomes: success and failurep: probability of successprobability of failure: p′ = 1−p
29 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Binomial distributionThe number of successes in a series of n iid Benoulli trials followa binomial distribution:
probability mass function30 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Poisson distributionThe distribution of rare events.• we know a rate but not the probability of a success in a
series of independent Bernoulli trials occurring in space/time• the number of trials is large while each individual trial has
low probability of success• e.g.: number of phone calls received in a call center per hour• number of defective devices on an assembly line per day
f (k ;λ ) = λk e−λk! ,
31 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Remarks
• It is tempting to use a Poisson model and set λ as the(pseudo-) coverage.
• E(f ;λ ) = σ2 = λ
Preconditions:• Independence: trials do not depend on previous events• Lack of clustering, prob. of two simultaneous events is low• Rate is constant over space/time
32 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Poisson distribution ...
• is not a suitable model for RNA-seq data in general• has been found sufficient for technical variation in RNA-seq
data• biological variance ≥ technical variance• keep in mind: we do not know the the length of the
trancriptome ,→bp is not a unit!• two-stage random-process (bit sloppy!) :
C(x) = Sequencing(Transcription(x))• → overdispersion
33 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Negative binomial distribution
• k : number of failures before r successes with prob. p occur• used for overdispersion problems with a dispersion
parameter
X ∼ NB(r , p)alternatively written asNB(µ, σ2)
f (k)≡ Pr(X = k) =(k+r−1
r−1)(1−p)r pk for k = 0,1,2, . . .
34 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
NB
0 10 20 30 40
0.00
0.05
0.10
0.15
0.20
0.25
probability mass function
k
dnbi
nom
(k, r
, p)
r=2, p=0.5r=5, p=0.5r=10, p=0.5r=20, p=0.5r=10, p=0.3
35 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
How do we assess significance?
• We have discrete probabilities• We can enumerate all possibilities• principle: enumerate all outcomes which are
equally or more extreme than the given one
36 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Example: Fisher’s exact test
• Exact test for contingency tables with small sample sizes• the probability of a single table follows the hyper-geometric
distribution• for large samples approximation by chi-sqare-test• the dieting example
37 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
edgeR and DEseq
Model: Kij ≡ NB(µij , σ2if ) gene:i , sample : j in a replicatedexperiment*
• mean:µ and variance σ2 must be estimated from thereplicates*
• edgeR: σ2 = µ +αµ2
• DEseq: µij = qi ,ρ(j)sj• DEseq: σ2ij = µij + s
2j vi ,ρ(j)
Both packages use different approaches for parameter fitting andtesting.*DEseq also works with no or few replicates, but with reduced power
38 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Testing in DEseq similar to Fisher’s test
As test statistic: the total counts in two conditions: KiA, KiBNow we need a p-value:
pi =
∑a+b=kiS ,
p(a,b)≤p(KiA,KiB )
p(a,b)
∑a+b=kiS
p(a,b)
Now the model comes in: p(a,b) = Pr(KiA = a)Pr(KiB = b)
39 / 40
-
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis
Summary Outlook (my 50 cent)
• RNA-seq is as promising as complex• read mapping and binning are working fine though
parameters need to be explored• normalization and statistical models and tests need more
work• more agressive normalization should be explored• many methods are bit ad-hoc or use arbitrary thresholds• no framework for within sample significance testing
40 / 40
IntroductionLab proceduresBioinformatics analysisWorkflowApplication examples
Statistics of DE analysisNormalizationTrials and distributionsStatistical testing