An introduction to RNA-seq - Universitetet i Bergeninge/inf389-2010/MD_RNASeqIntro.pdf · 2010. 11....

Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

Deep sequencing of transcriptomesAn introduction to RNA-seq

Michael Dondrup

UNI BCCS

2. november 2010

1 / 40


Transcriptomics by Ultra-Fast Sequencing

Microarrays have been the primary transcriptomicshigh-throughput tool for almost a decade.• New approach: sequence the transcriptome• Millions of short read fragments (15 – 400 bp) from NGS

machinesRemarks:• SAGE/CAGE and other tag based methods not suitable for

bacteria• Reverse transcription: RNA −→ cDNA• Most papers use a reference genome• Count read fragments per bin (CDS, ORF, exon, intergenic

region, window of size n)

2 / 40


Outline

Introduction

Lab procedures

Bioinformatics analysisWorkflowApplication examples

Statistics of DE analysisNormalizationTrials and distributionsStatistical testing

3 / 40


RNA-seq, a revolutionary tool?

4 / 40


RNA-seq, a revolutionary tool

4 / 40


RNA-seq, a revolutionary tool!

4 / 40


Applications

• Gene structure (e.g. Exons/Introns)• Finding novel transcripts:

• non-annotated (pseudo) Genes• non-coding RNA (ncRNA, sRNA)• antisense RNA

• Transcription Start Sites (TSS)• Operon structure• De-novo transcriptome assembly (given no reference exists)• Metagenomics approach (sampling of bact. communities by

RNA-seq)• (semi-) quantitative approach: differential expression (DE)

5 / 40


Outline

Introduction

Lab procedures



6 / 40


Overview

cDNA

small RNA mRNA

total RNA

sample

ACTGATGTGAT

ACTTTCCCGTGATAATCCGCTTATGTGATACTGGTCCAAAAATGAT

RNA extractionpurification

reverse transcription

fragmentation

depletion of rRNA,amplification of mRNA

high-throughputsequencing

short sequencereads

Remarks:

• DNase I treatment• prokaryotes don’t have

polyA• rRNA depletion: ≈90%

removal• rRNA/RNA before: 97-99,

after 90%• induces bias• no depletion e.g. for

meta-RNA-seq• directional by: ss-cDNA or

adapter ligation7 / 40


cDNA Sequencing

• Illumina/Solexa• ABI/Solid• Roche/454• direct RNA sequencing is under development

High-coverage (Illumina, Solid) preferred over read-length (454),if no transcriptome assembly required

8 / 40


Outline

Introduction

Lab procedures



9 / 40


Overview

Quality control & filtering

Aligning reads to reference

Alignment statistics & filtering

reference sequence Transcriptome assembly

Binning (genes, intergenic regions, etc)

genome annotation

Compute coverage

Sequence data, (FASTA, FASTQ)

DE analysis Transcription start sites/Operons

Search for novel transcripts

Transcript variants analysis (Splicing, Intron/

Exon, etc)

Visualization

10 / 40


Filtering of reads

• by base-qualities• lenght• Duplicate reads (identical sequence)

• removal• condense into single read• probabilistic (have not seen this applied)• duplicates are suspected to be artifacts

11 / 40


AlignmentsChallenges are the same as with all NGS data:• millions/billions of reads• rather short reads• sequencing errors• sequencing bias

(Short)-read alignment programs used in RNA-seq:• BWA• Bowtie• Shrimp• SOAP• Eland• blat• blast (not so good!)• ...

12 / 40


Challenges begin after the alignmentMapping fragments to the genome:

13 / 40


Filtering possibilities

• unique alignments• sequence identity• alignment quality score• alignment length• proportion of read length aligned

14 / 40


Pseudo coverage

• count of alignments spanning a genomic position• can be computed with 1bp resolution or for larger windows• used in visualization• pseudo: because we do not know the size of the

transcriptome

15 / 40


Examples

16 / 40


Interval binning

• DE analysis requires binning• typical bins: exons, CDS, transcripts, introns, genes• reads and genes are represented as genomic intervals

[start, end]• fast interval overlap algorithms:

• Sorting based methods• Interval tree based methods (as in the IRanges Bioconducto

package)• Nested containment lists (Alekseyenko & Lee, Bioinformatics

2007)

• Result: a read count ni ∈ N0 reads bin i

17 / 40


Search for ncRNAs

• define an arbitrary coverage threshold c0• search for continuos intergenic regions (seed) c > c0• possibly extend over small gaps• with "intergenic": regions that do not overlap a CDS (why?)• extract sequence and search databases

18 / 40


Search for ncRNAs

19 / 40


TSS and operon detection

• try to find out where coverage changes quickly (a TSScandidate)

• compute first order differences d(ci) for each genomicposition i (aka differentiation) using a sliding window of widthw

• look for maxima/minima upstream of a gene (max for +, minfor - strand)

20 / 40


TSS and operon detection

21 / 40


Outline

Introduction

Lab procedures



22 / 40


Why data normalization for RNA-seq data?

Account for systematic technical errors/bias:• different library sizes: different numbers of reads• different gene lengths• some sequencing methods seem to prefer longer transcripts

even more• GC content might have an influence too• limited read capacity: highly-expressed genes stealreads

This is not trivial (e.g. common 5’ preference). Differentiatebetween biological effects and technical effectsBtw.: isn’t this a déjà-vu?

23 / 40


Normalization methods

• RPKM (Mortazavi et al., 2008)• house-keeping or constant reference gene (e.g. POLR2A)• upper quartil (75% pecentile) normalization• quantile normalization (as for MA data, see: Bolstad et al.,

2003)

See: Bullard et al. Evaluation of statistical methods fornormalization and differential expression in mRNA-Seqexperiments. BMC bioinformatics (2010)

24 / 40


RPKM

reads per kilobase of exon per million mapped sequence reads

rpkm(ni , li ,N) =niliN

[1

1010b]

ni : number of mapped readsli : length of gene sequenceN: total number of mapped readsb: genomic base position (as a unit, this is not really a unit!)A very simple gene specific scaling factor. Some publicationsalso took log values of similarly scaled data.

25 / 40


Quantile normalization

• no assumption about original data distribution• results in the data being all samples from the same

distribution

26 / 40


Method comparison by Bullard et al.

• gold standard: qRT-PCR data (MAQC project)• divide the gold standard data in DE, non-DE, no-call• try to reproduce classification based on RNA-seq data (+MA

data) using statistical tests

Results:• house-keeping, upper quartile, and quantile preferable over

RPKM and total counts• the choice of the normalization method had a larger effect on

the performance than the choice of the statistical model(test)

27 / 40


What is significant here?

• Is a gene significantly differentially expressed under twoconditions (DE analysis)?

Remarks:

• We need a model for the data, more precisely we need amodel that can explain the variance in the data.

• Significance: the probability of rejecting the null-hypothesisin a statistical test setting just by chance, while there is – inreality – no effect, is low

• are dealing with count data, thus we are dealing withdiscrete statistics (for now).

28 / 40


Bernoulli trials

A trial with two outcomes: success and failurep: probability of successprobability of failure: p′ = 1−p

29 / 40


Binomial distributionThe number of successes in a series of n iid Benoulli trials followa binomial distribution:

probability mass function30 / 40


Poisson distributionThe distribution of rare events.• we know a rate but not the probability of a success in a

series of independent Bernoulli trials occurring in space/time• the number of trials is large while each individual trial has

low probability of success• e.g.: number of phone calls received in a call center per hour• number of defective devices on an assembly line per day

f (k ;λ ) = λk e−λk! ,

31 / 40


Remarks

• It is tempting to use a Poisson model and set λ as the(pseudo-) coverage.

• E(f ;λ ) = σ2 = λ

Preconditions:• Independence: trials do not depend on previous events• Lack of clustering, prob. of two simultaneous events is low• Rate is constant over space/time

32 / 40


Poisson distribution ...

• is not a suitable model for RNA-seq data in general• has been found sufficient for technical variation in RNA-seq

data• biological variance ≥ technical variance• keep in mind: we do not know the the length of the

trancriptome ,→bp is not a unit!• two-stage random-process (bit sloppy!) :

C(x) = Sequencing(Transcription(x))• → overdispersion

33 / 40


Negative binomial distribution

• k : number of failures before r successes with prob. p occur• used for overdispersion problems with a dispersion

parameter

X ∼ NB(r , p)alternatively written asNB(µ, σ2)

f (k)≡ Pr(X = k) =(k+r−1

r−1)(1−p)r pk for k = 0,1,2, . . .

34 / 40


NB

0 10 20 30 40

0.00

0.05

0.10

0.15

0.20

0.25

probability mass function

k

dnbi

nom

(k, r

, p)

r=2, p=0.5r=5, p=0.5r=10, p=0.5r=20, p=0.5r=10, p=0.3

35 / 40


How do we assess significance?

• We have discrete probabilities• We can enumerate all possibilities• principle: enumerate all outcomes which are

equally or more extreme than the given one

36 / 40


Example: Fisher’s exact test

• Exact test for contingency tables with small sample sizes• the probability of a single table follows the hyper-geometric

distribution• for large samples approximation by chi-sqare-test• the dieting example

37 / 40


edgeR and DEseq

Model: Kij ≡ NB(µij , σ2if ) gene:i , sample : j in a replicatedexperiment*

• mean:µ and variance σ2 must be estimated from thereplicates*

• edgeR: σ2 = µ +αµ2

• DEseq: µij = qi ,ρ(j)sj• DEseq: σ2ij = µij + s

2j vi ,ρ(j)

Both packages use different approaches for parameter fitting andtesting.*DEseq also works with no or few replicates, but with reduced power

38 / 40


Testing in DEseq similar to Fisher’s test

As test statistic: the total counts in two conditions: KiA, KiBNow we need a p-value:

pi =

∑a+b=kiS ,

p(a,b)≤p(KiA,KiB )

p(a,b)

∑a+b=kiS

p(a,b)

Now the model comes in: p(a,b) = Pr(KiA = a)Pr(KiB = b)

39 / 40


Summary Outlook (my 50 cent)

• RNA-seq is as promising as complex• read mapping and binning are working fine though

parameters need to be explored• normalization and statistical models and tests need more

work• more agressive normalization should be explored• many methods are bit ad-hoc or use arbitrary thresholds• no framework for within sample significance testing

40 / 40

IntroductionLab proceduresBioinformatics analysisWorkflowApplication examples


An introduction to RNA-seq - Universitetet i Bergeninge/inf389-2010/MD_RNASeqIntro.pdf · 2010. 11....

Documents

Transcript of An introduction to RNA-seq - Universitetet i Bergeninge/inf389-2010/MD_RNASeqIntro.pdf · 2010. 11....