An introduction to RNA-seq - Universitetet i Bergeninge/inf389-2010/MD_RNASeqIntro.pdf · 2010. 11....

43
Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis Deep sequencing of transcriptomes An introduction to RNA-seq Michael Dondrup UNI BCCS 2. november 2010 1 / 40

Transcript of An introduction to RNA-seq - Universitetet i Bergeninge/inf389-2010/MD_RNASeqIntro.pdf · 2010. 11....

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Deep sequencing of transcriptomesAn introduction to RNA-seq

    Michael Dondrup

    UNI BCCS

    2. november 2010

    1 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Transcriptomics by Ultra-Fast Sequencing

    Microarrays have been the primary transcriptomicshigh-throughput tool for almost a decade.• New approach: sequence the transcriptome• Millions of short read fragments (15 – 400 bp) from NGS

    machinesRemarks:• SAGE/CAGE and other tag based methods not suitable for

    bacteria• Reverse transcription: RNA −→ cDNA• Most papers use a reference genome• Count read fragments per bin (CDS, ORF, exon, intergenic

    region, window of size n)

    2 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Outline

    Introduction

    Lab procedures

    Bioinformatics analysisWorkflowApplication examples

    Statistics of DE analysisNormalizationTrials and distributionsStatistical testing

    3 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    RNA-seq, a revolutionary tool?

    4 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    RNA-seq, a revolutionary tool

    4 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    RNA-seq, a revolutionary tool!

    4 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Applications

    • Gene structure (e.g. Exons/Introns)• Finding novel transcripts:

    • non-annotated (pseudo) Genes• non-coding RNA (ncRNA, sRNA)• antisense RNA

    • Transcription Start Sites (TSS)• Operon structure• De-novo transcriptome assembly (given no reference exists)• Metagenomics approach (sampling of bact. communities by

    RNA-seq)• (semi-) quantitative approach: differential expression (DE)

    5 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Outline

    Introduction

    Lab procedures

    Bioinformatics analysisWorkflowApplication examples

    Statistics of DE analysisNormalizationTrials and distributionsStatistical testing

    6 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Overview

    cDNA

    small RNA mRNA

    total RNA

    sample

    ACTGATGTGAT

    ACTTTCCCGTGATAATCCGCTTATGTGATACTGGTCCAAAAATGAT

    RNA extractionpurification

    reverse transcription

    fragmentation

    depletion of rRNA,amplification of mRNA

    high-throughputsequencing

    short sequencereads

    Remarks:

    • DNase I treatment• prokaryotes don’t have

    polyA• rRNA depletion: ≈90%

    removal• rRNA/RNA before: 97-99,

    after 90%• induces bias• no depletion e.g. for

    meta-RNA-seq• directional by: ss-cDNA or

    adapter ligation7 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    cDNA Sequencing

    • Illumina/Solexa• ABI/Solid• Roche/454• direct RNA sequencing is under development

    High-coverage (Illumina, Solid) preferred over read-length (454),if no transcriptome assembly required

    8 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Outline

    Introduction

    Lab procedures

    Bioinformatics analysisWorkflowApplication examples

    Statistics of DE analysisNormalizationTrials and distributionsStatistical testing

    9 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Overview

    Quality control & filtering

    Aligning reads to reference

    Alignment statistics & filtering

    reference sequence Transcriptome assembly

    Binning (genes, intergenic regions, etc)

    genome annotation

    Compute coverage

    Sequence data, (FASTA, FASTQ)

    DE analysis Transcription start sites/Operons

    Search for novel transcripts

    Transcript variants analysis (Splicing, Intron/

    Exon, etc)

    Visualization

    10 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Filtering of reads

    • by base-qualities• lenght• Duplicate reads (identical sequence)

    • removal• condense into single read• probabilistic (have not seen this applied)• duplicates are suspected to be artifacts

    11 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    AlignmentsChallenges are the same as with all NGS data:• millions/billions of reads• rather short reads• sequencing errors• sequencing bias

    (Short)-read alignment programs used in RNA-seq:• BWA• Bowtie• Shrimp• SOAP• Eland• blat• blast (not so good!)• ...

    12 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Challenges begin after the alignmentMapping fragments to the genome:

    13 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Filtering possibilities

    • unique alignments• sequence identity• alignment quality score• alignment length• proportion of read length aligned

    14 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Pseudo coverage

    • count of alignments spanning a genomic position• can be computed with 1bp resolution or for larger windows• used in visualization• pseudo: because we do not know the size of the

    transcriptome

    15 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Examples

    16 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Examples

    16 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Interval binning

    • DE analysis requires binning• typical bins: exons, CDS, transcripts, introns, genes• reads and genes are represented as genomic intervals

    [start, end]• fast interval overlap algorithms:

    • Sorting based methods• Interval tree based methods (as in the IRanges Bioconducto

    package)• Nested containment lists (Alekseyenko & Lee, Bioinformatics

    2007)

    • Result: a read count ni ∈ N0 reads bin i

    17 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Search for ncRNAs

    • define an arbitrary coverage threshold c0• search for continuos intergenic regions (seed) c > c0• possibly extend over small gaps• with "intergenic": regions that do not overlap a CDS (why?)• extract sequence and search databases

    18 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Search for ncRNAs

    19 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    TSS and operon detection

    • try to find out where coverage changes quickly (a TSScandidate)

    • compute first order differences d(ci) for each genomicposition i (aka differentiation) using a sliding window of widthw

    • look for maxima/minima upstream of a gene (max for +, minfor - strand)

    20 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    TSS and operon detection

    21 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Outline

    Introduction

    Lab procedures

    Bioinformatics analysisWorkflowApplication examples

    Statistics of DE analysisNormalizationTrials and distributionsStatistical testing

    22 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Why data normalization for RNA-seq data?

    Account for systematic technical errors/bias:• different library sizes: different numbers of reads• different gene lengths• some sequencing methods seem to prefer longer transcripts

    even more• GC content might have an influence too• limited read capacity: highly-expressed genes stealreads

    This is not trivial (e.g. common 5’ preference). Differentiatebetween biological effects and technical effectsBtw.: isn’t this a déjà-vu?

    23 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Normalization methods

    • RPKM (Mortazavi et al., 2008)• house-keeping or constant reference gene (e.g. POLR2A)• upper quartil (75% pecentile) normalization• quantile normalization (as for MA data, see: Bolstad et al.,

    2003)

    See: Bullard et al. Evaluation of statistical methods fornormalization and differential expression in mRNA-Seqexperiments. BMC bioinformatics (2010)

    24 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    RPKM

    reads per kilobase of exon per million mapped sequence reads

    rpkm(ni , li ,N) =niliN

    [1

    1010b]

    ni : number of mapped readsli : length of gene sequenceN: total number of mapped readsb: genomic base position (as a unit, this is not really a unit!)A very simple gene specific scaling factor. Some publicationsalso took log values of similarly scaled data.

    25 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Quantile normalization

    • no assumption about original data distribution• results in the data being all samples from the same

    distribution

    26 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Method comparison by Bullard et al.

    • gold standard: qRT-PCR data (MAQC project)• divide the gold standard data in DE, non-DE, no-call• try to reproduce classification based on RNA-seq data (+MA

    data) using statistical tests

    Results:• house-keeping, upper quartile, and quantile preferable over

    RPKM and total counts• the choice of the normalization method had a larger effect on

    the performance than the choice of the statistical model(test)

    27 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    What is significant here?

    • Is a gene significantly differentially expressed under twoconditions (DE analysis)?

    Remarks:

    • We need a model for the data, more precisely we need amodel that can explain the variance in the data.

    • Significance: the probability of rejecting the null-hypothesisin a statistical test setting just by chance, while there is – inreality – no effect, is low

    • are dealing with count data, thus we are dealing withdiscrete statistics (for now).

    28 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Bernoulli trials

    A trial with two outcomes: success and failurep: probability of successprobability of failure: p′ = 1−p

    29 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Binomial distributionThe number of successes in a series of n iid Benoulli trials followa binomial distribution:

    probability mass function30 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Poisson distributionThe distribution of rare events.• we know a rate but not the probability of a success in a

    series of independent Bernoulli trials occurring in space/time• the number of trials is large while each individual trial has

    low probability of success• e.g.: number of phone calls received in a call center per hour• number of defective devices on an assembly line per day

    f (k ;λ ) = λk e−λk! ,

    31 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Remarks

    • It is tempting to use a Poisson model and set λ as the(pseudo-) coverage.

    • E(f ;λ ) = σ2 = λ

    Preconditions:• Independence: trials do not depend on previous events• Lack of clustering, prob. of two simultaneous events is low• Rate is constant over space/time

    32 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Poisson distribution ...

    • is not a suitable model for RNA-seq data in general• has been found sufficient for technical variation in RNA-seq

    data• biological variance ≥ technical variance• keep in mind: we do not know the the length of the

    trancriptome ,→bp is not a unit!• two-stage random-process (bit sloppy!) :

    C(x) = Sequencing(Transcription(x))• → overdispersion

    33 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Negative binomial distribution

    • k : number of failures before r successes with prob. p occur• used for overdispersion problems with a dispersion

    parameter

    X ∼ NB(r , p)alternatively written asNB(µ, σ2)

    f (k)≡ Pr(X = k) =(k+r−1

    r−1)(1−p)r pk for k = 0,1,2, . . .

    34 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    NB

    0 10 20 30 40

    0.00

    0.05

    0.10

    0.15

    0.20

    0.25

    probability mass function

    k

    dnbi

    nom

    (k, r

    , p)

    r=2, p=0.5r=5, p=0.5r=10, p=0.5r=20, p=0.5r=10, p=0.3

    35 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    How do we assess significance?

    • We have discrete probabilities• We can enumerate all possibilities• principle: enumerate all outcomes which are

    equally or more extreme than the given one

    36 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Example: Fisher’s exact test

    • Exact test for contingency tables with small sample sizes• the probability of a single table follows the hyper-geometric

    distribution• for large samples approximation by chi-sqare-test• the dieting example

    37 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    edgeR and DEseq

    Model: Kij ≡ NB(µij , σ2if ) gene:i , sample : j in a replicatedexperiment*

    • mean:µ and variance σ2 must be estimated from thereplicates*

    • edgeR: σ2 = µ +αµ2

    • DEseq: µij = qi ,ρ(j)sj• DEseq: σ2ij = µij + s

    2j vi ,ρ(j)

    Both packages use different approaches for parameter fitting andtesting.*DEseq also works with no or few replicates, but with reduced power

    38 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Testing in DEseq similar to Fisher’s test

    As test statistic: the total counts in two conditions: KiA, KiBNow we need a p-value:

    pi =

    ∑a+b=kiS ,

    p(a,b)≤p(KiA,KiB )

    p(a,b)

    ∑a+b=kiS

    p(a,b)

    Now the model comes in: p(a,b) = Pr(KiA = a)Pr(KiB = b)

    39 / 40

  • Introduction Lab procedures Bioinformatics analysis Statistics of DE analysis

    Summary Outlook (my 50 cent)

    • RNA-seq is as promising as complex• read mapping and binning are working fine though

    parameters need to be explored• normalization and statistical models and tests need more

    work• more agressive normalization should be explored• many methods are bit ad-hoc or use arbitrary thresholds• no framework for within sample significance testing

    40 / 40

    IntroductionLab proceduresBioinformatics analysisWorkflowApplication examples

    Statistics of DE analysisNormalizationTrials and distributionsStatistical testing