Poisson Distribution in Genome Assembly

Poisson Example: Genome Assembly• Goal: figure out the sequence of DNA nucleotides (ACTG) along the entire genome

• Problem: Sequencers generate random short reads

• Solution: assemble genome from short reads using computers. Whole Genome Shotgun Assembly.

MinION, a palm‐sized gene sequencer made by UK‐based Oxford Nanopore Technologies

Short Reads assemble into Contigs

Promise of Genomics

I think I found the corner piece!

How many short reads do we need?

Where is the Poisson?• G ‐ genome length (in bp) • L ‐ short read average length • N – number of short read sequenced • λ – sequencing redundancy = LN/G• x‐ number of short reads covering a given site on the genome

Ewens, Grant, Chapter 5.1

Poisson as a limit of Binomial. For a given site on the genome for each short read Prob(site covered): p=L/G is very small.Number of attempts (short reads): N is very large. Their product (sequencing redundancy): λ = NL/G is O(1).

What fraction of genome is covered?• Coverage: λ=NL/G, X – r.v. equal to the number of times a given site is coveredPoisson: P(X=x)= λx*exp(‐ λ) /x!P(X=0)=exp(‐ λ), P(X>0)=1‐ exp(‐ λ)

• Total length covered: G*[1‐ exp(‐ λ)]λ

How many contigs?• Probability that a given short read is the right end of a contig = no left ends of other reads fall within it.

• Left ends of each of N‐1 N other reads has Prob: p=(L‐1)/G L/G to fall within given read.Expected number of short reads extending a given one is N*p=N*L/G= λProbability that none will extend it = exp(‐ λ):

• Number of contigs: Ncontigs=Ne‐ λ:

Average length of a contig?• Length of a genome covered:Gcovered=G∙ P(X>0)=G ∙ (1‐ exp(‐ λ))

• Number of contigs Ncontigs=N ∙ e‐ λ

• Average length of a contig =<L>=Σi Li /Ncontigs=Gcovered/Ncontigs=

G ∙ (1‐ exp(‐ λ))/ N ∙ e‐ λ=L ∙ (1‐ exp(‐ λ))/ λ ∙ e‐ λ

Estimate

• Human genome is 3x109 bp long• Chromosome 1 spans about 250x106 bp• Illumina generates short reads 100 bp long• How many short reads are needed to completely assemble the 1st chromosome?

The answer is N=44x106 short (100bp) reads

or λ =17.6 fold redundant coverage.At 0.1$/Mb that means that the reads for de novo full assembly of

human genome would cost(3 x109 x 17.6 /106) x 0.1$

=$5300 /genomeIn reality is cheaper as we don’t

need de novo assembly

What spoils these estimates?

• Try searching human genome for TTAGGGTTAGGGTTAGGG (18 bases) and you will find 100s of matches

• How many one expects by pure chance? 2 ∙ 3x109 ∙ 4‐18=0.08 << 1 match

• Genome has lots of repeats. DNA repeats make assembly difficult

• Side remark. If it was not for repeats 17 bases would be enough to specify unique position in the human genome. log(2 ∙ 3x109)/log(4)=16.2412

• That is why microRNAs recognizing ~22 bases don’t have a lot of off‐target cleavage

Stuttering human genome

Repeats are like sky puzzle pieces

How many repeats are in eukaryotic genomes?

The Little Prince, Antoine de Saint‐Exupéry, 1943

Repetitive DNA and next-generation sequencing: computational challenges and solutionsTodd J. Treangen & Steven L. SalzbergNature Reviews Genetics 13, 36-46(January 2012)doi:10.1038/nrg3117

De Bruijn graph

How to assemble genome with repeats?

L=1000

L=5000

• Answer:longer reads

• Problem:cheap sequencing

=short reads

Credit: XKCD comics

Poisson Distribution in Genome Assembly · Where is the Poisson? • G ‐genome length (in bp) •...

Transcript of Poisson Distribution in Genome Assembly · Where is the Poisson? • G ‐genome length (in bp) •...

Poisson Distribution in Genome Assembly · Where is the Poisson? • G ‐genome length (in bp) •...

Documents

Transcript of Poisson Distribution in Genome Assembly · Where is the Poisson? • G ‐genome length (in bp) •...

Full-length cDNA sequencing for genome annotation …...Full-Length cDNA Sequencing for Genome Annotation and Analysis of Alternative Splicing Tyson A. Clark1, Bo Wang2, Ting Hon1,

05 Poisson

Poisson Processes

5.1 The Poisson Distribution and the Poisson Process...5.1 The Poisson Distribution and the Poisson Process Poisson behavior is so pervasive in natural phenomena and the Poisson distribution

Poisson Distribution. Poisson Poisson 1781-1840.

Palindromes in Viral Genomes - UTEP MATHEMATICS · 2007. 2. 15. · Length 22 Palindrome • Approximate the palindrome distribution by a Poisson process with rate • Here n = genome

distribusi poisson

Poisson Statistics - MITweb.mit.edu/jgross/Public/mit-classes/8.13/poisson... · 10/9/2011 · Poisson statistics are everywhere. Examples of Poisson-distributed measurements include

Wolpert formulae for the Poisson bracket of Hitchin Length ... · Wolpert formulae for the Poisson bracket of Hitchin Length Functions November 17, 2016 Abstract In this paper, we

Poisson limit

Poisson Distributions

DIRECTINVERSIONOFYOUNG'SMODULUSAND POISSON ...

Coding schemes for run-length infomation based on poisson ... · A Poisson distribution can be filtered or truncated at either end of the probability frequency spectrum, if runs below

Poisson fumé

Estimating telomere length from whole genome sequence data · Estimating telomere length from whole genome sequence data Zhihao Ding1, Massimo Mangino2, Abraham Aviv3, UK10K Consortium,

Poisson Models Ebook.spatstat.org/sample-chapters/chapter09.pdf · 2019-06-26 · E Poisson Models 301 length−2 and determines the average number of points per unit area. Scaling

Genome-wide Association Analysis in Humans Links ... et al., GWA... · ARTICLE Genome-wide Association Analysis in Humans Links Nucleotide Metabolism to Leukocyte Telomere Length

Genome-Wide Association Analysis in Humans …...ARTICLE Genome-Wide Association Analysis in Humans Links Nucleotide Metabolism to Leukocyte Telomere Length Chen Li,1 ,3 85Svetlana

Lecture 16 MCMC for Poisson response models. Lecture Contents Multilevel Poisson Model MCMC Algorithms for Poisson Models MLwiN for Poisson models WinBUGS.

Genome Sequencing and Structural Variation · 2012-12-27 · WGS & SVs Peter N. Robinson Structural variants Array CGH SV Discovery Poisson GC Content Read depth CNV Calling Genome