Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in...
Transcript of Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in...
![Page 1: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/1.jpg)
Genome Assembly Using de Bruijn Graphs
Biostatistics 666
![Page 2: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/2.jpg)
Previously: Reference Based Analyses
โข Individual short reads are aligned to reference
โข Genotypes generated by examining reads
overlapping each position
โข Works very well for SNPs and relatively well for other types of variant
![Page 3: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/3.jpg)
Shotgun Sequence Reads
โข Typical short read might be <25-100 bp long and not very informative on its own
โข Reads must be arranged (aligned) relative to each
other to reconstruct longer sequences
![Page 4: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/4.jpg)
Read Alignment
โข The first step in analysis of human short read data is to align each read to genome, typically using a hash table based indexing procedure
โข This process now takes no more than a few hours per million reads โฆ
โข Analyzing these data without a reference human genome would require much longer reads or result in very fragmented assemblies
5โ-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3โ
Reference Genome (3,000,000,000 bp)
GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA Short Read (30-100 bp)
![Page 5: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/5.jpg)
Mapping Quality โข Measures the confidence in an alignment, which
depends on: โ Size and repeat structure of the genome โ Sequence content and quality of the read โ Number of alternate alignments with few mismatches
โข The mapping quality is usually also measured on a
โPhredโ scale
โข Idea introduced by Li, Ruan and Durbin (2008) Genome Research 18:1851-1858
![Page 6: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/6.jpg)
Shotgun Sequence Data
5โ-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3โ Reference Genome
GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA
AGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTG ATGCTAGCTGATAGCTAGCTAGCTGATGAGCC
ATAGCTAGATAGCTGATGAGCCCGATCGCTGCTAGCTC TAGCTGATAGCTAGATAGCTGATGAGCCCGAT
Sequence Reads
Reads overlapping a position of interest are used to calculate genotype likelihoods and Interpreted using population information.
![Page 7: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/7.jpg)
Limitations of Reference Based Analyses
โข For some species, no suitable reference genome available
โข The reference genome may be incomplete, particularly near centromeres and telomeres
โข Alignment is difficult in highly variable regions
โข Alignment and analysis methods need to be customized for each type of variant
![Page 8: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/8.jpg)
Assembly Based Analyses
โข Assembly based approaches to study genetic variation โ Implementation, challenges and examples
โข Approaches that naturally extend to multiple variant types
![Page 9: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/9.jpg)
De Bruijn Graphs
โข A representation of available sequence data
โข Each k-mer (or short word) is a node in the graph
โข Words linked together when they occur consecutively
Short Sequence
De Bruijn Graph Representation
Iqbal (2012)
![Page 10: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/10.jpg)
Effective Read Depth
โข Overlaps must exceed k-mer length to register in a de Bruijn graph
โข This requirement effectively reduces coverage
โข Give read read length L, word length k, and expected depth D โฆ
๐ท๐๐๐๐๐๐๐๐๐ = ๐ท๐ฟ โ ๐ + 1
๐ฟ
![Page 11: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/11.jpg)
Cleaning
โข De Bruijn graphs are typically โcleanedโ before analysis
โข Cleaning involves removing portions of the graph that have very low coverage
โข For example, most paths with depth = 1 and even with depth <= 2 throughout are likely to be errors
![Page 12: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/12.jpg)
Variation in a de Bruijn Graph
โข Variation in sequence produces a bubble in a de Bruijn graph
โข Do all bubbles represent true variation? What are other alternative explanations?
Iqbal (2012)
![Page 13: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/13.jpg)
Effective Read Depth - Consequences
โข Consider a simple example where L = 100
โข With k = 21 โฆ โ Each read includes 80 words โ Each SNP generates a bubble of length 22 โ A single read may enable SNP discovery
โข With k = 75 โฆ
โ Each read includes 26 words โ Each SNP generates a bubble of length 76 โ Multiple overlapping reads required to discover SNP
![Page 14: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/14.jpg)
Properties of de Bruijn Graphs
โข Many useful properties of genome assemblies (including de Bruijn graphs) can be studied using results of Lander and Waterman (1988)
โข Described number of assembled contigs and their lengths as a function of genome size, length of fragments, and required overlap
![Page 15: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/15.jpg)
Lander and Waterman (1988) Notation
โข The genome size G
โข The number of fragments in assembly N
โข The length of sequenced fragments L โ The fractional overlap required for assembly ัฒ
โข The depth of coverage c = LN/G
โข Probability a clone starts at a position ฮฑ = N/G
![Page 16: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/16.jpg)
Number of Contigs Ne-c(1-ัฒ)
โข Consider the probability that a fragment starts is not linked to another before ending ๐ผ 1 โ ๐ผ ๐ฟ 1โ๐ = ๐ผ 1 โ๐/๐บ
๐บ๐๐ 1โ๐ = ๐ผ๐โ๐ 1โ๐
โข Then, the expected number of fragments that are
not linked to another is ๐บ๐ผ๐โ๐ 1โ๐ = ๐๐โ๐ 1โ๐
โข This is also the number of contigs!
![Page 17: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/17.jpg)
Number of Contigs
Genome Coverage (or depth) c
Num
ber o
f Con
tigs (
G/L
uni
ts)
Number of contigs peaks when depth
c = (1 โ ฮฑ)-1
![Page 18: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/18.jpg)
Contig Lengths โข Probability a fragment ends the contig:
๐โ๐ 1โ๐
โข Probability of contig with exactly j fragments: 1 โ ๐โ๐ 1โ๐ ๐โ1
๐โ๐ 1โ๐
โข The number of contigs with j fragments is: ๐๐โ2๐ 1โ๐ 1โ ๐โ๐ 1โ๐ ๐โ1
โข How many contigs will have 2+ fragments?
![Page 19: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/19.jpg)
Contig Lengths (in bases) โข The expected contig length, in fragments, is
๐ธ ๐ฝ = ๐๐ 1โ๐ โข Each fragment contributes X bases โฆ
๐ ๐ = ๐ = 1 โ ๐ผ ๐โ1๐ผ for 0 < ๐ โค ๐ฟ(1 โ ๐) ๐ ๐ = ๐ฟ = 1 โ ๐ผ ๐ฟ(1โ๐)
โข After some algebra:
๐ธ ๐ = ๐ฟ[1โ๐โ๐ 1โ๐
๐โ ๐๐โ๐ 1โ๐ ]
โข The expected contig length in bases is E(X) E(J)
๐ฟ[๐๐ 1โ๐ โ 1
๐โ ๐]
![Page 20: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/20.jpg)
Contig Lengths
Genome Coverage (or depth) c
Cont
ig L
engt
h (in
uni
ts o
f L b
ase
pairs
)
Lander and Waterman
also studied gap lengths
![Page 21: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/21.jpg)
Enhanced De Bruijn Graphs
โข Usefulness of a de Bruijn graph increases if we annotate each note with useful information
โข Basic information might include the number of times each word was observed
โข More detailed information might include the specific individuals in which the word was present
![Page 22: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/22.jpg)
Variant Analysis Algorithm 1: โBubble Callingโ
โข Create a de Bruijn graph of reference genome โ Bubbles in this graph are paralogous sequences
โข Using a different label, assemble sample of interest
โข Systematically search for bubbles
โ Nodes where two divergent paths eventually connect
Iqbal (2012)
![Page 23: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/23.jpg)
Word size k and Accessible Genome
Iqbal (2012)
![Page 24: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/24.jpg)
Power of Homozygous Variant Discovery (100-bp reads, no errors)
Iqbal (2012)
![Page 25: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/25.jpg)
Power of Heterozygous Variant Discovery (100-bp reads, no errors)
Iqbal (2012)
![Page 26: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/26.jpg)
Power of Homozygous Variant Discovery (Simulated 30x genomes, 100-bp reads)
Iqbal (2012)
![Page 27: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/27.jpg)
Power of Heterozygous Variant Discovery (Simulated 30x genomes, 100-bp reads)
Dotted lines (โฆ) refer to theoretical expectations. Solid lines (---) refer to simulation results. Iqbal (2012)
![Page 28: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/28.jpg)
Variant Analysis Algorithm 2: Path Divergence
โข Bubble calling requires accurately both alleles โ Power depends on word length k, allele length,
genome complexity and error model โ Low power for the largest events
โข Path divergence searches for regions where a
sample path differs from the reference
โข Especially increases power for deletions โ Deletion often easier to assemble than reference
![Page 29: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/29.jpg)
Path Divergence Example
Black line represents assembly of sample. We can infer a variant between positions a and b,
because the path between them differs from reference.
![Page 30: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/30.jpg)
Variant Analysis Algorithm 3: Multi-Sample Analysis
โข Improves upon simple bubble calling by tracking which paths occur on each sample
โข Improved ability to distinguish true variation from paralogous sequence and errors
Iqbal (2012)
![Page 31: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/31.jpg)
Classifying Sites
โข Evaluate ratio of coverage along the two branches of each bubble and in each individual
โข If the ratio is uniform across individuals โฆ โ Error: Ratio consistently low for one branch โ Repeat: Ratio constant across individuals
โข If the ratio varies across individuals โฆ
โ Variant: Ratio clusters around 0, ยฝ and 1 with probability of these outcomes depending on HWE
![Page 32: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/32.jpg)
Variant Analysis Algorithm 4: Genotyping
โข Calculate probability that a certain number of k-mers cover each path
โข To improve accuracy, short duplicate regions within a path can be ignored.
โข Allows likelihood calculation for use in imputation algorithms
Iqbal (2012)
![Page 33: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/33.jpg)
Example Application to High Coverage Genome
โข 26x, 100-bp reads, k = 55
โข 2,777,252,792 unique k-mers โ 2,691,115,653 also in reference โ 23% more k-mers before cleaning
โข 2,686,963 bubbles found by Bubble Caller
โ 5.6% of these also present in reference
โข 528,651 divergent paths โ 39.8% of these also present in reference
โข 2,245,279 SNPs, 361,531 short indels, 1,100 large or complex events
โ Reproduces 67% of heterozygotes from mapping (87% of homozygotes)
![Page 34: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/34.jpg)
Comparison to Mapping Based Algorithms
![Page 35: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/35.jpg)
Summary
โข Assembly based algorithms currently reach about 80% of the genome
โข These algorithms can handle different variant types more conveniently than mapping based approaches
โข Incorporating population information allows repeats to be distinguished from true variation
![Page 36: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐. ๐1โ๐ โข Each fragment contributes X bases](https://reader033.fdocuments.us/reader033/viewer/2022043006/5f902c667703620135533450/html5/thumbnails/36.jpg)
Recommended Reading
โข Iqbal, Caccamo, Turner, Flicek and McVean (2012) Nature Genetics 44:226-232
โข Lander and Waterman (1988) Genomics 2:231-239