Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in...

36
Genome Assembly Using de Bruijn Graphs Biostatistics 666

Transcript of Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in...

Page 1: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Genome Assembly Using de Bruijn Graphs

Biostatistics 666

Page 2: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Previously: Reference Based Analyses

โ€ข Individual short reads are aligned to reference

โ€ข Genotypes generated by examining reads

overlapping each position

โ€ข Works very well for SNPs and relatively well for other types of variant

Page 3: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Shotgun Sequence Reads

โ€ข Typical short read might be <25-100 bp long and not very informative on its own

โ€ข Reads must be arranged (aligned) relative to each

other to reconstruct longer sequences

Page 4: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Read Alignment

โ€ข The first step in analysis of human short read data is to align each read to genome, typically using a hash table based indexing procedure

โ€ข This process now takes no more than a few hours per million reads โ€ฆ

โ€ข Analyzing these data without a reference human genome would require much longer reads or result in very fragmented assemblies

5โ€™-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3โ€™

Reference Genome (3,000,000,000 bp)

GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA Short Read (30-100 bp)

Page 5: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Mapping Quality โ€ข Measures the confidence in an alignment, which

depends on: โ€“ Size and repeat structure of the genome โ€“ Sequence content and quality of the read โ€“ Number of alternate alignments with few mismatches

โ€ข The mapping quality is usually also measured on a

โ€œPhredโ€ scale

โ€ข Idea introduced by Li, Ruan and Durbin (2008) Genome Research 18:1851-1858

Page 6: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Shotgun Sequence Data

5โ€™-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3โ€™ Reference Genome

GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA

AGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTG ATGCTAGCTGATAGCTAGCTAGCTGATGAGCC

ATAGCTAGATAGCTGATGAGCCCGATCGCTGCTAGCTC TAGCTGATAGCTAGATAGCTGATGAGCCCGAT

Sequence Reads

Reads overlapping a position of interest are used to calculate genotype likelihoods and Interpreted using population information.

Page 7: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Limitations of Reference Based Analyses

โ€ข For some species, no suitable reference genome available

โ€ข The reference genome may be incomplete, particularly near centromeres and telomeres

โ€ข Alignment is difficult in highly variable regions

โ€ข Alignment and analysis methods need to be customized for each type of variant

Page 8: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Assembly Based Analyses

โ€ข Assembly based approaches to study genetic variation โ€“ Implementation, challenges and examples

โ€ข Approaches that naturally extend to multiple variant types

Page 9: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

De Bruijn Graphs

โ€ข A representation of available sequence data

โ€ข Each k-mer (or short word) is a node in the graph

โ€ข Words linked together when they occur consecutively

Short Sequence

De Bruijn Graph Representation

Iqbal (2012)

Page 10: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Effective Read Depth

โ€ข Overlaps must exceed k-mer length to register in a de Bruijn graph

โ€ข This requirement effectively reduces coverage

โ€ข Give read read length L, word length k, and expected depth D โ€ฆ

๐ท๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’ = ๐ท๐ฟ โˆ’ ๐‘˜ + 1

๐ฟ

Page 11: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Cleaning

โ€ข De Bruijn graphs are typically โ€œcleanedโ€ before analysis

โ€ข Cleaning involves removing portions of the graph that have very low coverage

โ€ข For example, most paths with depth = 1 and even with depth <= 2 throughout are likely to be errors

Page 12: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Variation in a de Bruijn Graph

โ€ข Variation in sequence produces a bubble in a de Bruijn graph

โ€ข Do all bubbles represent true variation? What are other alternative explanations?

Iqbal (2012)

Page 13: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Effective Read Depth - Consequences

โ€ข Consider a simple example where L = 100

โ€ข With k = 21 โ€ฆ โ€“ Each read includes 80 words โ€“ Each SNP generates a bubble of length 22 โ€“ A single read may enable SNP discovery

โ€ข With k = 75 โ€ฆ

โ€“ Each read includes 26 words โ€“ Each SNP generates a bubble of length 76 โ€“ Multiple overlapping reads required to discover SNP

Page 14: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Properties of de Bruijn Graphs

โ€ข Many useful properties of genome assemblies (including de Bruijn graphs) can be studied using results of Lander and Waterman (1988)

โ€ข Described number of assembled contigs and their lengths as a function of genome size, length of fragments, and required overlap

Page 15: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Lander and Waterman (1988) Notation

โ€ข The genome size G

โ€ข The number of fragments in assembly N

โ€ข The length of sequenced fragments L โ€“ The fractional overlap required for assembly ัฒ

โ€ข The depth of coverage c = LN/G

โ€ข Probability a clone starts at a position ฮฑ = N/G

Page 16: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Number of Contigs Ne-c(1-ัฒ)

โ€ข Consider the probability that a fragment starts is not linked to another before ending ๐›ผ 1 โˆ’ ๐›ผ ๐ฟ 1โˆ’๐œƒ = ๐›ผ 1 โˆ’๐‘/๐บ

๐บ๐‘’๐‘ 1โˆ’๐œƒ = ๐›ผ๐‘’โˆ’๐‘’ 1โˆ’๐œƒ

โ€ข Then, the expected number of fragments that are

not linked to another is ๐บ๐›ผ๐‘’โˆ’๐‘’ 1โˆ’๐œƒ = ๐‘๐‘’โˆ’๐‘’ 1โˆ’๐œƒ

โ€ข This is also the number of contigs!

Page 17: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Number of Contigs

Genome Coverage (or depth) c

Num

ber o

f Con

tigs (

G/L

uni

ts)

Number of contigs peaks when depth

c = (1 โ€“ ฮฑ)-1

Page 18: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Contig Lengths โ€ข Probability a fragment ends the contig:

๐‘’โˆ’๐‘’ 1โˆ’๐œƒ

โ€ข Probability of contig with exactly j fragments: 1 โˆ’ ๐‘’โˆ’๐‘’ 1โˆ’๐œƒ ๐‘—โˆ’1

๐‘’โˆ’๐‘’ 1โˆ’๐œƒ

โ€ข The number of contigs with j fragments is: ๐‘๐‘’โˆ’2๐‘’ 1โˆ’๐œƒ 1โˆ’ ๐‘’โˆ’๐‘’ 1โˆ’๐œƒ ๐‘—โˆ’1

โ€ข How many contigs will have 2+ fragments?

Page 19: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Contig Lengths (in bases) โ€ข The expected contig length, in fragments, is

๐ธ ๐ฝ = ๐‘’๐‘’ 1โˆ’๐œƒ โ€ข Each fragment contributes X bases โ€ฆ

๐‘ƒ ๐‘‹ = ๐‘š = 1 โˆ’ ๐›ผ ๐‘šโˆ’1๐›ผ for 0 < ๐‘š โ‰ค ๐ฟ(1 โˆ’ ๐œƒ) ๐‘ƒ ๐‘‹ = ๐ฟ = 1 โˆ’ ๐›ผ ๐ฟ(1โˆ’๐œƒ)

โ€ข After some algebra:

๐ธ ๐‘‹ = ๐ฟ[1โˆ’๐‘’โˆ’๐‘ 1โˆ’๐œƒ

๐‘’โˆ’ ๐œƒ๐‘’โˆ’๐‘’ 1โˆ’๐œƒ ]

โ€ข The expected contig length in bases is E(X) E(J)

๐ฟ[๐‘’๐‘’ 1โˆ’๐œƒ โˆ’ 1

๐‘โˆ’ ๐œƒ]

Page 20: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Contig Lengths

Genome Coverage (or depth) c

Cont

ig L

engt

h (in

uni

ts o

f L b

ase

pairs

)

Lander and Waterman

also studied gap lengths

Page 21: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Enhanced De Bruijn Graphs

โ€ข Usefulness of a de Bruijn graph increases if we annotate each note with useful information

โ€ข Basic information might include the number of times each word was observed

โ€ข More detailed information might include the specific individuals in which the word was present

Page 22: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Variant Analysis Algorithm 1: โ€œBubble Callingโ€

โ€ข Create a de Bruijn graph of reference genome โ€“ Bubbles in this graph are paralogous sequences

โ€ข Using a different label, assemble sample of interest

โ€ข Systematically search for bubbles

โ€“ Nodes where two divergent paths eventually connect

Iqbal (2012)

Page 23: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Word size k and Accessible Genome

Iqbal (2012)

Page 24: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Power of Homozygous Variant Discovery (100-bp reads, no errors)

Iqbal (2012)

Page 25: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Power of Heterozygous Variant Discovery (100-bp reads, no errors)

Iqbal (2012)

Page 26: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Power of Homozygous Variant Discovery (Simulated 30x genomes, 100-bp reads)

Iqbal (2012)

Page 27: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Power of Heterozygous Variant Discovery (Simulated 30x genomes, 100-bp reads)

Dotted lines (โ€ฆ) refer to theoretical expectations. Solid lines (---) refer to simulation results. Iqbal (2012)

Page 28: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Variant Analysis Algorithm 2: Path Divergence

โ€ข Bubble calling requires accurately both alleles โ€“ Power depends on word length k, allele length,

genome complexity and error model โ€“ Low power for the largest events

โ€ข Path divergence searches for regions where a

sample path differs from the reference

โ€ข Especially increases power for deletions โ€“ Deletion often easier to assemble than reference

Page 29: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Path Divergence Example

Black line represents assembly of sample. We can infer a variant between positions a and b,

because the path between them differs from reference.

Page 30: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Variant Analysis Algorithm 3: Multi-Sample Analysis

โ€ข Improves upon simple bubble calling by tracking which paths occur on each sample

โ€ข Improved ability to distinguish true variation from paralogous sequence and errors

Iqbal (2012)

Page 31: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Classifying Sites

โ€ข Evaluate ratio of coverage along the two branches of each bubble and in each individual

โ€ข If the ratio is uniform across individuals โ€ฆ โ€“ Error: Ratio consistently low for one branch โ€“ Repeat: Ratio constant across individuals

โ€ข If the ratio varies across individuals โ€ฆ

โ€“ Variant: Ratio clusters around 0, ยฝ and 1 with probability of these outcomes depending on HWE

Page 32: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Variant Analysis Algorithm 4: Genotyping

โ€ข Calculate probability that a certain number of k-mers cover each path

โ€ข To improve accuracy, short duplicate regions within a path can be ignored.

โ€ข Allows likelihood calculation for use in imputation algorithms

Iqbal (2012)

Page 33: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Example Application to High Coverage Genome

โ€ข 26x, 100-bp reads, k = 55

โ€ข 2,777,252,792 unique k-mers โ€“ 2,691,115,653 also in reference โ€“ 23% more k-mers before cleaning

โ€ข 2,686,963 bubbles found by Bubble Caller

โ€“ 5.6% of these also present in reference

โ€ข 528,651 divergent paths โ€“ 39.8% of these also present in reference

โ€ข 2,245,279 SNPs, 361,531 short indels, 1,100 large or complex events

โ€“ Reproduces 67% of heterozygotes from mapping (87% of homozygotes)

Page 34: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Comparison to Mapping Based Algorithms

Page 35: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Summary

โ€ข Assembly based algorithms currently reach about 80% of the genome

โ€ข These algorithms can handle different variant types more conveniently than mapping based approaches

โ€ข Incorporating population information allows repeats to be distinguished from true variation

Page 36: Genome Assembly Using de Bruijn GraphsContig Lengths (in bases) โ€ข The expected contig length, in fragments, is ๐ธ๐ฝ= ๐‘’. ๐‘’1โˆ’๐œƒ โ€ข Each fragment contributes X bases

Recommended Reading

โ€ข Iqbal, Caccamo, Turner, Flicek and McVean (2012) Nature Genetics 44:226-232

โ€ข Lander and Waterman (1988) Genomics 2:231-239