Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly...
Transcript of Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly...
![Page 2: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/2.jpg)
• De novo genome assembly• The fragment assembly problem• Shortest superstring problem• Seven bridges of Königsberg• Assembly as a graph theoretical problem• We construct a de Bruijn graph• Underlying assumptions of genome assemblies
2Sebastian Schmeier
![Page 3: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/3.jpg)
• The process of generating a new genome sequence from NGS genome sequence reads based on assembly algorithms
• Assembly involves joining short sequence fragments together into long pieces – contigs
3
reads
genome
sequencing
Sebastian Schmeier
![Page 4: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/4.jpg)
4
ATGCG
GCGTG
GTGGC
TGGCA
Sebastian Schmeier
![Page 5: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/5.jpg)
5
ATGCG
GCGTG
GTGGC
TGGCA
Vertex
Sebastian Schmeier
![Page 6: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/6.jpg)
6
ATGCG
GCGTG
GTGGC
TGGCA
Vertex
Edge
Sebastian Schmeier
![Page 7: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/7.jpg)
7
ATGCG
GCGTG
GTGGC
TGGCA
Vertex
Edge
Sebastian Schmeier
![Page 8: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/8.jpg)
8
ATGCG
GCGTG
GTGGC
TGGCA
Vertex
Edge
Sebastian Schmeier
![Page 9: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/9.jpg)
9
ATGCG
GCGTG
GTGGC
TGGCA
Vertex
ATGCGGCGTG
GTGGCTGGCA
ATGCGTGGCAgenome
Edge
Sebastian Schmeier
![Page 10: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/10.jpg)
• Given: A set of reads (strings) {s1, s2, … , sn }• Do: Determine a large string s that “best explains” the reads
• What do we mean by “best explains”?• What assumptions might we require?
10Sebastian Schmeier
![Page 11: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/11.jpg)
• Objective: Find a string s such that• all reads s1, s2, … , sn are substrings of s• s is as short as possible
• Assumptions:§ Reads are 100% accurate§ Identical reads must come from the same location on the genome§ “best” = “simplest”
11Sebastian Schmeier
![Page 12: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/12.jpg)
12
ATGCG
GCGTG
GTGGC
TGGCA
12
3
ATGCGGCGTG
GTGGCTGGCA
ATGCGTGGCA
• The assumption is that all substrings are represented• Even modern sequencers that generate 100nt reads do not cover all possible 100-mers
Sebastian Schmeier
![Page 13: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/13.jpg)
13
ATGCGGCGTG
GTGGCTGGCA
• Thus, people generally use k-mers of certain length ß Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG, GCG, CGT, GTGGTG, TGG GGCTGG, GGC, GCA
3-mers
Sebastian Schmeier
![Page 14: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/14.jpg)
14
ATGCGGCGTG
GTGGCTGGCA
• Thus, people generally use k-mers of certain length ß Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG, GCG, CGT, GTGGTG, TGG GGCTGG, GGC, GCA
3-mers
makethemunique
Sebastian Schmeier
![Page 15: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/15.jpg)
15
ATGCGGCGTG
GTGGCTGGCA
• Thus, people generally use k-mers of certain length ß Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG, GCG, CGT, GTGGTG, TGG GGCTGG, GGC, GCA
3-mers
ATG
TGC
GCG CGTGTG
TGG GGC
GCA Drawedgefromxtoywheresuffixfromxoverlapsprefixfromy
Sebastian Schmeier
![Page 16: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/16.jpg)
16
ATGCGGCGTG
GTGGCTGGCA
• Thus, people generally use k-mers of certain length ß Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG, GCG, CGT, GTGGTG, TGG GGCTGG, GGC, GCA
3-mers
ATG
TGC
GCG CGTGTG
TGG GGC
GCA Drawedgefromxtoywheresuffixfromxoverlapsprefixfromy
Sebastian Schmeier
![Page 17: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/17.jpg)
17
ATGCGGCGTG
GTGGCTGGCA
• Thus, people generally use k-mers of certain length ß Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG, GCG, CGT, GTGGTG, TGG GGCTGG, GGC, GCA
3-mers
ATG
TGC
GCG CGTGTG
TGG GGC
GCA Drawedgefromxtoywheresuffixfromxoverlapsprefixfromy
Sebastian Schmeier
![Page 18: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/18.jpg)
18
ATGCGGCGTG
GTGGCTGGCA
• Thus, people generally use k-mers of certain length ß Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG, GCG, CGT, GTGGTG, TGG GGCTGG, GGC, GCA
3-mers
ATG
TGC
GCG CGTGTG
TGG GGC
GCA Drawedgefromxtoywheresuffixfromxoverlapsprefixfromy
Sebastian Schmeier
![Page 19: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/19.jpg)
19
ATGCGGCGTG
GTGGCTGGCA
• Thus, people generally use k-mers of certain length ß Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG, GCG, CGT, GTGGTG, TGG GGCTGG, GGC, GCA
3-mers
ATG
TGC
GCG CGTGTG
TGG GGC
GCA
FindHamiltonianpath,thatis,apaththatvisitseveryvertexexactlyonceRecordtheFirstle0erofeachvertex+Allle0ersoflastvertex
Sebastian Schmeier
![Page 20: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/20.jpg)
20
ATGCGGCGTG
GTGGCTGGCA
• Thus, people generally use k-mers of certain length ß Here we use 3-mers by cutting the original reads into reads of length 3ß UNFORTUNATELY: The Hamiltonian path problem is very difficult to solve (np-complete)
ATG, TGC, GCG, GCG, CGT, GTGGTG, TGG GGCTGG, GGC, GCA
3-mers
ATG
TGC
GCG CGTGTG
TGG GGC
GCA
FindHamiltonianpath,thatis,apaththatvisitseveryvertexexactlyonceRecordtheFirstle0erofeachvertex+Allle0ersoflastvertex
ATGCGTGGCA
Sebastian Schmeier
![Page 21: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/21.jpg)
21
ATGCGGCGTG
GTGGCTGGCA
• Thus, people generally use k-mers of certain length ß Here we use 3-mers by cutting the original reads into reads of length 3ß UNFORTUNATELY: The Hamiltonian path problem is very difficult to solve (np-complete)
ATG, TGC, GCG, GCG, CGT, GTGGTG, TGG GGCTGG, GGC, GCA
3-mers
ATG
TGC
GCG CGTGTG
TGG GGC
GCA
FindHamiltonianpath,thatis,apaththatvisitseveryvertexexactlyonceRecordtheFirstle0erofeachvertex+Allle0ersoflastvertex
ATGCGTGGCAATGGCGTGCA
Sebastian Schmeier
![Page 22: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/22.jpg)
• In 1735 Leonhard Euler was presented with the following problem:§ find a walk through the city that would cross each bridge once and only once§ He proved that a connected graph with undirected edges contains an Eulerian cycle
exactly when every node in the graph has an even number of edges touching it.§ For the Königsberg Bridge graph, this is not the case because each of the four nodes
has an odd number of edges touching it and so the desired stroll through the city does not exist.
22https://en.wikipedia.org/wiki/Leonhard_Euler Sebastian Schmeier
![Page 23: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/23.jpg)
23
A B
D
Vertex
Directed edge
• The degree of a vertex: # of edges connected to it• outdegree: # of outgoing edges• indegree: # of ingoing edges• degree(B)?• outdegree(B)?• indegree(D)?
C
Sebastian Schmeier
![Page 24: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/24.jpg)
• The case of directed graphs is similar :§ A graph in which indegrees are equal to outdegrees for all nodes is
called 'balanced'. § Euler's theorem states that a connected directed graph has an Eulerian
cycle if and only if it is balanced.
• Mathematically/computationally finding Eulerian path is much easier than Hamiltonian
à we need to reformulate our assembly problem
24Sebastian Schmeier
![Page 25: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/25.jpg)
We construct a de Bruijn graph:§ edges represent k-mers § vertices correspond to (k-1)-mers
1. Form a node for every distinct prefix or suffix of a k-mer2. Connect vertex x to vertex y with a directed edge if some k-mer (e.g., ATG) has prefix x
(e.g., AT) and suffix y (e.g., TG), and label the edge with this k-mer.
25
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
Sebastian Schmeier
![Page 26: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/26.jpg)
We construct a de Bruijn graph:§ edges represent k-mers § vertices correspond to (k-1)-mers
1. Form a node for every distinct prefix or suffix of a k-mer2. Connect vertex x to vertex y with a directed edge if some k-mer (e.g., ATG) has prefix x
(e.g., AT) and suffix y (e.g., TG), and label the edge with this k-mer.
26
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
AT
Sebastian Schmeier
![Page 27: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/27.jpg)
We construct a de Bruijn graph:§ edges represent k-mers § vertices correspond to (k-1)-mers
1. Form a node for every distinct prefix or suffix of a k-mer2. Connect vertex x to vertex y with a directed edge if some k-mer (e.g., ATG) has prefix x
(e.g., AT) and suffix y (e.g., TG), and label the edge with this k-mer.
27
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
AT TG
Sebastian Schmeier
![Page 28: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/28.jpg)
We construct a de Bruijn graph:§ edges represent k-mers § vertices correspond to (k-1)-mers
1. Form a node for every distinct prefix or suffix of a k-mer2. Connect vertex x to vertex y with a directed edge if some k-mer (e.g., ATG) has prefix x
(e.g., AT) and suffix y (e.g., TG), and label the edge with this k-mer.
28
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
AT TG
GG
Sebastian Schmeier
![Page 29: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/29.jpg)
We construct a de Bruijn graph:§ edges represent k-mers § vertices correspond to (k-1)-mers
1. Form a node for every distinct prefix or suffix of a k-mer2. Connect vertex x to vertex y with a directed edge if some k-mer (e.g., ATG) has prefix x
(e.g., AT) and suffix y (e.g., TG), and label the edge with this k-mer.
29
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
AT TG
GG
GC CA
CGGT
Sebastian Schmeier
![Page 30: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/30.jpg)
We construct a de Bruijn graph:§ edges represent k-mers § vertices correspond to (k-1)-mers
1. Form a node for every distinct prefix or suffix of a k-mer2. Connect vertex x to vertex y with a directed edge if some k-mer (e.g., ATG) has prefix x
(e.g., AT) and suffix y (e.g., TG), and label the edge with this k-mer.
30
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
AT TG
GG
GC CA
CGGT
ATG
Sebastian Schmeier
![Page 31: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/31.jpg)
We construct a de Bruijn graph:§ edges represent k-mers § vertices correspond to (k-1)-mers
1. Form a node for every distinct prefix or suffix of a k-mer2. Connect vertex x to vertex y with a directed edge if some k-mer (e.g., ATG) has prefix x
(e.g., AT) and suffix y (e.g., TG), and label the edge with this k-mer.
31
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
AT TG
GG
GC CA
CGGT
ATG
TGG
Sebastian Schmeier
![Page 32: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/32.jpg)
We construct a de Bruijn graph:§ edges represent k-mers § vertices correspond to (k-1)-mers
1. Form a node for every distinct prefix or suffix of a k-mer2. Connect vertex x to vertex y with a directed edge if some k-mer (e.g., ATG) has prefix x
(e.g., AT) and suffix y (e.g., TG), and label the edge with this k-mer.
32
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
AT TG
GG
GC CA
CGGT
ATG
TGG GGC
TGC
GTG GCGCGT
GCA
Sebastian Schmeier
![Page 33: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/33.jpg)
Can we find a DNA sequence containing all k-mers?à In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once?à Eulerian path• a vertex v is semibalanced if |indegree(v) – outdegree(v)| = 1• a connected graph has an Eulerian path if and only if it contains at most two semibalanced
vertices
33
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
AT TG
GG
GC CA
CGGT
ATG
TGG GGC
TGC
GTG GCGCGT
GCA
Sebastian Schmeier
![Page 34: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/34.jpg)
Can we find a DNA sequence containing all k-mers?à In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once?à Eulerian path• a vertex v is semibalanced if |indegree(v) – outdegree(v)| = 1• a connected graph has an Eulerian path if and only if it contains at most two semibalanced
vertices
34
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
AT TG
GG
GC CA
CGGT
ATG
TGG GGC
TGC
GTG GCGCGT
GCA
Sebastian Schmeier
![Page 35: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/35.jpg)
Can we find a DNA sequence containing all k-mers?à In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once?à Eulerian path
35
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
AT TG
GG
GC CA
CGGT
ATG
TGG GGC
TGC
GTG GCGCGT
GCA
ATGGCGTGCA
Sebastian Schmeier
![Page 36: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/36.jpg)
Can we find a DNA sequence containing all k-mers?à In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once?à Eulerian path
36
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
AT TG
GG
GC CA
CGGT
ATG
TGG GGC
TGC
GTG GCGCGT
GCA
ATGGCGTGCA
ATGCGTGGCASebastian Schmeier
![Page 37: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/37.jpg)
• Four hidden assumptions that do not hold for next-generation sequencing We took for granted that:1. we can generate all k-mers present in the genome2. all k-mers are error free3. each k-mer appears at most once in the genome 4. the genome consists of a single chromosome
37Sebastian Schmeier
![Page 38: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/38.jpg)
• Four hidden assumptions that do not hold for next-generation sequencing We took for granted that:1. we can generate all k-mers present in the genome2. all k-mers are error free
• Due to these reasons we do NOT choose the longest possible k-mer• The smaller the k-mer the higher the possibility that we see all k-mers• Errors:
38
ATGGCGTGCAATGTGGGGCGCGCGTGTGTGCGCA
Mostlyunaffectedk-mers
ATGGCGTGCA100%affectedk-mers
K=3
K=10
Sebastian Schmeier
![Page 39: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/39.jpg)
Each k-mer appears at most once in the genome à repeats• This is most often not true• This is known as k-mer multiplicity
39
k-mers: ATG, GCA, TGC, TGC, GTG, GTG, GCG, GCG, CGT, CGT
Distinct (k-1)-mers:
AT TG GC CA
CGGT
ATG TGC
GTG GCGCGT
GCA
ATGCGTGCGTGCA
Sebastian Schmeier
![Page 40: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f9a0933d5c4042c1742f4f2/html5/thumbnails/40.jpg)
‹#› 40
ReferencesHow to apply de Bruijn graphs to genome assembly. Phillip E C Compeau, Pavel A Pevzner & Glenn Tesler. Nature Biotechnology29, 987–991 (2011) doi:10.1038/nbt.2023 Published online 08 November 2011
Sequence Assembly. Lecture by Mark Craven ([email protected]). BMI/CS 576 (www.biostat.wisc.edu/bmi576/), Fall 2011
Sebastian [email protected]://sschmeier.com/bioinf-workshop/