CSCI2950-C Lecture 4 DNA Sequencing and Fragment...
Transcript of CSCI2950-C Lecture 4 DNA Sequencing and Fragment...
9/24/09
1
CSCI2950-C Lecture 4
DNA Sequencing and Fragment Assembly
Ben Raphael Sept. 22, 2009
http://cs.brown.edu/courses/csci2950-c/
l-mer composition Def: Given string s, the Spectrum ( s, l ) is
unordered multiset of all possible (n – l + 1) l-mers in a string s of length n
• The order of individual elements in Spectrum ( s, l ) does not matter
• For s = TATGGTGC all of the following are equivalent representations of Spectrum ( s, 3 ):
{TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG}
9/24/09
2
The SBH Problem Goal: Reconstruct a string from its l-mer
composition
Input: A multiset S, representing all l-mers from an (unknown) string s
Output: String s such that Spectrum ( s,l ) = S
SBH: An Example DNA Sequencing AGT CCA ATC ATCCAGT TCC CAG
S = { ATC, CCA, CAG, TCC, AGT }
9/24/09
3
SBH: Hamiltonian Path Approach
S = { ATG AGG TGC TCC GTC GGT GCA CAG }
Path visited every VERTEX once: Hamiltonian Path
ATG AGG TGC TCC H GTC GGT GCA CAG
ATG C A G G T C C
Directed graph G = (S,E). es,t E if suffix of s = prefix of t, where
length(suffix/prefix) = l - 1
SBH: Hamiltonian Path Approach
A more complicated graph:
S = { ATG TGG TGC GTG GGC GCA GCG CGT }
9/24/09
4
SBH: Hamiltonian Path Approach
S = { ATG TGG TGC GTG GGC GCA GCG CGT } Path 1:
ATGCGTGGCA
ATGGCGTGCA
Path 2:
SBH: Eulerian Path Approach S = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }
Vertices correspond to ( l – 1 ) – mers : { AT, TG, GC, GG, GT,
CA, CG }
Edges correspond to l – mers from S
AT
GT CG
CA GC TG
GG Find path that visits every EDGE once
(l-1) deBruijn graph of S: Vertices: (l-1)-mers in S Directed edges: consecutive (l-1) mers in S
9/24/09
5
SBH: Eulerian Path Approach
S = {ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } Two different paths give different sequence reconstructions:
ATGGCGTGCA ATGCGTGGCA
AT TG GC CA
GG
GT CG
AT
GT CG
CA GC TG
GG
Euler Theorem
• A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing edges:
in(v)=out(v)
• Theorem: A connected graph is Eulerian if and only if each of its vertices is balanced.
9/24/09
6
Euler Theorem: Proof
• Eulerian → balanced
for every edge entering v (incoming edge) there exists an edge leaving v (outgoing edge). Therefore
in(v)=out(v)
• Balanced → Eulerian
???
Algorithm for Constructing an Eulerian Cycle
a. Start with an arbitrary vertex v and form an arbitrary cycle with unused edges until a dead end is reached. Since the graph is Eulerian this dead end is necessarily the starting point, i.e., vertex v.
9/24/09
7
Algorithm for Constructing an Eulerian Cycle (cont’d)
b. If cycle from (a) above is not an Eulerian cycle, it must contain a vertex w, which has untraversed edges. Perform step (a) again, using vertex w as the starting point. Once again, we will end up in the starting vertex w.
Algorithm for Constructing an Eulerian Cycle (cont’d)
c. Combine the cycles from (a) and (b) into a single cycle and iterate step (b).
9/24/09
8
Euler Theorem: Extension
• Theorem: A connected graph has an Eulerian path if and only if it contains at most two semi-balanced vertices and all other vertices are balanced.
Back to sequence assembly… Overlap Graph
Repeat Repeat Repeat
Find a path visiting every VERTEX exactly once: Hamiltonian path problem
Each vertex represents a read from the original sequence. Vertices from repeats are connected to many others.
9/24/09
9
Fragment Assembly: Overlap Graph
Find a path visiting every VERTEX exactly once in the OVERLAP graph:
Hamiltonian path problem
NP-complete: algorithms unknown
Repeat Graph: Eulerian Approach Repeat Repeat Repeat
Genome sequence is path visiting every EDGE exactly once: Eulerian path problem
Suppose each repeat and unique string (non-repeat) represented by directed edge.
Edge of multiplicity 3
9/24/09
10
Multiple Repeats Repeat1 Repeat1 Repeat2 Repeat2
Can be easily constructed with any number of repeats
Approaches to Fragment Assembly (cont’d)
Find a path visiting every EDGE exactly once in the REPEAT graph:
Eulerian path problem
Linear time algorithms are known
9/24/09
11
Repeat Graph
(a) DNA sequence with a triple repeat R;
(b) the layout graph;
(c) construction of the de Bruijn graph by gluing repeats;
(d) de Bruijn graph.
Pevzner P. A. et.al. PNAS 2001;98:9748-9753
Building Repeat Graph • Problem: Construct the repeat graph from
a collection of reads.
• Solution: Break the reads into smaller pieces.
?
9/24/09
12
Building Repeat Graph
• Reads are constructed from an original sequence in lengths that allow biologists a high level of certainty.
• They are then broken again into k-mers
Repeat Graph Vertices correspond to ( k – 1 ) – mers in each read
Edges correspond to k – mers in each read
Example: S = ATGGCGTGCA
Reads = {ATGGC, GGCGTG, GTGCA}
3-mers = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }
AT
GT CG
CA GC TG
GG
Two Eulerian paths: (visit every EDGE once)
ATGGCGTGCA ATGCGTGGCA
9/24/09
13
Repeat Graph Vertices correspond to ( k – 1 ) – mers in each read
Edges correspond to k – mers in each read
Example: S = ATGGCGTGCA
Reads = {ATGGC, GGCGTG, GTGCA}
3-mers = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }
AT
GT CG
CA GC TG
GG
Reads in Repeat Graph
S = {ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } Two different paths give different sequence reconstructions:
ATGGCGTGCA
AT
GT CG
CA GC TG
GG
9/24/09
14
Reads in Repeat Graph Example: S = ATGGCGTGCA
Reads = {ATGGC, GGCGTG, GTGCA}
3-mers = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }
ATGCGTGGCA
AT TG GC CA
GG
GT CG
Not all Eulerian paths preserve reads!
Eulerian Superpath Example: S = ATGGCGTGCA
Reads = {ATGGC, GGCGTG, GTGCA}
3-mers = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }
AT
GT CG
CA GC TG
GG
ATGGCGTGCA ATGCGTGGCA
Eulerian superpath: an Eulerian path that contains set of paths (reads) as subpaths.
9/24/09
15
EULER Fragment Assembly Approach
• Input: Reads s1, …, sN
• Further subdivide reads into k-mers (k = 20)
• Build repeat graph on resulting k-mers • Each read is path in resulting graph. • Solve Eulerian Superpath Problem.
Given an Eulerian graph and a collection of paths in this graph, find an Eulerian path in this graph that contains all these paths as subpaths.
Simplifying the Repeat Graph
• Collapse paths
• Collapse “tree-like” structures with unique path
• “Unzip” edges
C D
3 1 2 4 Concat(1,2,3,4)
B
A CD
B
A
C
9/24/09
16
Additional challenges in EULER Approach
1. Errors in reads
2. Reverse-complement of DNA string
3. Using mate-pair information to simplify the repeat graph.
4. Multiplicities of edges generally unknown (Copy number problem).
Sequencing Errors • If an error exists in one of the reads, the
error will be perpetuated among all of the k-mers broken from that read.
9/24/09
17
Error Correction from k-mers
• “Consensus first” approach to error correction.
• Let Gk = {k-mers in genome G} • Let T = {all k-tuples appearing in > M reads} • Gk ≈ T.
Error Correction from k-mers
• A string s is called a T-string if all its l-tuples belong to T.
• Spectral Alignment Problem. Given a string s and a spectrum T, find the minimum number of mutations in s that transform s into a T-string.
• Solving Spectral Alignment Problem attempts to eliminate most point mutation errors before reconstructing the original sequence.
• Not perfect!
9/24/09
18
Forward and Reverse Complements
3’ 5’
5’ 3’
We obtain reads from both strands of DNA. Do not know strand of origin. s = CAGT s’ = ACTG (reverse complement)
DNA is copied in 5’ 3’ direction.
Forward and Reverse Complements
3’ 5’
5’ 3’
We want assembly of either strand, but NOT a mix of both.
DNA is copied in 5’ 3’ direction.
9/24/09
19
Forward and Reverse Complements
3’ 5’
5’ 3’
In Euler assembler, include reverse complement of each read.
“assume that S contains a complement of every read and that the de Bruijn graph can be partitioned into two subgraphs (the “canonical” one and its reverse complement)”
Alternative approaches use bidirected graphs.
Sources
• Serafim Batzoglou http://ai.stanford.edu/~serafim/CS262_2006/ (Sequencing slides)
• http://bioalgorithms.info (Euler slides) • Euler assembler. Pevzner, Tang, and
Waterman (PNAS 2001): on website