CSCI2950-C Lecture 4 DNA Sequencing and Fragment...

19
9/24/09 1 CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly Ben Raphael Sept. 22, 2009 http://cs.brown.edu/courses/csci2950-c/ l-mer composition Def: Given string s, the Spectrum ( s, l ) is unordered multiset of all possible (n – l + 1) l-mers in a string s of length n The order of individual elements in Spectrum ( s, l ) does not matter • For s = TATGGTGC all of the following are equivalent representations of Spectrum ( s, 3 ): {TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG}

Transcript of CSCI2950-C Lecture 4 DNA Sequencing and Fragment...

Page 1: CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assemblycs.brown.edu/.../Fall2009/LectureSlides/Lecture4.pdf · 2009. 9. 24. · Back to sequence assembly… Overlap Graph Repeat

9/24/09

1

CSCI2950-C Lecture 4

DNA Sequencing and Fragment Assembly

Ben Raphael Sept. 22, 2009

http://cs.brown.edu/courses/csci2950-c/

l-mer composition Def: Given string s, the Spectrum ( s, l ) is

unordered multiset of all possible (n – l + 1) l-mers in a string s of length n

•  The order of individual elements in Spectrum ( s, l ) does not matter

•  For s = TATGGTGC all of the following are equivalent representations of Spectrum ( s, 3 ):

{TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG}

Page 2: CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assemblycs.brown.edu/.../Fall2009/LectureSlides/Lecture4.pdf · 2009. 9. 24. · Back to sequence assembly… Overlap Graph Repeat

9/24/09

2

The SBH Problem Goal: Reconstruct a string from its l-mer

composition

Input: A multiset S, representing all l-mers from an (unknown) string s

Output: String s such that Spectrum ( s,l ) = S

SBH: An Example DNA Sequencing AGT CCA ATC ATCCAGT TCC CAG

S = { ATC, CCA, CAG, TCC, AGT }

Page 3: CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assemblycs.brown.edu/.../Fall2009/LectureSlides/Lecture4.pdf · 2009. 9. 24. · Back to sequence assembly… Overlap Graph Repeat

9/24/09

3

SBH: Hamiltonian Path Approach

S = { ATG AGG TGC TCC GTC GGT GCA CAG }

   Path visited every VERTEX once: Hamiltonian Path 

ATG  AGG  TGC  TCC H  GTC  GGT  GCA  CAG 

ATG  C A G G T C C 

Directed graph G = (S,E). es,t E if suffix of s = prefix of t, where

length(suffix/prefix) = l - 1

SBH: Hamiltonian Path Approach

A more complicated graph:

S = { ATG TGG TGC GTG GGC GCA GCG CGT }

Page 4: CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assemblycs.brown.edu/.../Fall2009/LectureSlides/Lecture4.pdf · 2009. 9. 24. · Back to sequence assembly… Overlap Graph Repeat

9/24/09

4

SBH: Hamiltonian Path Approach

S = { ATG TGG TGC GTG GGC GCA GCG CGT } Path 1:

              ATGCGTGGCA 

ATGGCGTGCA 

Path 2: 

SBH: Eulerian Path Approach S = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }

Vertices correspond to ( l – 1 ) – mers : { AT, TG, GC, GG, GT,

CA, CG }

Edges correspond to l – mers from S

AT

GT CG

CA GC TG

GG Find path that visits every EDGE once

(l-1) deBruijn graph of S: Vertices: (l-1)-mers in S Directed edges: consecutive (l-1) mers in S

Page 5: CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assemblycs.brown.edu/.../Fall2009/LectureSlides/Lecture4.pdf · 2009. 9. 24. · Back to sequence assembly… Overlap Graph Repeat

9/24/09

5

SBH: Eulerian Path Approach

S = {ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } Two different paths give different sequence reconstructions:

ATGGCGTGCA ATGCGTGGCA

AT TG GC CA

GG

GT CG

AT

GT CG

CA GC TG

GG

Euler Theorem

•  A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing edges:

in(v)=out(v)

•  Theorem: A connected graph is Eulerian if and only if each of its vertices is balanced.

Page 6: CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assemblycs.brown.edu/.../Fall2009/LectureSlides/Lecture4.pdf · 2009. 9. 24. · Back to sequence assembly… Overlap Graph Repeat

9/24/09

6

Euler Theorem: Proof

•  Eulerian → balanced

for every edge entering v (incoming edge) there exists an edge leaving v (outgoing edge). Therefore

in(v)=out(v)

•  Balanced → Eulerian

???

Algorithm for Constructing an Eulerian Cycle

a. Start with an arbitrary vertex v and form an arbitrary cycle with unused edges until a dead end is reached. Since the graph is Eulerian this dead end is necessarily the starting point, i.e., vertex v.

Page 7: CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assemblycs.brown.edu/.../Fall2009/LectureSlides/Lecture4.pdf · 2009. 9. 24. · Back to sequence assembly… Overlap Graph Repeat

9/24/09

7

Algorithm for Constructing an Eulerian Cycle (cont’d)

b. If cycle from (a) above is not an Eulerian cycle, it must contain a vertex w, which has untraversed edges. Perform step (a) again, using vertex w as the starting point. Once again, we will end up in the starting vertex w.

Algorithm for Constructing an Eulerian Cycle (cont’d)

c. Combine the cycles from (a) and (b) into a single cycle and iterate step (b).

Page 8: CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assemblycs.brown.edu/.../Fall2009/LectureSlides/Lecture4.pdf · 2009. 9. 24. · Back to sequence assembly… Overlap Graph Repeat

9/24/09

8

Euler Theorem: Extension

•  Theorem: A connected graph has an Eulerian path if and only if it contains at most two semi-balanced vertices and all other vertices are balanced.

Back to sequence assembly… Overlap Graph

Repeat Repeat Repeat

Find a path visiting every VERTEX exactly once: Hamiltonian path problem

Each vertex represents a read from the original sequence. Vertices from repeats are connected to many others.

Page 9: CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assemblycs.brown.edu/.../Fall2009/LectureSlides/Lecture4.pdf · 2009. 9. 24. · Back to sequence assembly… Overlap Graph Repeat

9/24/09

9

Fragment Assembly: Overlap Graph

Find a path visiting every VERTEX exactly once in the OVERLAP graph:

Hamiltonian path problem

NP-complete: algorithms unknown

Repeat Graph: Eulerian Approach Repeat Repeat Repeat

Genome sequence is path visiting every EDGE exactly once: Eulerian path problem

Suppose each repeat and unique string (non-repeat) represented by directed edge.

Edge of multiplicity 3

Page 10: CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assemblycs.brown.edu/.../Fall2009/LectureSlides/Lecture4.pdf · 2009. 9. 24. · Back to sequence assembly… Overlap Graph Repeat

9/24/09

10

Multiple Repeats Repeat1 Repeat1 Repeat2 Repeat2

Can be easily constructed with any number of repeats

Approaches to Fragment Assembly (cont’d)

Find a path visiting every EDGE exactly once in the REPEAT graph:

Eulerian path problem

Linear time algorithms are known

Page 11: CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assemblycs.brown.edu/.../Fall2009/LectureSlides/Lecture4.pdf · 2009. 9. 24. · Back to sequence assembly… Overlap Graph Repeat

9/24/09

11

Repeat Graph

(a) DNA sequence with a triple repeat R;

(b) the layout graph;

(c) construction of the de Bruijn graph by gluing repeats;

(d) de Bruijn graph.

Pevzner P. A. et.al. PNAS 2001;98:9748-9753

Building Repeat Graph •  Problem: Construct the repeat graph from

a collection of reads.

•  Solution: Break the reads into smaller pieces.

?

Page 12: CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assemblycs.brown.edu/.../Fall2009/LectureSlides/Lecture4.pdf · 2009. 9. 24. · Back to sequence assembly… Overlap Graph Repeat

9/24/09

12

Building Repeat Graph

•  Reads are constructed from an original sequence in lengths that allow biologists a high level of certainty.

•  They are then broken again into k-mers

Repeat Graph Vertices correspond to ( k – 1 ) – mers in each read

Edges correspond to k – mers in each read

Example: S = ATGGCGTGCA

Reads = {ATGGC, GGCGTG, GTGCA}

3-mers = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }

AT

GT CG

CA GC TG

GG

Two Eulerian paths: (visit every EDGE once)

ATGGCGTGCA ATGCGTGGCA

Page 13: CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assemblycs.brown.edu/.../Fall2009/LectureSlides/Lecture4.pdf · 2009. 9. 24. · Back to sequence assembly… Overlap Graph Repeat

9/24/09

13

Repeat Graph Vertices correspond to ( k – 1 ) – mers in each read

Edges correspond to k – mers in each read

Example: S = ATGGCGTGCA

Reads = {ATGGC, GGCGTG, GTGCA}

3-mers = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }

AT

GT CG

CA GC TG

GG

Reads in Repeat Graph

S = {ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } Two different paths give different sequence reconstructions:

ATGGCGTGCA

AT

GT CG

CA GC TG

GG

Page 14: CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assemblycs.brown.edu/.../Fall2009/LectureSlides/Lecture4.pdf · 2009. 9. 24. · Back to sequence assembly… Overlap Graph Repeat

9/24/09

14

Reads in Repeat Graph Example: S = ATGGCGTGCA

Reads = {ATGGC, GGCGTG, GTGCA}

3-mers = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }

ATGCGTGGCA

AT TG GC CA

GG

GT CG

Not all Eulerian paths preserve reads!

Eulerian Superpath Example: S = ATGGCGTGCA

Reads = {ATGGC, GGCGTG, GTGCA}

3-mers = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }

AT

GT CG

CA GC TG

GG

ATGGCGTGCA ATGCGTGGCA

Eulerian superpath: an Eulerian path that contains set of paths (reads) as subpaths.

Page 15: CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assemblycs.brown.edu/.../Fall2009/LectureSlides/Lecture4.pdf · 2009. 9. 24. · Back to sequence assembly… Overlap Graph Repeat

9/24/09

15

EULER Fragment Assembly Approach

•  Input: Reads s1, …, sN

•  Further subdivide reads into k-mers (k = 20)

•  Build repeat graph on resulting k-mers •  Each read is path in resulting graph. •  Solve Eulerian Superpath Problem.

Given an Eulerian graph and a collection of paths in this graph, find an Eulerian path in this graph that contains all these paths as subpaths.

Simplifying the Repeat Graph

•  Collapse paths

•  Collapse “tree-like” structures with unique path

•  “Unzip” edges

C D

3 1 2 4 Concat(1,2,3,4)

B

A CD

B

A

C

Page 16: CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assemblycs.brown.edu/.../Fall2009/LectureSlides/Lecture4.pdf · 2009. 9. 24. · Back to sequence assembly… Overlap Graph Repeat

9/24/09

16

Additional challenges in EULER Approach

1.  Errors in reads

2.  Reverse-complement of DNA string

3.  Using mate-pair information to simplify the repeat graph.

4.  Multiplicities of edges generally unknown (Copy number problem).

Sequencing Errors •  If an error exists in one of the reads, the

error will be perpetuated among all of the k-mers broken from that read.

Page 17: CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assemblycs.brown.edu/.../Fall2009/LectureSlides/Lecture4.pdf · 2009. 9. 24. · Back to sequence assembly… Overlap Graph Repeat

9/24/09

17

Error Correction from k-mers

•  “Consensus first” approach to error correction.

•  Let Gk = {k-mers in genome G} •  Let T = {all k-tuples appearing in > M reads} •  Gk ≈ T.

Error Correction from k-mers

•  A string s is called a T-string if all its l-tuples belong to T.

•  Spectral Alignment Problem. Given a string s and a spectrum T, find the minimum number of mutations in s that transform s into a T-string.

•  Solving Spectral Alignment Problem attempts to eliminate most point mutation errors before reconstructing the original sequence.

•  Not perfect!

Page 18: CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assemblycs.brown.edu/.../Fall2009/LectureSlides/Lecture4.pdf · 2009. 9. 24. · Back to sequence assembly… Overlap Graph Repeat

9/24/09

18

Forward and Reverse Complements

3’ 5’

5’ 3’

We obtain reads from both strands of DNA. Do not know strand of origin. s = CAGT s’ = ACTG (reverse complement)

DNA is copied in 5’ 3’ direction.

Forward and Reverse Complements

3’ 5’

5’ 3’

We want assembly of either strand, but NOT a mix of both.

DNA is copied in 5’ 3’ direction.

Page 19: CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assemblycs.brown.edu/.../Fall2009/LectureSlides/Lecture4.pdf · 2009. 9. 24. · Back to sequence assembly… Overlap Graph Repeat

9/24/09

19

Forward and Reverse Complements

3’ 5’

5’ 3’

In Euler assembler, include reverse complement of each read.

“assume that S contains a complement of every read and that the de Bruijn graph can be partitioned into two subgraphs (the “canonical” one and its reverse complement)”

Alternative approaches use bidirected graphs.

Sources

•  Serafim Batzoglou http://ai.stanford.edu/~serafim/CS262_2006/ (Sequencing slides)

•  http://bioalgorithms.info (Euler slides) •  Euler assembler. Pevzner, Tang, and

Waterman (PNAS 2001): on website