Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

34
Rapid Global Alignments How to align genomic sequences in (more or less) linear time

Transcript of Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Page 1: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Rapid Global Alignments

How to align genomic sequences in (more or less) linear time

Page 2: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Motivation

• Genomic sequences are very long:

Human genome = 3 x 109 –long Mouse genome = 2.7 x 109 –long

• Aligning genomic regions is useful for revealing common gene structure

Useful to compare regions > 1,000,000-long

Page 3: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Main Idea

Genomic regions of interest contain ordered islands of similarity, such as genes

1. Find local alignments

2. Chain an optimal subset of them

Page 4: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Outline

• Methods to FIND Local Alignments

Sorting k-long words

Suffix Trees

• Methods to CHAIN Local Alignments

Dynamic Programming

Sparse Dynamic Programming

Page 5: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Methods to FIND Local Alignments

1. Sorting K-long wordsBLAST, BLAT, and the like

2. Suffix Trees

Page 6: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Finding Local Alignments: Sorting k-long words

Given sequences x, y:

1. Write down all (w, 0, i): w = xi+1…xi+k

(z, 1, j): z = yj+1…yj+k

2. Sort them lexicographically

3. Deduce all k-long matches between x and y

4. Extend to local alignments

Page 7: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Sorting k-long words: example

Let x, y be matched with 3-long words:

x = caggc: (cag,0,0), (agg,0,1), (ggc,0,2)

y = ggcag: (ggc,1,0), (gca,1,1), (cag,1,2)

Sorted: (agg,0,1),(cag,0,0),(cag,1,2),(ggc,0,2),(ggc,1,0),(gca,1,1)

Matches:1. cag: x1x2x3 = y3y4y5

2. ggc: x3x4x5 = y1y2y3

Page 8: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Running time

• Worst case: O(NxM)

• In practice: a large value of k results in a short list of matches

Tradeoff:

Low k: worse running time

High k: significant alignments missed

PatternHunter:

Sampling non-consecutive positions increases the likelihood to detect a conserved region, for a fixed value of k – refer to Lecture 3

Page 9: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Suffix Trees

• Suffix trees are a method to find all maximal matches between two strings (and much more)

Example: x = dabdac d a b d a c

ca

bd

acc

cca

db

1

4

25

63

Page 10: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Definition of a Suffix Tree

Definition:

For string x = x1…xm, a suffix tree is:

A rooted tree with m leaves

Leaf i: xi…xm

Each edge is a substring

No two edges out of a node, start with same letter

It follows, every substring corresponds to

an initial part of a path from root to a leaf

Page 11: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Constructing a Suffix Tree

• Naïve algorithm: O( N2 ) time

• Better algorithms: O( N ) time

(outside the scope of this class – too technical and not so interesting)

Memory: O( N ) but with a significant constant

Page 12: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Naïve Algorithm to Construct a Suffix Tree

1. Initialize tree T: a single root node r

2. Insert special symbol $ at end of x

3. For j = 1 to m

• Find longest match of xi…xm to T, starting from r

• Split edge where match stops: new node w

• Create edge (w, j), and label with unmatched portion of xi…xm

Page 13: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Example of Suffix Tree Construction

1

x = d a b d a $

d a b d a $

1. Insert d a b d a $

a

bd

a$

2

2. Insert a b d a $

$a

db

3

3. Insert b d a $

$

4

4. Insert d a $

$

5

5. Insert a $

$

6

6. Insert $

Page 14: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Faster Construction

Several algorithms

O( N ) time,

O( N ) memory with a big constant

Technical but not deep, outside the scope of this course

Optional: Gusfield, chapter 6

Page 15: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Memory to Store Suffix Tree

• Can store in O( N ) memory!

• Every edge is labeled with (i, j):

(i,j) denotes xi…xj

• Tree has O( N ) nodes

Proof:1. # leafs # nodes – 1

2. # leafs = |x|

Page 16: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Application: find all matches between x, y

1. Build suffix tree for x, mark nodes with x

2. Insert y in suffix tree, mark all nodes y “passes from” with y

The path label of every node marked both 0 and 1, is a common substring

Page 17: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

1

x = d a b d a $y = a b a d a $

d a b d a $1. Construct tree for x

a

bd

a$2

$a

db

3

$

4

$

5

$6

xx

x

6. Insert a $

5

6

6. Insert $

4. Insert a d a $

da$

3

5. Insert d a $

y

4

2. Insert a b a d a $

a

y

da

$

1

y

yx

3. Insert b a d a $ ady

2

a$

x

Example of Suffix Tree construction

Page 18: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Application: String search on a database

Say we have a database D = { s1, s2, …sn }(e.g., proteins)

Question: Given new string x, find all matches of x to database

1. Build suffix tree for {s1,…, sn}

2. All new queries x take O( |x| ) time (somewhat like BLAST)

Page 19: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Application: common substrings of k strings

To find the longest common substring of s1, s2, …sn

1. Build suffix tree for s1,…, sn

2. All nodes labeled {si1, …, sik} represent a match between si1, …, sik

Page 20: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Methods to CHAIN Local Alignments

Sparse Dynamic ProgrammingO(N log N)

Page 21: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

The Problem: Find a Chain of Local Alignments

(x,y) (x’,y’)

requires

x < x’y < y’

Each local alignment has a weight

FIND the chain with highest total weight

Page 22: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Quadratic Time Solution

• Build Directed Acyclic Graph (DAG): Nodes: local alignments [(xa,xb) (ya,yb)] & score

Directed edges: local alignments that can be chained• edge ( (xa, xb, ya, yb) , (xc, xd, yc, yd) )• xa < xb < xc < xd

• ya < yb < yc < yd

Each local alignment

is a node vi with

alignment score si

Page 23: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Quadratic Time Solution

Dynamic programming:

Initialization:Find each node va s.t. there is no edge (u,v0)

Set score of V(a) to be sa

Iteration:For each vi, optimal path ending in vi has total score:

V(i) = max ( weight(vj, vi) + V(j) )

Termination:Optimal global chain:

j = argmax ( V(j) ); trace chain from vj

Worst case time: quadratic

Page 24: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Sparse Dynamic Programming

Back to the LCS problem:

• Given two sequences x = x1, …, xm

y = y1, …, yn

• Find the longest common subsequence Quadratic solution with DP

• How about when “hits” xi = yj are sparse?

Page 25: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Sparse Dynamic Programming

15 3 24 16 20 4 24 3 11 18

4

20

24

3

11

15

11

4

18

20

• Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead

Page 26: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Sparse Dynamic Programming – L.I.S.

• Longest Increasing Subsequence

• Given a sequence over an ordered alphabet

x = x1, …, xm

• Find a subsequence

s = s1, …, sk

s1 < s2 < … < sk

Page 27: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Sparse LCS expressed as LIS

Create a sequence w

• Every matching point x-to-y, (i, j), is inserted into a sequence as follows:

• For each position j of x, from smallest to largest, insert in z the points (i, j), in decreasing column i order

• The 11 example points are inerted in the order given

• Any two points (ya, xa), (yb, xb) can be chained iff

a is before b in w, and ya < yb

15 3 24 16 20 4 24 3 11 18

6

4

2 7

1 8

10

9

5

11

3

4

20

24

3

11

15

11

4

18

20

x

y

Page 28: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Sparse LCS expressed as LIS

Create a sequence w

w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10)

Consider now w’s elements as ordered lexicographically, where

• (ya, xa) < (yb, xb) if ya < yb

Claim: An increasing subsequence of w is a common subsequence of x and y

15 3 24 16 20 4 24 3 11 18

6

4

2 7

1 8

10

9

5

11

3

4

20

24

3

11

15

11

4

18

20

x

y

Page 29: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Sparse Dynamic Programming for LIS

• Algorithm:

initialize empty array L

/* at each point, lj will contain the last element of the longest j-long increasing subsequence that ends with the smallest wi */

for i = 1 to |w|

binary search for w[i] in L, to find lj < w[i] ≤ lj+1

replace lj+1 with w[i]

keep a backptr lj w[i]

15 3 24 16 20 4 24 3 11 18

6

4

2 7

1 8

10

9

5

11

3

4

20

24

3

11

15

11

4

18

20

x

y

Page 30: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Sparse Dynamic Programming for LIS

Example:w = (4,2) (3,3) (10,5) (2,5) (8,6)

(1,6) (3,7) (4,8) (7,9) (5,9) (9,10)

L =1. (4,2)2. (3,3)3. (3,3) (10,5)4. (2,5) (10,5)5. (2,5) (8,6)6. (1,6) (8,6)7. (1,6) (3,7)8. (1,6) (3,7) (4,8)9. (1,6) (3,7) (4,8) (7,9)10. (1,6) (3,7) (4,8) (5,9)11. (1,6) (3,7) (4,8) (5,9) (9,10) Longest common subsequence:

s = 4, 24, 3, 11, 18

15 3 24 16 20 4 24 3 11 18

6

4

2 7

1 8

10

9

5

11

3

4

20

24

3

11

15

11

4

18

20

x

y

Page 31: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Sparse DP for rectangle chaining

• 1,…, N: rectangles

• (hj, lj): y-coordinates of rectangle j

• w(j): weight of rectangle j

• V(j): optimal score of chain ending in j

• L: list of triplets (lj, V(j), j)

L is sorted by lj L is implemented as a balanced binary tree

y

h

l

Page 32: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Sparse DP for rectangle chaining

Go through rectangle x-coordinates, from lowest to highest:

1. When on the leftmost end of i:

a. j: rectangle in L, with largest lj < hi

b. V(i) = w(i) + V(j)

2. When on the rightmost end of i:

a. j: rectangle in L, with largest lj lib. If V(i) > V(j):

i. INSERT (li, V(i), i) in L

ii. REMOVE all (lk, V(k), k) with V(k) V(i) & lk li

Page 33: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Example

x

y

1: 5

3: 3

2: 6

4: 45: 2

2

56

91011

1214

1516

Page 34: Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Time Analysis

1. Sorting the x-coords takes O(N log N)

2. Going through x-coords: N steps

3. Each of N steps requires O(log N) time:

• Searching L takes log N• Inserting to L takes log N• All deletions are consecutive, so log N per deletion• Each element is deleted at most once: N log N for all deletions

• Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree