Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
-
Upload
ethel-fisher -
Category
Documents
-
view
217 -
download
0
Transcript of Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Rapid Global Alignments
How to align genomic sequences in (more or less) linear time
Motivation
• Genomic sequences are very long:
Human genome = 3 x 109 –long Mouse genome = 2.7 x 109 –long
• Aligning genomic regions is useful for revealing common gene structure
Useful to compare regions > 1,000,000-long
Main Idea
Genomic regions of interest contain ordered islands of similarity, such as genes
1. Find local alignments
2. Chain an optimal subset of them
Outline
• Methods to FIND Local Alignments
Sorting k-long words
Suffix Trees
• Methods to CHAIN Local Alignments
Dynamic Programming
Sparse Dynamic Programming
Methods to FIND Local Alignments
1. Sorting K-long wordsBLAST, BLAT, and the like
2. Suffix Trees
Finding Local Alignments: Sorting k-long words
Given sequences x, y:
1. Write down all (w, 0, i): w = xi+1…xi+k
(z, 1, j): z = yj+1…yj+k
2. Sort them lexicographically
3. Deduce all k-long matches between x and y
4. Extend to local alignments
Sorting k-long words: example
Let x, y be matched with 3-long words:
x = caggc: (cag,0,0), (agg,0,1), (ggc,0,2)
y = ggcag: (ggc,1,0), (gca,1,1), (cag,1,2)
Sorted: (agg,0,1),(cag,0,0),(cag,1,2),(ggc,0,2),(ggc,1,0),(gca,1,1)
Matches:1. cag: x1x2x3 = y3y4y5
2. ggc: x3x4x5 = y1y2y3
Running time
• Worst case: O(NxM)
• In practice: a large value of k results in a short list of matches
Tradeoff:
Low k: worse running time
High k: significant alignments missed
PatternHunter:
Sampling non-consecutive positions increases the likelihood to detect a conserved region, for a fixed value of k – refer to Lecture 3
Suffix Trees
• Suffix trees are a method to find all maximal matches between two strings (and much more)
Example: x = dabdac d a b d a c
ca
bd
acc
cca
db
1
4
25
63
Definition of a Suffix Tree
Definition:
For string x = x1…xm, a suffix tree is:
A rooted tree with m leaves
Leaf i: xi…xm
Each edge is a substring
No two edges out of a node, start with same letter
It follows, every substring corresponds to
an initial part of a path from root to a leaf
Constructing a Suffix Tree
• Naïve algorithm: O( N2 ) time
• Better algorithms: O( N ) time
(outside the scope of this class – too technical and not so interesting)
Memory: O( N ) but with a significant constant
Naïve Algorithm to Construct a Suffix Tree
1. Initialize tree T: a single root node r
2. Insert special symbol $ at end of x
3. For j = 1 to m
• Find longest match of xi…xm to T, starting from r
• Split edge where match stops: new node w
• Create edge (w, j), and label with unmatched portion of xi…xm
Example of Suffix Tree Construction
1
x = d a b d a $
d a b d a $
1. Insert d a b d a $
a
bd
a$
2
2. Insert a b d a $
$a
db
3
3. Insert b d a $
$
4
4. Insert d a $
$
5
5. Insert a $
$
6
6. Insert $
Faster Construction
Several algorithms
O( N ) time,
O( N ) memory with a big constant
Technical but not deep, outside the scope of this course
Optional: Gusfield, chapter 6
Memory to Store Suffix Tree
• Can store in O( N ) memory!
• Every edge is labeled with (i, j):
(i,j) denotes xi…xj
• Tree has O( N ) nodes
Proof:1. # leafs # nodes – 1
2. # leafs = |x|
Application: find all matches between x, y
1. Build suffix tree for x, mark nodes with x
2. Insert y in suffix tree, mark all nodes y “passes from” with y
The path label of every node marked both 0 and 1, is a common substring
1
x = d a b d a $y = a b a d a $
d a b d a $1. Construct tree for x
a
bd
a$2
$a
db
3
$
4
$
5
$6
xx
x
6. Insert a $
5
6
6. Insert $
4. Insert a d a $
da$
3
5. Insert d a $
y
4
2. Insert a b a d a $
a
y
da
$
1
y
yx
3. Insert b a d a $ ady
2
a$
x
Example of Suffix Tree construction
Application: String search on a database
Say we have a database D = { s1, s2, …sn }(e.g., proteins)
Question: Given new string x, find all matches of x to database
1. Build suffix tree for {s1,…, sn}
2. All new queries x take O( |x| ) time (somewhat like BLAST)
Application: common substrings of k strings
To find the longest common substring of s1, s2, …sn
1. Build suffix tree for s1,…, sn
2. All nodes labeled {si1, …, sik} represent a match between si1, …, sik
Methods to CHAIN Local Alignments
Sparse Dynamic ProgrammingO(N log N)
The Problem: Find a Chain of Local Alignments
(x,y) (x’,y’)
requires
x < x’y < y’
Each local alignment has a weight
FIND the chain with highest total weight
Quadratic Time Solution
• Build Directed Acyclic Graph (DAG): Nodes: local alignments [(xa,xb) (ya,yb)] & score
Directed edges: local alignments that can be chained• edge ( (xa, xb, ya, yb) , (xc, xd, yc, yd) )• xa < xb < xc < xd
• ya < yb < yc < yd
Each local alignment
is a node vi with
alignment score si
Quadratic Time Solution
Dynamic programming:
Initialization:Find each node va s.t. there is no edge (u,v0)
Set score of V(a) to be sa
Iteration:For each vi, optimal path ending in vi has total score:
V(i) = max ( weight(vj, vi) + V(j) )
Termination:Optimal global chain:
j = argmax ( V(j) ); trace chain from vj
Worst case time: quadratic
Sparse Dynamic Programming
Back to the LCS problem:
• Given two sequences x = x1, …, xm
y = y1, …, yn
• Find the longest common subsequence Quadratic solution with DP
• How about when “hits” xi = yj are sparse?
Sparse Dynamic Programming
15 3 24 16 20 4 24 3 11 18
4
20
24
3
11
15
11
4
18
20
• Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead
Sparse Dynamic Programming – L.I.S.
• Longest Increasing Subsequence
• Given a sequence over an ordered alphabet
x = x1, …, xm
• Find a subsequence
s = s1, …, sk
s1 < s2 < … < sk
Sparse LCS expressed as LIS
Create a sequence w
• Every matching point x-to-y, (i, j), is inserted into a sequence as follows:
• For each position j of x, from smallest to largest, insert in z the points (i, j), in decreasing column i order
• The 11 example points are inerted in the order given
• Any two points (ya, xa), (yb, xb) can be chained iff
a is before b in w, and ya < yb
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
x
y
Sparse LCS expressed as LIS
Create a sequence w
w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10)
Consider now w’s elements as ordered lexicographically, where
• (ya, xa) < (yb, xb) if ya < yb
Claim: An increasing subsequence of w is a common subsequence of x and y
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
x
y
Sparse Dynamic Programming for LIS
• Algorithm:
initialize empty array L
/* at each point, lj will contain the last element of the longest j-long increasing subsequence that ends with the smallest wi */
for i = 1 to |w|
binary search for w[i] in L, to find lj < w[i] ≤ lj+1
replace lj+1 with w[i]
keep a backptr lj w[i]
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
x
y
Sparse Dynamic Programming for LIS
Example:w = (4,2) (3,3) (10,5) (2,5) (8,6)
(1,6) (3,7) (4,8) (7,9) (5,9) (9,10)
L =1. (4,2)2. (3,3)3. (3,3) (10,5)4. (2,5) (10,5)5. (2,5) (8,6)6. (1,6) (8,6)7. (1,6) (3,7)8. (1,6) (3,7) (4,8)9. (1,6) (3,7) (4,8) (7,9)10. (1,6) (3,7) (4,8) (5,9)11. (1,6) (3,7) (4,8) (5,9) (9,10) Longest common subsequence:
s = 4, 24, 3, 11, 18
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
x
y
Sparse DP for rectangle chaining
• 1,…, N: rectangles
• (hj, lj): y-coordinates of rectangle j
• w(j): weight of rectangle j
• V(j): optimal score of chain ending in j
• L: list of triplets (lj, V(j), j)
L is sorted by lj L is implemented as a balanced binary tree
y
h
l
Sparse DP for rectangle chaining
Go through rectangle x-coordinates, from lowest to highest:
1. When on the leftmost end of i:
a. j: rectangle in L, with largest lj < hi
b. V(i) = w(i) + V(j)
2. When on the rightmost end of i:
a. j: rectangle in L, with largest lj lib. If V(i) > V(j):
i. INSERT (li, V(i), i) in L
ii. REMOVE all (lk, V(k), k) with V(k) V(i) & lk li
Example
x
y
1: 5
3: 3
2: 6
4: 45: 2
2
56
91011
1214
1516
Time Analysis
1. Sorting the x-coords takes O(N log N)
2. Going through x-coords: N steps
3. Each of N steps requires O(log N) time:
• Searching L takes log N• Inserting to L takes log N• All deletions are consecutive, so log N per deletion• Each element is deleted at most once: N log N for all deletions
• Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree