Sequence Alignment. CS262 Lecture 3, Win06, Batzoglou Sequence Alignment...
-
date post
22-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of Sequence Alignment. CS262 Lecture 3, Win06, Batzoglou Sequence Alignment...
Sequence Alignment
CS262 Lecture 3, Win06, Batzoglou
Sequence Alignment
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
DefinitionGiven two strings x = x1x2...xM, y = y1y2…yN,
an alignment is an assignment of gaps to positions0,…, N in x, and 0,…, N in y, so as to line up each
letter in one sequence with either a letter, or a gapin the other sequence
AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC
CS262 Lecture 3, Win06, Batzoglou
The Needleman-Wunsch Algorithm
1. Initialization.F(0, 0) = F(0, j) = F(i, 0) = 0
2. Main Iteration. a. For each i = 1……M
For each j = 1……N F(i-1,j-1) + s(xi,
yj) [case 1] F(i, j) = max F(i-1, j) – d [case 2] F(i, j-1) – d [case 3]
DIAG, if [case 1] Ptr(i,j) = LEFT, if [case 2]
UP, if [case 3]
3. Termination. F(M, N) is the optimal score, andfrom Ptr(M, N) can trace back optimal alignment
CS262 Lecture 3, Win06, Batzoglou
The Smith-Waterman algorithm
Idea: Ignore badly aligning regions
Modifications to Needleman-Wunsch:
Initialization: F(0, j) = F(i, 0) = 0
0
Iteration: F(i, j) = max F(i – 1, j) – d
F(i, j – 1) – d
F(i – 1, j – 1) + s(xi, yj)
CS262 Lecture 3, Win06, Batzoglou
Scoring the gaps more accurately
Simple, linear gap model:Gap of length nincurs penalty nd
However, gaps usually occur in bunches
Convex gap penalty function:(n):for all n, (n + 1) - (n) (n) - (n – 1)
Algorithm: O(N3) time, O(N2) space
(n)
(n)
CS262 Lecture 3, Win06, Batzoglou
Compromise: affine gaps
(n) = d + (n – 1)e | | gap gap open extend
To compute optimal alignment,
At position i, j, need to “remember” best score if gap is open best score if gap is not open
F(i, j): score of alignment x1…xi to y1…yj
ifif xi aligns to yj
G(i, j): score ifif xi aligns to a gap after yj
H(i, j): score ifif yj aligns to a gap after xi
V(i, j) = best score of alignment x1…xi to y1…yj
de
(n)
CS262 Lecture 3, Win06, Batzoglou
Needleman-Wunsch with affine gaps
Why do we need matrices F, G, H?
• xi aligns to yj
x1……xi-1 xi xi+1
y1……yj-1 yj -
2. xi aligns to a gap
x1……xi-1 xi xi+1
y1……yj …- -
Add -d
Add -e
G(i+1, j) = V(i, j) – d
G(i+1, j) = G(i, j) – e
Because, perhaps
G(i, j) < V(i, j)
(it is best to align xi to yj if we were aligningonly x1…xi to y1…yj and not the rest of x, y),
but on the contrary
G(i, j) – e > V(i, j) – d
(i.e., had we “fixed” our decision that xi alignsto yj, we could regret it at the next step whenaligning x1…xi+1 to y1…yj)
CS262 Lecture 3, Win06, Batzoglou
Needleman-Wunsch with affine gaps
Initialization: V(i, 0) = d + (i – 1)eV(0, j) = d + (j – 1)e
Iteration:V(i, j) = max{ F(i, j), G(i, j), H(i, j) }
F(i, j) = V(i – 1, j – 1) + s(xi, yj)
V(i – 1, j) – d G(i, j) = max
G(i – 1, j) – e
V(i, j – 1) – d H(i, j) = max
H(i, j – 1) – e
Termination: V(i, j) has the best alignment
Time?Space?
CS262 Lecture 3, Win06, Batzoglou
To generalize a little…
… think of how you would compute optimal alignment with this gap function
….in time O(MN)
(n)
CS262 Lecture 3, Win06, Batzoglou
Bounded Dynamic Programming
Assume we know that x and y are very similar
Assumption: # gaps(x, y) < k(N)
xi Then, | implies | i – j | < k(N)
yj
We can align x and y more efficiently:
Time, Space: O(N k(N)) << O(N2)
CS262 Lecture 3, Win06, Batzoglou
Bounded Dynamic Programming
Initialization:
F(i,0), F(0,j) undefined for i, j > k
Iteration:
For i = 1…M
For j = max(1, i – k)…min(N, i+k)
F(i – 1, j – 1)+ s(xi, yj)
F(i, j) = max F(i, j – 1) – d, if j > i – k(N)
F(i – 1, j) – d, if j < i + k(N)
Termination: same
Easy to extend to the affine gap case
x1 ………………………… xM
y1 …
……
……
……
……
…
yN
k(N)
CS262 Lecture 3, Win06, Batzoglou
Linear-Space Alignment
CS262 Lecture 3, Win06, Batzoglou
Subsequences and Substrings
Definition A string x’ is a substring of a string x,if x = ux’v for some prefix string u and suffix string v
(similarly, x’ = xi…xj, for some 1 i j |x|)
A string x’ is a subsequence of a string xif x’ can be obtained from x by deleting 0 or more letters
(x’ = xi1…xik, for some 1 i1 … ik |x|)
Note: a substring is always a subsequence
Example: x = abracadabray = cadabr; substringz = brcdbr; subseqence, not substring
CS262 Lecture 3, Win06, Batzoglou
Hirschberg’s algortihm
Given a set of strings x, y,…, a common subsequence is a string u that is a subsequence of all strings x, y, …
• Longest common subsequence Given strings x = x1 x2 … xM, y = y1 y2 … yN, Find longest common subsequence u = u1 … uk
• Algorithm:F(i – 1, j)
• F(i, j) = max F(i, j – 1)F(i – 1, j – 1) + [1, if xi = yj; 0 otherwise]
• Ptr(i, j) = (same as in N-W)
• Termination: trace back from Ptr(M, N), and prepend a letter to u whenever • Ptr(i, j) = DIAG and F(i – 1, j – 1) < F(i, j)
• Hirschberg’s algorithm solves this in linear space
CS262 Lecture 3, Win06, Batzoglou
F(i,j)
Introduction: Compute optimal score
It is easy to compute F(M, N) in linear space
Allocate ( column[1] )
Allocate ( column[2] )
For i = 1….M
If i > 1, then:
Free( column[i – 2] )
Allocate( column[ i ] )
For j = 1…N
F(i, j) = …
CS262 Lecture 3, Win06, Batzoglou
Linear-space alignment
To compute both the optimal score and the optimal alignment:
Divide & Conquer approach:
Notation:
xr, yr: reverse of x, y
E.g. x = accgg;
xr = ggcca
Fr(i, j): optimal score of aligning xr1…xr
i & yr1…yr
j
same as aligning xM-i+1…xM & yN-j+1…yN
CS262 Lecture 3, Win06, Batzoglou
Linear-space alignment
Lemma: (assume M is even)
F(M, N) = maxk=0…N( F(M/2, k) + Fr(M/2, N-k) )
x
y
M/2
k*
F(M/2, k) Fr(M/2, N-k)
CS262 Lecture 3, Win06, Batzoglou
Linear-space alignment
• Now, using 2 columns of space, we can compute
for k = 1…M, F(M/2, k), Fr(M/2, N-k)
PLUS the backpointers
CS262 Lecture 3, Win06, Batzoglou
Linear-space alignment
• Now, we can find k* maximizing F(M/2, k) + Fr(M/2, N-k)
• Also, we can trace the path exiting column M/2 from k*
k*
k*+1
0 1 …… M/2 M/2+1 …… M M+1
CS262 Lecture 3, Win06, Batzoglou
Linear-space alignment
• Iterate this procedure to the left and right!
N-k*
M/2M/2
k*
CS262 Lecture 3, Win06, Batzoglou
Linear-space alignment
Hirschberg’s Linear-space algorithm:
MEMALIGN(l, l’, r, r’): (aligns xl…xl’ with yr…yr’)
1. Let h = (l’-l)/22. Find (in Time O((l’ – l) (r’-r)), Space O(r’-r))
the optimal path, Lh, entering column h-1, exiting column hLet k1 = pos’n at column h – 2 where Lh enters
k2 = pos’n at column h + 1 where Lh exits
3. MEMALIGN(l, h-2, r, k1)
4. Output Lh
5. MEMALIGN(h+1, l’, k2, r’)
Top level call: MEMALIGN(1, M, 1, N)
CS262 Lecture 3, Win06, Batzoglou
Linear-space alignment
Time, Space analysis of Hirschberg’s algorithm: To compute optimal path at middle column,
For box of size M N,Space: 2NTime: cMN, for some constant c
Then, left, right calls cost c( M/2 k* + M/2 (N-k*) ) = cMN/2
All recursive calls cost Total Time: cMN + cMN/2 + cMN/4 + ….. = 2cMN = O(MN)
Total Space: O(N) for computation,O(N+M) to store the optimal alignment
CS262 Lecture 3, Win06, Batzoglou
Heuristic Local Alignerers
1. The basic indexing & extension technique
2. Indexing: techniques to improve sensitivityPairs of Words, Patterns
3. Systems for local alignment
CS262 Lecture 3, Win06, Batzoglou
State of biological databases
http://www.cbs.dtu.dk/databases/DOGS/
~10x per3 years
CS262 Lecture 3, Win06, Batzoglou
State of biological databases
• Number of genes in these genomes:
Mammals: ~24,000 Insects: ~14,000 Worms: ~17,000 Fungi: ~6,000-10,000
Small organisms: 100s-1,000s
• Each known or predicted gene has one or more associated protein sequences
• >1,000,000 known / predicted protein sequences
CS262 Lecture 3, Win06, Batzoglou
Some useful applications of alignments
• Given a newly discovered gene, Does it occur in other species? How fast does it evolve?
• Assume we try Smith-Waterman:
The entire genomic database
Our new gene
104
1010 - 1012
CS262 Lecture 3, Win06, Batzoglou
Some useful applications of alignments
• Given a newly sequenced organism,• Which subregions align with other organisms?
Potential genes Other biological characteristics
• Assume we try Smith-Waterman:
The entire genomic database
Our newly sequenced mammal
3109
1010 - 1012
CS262 Lecture 3, Win06, Batzoglou
Indexing-based local alignment
(BLAST- Basic Local Alignment Search Tool)
Main idea:
1. Construct a dictionary of all the words in the query
2. Initiate a local alignment for each word match between query and DB
Running Time: O(MN)
However, orders of magnitude faster than Smith-Waterman
query
DB
CS262 Lecture 3, Win06, Batzoglou
Indexing-based local alignment
Dictionary:
All words of length k (~10)
Alignment initiated between words of alignment score T
(typically T = k)
Alignment:
Ungapped extensions until score
below statistical threshold
Output:
All local alignments with score
> statistical threshold
……
……
query
DB
query
scan
CS262 Lecture 3, Win06, Batzoglou
Indexing-based local alignment—Extensions
A C G A A G T A A G G T C C A G T
C
C
C
T
T
C C
T
G
G
A T
T
G
C
G
A
Example:
k = 4
The matching word GGTC initiates an alignment
Extension to the left and right with no gaps until alignment falls < C below best so far
Output:
GTAAGGTCC
GTTAGGTCC
CS262 Lecture 3, Win06, Batzoglou
Indexing-based local alignment—Extensions
A C G A A G T A A G G T C C A G T
C
T
G
A
T
C C
T
G
G
A
T
T
G C
G
A
Gapped extensions
• Extensions with gaps in a band around anchor
Output:
GTAAGGTCCAGTGTTAGGTC-AGT
CS262 Lecture 3, Win06, Batzoglou
Indexing-based local alignment—Extensions
A C G A A G T A A G G T C C A G T
C
T
G
A
T
C C
T
G
G
A
T
T
G C
G
A
Gapped extensions until threshold
• Extensions with gaps until score < C below best score so far
Output:
GTAAGGTCCAGTGTTAGGTC-AGT
CS262 Lecture 3, Win06, Batzoglou
Sensitivity-Speed Tradeoff
long words
(k = 15)
short words
(k = 7)
Sensitivity Speed
Kent WJ, Genome Research 2002
Sens.
Speed
X%
CS262 Lecture 3, Win06, Batzoglou
Sensitivity-Speed Tradeoff
Methods to improve sensitivity/speed
1. Using pairs of words
2. Using inexact words
3. Patterns—non consecutive positions
……ATAACGGACGACTGATTACACTGATTCTTAC……
……GGCACGGACCAGTGACTACTCTGATTCCCAG……
……ATAACGGACGACTGATTACACTGATTCTTAC……
……GGCGCCGACGAGTGATTACACAGATTGCCAG……
TTTGATTACACAGAT T G TT CAC G
CS262 Lecture 3, Win06, Batzoglou
Measured improvement
Kent WJ, Genome Research 2002
CS262 Lecture 3, Win06, Batzoglou
Non-consecutive words—Patterns
Patterns increase the likelihood of at least one match within a long conserved region
3 common
5 common
7 common
Consecutive Positions Non-Consecutive Positions
6 common
On a 100-long 70% conserved region: Consecutive Non-consecutive
Expected # hits: 1.07 0.97Prob[at least one hit]: 0.30 0.47
CS262 Lecture 3, Win06, Batzoglou
Advantage of Patterns
11 positions
11 positions
10 positions
CS262 Lecture 3, Win06, Batzoglou
Multiple patterns
• K patterns Takes K times longer to scan Patterns can complement one another
• Computational problem: Given: a model (prob distribution) for homology between two regions Find: best set of K patterns that maximizes Prob(at least one match)
TTTGATTACACAGAT T G TT CAC G T G T C CAG TTGATT A G
Buhler et al. RECOMB 2003Sun & Buhler RECOMB 2004
How long does it take to search the query?
CS262 Lecture 3, Win06, Batzoglou
Variants of BLAST
• NCBI BLAST: search the universe http://www.ncbi.nlm.nih.gov/BLAST/• MEGABLAST: http://genopole.toulouse.inra.fr/blast/megablast.html
Optimized to align very similar sequences• Works best when k = 4i 16• Linear gap penalty
• WU-BLAST: (Wash U BLAST) http://blast.wustl.edu/ Very good optimizations Good set of features & command line arguments
• BLAT http://genome.ucsc.edu/cgi-bin/hgBlat Faster, less sensitive than BLAST Good for aligning huge numbers of queries
• CHAOS http://www.cs.berkeley.edu/~brudno/chaos Uses inexact k-mers, sensitive
• PatternHunter http://www.bioinformaticssolutions.com/products/ph/index.php Uses patterns instead of k-mers
• BlastZ http://www.psc.edu/general/software/packages/blastz/ Uses patterns, good for finding genes
• Typhon http://typhon.stanford.edu Uses multiple alignments to improve sensitivity/speed tradeoff
CS262 Lecture 3, Win06, Batzoglou
Example
Query: gattacaccccgattacaccccgattaca (29 letters) [2 mins]
Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters
>gi|28570323|gb|AC108906.9| Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = 144487 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus
Query: 4 tacaccccgattacaccccga 24 ||||||| |||||||||||||
Sbjct: 125138 tacacccagattacaccccga 125158
Score = 34.2 bits (17),
Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus
Query: 4 tacaccccgattacaccccga 24 ||||||| |||||||||||||
Sbjct: 125104 tacacccagattacaccccga 125124
>gi|28173089|gb|AC104321.7| Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence, complete sequence Length = 139823 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus
Query: 4 tacaccccgattacaccccga 24 ||||||| |||||||||||||
Sbjct: 3891 tacacccagattacaccccga 3911
CS262 Lecture 3, Win06, Batzoglou
Example
Query: Human atoh enhancer, 179 letters [1.5 min]
Result: 57 blast hits1. gi|7677270|gb|AF218259.1|AF218259 Homo sapiens ATOH1 enhanc... 355 1e-95 2. gi|22779500|gb|AC091158.11| Mus musculus Strain C57BL6/J ch... 264 4e-68 3. gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhanc... 256 9e-66 4. gi|28875397|gb|AF467292.1| Gallus gallus CATH1 (CATH1) gene... 78 5e-12 5. gi|27550980|emb|AL807792.6| Zebrafish DNA sequence from clo... 54 7e-05 6. gi|22002129|gb|AC092389.4| Oryza sativa chromosome 10 BAC O... 44 0.068 7. gi|22094122|ref|NM_013676.1| Mus musculus suppressor of Ty ... 42 0.27 8. gi|13938031|gb|BC007132.1| Mus musculus, Similar to suppres... 42 0.27
gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhancer sequence Length = 1517 Score = 256 bits (129), Expect = 9e-66 Identities = 167/177 (94%),
Gaps = 2/177 (1%) Strand = Plus / Plus Query: 3 tgacaatagagggtctggcagaggctcctggccgcggtgcggagcgtctggagcggagca 62 ||||||||||||| ||||||||||||||||||| |||||||||||||||||||||||||| Sbjct: 1144 tgacaatagaggggctggcagaggctcctggccccggtgcggagcgtctggagcggagca 1203
Query: 63 cgcgctgtcagctggtgagcgcactctcctttcaggcagctccccggggagctgtgcggc 122 |||||||||||||||||||||||||| ||||||||| |||||||||||||||| ||||| Sbjct: 1204 cgcgctgtcagctggtgagcgcactc-gctttcaggccgctccccggggagctgagcggc 1262
Query: 123 cacatttaacaccatcatcacccctccccggcctcctcaacctcggcctcctcctcg 179 ||||||||||||| || ||| |||||||||||||||||||| |||||||||||||||
Sbjct: 1263 cacatttaacaccgtcgtca-ccctccccggcctcctcaacatcggcctcctcctcg 1318 http://www.ncbi.nlm.nih.gov/BLAST/
CS262 Lecture 3, Win06, Batzoglou
The Four-Russian Algorithmbrief overview
A (not so useful) speedup of Dynamic Programming
[Arlazarov, Dinic, Kronrod, Faradzev 1970]
CS262 Lecture 3, Win06, Batzoglou
Main Observation
Within a rectangle of the DP matrix,values of D depend onlyon the values of A, B, C,
and substrings xl...l’, yr…r’
Definition: A t-block is a t t square of
the DP matrix
Idea: Divide matrix in t-blocks,Precompute t-blocks
Speedup: O(t)
A B
C
D
xl xl’
yr
yr’
t
CS262 Lecture 3, Win06, Batzoglou
The Four-Russian Algorithm
Main structure of the algorithm:
1. Divide NN DP matrix into KK log2N-blocks that overlap by 1 column & 1 row
2. For i = 1……K
3. For j = 1……K
4. Compute Di,j as a function of Ai,j, Bi,j, Ci,j, x[li…l’i], y[rj…r’j]
Time: O(N2 / log2N)
times the cost of step 4t t t
CS262 Lecture 3, Win06, Batzoglou
The Four-Russian Algorithm
t t t