Pairwise Sequence Alignment (I)
description
Transcript of Pairwise Sequence Alignment (I)
![Page 1: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/1.jpg)
Pairwise Sequence Alignment (I)
(Lecture for CS498-CXZ Algorithms in Bioinformatics)
Sept. 22, 2005
ChengXiang Zhai
Department of Computer ScienceUniversity of Illinois, Urbana-Champaign
Many slides are taken/adapted from http://www.bioalgorithms.info/slides.htm
![Page 2: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/2.jpg)
Comparing Genes in Two Genomes
• Small islands of similarity corresponding to similarities between exons
• Such comparisons are quite common in biology research
![Page 3: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/3.jpg)
Alignment of sequences is one of the most basic and most important problems in bioinformatics…
![Page 4: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/4.jpg)
Outline
• Defining the problem of alignment
• The longest common subsequence problem
• Dynamic programming algorithms for alignment
![Page 5: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/5.jpg)
Aligning Two StringsGiven the strings:
• v = ATGTTAT
• w = ATCGTAC One possible alignment of the strings:
AT_GTTAT_ ATCGT_A_C1st row – string v with with space symbols “-” inserted
2nd row – string w with with space symbols “-” inserted
![Page 6: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/6.jpg)
Aligning Two Strings (cont’d)
Another way to represent each row shows the number of symbols of the sequence present up to a given position. For example the above sequences can be represented as:
0 1 2 2 3 4 5 6 7 7
0 1 2 3 4 5 5 6 6 7
AT_GTTAT_ ATCGT_A_C
![Page 7: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/7.jpg)
Alignment Matrix
Both rows of the alignment can be represented in the resulting matrix:
0 1 2 2 3 4 5 6 7 7
0 1 2 3 4 5 5 6 6 7
AT_GTTAT_ ATCGT_A_C
0 1 2 2 3 4 5 6 7 70 1 2 3 4 5 5 6 6 7
![Page 8: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/8.jpg)
Alignment as a Path in the Edit Graph
0 0 1 1 2 2 3 4 5 6 7 72 2 3 4 5 6 7 7 A A T _ G T T A T _T _ G T T A T _ A A T C G T _ A _ CT C G T _ A _ C0 0 1 1 2 3 4 5 5 6 6 7 2 3 4 5 5 6 6 7
(0,0) , (0,0) , (1,1)(1,1)
![Page 9: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/9.jpg)
Alignment as a Path in the Edit Graph
0 1 0 1 2 2 2 3 4 5 6 7 72 3 4 5 6 7 7 A A T T _ G T T A T __ G T T A T _ A A T T C G T _ A _ CC G T _ A _ C0 1 0 1 2 2 3 4 5 5 6 6 7 3 4 5 5 6 6 7
(0,0) , (1,1) , (0,0) , (1,1) , (2,2)(2,2)
![Page 10: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/10.jpg)
Alignment as a Path in the Edit Graph
0 1 2 2 0 1 2 2 33 4 5 6 7 7 4 5 6 7 7 A T _ A T _ G G T T A T _T T A T _ A T C A T C G G T _ A _ CT _ A _ C0 1 2 3 0 1 2 3 4 4 5 5 6 6 7 5 5 6 6 7
(0,0) , (1,1) , (2,2), (2,3), (0,0) , (1,1) , (2,2), (2,3), (3,4)(3,4)
![Page 11: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/11.jpg)
Alignment as a Path in the Edit Graph
0 1 2 2 3 4 5 6 7 70 1 2 2 3 4 5 6 7 7 A T _ G T T A T _A T _ G T T A T _ A T C G T _ A _ CA T C G T _ A _ C0 1 2 3 4 5 5 6 6 7 0 1 2 3 4 5 5 6 6 7
(0,0) , (1,1) , (2,2), (2,3), (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7)(7,6), (7,7)
- End Result -
![Page 12: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/12.jpg)
Alignment as a Path in the Edit Graph
Every path in the edit graph corresponds to an alignment:
![Page 13: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/13.jpg)
How to Score an Alignment?
• Simplest– Every match scores 1– Every mismatch scores 0– An alignment is scored based on the number of common
symbols– Lead to the longest common subsequence problem
• More sophisticated– ? – ?– To be covered later
![Page 14: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/14.jpg)
Alignments in Edit Graph (cont’d)
and represent indels in v and w• Score 0.
represent exact matches. • Score 1.
![Page 15: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/15.jpg)
Alignments in Edit Graph (cont’d)
The score of the alignment path in the graph is 5.
![Page 16: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/16.jpg)
The Longest Common Subsequence (LCS) Problem
• Find the longest subsequence common to two strings.Input: Two strings, v and w.
Output: The longest common subsequence of v
and w.
A subsequence is not necessarily consecutive
v = ATGTTAT w = ATCGTAC
v = AT GTTAT | | | | | “ATGTA”w = ATCGT AC
Longest common subsequence Best alignment
![Page 17: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/17.jpg)
How to solve the LCS problem efficiently?
![Page 18: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/18.jpg)
Brute Force Approach
• Enumerate all the sequences up to length min(|v|,|w|)
• For each one, check to see if it is a subsequence of v and w
• Very expensive…. (How many sequences do we have to enumerate? )
![Page 19: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/19.jpg)
The Idea of Dynamic Programming
• Think of an alignment as a path in an edit graph
• We only need to keep track of the best alignment (i.e., the longest common subsequence)
• Score a longer alignment based on shorter alignments
![Page 20: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/20.jpg)
Alignment as a Path in the Edit Graph
01201222345673456777v= ATv= AT__GTGTTTAATT__w= ATw= ATCCGTGT__AA__CC 01201233455664556677
(0,0) , (1,1) , (2,2), (0,0) , (1,1) , (2,2), (2,3),(2,3), (3,4), (4,5), (3,4), (4,5), (5,5),(5,5), (6,6), (6,6), (7,6),(7,6), (7,7)(7,7)
Use each cell to store the best alignment so far…
![Page 21: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/21.jpg)
Alignment: Dynamic Programming
Use this scoring algorithm
si,j = si-1, j-1+1 if vi = wj
max si-1, j
si, j-1
![Page 22: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/22.jpg)
Dynamic Programming Example
• There are no matches in the beginning of the sequence
• Label column i=1 to be all zero, and row j=1 to be all zero
![Page 23: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/23.jpg)
Dynamic Programming Example
Si,j = Si-1, j-1
max Si-1, j
Si, j-1
value from NW +1, if vi = wj
value from North (top) value from West (left)
Keep track of the best alignment score and the path contributing to it
![Page 24: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/24.jpg)
Alignment: Backtracking
Arrows show where the score originated from.
if from the top
if from the left
if vi = wj
![Page 25: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/25.jpg)
Dynamic Programming Example
Continuing with the scoring algorithm gives this result.
![Page 26: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/26.jpg)
LCS Algorithm1.LCS(v,w)2. for i 1 to n3. Si,0 04. for j 1 to m5. S0,j 06. for i 1 to n7. for j 1 to m8. si-1,j
9. si,j max si,j-1 10. si-1,j-1 + 1, if vi = wj
11. “ “ if si,j = si-1,j
• bi,j “ “ if si,j = si,j-1
• “ “ if si,j = si-1,j-1 + 1
• return (sn,m, b)
![Page 27: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/27.jpg)
Now What?
• LCS(v,w) created the alignment grid
• Now we need a way to read the best alignment of v and w
• Follow the arrows backwards from the (|v|,|w|) cell
![Page 28: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/28.jpg)
LCS Runtime
• To create the nxm matrix of best scores from vertex (0,0) to all other vertices, it takes O(nm) time.
• Why O(nm)? The pseudocode consists of a nested “for” loop inside of another “for” loop to set up a nxm matrix.
![Page 29: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/29.jpg)
How do we improve the scoring of alignments?
Can we still find an alignment efficiently?
We’ll talk about these later…
![Page 30: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/30.jpg)
The LCS Recurrence Revisited
• The formula can be rewritten by adding zero to the edges that come from an indel, since the penalty of indels are 0:
si-1, j-1+1 if vi = wj
si,j = max si-1, j + 0
si, j-1 + 0 Insertion/deletion score
Matching score
![Page 31: Pairwise Sequence Alignment (I)](https://reader035.fdocuments.us/reader035/viewer/2022062501/56815d36550346895dcb367a/html5/thumbnails/31.jpg)
What You Should Know
• How an alignment corresponds to a path in an edit graph
• How the LCS problem corresponds to alignment with a simple scoring method
• How the dynamic programming algorithm solves the LCS problem (= simple alignment)