Comp. Genomics
description
Transcript of Comp. Genomics
Comp. Genomics
Recitation 1
Outline
• Sequence alignment• End-space free alignment• Alignment with gaps
Alignment basic step
xi | G
yj |C
GC
G-
xi |G
yj |C
xi |G
yj |C
xi |G
yj |C
-C
Global alignment
• All of x has to be aligned with all of y
• Therefore, every gap is “paid for”• The solution score is found in one cell
Alignment score here
Traceback all the way
Global alignment
• Input: Sequences x,y• Output: Maximum score alignment• F(i,j) – score of aligning x[1..i] with y[1..j]• Base conditions:
• F(i,0) = k=1..i(xk,-)• F(0,j) = k=1..j(-,yk)
• Recurrence relation: F(i-1,j-1) + (xi,yj)
1in, 1jm: F(i,j) = max F(i-1,j) + (xi,-)F(i,j-1) + (-, yj)
Local alignment
• Local alignment• Subset of x aligned with a subset of y• Gaps outside subsets “costless”• Solution equals the maximum score cell in the
DP matrix• Base conditions:
• F(i,0) = 0• F(0,j) = 0
• Recurrence relation: F(i-1,j-1) + (xi,yj)
1in, 1jm: F(i,j) = max F(i-1,j) + (xi,-)F(i,j-1) + (-,yj)
0
Local alignment example
H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 5 0 0 0 0 0
W 0 0 0 0 2 0 20 12 4 0 0
H 0 10 0 0 0 0 12 18 22 14 6
E 0 2 16 8 0 0 4 10 18 28 20
A 0 0 8 21 13 5 0 4 10 20 27
E 0 0 6 13 18 12 4 0 4 16 26
Mismatch: BLOSUM50Match: BLOSUM50Gap: -8
AWGHEAW_ HE
Overlap matches (end space free alignment)
• Something between global and local• Consider aligning a gene x to a (bacterial) genome y• Gaps in the beginning and end of x and y are costless• But all of x should be aligned• Base conditions:
• F(i,0) = 0• F(0,j) = 0
• Recurrence relation: F(i-1,j-1) + (xi,yj)
1in, 1jm: F(i,j) = max F(i-1,j) + (xi,-)
F(i,j-1) + (-,yj)
• The optimal solution is found at the last row/column (not necessarily at bottom right corner)
Handling weird gaps
• Affine gap: different cost for a “new” and “old” gaps
Xi|G
yj|C
GC
G
-
Xi|G
Xi|G
Xi|G
y j |C
-C
Now we care if there were gaps here
Two new things to keep track Two additional matrices
y j |Cy j |C
Alignment with Affine Gap Penalty
Base Conditions:M(i, 0) = Ix(i, 0) = Wg + iWs
M(0, j) = Iy(0, j) = Wg + jWs
M(0, 0) = 0Recursive Computation:
x 1...........iy 1...........j
x 1......i----y 1...........j
x 1...........iy 1….j-----
M(i,j)
Iy(i,j)
Ix(i,j)
M(i,j) = max
M(i-1,j-1) + (xi,yj)
Ix(i-1,j-1) + (xi,yj)
Iy (i-1,j-1) +(xi,yj)
Ix(i,j) = max M(i-1,j) + Wg+ Ws Ix(i-1,j) + Ws
Wg ,Ws <0
The optimal solution is the maximum of the relevant cells in the three matrices
When do constant and affine gap costs differ?
• Consider: AGAGACTGACGCTTA
ATATTA
AGAGACTGACGCTTA
----A-T-A---TTA
Constant penalty: Mismatch: -5Gap: -1
AGAGACTGACGCTTA
ATA---------TTA
Affine penalty: Mismatch: -5Gap open: -3Gap extend: -0.5
-9 -14
-12-14.5
Question
• Given two sequences x and y, the fragmentation number of x,y is the minimal k such that x and y can be broken into substrings x1,x2,...,xk ; y1,y2,...,yk and every xi is a substring of the corresponding yi
• Suggest an algorithm for finding the fragmentation number of two sequences
Solution
•Global alignment with the following modifications:•No penalty for gaps at the ends of y•Gaps are only allowed in x (characters of x may not be skipped)•Mismatches are not allowed (score -∞)•Affine gaps score, with open cost 1 and extension cost 0
Question
• How do we align two sequences with a bound k on the maximal number of gaps?
• Analyze the complexity
Solution
We will divide every cell in the alignment matrix to 2k sub-cells. The meaning of a sub-cell is as follows:
k cells with superscript 1:
k cells with superscript 2:1( , ) Cost of optimal alignment with l gaps opened and l gaps closedlM i j
2 ( , ) Cost of optimal alignment with l gaps opened and l-1 gaps closedlM i j
Solution
• The update rule for sub-cells with superscript 1:
• The update rule for sub-cells with superscript 2:
11
21
( 1, 1) ( , )( , ) max
( 1, 1) ( , )
ll
l
M i j i jM i j
M i j i j
2
22
11
11
( , 1) ( , )
( 1, ) ( , )( , ) max
( 1, ) ( , )
( , 1) ( , )
l
ll
l
l
M i j j
M i j iM i j
M i j i
M i j j
What about arbitrary gap functions?
• If the gap cost is an arbitrary function of its length, γ(k)
• When computing Mij, we need to look at all possible gap lengths “back”:
Xi|G
Yj |C
Alignment with arbitrary gap functions
Recursive Computation:
F(i,j) = max
F(i-1,j-1) + (xi,yj)
F(k,j) + γ(i-k)
F(i,k) + γ(j-k)
k=0,…,i-1
k=0,…,j-1
Complexity
Suppose the two sequences are of length n.
2
0 0 0 0
2 3
( ) ( 1) ( 1) ( 1)
( 1)2( 1) ( )
2
n n n n
i j i j
T n i j n i n j n
n nn n O n
LCS
• Longest common non-contigous subsequence:• Use global alignment with similarity
scores• +1 for match• 0 for indel • -∞ for mismatches
Exercise: Shortest common supersequence
• A is called a non-contiguous supersequence of B if B is a non-contiguous subsequence of A.
• e.g., YABADABADU is a non-contigous supersequence of BABU (YABADABADU)
• Problem: Given A and B, find their shortest common supersequence
Solution
• For A=“PRIDE” B=“PARADE”: • Compute LCS using global align:
A=P-R-IDE
B=PARA-DE• PARAIDE – Shortest common
supersequence• Notice that PRDE is the longest
common subsequence of A and B.
Exercise: Finding repeats
• Basic objective: find a pair of subsequences within a string x with maximum similarity
• Simple (albeit wrong) idea: Find an optimal alignment of x with itself!
(Why is this wrong?)• But using local alignment is still a
good idea
Variant #1
• First variant: the two sequences may not overlap
• Solution: Absence of overlap means that there exists an index k such that one substring is in x[1..k] and another in x[k+1..n] • Check local alignments between x[1..k] and
x[k+1..n] for all 1<=k<n• Pick the highest-scoring alignment
• Complexity: O(n3) time and O(n) space
Variant #1, Pictorially
Variant #2
• Second variant: the two sequences must be consecutive (tandem repeat)
• Solution: Similar to variant #2, but somewhat “ends-free”: seek a global alignment between x[1..k] and x[k+1..n], • No penalties for gaps in the beginning of
x[1..k]• No penalties for gaps in the end of x[k+1..n]
• Complexity: O(n3) time and O(n) space
Variant #2, Pictorially