1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of...
-
Upload
mervyn-hopkins -
Category
Documents
-
view
231 -
download
0
Transcript of 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of...
![Page 1: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/1.jpg)
1
CAP5510 – BioinformaticsSequence Comparison
Tamer Kahveci
CISE Department
University of Florida
![Page 2: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/2.jpg)
2
Goals
• Understand major sequence comparison algorithms.
• Gain hands on experience
![Page 3: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/3.jpg)
3
Why Compare Sequences ?
• Prediction of function
• Construction of phylogeny
• Shotgun assembly
• Finding motifs
• Understanding of biological processes
![Page 4: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/4.jpg)
4
Question
• Q = AATTCGA
• X = ACATCGG• Y = CATTCGCC• Z = ATTCCGC
• Form groups of 2-3. Sort X, Y, and Z in decreasing similarity to Q. (5 min)
![Page 5: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/5.jpg)
5
Dot Plot
A A T T C G A
A
C
A
T
C
G
G
How can we compute similarity?
O(m+n) time
Is it a good scheme ?
Use longer subsequences (k-gram)
![Page 6: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/6.jpg)
6
Dot Plot
A A T T C G A
A
C
A
T
C
G
G
Use longer subsequences (k-gram)
![Page 7: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/7.jpg)
7
Sequence Comparison
• How to align– Global alignment: align entire sequences– Local alignment: align subsequences
• How to evaluate– Distance– Score
![Page 8: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/8.jpg)
8
Global Alignment
• Q = AATTCGA• |rr|||r• X = ACATCGG
• 4 match• 3 mismatch
• Q = A-ATTCGA• |i|d|||r• X = ACA-TCGG• 5 match• 1 insert• 1 delete• 1 mismatch
Similarity is defined in terms of Distance / Score of alignment
Many combinations of Insert / delete / (mis)match
![Page 9: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/9.jpg)
9
Each Alignment Maps to a Path
A A T T C G
A
C
A
T
C
G
![Page 10: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/10.jpg)
10
Edit Distance
• Minimum number of insert / delete / replace operators to transform one sequence into the other.
• Q = AATTCGA• | ||| => 3• X = ACATCGG
How do we find the minimum edit distance ?
![Page 11: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/11.jpg)
11
Global sequence alignment(Needleman-Wunsch)
• Compute distance recursively : dynamic programming.
Case 1 : match (0) or mismatch (1)
Case 2 : delete (1)
Case 3 : insert (1)
Case 0 : one string is empty (n)
![Page 12: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/12.jpg)
12
• Optimal string alignment• D(i,j) = edit distance between A(1:i) and B(1:j)• d(a,b) = 0 if a = b, 1 otherwise.• Recurrence relation
– D(i,0) = Σ d(A(k),-), 0 <= k <= i– D(0,j) = Σ d(-,B(k)), 0 <= k <= j– D(i,j) = Min {
• D(i-1,j) + d(A(i),-), • D(i,j-1) + d(-,B(j)),• D(i-1,j-1) + d(A(i),B(j))}
Global sequence alignment(Needleman-Wunsch)
![Page 13: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/13.jpg)
13
DP Example
A A T T C G
A
C
A
T
C
G
D(i,0) = D(i,0) = Σ d(A(k),-), 0 <= k <= iD(0,j) = D(0,j) = Σ d(-,B(k)), 0 <= k <= j
D(i,j) = D(i,j) = Min {•D(i-1,j) + d(A(i),-), •D(i,j-1) + d(-,B(j)),•D(i-1,j-1) + d(A(i),B(j))}
![Page 14: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/14.jpg)
14
DP Example: Backtracking
A A T T C G
0 1 2 3 4 5 6
1 0 1 2 3 4 5
2 1 1 2 3 3 4
3 2 1 2 3 4 4
4 3 2 1 2 3 4
5 4 3 2 2 2 3
6 5 4 3 3 3 2
A
C
A
T
C
G
•O(mn) time and space
•Reconstruct alignment
•O(max{m,n}) space if alignment not needed. How ?
![Page 15: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/15.jpg)
15
Number of Alignments
• N(n, m) = number of alignments of sequences of n and m letters (not necessarily optimal alignment).
• N(0, i) = N(i, 0) = 1• N(n, m) = N(n-1, m) + N(n, m-1) + N(n-1,m-1)• N(n, n) ~ (1 + 21/2)2n+1n-1/2.• N(1000, 1000) > 10767
• 1080 atoms in the universe !
![Page 16: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/16.jpg)
16
Edit Distance: a Good Measure?
• Compare these two alignments. Which one is better ?
• Q = AATTCGA• | |||• X = ACATCGG
• Q = A-ATTCGA• | | |||• X = ACA-TCGG
Scoring scheme: • +1 for each match• -1 for each mismatch/indel
Can be computed the same as edit distance by including +1 for each match
![Page 17: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/17.jpg)
17
More Trouble: Scoring Matrices
• Different mutations may occur at different rates in nature. Why ?
• E.g., each amino acid = three nucleotides. Transformation of one amino acid to other due to single nucleotide modification may be biased– E = GAA, GAG– D = GAU, GAC– F = UUU, UUC– E similar to D, not similar to F
• Mutation probability of different pairs of nucleotides may differ.
• PAM, BLOSUM matrices
![Page 18: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/18.jpg)
18
A R N D C Q E G H I L K M F P S T W Y VA 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -2 -2 0 R -2 7 0 -1 -3 1 0 -2 0 -3 -2 3 -1 -2 -2 -1 -1 -2 -1 -2 N -1 0 6 2 -2 0 0 0 1 -2 -3 0 -2 -2 -2 1 0 -4 -2 -3 D -2 -1 2 7 -3 0 2 -1 0 -4 -3 0 -3 -4 -1 0 -1 -4 -2 -3 C -1 -3 -2 -3 12 -3 -3 -3 -3 -3 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1 Q -1 1 0 0 -3 6 2 -2 1 -2 -2 1 0 -4 -1 0 -1 -2 -1 -3 E -1 0 0 2 -3 2 6 -2 0 -3 -2 1 -2 -3 0 0 -1 -3 -2 -3 G 0 -2 0 -1 -3 -2 -2 7 -2 -4 -3 -2 -2 -3 -2 0 -2 -2 -3 -3H -2 0 1 0 -3 1 0 -2 10 -3 -2 -1 0 -2 -2 -1 -2 -3 2 -3 I -1 -3 -2 -4 -3 -2 -3 -4 -3 5 2 -3 2 0 -2 -2 -1 -2 0 3 L -1 -2 -3 -3 -2 -2 -2 -3 -2 2 5 -3 2 1 -3 -3 -1 -2 0 1 K -1 3 0 0 -3 1 1 -2 -1 -3 -3 5 -1 -3 -1 -1 -1 -2 -1 -2 M -1 -1 -2 -3 -2 0 -2 -2 0 2 2 -1 6 0 -2 -2 -1 -2 0 1 F -2 -2 -2 -4 -2 -4 -3 -3 -2 0 1 -3 0 8 -3 -2 -1 1 3 0 P -1 -2 -2 -1 -4 -1 0 -2 -2 -2 -3 -1 -2 -3 9 -1 -1 -3 -3 -3 S 1 -1 1 0 -1 0 0 0 -1 -2 -3 -1 -2 -2 -1 4 2 -4 -2 -1 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -1 -1 2 5 -3 -1 0 W -2 -2 -4 -4 -5 -2 -3 -2 -3 -2 -2 -2 -2 1 -3 -4 -3 15 3 -3 Y -2 -1 -2 -2 -3 -1 -2 -3 2 0 0 -1 0 3 -3 -2 -1 3 8 -1V 0 -2 -3 -3 -1 -3 -3 -3 -3 3 1 -2 1 0 -3 -1 0 -3 -1 5
The BLOSUM45 Matrix
![Page 19: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/19.jpg)
19
score(H,P) = -2, gap Penalty = –8
H E A G A W G H E E
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2
A -16
W -24
H -32
E -40
A -48
E -56
![Page 20: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/20.jpg)
20
Score(E,P) = 0, score(E,A) = -1, score(H,A) = -2
H E A G A W G H E E
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -8
A -16 -10 -3
W -24
H -32
E -40
A -48
E -56
![Page 21: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/21.jpg)
21
H E A G A W G H E E
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -8 -16 -24 -33 -42 -49 -57 -65 -73
A -16 -10 -3 -4 -12 -19 -28 -36 -44 -52 -60
W -24 -18 -11 -6 -7 -15 -4 -12 -21 -29 -37
H -32 -14 -18 -13 -8 -9 -12 -6 -2 -11 -19
E -40 -22 -8 -16 -16 -9 -12 -14 -6 4 -5
A -48 -30 -16 -3 -11 -11 -12 -12 -14 -4 2
E -56 -38 -24 -11 -6 -12 -14 -15 -12 -8 2
H E A G A W G H E - E- P - - A W - H E A E
Optimal alignment:
![Page 22: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/22.jpg)
22
Distance v.s. Similarity
• Similarity model: s(a,b), g’(k)• Distance model: d(a,b), g(k)If there is a constant c, such that
– S(a,b) = c – d(a,b)– G’(k) = g(k) – kc/2
ThenSimilarity optimal alignment = distance optimal
alignment
![Page 23: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/23.jpg)
23
Global Alignment ?
• Q = A-ATTCGA• | | |||• X = ACA-TCGG
• Q = AATTCGA-• ||||| • Y = CATTCGCC
Which one is more similar to Q ?
Local alignment: highest scoring subsequence alignment. How can we find it ?
Brute force: O(n3m3)Gotoh (Smith-Waterman): O(nm)
![Page 24: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/24.jpg)
24
Local Suffix Alignment
• V(i, 0) = v(0, j) = 0• V(i,j) = max{0, v(i-1, j-1) + s(x(i), y(j)), v(i-1, j) + s(x(i), -) v(i, j-1) + s(-, y(j))}
X[1: i]
Y[1: j]
![Page 25: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/25.jpg)
25
Local Alignment
• The prefixes with highest local suffix alignment
![Page 26: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/26.jpg)
26
-- G C T G G A A G G C A T
-- 0 0 0 0 0 0 0 0 0 0 0 0
G 0 5 1 0 5 5 1 0 5 5 1 0 0
C 0 1 10 6 2 1 1 0 1 1 10 6 2
A 0 0 6 6 2 0 6 6 2 0 6 15 11
G 0 5 2 2 11 7 3 2 11 7 3 11 11
A 0 1 1 0 7 7 11 8 7 7 3 8 7
G 0 5 1 0 5 11 7 7 13 12 8 4 4
C 0 0 10 6 2 7 7 3 9 8 17 13 9
A 0 0 6 6 2 3 11 12 8 5 13 22 18
C 0 0 5 2 2 0 7 8 8 4 18 18 18
G 0 5 1 1 7 7 5 4 13 13 14 14 14
P’s subsequence: G C A G A G C AQ’s subsequence: G A A G – G C A
P
Q
Match = +5Mismatch = -4
Local Alignment Example
![Page 27: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/27.jpg)
27
Goals
• Other important sequence comparison problems– banded alignment– end free search– pattern search– non-overlapping alignments– gaps– linear-space algorithms– bitwise operations– neighborhood searching– NFAs– Approximate alignment
![Page 28: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/28.jpg)
28
Banded Global Alignment
• Two sequences differ by at most w edit operations (w<<n).
• How can we align ?
![Page 29: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/29.jpg)
29
Banded Alignment Example
• O(wn) time and space.
• Example: – w=3.– Match = +1– Mismatch = -1– Indel = -2
A C C A C A C A0 -2 -4 -6
A -2 1 -1 -3 -5
C -4 -1 2 0 -2 -4
A -6 -3 0 1 1 -1 -3
C -5 -2 1 0 2 0 -2
C -4 -1 0 1 1 1 -1
A -3 0 -1 2 0 2
T -2 -1 0 1 0
A -1 0 -1 2
![Page 30: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/30.jpg)
30
End space free alignment
• --CCA-TGAC• TTCCAGTG--• How can we find it ?
![Page 31: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/31.jpg)
31
End space free alignment
• --CCA-TGAC• TTCCAGTG--
![Page 32: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/32.jpg)
32
Pattern search
• AAGCAGCCATGACGGAAAT• CCAGTG• How can we find it ?
![Page 33: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/33.jpg)
33
Pattern search
• AAGCAGCCATGACGGAAAT• CCAGTG
![Page 34: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/34.jpg)
34
Non-overlapping Local Alignments
• GCTCTGCGAATA• GCTCTGCGAATA
• CGTTGAGATACT• CGTTGAGATACT
• Find all non-overlapping local alignments with score > threshold.
• Two alignments overlap if they share same letter pair.
• How do we find ?
![Page 35: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/35.jpg)
35
Non-overlapping Local Alignments
1. Compute DP matrix
2. Find the largest scoring alignment > threshold
3. Report the alignment
4. Remove the effects of the alignment from the matrix
5. Go to step 2
![Page 36: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/36.jpg)
36
Next: Closer look into gaps
![Page 37: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/37.jpg)
37
Gaps
• Q = AATTCGAG• ||||| • Y = -ATTCGC-
• Q = AATTCGAG• ||||| • Z = AATTCC--
Which one is more similar to Q ?
Starting an indel is less likely than continuing an indel.
Affine gap model: Large gap open and smaller gap extend penalty.
How can we compute it ?
![Page 38: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/38.jpg)
38
Computing affine gaps
• 3 cases
E
F
G
i
j
i
j
i
j
![Page 39: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/39.jpg)
39
Recursions
• E(i, 0) = gap_open + i x gap_extend
• E(i,j) = max{E(i, j-1) + gap_extend, V(i, j-1) + gap_open +
gap_extend}
Ei
j
![Page 40: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/40.jpg)
40
Recursions
• F(0, j) = gap_open + j x gap_extend
• F(i,j) = max{F(i-1, j) + gap_extend, V(i-1, j) + gap_open +
gap_extend}
Fi
j
![Page 41: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/41.jpg)
41
Recursions
• G(i,j) = G(i-1, j-1) + s(x(i), y(j))
Gi
j
![Page 42: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/42.jpg)
42
Recursions
• V(i, 0) = gap_open + i x gap_extend
• V(0, j) = gap_open + j x gap_extend
• V(i, j) = max{E(i, j), F(i, j), G(i, j)}
![Page 43: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/43.jpg)
43
Other Gap Models
• Constant: fixed gap penalty per gap regardless of length
• Non-linear: Gap cost increase is non-linear.– E.g., g(n) = -(1 + ½ + 1/3 + … + 1/n)
• Arbitrary
![Page 44: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/44.jpg)
44
DP in Linear Space ?
![Page 45: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/45.jpg)
45
Linear Space DP
• Keep two vectors at a time: – Two columns or two
rows
• O(min{m,n}) space• O(mn) time• No backtracking
A A T T C G
A
C
A
T
C
G
![Page 46: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/46.jpg)
46
Linear Space DP with Backtracking
• Find midpoint of the alignment– Align the first half– Align the second half– Choose the point with
best sum of score/distance
• Search the upper left and lower right of mid point
![Page 47: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/47.jpg)
47
Linear Space DP with Backtracking: Time Complexity
• 2(n/2 x m) = nm
• 2(n/4 x k) + 2(n/4 x (m-k)) = nm/2
• …
• nm/2i
• Adds up to 2nm
![Page 48: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/48.jpg)
48
Next: inversions
![Page 49: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/49.jpg)
49
Alignment with Inversions
• A’ = T and G’ = C
• ACTCTCTCGCTGTACTG• AATCT-ACTACTGCTTG
• Each letter is inverted only once.• An inversion cost (inv) for each inverted block.• How to find the alignment ?
![Page 50: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/50.jpg)
50
Alignment with Inversions
1. For i=1:m1. For j=1:n
1. For g=1:I1. For h=1:j
1. Compute Z(g,h; I,j)2. V(I,j) = max{
» Max{v(i-1,j-1) + z(g,h; I,j)} + inv» V(i-1,j-1) + s(xi, yj)» V(i-1, j) + ins» V(I, j-1) + del}
• O(n6) time
![Page 51: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/51.jpg)
51
Alignment with Inversions: Faster Method
1. Find all local alignments of x and y’ (Z)2. V(I,j) = max{
1. max{V(g-1, h-1) + Z(g, h; I, j)} + inv,
2. V(i-1, j-1) + s(xi, yj),3. V(i-1, j) + ins4. V(I, j-1) + del }
• O(nmL) time, where L is the average number of inverse alignments ending at (i,j)
![Page 52: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/52.jpg)
52
Recap & Goals
• Other important sequence comparison problems– banded alignment– end free search– pattern search– non-overlapping alignments– gaps– linear-space algorithms– inversions– bitwise operations– neighborhood searching– NFAs– Approximate alignment– Homology
![Page 53: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/53.jpg)
53
Pattern Searching with Bitwise Operations
UM-92 (A3)
![Page 54: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/54.jpg)
54
Pattern Searching with Bitwise Operations (1)
• Simple case : Find all exact matches to y in x• Rj[i] = 1 if first i letters of y matches last i letters of x.
• R0[i] = 1 (if i = 0) 0 (if 0 < i <= m)• Rj+1[i] =
– 1 (if Rj[i-1] = 1 and y[i] = x[j]) 0 (else)
• Match if Rj[m] = 1
x
y mn
![Page 55: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/55.jpg)
55
Pattern Searching with Bitwise Operations (2)
• Si[k] = – 1 if y[i] = kth letter in the alphabet– 0 else– (for i = 1, 2, …, m)
• Rj+1 = (right shift of Rj) AND (Si)– where x[j+1] = ith letter in the alphabet
![Page 56: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/56.jpg)
56
Pattern Searching with Bitwise Operations (3)
• AATAACAATACAT• AATAC
A A T A A C A A T A C A T
A 1
A 0
T 0
A 0
C 0
A C G T
A 1 0 0 0
A 1 0 0 0
T 0 0 0 1
A 1 0 0 0
C 0 1 0 0
11000
11010
11000
AND
R S
![Page 57: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/57.jpg)
57
Pattern Searching with Bitwise Operations (3)
• AATAACAATACAT• AATAC
A A T A A C A A T A C A T
A 1 1
A 0 1
T 0 0
A 0 0
C 0 0
A C G T
A 1 0 0 0
A 1 0 0 0
T 0 0 0 1
A 1 0 0 0
C 0 1 0 0
![Page 58: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/58.jpg)
58
Pattern Searching with Bitwise Operations (4)
• Harder case: one edit distance allowed
• Use R and R1
• R for exact match
• R1j[i] = 1 if first i letters of y matches last i letters of x with at most one edit operation.
![Page 59: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/59.jpg)
59
Pattern Searching with Bitwise Operations (5)
• Insertion1. y[1:i] matches x[:j] exactly
• insert x[j+1]
2. y[1:i-1] matches x[:j] with one insertion• match y[i] with x[j+1] if they are equal
• R1j+1 = (Rj) OR ((right shift of R1j) AND (Si))– where x[j+1] = ith letter in the alphabet
• Similar reasoning for delete and replace
![Page 60: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/60.jpg)
60
Pattern Searching with Bitwise Operations (6)
A A T A A C A A T A C A T
A 1 1 0 1
A 0 1 0 0
T 0 0 1 0
A 0 0 0 1
C 0 0 0 0
A A T A A C A A T A C A T
A 1 1 1 1
A 0 1 1 1
T 0 0 1 1
A 0 0 0 1
C 0 0 0 0
R R1
![Page 61: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/61.jpg)
61
• General problem: k edit operations are allowed
• Use R1, R2, …, Rk
• Update Rz+1 using Rz and Rz+1
• Improve running time by partitioning y into k+1 pieces (next slide).
Pattern Searching with Bitwise Operations (7)
![Page 62: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/62.jpg)
62
Improving Running Time of Approximate Pattern Search
• For searching k edit distance threshold
• Partition y into k+1 pieces
• At least one of them is an exact match.
• Why ? (Dirichlet principle)
k = 3
![Page 63: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/63.jpg)
63
Dirichlet (pigeonhole) Principle
• NK balls
• K+1 boxes
• Put balls in boxes
• At least one box contains < N balls
![Page 64: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/64.jpg)
64
Improving Running Time of Approximate Pattern Search
• For searching k edit distance threshold
• Partition y into k+1 pieces
• At least one of them is an exact match.
• Why ? (Dirichlet principle)
k = 3
![Page 65: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/65.jpg)
65
Improving Running Time of Approximate Pattern Search
• Search each partition for an exact match.• Align around the exact matches only
• Is it a good idea ? • (k+1)n/(Am/(k+1)) random matches, where A is the
alphabet size
k = 3
![Page 66: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/66.jpg)
66
Neighborhood Searching
Myers-94 (A4)
![Page 67: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/67.jpg)
67
Neighborhood Searching (1)
• Find all subsequences of x within D edit distance to y.• Assumption: m = logAn
• D-neighborhood of y = D-N(y) = set of all sequences within D edit distance to y
1. Find D-N(y)2. Find exact matches to all the sequences in D-N(y) in x
x
y mn
![Page 68: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/68.jpg)
68
Neighborhood Searching (2)
• Condensed D-neighborhood of y = D-N’(y) = Sequences in D-N’(y) which do not contain a prefix in D-N’(y)
![Page 69: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/69.jpg)
69
Searching Neighbors: Hash Table
• Fix the length of the sequence to search (say m)
• Create a hash table for all subsequences of x of length m.
• Lookup for query sequence of length m
• 01234567• CACACATGGTA
AAAA -> #…ACAC -> 1…CACA -> 0, 2…TTTT -> #
![Page 70: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/70.jpg)
70
Hash Table
• {A, C, G, T}• {0, 1, 2, 3}• {00, 01, 10, 11}
• GTCAT– 101101 = 29– (((2 x 4) + 3) x 4) + 1
• GTCAT– (((3 x 4) + 1) x 4) + 0– 52
• O(n) space and construction time
• What happens when query is– shorter ?– longer ?
• What happens when– alphabet size is large ?– m is large ?
![Page 71: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/71.jpg)
71
Neighborhood Searching (3)
• O(Dn) worst case
• O(Dnf(D/m)log n) expected time, where f(D/m) is an increasing concave function
• What if m is large (i.e., m > logAn) ?
– Dirichlet principle
![Page 72: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/72.jpg)
72
Using NFA for Sequence Matching
Baeza-Yates, Navarro – 99 (A6)
x
y mn
![Page 73: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/73.jpg)
73
Using NFA for Sequence Matching (1)
match
mismatch
deletepattern
insertpattern
1: active state
0: inactive state
NFA for pattern “patt” Search inside “waitt”
![Page 74: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/74.jpg)
74
Using NFA for Sequence Matching (2)
• If A(i,j) = 1 then A(i+d,j+d) = 1 for d>0
• Keep each diagonal’s first active node.
• Di = k if the first active node in diagonal I is k.
• Computation of D (next slide)
A’(i,j) = (A(i,j-1) AND x(k) == y(j)) OR A(i-1,j) OR A(i-1,j-1) OR A’(i-1,j-1)
![Page 75: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/75.jpg)
75
Using NFA for Sequence Matching (3)
![Page 76: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/76.jpg)
76
Using NFA for Sequence Matching (4)
• NFA gets large for long y and large error threshold
• How can we manage long y ?– Dirichlet principle
![Page 77: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/77.jpg)
77
Using NFA for Sequence Matching (5)
• Extension: Searching multiple patterns in parallel
![Page 78: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/78.jpg)
78
Approximate Global Alignment of Sequences
T. Kahveci, V. Ramaswamy, H. Tao, T. Li - 2005
![Page 79: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/79.jpg)
79
The problem
• Given sequences X and Y– Bounded: Find global alignment of X and Y
with at most k edit ops.– Unbounded: Find global alignment of X and Y
with p% approximation
• p = 100 % = optimal alignment.
![Page 80: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/80.jpg)
80
Frequency Vectors [KS’01]
• Frequency vector is the count of each letter.
– f(s = AATGATAG) = [4, 0, 2, 2].
• Edit operations & frequency vectors:– (del. G), s = AAT.ATAG => f(s) = [4, 0, 1, 2]– (ins. C), s = AACTATAG => f(s) = [4, 1, 1, 2]– (AC), s = ACCTATAG => f(s) = [3, 2, 1, 2]
• Use frequency vectors to measure distance!
nA nGnC nT
![Page 81: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/81.jpg)
81
An Approximation to ED:Frequency Distance (FD)
• s = AATGATAG => f(s)=[4, 0, 2, 2]• q = ACTTAGC => f(q)=[2, 2, 1, 2]
– dec = (4-2) + (2-1) = 3– inc = (2-0) = 2– FD(f(s),f(q)) = 3– ED(q,s) = 4
• FD(f(s1),f(s2))=max{inc,dec}.• FD(f(s1),f(s2)) ED(s1,s2).
![Page 82: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/82.jpg)
82
Distance Prediction using Frequency Vectors
A C T - - T A G
R I I
A A T G A T A G
A C T T A G C
* * * *
A A T G A T A
ED
GED
Given frequency vectors of two strings x and y, GED(x,y) is normally distributed.
Q = [12, 10, 3, 5]
U = [11, 11, 4, 4]
V = [6, 5, 9, 10]
![Page 83: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/83.jpg)
83
Mean :
Variance :
![Page 84: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/84.jpg)
84
Bounded Alignment: lower bounding the alignment
• Mi,j = Edit distance between prefixes of X and Y
• d = lower bound to ED between suffixes of X and Y with at least p% probability.
• If (Mi,j + d > cutoff) then – No solution exists from (i,j)
with p% probability.– Remove entry (i,j)
i
j
X
Y
d
Mi,j
p %d
![Page 85: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/85.jpg)
85
Cost of computing lower bound?
• Frequency vectors can be computed in O(1) time incrementally.
A A T T C
[2 1 0 2][A C G T]
[1 1 0 2]
A A T T C
![Page 86: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/86.jpg)
86
Unbounded Alignment: upper bounding the alignment
i
jX
Y
Dij
Mi,j
•Dij = upper bound to the distance between suffixes.
•Use mini,j {Mi,j + Dij} as cutoff.
•Prune if (Mi,j + d > cutoff)
•Dij: desirable if it is•Computed quickly•Tight
![Page 87: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/87.jpg)
87
How to Compute the Upper Bound?
i
jX
Y
Dij
Mi,j
X
YDi,j
Dij = distance for a sample alignment (suffix)
- A A C C T CG C A T C T A
e.g.
Di,j = 4
![Page 88: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/88.jpg)
88
Cost of Computing Upper Bound?
• Upper bound can be computed in O(1) time incrementally.
A A T C T G- C T C A G
A A T C T G- - T C A G
A T C T GC T C A G
D = 3 D = 3
D = 2
A A T C T G
TCAG
C
![Page 89: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/89.jpg)
89
Optimization 3: Path Prune
X
X
• No solution exists from entry (i, j) if its path to entry (0, 0) is blocked.
• Remove entry (i,j)
![Page 90: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/90.jpg)
90
Optimization 3: Path Prune
X X
X X
X X X
X X X
• No solution exists from entry (i, j) if its path to entry (0, 0) is blocked.
• Remove entry (i,j)
![Page 91: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/91.jpg)
91
Unbounded Alignment: Time
![Page 92: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/92.jpg)
92
Unbounded Alignment: Space
![Page 93: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/93.jpg)
93
Bounded Alignment: Time
![Page 94: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/94.jpg)
94
Bounded Alignment: Space
![Page 95: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/95.jpg)
95
Recap & Goals
• Other important sequence comparison problems– banded alignment– end free search– pattern search– non-overlapping alignments– gaps– linear-space algorithms– inversions– bitwise operations– neighborhood searching– NFAs– Homology
![Page 96: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/96.jpg)
96
What is Similarity Anyway ?
• Similar: have similar letters
• Homolog: have common ancestor
• Not exactly the same !• Three types of homology
– Paralog– Ortholog– Xenolog
Organism A Organism B
Parent Organism
![Page 97: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/97.jpg)
97
Paralog & Ortholog (1)• "Two genes are said to be
paralogous if they are derived from a duplication event, but orthologous if they are derived from a speciation event.“
W-H Li
1. A gene called A in species w 2. is duplicated producing initially two
copies of A.3. With time the two copies diverge by
evolution forming related genes A1 and A2. These two genes are said to be paralogous to one another. Paralogy typically involves comparisons within a species.
![Page 98: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/98.jpg)
98
Paralog & Ortholog (2)
• Two species, x and y evolve from species w, their common ancestor. The descendants of the A1 and A2 genes are now called A1x, A1y, and A2x, A2y to reflect which species they now occupy. A1x is orthologous to A1y and A2x is orthologous to A2y. The comparison is between two species.
![Page 99: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/99.jpg)
99
Xenolog
• Xenology is defined as that condition (horizontal transfer) where the history of the gene involves an interspecies transfer of genetic material. It does not include transfer between organelles and the nucleus. It is the only form of homology in which the history has an episode where the descent is not from parent to offspring but, rather, from one organism to another.
![Page 100: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/100.jpg)
100
Paralog, Ortholog, Xenolog
ParalogOrthologXenolog
![Page 101: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/101.jpg)
101
Recommended Reading
• Fitch, WM, “Homology a personal view on some of the problems”, Trends. Genet., 2000, 16: 227-231
![Page 102: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/102.jpg)
102
Overview
• Dot plots• Dynamic programming solutions
– Local, global alignments and their extensions
• Distance and similarity models• Gap models• Improvements and different on computation of
sequence similarity• Similarity versus homology
![Page 103: 1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.](https://reader035.fdocuments.us/reader035/viewer/2022062314/56649e675503460f94b62630/html5/thumbnails/103.jpg)
103
Next: Substitution Patterns
•Predict substitutions
•What are scoring matrices and how are they derived?