Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.
-
date post
20-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.
Text Comparison of Genetic Sequences
Shiri Azenkot
Pomona College
DIMACS REU 2004
Comparing Two Strings
Definition: A string is a set of consecutive characters.
Examples: –“hello world”–“0123456”–DNA sequences –text file
Comparing Two Strings
If X and Y are strings, how similar are they?Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y.
• Allowed operations:– Insert a character
– Delete a character
– Replace a character
• Running time: O(mn) with a dynamic programming algorithm
Comparing Two Strings
If X and Y are strings, how similar are they?Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y.
X = abcdef Y = defabc
d(X, Y) = ?# operations =
Comparing Two Strings
If X and Y are strings, how similar are they?Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y.
X = bcdef Y = defabc
d(X, Y) = ?# operations = 1
Comparing Two Strings
If X and Y are strings, how similar are they?Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y.
X = cdef Y = defabc
d(X, Y) = ?# operations = 2
Comparing Two Strings
If X and Y are strings, how similar are they?Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y.
X = def Y = defabc
d(X, Y) = ?# operations = 3
Comparing Two Strings
If X and Y are strings, how similar are they?Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y.
X = defa Y = defabc
d(X, Y) = ?# operations = 4
Comparing Two Strings
If X and Y are strings, how similar are they?Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y.
X = defab Y = defabc
d(X, Y) = ?# operations = 5
Comparing Two Strings
If X and Y are strings, how similar are they?Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y.
X = defabc Y = defabc
d(X, Y) = 6# operations = 6
Does this seem too high?
Edit Distance with Moves
• d(X, Y): smallest number of operations to make X look like Y.– New operation: move a substring
X = abcdef Y = defabc
d(X, Y) = 1
Edit Distance with Moves
• d(X, Y): smallest number of operations to make X look like Y.– New operation: move a substring
• Some applications– Computational biology – DNA sequences– Text editing– Webpage updating
Edit Distance with Moves
• The problem is NP-hard
• Algorithm approximates d(X, Y) deterministically
• Run time: O(n log n)
• Edit Sensitive Parsing (ESP) Algorithm:1. Parse each string into a 2-3 tree
2. Compare nodes (substrings) of the trees to compute edit distance approximation:
Edit Distance with MovesAlgorithm
1. Parse each string into a 2-3 tree
Every node represents a substring
X = bagcabagehead
b a g c a b a g e h e a d
bagca
Edit Distance with MovesAlgorithm
1. Parse each string into a 2-3 tree
Every node represents aa substring
Y = cabageheadbag
b gb g ac a a e h e a d
Edit Distance with MovesAlgorithm
2. Compare nodes (substrings) of the trees to compute edit distance approximation
2.1 Find frequencies of occurrence of each substring.
X:
1 1 1 1 1
bagca bagehead
bag ca ba geh ead
1 1
b a g c a b a g e h e a d
ca ba geh ea db ag
1 1 1 1 1 1
Edit Distance with MovesAlgorithm
2. Compare nodes (substrings) of the trees to compute edit distance approximation
2.1 Find frequencies of occurrence of each substring.
Y:
b a gc a b a g e h e a d
caba gehea dbag
1 1 1
1 1 1 1 1 1ca ba geh ea db ag
Edit Distance with MovesAlgorithm
2. Compare nodes (substrings) of the trees to compute edit distance approximation
2.1 Find frequencies of occurrence of each substring.
2.2 Subtract characteristic vectors to get approximation for d(X, Y)
Bagca bagehead 1 1
bag ca ba geh ead 1 1 1 1 1
caba gehea dbag 1 1 1- = 10
Edit Distance with MovesAlgorithm
2. Compare nodes (substrings) of the trees to compute edit distance approximation
2.1 Find frequencies of occurrence of each substring.
2.2 Subtract characteristic vectors to get approximation for d(X, Y)
Actual edit distance with moves?d(bagcabagehead, cabageheadbag) = 1
Edit Distance with Moves
Goals for this project:– Implement this algorithm– Test algorithm on DNA sequences
Questions to think about:– How accurate is the approximation?– How applicable is this technique for comparing large
biological sequences?– This algorithm finds repeating structures within the
sequences when comparing them. Do these structures have significance?
– Do such structures exist for real sequences?
Acknowledgements
• Mentor: Graham Cormode, DIMACS Postdoc• DIMACS REU 2004• References:
– Benedetto, D., Caglioti E., Loreto V., “Language Trees and Zipping”. Physical Review Letters, 2002
– Cormode, G., Muthukrishnan, S., “The String Edit Distance Matching Problem with Moves”.