Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

21
Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

Page 1: Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

Text Comparison of Genetic Sequences

Shiri Azenkot

Pomona College

DIMACS REU 2004

Page 2: Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

Comparing Two Strings

Definition: A string is a set of consecutive characters.

Examples: –“hello world”–“0123456”–DNA sequences –text file

Page 3: Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

Comparing Two Strings

If X and Y are strings, how similar are they?Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y.

• Allowed operations:– Insert a character

– Delete a character

– Replace a character

• Running time: O(mn) with a dynamic programming algorithm

Page 4: Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

Comparing Two Strings

If X and Y are strings, how similar are they?Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y.

X = abcdef Y = defabc

d(X, Y) = ?# operations =

Page 5: Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

Comparing Two Strings

If X and Y are strings, how similar are they?Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y.

X = bcdef Y = defabc

d(X, Y) = ?# operations = 1

Page 6: Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

Comparing Two Strings

If X and Y are strings, how similar are they?Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y.

X = cdef Y = defabc

d(X, Y) = ?# operations = 2

Page 7: Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

Comparing Two Strings

If X and Y are strings, how similar are they?Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y.

X = def Y = defabc

d(X, Y) = ?# operations = 3

Page 8: Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

Comparing Two Strings

If X and Y are strings, how similar are they?Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y.

X = defa Y = defabc

d(X, Y) = ?# operations = 4

Page 9: Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

Comparing Two Strings

If X and Y are strings, how similar are they?Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y.

X = defab Y = defabc

d(X, Y) = ?# operations = 5

Page 10: Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

Comparing Two Strings

If X and Y are strings, how similar are they?Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y.

X = defabc Y = defabc

d(X, Y) = 6# operations = 6

Does this seem too high?

Page 11: Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

Edit Distance with Moves

• d(X, Y): smallest number of operations to make X look like Y.– New operation: move a substring

X = abcdef Y = defabc

d(X, Y) = 1

Page 12: Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

Edit Distance with Moves

• d(X, Y): smallest number of operations to make X look like Y.– New operation: move a substring

• Some applications– Computational biology – DNA sequences– Text editing– Webpage updating

Page 13: Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

Edit Distance with Moves

• The problem is NP-hard

• Algorithm approximates d(X, Y) deterministically

• Run time: O(n log n)

• Edit Sensitive Parsing (ESP) Algorithm:1. Parse each string into a 2-3 tree

2. Compare nodes (substrings) of the trees to compute edit distance approximation:

Page 14: Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

Edit Distance with MovesAlgorithm

1. Parse each string into a 2-3 tree

Every node represents a substring

X = bagcabagehead

b a g c a b a g e h e a d

bagca

Page 15: Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

Edit Distance with MovesAlgorithm

1. Parse each string into a 2-3 tree

Every node represents aa substring

Y = cabageheadbag

b gb g ac a a e h e a d

Page 16: Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

Edit Distance with MovesAlgorithm

2. Compare nodes (substrings) of the trees to compute edit distance approximation

2.1 Find frequencies of occurrence of each substring.

X:

1 1 1 1 1

bagca bagehead

bag ca ba geh ead

1 1

b a g c a b a g e h e a d

Page 17: Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

ca ba geh ea db ag

1 1 1 1 1 1

Edit Distance with MovesAlgorithm

2. Compare nodes (substrings) of the trees to compute edit distance approximation

2.1 Find frequencies of occurrence of each substring.

Y:

b a gc a b a g e h e a d

caba gehea dbag

1 1 1

Page 18: Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

1 1 1 1 1 1ca ba geh ea db ag

Edit Distance with MovesAlgorithm

2. Compare nodes (substrings) of the trees to compute edit distance approximation

2.1 Find frequencies of occurrence of each substring.

2.2 Subtract characteristic vectors to get approximation for d(X, Y)

Bagca bagehead 1 1

bag ca ba geh ead 1 1 1 1 1

caba gehea dbag 1 1 1- = 10

Page 19: Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

Edit Distance with MovesAlgorithm

2. Compare nodes (substrings) of the trees to compute edit distance approximation

2.1 Find frequencies of occurrence of each substring.

2.2 Subtract characteristic vectors to get approximation for d(X, Y)

Actual edit distance with moves?d(bagcabagehead, cabageheadbag) = 1

Page 20: Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

Edit Distance with Moves

Goals for this project:– Implement this algorithm– Test algorithm on DNA sequences

Questions to think about:– How accurate is the approximation?– How applicable is this technique for comparing large

biological sequences?– This algorithm finds repeating structures within the

sequences when comparing them. Do these structures have significance?

– Do such structures exist for real sequences?

Page 21: Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

Acknowledgements

• Mentor: Graham Cormode, DIMACS Postdoc• DIMACS REU 2004• References:

– Benedetto, D., Caglioti E., Loreto V., “Language Trees and Zipping”. Physical Review Letters, 2002

– Cormode, G., Muthukrishnan, S., “The String Edit Distance Matching Problem with Moves”.