Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY...

33
Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois, Laboratoire d’Informatique Seminarium IPIPAN, 24 kwietnia, 2006

Transcript of Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY...

Page 1: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

Similarity and Correction of Strings and Trees : Towards a Correction of XML

Documents

Agata SAVARY

Université-François Rabelais de Tours, Campus de Blois, Laboratoire d’Informatique

Seminarium IPIPAN, 24 kwietnia, 2006

Page 2: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

String-to-string correction

Page 3: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 3

Traditional string-to-string correction

(Wagner&Fischer 1974, Lawrence&Wagner 1975,…)

• CONTEXT:– Finite set of symbols (alphabet)– Elementary operations on symbols (editing operations, e.g. deletion,

insertion, or replacement of a letter, inversion of two adjacent letters) with their costs (usually 1 per operation)

– Sequences of editing operations (edit sequences; each operation applies to a word resulting from the previous operations) with their costs (sums of costs of editing operations involved)

– Measure of similarity between words A and B (edit distance or error distance): minimum cost of all edit sequences transforming A to B

• INPUT:– Two words A and B

• OUTPUT:– Distance between A and B

Page 4: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 4

Examples of elementary edit operations

• Insertion of a lettermonter montaer, monter montrer

• Deletion of a lettermonter montr, monter monte

• Replacement of a letter by anothermonter ponter, monter conter

• Transposition of two adjacent lettersmonter mnoter, monter montre

Each elementary operation has a non negatif cost.From now on we admit cost 1 for each elementary operation.

Page 5: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 5

Edit sequence• Edit sequence = sequence of elementary edit operations• For each couple of words X and Y many edit sequences exist that transform

X into Y.• Example 1: transforming sorting into string :

– sorting srting sting string (3 operations)– sorting sotring string (2 operations)– sorting srting string (2 operations)– sorting strting string (2 operations)– sorting srting sting sing sring string (5 operations)– .................

• Example 2: transforming abc into ca :– abc ac ca (2 operations)– abc cabc cac ca (3 operations)

• From now on, we’ll be interested in linear edit sequences (Du&Chang 1992), i.e. such that the operations are performed from left to right, and no further operation may alter the result of a previous operation.

Linear sequence

Linear sequence

Linear sequence

Linear sequence

Page 6: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 6

Edit (error) distance• Cost of an edit sequence = sum of costs of all elementary

operations included in the sequence– sortingsrtingstingstring (3 operations) cost = 3

– sortingsotringstring (2 operations) cost = 2

– sortingsrtingstingsingsringstring (5 operations) cost = 5

• Edit distance (error distance) between two words X and Y (ed(X,Y)) = minimal cost of all edit sequences transforming X into Y :

ed(sorting, string) = 2

ed(abc,ca) = 2, if all edit sequences are taken into account

ed(abc,ca) = 3, if only the linear edit sequences are taken into account

Page 7: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 7

Calculating the edit distance (1/4)

If xi+1 = yj+1 then

ed(X[i+1],Y[j+1]) = ed(X[i],Y[j])

X[i+1]

Y[j+1]

i

j

Notation : word X= x1 x2 ... xi ...xn; the prefix of lenght i of X : X[i] = x1 x2 ... xi

Xi

X[i]

It is possible to calculate the distance between two prefixes X[i+1] and Y[j+1] on the basis of the distances between shorter prefixes: 3 cases

x1 x2 x3 ... xi ... xn

Page 8: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 8

Transposition’s cost

If xi = yj+1 and xi+1 = yj (the 2 last characters may be inverted) then 4 sub-cases are possible:

• The cheapest sequence transforming X[i+1] into Y[j+1] contains a transposition of xi and xi+1 : ed(X[i+1],Y[j+1]) = ed(X[i-1],Y[j-1]) + 1

X[i+1]

Y[j+1]

i

j

• The cheapest sequence transforming X[i+1] into Y[j+1] contains the replacement of xi+1 by yj+1 :

ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1

• The cheapest sequence transforming X[i+1] into Y[j+1] contains the l’insertion of yj+1 :

ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1

• The cheapest sequence transforming X[i+1] into Y[j+1] contains the deletion of xi+1 :

ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1

Replacement’s cost

Insertion’s cost

Deletion’s cost

Calculating the edit distance (2/4)

Page 9: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 9

OTHERWISE (if xi+1 yj+1, and (xi yj+1 or xi+1 yj)) then 3 sub-cases are possible:

X[i+1]

Y[j+1]

i

j

• The cheapest sequence transforming X[i+1] into Y[j+1] contains the replacement of xi+1 by yj+1 :

ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1

• The cheapest sequence transforming X[i+1] into Y[j+1] contains the insertion of yj+1 :

ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1

• The cheapest sequence transforming X[i+1] into Y[j+1] contains the deletion of xi+1 :

ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1

Replacement’s cost

Insertion’s cost

Deletion’s cost

Calculating the edit distance (3/4)

Page 10: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 10

Edit distance between X[i] and Y[j] - recursive definition:

For i=0,...,m, j=0,...,n:

1° ed(X[-1],Y[j]) = ed(X[i], Y[-1]) = max(m,n)

2° ed(X[0],Y[j]) = j

ed(X[i],Y[0]) = i

ed(X[i],Y[j]) if xi+1 = yj+1

1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]), if xi=yj+1 et xi+1 =

yj

3° ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]), ed(X[i-1],Y[j-1]) }

1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]), otherwiseed(X[i],Y[j+1])}

Calculating the edit distance (4/4)

Page 11: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 11case [n,m] contains the edit distance between the 2 words

case [i,j] contains the edit distance between the prefix [1,..,i] of the one word and the prefixe [1,...,j] of the other word

Calculation the edit distance : dynamic programming

s o r t i n g

0 1 2 3 4 5 6 7

s 1 0 1 2 3 4 5 6

t 2 1 1 2 2 3 4 5

r 3 2 2 1 2 3 4 5

i 4 3 3 2 3 2 3 4

n 5 4 4 3 4 3 2 3

g 6 5 5 4 5 4 3 2

i

j

n

m

Page 12: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 12

Dynamic programming: case 1

s o r t i n g

0 1 2 3 4 ? ? ?

s 1 0 1 2 3 ? ? ?

t 2 1 1 2 2 ? ? ?

r ? ? ? ? ? ? ? ?

i ? ? ? ? ? ? ? ?

n ? ? ? ? ? ? ? ?

g ? ? ? ? ? ? ? ?

i+1

j+1xi+1 = yj+1

Page 13: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 13

Dynamic programming : case 2

s o r t i n g

0 1 2 3 4 ? ? ?

s 1 0 1 2 3 ? ? ?

t 2 1 1 2 2 ? ? ?

r 3 2 2 1 2 ? ? ?

i ? ? ? ? ? ? ? ?

n ? ? ? ? ? ? ? ?

g ? ? ? ? ? ? ? ?

i+1

j+1xi+1 = yj and xi+1 = yj

Page 14: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 14

Dynamic programming : case 3

s o r t i n g

0 1 2 3 4 ? ? ?

s 1 0 1 2 3 ? ? ?

t 2 1 1 2 2 ? ? ?

r 3 2 2 1 2 ? ? ?

i 4 3 3 2 2 ? ? ?

n ? ? ? ? ? ? ? ?

g ? ? ? ? ? ? ? ?

i+1

j+1xi+1 yj+1 et (xi+1 yj ou xi+1 yj)

Page 15: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

String-to-language correction

Page 16: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 16

String-to-language correction: problem definition

• CONTEXT:– Finite set of symbols (alphabet)– Elementary edit operations on symbols (as before) with their costs (1 per

operation)– Edit sequences (as before) – Edit distance (error distance) between words: as before

• INPUT:– Regular grammar describing words (a finite set of words in particular)– Incorrect word A (unrecognizable by the grammar)– Threshold t

• OUTPUT:– A set of correct words B1, B2, …, Bn whose distance from A stays within t (the

nearest neighbors of A)

Page 17: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 17

String-to-language correction: simplistic approach

• METHOD:– For each word B recognizable by the grammar calculate the edit distance

matrix between A and B.– Propose candidates whose distance from A does not exceed the threshold t

(ed(A,B) t).

• FAISABILITY:– Impossible in case of infinite languages

• COMPLEXITY:

O(n * m * |D|)

Page 18: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 18

String-to-language correction: threshold-controlled depth-first

exploration of an FSA(Oflazer 1996, …)

Page 19: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 19

Part of the matrix calculated only once for all valid words sharing the same prefix appl

String correction with respect to a deterministic FSA (1/4)

1

2 4

5

3 6

7

8

9

ap

p

l y

e s

p

ly

e

a

Word to be corrected : *aply, threshold 2

a p p l ... ... 0 1 2 3 4 ... ...a 1 0 1 2 3 ... ...p 2 1 0 1 2 ... ...l 3 2 1 1 1 ... ...y 4 3 2 2 2 ... ...

• Each time a transition is followed a new column is calculated in the edit distance matrix

e54322

• If we get to a final state and the edit distance remains within the thershold a new candidate has been found

apple

Page 20: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 20

1

2 4

5

3 6

7

8

9

ap

p

l y

e s

p

ly

e

a

a p p l ... ... 0 1 2 3 4 ... ...a 1 0 1 2 3 ... ...p 2 1 0 1 2 ... ...l 3 2 1 1 1 ... ...y 4 3 2 2 2 ... ...

e54322

s65433

apple

String correction with respect to a deterministic FSA (2/4)

Word to be corrected : *aply, threshold 2

Part of the matrix calculated only once for all valid words sharing the same prefix appl

• Each time a transition is followed a new column is calculated in the edit distance matrix

• If we get to a final state and the edit distance remains within the thershold a new candidate has been found

Page 21: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 21

1

2 4

5

3 6

7

8

9

ap

p

l y

e s

p

ly

e

a

a p p l ... ... 0 1 2 3 4 ... ...a 1 0 1 2 3 ... ...p 2 1 0 1 2 ... ...l 3 2 1 1 1 ... ...y 4 3 2 2 2 ... ...

e54322

• A backtrancking results in deleting the current column

apple

s65433

String correction with respect to a deterministic FSA (3/4)

Word to be corrected : *aply, threshold 2

Part of the matrix calculated only once for all valid words sharing the same prefix appl

• Each time a transition is followed a new column is calculated in the edit distance matrix

• If we get to a final state and the edit distance remains within the thershold a new candidate has been found

Page 22: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 22

1

2 4

5

3 6

7

8

9

ap

p

l y

e s

p

ly

e

a

a p p l ... ... 0 1 2 3 4 ... ...a 1 0 1 2 3 ... ...p 2 1 0 1 2 ... ...l 3 2 1 1 1 ... ...y 4 3 2 2 2 ... ...

y54321

apple apply

String correction with respect to a deterministic FSA (4/4)

• A backtrancking results in deleting the current column

Word to be corrected : *aply, threshold 2

Part of the matrix calculated only once for all valid words sharing the same prefix appl

• Each time a transition is followed a new column is calculated in the edit distance matrix

• If we get to a final state and the edit distance remains within the thershold a new candidate has been found

Page 23: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 23

1

2

8

9

a c

d

Word to be corrected : abcbb, t=2

a b b b b b b-2 -1 0 1 2 3 4 5 6

-2 + + + + + + + + + -1 + 0 1 2 3 4 5 6 7a 0 + 1 0 1 2 3 4 5 6b 1 + 2 1 0 1 2 3 4 5c 2 + 3 2 1 1 2 3 4 5b 3 + 4 3 2 1 1 2 3 4b 4 + 5 4 3 2 1 1 2 3

b

b

• If the current column exceeds the threshold the whole path is cut off

Controlling the searchspace by the threshold

Page 24: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

Tree-to-tree correction

Page 25: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 25

Tree-to-tree correction(Selkow 1977,…)

• CONTEXT:– Finite set of node symbols (alphabet)– Elementary edit operations on trees:

• Insertion of a leaf• Deletion of a leaf• Renaming of a node (leaf or internal node)

– Non negatif cost for each elementary operation– Edit sequences (sequences of edit operations) with their costs (sums of

costs of editing operations involved)– Edit distance between two trees A and B: minimum cost of all edit

sequences transforming A into B • INPUT:

– Two trees A and B• OUTPUT:

– Distance between A and B

Page 26: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 26

• A partial tree A0:i is the root of A and its subtrees A0,...,Ai • The comparison is based on comparing roots, and then recursively comparing the

roots’ subtrees

Comparing two trees(Selkow 1977,…)

A

root(A)A0

A1

A2

Broot(B)

B0

B1

B2 B3

A0:1

a b

c d c c d e c

e e e fb d b b b

B0:2

Page 27: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 27

case [-1,-1] contains the cost of renaming root(A) into root(B)

Edit distance matrix between two trees

(Selkow 1977,…)

case [n,m] contains the edit distance between the 2 trees

case [i,j] contains the edit distance between the partial trees A0:i and B0:j

-1 0 1 2 3

-1 1 4 14 15 16

0 4 2 12 13 14

1 15 13 3 4 5

2 16 14 4 4 4

i

j

n

m

Page 28: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 28

Calculation of the tree matrix(Selkow 1977,…)

-1 0 1 2 3

-1 1 4 14 15 16

0 4 2 12 13 14

1 15 13 3 4 5

2 16 14 4 4 ?i

j

Adding the cost of inserting Bj (here +1)

Adding the edit distance between Ai and Bj (here +0)

Adding the cost od deleting Ai (here +1)

Taking the minimum (here min(4+0, 5+1, 4+1) = 4

Page 29: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 29

Extension to the correction of XML-documents

• The validity of a node is described by a set of regular expressions, e.g. E = ab*c + db*

• The „horizontal” correction on a siblings’ level is similar to the string-to-language correction (Oflazer 1996)

• The „vertical” correction is inspired from the tree-to-tree correction (Selkow 1977)

<y> </y>

<root> </root>

<x> </x> <z> </z>

<a> </a> <b> </b> <c> </c> <b> </b> <b> </b>

Page 30: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 30

Main idea

String-to-string(Wagner&Fischer 1974)

String-to-(regular) language(Oflazer 1996)

Tree-to-tree(Selkow 1977)

Tree-to-(regular) tree language(Cheriat, Savary, Bouchou, Halfeld,to be continued)

Page 31: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 31

Edit distance matrix with edit sequences

case [i,j] contains the edit distance between the partial trees A0:i and B0:j, and the edit sequence necessary to transform A0:i into B0:j

-1 0 1 2 3

-1 ... ... ... ... ...

0 ... ... ... ... ...

1 ... ... ... [3, <(R,0.1,f),(D,1.1,/),(I,2,e)>] ...

2 ... ... ... ... ...

i

j

Page 32: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 32

Bibliography• Clarke, G., Barnard, D.T., Duncan N. (1995) Tree-to-tree Correction for Document

Trees. Technical Report 95-372, Department of Computing and Information Science, Queen’s University, Kingston, Ontario.

• Du, M. W., Chang, S. C. (1992): A model and a fast algorithm for multiple errors spelling correction. Acta Informatica, Vol. 29. Springer Verlag, pp. 281-302

• Hall, P., Dowling, G. (1980): Approximate String Matching. ACM Computing Surveys, Vol. 12(4). ACM, New York., pp. 381-402

• Lowrance, R., Wagner, R. A. (1975): An Extension of the String-to-String Correction Problem. Journal of the ACM, Vol. 22(2), pp. 177-183

• Mihov, S., Schultz, K. (2004): Fast approximate search in large dictionaries. Computational Linguistics, Vol. 30(4). MIT Press, Cambridge, Massachusetts pp. 451-477

• Oflazer, K. (1996): Error-tolerant finite state recognition with applications to morphological analysis and spelling correction. Computational Linguistics, Vol. 22(1). MIT Press, Cambridge, Massachusetts pp. 73-89

• Selkow, S. (1977): The tree-to-tree editing problem, Information Processing Letters 6(6), pp. 184-186

• Wagner, R. A. (1974): Order-n Correction for Regular Languages. Communications of the ACM, 17(5), pp. 265-268

• Wagner, R. A., Fischer, M. J. (1974): The String-to-String Correction Problem. Journal of the ACM, Vol. 21(1), pp. 168-173

Page 33: Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

A. Savary Seminarium IPIPAN, 24/04/2006 33

Some details of the state of the art • Wagner & Fischer (1974):

– Elegant and solid theoretical definition of the string-to-string correction problem – 3 elementary operations on single letters admitted (insertion, deletion, replacement)– Model of a trace describing the edit distance between two strings– Dynamic programming method

• Lowrance & Wagner (1975)– Additional elementary operation: inversion of two adjacent letters – Restriction of the cost function

• Du & Chang (1992):– Cost 1 for each elementary operation– Restriction to linear editing sequences – Application to the nearest neighbor search in a dictionary, with a threshold

• Oflazer (1996):– Nearest-neighbor search in finite-state automata– Application to large natural-language dictionaries

• Selkow (1977), Tai (1979), Zhang & Shasha (1989), Clarke, Barnard & Duncan (1995), de Rougemont (2003):

– Tree-to-tree correction problem• Mihov & Schulz (2004):

– Levenshtein automaton– Backward dictionary

• Bouchou, B. & Halfeld Ferrari Alves, M. (2003):– Incremental validation of XML documents resulting from updates: human-computer interaction