SpSim

45
Measuring Spelling Similarity for Cognate Identification Lu´ ıs Gomes Faculdade de Ciˆ encias e Tecnologia da Universidade Nova de Lisboa EPIA 2011, October 10, 2011, Lisboa

description

This talk presents SpSim, a new string similarity measure for identifying cognates that is tolerant towards characteristic spelling differences that are automatically extracted from a set of cognates known apriori.Talk given at EPIA 2011, October 10, 2011, Lisboa

Transcript of SpSim

Page 1: SpSim

Measuring Spelling Similarityfor Cognate Identification

Luıs Gomes

Faculdade de Ciencias e Tecnologiada Universidade Nova de Lisboa

EPIA 2011, October 10, 2011, Lisboa

Page 2: SpSim

What are cognates?

In linguistics, cognates are words that have a commonetymological origin. – Wikipedia

Example

The words etymology (English) and etimologia (Portuguese) bothderive from Greek etymologıa through Latin etymologia.

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 3: SpSim

What are cognates?

In linguistics, cognates are words that have a commonetymological origin. – Wikipedia

Example

The words etymology (English) and etimologia (Portuguese) bothderive from Greek etymologıa through Latin etymologia.

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 4: SpSim

What are cognates?

I am particularly interested in cognates of different languages, thatretain the same meaning, such as

German symbole themen operativeEnglish symbols themes operationalFrench symboles themes operationnelleSpanish sımbolos temas operativaPortuguese sımbolos temas operacionalItalian simboli temi operativa

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 5: SpSim

What are cognates?

I am particularly interested in cognates of different languages, thatretain the same meaning, such as

German symbole themen operativeEnglish symbols themes operationalFrench symboles themes operationnelleSpanish sımbolos temas operativaPortuguese sımbolos temas operacionalItalian simboli temi operativa

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 6: SpSim

What are cognates?

I am particularly interested in cognates of different languages, thatretain the same meaning, such as

German demokratische aspekte justizEnglish democratic aspects justiceFrench democratique aspects justiceSpanish democratica aspectos justiciaPortuguese democratica aspectos justicaItalian democratica aspetti giustizia

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 7: SpSim

Extracting cognates from parallel corpora

Example parallel sentences

The Member States shall

coordinate their economic

policies within the Union .

Os Estados - Membros

coordenam as suas polıticas

economicas no ambito da

Uniao .

Spelling similarity

Cognates typically have similar spellings.

AssociationTranslations tend to co-occur systematically in parallel texts, whilenon-translations co-occur by chance.

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 8: SpSim

Extracting cognates from parallel corpora

Example parallel sentences

The Member States shall

coordinate their economic

policies within the Union .

Os Estados - Membros

coordenam as suas polıticas

economicas no ambito da

Uniao .

Spelling similarity

Cognates typically have similar spellings.

AssociationTranslations tend to co-occur systematically in parallel texts, whilenon-translations co-occur by chance.

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 9: SpSim

Extracting cognates from parallel corpora

Example parallel sentences

The Member States shall

coordinate their economic

policies within the Union .

Os Estados - Membros

coordenam as suas polıticas

economicas no ambito da

Uniao .

Spelling similarity

Cognates typically have similar spellings.

AssociationTranslations tend to co-occur systematically in parallel texts, whilenon-translations co-occur by chance.

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 10: SpSim

Extracting cognates from parallel corpora

Spelling similarity measures

I EDSim (Edit-Distance-based Similarity)

I LCSR (Longest Common Subsequence Ratio)

I and a few others . . .

Association measures

I Dice

I SCP (Symmetric Conditional Probability)

I Mutual-Information

I and many others . . .

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 11: SpSim

Extracting cognates from parallel corpora

Most commonly used spelling similarity measures

EDSim(w1,w2) = 1− ED(w1,w2)

max(w1,w2)

ED(w1,w2) is the Edit Distance between words w1 and w2.

LCSR(w1,w2) =LCS(w1,w2)

max(w1,w2)

LCS(w1,w2) is the length of the Longest Common Subsequencebetween words w1 and w2.

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 12: SpSim

Extracting cognates from parallel corpora

Problem with these measuresThey look at strings too literally!

EDSim(photographic, fotografica) = 0.5

LCSR(photographic, fotografica) = 0.58

The spelling similarity score should be closer to 1.0 to reflecthuman judgement.

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 13: SpSim

How does SpSim work?

First we align the two strings to find differences

This takes O(w1w2) time, just like computing ED(w1,w2) orLCS(w1,w2).

ˆ ph o t o g r aph i c $

ˆ f o t o g r af i c o $

Then we check which differences we may ignore

I Is “ph f” in the hashtable?

I Is “aph af” in the hashtable?

I Is “ o” in the hashtable?

In learning mode we would insert these differences into thehastable instead.

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 14: SpSim

How does SpSim work?

First we align the two strings to find differences

This takes O(w1w2) time, just like computing ED(w1,w2) orLCS(w1,w2).

ˆ ph o t o g r aph i c $

ˆ f o t o g r af i c o $

Then we check which differences we may ignore

I Is “ph f” in the hashtable?

I Is “aph af” in the hashtable?

I Is “ o” in the hashtable?

In learning mode we would insert these differences into thehastable instead.

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 15: SpSim

How does SpSim work?

First we align the two strings to find differences

This takes O(w1w2) time, just like computing ED(w1,w2) orLCS(w1,w2).

ˆ ph o t o g r aph i c $

ˆ f o t o g r af i c o $

Then we check which differences we may ignore

I Is “ph f” in the hashtable?

I Is “aph af” in the hashtable?

I Is “ o” in the hashtable?

In learning mode we would insert these differences into thehastable instead.

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 16: SpSim

How does SpSim work?

First we align the two strings to find differences

This takes O(w1w2) time, just like computing ED(w1,w2) orLCS(w1,w2).

ˆ ph o t o g r aph i c $

ˆ f o t o g r af i c o $

Then we check which differences we may ignore

I Is “ph f” in the hashtable?

I Is “aph af” in the hashtable?

I Is “ o” in the hashtable?

In learning mode we would insert these differences into thehastable instead.

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 17: SpSim

How does SpSim work?

Finally, we compute SpSim

SpSim(w1,w2) = 1−∑

i Di

w1 + w2

Di is the length of each difference that cannot be ignored.If no difference is ignored, then SpSim(w1,w2) = EDSim(w1,w2).

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 18: SpSim

How does SpSim work?

Problem: over-generalization

Some differences such as insert an “o” in the Portuguese word aretoo vague and may occur by chance (ie, even if the words aretotally unrelated).

ˆ ph o t o g r aph i c $

ˆ f o t o g r af i c o $

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 19: SpSim

How does SpSim work?

Solution: contextualize first and generalize afterwards

Contextualized differences are less likely to occur by chance.Example: insert an “o” at the end of the Portuguese word if theEnglish word ends with a “c”.

ˆpho t o g raphi c$

ˆfo t o g rafi co$

Whenever we find the same difference in a different context wemay generalize it.

ˆpha s e $

ˆfa s e $

“ˆpho ˆfo” + “ˆpha ˆfa” = “ˆph ˆf”

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 20: SpSim

Experimental setup

Corpora

I used a parallel corpus of texts from the European Constitution infive language pairs.

Method

1. Obtain a list of putative cognates by thresholding anassociation measure (Dice).

2. Manually verify all putative cognates.

3. Compare the precision, recall and f-measure of SpSim andEDSim for a series of different threshold values.

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 21: SpSim

Experimental setup

I used Dice to extract the initial list of putative cognates

Dice(x , y) =2 F(x , y)

F(x) + F(y)

F (x , y) is the number of co-occurrences in all parallel segments si :

F(x , y) =∑i

min(f(x , si ), f(y , si ))

F(x) and F(y) are the total number of occurrences of x and y inall parallel segments si :

F(x) =∑i

f(x , si ) ; F(y) =∑i

f(y , si )

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 22: SpSim

Experimental setup

Extracted all pairs of words with Dice ≥ 0.6.

Summary of extraction and manual verification

Language Pair Accepted Rejected Total

German-English 269 878 1147English-Spanish 399 749 1148English-French 380 825 1205English-Portuguese 410 796 1206French-Italian 635 974 1609

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 23: SpSim

Comparing SpSim to EDSim

English–Portuguese

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

threshold

spsim learned from 16 examples (edsim > 0.9)

edsim precisionedsim recall

edsim f-measure

spsim precisionspsim recall

spsim f-measure

English–German

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

threshold

spsim learned from 4 examples (edsim > 0.9)

edsim precisionedsim recall

edsim f-measure

spsim precisionspsim recall

spsim f-measure

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 24: SpSim

Comparing SpSim to EDSim

English–Portuguese

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

threshold

spsim learned from 57 examples (edsim > 0.8)

edsim precisionedsim recall

edsim f-measure

spsim precisionspsim recall

spsim f-measure

English–German

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

threshold

spsim learned from 25 examples (edsim > 0.8)

edsim precisionedsim recall

edsim f-measure

spsim precisionspsim recall

spsim f-measure

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 25: SpSim

Comparing SpSim to EDSim

English–Portuguese

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

threshold

spsim learned from 140 examples (edsim > 0.7)

edsim precisionedsim recall

edsim f-measure

spsim precisionspsim recall

spsim f-measure

English–German

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

threshold

spsim learned from 46 examples (edsim > 0.7)

edsim precisionedsim recall

edsim f-measure

spsim precisionspsim recall

spsim f-measure

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 26: SpSim

Comparing SpSim to EDSim

English–Portuguese

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

threshold

spsim learned from 202 examples (edsim > 0.6)

edsim precisionedsim recall

edsim f-measure

spsim precisionspsim recall

spsim f-measure

English–German

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

threshold

spsim learned from 61 examples (edsim > 0.6)

edsim precisionedsim recall

edsim f-measure

spsim precisionspsim recall

spsim f-measure

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 27: SpSim

Comparing SpSim to EDSim

English–Portuguese

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

threshold

spsim learned from 246 examples (edsim > 0.5)

edsim precisionedsim recall

edsim f-measure

spsim precisionspsim recall

spsim f-measure

English–German

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

threshold

spsim learned from 75 examples (edsim > 0.5)

edsim precisionedsim recall

edsim f-measure

spsim precisionspsim recall

spsim f-measure

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 28: SpSim

Comparing SpSim to EDSim

English–Portuguese

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

threshold

spsim learned from 267 examples (edsim > 0.4)

edsim precisionedsim recall

edsim f-measure

spsim precisionspsim recall

spsim f-measure

English–German

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

threshold

spsim learned from 89 examples (edsim > 0.4)

edsim precisionedsim recall

edsim f-measure

spsim precisionspsim recall

spsim f-measure

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 29: SpSim

Comparing SpSim to EDSim

English–Portuguese

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

threshold

spsim learned from 299 examples (edsim > 0.3)

edsim precisionedsim recall

edsim f-measure

spsim precisionspsim recall

spsim f-measure

English–German

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

threshold

spsim learned from 106 examples (edsim > 0.3)

edsim precisionedsim recall

edsim f-measure

spsim precisionspsim recall

spsim f-measure

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 30: SpSim

Comparing SpSim to EDSim

English–Portuguese

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

threshold

spsim learned from 346 examples (edsim > 0.2)

edsim precisionedsim recall

edsim f-measure

spsim precisionspsim recall

spsim f-measure

English–German

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

threshold

spsim learned from 147 examples (edsim > 0.2)

edsim precisionedsim recall

edsim f-measure

spsim precisionspsim recall

spsim f-measure

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 31: SpSim

Comparing SpSim to EDSim

English–Portuguese

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

threshold

spsim learned from 380 examples (edsim > 0.1)

edsim precisionedsim recall

edsim f-measure

spsim precisionspsim recall

spsim f-measure

English–German

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

threshold

spsim learned from 219 examples (edsim > 0.1)

edsim precisionedsim recall

edsim f-measure

spsim precisionspsim recall

spsim f-measure

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 32: SpSim

Comparing SPSim to EDSim

EN-ES18 examples (edsim > 0.9)

EN-FR31 examples (edsim > 0.9)

EN-PT16 examples (edsim > 0.9)

DE-EN4 examples (edsim > 0.9)

FR-IT14 examples (edsim > 0.9)

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 33: SpSim

Comparing SPSim to EDSim

EN-ES103 examples (edsim > 0.8)

EN-FR93 examples (edsim > 0.8)

EN-PT57 examples (edsim > 0.8)

DE-EN25 examples (edsim > 0.8)

FR-IT124 examples (edsim > 0.8)

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 34: SpSim

Comparing SPSim to EDSim

EN-ES168 examples (edsim > 0.7)

EN-FR149 examples (edsim > 0.7)

EN-PT140 examples (edsim > 0.7)

DE-EN46 examples (edsim > 0.7)

FR-IT251 examples (edsim > 0.7)

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 35: SpSim

Comparing SPSim to EDSim

EN-ES203 examples (edsim > 0.6)

EN-FR181 examples (edsim > 0.6)

EN-PT202 examples (edsim > 0.6)

DE-EN61 examples (edsim > 0.6)

FR-IT362 examples (edsim > 0.6)

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 36: SpSim

Comparing SPSim to EDSim

EN-ES244 examples (edsim > 0.5)

EN-FR220 examples (edsim > 0.5)

EN-PT246 examples (edsim > 0.5)

DE-EN75 examples (edsim > 0.5)

FR-IT449 examples (edsim > 0.5)

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 37: SpSim

Comparing SPSim to EDSim

EN-ES255 examples (edsim > 0.4)

EN-FR234 examples (edsim > 0.4)

EN-PT267 examples (edsim > 0.4)

DE-EN89 examples (edsim > 0.4)

FR-IT502 examples (edsim > 0.4)

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 38: SpSim

Comparing SPSim to EDSim

EN-ES286 examples (edsim > 0.3)

EN-FR260 examples (edsim > 0.3)

EN-PT299 examples (edsim > 0.3)

DE-EN106 examples (edsim > 0.3)

FR-IT538 examples (edsim > 0.3)

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 39: SpSim

Comparing SPSim to EDSim

EN-ES329 examples (edsim > 0.2)

EN-FR301 examples (edsim > 0.2)

EN-PT346 examples (edsim > 0.2)

DE-EN147 examples (edsim > 0.2)

FR-IT581 examples (edsim > 0.2)

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 40: SpSim

Comparing SPSim to EDSim

EN-ES368 examples (edsim > 0.1)

EN-FR343 examples (edsim > 0.1)

EN-PT380 examples (edsim > 0.1)

DE-EN219 examples (edsim > 0.1)

FR-IT622 examples (edsim > 0.1)

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 41: SpSim

Conclusions

I SpSim learns fast

I SpSim has much better recall than EDSim (and LCSR)

I SpSim has the same time complexity as EDSim and LCSR

I SpSim is almost as easy to implement as EDSim or LCSR

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 42: SpSim

Conclusions

I SpSim learns fast

I SpSim has much better recall than EDSim (and LCSR)

I SpSim has the same time complexity as EDSim and LCSR

I SpSim is almost as easy to implement as EDSim or LCSR

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 43: SpSim

Conclusions

I SpSim learns fast

I SpSim has much better recall than EDSim (and LCSR)

I SpSim has the same time complexity as EDSim and LCSR

I SpSim is almost as easy to implement as EDSim or LCSR

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 44: SpSim

Conclusions

I SpSim learns fast

I SpSim has much better recall than EDSim (and LCSR)

I SpSim has the same time complexity as EDSim and LCSR

I SpSim is almost as easy to implement as EDSim or LCSR

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

Page 45: SpSim

Thanks for listening

Questions?

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes