Measures of Text Similarity

48
Measures of Text Similarity Presented By: Ehsan Asgarian 1 Ferdowsi University of Mashhad

description

Measures of Text Similarity. Presented By : Ehsan Asgarian. Agenda. Introduction Syntactical (String-Based) Similarity Character-Based (word level) Term-Based (sentence or document level) Semantic (knowledge-Based) Similarity Path Length Information Content - PowerPoint PPT Presentation

Transcript of Measures of Text Similarity

Page 1: Measures of Text  Similarity

1

Measures of Text

Similarity

Presented By: Ehsan Asgarian

Ferdowsi University of Mashhad

Page 2: Measures of Text  Similarity

2

Agenda

Introduction Syntactical (String-Based) Similarity

Character-Based (word level) Term-Based (sentence or document level)

Semantic (knowledge-Based) Similarity Path Length Information Content Relatedness (Dictionary-based method)

Statistical (Corpus-Based) Similarity

Page 3: Measures of Text  Similarity

3

Why text similarity?Used everywhere in NLP• Information retrieval (Query vs Document)

• Text classification (Document vs Category)

• Word-sense disambiguation (Context vs Context)

• Automatic evaluation Machine translation (Gold Standard vs

Generated) Text summarization (Summary vs Original)

Page 4: Measures of Text  Similarity

4

Distance Function A metric on a set X is a function

d : X × X → R (where R is the set of real numbers).

For all x, y, z in X, this function is required to satisfy the following conditions:

d(x, y) ≥ 0 (non-negativity, or separation axiom) d(x, y) = 0 if and only if x = y (coincidence

axiom) d(x, y) = d(y, x) (symmetry) d(x, z) ≤ d(x, y) + d(y, z) (subadditivity /

triangle inequality)

Page 5: Measures of Text  Similarity

5

Agenda IntroductionSyntactical (String-Based)

Similarity Character-Based (word level) Term-Based (sentence or document level)

Semantic (knowledge-Based) Similarity Path Length Information Content Relatedness (Dictionary-based method)

Statistical (Corpus-Based) Similarity

Page 6: Measures of Text  Similarity

6

Syntactical (String-Based) Similarity

Character-Based• LCS• Levenshtein• N-gram• Jaro• Jaro-Winkler• Soundex

(phonetic algorithms)

• …

Term-Based• Block Distance• Euclidean Distance• Cosine Similarity• Jaccard Similarity• Dice's Coefficient• Tanimoto• Tversky• Matching Coefficient• Overlap Coefficient

Page 7: Measures of Text  Similarity

7

Agenda Introduction Syntactical (String-Based) Similarity

Character-Based (word level) Term-Based (sentence or document level)

Semantic (knowledge-Based) Similarity Path Length Information Content Relatedness (Dictionary-based method)

Statistical (Corpus-Based) Similarity

Page 8: Measures of Text  Similarity

8

LCS & Levenshtein Distance

• LCS (Longest Common SubString) : algorithm considers the similarity between two strings is based on the length of contiguous chain of characters that exist in both strings.

• Levenshtein :defines distance between two strings by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two adjacent characters.

Page 9: Measures of Text  Similarity

9

N-gram Distance

• N-gram is a sub-sequence of n items from a given sequence of text.

• N-gram similarity algorithms compare the n-grams from each character or word in two strings.

• Distance is computed by dividing the number of similar n-grams by maximal number of n-grams.

Page 10: Measures of Text  Similarity

10

Jaro Distance

• Jaro is based on the number and order of the common characters between two strings; it takes into account typical spelling deviations and mainly used in the area of record linkage.

0 0

1, 2 1

3 1 2jaro

if m

d s s m m m totherwise

s s m

where:m: is the number of matching characters.(Two characters from s1 and s2 respectively, are considered matching only if they are the same and not farther than: )t: is half the number of transpositions.

Page 11: Measures of Text  Similarity

11

Jaro-Winkler Distance• Jaro-Winkler is an extension of Jaro distance; it

uses a prefix scale which gives more favorable ratings to strings that match from the beginning for a set prefix length:

1, 2 1, 2. . 1 1,

0

2

jaro t

jaro winkler jarojaro

if d bd s s d s s

pref p d s s otherwise

where:|Pref|: is the length of common prefix at the start of the string up to a maximum of 4 characters.p: is a constant scaling factor for how much the score is adjusted upwards for having common prefixes. (should not exceed 0.25, otherwise the distance can become larger than 1. The standard value for this constant in Winkler's work is p=0.1) bt: The boost threshold in Winkler's implementation was 0.7

Page 12: Measures of Text  Similarity

12

Example• Given the strings Ehsan and Eihhoos we find:

m = 2 (Note that the two ‘S’s are not considered matches because they are outside the match window of 3.),

|s1|=5 , |s2| = 7

There is mismatched character (i/h) leading to : t = 1/2

1 2 2 2 0.51, 2 0.479

3 5 7 2jarod s s

To find the Jaro–Winkler score using the standard weight p=0.1, we continue to find: |pref| = 1 Thus:

1, 2 0.479 1*0.1*(1 0.479) 0.531jaro winklerd s s

Page 13: Measures of Text  Similarity

13

Character-Based Similarity (cont.)

• Needleman-Wunsch (used in bioinformatics to align protein or nucleotide sequences)

• Smith-Waterman (performs local sequence alignment)

• Soundex (phonetic algorithms): is a phonetic algorithm for indexing names by sound, as pronounced in English.

• Keyboard-Key Distance• …

Page 14: Measures of Text  Similarity

14

Agenda Introduction Syntactical (String-Based) Similarity

Character-Based (word level) Term-Based (sentence or document level)

Semantic (knowledge-Based) Similarity Path Length Information Content Relatedness (Dictionary-based method)

Statistical (Corpus-Based) Similarity

Page 15: Measures of Text  Similarity

15

Syntactical (String-Based) Similarity

Character-Based• LCS• Levenshtein• N-gram• Jaro• Jaro-Winkler• Soundex

(phonetic algorithms)

• …

Term-Based (syntactic)• Block Distance

• Euclidean Distance

• Cosine Similarity

• Jaccard Similarity

• Dice's Coefficient

• Tanimoto

• Tversky

• Matching Coefficient

• Overlap Coefficient

Page 16: Measures of Text  Similarity

Document Representation: Term vector space

Page 17: Measures of Text  Similarity

Example: Term vector space

NormalizeTokenizeStemmingStop Word Removal

Page 18: Measures of Text  Similarity

Local Term Weighting

18

No. Name Formula Description

1 Binary =1 if the term exists in the document, or else 0

2Term

Frequency

=the number of occurrences of term i in document j

3 Log =

4 normal

5 Augnorm

• Position of Word in Phrase, Sentence, Paragraph or Document,• Part of Speech Tag (Noun, Verb, Adj, Adv, …) • Named Entity (Person, Time, Location, …),

Page 19: Measures of Text  Similarity

Global Term Weighting

19

No. Name Formula Description

1 Binary

2Norma

l

3 IDFWhere dfi is the number of documents in which term i occurs

4 GfIdf

where gfi is the total number of times term i occurs in the whole collection

5Entrop

y

Page 20: Measures of Text  Similarity

20

Hamming & Euclidean Distance

• Hamming (Block Distance, L1 distance, city block distance and Manhattan distance) : In information theory, the Hamming distance between two strings of equal length is the sum of the differences of their corresponding components.

• Euclidean distance or L2 distance : is the square root of the sum of squared differences between corresponding elements of the two vectors.

Page 21: Measures of Text  Similarity

Cosine Similarity Measure

Page 22: Measures of Text  Similarity

22

Matching & Overlap coefficient

• Simple Matching coefficient : Number of terms (variables) in which document (object) s1 and s2 mismatch / Number of terms (variables) :

• Overlap coefficient : The overlap coefficient is a similarity measure related to the Jaccard index that measures the overlap between two sets, and is defined as the size of the intersection divided by the smaller of the size of the two sets:

1 21, 2 1

1 2SMC

s sd s s

s s

1 2

1, 2 1min 1 , 2overlap

s sd s s

s s

Page 23: Measures of Text  Similarity

23

Jaccard & Sørensen–Dice Distance

• Jaccard : The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets.

• Sørensen–Dice (Dice's coefficient) :

Sørensen's original formula was intended to be applied to presence/absence data, and is :

1 21, 2 1

1 2jaccard

s sd s s

s s

2 1 21, 2 1

1 2Sørensen Dice

s sd s s

s s

Page 24: Measures of Text  Similarity

24

Tanimoto & Tversky Distance

• Tanimoto : Tanimoto distance is often referred to, erroneously, as a synonym for Jaccard distance.

• Tversky : The Tversky index can be seen as a generalization of Dice's coefficient and Tanimoto coefficient.

• Setting produces a=b=1 the Tanimoto coefficient; setting produces a=b=0.5 Dice's coefficient.

1 21, 2 1

1 2 1 2Tanimoto

s sd s s

s s s s

1 21, 2 1

1 2 1 2 2 1Tversky

s sd s s

s s a s s b s s

Page 25: Measures of Text  Similarity

25

Agenda Introduction Syntactical (String-Based) Similarity

Character-Based (word level) Term-Based (sentence or document level)

Semantic (knowledge-Based) Similarity Path Length Information Content Relatedness (Dictionary-based method)

Statistical (Corpus-Based) Similarity

Page 26: Measures of Text  Similarity

26

Semantic (knowledge-Based) Similarity

Path Length• Simple Path Length

• Wu & Palmer• Leacock & Chodorow

Information Content• Resnik

• Lin

• Jiang & Conrath Relatedness (Dictionary-based

method)• Hirst-St.Onge (HSO)

• Lesk

• vector pairs

Page 27: Measures of Text  Similarity

27

Agenda Introduction Syntactical (String-Based) Similarity

Character-Based (word level) Term-Based (sentence or document level)

Semantic (knowledge-Based) Similarity Path Length Information Content Relatedness (Dictionary-based method)

Statistical (Corpus-Based) Similarity

Page 28: Measures of Text  Similarity

28

WordNet Similarity

E n tity

O b je c t

Ar tif ac t

S tr u c tu r e I n s tr u m en ta lity

A re a C o n v ey an c e

Ro o m Veh ic le

C o m p ar tm en t

Ca rGo n d o la

Ca rEle v a to r_ c a r

A irs h ip Ele v a to r

W h eeled _ v eh ic le M o to r _ v eh ic le

Ca rRa ilw a y _ c a r

Ca rA u to mo b ile

C ab o o s e F r e ig h t_ c ar

S u s p en s io n

T r a in Co u p e S ed an T ax i

E n g in e

R ear _ w in d o w

I S - A ( Hy p o n y m )

Has - P ar t ( Ho lo n y m )

P ar t- o f ( M er o n y m )

M em b er - o f ( M er o n y m )

Page 29: Measures of Text  Similarity

29

Definitions and Notation

• Pathlen(c1, c2): number of edges in shortest path

• depth(ci) : The depth of a node is the length of the path to it from the global root, i.e.,depth(ci)=pathlen(root,ci).

• LCS(c1,c2): Least Common Subsumer is lowest node in hierarchy that is a hypernym of c1 & c2.

• rel(c1,c2): for semantic relatedness between two conceptsc1andc2 , the relatedness rel(w1,w2) between two words w1and w2 can be calculated as

Where s(wi) is “the set of concepts in the taxonomy that are senses of word wi” (Resnik 1995).

Page 30: Measures of Text  Similarity

30

Path length

• path-length based similarity:

1, 2 log ( 1, 2)

1, 2 1 ( 1, 2) 1

path

path

Sim c c pathlen c c or

Sim c c pathlen c c

Page 31: Measures of Text  Similarity

31

Leacock & Chodorow

• L&Ch measure returns a score denoting how similar two word senses are, based on the shortest path that connects the senses and the maximum depth of the taxonomy in which the senses occur.

&

( 1, 2)1, 2 log

2*max 1 , 2L Ch

pathlen c cSim c c

depth c depth c

Page 32: Measures of Text  Similarity

32

Wu and Palmer

• W&P measure returns a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer.

&

2. ( 1, 2)1, 2

1 2w p

depth LCS c cSim c c

depth c depth c

Page 33: Measures of Text  Similarity

33

Agenda Introduction Syntactical (String-Based) Similarity

Character-Based (word level) Term-Based (sentence or document level)

Semantic (knowledge-Based) Similarity Path Length Information Content Relatedness (Dictionary-based method)

Statistical (Corpus-Based) Similarity

Page 34: Measures of Text  Similarity

34

Information Content

• P(C) = probability of seeing a concept of type C in a large corpus = probability of seeing instances of that concept.

• Where words(c) is set of words subsumed by concept c, N is the number of words in corpus and also in thesaurus.

• P(root)=1 since all words are subsumed by root concept,

• The lower a concept in the hierarchy, the lower the probability

( )( )

( ) w words ccount w

P cN

Page 35: Measures of Text  Similarity

35

Information Content (cont.)

Train probabilities by counting in a corpus: each word counts as an occurrence of all concepts “containing” it

Page 36: Measures of Text  Similarity

36

Information Content (cont.)

• Based on information content :( ) log( ( ))IC concept P concept

Page 37: Measures of Text  Similarity

37

Information Content (cont.)

E n tity

O b je c t

Ar tif ac t

S tr u c tu r e I n s tr u m en ta lity

A re a C o n v ey an c e

Ro o m Veh ic le

C o m p ar tm en t

Ca rGo n d o la

Ca rEle v a to r_ c a r

A irs h ip Ele v a to r

W h eeled _ v eh ic le M o to r _ v eh ic le

Ca rRa ilw a y _ c a r

Ca rA u to mo b ile

C ab o o s e F r e ig h t_ c ar

S u s p en s io n

T r a in Co u p e S ed an T ax i

E n g in e

R ear_ w in d o w

I S -A ( Hy p o n y m )

Has -P ar t ( Ho lo n y m )

P ar t- o f ( M er o n y m )

M em b er - o f ( M er o n y m )

e.g. corpus size = 10000 words

IC(vehicle) = -log(75/10000) = 2.12IC(caboose) = -log(10/10000) = 3IC(freight car) = -log(1/10000) = 4IC(coupe) = -log(14/10000) = 2.85IC(sedan) = -log(16/10000) = 2.82IC(taxi) = -log(34/10000) = 2.46…

Page 38: Measures of Text  Similarity

38

Resnik

• Resnik is equal to the information content (IC) of the Least Common Subsumer (most informative subsumer). This means that the value will always be greater-than or equal-to zero. The upper bound on the value is generally quite large and varies depending upon the size of the corpus used to determine information content values.

Re 1, 2 ( 1, 2)snicSim c c IC LCS c c

Page 39: Measures of Text  Similarity

39

Lin & Jiang-Conrath

• Lin : The lin measure scales the information content of the Least Common Subsumer by the sum of the information content of concepts c1 and c2 themselves.

• Jiang-Conrath : takes the difference of this sum and the information content of the Least Common Subsumer.

Re

2. ( 1, 2)1, 2

1 2snic

IC LCS c cSim c c

IC c IC c

1, 2 1 2 2. ( 1, 2)

11, 21, 2

JC

JCJC

d c c IC c IC c IC LCS c c

Sim c cd c c

Page 40: Measures of Text  Similarity

40

Agenda Introduction Syntactical (String-Based) Similarity

Character-Based (word level) Term-Based (sentence or document level)

Semantic (knowledge-Based) Similarity Path Length Information Content Relatedness (Dictionary-based method)

Statistical (Corpus-Based) Similarity

Page 41: Measures of Text  Similarity

41

Lesk

• Lesk algorithm used for word sense disambiguation

• Dictionary-based method : makes use of glosses, a property of dictionaries

• Lesk : two concepts/senses are similar if their glosses contain overlapping words

overlap(gloss(c1)1, , gloss( 2))2 cLeskSim c c

Page 42: Measures of Text  Similarity

42

Extended Lesk

• Extended Lesk measure: • Extended gloss overlap: two concepts/senses are

similar if their glosses contain overlapping words

• Let RELs be the set of possible WordNet relations with glosses we compare

• For each n-word phrase seen in both glosses, eLesk adds n2; longer overlaps are rare, and should be weighted more heavily

,

overlap(gloss(r(c1)), gloss(q(c2)1, 2 ))eLeskr q RELs

Sim c c

Page 43: Measures of Text  Similarity

43

Extended Lesk (Example)• drawing paper: paper that is specially prepared

for use in drafting

• decal: the art of transferring designs from specially prepared paper to a wood or glass or metal surface

overlap(decal, drawing paper)=12+22=5

overlap(gloss(c1), gloss(c2))

overlap(gloss(hypo(c1)), gloss(c2))

overlap(gloss(c1), hypo(gloss(c2)))

overlap(gloss(hypo(c1)), gloss(hypo(c2

1, 2

)))

eLeskSim c c

if considering hyponyms only,

Page 44: Measures of Text  Similarity

44

Hirst-St.Onge (HSO)

• hso measure works by finding lexical chains linking the two word senses.

• Where C and k are constants (in practice, they used C=8 and k=1), and turns(c1 ,c2) is the number of times the path between c1 and c2 changes direction.

(c1,c2)-k turns(c ,c )2 1 21,HSOSim c c C pathlen

Page 45: Measures of Text  Similarity

45

vector pairs

• vector measure creates a co–occurrence matrix for each word used in the WordNet glosses from a given corpus, and then represents each gloss/concept with a vector that is the average of these co–occurrence vectors.

• where c1 and c2 are the two given concepts, v1 and v2 are the gloss vectors corresponding to the concepts and angle returns the angle between vectors. .

1 21 2

1 2

.1, 2 cos ( , )vector

v vSim c c angle v v

v v

RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR

Page 46: Measures of Text  Similarity

46

Agenda Introduction Syntactical (String-Based) Similarity

Character-Based (word level) Term-Based (sentence or document level)

Semantic (knowledge-Based) Similarity Path Length Information Content Relatedness (Dictionary-based method)

Statistical (Corpus-Based) Similarity

Page 47: Measures of Text  Similarity

47

Statistical (Corpus-Based) Similarity

Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora:

• PMI

• LSA

• ESA

• pLSI

• NMF

• LDA

• DISCO

• …

Page 48: Measures of Text  Similarity

48

Thanks For Your Attention