Document Similarity Measures

Document Similarity Measures

Content:1. Knowledge-based word semantic similarity

1. shortest path similarity2. Leacock & Chodorow similarity3. Lesk similarity4. Wu & Palmer (Wu and Palmer, 1994) similarity metric5. Resnik (Resnik, 1995) Information content based measure6. measure introduced by Lin (Lin, 1998)7. Jiang & Conrath (Jiang and Conrath, 1997) measure of similarity8. Hirst & St. Onge (Hirst and St-Onge, 1998) measure of similarity

2. Corpus-Based Measures of semantic similarity1. PointWise Mutual Information2. Normalized Google Similarity Distance

3. Explicit Semantic Analysis

Introduction

Text similarity (uses):

Information retrieval (Query vs Document) Text classification (Document vs Category) Word-sense disambiguation (Context vs Context) Automatic evaluation Machine translation (Gold Standard vs Generated) Text summarization (Summary vs Original)

Word Similarity (Introduction): Words can be similar if:

They mean the same thing (synonyms) They mean the opposite (antonyms) They are used in the same way (red, green) They are used in the same context (doctor, hospital, scalpel) One is a type of another (poodle, dog, mammal)

Knowledge-based word semantic similarityDefinition: Knowledgebase like: Wikipedia, encyclopedia, Annotated resources like: WordNet, and Large Text corpus is used to achieve such kind of semantic similarity. At starting we identify the relationship(s) among words of any given document with help of above discussed resources and then apply some metric or procedure to calculate the semantic relations between word pairs.

E.g., Identifying semantic relations between words using WordNet:

Figure: WordNet-like Hierarchy

Knowledge-based word semantic similarityThe shortest path similarity: It is determined as:

lengthSimpath

1

Where, length is the length of the shortest path between two concepts using node-counting (including the end nodes).

The Leacock & Chodorow similarity (Leacock and Chodorow, 1998): It is determined as:

Dlengthsimlch *2

log

Where, length is the length of the shortest path between two concepts using node-counting, and D is the maximum depth of the taxonomy.

Lesk similarity: The Lesk similarity of two concepts is defined as a function of the overlap between the corresponding definitions, as provided by a dictionary. It is based on an algorithm proposed by Lesk (1986) as a solution for word sense disambiguation.

Knowledge-based word semantic similarityThe Wu & Palmer (Wu and Palmer, 1994) similarity metric measures the depth of two given concepts in the WordNet taxonomy, and the depth of the least common subsumer (LCS), and combines these figures into a similarity score:

21

2conceptdepthconceptdepth

LCSdepthSimwup

The measure introduced by Resnik (Resnik, 1995) returns the information content (IC) of the LCS of two concepts:

LCSICSimres

where IC is defined as:

cPcIC log

and P(c) is the probability of encountering an instance of concept c in a large corpus.

Knowledge-based word semantic similarity

The measure introduced by Lin (Lin, 1998) builds on Resnik’s measure of similarity, and adds a normalization factor consisting of the information content of the two input concepts:

21

2conceptICconceptIC

LCSICSimlin

Jiang & Conrath (Jiang and Conrath, 1997) measure of similarity:

LCSICconceptICconceptICSim jnc

2

1

21

The Hirst & St. Onge (Hirst and St-Onge, 1998) measure of similarity: It determines the similarity strength of a pair of synsets by detecting lexical chains between the pair in a text using the WordNet hierarchy.

Corpus-Based Measures of semantic similarity

Important points

Corpus-based measures differ from knowledge-based methods in that they do not require any encoded understanding of either the vocabulary or the grammar of a text’s language.

In many of the scenarios where Corpus-based measures would be advantageous, robust language-specific resources (e.g. WordNet) may not be available.

Thus, state-of-the-art corpus-based measures may be the only available approach to Corpus-based measures in languages with scarce resources.

Initial setting for corpus based measure of semantic similarity:

Uses document vector representation of text with N-dimensional space where N is the number of unique words in a pair of texts.

Each of the two texts can be treated like a vector in this N-dimensional space.

The distance between the two vectors is an indication of the similarity of the two texts.

Corpus-Based Measures of semantic similarityPoint-wise Mutual Information (PMI): It is a common approach to quantify relatedness. Here we will discuss about three ways to measure term relatedness using PMI.

(1) One is based on Wikipedia page count:

jpipjipNjipmi p

,log, 2

Where jip , is the number of Wikipedia articles containing both terms it and jt , while ip is the

number of articles which contain it .

(2) The second is based on the term count in Wikipedia articles,

jtitjitTjipmit

,log, 2

Where T is the number of terms in Wikipedia, jit , is the number of it and jt occurred adjacently

in Wikipedia, and it is the number of it in Wikipedia.

(3) The third one is a combination of the above two PMI ways,

jpipjiptNjipmic

,log, 2

Where jipt , indicates the number of Wikipedia articles containing it and jt as adjacency. It is

obvious that jipmijipmi pc ,, , and jipmic , is more strict and accurate for measuring

relatedness.


Normalized Google Similarity Distance (NGD) is a new measure for measuring similarity between terms proposed by (Cilibrasi and Vitanyi, 2007) based on information distance and Kolmogorov complexity. It could be applied to compute term similarity from the World Wide Web or any large enough corpus using the page counts of terms.

Google distance is a semantic similarity measure derived from the number of hits returned by the Google search engine for a given set of keywords. Keywords with the same or similar meanings in a natural language sense tend to be "close" in units of Google distance, while words with dissimilar meanings tend to be farther apart.

yfxfM

yxjyfxfyxNGDlog,logminlog

,loglog,logmax,

where M is the total number of web pages searched by Google; f(x) and f(y) are the number of hits for search terms x and y, respectively; and f(x, y) is the number of web pages on which both x and y occur.

If the two search terms x and y never occur together on the same web page, but do occur separately, the normalized Google distance between them is infinite. If both terms always occur together, their NGD is zero, or equivalent to the coefficient between x squared and y squared.

10


Example:T1 = 2W1 + 3W2 + 5W3

T2 = 3W1 + 7W2 + W3

cos Ɵ = T1·T2 / (|T1|*|T2|

= 0.6758

W3

W1

W2

T1 = 2W1 + 3W2 + 5W3

T2 = 3W1 + 7W2 + W3

Cosine Similarity

11

Cosine Similarity: Example

Hurricane Gilbert swept toward the Dominican Republic Sunday , and the Civil Defense alerted its heavily populated south coast to prepare for high winds, heavy rains and high seas.

The storm was approaching from the southeast with sustained winds of 75 mph gusting to 92 mph .

“There is no need for alarm," Civil Defense Director Eugenio Cabral said in a television alert shortly before midnight Saturday .

Cabral said residents of the province of Barahona should closely follow Gilbert 's movement .

An estimated 100,000 people live in the province, including 70,000 in the city of Barahona , about 125 miles west of Santo Domingo .

Tropical Storm Gilbert formed in the eastern Caribbean and strengthened into a hurricane Saturday night

The National Hurricane Center in Miami reported its position at 2a.m. Sunday at latitude 16.1 north , longitude 67.5 west, about 140 miles south of Ponce, Puerto Rico, and 200 miles southeast of Santo Domingo.

The National Weather Service in San Juan , Puerto Rico , said Gilbert was moving westward at 15 mph with a "broad area of cloudiness and heavy weather" rotating around the center of the storm.

The weather service issued a flash flood watch for Puerto Rico and the Virgin Islands until at least 6p.m. Sunday.

Strong winds associated with the Gilbert brought coastal flooding , strong southeast winds and up to 12 feet to Puerto Rico 's south coast.

12

Cosine Similarity: Example (Document Vectors for selected terms)

• Document1– Gilbert: 3– Hurricane: 2– Rains: 1– Storm: 2– Winds: 2

• Document2– Gilbert: 2– Hurricane: 1– Rains: 0– Storm: 1– Winds: 2

Cosine Similarity: 0.9439

Explicit Semantic AnalysisImportant Points:

1. Determine the extent to which each word is associated with every concept of Wikipedia via term frequency or some other method.

2. For a text, sum up the associated concept vectors for a composite text concept vector.

3. Compare the texts using a standard cosine similarity or other vector similarity measure.

4. Advantage: The vectors can be analyzed and tweaked because they are closely tied to Wikipedia concepts.

A ample system setup to calculate explicit semantic relatedness score. (ref: Evgeniy Gabrilovich, Shaul Markovitch: Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis. IJCAI 2007: 1606-1611)

Reference

• Michael Mohler, Rada Mihalcea: Text-to-Text Semantic Similarity for Automatic Short Answer Grading. EACL 2009: 567-575.

• Gabrilovich, E. and Markovitch, S. (2007). "Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis", Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, January 2007

• Claudia d'Amato, Steffen Staab, Nicola Fanizzi: On the Influence of Description Logics Ontologies on Conceptual Similarity. EKAW 2008: 48-63.

• Similarity-based Learning Methods for the Semantic Web (http://www.di.uniba.it/~cdamato/PhDThesis_dAmato.pdf)

http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/m/Mihalcea:Rada.html



http://www.informatik.uni-trier.de/~ley/db/conf/eacl/eacl2009.html#MohlerM09

Document Similarity Measures

Documents

Transcript of Document Similarity Measures