10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information...

26
06/16/22 ACM WIDM'2005 1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou Euripides G.M. Petrakis Evangelos Milios

Transcript of 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information...

Page 1: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 1

Semantic Similarity Methods in WordNet andTheir Application to Information Retrieval onthe Web

Giannis VarelasEpimenidis VoutsakisParaskevi RaftopoulouEuripides G.M. PetrakisEvangelos Milios

Page 2: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 2

Semantic Similarity

Semantic Similarity relates to computing the conceptual similarity between terms which are not lexicographically similar “car” “automobile”

Map two terms to an ontology and compute their relationship in that ontology

Page 3: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 3

Objectives We investigate several Semantic Similarity

Methods and we evaluate their performance http://www.ece.tuc.gr/similarity

We propose the Semantic Similarity Retrieval Model (SSRM) for computing similarity between documents containing semantically similar but not necessarily lexicographically similar terms http://www.ece.tuc.gr/intellisearch

Page 4: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 4

Ontologies Tools of information representation on a

subject Hierarchical categorization of terms from

general to most specific terms object artifact construction stadium

Domain Ontologies representing knowledge of a domain e.g., MeSH medical ontology

General Ontologies representing common sense knowledge about the world e.g., WordNet

Page 5: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 5

WordNet A vocabulary and a thesaurus offering a

hierarchical categorization of natural language terms

More than 100,000 terms An ontology of natural language terms Nouns, verbs, adjectives and adverbs are

grouped into synonym sets (synsets) Synsets represent terms or concepts

stadium, bowl, arena, sports stadium – (a large structure for open-air sports or entertainments)

Page 6: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 6

WordNet Hierarchies The synsets are also organized into senses Senses: Different meanings of the same

term The synsets are related to other synsets

higher or lower in the hierarchy by different types of relationships e.g. Hyponym/Hypernym (Is-A relationships) Meronym/Holonym (Part-Of relationships)

Nine noun and several verb Is-A hierarchies

Page 7: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 7

A Fragment of the WordNet Is-A Hierarchy

Page 8: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 8

Semantic Similarity Methods Map terms to an ontology and compute

their relationship in that ontology Four main categories of methods:

Edge counting: path length between terms Information content: as a function of their

probability of occurrence in corpus Feature based: similarity between their

properties (e.g., definitions) or based on their relationships to other similar terms

Hybrid: combine the above ideas

Page 9: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 9

Example Edge counting

distance between “conveyance” and “ceramic” is 2

An information content method, would associate the two terms with their common subsumer and with their probabilities of occurrence in a corpus

Page 10: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 10

Semantic Similarity on WordNet

The most popular methods are evaluated

All methods applied on a set of 38 term pairs

Their similarity values are correlated with scores obtained by humans

The higher the correlation of a method the better the method is

Page 11: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 11

EvaluationMethod Type Correlation

Rada 1989 Edge Counting 0.59

Wu 1994 Edge Counting 0.74

Li 2003 Edge Counting 0.82

Leackok 1998 Edge Counting 0.82

Richardson 1994 Edge Counting 0.63

Resnik 1999 Info. Content 0.79

Lin 1993 Info. Content 0.82

Lord 2003 Info. Content 0.79

Jiang 1998 Info. Content 0.83

Tversky 1977 Feature Based 0.73

Rodriguez 2003 Hybrid 0.71

Page 12: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 12

Observations Edge counting/Info. Content methods work

by exploiting structure information Good methods take the position of the

terms into account Higher similarity for terms which are close

together but lower in the hierarchy e.g., [Li et.al. 2003]

Information Content is measured on WordNet rather than on corpus [Seco2002]

Similarity only for nouns and verbs No taxonomic structure for other p.o.s

Page 13: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 13

http://www.ece.tuc.gr/similarity

Page 14: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 14

Semantic Similarity Retrieval Model (SSRM)

Classic retrieval models retrieve documents with the same query terms

SSRM will retrieve documents which also contain semantically similar terms

Queries and documents are initially assigned tfxidf weights

q=(q1,q2,…qN) , d=(d1,d2,…dN)

Page 15: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 15

SSRM

I. Query term re-weighting similar terms reinforce each other

I. Query term expansion with synonyms and similar terms

II. Document similarity

ij

tjisimjii jisimqqq

),(

),('

),(1

'),(

jisimqn

qqji

QjTjisim

jii

tjisim

dq

jisimdqdqSim

i j ji

i j ji

),(

),(),(

Page 16: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 16

Query Term Expansion

Page 17: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 17

Observations Specification of T ? Large T may lead to topic drift Word sense disambiguation for expanding

with the correct sense Expansion with co-concurring terms?

SVD, local/global analysis

Semantic similarity between terms of different parts of speech?

Work with compound terms (phrases)

Page 18: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 18

Evaluation of SSRM SSRM is evaluated through

intellisearch a system for information retrieval on the WWW

1,5 Million Web pages with images Images are described by surrounding

text The problem of image retrieval is

transformed into a problem of text retrieval

Page 19: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 19

http://www.ece.tuc.gr/intellisearch

Page 20: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 20

Methods

Vector Space Model (VSM) SSRM Each method is represented by a

precision/recall plot Each point is the average

precision/recall over 20 queries 20 queries from the list of the most

frequent Google image queries

Page 21: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 21

Experimental Results

Page 22: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 22

MeSH and MedLine

MeSH: ontology for medical and biological terms by the N.L.M. 22,000 terms

MedLine: the premier bibliographic medical database of N.L.M. 13 Million references

Page 23: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 23

Evaluation on MedLine

Page 24: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 24

Conclusions Semantic similarity methods

approximated the human notion of similarity reaching correlation up to 83%

SSRM exploits this information for improving the performance of retrieval

SSRM can work with any semantic similarity method and any ontology

Page 25: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 25

Future Work Experimentation with more data sets

(TREC) and ontologies Extend SSRM to work with

Compound terms More parts of speech (e.g., adverbs) Co-occurring terms More terms relationships in WordNet More elaborate methods for specification

of thresholds

Page 26: 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

04/20/23 ACM WIDM'2005 26

Try our system on the Web

Semantic Similarity System: http://www.ece.tuc.gr/similarity

SSRM: http://www.ece.tuc.gr/intellisearch