10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information...
-
Upload
neal-green -
Category
Documents
-
view
220 -
download
0
Transcript of 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information...
04/20/23 ACM WIDM'2005 1
Semantic Similarity Methods in WordNet andTheir Application to Information Retrieval onthe Web
Giannis VarelasEpimenidis VoutsakisParaskevi RaftopoulouEuripides G.M. PetrakisEvangelos Milios
04/20/23 ACM WIDM'2005 2
Semantic Similarity
Semantic Similarity relates to computing the conceptual similarity between terms which are not lexicographically similar “car” “automobile”
Map two terms to an ontology and compute their relationship in that ontology
04/20/23 ACM WIDM'2005 3
Objectives We investigate several Semantic Similarity
Methods and we evaluate their performance http://www.ece.tuc.gr/similarity
We propose the Semantic Similarity Retrieval Model (SSRM) for computing similarity between documents containing semantically similar but not necessarily lexicographically similar terms http://www.ece.tuc.gr/intellisearch
04/20/23 ACM WIDM'2005 4
Ontologies Tools of information representation on a
subject Hierarchical categorization of terms from
general to most specific terms object artifact construction stadium
Domain Ontologies representing knowledge of a domain e.g., MeSH medical ontology
General Ontologies representing common sense knowledge about the world e.g., WordNet
04/20/23 ACM WIDM'2005 5
WordNet A vocabulary and a thesaurus offering a
hierarchical categorization of natural language terms
More than 100,000 terms An ontology of natural language terms Nouns, verbs, adjectives and adverbs are
grouped into synonym sets (synsets) Synsets represent terms or concepts
stadium, bowl, arena, sports stadium – (a large structure for open-air sports or entertainments)
04/20/23 ACM WIDM'2005 6
WordNet Hierarchies The synsets are also organized into senses Senses: Different meanings of the same
term The synsets are related to other synsets
higher or lower in the hierarchy by different types of relationships e.g. Hyponym/Hypernym (Is-A relationships) Meronym/Holonym (Part-Of relationships)
Nine noun and several verb Is-A hierarchies
04/20/23 ACM WIDM'2005 7
A Fragment of the WordNet Is-A Hierarchy
04/20/23 ACM WIDM'2005 8
Semantic Similarity Methods Map terms to an ontology and compute
their relationship in that ontology Four main categories of methods:
Edge counting: path length between terms Information content: as a function of their
probability of occurrence in corpus Feature based: similarity between their
properties (e.g., definitions) or based on their relationships to other similar terms
Hybrid: combine the above ideas
04/20/23 ACM WIDM'2005 9
Example Edge counting
distance between “conveyance” and “ceramic” is 2
An information content method, would associate the two terms with their common subsumer and with their probabilities of occurrence in a corpus
04/20/23 ACM WIDM'2005 10
Semantic Similarity on WordNet
The most popular methods are evaluated
All methods applied on a set of 38 term pairs
Their similarity values are correlated with scores obtained by humans
The higher the correlation of a method the better the method is
04/20/23 ACM WIDM'2005 11
EvaluationMethod Type Correlation
Rada 1989 Edge Counting 0.59
Wu 1994 Edge Counting 0.74
Li 2003 Edge Counting 0.82
Leackok 1998 Edge Counting 0.82
Richardson 1994 Edge Counting 0.63
Resnik 1999 Info. Content 0.79
Lin 1993 Info. Content 0.82
Lord 2003 Info. Content 0.79
Jiang 1998 Info. Content 0.83
Tversky 1977 Feature Based 0.73
Rodriguez 2003 Hybrid 0.71
04/20/23 ACM WIDM'2005 12
Observations Edge counting/Info. Content methods work
by exploiting structure information Good methods take the position of the
terms into account Higher similarity for terms which are close
together but lower in the hierarchy e.g., [Li et.al. 2003]
Information Content is measured on WordNet rather than on corpus [Seco2002]
Similarity only for nouns and verbs No taxonomic structure for other p.o.s
04/20/23 ACM WIDM'2005 13
http://www.ece.tuc.gr/similarity
04/20/23 ACM WIDM'2005 14
Semantic Similarity Retrieval Model (SSRM)
Classic retrieval models retrieve documents with the same query terms
SSRM will retrieve documents which also contain semantically similar terms
Queries and documents are initially assigned tfxidf weights
q=(q1,q2,…qN) , d=(d1,d2,…dN)
04/20/23 ACM WIDM'2005 15
SSRM
I. Query term re-weighting similar terms reinforce each other
I. Query term expansion with synonyms and similar terms
II. Document similarity
ij
tjisimjii jisimqqq
),(
),('
),(1
'),(
jisimqn
qqji
QjTjisim
jii
tjisim
dq
jisimdqdqSim
i j ji
i j ji
),(
),(),(
04/20/23 ACM WIDM'2005 16
Query Term Expansion
04/20/23 ACM WIDM'2005 17
Observations Specification of T ? Large T may lead to topic drift Word sense disambiguation for expanding
with the correct sense Expansion with co-concurring terms?
SVD, local/global analysis
Semantic similarity between terms of different parts of speech?
Work with compound terms (phrases)
04/20/23 ACM WIDM'2005 18
Evaluation of SSRM SSRM is evaluated through
intellisearch a system for information retrieval on the WWW
1,5 Million Web pages with images Images are described by surrounding
text The problem of image retrieval is
transformed into a problem of text retrieval
04/20/23 ACM WIDM'2005 19
http://www.ece.tuc.gr/intellisearch
04/20/23 ACM WIDM'2005 20
Methods
Vector Space Model (VSM) SSRM Each method is represented by a
precision/recall plot Each point is the average
precision/recall over 20 queries 20 queries from the list of the most
frequent Google image queries
04/20/23 ACM WIDM'2005 21
Experimental Results
04/20/23 ACM WIDM'2005 22
MeSH and MedLine
MeSH: ontology for medical and biological terms by the N.L.M. 22,000 terms
MedLine: the premier bibliographic medical database of N.L.M. 13 Million references
04/20/23 ACM WIDM'2005 23
Evaluation on MedLine
04/20/23 ACM WIDM'2005 24
Conclusions Semantic similarity methods
approximated the human notion of similarity reaching correlation up to 83%
SSRM exploits this information for improving the performance of retrieval
SSRM can work with any semantic similarity method and any ontology
04/20/23 ACM WIDM'2005 25
Future Work Experimentation with more data sets
(TREC) and ontologies Extend SSRM to work with
Compound terms More parts of speech (e.g., adverbs) Co-occurring terms More terms relationships in WordNet More elaborate methods for specification
of thresholds
04/20/23 ACM WIDM'2005 26
Try our system on the Web
Semantic Similarity System: http://www.ece.tuc.gr/similarity
SSRM: http://www.ece.tuc.gr/intellisearch