Semantic Access to Data from the Web

27
Semantic Access to Data from the Web Raquel Trillo * , Laura Po + , Sergio Ilarri * , Sonia Bergamaschi + and E. Mena * 1st International Workshop on Interoperability through Semantic Data and Service Integration (ISDSI’09) Cagmoli (Genova), Italy, 25th June 2009 Distributed Information Systems Group, http://sid.cps.unizar.es University of Zaragoza, Spain Databases Group, http://www. dbgroup . unimo .it/ Univ. Of Modena e Reggio Emilia, Italy * +

description

Semantic Access to Data from the Web. Raquel Trillo * , Laura Po + , Sergio Ilarri * , Sonia Bergamaschi + and E. Mena *. +. *. Distributed Information Systems G roup , http://sid.cps.unizar.es University of Zaragoza, Spain. - PowerPoint PPT Presentation

Transcript of Semantic Access to Data from the Web

Page 1: Semantic Access to Data from the Web

Semantic Access to Data from the WebRaquel Trillo*, Laura Po+, Sergio Ilarri*, Sonia Bergamaschi+ and E. Mena*

1st International Workshop on Interoperability through Semantic Data and Service Integration (ISDSI’09)

Cagmoli (Genova), Italy, 25th June 2009

Distributed Information Systems Group, http://sid.cps.unizar.es University of Zaragoza, Spain

Databases Group, http://www.dbgroup.unimo.it/ Univ. Of Modena e Reggio Emilia, Italy

* +

Page 2: Semantic Access to Data from the Web

Outline Introduction.

Basic Architecture of the system: Discovering the Semantics of User Keywords. Semantics-Guided Data Retrieval.

Improvements to the Basic Architecture: Probabilistic Word Sense Disambiguation. Retrieval of Synonyms of User Keywords.

Conclusions and Future Work.

ISDSI’09 Cagmoli (Italy), 25th June 2009

Page 3: Semantic Access to Data from the Web

Introduction

ISDSI’09 Cagmoli (Italy), 25th June 2009

Search engines have become the best allies of users. They index most no hidden Web. They succeed when users ask for popular information on

the Web.

Traditional Search engines are based on syntactic techniques (no semantics): Polysemous Words: with several meanings

(senses/interpretations). Example: Mouse (animal, Mickey Mouse, input device,

etc).

Synonymous Words: Different representations (words) with the same meaning.

Example: automobile or car Example: lorry or truck

Page 4: Semantic Access to Data from the Web

Introduction

ISDSI’09 Cagmoli (Italy), 25th June 2009

Truck

172,000,000

Lorry

4,760,000

Page 5: Semantic Access to Data from the Web

Introduction: Semantic Search

ISDSI’09 Cagmoli (Italy), 25th June 2009

Semantic Search engines can overcome the problems of traditional search engines. Consider the semantics of keywords and not

only its representation (how they are written). Our proposal:

Classify the results of traditional search in different categories by considering their possible meanings.

Considers the synonyms of the user keywords to retrieve more pages.

Page 6: Semantic Access to Data from the Web

Introduction: Web Clustering

ISDSI’09 Cagmoli (Italy), 25th June 2009

Along last decades, different techniques to cluster documents have appeared: Traditional clustering algorithms cannot be

applied to search result clustering. Features that a clustering for web search should:

Separate relevant pages for the user from irrelevant ones.

Provide browsable summaries of each cluster.

Be applied to snippets and not to whole pages. Be incremental and provide results ASAP. Allow the overlapping between groups.

Page 7: Semantic Access to Data from the Web

OutlineIntroduction.Basic Architecture of the system:

Discovering the Semantics of User Keywords: Obtaining the possible keyword senses (meanings). Selecting the most probable sense of each user keyword.

Semantics-Guided Data Retrieval: Lexical annotations of results of a traditional search. Categorization of results.

Improvements to the Basic Architecture:Conclusions and Future Work.

ISDSI’09 Cagmoli (Italy), 25th June 2009

Page 8: Semantic Access to Data from the Web

Basic Architecture of the system

Discovering the semantics of User keywords

Semantics-Guided Data Retrieval

Extraction of keyword senses

Disambiguation of keyword senses

Selection of the most probable intended category

Categorization of hits

Lexical annotation of hits: title and snippet

Search keywords in traditional search engines

Possible keyword senses

Selected senses

Hits (results of a traditional SE)

Annotated Hits by considering the Possible Keyword ss

Clusters or categories of hits

Semantic Cluster of Hits

Keywords

Goal: Discover the intended meaning of each user keyword. How: Word Sense Disambiguation Algorithm performs in two phases:

Phase 1: Discover the possible meanings (senses) from semantic resources such as Ontologies, Thesaurus, etc. Phase 2: For each keyword select one intended meaning by considering the context.

Page 9: Semantic Access to Data from the Web

Obtaining the Possible Keyword Senses of each User Keyword

ISDSI’09 Cagmoli (Italy), 25th June 2009

Consulting a well-known general-pupose shared thesaurus such as WordNet: Advantages: It is fast and provides a reliable set of

senses. Disadvantages: It does not cover with the same detail

different domains of knowledge. Ex: The meaning of developer as “sb who designs and implements software” does not appear.

Consulting the knowledge stored in different pools of ontologies available on the Web and using synonym probability measures to remove redundant interpretations: Advantages: The more ontologies consulted, the more

chances to find the semantics assigned by the user. Disadvantages: It could introduce noise and irrelevant

information.

Page 10: Semantic Access to Data from the Web

Obtaining the Possible Keyword Senses of each User Keyword

ISDSI’09 Cagmoli (Italy), 25th June 2009

Option 1: Consulting a well-known general-pupose shared thesaurus such as WordNet.

Option 2: Consulting the knowledge stored in different pools of ontologies and using synonym probability measures to remove redundant interpretations.

The trade-off between the two approaches is not totally clear: Implement both options beginning by the

Wordnet one. Perform experimental evaluation to decide

which approach to consider.

Page 11: Semantic Access to Data from the Web

Discovering the semantics of User keywords

Extraction of keyword senses

Disambiguation of keyword senses

Possible keyword senses

Selected senses

ISDSI’09 Cagmoli (Italy), 25th June 2009

Selecting the most probable sense of each User Keyword

Goal: Select the most probable intended meaning for each user keyword.

How: Using Word Sense Disambiguation techniques: Many features can be considered in the context of

written document, but here the process is more complex. No syntax of whole sentences, few keywords (<5), etc.

Page 12: Semantic Access to Data from the Web

ISDSI’09 Cagmoli (Italy), 25th June 2009

Selecting the most probable sense of each User Keyword Try to emulate the behaviour of a human by considering

the possible meanings of the rest of keywords: If star appears in the context “Star Hollywood”, then

the most probable intended meaning is “famous actor/actress”.

If star appears in the context “Star Sky”, then the most probable intended meaning is “celestial body”.

The architecture proposed does not depend on a particular Word Sense Disambiguation technique: Probabilistic Word Sense Disambiguation techniques

that combine different algorithms.

Page 13: Semantic Access to Data from the Web

OutlineIntroduction.Basic Architecture of the system:

Discovering the Semantics of User Keywords:Obtaining the possible keyword senses (meanings).Selecting the most probable sense of each user keyword.

Semantics-Guided Data Retrieval: Lexical annotations of results of a traditional search. Categorization of results.

Improvements to the Basic Architecture:Conclusions and Future Work.

ISDSI’09 Cagmoli (Italy), 25th June 2009

Page 14: Semantic Access to Data from the Web

Semantics-Guided Data Retrieval

Semantics-Guided Data Retrieval

Selection of the most probable intended category

Categorization of hits

Lexical annotation of hits: title and snippet

Search keywords in traditional search engines

Possible keyword senses

Selected senses

Hits (results of a traditional SE)

Annotated Hits by considering the Possible Keyword ss

Cluster or categories of hits

Semantic Cluster of Hits

Keywords

Goal: Select hits relevant for the user and filter irrelevant ones.

Phase 1: Retrieval by using traditional techniques.Phase 2 and 3: Lexical annotations of hits and classification of them by using Word Sense Disambiguation.Phase 4: Selection of the category corresponding to the selected senses.

How:

Page 15: Semantic Access to Data from the Web

Goal: Associated to each user keyword that appears in each returned hit (title, URL and snippets) a meaning by considering the possible the meaning of the keyword.

Cleasing each hit to remove stopwords and mark without semantic information.

Performing WSD by considering the context of the words (its neighbour words in a window).

How:

Lexical Annotation of the Results of a Traditional Search Engine

Cleasing of hitsPossible keyword senses

Hits (results of a traditional SE). For each hit title, URL and Snippet

Annotated Hits by considering the Possible Keyword Senses

Lexical Annotation

Page 16: Semantic Access to Data from the Web

Lexical Annotation of the Results of a Traditional Search Engine

Cleasing of hits

Possible keyword senses

Hits (results of a traditional SE). For each hit title, URL and Snippet

Annotated Hits by considering the Possible Keyword Senses

Lexical Annotation

Only information from snippets is used to perform the lexical anotation

New senses for words appears but only when they are widespreaded they are integrated in semantic resources

Page 17: Semantic Access to Data from the Web

Categorization of the Annotated Results

ISDSI’09 Cagmoli (Italy), 25th June 2009

Hit1(s11, s21), Hit2 (s11, s22), Hit3(s11, s22), Hit4(s11, ?),…

K1 (Hollywood): S11K2 (Star): S21(Celestial body), S22 (Actor/Actres)

C1(S11, S21): Hit1, …C2(S1U, S21): Hit4, ...C3(S11, S22): Hit2, Hit3,... C4(S1U, S22): ...C5(S1U, S2U): ...

Goal: Associated to annotated hit a category. How:

Defining the categories considering the possible keyword senses.

Associated to each hit a category by considering its annotations.

Page 18: Semantic Access to Data from the Web

Categorization of the Annotated Results

ISDSI’09 Cagmoli (Italy), 25th June 2009

C1(S11, S21): Hit1, …

C2(S1U, S21): Hit4, ...

C3(S11, S22): Hit2, Hit3,...

C4(S1U, S22): ...

C5(S1U, S2U): ...C6(S11, S2U): ...

Select the category (cluster) that correspond to the selected senses of the user.

The hits of each category are orderd following the ranking return by the search engine.

Unknown Sense for Hollywood

Unknown Sense for star

Page 19: Semantic Access to Data from the Web

Problems of Basic Architecture

ISDSI’09 Cagmoli (Italy), 25th June 2009

Problem 1: The system only selects the most probable intended category but the user can be interested in other one.

Problem 2: Sometimes, even for a human it is very difficult to decide which is the meaning which is being used for a word.

Problem 3: The system is not considering the synonyms of the keywords

Page 20: Semantic Access to Data from the Web

OutlineIntroduction.Basic Architecture of the system:

Discovering the Semantics of User Keywords. Semantics-Guided Data Retrieval.

Improvements to the Basic Architecture: Probabilistic Word Sense Disambiguation. Retrieval of Synonyms of User Keywords.

Conclusions and Future Work.ISDSI’09 Cagmoli (Italy), 25th June 2009

Page 21: Semantic Access to Data from the Web

ISDSI’09 Cagmoli (Italy), 25th June 2009

Probabilistic Word Sense Disambiguation

Show more intrepretations to the user: Instead of only showing to the user the category corresponding to the most probable senses, showing him/her all the categories sorted by considering the probability associated to each category.

C1(S11, S21): Hit1, …

C2(S1U, S21): Hit4, ...

C3(S11, S22): Hit2, Hit3,...

C4(S1U, S22): ...

C5(S1U, S2U): ...C6(S11, S2U): ...

C3(S11, S22): Hit2, Hit3,...

C4(S1U, S22): ...

C1(S11, S21): Hit1, …

C2(S1U, S21): Hit4, ...

C6(S11, S2U): ...C5(S1U, S2U): ...

Page 22: Semantic Access to Data from the Web

Probabilistic Word Sense Disambiguation Probabilistic Word Sense Disambiguation:

It is based on a probabilistic combination of different WSD algorithms so the process is not affected by the effectiveness of a single algorithm.

Associate a probability to each lexical annotation that indicates the reliability level of the annotation.

So, each hit will be associated to several categories with a certain probability.

0,750,20

0,05

Page 23: Semantic Access to Data from the Web

C3(S11 (hollywod), S22(star)): Hit2, Hit3,...

ISDSI’09 Cagmoli (Italy), 25th June 2009

Retrieval of Synonyms of User Keywords

Probabilistic Word Sense Disambiguation: Associate to each hit the product of the probabilities of

its annotations and use this value to rank the hits clasiffied inside a category (group of cluster).

Enrichment of the clusters with retrieval of synonyms of the senses that represent that category.

Celebrity, actor/actress

Page 24: Semantic Access to Data from the Web

OutlineIntroduction.Basic Architecture of the system:

Discovering the Semantics of User Keywords. Semantics-Guided Data Retrieval.

Improvements to the Basic Architecture: Probabilistic Word Sense Disambiguation. Retrieval of Synonyms of User Keywords.

Conclusions and Future Work.ISDSI’09 Cagmoli (Italy), 25th June 2009

Page 25: Semantic Access to Data from the Web

ISDSI’09 Cagmoli (Italy), 25th June 2009

Related Work

There exist several techniques for clustering the results of a web search, but most of them based only on statistics techniques.

Some approaches consider semantics, such as:

Hao et al. 2008: Uses only WordNet and assumes a predefined set of categories.

Hemayati et al. 2007: Limited to queries with a single keyword and does not allow overlapping categories.

Page 26: Semantic Access to Data from the Web

ISDSI’09 Cagmoli (Italy), 25th June 2009

Conclusions and Future Work We have proposed an architecture to group the results of a standard search engine in different categories:

The categories are defined by the senses of the input keywords.

The system has desirable features in this kind of systems.

Non-popular searches do not remain hidden.

Next steps:

Implementation of the system proposed.

Design a set of experiments with users to evaluate it.

Page 27: Semantic Access to Data from the Web

Semantic Access to Data from the WebRaquel Trillo, Laura Po+, Sergio Ilarri*, Sonia Bergamaschi+, E. Mena*

1st International Workshop on Interoperability through Semantic Data and Service Integration ISDSI’09

Cagmoli (Genova), Italy, 25th June 2009

http://sid.cps.unizar.es Univ. Zaragoza

http://www.dbgroup.unimo.it/ Univ. Of Modena e Reggio Emilia

Grazie Mille!Thank you very much!Questions and suggestions.

* +