Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching 1 Label...

Serena Sorrentino Label Normalization and Lexical Annotation for Schema and Ontology Matching

1

Label Normalization and Label Normalization and Lexical Annotation Lexical Annotation

for Schema and Ontology Matching for Schema and Ontology Matching

International Doctorate School inInformation and Communication Technologies

Università degli Studi di Modena e Reggio Emilia

Serena Sorrentino

XXIII Cycle

Computer Engineering and Science

Advisor: Prof. Sonia Bergamaschi

Co-Advisor: Prof. Sanda Harabagiu


OutlineOutline

2

Conclusion & Future WorkConclusion & Future Work

OverviewOverview

Schema MatchingSchema Matching

Lexical AnnotationLexical Annotation

The MOMIS Data Integration SystemThe MOMIS Data Integration System

Open Problems and ContributionsOpen Problems and Contributions

Semi-Automatic Lexical AnnotationSemi-Automatic Lexical Annotation

Schema Label NormalizationSchema Label Normalization

Uncertainty in Automatic AnnotationUncertainty in Automatic Annotation


Schema Matching - DefinitionSchema Matching - Definition

Schema matching Schema matching is the task of finding the semantic correspondences (mappings) between elements of two schemata

Auxiliary Information: dictionaries, thesauri, user input …

Auxiliary Information: dictionaries, thesauri, user input …

Schema Information: element names, data types, constraints…

Schema Information: element names, data types, constraints…

Instance Information: used to characterize the content and semantics of schema elements

Instance Information: used to characterize the content and semantics of schema elements

Match Result: is defined as a set of mapping elements each of which specifies that certain elements of S1 are mapped to certain elements of S2

Match Result: is defined as a set of mapping elements each of which specifies that certain elements of S1 are mapped to certain elements of S2

InputInput OutputOutput

3


Lexical Annotation for Schema MatchingLexical Annotation for Schema Matching

4

Lexical Annotation of schema labels is the explicit assignment of meanings w.r.t. a reference lexical thesaurus (WordNet in our case)

Lexical relationships (inter-schema knowledge):• SYN SYN (Synonym-of) between two synonym terms• BT (BT (Broader Term) between two terms where the first generalizes the second (the opposite is NT- Narrower Term)• RTRT(Related Term) between two terms that are generally used together in the same context

[ S.Bergamaschi, S.Castano, M.Vincini, D.Beneventano. Semantic integration of heterogeneous information sources. DKE Journal, 2001]

Schema derived relationships (intra-schema knowledge):• BT/NT (BT/NT ( from ISA relationships, and from Foreign Key (FK) in relational sources when it is a Primary Key in both the original and referenced relation)• RTRT (from nested elements in XML files and from FK in relational sources)

DBGroup Approach: DBGroup Approach: starting from “hidden” meanings associated to schema schema labels labels (i.e. class and attribute names, also called terms), it is possible to discover lexical relationships among schema elements


Lexical Annotation - ExampleLexical Annotation - Example

5

Schema Labels

Meaning (Synsets in WordNet) Customer Client

someone who pays for goods or services

a person who seeks the advice of a lawyer

any computer that is hooked up to a computer network

√ √ √ √ √ √

√ √

Lexical Annotation

Customer ClientSYN

Client#2

Client#3

Customer#1 Client#1

Same Synset

…

…

hyponym

meronymy

hypernym

holonym

…

Lexical Relationship

Discovery

• SYN SYN synonym in WordNet• BT/NTBT/NT hypernym/hyponym WordNet relationship• RTRT meronym relationship (part of) or sibling in WordNet



6

MANUAL LEXICALANNOTATION

AUTOMATIC LEXICALANNOTATION

INFERRED RELATIONSHIPS

LEXICAL RELATIONSHIPS

SCHEMA DERIVED RELATIONSHIPS

CommonThesaurus

COMMON THESAURUS GENERATION

USER SUPPLIED RELATIONSHIPS

LOCAL SCHEMA N

GLOBAL SCHEMA GENERATION

clustersgeneration

WRAPPING

LOCAL SCHEMA 1

…

RDB

<XML>

<DATA>

SYNSET2

SYNSET#

SYNSET3

SYNSET1

MAPPING TABLES

GLOBAL CLASSES

The MOMIS System (Mediator EnvirOment for Multiple Information Sources) is an I3 framework designed for the integration of structured and semi-structured data sources

6


Open Problems and Contributions: Automatic Lexical AnnotationOpen Problems and Contributions: Automatic Lexical Annotation

7

…

…

…

Schema S1Schema S1 Schema S2Schema S2

CLIENT_IDNAMEADDRESS

CLIENT

COUNTRYCITY

PO_ID

STREET_ADDRESS

PO_IDPRODUCT_CODE

PURCHASE_ORDER

QTY

TSP_INFO

INVOCE_NR

PRICE

… …

Non-Dictionary Words. i.e., Compound Nouns(CNs) , abbreviations, acronyms: need to normalize schema labels

Non-Dictionary Words. i.e., Compound Nouns(CNs) , abbreviations, acronyms: need to normalize schema labels

Fully Automatic Annotation (i.e. “on-the-fly”) is intrinsically uncertaint: need of dealing with uncertain annotations

Fully Automatic Annotation (i.e. “on-the-fly”) is intrinsically uncertaint: need of dealing with uncertain annotations

Manual Annotation is a boring and not scalable task we need of a method to perform Automatic or Semi-automatic Annotation


OutlineOutline

8


OverviewOverview









Word Sense Disambiguation for Semi-Automatic Lexical Word Sense Disambiguation for Semi-Automatic Lexical AnnotationAnnotation

WSD (Word Sense Disambiguation) is the ability of identifying the meanings of words in a context by a computational technique [R. Navigli, Word sense disambiguation: A survey. ACM Comput. Surv., 2009 ]

9

The semi-automatic CWSD (Combined Word Sense Disambiguation) method:

associates to each label, one/more WordNet meanings

combines two WSD algorithms:

SD (Structural Disambiguation) exploits the schema derived relationships WND (WordNet domains Disambiguation) exploits WordNet Domains [B. Magnini, et al.,The role of domain information in Word Sense Disambiguation, Journal of Natural Language Engineering, 2002 ]


The CWSD methodThe CWSD methodSOURCES

SCHEMA DERIVED RELATIONSHIP EXTRACTION

(Automatic Wrapping)

1

CLASS AND ATTRIBUTE NAMES EXTRACTION

(Automatic Wrapping)

1

SD

Algorithm

WND

Algorithm

CWSD

LEXICAL

RELATIONSHIPS

43

ANNOTATED SCHEMATA

AA

AA AA

INTEGRATIONDESIGNER

Selects relevant domains

10

CommonThesaurus

2


We experimented CWSD over a real data set: three level of a subtree of the Yahoo and Google directories (“society and culture” and “society”, respectively)

Experimental EvaluationExperimental Evaluation

WSD Algorithm

Recall Precision F-Measure

SD 0.08 0.97 0.15

WND 0.67 0.70 0.68

CWSD 0.74 0.74 0.74

11

Publications related to CWSD: • S.Bergamaschi, L.Po, S.Sorrentino. Automatic Annotation in Data Integration Systems. OTM Workshops 2007OTM Workshops 2007• S.Bergamaschi, L.Po, A.Sala, S.Sorrentino. Data source annotation in data integration systems. DBISP2P DBISP2P 20072007


OutlineOutline

12


OverviewOverview









Schema label normalization: Schema label normalization: is the reduction of each label to some standardized form that can be easily recognized

In our caseIn our case: the process of abbreviation expansion and CN (Compound Noun) annotation


a- Discovered relationships without Schema normalizationa- Discovered relationships without Schema normalization b- Discovered relationships with Schema normalizationb- Discovered relationships with Schema normalization

Legenda

Right RelationshipFalse Negative RelationshipFalse Positive Relationship

POPO PurchaseOrderPurchaseOrder

SYNSYN

SYN

SYN

SYN

SYN

SYN

SYN

SYN

SYN

POPO PurchaseOrderPurchaseOrder

13


The Schema Label Normalization methodThe Schema Label Normalization method

14

SelectingSelecting the labels to be normalized

TokenizingTokenizing labels into separated words

IdentifyingIdentifying abbreviations and CNs among the tokenized words

SelectingSelecting the labels to be normalized

TokenizingTokenizing labels into separated words

IdentifyingIdentifying abbreviations and CNs among the tokenized words

Maciej Gawinecki’s presentation

Maciej Gawinecki’s presentation

Interpreting Interpreting CNs Creating new Creating new

WordNet entries and WordNet entries and meanings meanings for the CNs

Interpreting Interpreting CNs Creating new Creating new

WordNet entries and WordNet entries and meanings meanings for the CNs

We propose a semi-automatic schema label normalization method which is composed by three phases:


CN AnnotationCN Annotation

Compound Noun (CN): is a term composed of two or more words called constituents

Endocentric CNs: they consist of a headhead (i.e. the part that contains the basic meaning of the CN) and modifiersmodifiers, which restrict this meaning. Eg. “delivery company”

Our method can be summed up into four main stepsfour main steps

15


1.CN constituent disambiguation 1.CN constituent disambiguation

• head and modifiers disambiguationhead and modifiers disambiguation: by applying CWSD

2.Redundant constituent identification and pruning 2.Redundant constituent identification and pruning

• Redundant words: Redundant words: words that do not contribute new information, i.e. derived from the schema or from the lexical thesaurus

• E.g. the attribute “company address” of the class “company”: “company” is not considered as the relationship holding among a class and its attributes is implicit in the schema

CN constituent disambiguation & pruningCN constituent disambiguation & pruning

16


CN interpretation via semantic relationshipsCN interpretation via semantic relationships3. CN interpretation: selecting, among a set of predefined semantic relationships in our case the nine Levi’s relationships (CAUSE, HAVE, MAKE, IN, FOR, ABOUT, USE, BE, FROM) [Levi, J. N., The Syntax and Semantics of Complex Nominals. Academic Press, 1978]) the one that best captures the relationship between the head and the modifier

Intuition: the semantic relationship between head and modifier is the same holding between their unique beginners (i.e., the 25 top concepts in the noun WordNet hierarchy) we manually select the correct Levi’s relationship only for the couple of unique beginners

Group#1

hyponym …

Institution#1

hyponym …

Company#1

Act#2

hyponym

Delivery#1

MAKE

MAKE

hyponym

Transport#1

…

…

17

• they are suitable to interpret couple of unique beginners• a detailed and fine interpretation is not required in our context • they can be used during the CN gloss definition

Why Levi’s relationships?:


Creation of a new WN meaning for a CNCreation of a new WN meaning for a CN

4.a Gloss definition4.a Gloss definition Company#1 GlossDelivery #1 Glossan institution created to conduct business

the act of delivering or distributing something

++

Modifier MAKE Head

an institution created to conduct business make the act of delivering or distributing something

Delivery_Company Delivery_Company Gloss:Gloss:

4.b Inclusion of the new CN meaning in WN4.b Inclusion of the new CN meaning in WN

Company#1 Delivery#1

Delivery_Company#1 SYNSETµ

SYNSETβHypernym/Hyponym

Related Term

Delivery_Company#1

18


Experimental EvaluationExperimental Evaluation

Evaluation over five different data sets (including relational and XML schemata)

Evaluating the lexical annotation process:Evaluating the lexical annotation process:

Evaluating the discovered lexical relationships:Evaluating the discovered lexical relationships:

Precision Recall F-Measure

Lexical Annotation without Normalization 0.78 0.36 0.49

Lexical Annotation with Normalization 0.71 0.66 0.68

Precision Recall F-Measure

Relationships discovered without Normalization 0.52 0.47 0.49

Relationships discovered with Normalization 0.79 0.75 0.77

19

Publications related to Schema Label Normalization :• S.Sorrentino, S.Bergamaschi, M.Gawinecki, L.Po, Schema Label Normalization for Improving Schema Matching, DKE Journal, 2010.DKE Journal, 2010.• S.Sorrentino, S.Bergamaschi, M.Gawinecki, L.Po , Schema Label Normalization for Improving Schema Matching, ER 2009 ER 2009


OutlineOutline

20


OverviewOverview









Uncertainty in Automatic Annotation Uncertainty in Automatic Annotation

21

In Automatic Lexical Annotation, uncertainty is assessed in terms of probability

The PWSDPWSD (Probabilistic Word Sense Disambiguation) algorithm:

automatically associates one/more WordNet meanings to schema labels

automatically assigns to each annotation a probability value that indicates the reliability of the annotation itself

is based on a probabilistic combination of different WSD algorithms

uses the Dempster-Shafer theory [Shafer, G., A Mathematical Theory of Evidence, Princeton 1976] to combine the results of the different WSD algorithms


ExampleExample

22

Dempster-Shafer Theory

… …

Annotations Prob. Value0.65

0.17

0.60

0.48

Source1.Book

Source1.Book

Source2.Brochure

Source2.Book Heading

Schema Elementsbook#1

book#3

brochure#1

heading#2…

meanings WSD 1 WSD 2 WSD N

label label#1 x x x

label#2

label#3 x

WSD Algorithm 170% Confidence

TERMS ANNOTED WITH ALGORITHM 1

WSD Algorithm 260% Confidence

WSD Algorithm 350% Confidence …

TERMS ANNOTED WITH ALGORITHM 2

TERMS ANNOTED WITH ALGORITHM N

SCHEMA LABELS


Probabilistic Lexical RelationshipsProbabilistic Lexical Relationships

23

Starting from the probabilistic annotation, PWSD derives a set of probabilistic lexical relationships probabilistic lexical relationships between schema elements

0.42

0.38

0.40

0.57

0.56

0.39

0.62

0.510.78

0.64

0.23

WordNet First Sense PWSD


Experimental ResultsExperimental Results

Evaluation on 2 relational schemata of the Amalgam integration benchmark and 3 ontologies from the benchmark OAEI’06

24

WSD method Precision Recall F-Measure

WordNet First Sense 0.75 0.54 0.63

PWSD* 0.63 0.73 0.68

WSD method Precision Recall F-Measure

WordNet First Sense 0.80 0.65 0.72

PWSD* 0.80 0.71 0.75

* Threshold = 0.2

* Threshold = 0.15

Evaluating the lexical annotation process:

Evaluating the discovered lexical relationships::

Publications related to PWSD:• L.Po, S.Sorrentino, Automatic generation of probabilistic relationships for improving schema matching, Information Systems Information Systems Journal, 2011Journal, 2011• L. Po, S.Sorrentino, S.Bergamaschi, D. Beneventano, Lexical knowledge extraction: an effective approach to schema and ontology matching, ECKM 2009ECKM 2009


NORMS and ALANORMS and ALA

The Schema Label Normalization functionalities have been implemented in a tool called NORMS (NORMalizer of Schemata) which allows the designer to enhance the normalized labels by correcting potential errors [S.Sorrentino, S.Bergamaschi, M.Gawinecki, NORMS: an automatic tool to perform schema label normalization, ICDE 2011ICDE 2011]

CWSD and PWSD have been implemented in a tool called ALA (Automatic Lexical Annotator). It has been integrated within the MOMIS System [S.Bergamaschi, L.Po, S.Sorrentino, A.Corni, Dealing with Uncertainty in Lexical Annotation, ERPD 2009 ERPD 2009 ]

25


ConclusionConclusion

26

Automatic and Semi-Automatic methods to perform Label Normalization and Lexical Annotation have been presented:

CWSD

Schema Label Normalization

PWSD

Automatic methods to extract (probabilistic) lexical relationships have been proposed and their effectiveness in order to improve schema matching has been shown

All the methods have been implemented in the context of the MOMIS Data Integration System. However, they can be applied in the general contexts of schema and ontology matching


Future WorkFuture Work

27

New research lines:

inclusion and integration of other knowledge resources for automatic lexical annotation:

Domain-Specific Resources such as domain ontologies, domain thesauri etc. to address the problem of specific domain terms in schemata (e.g., the biomedical term “aromatase” which is an enzyme involved in the production of estrogen)

Generic resources: Wikipedia, dictionary etc.

inclusion of instance-information extraction techniques to improve the automatic annotation and relationship discovery processes and to solve the problem of non-informative schema labels

The use of RELEVANT [S. Bergamaschi, C. Sartori, F. Guerra, M. Orsini, Extracting Relevant Attribute Values for Improved Search. IEEE Internet Computing 2007], which is a tool to extract (and add to the schema) metadata about the relevant instance values of an attribute, is a promising direction


PublicationsPublicationsJournals: Journals: • Po, L. and Sorrentino, S. (2011). Automatic generation of probabilistic relationships for

improving schema matching. Information Systems Journal, Special Issue on Semantic Integration of Data, Multimedia, and Services, 36(2):192208

• Sorrentino, S., Bergamaschi, S., Gawinecki, M., and Po, L. (2010). Schema label normalization for improving schema matching. DKE Journal, 69(12):12541273.

International Conferences and Workshops:International Conferences and Workshops:• Sorrentino, S., Bergamaschi, S., and Gawinecki, M. (2011). NORMS: an automatic tool to

perform schema label normalization. In Press, Accepted Manuscript (Demo Paper), IEEE International Conference on Data Engineering ICDE 2011ICDE 2011, April 11-16, Hannover.

• Sorrentino, S., Bergamaschi, S., Gawinecki, M., and Po, L. (2009). Schema normalization for improving schema matching. In proceedings of the 28th International Conference on Conceptual Modeling, ER 2009ER 2009, Gramado, Brasil, 9-12 November, pages 280-293.

• Beneventano, D., Bergamaschi, S., and Sorrentino, S. (2009) Extending WordNet with compound nouns for semi-automatic annotation in data integration systems. In proceeding of the IEEE NLP-KE IEEE NLP-KE Conference, Dalian, China, 24-27 September 2009.

• Bergamaschi, S., Po, L., Sorrentino, S., and Corni, A. (2009). Dealing with Uncertainty in Lexical Annotation. Revista de Informatica Terica e Aplicada, RITA, ER 2009 Poster and ER 2009 Poster and Demonstrations Demonstrations Session,16(2):9396.

28


PublicationsPublications

• Beneventano, D., Orsini, M., Po, L., Antonio, S., and Sorrentino, S. (2009). An ontology-based data integration system for data and multimedia sources. In Proceeding of the Third International Conference on Semantic Computing, IEEE ICSC 2009IEEE ICSC 2009, Berkeley, CA, USA - September 14-16, pages 606-611. IEEE Computer Society.

• Beneventano, D., Orsini, M., Po, L., and Sorrentino, S. (2009). The MOMIS-STASIS approach for Ontology-Based Data Integration. In proceedings of the 1st International Workshop on Interoperability through Semantic Data and Service Integration, ISDSI 2009ISDSI 2009, Camogli (GE), Italy June 25.

• Po, L., Sorrentino, S., Bergamaschi, S., and Beneventano, D. (2009). Lexical knowledge extraction: an effective approach to schema and ontology matching. Proceedings of the European Conference on Knowledge Management, ECKM 2009ECKM 2009, 3-4 September Vicenza.

• Bergamaschi, S., Po, L., Sala, A., and Sorrentino, S. (2007). Data source annotation in data integration systems. In Proceedings of the fifth International Workshop on Databases, Information Systems and Peer- to -Peer Computing, DBISP2PDBISP2P, at 33st International Conference on Very Large Data Bases (VLDB 2007), University of Vienna, Austria, September 24.

• Bergamaschi, S., Po, L., and Sorrentino, S. (2007). Automatic Annotation in Data Integration Systems. In Proceeding of the OTM WorkshopsOTM Workshops, Portugal, November 27-28.

29


PublicationsPublications

National Conferences National Conferences

• Bergamaschi, L. Po, S. Sorrentino, A. Corni, "Uncertainty in data integration systems: automatic generation of probabilistic relationships", VI Conference of the Italian Chapter of AIS, ITAIS ITAIS 2009, , Costa Smeralda, Italy, October 2-3 2009.

• S. Bergamaschi, S. Sorrentino, "Semi-automatic compound nouns annotation for data integration systems", Proceedings of the 17th Italian Symposium on Advanced Database Systems, SEBD SEBD 2009, Camogli (Genova), Italy 21-24 June 2009.

• S. Bergamaschi, L. Po, and S. Sorrentino, "Automatic annotation for mapping discovery in data integration systems", Proceedings of the Sixteenth Italian Symposium on Advanced Database Systems, SEBD SEBD 2008, Mondello (Palermo), Italy, 22-25 June 2008 (pp 334-341).

•

Book ChaptersBook Chapters

• Bergamaschi, S., Beneventano, D., Po, L., Sorrentino, S. (2011). Automatic Schema Mapping through Normalization and Annotation. In Press, in Second Search Computing Workshop: Challenges and Directions, 2010, LNCS State-of-the-Art Survey.

• Bergamaschi S., Po L., Sorrentino S., Corni A.. “Uncertainty in data integration systems: automatic generation of probabilistic relationships”, to appeat at Management of the Interconnected World (A. D’Atri, M. De Marco, A. Maria Braccini, F. Cariddu eds.), Springer, ISBN/ISSN: 978-3-7908-2403-2, 2010.

30


ProjectsProjects

31

NeP4B - Networked Peers for Business, MIUR funded research project – FIRB 2005 (2006- 2009) (http://www.dbgroup.unimo.it/nep4b)

STASIS - SofTware for Ambient Semantic Interoperable Services - Project FP6-2005-IST-5-034980 (2006-2008) (http://www.dbgroup.unimo.it/stasis/)

“Searching for a needle in mountains of data!” project funded by the Fondazione Cassa di Risparmio di Modena within the Bando di Ricerca Internazionale (2008-2010) (http://www.dbgroup.unimo.it/keymantic)


Thanks for your attentionThanks for your attention!!

32


Evaluation MeasuresEvaluation Measures

33

FN:False Negative TP: True PositiveFP: False PositiveTN: True Negative


Unique beginnersUnique beginners

• The top level concepts of the WordNet hierarchy are the 25 unique beginners (e.g., act, animal, artifact etc.) for WordNet English nouns defined in [Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K., WordNet: An on-line lexical database. International Journal of Lexicography, 1990]

34


Levi’s relationships setLevi’s relationships set

35

M = ModifierH = Head

[Levi, J. N., The Syntax and Semantics of Complex Nominals. Academic Press, 1978]


Dempster-Shafer theoryDempster-Shafer theory

36

The Dempster-Shafer theory is a mathematical theory of evidence. It allows to combine evidence from different sources: by using this theory for each algorithm, we assign a probability mass function m(·) to the set of all possible meanings for the term under consideration

• The mass function of the WSD algorithms are combined by using the Dempster’s rule of combination

• In the end, to obtain the probability assigned to each meaning, the belief mass function concerning a set of meanings is split

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching 1 Label...

Documents

Transcript of Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching 1 Label...