Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br A Method for Defining...

38
Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br A Method for Defining Semantic Similarities between GML Schemas Angelo Augusto Frozza – UFSC / UNIPLAC Ronaldo dos Santos Mello - UFSC GBD GBD UFSC UFSC Data Base Group of Data Base Group of Santa Catarina Santa Catarina Federal Federal University University

Transcript of Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br A Method for Defining...

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

A Method for Defining Semantic Similarities between GML Schemas

Angelo Augusto Frozza – UFSC / UNIPLAC Ronaldo dos Santos Mello - UFSC

GBDGBD

UFSCUFSCData Base Group ofData Base Group of

Santa CatarinaSanta Catarina FederalFederal UniversityUniversity

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Summary

1. Introduction

2. Method overview

3. Preprocessing

4. Definition of the similarity score

5. Mapping catalog

6. Conclusion

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Summary

1. Introduction

2. Method overview

3. Preprocessing

4. Definition of the similarity score

5. Mapping catalog

6. Conclusion

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Motivation

• GIS have been extensively used by several kinds of organizations

• Organizations may need to interchange geographic data– Problem: data heterogeneity

• a same geographic entity may have different representations in different organizations

– Solutions for supporting geographic data interoperability among autonomous and heterogeneous sources are required

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Motivation

• Information interchange among GIS must solve heterogeneities at the following levels:– syntactic – semantic

• Syntactic level -> schema heterogeneity– requires conversion of export and import formats– does not ensure that the data have any meaning to

new users

• Semantic level – two geographic entities represent the same real world fact?

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Tendency

• Current solutions for syntactic and semantic interoperability among GIS are based on the use of standards and ontologies

• Main initiatives– Geography Markup Language (GML)– Ontology Web Language (OWL)

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Proposal• A method for semi-automated determination of semantic

similarities between elements of distinct GML schemas– consider the aid of an ontology as a basis for common

knowledge – may consider expert user intervention

• Contributions – Support for the development of GIS that requires semantic

interoperability– Solution applied to recent technologies for representing

geographic data and ontologies• GML and OWL

– The method is applied to urban registration domain• Not so much explored on related work • Domain with large potential for practical applications

– The method focus on the integration of small non-interconnected data sources

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Summary

1. Introduction

2. Method overview

3. Preprocessing

4. Definition of the similarity score

5. Mapping catalog

6. Conclusion

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

The Proposed MethodInput

Processing(on GML’ schema home)

OutputOWL GML’ GML’’

Mappingdefinition

Domainontology

wrapper

...

GML’’schema

wrapper

...

Similaritydefinition

(a) (a)

(b) (b)

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

The Proposed Method

Processing(on GML’ schema home)

OWL GML’ GML’’

Mappingdefinition

Domainontology

wrapper

...

GML’’schema

wrapper

...

Similaritydefinition

(a) (a)

(b) (b)

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Summary

1. Introduction

2. Method overview

3. Preprocessing

4. Definition of the similarity score

5. Mapping catalog

6. Conclusion

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Data PreProcessing• A wrapper is used to convert ontology and

GML schemas into a canonic (tree) structure

O1 = “Parcel”O2 = “address” (string)O3 = “BlockNumber” (integer)O4 = “isPart” (“Block”, atomic)O5 = “hasRepresentation” (“geographicRepresentation”, multivalued)

G1 = “ParcelArea”G2 = “address” (string)G3 = “Block” (integer)G4 = “isPart” (“BlockMTR”, atomic)

OWL GML

O4 O5 G4

Relationship

O2 O3 G2 G3

AttributeO1 G1

Complex element

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Summary

1. Introduction

2. Method overview

3. Preprocessing

4. Definition of the similarity score

5. Mapping catalog

6. Conclusion

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Similarity Score Definition

• Types of conflicts considered:– Nomenclature

• Synonyms• Homonyms

– Composition• Structure (properties)

– Relationships• Generalization/Specialization• Association

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Similarity Score Definition

• We adapt the metrics proposed by Dorneles et al. (2004):– Metrics for Complex Values (MCV)

• applied to data structures (complex element)

– Metrics for Atomic Value (MAV)• applied to simple data (strings, dates, …) • application domain dependent

• This metric set refers to a taxonomy appropriate to XML data handling.

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Similarity Score Definition

• Each GML schema tree node is tested against each ontology tree node

1. A node name is initially tested for equality against a table of synonyms:

enBlockBlock

ptBlockQuarteirão

ptBlockQuadra

enParcelParcel

ptParcelLote

LANGUAGECLASSSYNONYM

enBlockBlock

ptBlockQuarteirão

ptBlockQuadra

enParcelParcel

ptParcelLote

LANGUAGECLASSSYNONYM

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Similarity Score Definition

2. If one or more corresponding synonyms are found, a structure similarity metric is applied on each positive result

OWL GML

O4 O5 G4O2 O3 G2 G3

O1 G1

OWL GML

O4 O5 G4O4 O5 G4O2 O3 G2 G3O2 O3 G2 G3

O1 G1O1 G1Parcel Parcel

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Similarity Score Definition

3. If no corresponding synonym is found, a new search is done on the synonym table, applying a name similarity metric

Example: “BlockMTR” = “Block” Chosen metric: Jaro Winkler

It extends the Jaro metric It prevents strings that differ only at the end from having a

large distance between them It considers the concept of prefix

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Similarity Score Definition4. If the similarity score is acceptable, the

structure similarity metric is applied on each result

The pair with higher similarity score is chosen

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Structure Similarity Metric

– εp : a node on set p– εd : a node on set d– p : set of element nodes from GML schema tree– d : set of class nodes from the ontology tree– n e m : number of children from εp and εd,

respectively

),max(

)),((

),(..

nm

sim

tupleSimjd

ip

jd

ip

dp

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Similarity Score Definition

),max(

)),((

),(..

nm

sim

tupleSimjd

ip

jd

ip

dp

OWL GML

O4 O5 G4

Relationship

O2 O3 G2 G3

AttributeO1 G1

Complex elementOWL GML

O4 O5 G4

Relationship

O4 O5 G4

RelationshipRelationship

O2 O3 G2 G3

Attribute

O2 O3 G2 G3

AttributeAttributeO1 G1

Complex element

O1 G1

Complex elementComplex element

εp εd

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Simple Attribute Metric

• This metric is composed by• Jaro Winkler metric for names• Data type compatibility analysis

• nameSim – attribute name similarity• typeSim – data type similarity

• names and data types have different weights

2

),(),().,.(

dpdp

dp

typeSimnameSimattrSim

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Jaro Winkler Metric

JaroWinklerScore(s,t) = JaroScore(s,t) + (prefixLength * PREFIXSCALE * (1 - JaroScore(s,t)))

– prefixLength - the length of the common prefix at the start of the string

– PREFIXSCALE - a constant scaling factor for how much the score is adjusted upwards for having common prefix's

Examples:• Block ≈ BlockMTR ≈ 0,875 + (0,5 * 0,125) = 0,937• ParcelCTM ≈ ParcelTaxable ≈ 0,820 + (0,6 * 0,179) = 0,927

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Relationship Metric

• This metric is composed by• Jaro Winkler metric for names• Concept similarity• Cardinality constraint analysis

• nameSim – relationship name similarity• concSim – concept similarity• cardSim – cardinality similarity

• The components of the formula have different weights

3

),(),(),().,.(

dpdpdp

dp

cardSimconcSimnameSimrelSim

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

– sim2 = attrSim (G2, O2) = 1 [address ≈ address]

– sim3 = attrSim (G3, O3) = 0,95 [BlockNumber ≈ Block]

Example of Similarity Definition

OWL GML

O4 O5 G4

Relationship

O2 O3 G2 G3

Attribute

O1 G1

Complex element

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

– sim4 = relSim (G4, O4) = 0,98 [isPart ≈ isPart]

Example of Similarity Definition

OWL GML

O4 O5 G4

Relationship

O2 O3 G2 G3

Attribute

O1 G1

Complex element

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Example of Similarity Definition

),max(

)),((

),(..

nm

sim

tupleSimjd

ip

jd

ip

dp

– tupleSim() = (sim2 + sim3 + sim4) / 4– tupleSim() = (1 + 0,95 + 0,98) / 4 = 0,73

OWL GML

O4 O5 G4

Relationship

O2 O3 G2 G3

AttributeO1 G1

Complex elementOWL GML

O4 O5 G4

Relationship

O4 O5 G4

RelationshipRelationship

O2 O3 G2 G3

Attribute

O2 O3 G2 G3

AttributeAttributeO1 G1

Complex element

O1 G1

Complex elementComplex element

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Summary

1. Introduction

2. Method overview

3. Preprocessing

4. Definition of the similarity score

5. Mapping catalog

6. Conclusion

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Mapping Catalog

• The catalog is composed by two table sets

1. Information about the imported GML schemas (metadata)

2. Schema mappingsi. Each element on the main GML schema may

have an equivalent concept in the ontology

ii. Elements and similarities on the GML” schemas are related to the concepts from the main GML and the ontology

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Mapping Catalog - Example

Parcel

Block

Ontology Concepts

ParcelDesigned

Block

Main GML

0,92

1

Main GML similarity

2

1

ID

Parcel

Block

Ontology Concepts

ParcelDesigned

Block

Main GML

0,92

1

Main GML similarity

2

1

ID

0,90BlockDesigned1

0.96ParcelMTR2

0,94BlockMTR1

GML’’similarity

Imported GML’’GML’

0,90BlockDesigned1

0.96ParcelMTR2

0,94BlockMTR1

GML’’similarity

Imported GML’’GML’

OWL x GML’

GML’ x GML”

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Summary

1. Introduction

2. Method overview

3. Preprocessing

4. Definition of the similarity score

5. Mapping catalog

6. Conclusion

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Conclusion

• The assumptions that bases our work is

– Geographic data interchange happens mainly among domains with some affinity

– Geographic data are better defined semantically on a specific domain than through domain generalization

• In this context, we expect that our method is useful as part of a system for GIS data integration

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Main Contribution

• This work proposes a solution for the problem of semantic interoperability among GML schemas within the domain of urban registration

• Method characteristics

– an ontology that represents the domain knowledge

– semi-automated equivalence determination

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Related Work• Related work focus on translating queries executed on

closely interconnected heterogeneous environments

• This work focus on data integration on environments that are not necessarily interconnected

• This research includes a scenario where:

– small municipalities, individually, have no means to keep complex systems

– geographic data are spread over many institutions

• On the other hand, as a consortium, they could promote data interchange through a mechanism that would identify the similarity among them

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Future Work

• To define and execute experiments to validate and improve the method

• To increase the scope of the domain

• To extend the method to be applied to other domains

– To consider other ontologies

• To provide the integration of GML instances

• To specify an environment for distributed geographic data queries based on the mappings

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

A Method for Defining Semantic Similarities between GML Schemas

Thanks!GBDGBD

UFSCUFSCData Base Group ofData Base Group of

Santa Catarina Santa Catarina Federal UniversityFederal University

Angelo Augusto FrozzaRonaldo dos Santos Mello

{frozza, ronaldo}@inf.ufsc.br

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Application: Urban Register

• Ontology

Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br

Application: Urban Register

• GML schema

PARCEL

Parcel

Compatibilization

Postal Code

CemeteryAddress

Limit_Parcel

WallFence

Parcel_Front

Front_Parcel

Front_SecondaryFront_Main

Address

0..1

1Over

0..*

1Inside

Ocupation_Real state

1

1..* Contains

1..*

0..*

Has

Forma 1 0..*Has 0..* 1Has

0..*

0..*

0..*

0..*

0..*

0..*

0..*1

Over

1

0..*Has

0..*0..*

0..*

0..*0..*

0..*

Block_MTR

1..*

1

Belong

1

1 0..*

Parcel_MTR_Actual

Parcel_MTR_Designed

Parcel_Legal

Parcel_MTR

Parcel_Taxable

Ocupation_Real state

Parcel_Area