Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br A Method for Defining...
-
Upload
irea-crowley -
Category
Documents
-
view
223 -
download
0
Transcript of Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br A Method for Defining...
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
A Method for Defining Semantic Similarities between GML Schemas
Angelo Augusto Frozza – UFSC / UNIPLAC Ronaldo dos Santos Mello - UFSC
GBDGBD
UFSCUFSCData Base Group ofData Base Group of
Santa CatarinaSanta Catarina FederalFederal UniversityUniversity
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Summary
1. Introduction
2. Method overview
3. Preprocessing
4. Definition of the similarity score
5. Mapping catalog
6. Conclusion
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Summary
1. Introduction
2. Method overview
3. Preprocessing
4. Definition of the similarity score
5. Mapping catalog
6. Conclusion
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Motivation
• GIS have been extensively used by several kinds of organizations
• Organizations may need to interchange geographic data– Problem: data heterogeneity
• a same geographic entity may have different representations in different organizations
– Solutions for supporting geographic data interoperability among autonomous and heterogeneous sources are required
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Motivation
• Information interchange among GIS must solve heterogeneities at the following levels:– syntactic – semantic
• Syntactic level -> schema heterogeneity– requires conversion of export and import formats– does not ensure that the data have any meaning to
new users
• Semantic level – two geographic entities represent the same real world fact?
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Tendency
• Current solutions for syntactic and semantic interoperability among GIS are based on the use of standards and ontologies
• Main initiatives– Geography Markup Language (GML)– Ontology Web Language (OWL)
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Proposal• A method for semi-automated determination of semantic
similarities between elements of distinct GML schemas– consider the aid of an ontology as a basis for common
knowledge – may consider expert user intervention
• Contributions – Support for the development of GIS that requires semantic
interoperability– Solution applied to recent technologies for representing
geographic data and ontologies• GML and OWL
– The method is applied to urban registration domain• Not so much explored on related work • Domain with large potential for practical applications
– The method focus on the integration of small non-interconnected data sources
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Summary
1. Introduction
2. Method overview
3. Preprocessing
4. Definition of the similarity score
5. Mapping catalog
6. Conclusion
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
The Proposed MethodInput
Processing(on GML’ schema home)
OutputOWL GML’ GML’’
Mappingdefinition
Domainontology
wrapper
...
GML’’schema
wrapper
...
Similaritydefinition
(a) (a)
(b) (b)
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
The Proposed Method
Processing(on GML’ schema home)
OWL GML’ GML’’
Mappingdefinition
Domainontology
wrapper
...
GML’’schema
wrapper
...
Similaritydefinition
(a) (a)
(b) (b)
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Summary
1. Introduction
2. Method overview
3. Preprocessing
4. Definition of the similarity score
5. Mapping catalog
6. Conclusion
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Data PreProcessing• A wrapper is used to convert ontology and
GML schemas into a canonic (tree) structure
O1 = “Parcel”O2 = “address” (string)O3 = “BlockNumber” (integer)O4 = “isPart” (“Block”, atomic)O5 = “hasRepresentation” (“geographicRepresentation”, multivalued)
G1 = “ParcelArea”G2 = “address” (string)G3 = “Block” (integer)G4 = “isPart” (“BlockMTR”, atomic)
OWL GML
O4 O5 G4
Relationship
O2 O3 G2 G3
AttributeO1 G1
Complex element
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Summary
1. Introduction
2. Method overview
3. Preprocessing
4. Definition of the similarity score
5. Mapping catalog
6. Conclusion
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Similarity Score Definition
• Types of conflicts considered:– Nomenclature
• Synonyms• Homonyms
– Composition• Structure (properties)
– Relationships• Generalization/Specialization• Association
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Similarity Score Definition
• We adapt the metrics proposed by Dorneles et al. (2004):– Metrics for Complex Values (MCV)
• applied to data structures (complex element)
– Metrics for Atomic Value (MAV)• applied to simple data (strings, dates, …) • application domain dependent
• This metric set refers to a taxonomy appropriate to XML data handling.
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Similarity Score Definition
• Each GML schema tree node is tested against each ontology tree node
1. A node name is initially tested for equality against a table of synonyms:
enBlockBlock
ptBlockQuarteirão
ptBlockQuadra
enParcelParcel
ptParcelLote
LANGUAGECLASSSYNONYM
enBlockBlock
ptBlockQuarteirão
ptBlockQuadra
enParcelParcel
ptParcelLote
LANGUAGECLASSSYNONYM
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Similarity Score Definition
2. If one or more corresponding synonyms are found, a structure similarity metric is applied on each positive result
OWL GML
O4 O5 G4O2 O3 G2 G3
O1 G1
OWL GML
O4 O5 G4O4 O5 G4O2 O3 G2 G3O2 O3 G2 G3
O1 G1O1 G1Parcel Parcel
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Similarity Score Definition
3. If no corresponding synonym is found, a new search is done on the synonym table, applying a name similarity metric
Example: “BlockMTR” = “Block” Chosen metric: Jaro Winkler
It extends the Jaro metric It prevents strings that differ only at the end from having a
large distance between them It considers the concept of prefix
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Similarity Score Definition4. If the similarity score is acceptable, the
structure similarity metric is applied on each result
The pair with higher similarity score is chosen
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Structure Similarity Metric
– εp : a node on set p– εd : a node on set d– p : set of element nodes from GML schema tree– d : set of class nodes from the ontology tree– n e m : number of children from εp and εd,
respectively
),max(
)),((
),(..
nm
sim
tupleSimjd
ip
jd
ip
dp
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Similarity Score Definition
),max(
)),((
),(..
nm
sim
tupleSimjd
ip
jd
ip
dp
OWL GML
O4 O5 G4
Relationship
O2 O3 G2 G3
AttributeO1 G1
Complex elementOWL GML
O4 O5 G4
Relationship
O4 O5 G4
RelationshipRelationship
O2 O3 G2 G3
Attribute
O2 O3 G2 G3
AttributeAttributeO1 G1
Complex element
O1 G1
Complex elementComplex element
εp εd
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Simple Attribute Metric
• This metric is composed by• Jaro Winkler metric for names• Data type compatibility analysis
• nameSim – attribute name similarity• typeSim – data type similarity
• names and data types have different weights
2
),(),().,.(
dpdp
dp
typeSimnameSimattrSim
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Jaro Winkler Metric
JaroWinklerScore(s,t) = JaroScore(s,t) + (prefixLength * PREFIXSCALE * (1 - JaroScore(s,t)))
– prefixLength - the length of the common prefix at the start of the string
– PREFIXSCALE - a constant scaling factor for how much the score is adjusted upwards for having common prefix's
Examples:• Block ≈ BlockMTR ≈ 0,875 + (0,5 * 0,125) = 0,937• ParcelCTM ≈ ParcelTaxable ≈ 0,820 + (0,6 * 0,179) = 0,927
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Relationship Metric
• This metric is composed by• Jaro Winkler metric for names• Concept similarity• Cardinality constraint analysis
• nameSim – relationship name similarity• concSim – concept similarity• cardSim – cardinality similarity
• The components of the formula have different weights
3
),(),(),().,.(
dpdpdp
dp
cardSimconcSimnameSimrelSim
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
– sim2 = attrSim (G2, O2) = 1 [address ≈ address]
– sim3 = attrSim (G3, O3) = 0,95 [BlockNumber ≈ Block]
Example of Similarity Definition
OWL GML
O4 O5 G4
Relationship
O2 O3 G2 G3
Attribute
O1 G1
Complex element
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
– sim4 = relSim (G4, O4) = 0,98 [isPart ≈ isPart]
Example of Similarity Definition
OWL GML
O4 O5 G4
Relationship
O2 O3 G2 G3
Attribute
O1 G1
Complex element
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Example of Similarity Definition
),max(
)),((
),(..
nm
sim
tupleSimjd
ip
jd
ip
dp
– tupleSim() = (sim2 + sim3 + sim4) / 4– tupleSim() = (1 + 0,95 + 0,98) / 4 = 0,73
OWL GML
O4 O5 G4
Relationship
O2 O3 G2 G3
AttributeO1 G1
Complex elementOWL GML
O4 O5 G4
Relationship
O4 O5 G4
RelationshipRelationship
O2 O3 G2 G3
Attribute
O2 O3 G2 G3
AttributeAttributeO1 G1
Complex element
O1 G1
Complex elementComplex element
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Summary
1. Introduction
2. Method overview
3. Preprocessing
4. Definition of the similarity score
5. Mapping catalog
6. Conclusion
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Mapping Catalog
• The catalog is composed by two table sets
1. Information about the imported GML schemas (metadata)
2. Schema mappingsi. Each element on the main GML schema may
have an equivalent concept in the ontology
ii. Elements and similarities on the GML” schemas are related to the concepts from the main GML and the ontology
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Mapping Catalog - Example
Parcel
Block
Ontology Concepts
ParcelDesigned
Block
Main GML
0,92
1
Main GML similarity
2
1
ID
Parcel
Block
Ontology Concepts
ParcelDesigned
Block
Main GML
0,92
1
Main GML similarity
2
1
ID
0,90BlockDesigned1
0.96ParcelMTR2
0,94BlockMTR1
GML’’similarity
Imported GML’’GML’
0,90BlockDesigned1
0.96ParcelMTR2
0,94BlockMTR1
GML’’similarity
Imported GML’’GML’
OWL x GML’
GML’ x GML”
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Summary
1. Introduction
2. Method overview
3. Preprocessing
4. Definition of the similarity score
5. Mapping catalog
6. Conclusion
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Conclusion
• The assumptions that bases our work is
– Geographic data interchange happens mainly among domains with some affinity
– Geographic data are better defined semantically on a specific domain than through domain generalization
• In this context, we expect that our method is useful as part of a system for GIS data integration
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Main Contribution
• This work proposes a solution for the problem of semantic interoperability among GML schemas within the domain of urban registration
• Method characteristics
– an ontology that represents the domain knowledge
– semi-automated equivalence determination
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Related Work• Related work focus on translating queries executed on
closely interconnected heterogeneous environments
• This work focus on data integration on environments that are not necessarily interconnected
• This research includes a scenario where:
– small municipalities, individually, have no means to keep complex systems
– geographic data are spread over many institutions
• On the other hand, as a consortium, they could promote data interchange through a mechanism that would identify the similarity among them
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Future Work
• To define and execute experiments to validate and improve the method
• To increase the scope of the domain
• To extend the method to be applied to other domains
– To consider other ontologies
• To provide the integration of GML instances
• To specify an environment for distributed geographic data queries based on the mappings
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
A Method for Defining Semantic Similarities between GML Schemas
Thanks!GBDGBD
UFSCUFSCData Base Group ofData Base Group of
Santa Catarina Santa Catarina Federal UniversityFederal University
Angelo Augusto FrozzaRonaldo dos Santos Mello
{frozza, ronaldo}@inf.ufsc.br
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Application: Urban Register
• Ontology
Angelo Augusto Frozza, Ronaldo dos Santos Mello{frozza, ronaldo}@inf.ufsc.br
Application: Urban Register
• GML schema
PARCEL
Parcel
Compatibilization
Postal Code
CemeteryAddress
Limit_Parcel
WallFence
Parcel_Front
Front_Parcel
Front_SecondaryFront_Main
Address
0..1
1Over
0..*
1Inside
Ocupation_Real state
1
1..* Contains
1..*
0..*
Has
Forma 1 0..*Has 0..* 1Has
0..*
0..*
0..*
0..*
0..*
0..*
0..*1
Over
1
0..*Has
0..*0..*
0..*
0..*0..*
0..*
Block_MTR
1..*
1
Belong
1
1 0..*
Parcel_MTR_Actual
Parcel_MTR_Designed
Parcel_Legal
Parcel_MTR
Parcel_Taxable
Ocupation_Real state
Parcel_Area