Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach
-
Upload
cain-allen -
Category
Documents
-
view
27 -
download
0
description
Transcript of Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach
Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach
Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
{dani, casanova, [email protected]}
Pontifical Catholic University of Rio de Janeiro (PUC-Rio)
Department of Informatics
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Summary
• Motivation
• Gazetteers & Thesauri
• Gazetteer Integration
• Instance-based Thesauri Mapping
– Conceptual and Statistical Model
– Experiments with Geographic Data
• Conclusions
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Motivation
• Goal – Gazetteer Integration
– how to migrate entries from gazetteer GB to gazetteer GA
• Problems
1. Duplicated Entries Elimination:Gazetteers may “overlap” – requires detecting and eliminating duplicates
2. Reclassification of migrated entries:Gazetteers may adopt different classification schemes –
requires mapping the classification scheme of GB to that of GA
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Summary
• Motivation
• Gazetteers & Thesauri
• Gazetteer Integration
• Instance-based Thesauri Mapping
– Conceptual and Statistical Model
– Experiments with Geographic Data
• Conclusions
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Gazetteers & Thesauri
• Gazetteer
– a gazetteer is “a geographical dictionary (as at the back of an atlas) containing a list of geographic names, together with their geographic locations and other descriptive information” [WordNet 2005].
– a gazetteer is a catalog of geographic feature, where each entry has as attributes:
• a unique ID
• a unique type – a term taken from a feature type thesaurus
• a name
• optionally, a location – an approximation of the feature footprint
WordNet (2005), “WordNet - a lexical database for the English language”. Cognitive Science Laboratory, Princeton University, Princeton, NJ – USA. Available at: http://wordnet.princeton.edu
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Gazetteers & Thesauri
• Thesauri
– a thesaurus is “a structured and defined list of terms which standardizes words used for indexing” [UNESCO 1995]
– thesaurus relationships
• NT – narrower term
• BT – broader term
• RT – related term
• ...
UNESCO (1995), “UNESCO Thesaurus”. United Nations Educational, Scientific andCultural Organization, 1995. http://www.ulcc.ac.uk/unesco
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Gazetteers & Thesauri
identifier display-name class gml:y gml:x
adlgaz-1-1457057-00 Rio de Janeiro, Estado do - Brazil administrative areas -22.0 -42.5
adlgaz-1-1457059-20 Rio de Janeiro, Serra do - Brazil mountains -17.95 -44.95
adlgaz-1-1457061-32 Rio de Janeiro - Brazil populated places -22.9 -43.2333
adlgaz-1-1437138-6b Janeiro, Rio de - Brazil streams -11.85 -45.15
adlgaz-1-3223719-6f Rio de Janeiro - Loreto, Departamento de - Peru populated places -4.3833 -71.8167
Ex: ADL Feature Type Thesaurus
ADL Gazetteer
ADL (1999), “Alexandria Digital Library Gazetteer”, Map and Imagery Lab, Davidson Library, University of California, Santa Barbara. Available at: http://www.alexandria.ucsb.edu/gazetteer
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Gazetteers & Thesauri
ADL Feature Type Thesaurus (sample terms rooted at ‘regions’)
regions
. agricultural regions
. biogeographic regions
. . barren lands
. . deserts
. . forests
. . . petrified forests
. . . rain forests
. . . woods
. . grasslands
. . habitats
. . jungles
. . oases
. . shrublands
. . snow regions
. . tundras
. . wetlands
regions (cont.)
. climatic regions
. coastal zones
. economic regions
. land regions
. . continents
. . islands
. . . archipelagos
. . subcontinents
. linguistic regions
. map regions
. . chart regions
. . map quadrangle regions
. . UTM zones
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Gazetteers & Thesauri
ADL Feature Type Thesaurus (sample entry)
islands » A feature type category for places such as the island of Manhattan.
Used for: » The category islands is used instead of any of the following.
-atolls
-cays
-island arcs
-isles
-islets
-keys (islands)
-land-tied islands
-mangrove islands
Broader Terms: land regions » islands is a subtype of "land regions.."
Related Terms: » The following is a list of other categories related to islands (non-hierarchical relationships)
-bars (physiographic)
Scope Note: Tracts of land smaller than a continent, surrounded by the water of an ocean, sea, lake or
stream. [Glossary of Geology, 4th ed.]. » Definition of islands .
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Gazetteers & Thesauri
Wrapper Wrapper
DataSource
Wrapper
DataSourceDataSource
Mediator
LocalCatalogue
CAT
ReferenceGazetteer
GAZ DS
LocalDataSource
Mediator
ExternalCatalogue
CAT
ExternalGazetteer
GAZ
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Summary
• Motivation
• Gazetteers & Thesauri
• Gazetteer Integration
• Instance-based Thesauri Mapping
– Conceptual and Statistical Model
– Experiments with Geographic Data
• Conclusions
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Gazetteer Integration
GEONetADL Gazetteer
TA TB
• Gazetteer Integration Problem
– how to migrate entries from gazetteer GB to gazetteer GA
GA GB
ADL (1999), “Alexandria Digital Library Gazetteer”, Map and Imagery Lab, Davidson Library, University of California, Santa Barbara. Available at: http://www.alexandria.ucsb.edu/gazetteer
GNS (2006), “GEOnet Names Server”, U.S. National Geospatial Intelligence Agency, USA. Available at: http://gnswww.nga.mil/geonames/GNS
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Gazetteer Integration
identifier display-name class gml:y gml:x
adlgaz-1-1457057- Rio de Janeiro administrat -22.0 -42.5
adlgaz-1-1457061- Rio de Jane populated places -22.9 -43.2333
adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15
adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15
GEONetADL Gazetteer
TA TB
• Duplicated Entries Elimination:
– Gazetteers GA and GB may have entries that represent
the same real-world features
• use footprints to detect possible duplicates
identifier display-name class gml:y gml:x
adlgaz-1-1457057-
Rio de Janeiro administrat -22.0 -42.5
adlgaz-1-1457059
Rio de Janei mountains -17.95 -44.95
adlgaz-1-1457061-
Rio de Jane populated places -22.9 -43.2333
adlgaz-1-1437138
Janeiro, Rio de streams -11.85 -45.15
adlgaz-1-1437138
Janeiro, Rio de streams -11.85 -45.15
FAFB
fa ≡ fb
GA GB
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Gazetteer Integration
GEONetADL Gazetteer
TA TB
• Reclassification of migrated entries:
– Gazetteers may adopt different classification schemes –
requires mapping the classification scheme of GB to that of GA
GA GB m( tb ) = ta
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Gazetteer Integration
• Aligning terms does not work...
...
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Gazetteer Integration
• Aligning term definitions is even worse...
1. (ADL) bay: indentations of a coastline or shoreline enclosing a part of a body of water; bodies of water partly surrounded by land.
2. (GNS) bay: a coastal indentation between two capes or headlands, larger than a cove but smaller than a gulf.
3. (GNS) island: tracts of land, smaller than a continent, surrounded by water at high water.
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Gazetteer Integration
• Formal approaches (based on DL) are hopeless...
...<owl:Class rdf:ID="Island"> <rdfs:subClassOf> <owl:Restriction> <owl:onProperty rdf:resource=
"http://sweet.jpl.nasa.gov/ontology/space.owl#surroundedBy_2D" /> <owl:allValuesFrom> <owl:Class> <owl:unionOf rdf:parseType="Collection"> <owl:Class rdf:about="#OceanRegion" /> <owl:Class rdf:about="#LandwaterRegion" /> </owl:unionOf> </owl:Class> </owl:allValuesFrom> </owl:Restriction> </rdfs:subClassOf> <rdfs:subClassOf rdf:resource="#LandRegion" /> </owl:Class>...</rdf:RDF>
SWEET (2006) The Semantic Web for Earth and Environmental Terminology (SWEET). Jet Propulsion Laboratory, California Institute of Technology. Available at: http://sweet.jpl.nasa.gov/index.html
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Gazetteer Integration
identifier display-name class gml:y gml:x
adlgaz-1-1457057- Rio de Janeiro administrat -22.0 -42.5
adlgaz-1-1457061- Rio de Jane populated places -22.9 -43.2333
adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15
adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15
GEONetADL Gazetteer
TA TB
• Instance-based Thesauri Mapping:
– use duplicates to figure out
how to map the classification scheme of GB to that of GA
identifier display-name class gml:y gml:x
adlgaz-1-1457057-
Rio de Janeiro administrat -22.0 -42.5
adlgaz-1-1457059
Rio de Janei mountains -17.95 -44.95
adlgaz-1-1457061-
Rio de Jane populated places -22.9 -43.2333
adlgaz-1-1437138
Janeiro, Rio de streams -11.85 -45.15
adlgaz-1-1437138
Janeiro, Rio de streams -11.85 -45.15
FAFB
fa ≡ fb
GA GB m( tb ) = ta
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Summary
• Motivation
• Gazetteers & Thesauri
• Gazetteer Integration
• Instance-based Thesauri Mapping
– Conceptual and Statistical Model
– Experiments with Geographic Data
• Conclusions
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Instance-based Thesauri Mapping Approach
Conceptual and Statistical Model
• n(ta ,tb) = number of occurrences
of pairs of objects fa and fb such that:
– fa GA and fb GB
– fa ≡ fb
– ta and tb are the types of fa, and fb, respectively
• n(ta) = the number of entries in FA classified as ta
FB
TB GB TAGA
FA
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Instance-based Thesauri Mapping Approach
Conceptual and Statistical Model
• P(ta ,tb) = Mapping Rate Estimator
– an estimation for the frequency
that the term ta maps to tb,
for each pair of terms ta TA and tb TB
FB
TB GB TAGA
FA
where: Δ =1
| TB | P( ta , tb ) =
n( ta , tb ) + Δ
n( ta ) + 1
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Instance-based Thesauri Mapping Approach
Conceptual and Statistical Model
= Threshold Mapping Rate
• m(tb) = ta iff P(ta ,tb)
Problem: What is the value of ?
FB
TB GB TAGA
FA
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Summary
• Motivation
• Gazetteers & Thesauri
• Gazetteer Integration
• Instance-based Thesauri Mapping
– Conceptual and Statistical Model
– Experiments with Geographic Data
• Conclusions
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Experiments with Geographic Data
Data collection
– ADL Gazetteer (ADL Feature Type Thesaurus - TA)
• Instances: 16783
• Thesaurus terms: 210
– GEOnet Server Names (GEOnet Thesaurus - TB)
• Instances: 87608
• Thesaurus terms: 642
ADL (1999), “Alexandria Digital Library Gazetteer”, Map and Imagery Lab, Davidson Library, University of California, Santa Barbara. Available at: http://www.alexandria.ucsb.edu/gazetteer
GNS (2006), “GEOnet Names Server”, U.S. National Geospatial Intelligence Agency, USA. Available at: http://gnswww.nga.mil/geonames/GNS
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Experiments with Geographic Data
Model Evaluation & Test
• Data collected was partitioned into 7 datasets
– 6 for tuning
– 1 for testing
Testing set
Tuning sets
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Experiments with Geographic Data
6-fold cross-validation
Collected data
Testing set
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Experiments with Geographic Data
6-fold cross-validation
Collected data
Testing set
Training Set (Tk)
...
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Experiments with Geographic Data
ADL (ta) GEOnet (tb) n( ta , tb )
bays BAY 40
beaches BCH 66
countries, 2nd order divisions ADM2 30
forests PRK 2
islands ISL 562
islands ISLS 38
mountains HLL 222
mountains HLLS 166
physiographic features UPLD 204
populated places PPL 10961
populated places PPLL 12
populated places ISL 6
power generation sites PS 10
railroad features RSTN 406
railroad features RSTP 201
ridges RDGE 131
ridges SPUR 55
wetlands MRSH 9
wetlands FLTT 5
ADL (ta) n( ta )
islands 600
mountains 388
physiographic features 204
populated places 10979
power generation sites 10
railroad features 607
ridges 186
wetlands 14...
...
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Experiments with Geographic Data
6-fold cross-validation
Collected data
Testing set
Validation Set (Vk)
...
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Testing set
Experiments with Geographic Data
6-fold cross-validation
Collected data
Validation Step
...
...
Validation Set (Vk)
Training Set (Tk)
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Experiments with Geographic Data
6-fold cross-validation
Collected data
Testing set
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Experiments with Geographic Data
6-fold cross-validation
Collected data
Testing set
Estimated Threshold Mapping Rate
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Experiments with Geographic Data
6-fold cross-validation
Collected data
Testing set
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Experiments with Geographic Data
6-fold cross-validation
Collected data
Testing set
Threshold: 0.4
Testing Step
Legend:C: correct term alignmentsP: proposed term alignments
C P Accuracy = C/P
26 29 89.7%
ta tb P(ta,tb)
agricultural sites FRM 0.96974
bays BAY 0.50039
forests RESF 0.50078
islands ISL 0.93422
lakes LK 0.91849
Example: Aligned terms
...
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Summary
• Motivation
• Gazetteers & Thesauri
• Gazetteer Integration
• Instance-based Thesauri Mapping
– Conceptual and Statistical Model
– Experiments with Geographic Data
• Conclusions
© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
Conclusions
• Conclusions:
– duplicates help reclassification !
– a “semantic approach” may work when “syntactic approaches” fail (badly)
• If you buy the idea, you also get...
– a strategy to gradually learn how to reclassify gazetteer entries (as in a mediator)
– a strategy to mediate access to object catalogs in general(as long as it is possible to detect duplicates)
– (Gazetteer for the Brazilian territory:
• extracted from the ADL Gazetteer
• entries classified according to 4 different (aligned) schemes
• encapsulated by Web services)
Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach
Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú
{dani, casanova, [email protected]}
Pontifical Catholic University of Rio de Janeiro (PUC-Rio)
Department of Informatics