Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach

39
Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú {dani, casanova, [email protected]} Pontifical Catholic University of Rio de Janeiro (PUC-Rio) Department of Informatics

description

Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach. Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú {dani, casanova, [email protected]} Pontifical Catholic University of Rio de Janeiro (PUC-Rio) Department of Informatics. Summary. Motivation - PowerPoint PPT Presentation

Transcript of Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach

Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach

Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

{dani, casanova, [email protected]}

Pontifical Catholic University of Rio de Janeiro (PUC-Rio)

Department of Informatics

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Summary

• Motivation

• Gazetteers & Thesauri

• Gazetteer Integration

• Instance-based Thesauri Mapping

– Conceptual and Statistical Model

– Experiments with Geographic Data

• Conclusions

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Motivation

• Goal – Gazetteer Integration

– how to migrate entries from gazetteer GB to gazetteer GA

• Problems

1. Duplicated Entries Elimination:Gazetteers may “overlap” – requires detecting and eliminating duplicates

2. Reclassification of migrated entries:Gazetteers may adopt different classification schemes –

requires mapping the classification scheme of GB to that of GA

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Summary

• Motivation

• Gazetteers & Thesauri

• Gazetteer Integration

• Instance-based Thesauri Mapping

– Conceptual and Statistical Model

– Experiments with Geographic Data

• Conclusions

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Gazetteers & Thesauri

• Gazetteer

– a gazetteer is “a geographical dictionary (as at the back of an atlas) containing a list of geographic names, together with their geographic locations and other descriptive information” [WordNet 2005].

– a gazetteer is a catalog of geographic feature, where each entry has as attributes:

• a unique ID

• a unique type – a term taken from a feature type thesaurus

• a name

• optionally, a location – an approximation of the feature footprint

WordNet (2005), “WordNet - a lexical database for the English language”. Cognitive Science Laboratory, Princeton University, Princeton, NJ – USA. Available at: http://wordnet.princeton.edu

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Gazetteers & Thesauri

• Thesauri

– a thesaurus is “a structured and defined list of terms which standardizes words used for indexing” [UNESCO 1995]

– thesaurus relationships

• NT – narrower term

• BT – broader term

• RT – related term

• ...

UNESCO (1995), “UNESCO Thesaurus”. United Nations Educational, Scientific andCultural Organization, 1995. http://www.ulcc.ac.uk/unesco

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Gazetteers & Thesauri

identifier display-name class gml:y gml:x

adlgaz-1-1457057-00 Rio de Janeiro, Estado do - Brazil administrative areas -22.0 -42.5

adlgaz-1-1457059-20 Rio de Janeiro, Serra do - Brazil mountains -17.95 -44.95

adlgaz-1-1457061-32 Rio de Janeiro - Brazil populated places -22.9 -43.2333

adlgaz-1-1437138-6b Janeiro, Rio de - Brazil streams -11.85 -45.15

adlgaz-1-3223719-6f Rio de Janeiro - Loreto, Departamento de - Peru populated places -4.3833 -71.8167

Ex: ADL Feature Type Thesaurus

ADL Gazetteer

ADL (1999), “Alexandria Digital Library Gazetteer”, Map and Imagery Lab, Davidson Library, University of California, Santa Barbara. Available at: http://www.alexandria.ucsb.edu/gazetteer

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Gazetteers & Thesauri

ADL Feature Type Thesaurus (sample terms rooted at ‘regions’)

regions

. agricultural regions

. biogeographic regions

. . barren lands

. . deserts

. . forests

. . . petrified forests

. . . rain forests

. . . woods

. . grasslands

. . habitats

. . jungles

. . oases

. . shrublands

. . snow regions

. . tundras

. . wetlands

regions (cont.)

. climatic regions

. coastal zones

. economic regions

. land regions

. . continents

. . islands

. . . archipelagos

. . subcontinents

. linguistic regions

. map regions

. . chart regions

. . map quadrangle regions

. . UTM zones

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Gazetteers & Thesauri

ADL Feature Type Thesaurus (sample entry)

islands » A feature type category for places such as the island of Manhattan.

Used for: » The category islands is used instead of any of the following.

-atolls

-cays

-island arcs

-isles

-islets

-keys (islands)

-land-tied islands

-mangrove islands

Broader Terms: land regions » islands is a subtype of "land regions.."

Related Terms: » The following is a list of other categories related to islands (non-hierarchical relationships)

-bars (physiographic)

Scope Note: Tracts of land smaller than a continent, surrounded by the water of an ocean, sea, lake or

stream. [Glossary of Geology, 4th ed.]. » Definition of islands .

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Gazetteers & Thesauri

Wrapper Wrapper

DataSource

Wrapper

DataSourceDataSource

Mediator

LocalCatalogue

CAT

ReferenceGazetteer

GAZ DS

LocalDataSource

Mediator

ExternalCatalogue

CAT

ExternalGazetteer

GAZ

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Summary

• Motivation

• Gazetteers & Thesauri

• Gazetteer Integration

• Instance-based Thesauri Mapping

– Conceptual and Statistical Model

– Experiments with Geographic Data

• Conclusions

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Gazetteer Integration

GEONetADL Gazetteer

TA TB

• Gazetteer Integration Problem

– how to migrate entries from gazetteer GB to gazetteer GA

GA GB

ADL (1999), “Alexandria Digital Library Gazetteer”, Map and Imagery Lab, Davidson Library, University of California, Santa Barbara. Available at: http://www.alexandria.ucsb.edu/gazetteer

GNS (2006), “GEOnet Names Server”, U.S. National Geospatial Intelligence Agency, USA. Available at: http://gnswww.nga.mil/geonames/GNS

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Gazetteer Integration

identifier display-name class gml:y gml:x

adlgaz-1-1457057- Rio de Janeiro administrat -22.0 -42.5

adlgaz-1-1457061- Rio de Jane populated places -22.9 -43.2333

adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15

adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15

GEONetADL Gazetteer

TA TB

• Duplicated Entries Elimination:

– Gazetteers GA and GB may have entries that represent

the same real-world features

• use footprints to detect possible duplicates

identifier display-name class gml:y gml:x

adlgaz-1-1457057-

Rio de Janeiro administrat -22.0 -42.5

adlgaz-1-1457059

Rio de Janei mountains -17.95 -44.95

adlgaz-1-1457061-

Rio de Jane populated places -22.9 -43.2333

adlgaz-1-1437138

Janeiro, Rio de streams -11.85 -45.15

adlgaz-1-1437138

Janeiro, Rio de streams -11.85 -45.15

FAFB

fa ≡ fb

GA GB

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Gazetteer Integration

GEONetADL Gazetteer

TA TB

• Reclassification of migrated entries:

– Gazetteers may adopt different classification schemes –

requires mapping the classification scheme of GB to that of GA

GA GB m( tb ) = ta

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Gazetteer Integration

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Gazetteer Integration

• Aligning terms does not work...

...

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Gazetteer Integration

• Aligning term definitions is even worse...

1. (ADL) bay: indentations of a coastline or shoreline enclosing a part of a body of water; bodies of water partly surrounded by land.

2. (GNS) bay: a coastal indentation between two capes or headlands, larger than a cove but smaller than a gulf.

3. (GNS) island: tracts of land, smaller than a continent, surrounded by water at high water.

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Gazetteer Integration

• Formal approaches (based on DL) are hopeless...

...<owl:Class rdf:ID="Island"> <rdfs:subClassOf> <owl:Restriction> <owl:onProperty rdf:resource=

"http://sweet.jpl.nasa.gov/ontology/space.owl#surroundedBy_2D" /> <owl:allValuesFrom> <owl:Class> <owl:unionOf rdf:parseType="Collection"> <owl:Class rdf:about="#OceanRegion" /> <owl:Class rdf:about="#LandwaterRegion" /> </owl:unionOf> </owl:Class> </owl:allValuesFrom> </owl:Restriction> </rdfs:subClassOf> <rdfs:subClassOf rdf:resource="#LandRegion" /> </owl:Class>...</rdf:RDF>

SWEET (2006) The Semantic Web for Earth and Environmental Terminology (SWEET). Jet Propulsion Laboratory, California Institute of Technology. Available at: http://sweet.jpl.nasa.gov/index.html

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Gazetteer Integration

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Gazetteer Integration

identifier display-name class gml:y gml:x

adlgaz-1-1457057- Rio de Janeiro administrat -22.0 -42.5

adlgaz-1-1457061- Rio de Jane populated places -22.9 -43.2333

adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15

adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15

GEONetADL Gazetteer

TA TB

• Instance-based Thesauri Mapping:

– use duplicates to figure out

how to map the classification scheme of GB to that of GA

identifier display-name class gml:y gml:x

adlgaz-1-1457057-

Rio de Janeiro administrat -22.0 -42.5

adlgaz-1-1457059

Rio de Janei mountains -17.95 -44.95

adlgaz-1-1457061-

Rio de Jane populated places -22.9 -43.2333

adlgaz-1-1437138

Janeiro, Rio de streams -11.85 -45.15

adlgaz-1-1437138

Janeiro, Rio de streams -11.85 -45.15

FAFB

fa ≡ fb

GA GB m( tb ) = ta

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Summary

• Motivation

• Gazetteers & Thesauri

• Gazetteer Integration

• Instance-based Thesauri Mapping

– Conceptual and Statistical Model

– Experiments with Geographic Data

• Conclusions

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Instance-based Thesauri Mapping Approach

Conceptual and Statistical Model

• n(ta ,tb) = number of occurrences

of pairs of objects fa and fb such that:

– fa GA and fb GB

– fa ≡ fb

– ta and tb are the types of fa, and fb, respectively

• n(ta) = the number of entries in FA classified as ta

FB

TB GB TAGA

FA

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Instance-based Thesauri Mapping Approach

Conceptual and Statistical Model

• P(ta ,tb) = Mapping Rate Estimator

– an estimation for the frequency

that the term ta maps to tb,

for each pair of terms ta TA and tb TB

FB

TB GB TAGA

FA

where: Δ =1

| TB | P( ta , tb ) =

n( ta , tb ) + Δ

n( ta ) + 1

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Instance-based Thesauri Mapping Approach

Conceptual and Statistical Model

= Threshold Mapping Rate

• m(tb) = ta iff P(ta ,tb)

Problem: What is the value of ?

FB

TB GB TAGA

FA

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Summary

• Motivation

• Gazetteers & Thesauri

• Gazetteer Integration

• Instance-based Thesauri Mapping

– Conceptual and Statistical Model

– Experiments with Geographic Data

• Conclusions

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Experiments with Geographic Data

Data collection

– ADL Gazetteer (ADL Feature Type Thesaurus - TA)

• Instances: 16783

• Thesaurus terms: 210

– GEOnet Server Names (GEOnet Thesaurus - TB)

• Instances: 87608

• Thesaurus terms: 642

ADL (1999), “Alexandria Digital Library Gazetteer”, Map and Imagery Lab, Davidson Library, University of California, Santa Barbara. Available at: http://www.alexandria.ucsb.edu/gazetteer

GNS (2006), “GEOnet Names Server”, U.S. National Geospatial Intelligence Agency, USA. Available at: http://gnswww.nga.mil/geonames/GNS

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Experiments with Geographic Data

Model Evaluation & Test

• Data collected was partitioned into 7 datasets

– 6 for tuning

– 1 for testing

Testing set

Tuning sets

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Experiments with Geographic Data

6-fold cross-validation

Collected data

Testing set

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Experiments with Geographic Data

6-fold cross-validation

Collected data

Testing set

Training Set (Tk)

...

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Experiments with Geographic Data

ADL (ta) GEOnet (tb) n( ta , tb )

bays BAY 40

beaches BCH 66

countries, 2nd order divisions ADM2 30

forests PRK 2

islands ISL 562

islands ISLS 38

mountains HLL 222

mountains HLLS 166

physiographic features UPLD 204

populated places PPL 10961

populated places PPLL 12

populated places ISL 6

power generation sites PS 10

railroad features RSTN 406

railroad features RSTP 201

ridges RDGE 131

ridges SPUR 55

wetlands MRSH 9

wetlands FLTT 5

ADL (ta) n( ta )

islands 600

mountains 388

physiographic features 204

populated places 10979

power generation sites 10

railroad features 607

ridges 186

wetlands 14...

...

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Experiments with Geographic Data

6-fold cross-validation

Collected data

Testing set

Validation Set (Vk)

...

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Testing set

Experiments with Geographic Data

6-fold cross-validation

Collected data

Validation Step

...

...

Validation Set (Vk)

Training Set (Tk)

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Experiments with Geographic Data

6-fold cross-validation

Collected data

Testing set

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Experiments with Geographic Data

6-fold cross-validation

Collected data

Testing set

Estimated Threshold Mapping Rate

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Experiments with Geographic Data

6-fold cross-validation

Collected data

Testing set

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Experiments with Geographic Data

6-fold cross-validation

Collected data

Testing set

Threshold: 0.4

Testing Step

Legend:C: correct term alignmentsP: proposed term alignments

C P Accuracy = C/P

26 29 89.7%

ta tb P(ta,tb)

agricultural sites FRM 0.96974

bays BAY 0.50039

forests RESF 0.50078

islands ISL 0.93422

lakes LK 0.91849

Example: Aligned terms

...

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Summary

• Motivation

• Gazetteers & Thesauri

• Gazetteer Integration

• Instance-based Thesauri Mapping

– Conceptual and Statistical Model

– Experiments with Geographic Data

• Conclusions

© Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

Conclusions

• Conclusions:

– duplicates help reclassification !

– a “semantic approach” may work when “syntactic approaches” fail (badly)

• If you buy the idea, you also get...

– a strategy to gradually learn how to reclassify gazetteer entries (as in a mediator)

– a strategy to mediate access to object catalogs in general(as long as it is possible to detect duplicates)

– (Gazetteer for the Brazilian territory:

• extracted from the ADL Gazetteer

• entries classified according to 4 different (aligned) schemes

• encapsulated by Web services)

Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach

Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú

{dani, casanova, [email protected]}

Pontifical Catholic University of Rio de Janeiro (PUC-Rio)

Department of Informatics