Extracting Geographical Gazetteers from the Internet

51
Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03

description

Extracting Geographical Gazetteers from the Internet. Olga Uryupina 30.05.03. Overview. Named Entity Recognition & Gazetteers Data Initial Algorithm Bootstrapping approach Evaluation ToDo. NE Recognition. - PowerPoint PPT Presentation

Transcript of Extracting Geographical Gazetteers from the Internet

Page 1: Extracting Geographical Gazetteers from the Internet

Extracting Geographical Gazetteers from the Internet

Olga Uryupina30.05.03

Page 2: Extracting Geographical Gazetteers from the Internet

Overview

• Named Entity Recognition & Gazetteers

• Data• Initial Algorithm• Bootstrapping approach• Evaluation• ToDo

Page 3: Extracting Geographical Gazetteers from the Internet

NE Recognition

National Gallery of Scotland – The nucleus of the Gallery was formed by the Royal Institution‘s collection, later expanded by bequests and purchasing. Playfair designed (1850-57) the imposing classical building to house the works.

Page 4: Extracting Geographical Gazetteers from the Internet

State-of-the-art systems

Standard approaches usually combine• Rules• Statistics• Gazetteers

Classes distinguished:• Person• Organisation• Location

Page 5: Extracting Geographical Gazetteers from the Internet

NE Recognition – with and without gazetteers

(Mikheev, Moens, and Grover, 1999) ran their system in different modes

Full gazetteer No gazetteer

Recall

Precision Recall Precision

organisation 90% 93% 86% 85%

person 96% 98% 90% 95%

location 95% 94% 46% 59%

Page 6: Extracting Geographical Gazetteers from the Internet

Fine-grained NER

Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.

Page 7: Extracting Geographical Gazetteers from the Internet

Fine-grained NER

Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.

Page 8: Extracting Geographical Gazetteers from the Internet

Fine-grained NER

Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.

Page 9: Extracting Geographical Gazetteers from the Internet

Fine-grained NER

Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.

Page 10: Extracting Geographical Gazetteers from the Internet

Manually created gazetteers

Available resources:• Word lists from the Web• Atlases & maps• Digital gazetteers (e.g. Alexandria Digital

Library)

Page 11: Extracting Geographical Gazetteers from the Internet

Manually created gazetteers – drawbacks

• Only positive data (no way to find out whether Mainau island does not exist or is simly not listed)

• Difficult to adjust when new classes are required

• Not available for most languages:

Aquisgrana

Page 12: Extracting Geographical Gazetteers from the Internet

Task

We can get rid of manually compiled gazetteers by using the Internet.

Task: subclassify locations using the Internet counts (obtained from the Altavista Search Engine).

Offline vs. Online processing

Page 13: Extracting Geographical Gazetteers from the Internet

Data

Manually created gazetteer (1260 items)

Classes:• COUNTRY Pitcairn• REGION Bavaria/Bayern• RIVER Oder• ISLAND Savai‘i• MOUNTAIN Ohmberge• CITY Nancy

Washington: 11xCITY, 1xMOUNTAIN, 2xISLAND, (31+1+1)xREGION

Page 14: Extracting Geographical Gazetteers from the Internet

Data

Gazetteer example

Toronto CITY

Totonicapan CITY, REGION

Trinidad CITY, RIVER, ISLAND

Page 15: Extracting Geographical Gazetteers from the Internet

Data

For each class we sample 100 items from the gazetteer. As the lists overlap, this results in 520 different items (TRAINING data). The rest was used for TESTING.

CITY: ... REGION: ... COUNTRY: ...RIVER: ..., Victoria, ...ISLAND: ..., Victoria, ...MOUNTAIN: ..., Victoria, ... TRAINING:

Victoria [+CITY, +REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY]

Page 16: Extracting Geographical Gazetteers from the Internet

Initial system

For each class a set of keywords was created.

ISLAND island

islandsarchipelago

Page 17: Extracting Geographical Gazetteers from the Internet

Initial system

For each item X to be classified, queries of the form “X KEYWORD“ and “KEYWORD of X“ are sent to the Altavista search engine.

Newfoundland622385

Newfoundland islandisland of NewfoundlandNewfoundland islandsislands of NewfoundlandNewfound. archipelago

50135057831

0.000800.005630.000010.000130.00000

Page 18: Extracting Geographical Gazetteers from the Internet

Initial system

Machine learners use the counts to induce classifications. Learners tested for this task:

• C4.5• TiMBL• Ripper

Page 19: Extracting Geographical Gazetteers from the Internet

Initial system – drawbacks

Still needs manually created resources:

• Set of patterns• Initial gazetteer (TRAINING)

Only online (slow) processing – the system can only classify items, provided by the user, but not extract new names itself

Page 20: Extracting Geographical Gazetteers from the Internet

Bootstrapping

Riloff & Jones, 1999 – Bootstrapping for IE task

ITEMS PATTERNS

Page 21: Extracting Geographical Gazetteers from the Internet

Bootstrapping

Main problem – noise: the patterns set can get infected

Remedies:• Vaccine (external algorithm for evaluating

patterns)• Stop lists• Human experts

Page 22: Extracting Geographical Gazetteers from the Internet

Extraction

items

Collectingpatterns

Discardingmost general

patterns

Learning classifiers

Extractionpatterns

Collectingitems

Discardingcommon names

Classifyingitems

Learnedhigh-precision

classifier

Initialgazetteer

Page 23: Extracting Geographical Gazetteers from the Internet

Extraction

items

Collectingpatterns

Discardingmost general

patterns

Learning classifiers

Extractionpatterns

Collectingitems

Discardingcommon names

Classifyingitems

Learnedhigh-precision

classifier

Initialgazetteer

Page 24: Extracting Geographical Gazetteers from the Internet

Extraction

items

Collectingpatterns

Discardingmost general

patterns

Learning classifiers

Extractionpatterns

Collectingitems

Discardingcommon names

Classifyingitems

Learnedhigh-precision

classifier

Initialgazetteer

Page 25: Extracting Geographical Gazetteers from the Internet

Collecting patterns (step 1)

• Go to AltaVista• ask for an item• download first n pages• match with a simple regexppatterns

Page 26: Extracting Geographical Gazetteers from the Internet

Example – step 1

10 best patterns for ISLAND:of X 70the X 60X and 58X the 55to X 53in X 52and X 47X is 45X in 45on X 45

Page 27: Extracting Geographical Gazetteers from the Internet

Extraction

items

Collectingpatterns

Discardingmost general

patterns

Learning classifiers

Extractionpatterns

Collectingitems

Discardingcommon names

Classifyingitems

Learnedhigh-precision

classifier

Initialgazetteer

Page 28: Extracting Geographical Gazetteers from the Internet

Rescoring (step 2)

Goal: discard too general patterns

– score of pattern p

for class c

– penalty for appearing in more than one class

ij

ijjii ccacpscpscps ),(),(),(),('

),( cps

ij

ijj ccacps ),(),(

Page 29: Extracting Geographical Gazetteers from the Internet

Example – step 2

10 best patterns for ISLAND:X island 17island of X 9X islands 8island X 7islands X 7insel X 7the island X 6X elects 5of X islands 5zealand X 4

Page 30: Extracting Geographical Gazetteers from the Internet

Extraction

items

Collectingpatterns

Discardingmost general

patterns

Learning classifiers

Extractionpatterns

Collectingitems

Discardingcommon names

Classifyingitems

Learnedhigh-precision

classifier

Initialgazetteer

Page 31: Extracting Geographical Gazetteers from the Internet

Learning classifiers (step 3)

20 best patterns are used to train Ripper (as in the initial system)

Produced classifiers:• high-recall• high-accuracy• high-precision

Page 32: Extracting Geographical Gazetteers from the Internet

Example – step 3

• High-recall classifier for ISLAND:if #(„X island“)/#X >= 0.003879

classify X as +ISLANDif #(„and X islands“)/#X >= 0.000002

classify X as +ISLANDif #(„insel X“)/#X >= 0.017099

classify X as +ISLANDotherwise

classify X as –ISLAND

• Extraction patterns:„X island“, „and X islands“, „insel X“

Page 33: Extracting Geographical Gazetteers from the Internet

One more example – step 3

• High-accuracy classifier for ISLAND:if #(„X island“)/#X >= 0.000636

classify X as +ISLANDif #(„and X islands“)/#X >= 0.000002 and #(„X sea“)/#X>=0.000013 and #(„X geography“)<13

classify X as +ISLANDif #(„X islands“)/#X >= 0.000056 and #(„pacific islands X“)/#X>=0.000006

classify X as +ISLANDotherwise

classify X as –ISLAND

Page 34: Extracting Geographical Gazetteers from the Internet

Extraction

items

Collectingpatterns

Discardingmost general

patterns

Learning classifiers

Extractionpatterns

Collectingitems

Discardingcommon names

Classifyingitems

Learnedhigh-precision

classifier

Initialgazetteer

Page 35: Extracting Geographical Gazetteers from the Internet

Extraction

items

Collectingpatterns

Discardingmost general

patterns

Learning classifiers

Extractionpatterns

Collectingitems

Discardingcommon names

Classifyingitems

Learnedhigh-precision

classifier

Initialgazetteer

Page 36: Extracting Geographical Gazetteers from the Internet

Collecting and discarding items (steps 4&5)

The same procedure as the step 1:go to AltaVista, ask for extraction

patterns (cf. step 3), ..

Discarding: common names (beginning with low-case letters), stop words (not necessary, but save time)

Page 37: Extracting Geographical Gazetteers from the Internet

Example – steps 4 and 5

Extracted islands (alphabetically):

AboutAbyssAchillActive

AdataraAkutan

AlaskaAlaskanAlbarella

AllAmelia

American

Page 38: Extracting Geographical Gazetteers from the Internet

Extraction

items

Collectingpatterns

Discardingmost general

patterns

Learning classifiers

Extractionpatterns

Collectingitems

Discardingcommon names

Classifyingitems

Learnedhigh-precision

classifier

Initialgazetteer

Page 39: Extracting Geographical Gazetteers from the Internet

Classifying (step 6)

High-precision classifier (cf. step 3) is run on collected items

rejected items are discarded accepted items used for extraction

at the next loop

Page 40: Extracting Geographical Gazetteers from the Internet

Example – step 6

Extracted islands (alphabetically):

AchillAkutan

AlbarellaAmelia

AndamanAscension

BainbridgeBaltrumBeaver

BigBlock

Bouvet

Page 41: Extracting Geographical Gazetteers from the Internet

Evaluation

Classifiers:• initial system• bootstrapping from the seed gazetteer• bootstrapping from positive examples only

Items lists:• bootstrapping from the seed gazetteer

Page 42: Extracting Geographical Gazetteers from the Internet

Initial system – evaluation

Class Accuracy

CITY 74.3%

ISLAND 95.8%

RIVER 88.8%

MOUNTAIN 88.7%

COUNTRY 98.8%

REGION 82.3%

average 88.1%

Page 43: Extracting Geographical Gazetteers from the Internet

Bootstrapping – evaluation

Class Initial

systemAfter

the 1st loop

After the 2nd loop

CITY 74.3% 51.2% 62.0%

ISLAND 95.8% 91.4% 96.4%

RIVER 88.8% 91.5% 89.6%

MOUNTAIN 88.7% 89.1% 88.8%

COUNTRY 98.8% 99.2% 99.6%

REGION 82.3% 80.4% 82.6%

average 88.1% 83.8% 86.5%

Page 44: Extracting Geographical Gazetteers from the Internet

Comparing the performance

RIVER, MOUNTAIN, COUNTRY – the new system is better!

ISLAND – the new system improved and became better after the 2nd loop.

REGION – infected category („departments of X“); however, the system is improving.

CITY – very heterogeneous class (homonymy); 1st loop – „streets of X“, 2nd loop – „km from X“, „ort X“.

Page 45: Extracting Geographical Gazetteers from the Internet

Comparing the systems

Bootstrapping (vs. the initial system):+ patterns learned automatically+ word lists produced- cheap seed gazetteer

Problem: it‘s easy to download huge lists of islands etc., but very difficult to check them and classify properly

Page 46: Extracting Geographical Gazetteers from the Internet

Learning from positives

CITY: ... REGION: ... COUNTRY: ...RIVER: ..., Victoria, ...ISLAND: ..., Victoria, ...MOUNTAIN: ..., Victoria, ...Before: => TRAINING:

Victoria [+CITY, +REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY]

Now: => TRAINING: Victoria [-CITY, -REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY]

Page 47: Extracting Geographical Gazetteers from the Internet

Initial system – evaluation

Class Precompiled gazetteer

Positives only

CITY 74.3% 50.3%

ISLAND 95.8% 94.1%

RIVER 88.8% 91.0%

MOUNTAIN 88.7% 89.3%

COUNTRY 98.8% 99.6%

REGION 82.3% 86.9%

average 88.1% 85.2%

Page 48: Extracting Geographical Gazetteers from the Internet

Bootstrapping with positives only – evaluation

Class 1st loop 2nd loop

CITY 39.3% 44.1%

ISLAND 94.5% 95.8%

RIVER 91.2% 91.1%

MOUNTAIN 90.1% 91.2%

COUNTRY 98.7% 99.6%

REGION 86.5% 81.6%

average 83.4% 83.9%

Page 49: Extracting Geographical Gazetteers from the Internet

New items

New ISLANDs:true islands 121 (90.3%)

found in the atlases 93

not found 28

descriptions 5 (3.7%)parts of names 3 (2.2%)mistakes 5 (3.7%)_______all 134

Page 50: Extracting Geographical Gazetteers from the Internet

Conclusion

Advantages of our approach:• very few manually collected data required

(seed gazetteer)• no sophisticated engineering – patterns

produced automatically• on-line classifiers provide negative

information and are applicable to any entity

• new items (off-line gazetteer) collected automatically

Page 51: Extracting Geographical Gazetteers from the Internet

ToDo

• new classes -> hierarchy• multi-word expressions• more elaborated learning from

positive examples• determine locations (where is X?)