Extracting Geographical Gazetteers from the Internet
Olga Uryupina30.05.03
Overview
• Named Entity Recognition & Gazetteers
• Data• Initial Algorithm• Bootstrapping approach• Evaluation• ToDo
NE Recognition
National Gallery of Scotland – The nucleus of the Gallery was formed by the Royal Institution‘s collection, later expanded by bequests and purchasing. Playfair designed (1850-57) the imposing classical building to house the works.
State-of-the-art systems
Standard approaches usually combine• Rules• Statistics• Gazetteers Classes distinguished:• Person• Organisation• Location
NE Recognition – with and without gazetteers
(Mikheev, Moens, and Grover, 1999) ran their system in different modes
Full gazetteer No gazetteerRecal
lPrecision Recall Precision
organisation 90% 93% 86% 85%person 96% 98% 90% 95%location 95% 94% 46% 59%
Fine-grained NER
Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.
Fine-grained NER
Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.
Fine-grained NER
Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.
Fine-grained NER
Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.
Manually created gazetteers
Available resources:• Word lists from the Web• Atlases & maps• Digital gazetteers (e.g. Alexandria Digital
Library)
Manually created gazetteers – drawbacks
• Only positive data (no way to find out whether Mainau island does not exist or is simly not listed)
• Difficult to adjust when new classes are required
• Not available for most languages:Aquisgrana
Task
We can get rid of manually compiled gazetteers by using the Internet.
Task: subclassify locations using the Internet counts (obtained from the Altavista Search Engine).
Offline vs. Online processing
Data
Manually created gazetteer (1260 items)Classes:• COUNTRY Pitcairn• REGIONBavaria/Bayern• RIVER Oder• ISLAND Savai‘i• MOUNTAIN Ohmberge• CITY Nancy
Washington: 11xCITY, 1xMOUNTAIN, 2xISLAND, (31+1+1)xREGION
Data
Gazetteer exampleToronto CITYTotonicapan CITY, REGIONTrinidad CITY, RIVER, ISLAND
Data For each class we sample 100 items
from the gazetteer. As the lists overlap, this results in 520 different items (TRAINING data). The rest was used for TESTING.
CITY: ... REGION: ... COUNTRY: ...RIVER: ..., Victoria, ...ISLAND: ..., Victoria, ...MOUNTAIN: ..., Victoria, ... TRAINING:
Victoria [+CITY, +REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY]
Initial system
For each class a set of keywords was created.
ISLAND islandislandsarchipelago
Initial system
For each item X to be classified, queries of the form “X KEYWORD“ and “KEYWORD of X“ are sent to the Altavista search engine.
Newfoundland622385
Newfoundland islandisland of NewfoundlandNewfoundland islandsislands of NewfoundlandNewfound. archipelago
50135057831
0.000800.005630.000010.000130.00000
Initial system
Machine learners use the counts to induce classifications. Learners tested for this task:
• C4.5• TiMBL• Ripper
Initial system – drawbacks
Still needs manually created resources:
• Set of patterns• Initial gazetteer (TRAINING) Only online (slow) processing – the
system can only classify items, provided by the user, but not extract new names itself
Bootstrapping
Riloff & Jones, 1999 – Bootstrapping for IE task
ITEMS PATTERNS
Bootstrapping
Main problem – noise: the patterns set can get infected
Remedies:• Vaccine (external algorithm for evaluating
patterns)• Stop lists• Human experts
Extractionitems
Collectingpatterns
Discardingmost general
patterns
Learning classifiers
Extractionpatterns
Collectingitems
Discardingcommon names
Classifyingitems
Learnedhigh-precision
classifier
Initialgazetteer
Extractionitems
Collectingpatterns
Discardingmost general
patterns
Learning classifiers
Extractionpatterns
Collectingitems
Discardingcommon names
Classifyingitems
Learnedhigh-precision
classifier
Initialgazetteer
Extractionitems
Collectingpatterns
Discardingmost general
patterns
Learning classifiers
Extractionpatterns
Collectingitems
Discardingcommon names
Classifyingitems
Learnedhigh-precision
classifier
Initialgazetteer
Collecting patterns (step 1)
• Go to AltaVista• ask for an item• download first n pages• match with a simple regexppatterns
Example – step 1
10 best patterns for ISLAND:of X 70the X 60X and 58X the 55to X 53in X 52and X 47X is 45X in 45on X 45
Extractionitems
Collectingpatterns
Discardingmost general
patterns
Learning classifiers
Extractionpatterns
Collectingitems
Discardingcommon names
Classifyingitems
Learnedhigh-precision
classifier
Initialgazetteer
Rescoring (step 2)
Goal: discard too general patterns
– score of pattern p for class c
– penalty for appearing in more than one class
ij
ijjii ccacpscpscps ),(),(),(),('
),( cps
ij
ijj ccacps ),(),(
Example – step 2
10 best patterns for ISLAND:X island 17island of X 9X islands 8island X 7islands X 7insel X 7the island X 6X elects 5of X islands 5zealand X 4
Extractionitems
Collectingpatterns
Discardingmost general
patterns
Learning classifiers
Extractionpatterns
Collectingitems
Discardingcommon names
Classifyingitems
Learnedhigh-precision
classifier
Initialgazetteer
Learning classifiers (step 3)
20 best patterns are used to train Ripper (as in the initial system)
Produced classifiers:• high-recall• high-accuracy• high-precision
Example – step 3• High-recall classifier for ISLAND:if #(„X island“)/#X >= 0.003879
classify X as +ISLANDif #(„and X islands“)/#X >= 0.000002
classify X as +ISLANDif #(„insel X“)/#X >= 0.017099
classify X as +ISLANDotherwise
classify X as –ISLAND• Extraction patterns:„X island“, „and X islands“, „insel X“
One more example – step 3• High-accuracy classifier for ISLAND:if #(„X island“)/#X >= 0.000636
classify X as +ISLANDif #(„and X islands“)/#X >= 0.000002 and #(„X sea“)/#X>=0.000013 and #(„X geography“)<13
classify X as +ISLANDif #(„X islands“)/#X >= 0.000056 and #(„pacific islands X“)/#X>=0.000006
classify X as +ISLANDotherwise
classify X as –ISLAND
Extractionitems
Collectingpatterns
Discardingmost general
patterns
Learning classifiers
Extractionpatterns
Collectingitems
Discardingcommon names
Classifyingitems
Learnedhigh-precision
classifier
Initialgazetteer
Extractionitems
Collectingpatterns
Discardingmost general
patterns
Learning classifiers
Extractionpatterns
Collectingitems
Discardingcommon names
Classifyingitems
Learnedhigh-precision
classifier
Initialgazetteer
Collecting and discarding items (steps 4&5)
The same procedure as the step 1:go to AltaVista, ask for extraction
patterns (cf. step 3), ..
Discarding: common names (beginning with low-case letters), stop words (not necessary, but save time)
Example – steps 4 and 5Extracted islands (alphabetically):
AboutAbyssAchillActive
AdataraAkutan
AlaskaAlaskanAlbarella
AllAmelia
American
Extractionitems
Collectingpatterns
Discardingmost general
patterns
Learning classifiers
Extractionpatterns
Collectingitems
Discardingcommon names
Classifyingitems
Learnedhigh-precision
classifier
Initialgazetteer
Classifying (step 6)
High-precision classifier (cf. step 3) is run on collected items
rejected items are discarded accepted items used for extraction
at the next loop
Example – step 6Extracted islands (alphabetically):
AchillAkutan
AlbarellaAmelia
AndamanAscension
BainbridgeBaltrumBeaver
BigBlock
Bouvet
Evaluation
Classifiers:• initial system• bootstrapping from the seed gazetteer• bootstrapping from positive examples onlyItems lists:• bootstrapping from the seed gazetteer
Initial system – evaluation
Class AccuracyCITY 74.3%ISLAND 95.8%RIVER 88.8%MOUNTAIN 88.7%COUNTRY 98.8%REGION 82.3%average 88.1%
Bootstrapping – evaluation
Class Initial
systemAfter
the 1st loop
After the 2nd loop
CITY 74.3% 51.2% 62.0%ISLAND 95.8% 91.4% 96.4%RIVER 88.8% 91.5% 89.6%MOUNTAIN 88.7% 89.1% 88.8%COUNTRY 98.8% 99.2% 99.6%REGION 82.3% 80.4% 82.6%average 88.1% 83.8% 86.5%
Comparing the performanceRIVER, MOUNTAIN, COUNTRY – the new
system is better!ISLAND – the new system improved
and became better after the 2nd loop.REGION – infected category
(„departments of X“); however, the system is improving.
CITY – very heterogeneous class (homonymy); 1st loop – „streets of X“, 2nd loop – „km from X“, „ort X“.
Comparing the systems
Bootstrapping (vs. the initial system):+ patterns learned automatically+ word lists produced- cheap seed gazetteer
Problem: it‘s easy to download huge lists of islands etc., but very difficult to check them and classify properly
Learning from positivesCITY: ... REGION: ... COUNTRY: ...RIVER: ..., Victoria, ...ISLAND: ..., Victoria, ...MOUNTAIN: ..., Victoria, ...Before: => TRAINING:
Victoria [+CITY, +REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY]
Now: => TRAINING: Victoria [-CITY, -REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY]
Initial system – evaluation
Class Precompiled gazetteer
Positives only
CITY 74.3% 50.3%ISLAND 95.8% 94.1%RIVER 88.8% 91.0%MOUNTAIN 88.7% 89.3%COUNTRY 98.8% 99.6%REGION 82.3% 86.9%average 88.1% 85.2%
Bootstrapping with positives only – evaluation
Class 1st loop 2nd loop
CITY 39.3% 44.1%ISLAND 94.5% 95.8%RIVER 91.2% 91.1%MOUNTAIN 90.1% 91.2%COUNTRY 98.7% 99.6%REGION 86.5% 81.6%average 83.4% 83.9%
New items
New ISLANDs:true islands 121 (90.3%)
found in the atlases 93not found 28
descriptions 5 (3.7%)parts of names 3 (2.2%)mistakes 5 (3.7%)_______all 134
Conclusion
Advantages of our approach:• very few manually collected data required
(seed gazetteer)• no sophisticated engineering – patterns
produced automatically• on-line classifiers provide negative
information and are applicable to any entity
• new items (off-line gazetteer) collected automatically
ToDo
• new classes -> hierarchy• multi-word expressions• more elaborated learning from
positive examples• determine locations (where is X?)
Top Related