Extracting Geographical Gazetteers from the Internet

of 51 /51
Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03

Embed Size (px)

description

Extracting Geographical Gazetteers from the Internet. Olga Uryupina 30.05.03. Overview. Named Entity Recognition & Gazetteers Data Initial Algorithm Bootstrapping approach Evaluation ToDo. NE Recognition. - PowerPoint PPT Presentation

Transcript of Extracting Geographical Gazetteers from the Internet

  • Extracting Geographical Gazetteers from the InternetOlga Uryupina30.05.03

  • OverviewNamed Entity Recognition & GazetteersDataInitial AlgorithmBootstrapping approachEvaluationToDo

  • NE RecognitionNational Gallery of Scotland The nucleus of the Gallery was formed by the Royal Institutions collection, later expanded by bequests and purchasing. Playfair designed (1850-57) the imposing classical building to house the works.

  • State-of-the-art systemsStandard approaches usually combineRulesStatisticsGazetteers Classes distinguished:PersonOrganisationLocation

  • NE Recognition with and without gazetteers(Mikheev, Moens, and Grover, 1999) ran their system in different modes

    Full gazetteerNo gazetteerRecallPrecisionRecallPrecisionorganisation90%93%86%85%person96%98%90%95%location95%94%46%59%

  • Fine-grained NERWashington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.

  • Fine-grained NERWashington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.

  • Fine-grained NERWashington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.

  • Fine-grained NERWashington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.

  • Manually created gazetteersAvailable resources:Word lists from the WebAtlases & mapsDigital gazetteers (e.g. Alexandria Digital Library)

  • Manually created gazetteers drawbacks Only positive data (no way to find out whether Mainau island does not exist or is simly not listed)Difficult to adjust when new classes are requiredNot available for most languages:Aquisgrana

  • TaskWe can get rid of manually compiled gazetteers by using the Internet.Task: subclassify locations using the Internet counts (obtained from the Altavista Search Engine).Offline vs. Online processing

  • Data Manually created gazetteer (1260 items)Classes:COUNTRYPitcairnREGIONBavaria/BayernRIVEROderISLANDSavaiiMOUNTAINOhmbergeCITYNancy

    Washington: 11xCITY, 1xMOUNTAIN, 2xISLAND, (31+1+1)xREGION

  • DataGazetteer example

    TorontoCITYTotonicapanCITY, REGIONTrinidadCITY, RIVER, ISLAND

  • Data For each class we sample 100 items from the gazetteer. As the lists overlap, this results in 520 different items (TRAINING data). The rest was used for TESTING.

    CITY: ...REGION: ...COUNTRY: ...RIVER: ..., Victoria, ...ISLAND: ..., Victoria, ...MOUNTAIN: ..., Victoria, ...TRAINING: Victoria [+CITY, +REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY]

  • Initial systemFor each class a set of keywords was created.

    ISLANDislandislandsarchipelago

  • Initial systemFor each item X to be classified, queries of the form X KEYWORD and KEYWORD of X are sent to the Altavista search engine.

    Newfoundland622385Newfoundland islandisland of NewfoundlandNewfoundland islandsislands of NewfoundlandNewfound. archipelago501350578310.000800.005630.000010.000130.00000

  • Initial systemMachine learners use the counts to induce classifications. Learners tested for this task:C4.5TiMBLRipper

  • Initial system drawbacksStill needs manually created resources:Set of patternsInitial gazetteer (TRAINING) Only online (slow) processing the system can only classify items, provided by the user, but not extract new names itself

  • BootstrappingRiloff & Jones, 1999 Bootstrapping for IE task

    ITEMSPATTERNS

  • BootstrappingMain problem noise: the patterns set can get infectedRemedies:Vaccine (external algorithm for evaluating patterns)Stop listsHuman experts

  • ExtractionitemsCollectingpatternsDiscardingmost generalpatternsLearning classifiersExtractionpatternsCollectingitemsDiscardingcommon namesClassifyingitemsLearnedhigh-precisionclassifierInitialgazetteer

  • ExtractionitemsCollectingpatternsDiscardingmost generalpatternsLearning classifiersExtractionpatternsCollectingitemsDiscardingcommon namesClassifyingitemsLearnedhigh-precisionclassifierInitialgazetteer

  • ExtractionitemsCollectingpatternsDiscardingmost generalpatternsLearning classifiersExtractionpatternsCollectingitemsDiscardingcommon namesClassifyingitemsLearnedhigh-precisionclassifierInitialgazetteer

  • Collecting patterns (step 1)Go to AltaVistaask for an itemdownload first n pagesmatch with a simple regexppatterns

  • Example step 110 best patterns for ISLAND:of X70the X60X and58X the55to X53in X52and X47X is45X in45on X45

  • ExtractionitemsCollectingpatternsDiscardingmost generalpatternsLearning classifiersExtractionpatternsCollectingitemsDiscardingcommon namesClassifyingitemsLearnedhigh-precisionclassifierInitialgazetteer

  • Rescoring (step 2)Goal: discard too general patterns

    score of pattern p for class c

    penalty for appearing in more than one class

  • Example step 210 best patterns for ISLAND:X island17island of X9X islands8island X7islands X7insel X7the island X6X elects5of X islands5zealand X4

  • ExtractionitemsCollectingpatternsDiscardingmost generalpatternsLearning classifiersExtractionpatternsCollectingitemsDiscardingcommon namesClassifyingitemsLearnedhigh-precisionclassifierInitialgazetteer

  • Learning classifiers (step 3)20 best patterns are used to train Ripper (as in the initial system)Produced classifiers:high-recallhigh-accuracyhigh-precision

  • Example step 3High-recall classifier for ISLAND:if #(X island)/#X >= 0.003879 classify X as +ISLANDif #(and X islands)/#X >= 0.000002 classify X as +ISLANDif #(insel X)/#X >= 0.017099 classify X as +ISLANDotherwiseclassify X as ISLANDExtraction patterns:X island, and X islands, insel X

  • One more example step 3High-accuracy classifier for ISLAND:if #(X island)/#X >= 0.000636 classify X as +ISLANDif #(and X islands)/#X >= 0.000002 and #(X sea)/#X>=0.000013 and #(X geography)= 0.000056 and #(pacific islands X)/#X>=0.000006classify X as +ISLANDotherwiseclassify X as ISLAND

  • ExtractionitemsCollectingpatternsDiscardingmost generalpatternsLearning classifiersExtractionpatternsCollectingitemsDiscardingcommon namesClassifyingitemsLearnedhigh-precisionclassifierInitialgazetteer

  • ExtractionitemsCollectingpatternsDiscardingmost generalpatternsLearning classifiersExtractionpatternsCollectingitemsDiscardingcommon namesClassifyingitemsLearnedhigh-precisionclassifierInitialgazetteer

  • Collecting and discarding items (steps 4&5)The same procedure as the step 1:go to AltaVista, ask for extraction patterns (cf. step 3), ..

    Discarding: common names (beginning with low-case letters), stop words (not necessary, but save time)

  • Example steps 4 and 5Extracted islands (alphabetically):

    AboutAbyssAchillActiveAdataraAkutanAlaskaAlaskanAlbarellaAllAmeliaAmerican

  • ExtractionitemsCollectingpatternsDiscardingmost generalpatternsLearning classifiersExtractionpatternsCollectingitemsDiscardingcommon namesClassifyingitemsLearnedhigh-precisionclassifierInitialgazetteer

  • Classifying (step 6)High-precision classifier (cf. step 3) is run on collected items rejected items are discarded accepted items used for extraction at the next loop

  • Example step 6Extracted islands (alphabetically):

    AchillAkutanAlbarellaAmeliaAndamanAscensionBainbridgeBaltrumBeaverBigBlockBouvet

  • EvaluationClassifiers:initial systembootstrapping from the seed gazetteerbootstrapping from positive examples onlyItems lists:bootstrapping from the seed gazetteer

  • Initial system evaluation

    ClassAccuracyCITY74.3%ISLAND95.8%RIVER88.8%MOUNTAIN88.7%COUNTRY98.8%REGION82.3%average88.1%

  • Bootstrapping evaluation

    ClassInitialsystemAfter the 1st loopAfter the 2nd loopCITY74.3%51.2%62.0%ISLAND95.8%91.4%96.4%RIVER88.8%91.5%89.6%MOUNTAIN88.7%89.1%88.8%COUNTRY98.8%99.2%99.6%REGION82.3%80.4%82.6%average88.1%83.8%86.5%

  • Comparing the performanceRIVER, MOUNTAIN, COUNTRY the new system is better!ISLAND the new system improved and became better after the 2nd loop.REGION infected category (departments of X); however, the system is improving.CITY very heterogeneous class (homonymy); 1st loop streets of X, 2nd loop km from X, ort X.

  • Comparing the systemsBootstrapping (vs. the initial system):+ patterns learned automatically+ word lists producedcheap seed gazetteer

    Problem: its easy to download huge lists of islands etc., but very difficult to check them and classify properly

  • Learning from positivesCITY: ...REGION: ...COUNTRY: ...RIVER: ..., Victoria, ...ISLAND: ..., Victoria, ...MOUNTAIN: ..., Victoria, ...Before: => TRAINING: Victoria [+CITY, +REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY]Now: => TRAINING: Victoria [-CITY, -REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY]

  • Initial system evaluation

    ClassPrecompiled gazetteerPositives onlyCITY74.3%50.3%ISLAND95.8%94.1%RIVER88.8%91.0%MOUNTAIN88.7%89.3%COUNTRY98.8%99.6%REGION82.3%86.9%average88.1%85.2%

  • Bootstrapping with positives only evaluation

    Class1st loop2nd loopCITY39.3%44.1%ISLAND94.5%95.8%RIVER91.2%91.1%MOUNTAIN90.1%91.2%COUNTRY98.7%99.6%REGION86.5%81.6%average83.4%83.9%

  • New itemsNew ISLANDs:true islands121(90.3%)found in the atlases93not found28descriptions5(3.7%)parts of names3(2.2%)mistakes5(3.7%)_______all134

  • ConclusionAdvantages of our approach:very few manually collected data required (seed gazetteer)no sophisticated engineering patterns produced automaticallyon-line classifiers provide negative information and are applicable to any entitynew items (off-line gazetteer) collected automatically

  • ToDonew classes -> hierarchymulti-word expressionsmore elaborated learning from positive examplesdetermine locations (where is X?)