Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa...

17
Multi-lingual Concept Extraction with Linked Data and Human-in-the-Loop Alfredo Alba, Anni Coden, Anna Lisa Gentile, Daniel Gruhl, Petar Ristoski, Steve Welch IBM Research

Transcript of Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa...

Page 1: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Multi-lingualConceptExtractionwithLinkedDataandHuman-in-the-Loop

AlfredoAlba,Anni Coden,AnnaLisaGentile,DanielGruhl,Petar Ristoski,SteveWelch

IBMResearch

Page 2: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Motivation

Page 3: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Motivation

§ extractinformationfromanovel corpus

§ whataretherelevantconcepts inthedomain?

§ limiteddomain andlanguage knowledge

§ IDEA:combinestatisticaltechniqueswithuser-in-the-loop

Page 4: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

DomainLearningAssistant

• Startwithasmallnumberofseeds(1)

• Getsuggestionsofnewsurfaceforms

• Theuseraccept/reject

Page 5: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Findingconcept candidatesThesafetyandefficacyoffilgrastim aresimilarinadultsand childrenreceivingcytotoxicchemotherapy

Laeficacia ylaseguridad delfilgrastim sonsimilares en los adultos y en los niños tratados conquimioterapia citotóxica

Lasicurezza el’efficacia delfilgrastim sono simili negli adulti e nei bambinisottoposti achemioterapia citotossica

DieWirksamkeit undUnbedenklichkeit vonFilgrastim ist bei Erwachsenen undbei Kindern ,dieeine zytotoxische Chemotherapie erhalten ,vergleichbar

Page 6: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Findingconcept candidates

Plasmaeliminationhalf-lifeoforalpravastatin is1.5to2hours.

L’emivita plasmatica dieliminazione delpravastatin orale é compresa tra un’ora emezzoedueore.

Page 7: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Findingconcept candidatesCandidates:{eggs,flour}

“mixeggs andflour”àmix <candidate>and <candidate>

mix <candidate>and <candidate>à “mixsugarandbutter”

Candidates:{eggs,flour,sugar,butter}

“meltthebutter”àmeltthe<candidate>

Page 8: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Findingconcept candidatesCandidates:{uova,farina}

“amalgamare uova efarina”à amalgamare <candidate>e<candidate>

amalgamare <candidate>e<candidate>à “amalgamare zucchero eburro”

Candidates:{uova,farina,zucchero,burro}

“sciogliere il burro”à sciogliere il <candidate>

Page 9: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Multi-lingualexperimentHYPOTHESIS:samebehavior,regardlessofthelanguage

§ westartwithveryfewseeds(onecouldbesufficient)foreachlanguage§ weextractcontextpatternsandusethemtogeneratenewcandidates

§ weasktouser toaccept/reject thecandidates

§ werepeatforafixednumberofiterationsinalllanguages

Page 10: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Multi-lingualexperiment:DrugDiscovery§ DATA:parallelcorpusfromtheEuropeanMedicinesAgency(EMEA)§documentsrelatedtomedicinalproducts§translationsinto22officiallanguagesoftheEuropeanUnion§1,500documentsformostofthelanguages§weused4languages(en,es,it,de)

§ TASK:buildalexiconofclinicaldrugs

§user-in-the-loop simulatedbyconstructingaGoldStandard(GS)ofdrugsnamesextractedfromLinkedOpenData(weusedDBpediahttp://dbpedia.org)

Page 11: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

DrugDiscovery:Oneseed

§ initialseeds:singleseed§Onedrugnamewhichappearsineachcorpus(e.g.“irbesartan”)

§ 20iterations

§ learningcurvesforalllanguagesarecomparable

Discovery growth for glimpse for English (en), Italian (it), Spanish (es) and German (de). Average correlation amongst all languages r = 0.998.

Page 12: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

DrugDiscovery:LinkedDataseeds§ initialseeds:20%ofavailableLinkedData(DBpedia)§ 5-foldvalidation(randomlyselected20%,samedrugsforalllanguages)§ choiceofinitialseedsdoesnotimpactstheresults

Discovery growth with 5-fold cross validation on the EMEA dataset using DBpedia as seeds. Each plot shows the discovery growth for each of the randomly generated 5 folds and reports the Pearson correlation (r) amongst them.

Page 13: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

DrugDiscovery:benefitofLinkedData§glimpseà onemanuallyprovidedseed

§glimpseLDàLinkedDataseeds

§in10iterationsglimpseLD cancoverthesamelexiconthatwouldtakemorethan20iterationswithglimpse

Human-in-the-loopexperimentwithasubjectmatterexpert(physician)

Page 14: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Multi-lingualexperiment:Colors§ DATA:Twitterstream1st-14thofJanuary2016– lang:En,De,Es, It§containatleastonementionofacolor§ goldstandardlistsofcolorsfromWikidata andDbpedia

§ balancedatasetssizeindifferentlanguages§ 155,828tweetsperlanguage

§ TASK:expandthelexiconofcolors

§ user-in-the-loop: 4nativespeakers,10iterations

Page 15: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Multi-lingualexperiment:Colors§ newcoloritemsextractedfromTwitterdata:§German:5§ Italian:5§English:19§Spanish:22§azulgrana§ rojo vivo§ “limn"(inplaceofthecolorlímon)

Page 16: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

ConclusionsWHAT§ knowledgeresourcesarenevercomplete/exhaustive

§ construct/improvedictionariesfromtextcorpora

HOW§ iterativeandpurelystatistical algorithm§ nofeatureextractionrequired§ comparablebehaviorfor differentlanguages

§ organicallyincorporateshumanfeedback

Page 17: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Multi-lingualConceptExtractionwithLinkedDataandHuman-in-the-Loop

IBMResearch

[email protected] @AnLiGentile

AlfredoAlba,Anni Coden,AnnaLisaGentile,DanielGruhl,Petar Ristoski,SteveWelch