Gerhard Weikum Max Planck Institute for Informatics weikum
description
Transcript of Gerhard Weikum Max Planck Institute for Informatics weikum
Gerhard Weikum Max Planck Institute for Informaticshttp://www.mpi-inf.mpg.de/~weikum/
For a Few Triples More
Acknowledgements
LOD: RDF Triples on the Web
http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png
owl:s
ameAs
rdf.freebase.com/ns/en.rome
owl:sameAs
owl:sameAs
data.nytimes.com/51688803696189142301
Coord
geonames.org/3169070/roma
N 41° 54' 10'' E 12° 29' 2''
dbpprop:citizenOf
dbpedia.org/resource/Rome
rdf:ty
pe
rdfs:subclassOf
yago/wordnet:Actor109765278
rdf:ty
pe
rdfs:subclassOfyago/wikicategory:ItalianComposer
yago/wordnet: Artist109812338
prop:actedInimdb.com/name/nm0910607/
LOD: Linked RDF Triples on the Web
prop: composedMusicFor
imdb.com/title/tt0361748/
dbpedia.org/resource/Ennio_Morricone
LOD: Linked RDF Triples on the Web
• Size: 30 Billion triples
• Linkage: 500 Million links
• Dynamics: encyclopedic reference data
The Good, the Bad, and the Ugly
30 billion triples – still not enough ?
No! Consider:1. Dynamics2. Linkage3. Ubiquity
For a Few Triples More
Outline
Why More Triples:Dynamics, Linkage, Ubiquity
Web-Scale Linkage
Explain Title
Wrap-up
Linkage & Ubiquity: Named-Entity Disambiguation
1. Dynamics: in a Fast Paced WorldAnecdotic examples:• <… rdf:about="http://dbpedia.org/resource/Steve_Jobs"> <dbpprop:occupation …> Chairman and CEO, Apple Inc.• <… "http://... Ellen_Johnson_Sirleaf"> <dcterms:subject rdf:resource= "http://... Category:Nobel_Peace_Prize_laureates”/>• <… "http://... Scarlett_Johansson"> <dbpprop:spouse rdf:resource="http://... Ryan_Reynolds"/>• <… "http://... Clint_Eastwood"> <dbpprop:spouse …>Dina Ruiz 1 child <… rdf:about="http://... Clint_Eastwood"> <dbpprop:spouse …>Maggie Johnson 2 children• <… rdf:about="http://... Paul_McCartney"> <dbpprop:spouse rdf:resource=« http://… Linda_McCartney "/> <… rdf:about="http://... Paul_McCartney"> <dbpprop:spouse rdf:resource=« http://… Heather_Mills"/> <… rdf:about="http://... Paul_McCartney"> <dbpprop:spouse …>Nancy Shevell
still there
not therenever there
both there
none there
1. Dynamics: As Fresh As Possible
http://data.gov.uk/openspending
1. Dynamics: Closer to the SourcesRDF Data on the Web produced by:• Maintained, but mostly „static“
reference collections (e.g. geo)• Periodic exports from curated databases
(e.g. gov, bio, music)• Periodic extraction from Web sources
(e.g. encyclopedia, news)• Tags in social streams and advertisements
mostly fresh
often stale
often stale
very noisy
Get closer to the data origin:• RDF engines (Sparql APIs) for production DBs• view-maintenance by pub-sub push (feeds)• Deep-Web crawl/query for surfacing of RDF data
1. Dynamics: Nothing Lasts ForeverEven old and „static“ data often needs temporal scope (timepoint, timespan) for proper interpretation
Need to add temporal properties to RDF and SPARQL
with reification, or use quads (quints, pints, etc.)
[11-Jun-2002, 2008][Oct-2011, now][1999]
PaulMcCartney hasSpouse HeatherMillsPaulMcCartney hasSpouse NancyShevellPaulMcCartney gotHonor SirPaul
1:2:3:
1 validFrom 11-Jun-2002 1 validUntil 20082 validFrom Oct-20113 happendOn 1999
Select ?w Where {?id1: PM gotHonor SirPaul . ?id1 happendOn ?t .?id2: PM hasSpouse ?w . ?id2 validFrom ?b . ?id2 validUntil ?e .?t containedIn [?b,?e] . }
but: principled, expressive, easy-to-use
1. Dynamics: Nothing Lasts Foreverhttp://www.mpi-inf.mpg.de/yago-naga/yago/
2. Linkage: sameAs Linksdbpedia.org/resource/Linda_Louise_Eastman owl:sameAs yago-knowledge.org/resource/Linda_McCartney
www.freebase.com/view/en/man_with_no_name owl:SameAs dbpedia.org/page/Clint_Eastwood
data.linkedmdb.org/page/film/38166 owl:sameAs de.dbpedia.org/page/Zwei_glorreiche_Halunken
LOD statistics: 30 Bio. triples, 500 Mio. links330 Mio. links trivial (ID-based) within pub, within bio10‘s Mio. links near-trivial Dbpedia Freebase Yago GeoNamessameas.org: 17 Mio. bundles for 50 Mio. URIsdata.nytimes.com: 5000 people, 2000 locations
Way too few for a world with:1 Mio. people, 10 Mio. locations, 10‘s Mio. species,6 Mio. books, 2 Mio. movies, 10 Mio. songs, etc. etc.
2. Linkage: sameAs Coverage
3. Ubiquity: Web-of-Data & Web-of-Contents
3. Ubiquity: Web of Data & Other Contents
RDF data and Web contents need to be interconnectedRDFa & microformats provide the mechanism
How do we get the Web RDF-annotated (at large scale)?Largely automated, but allow humans in the loop
3. Ubiquity: Web of Data & Other ContentsMay 2, 2011
Maestro Morricone will perform on the stage of the Smetana Hall to conduct the Czech National Symphony Orchestra and Choir. The concert will feature both Classical compositions and soundtracks such asthe Ecstasy of Gold.In programme two concerts for July 14th and 15th.
<html … May 2, 2011
<div typeof=event:music>
<span id="Maestro_Morricone">Maestro Morricone<a rel="sameAs"resource="dbpedia…/Ennio_Morricone "/></span>…<span property = "event:location" >Smetana Hall </span>…<span property="rdf:type"resource="yago:performance">The concert </span> will feature …<span property="event:date" content="14-07-2011"></span>July 1
</div>
Why a Few Triples More?
• Dynamics: Where is the live data?• Linkage: Where are the links in Linked Data?• Ubiquity: Where are the paths between the Web-of-Data and the Web?
Linked Data is great!But still in its infancyNeed to add triples to capture further issues:
Outline
Why More Triples:Dynamics, Linkage, Ubiquity
Web-Scale Linkage
Explain Title
Wrap-up
Linkage & Ubiquity: Named-Entity Disambiguation
Named-Entity Disambiguation (NED)
Harry fought with you know who. He defeats the dark lord.
1) named-entity detection: segment & label by HMM or CRF (e.g. Stanford NER tagger)
2) co-reference resolution: link to preceding NP (trained classifier over linguistic features)3) named-entity disambiguation: map each mention (name) to canonical entity (entry in KB)
Three NLP tasks:
HarryPotter
DirtyHarry
LordVoldemort
The Who(band)
Prince Harryof England
Sergio talked to Ennio aboutEli‘s role in theEcstasy scene. This sequence onthe graveyardwas a highlight inSergio‘s trilogyof western films.
Mentions, Meanings, Mappings
D5 Overview May 30, 2011
Sergio means Sergio_LeoneSergio means Serge_GainsbourgEnnio means Ennio_AntonelliEnnio means Ennio_MorriconeEli means Eli_(bible)Eli means ExtremeLightInfrastructureEli means Eli_WallachEcstasy means Ecstasy_(drug)Ecstasy means Ecstasy_of_Goldtrilogy means Star_Wars_Trilogytrilogy means Lord_of_the_Ringstrilogy means Dollars_Trilogy … … …
KB
Eli (bible)
Eli Wallach
Mentions(surface names)
Entities(meanings)
Dollars Trilogy
Lord of the Rings
Star Wars Trilogy
Benny Andersson
Benny Goodman
Ecstasy of Gold
Ecstasy (drug)
?
Sergio talked to Ennio aboutEli‘s role in theEcstasy scene. This sequence onthe graveyardwas a highlight inSergio‘s trilogyof western films.
Mention-Entity Graph
Dollars Trilogy
Lord of the Rings
Star Wars
Ecstasy of Gold
Ecstasy (drug)
Eli (bible)
Eli Wallach
KB+Stats
weighted undirected graph with two types of nodes
Popularity(m,e):• freq(e|m)• length(e)• #links(e)
Similarity (m,e):• cos/Dice/KL (context(m), context(e))
bag-of-words orlanguage model:words, bigrams, phrases
Sergio talked to Ennio aboutEli‘s role in theEcstasy scene. This sequence onthe graveyardwas a highlight inSergio‘s trilogyof western films.
Mention-Entity Graph
Dollars Trilogy
Lord of the Rings
Star Wars
Ecstasy of Gold
Ecstasy (drug)
Eli (bible)
Eli Wallach
KB+Stats
weighted undirected graph with two types of nodes
Popularity(m,e):• freq(e|m)• length(e)• #links(e)
Similarity (m,e):• cos/Dice/KL (context(m), context(e))
jointmapping
Mention-Entity Graph
2828 / 20
Dollars Trilogy
Lord of the Rings
Star Wars
Ecstasy of Gold
Ecstasy(drug)
Eli (bible)
Eli Wallach
KB+Stats
weighted undirected graph with two types of nodes
Popularity(m,e):• freq(m,e|m)• length(e)• #links(e)
Similarity (m,e):• cos/Dice/KL (context(m), context(e))
Coherence (e,e‘):• dist(types)• overlap(links)• overlap (anchor words)
Sergio talked to Ennio aboutEli‘s role in theEcstasy scene. This sequence onthe graveyardwas a highlight inSergio‘s trilogyof western films.
Mention-Entity Graph
2929 / 20
KB+Stats
weighted undirected graph with two types of nodes
Popularity(m,e):• freq(m,e|m)• length(e)• #links(e)
Similarity (m,e):• cos/Dice/KL (context(m), context(e))
Coherence (e,e‘):• dist(types)• overlap(links)• overlap (anchor words)
American Jewsfilm actorsartistsAcademy Award winners
Metallica songsEnnio Morricone songsartifactssoundtrack music
spaghetti westernsfilm trilogiesmoviesartifactsDollars Trilogy
Lord of the Rings
Star Wars
Ecstasy of Gold
Ecstasy (drug)
Eli (bible)
Eli Wallach
Sergio talked to Ennio aboutEli‘s role in theEcstasy scene. This sequence onthe graveyardwas a highlight inSergio‘s trilogyof western films.
Mention-Entity Graph
3030 / 20
KB+Stats
weighted undirected graph with two types of nodes
Popularity(m,e):• freq(m,e|m)• length(e)• #links(e)
Similarity (m,e):• cos/Dice/KL (context(m), context(e))
Coherence (e,e‘):• dist(types)• overlap(links)• overlap (anchor words)
http://.../wiki/Dollars_Trilogyhttp://.../wiki/The_Good,_the_Bad, _the_Uglyhttp://.../wiki/Clint_Eastwoodhttp://.../wiki/Honorary_Academy_Award
http://.../wiki/The_Good,_the_Bad,_the_Uglyhttp://.../wiki/Metallicahttp://.../wiki/Bellagio_(casino)http://.../wiki/Ennio_Morricone
http://.../wiki/Sergio_Leonehttp://.../wiki/The_Good,_the_Bad,_the_Uglyhttp://.../wiki/For_a_Few_Dollars_Morehttp://.../wiki/Ennio_MorriconeDollars Trilogy
Lord of the Rings
Star Wars
Ecstasy of Gold
Ecstasy (drug)
Eli (bible)
Eli Wallach
Sergio talked to Ennio aboutEli‘s role in theEcstasy scene. This sequence onthe graveyardwas a highlight inSergio‘s trilogyof western films.
Mention-Entity Graph
3131 / 20
KB+StatsPopularity(m,e):• freq(m,e|m)• length(e)• #links(e)
Similarity (m,e):• cos/Dice/KL (context(m), context(e))
Coherence (e,e‘):• dist(types)• overlap(links)• overlap (anchor words)
Metallica on Morricone tributeBellagio water fountain showYo-Yo MaEnnio Morricone composition
The Magnificent SevenThe Good, the Bad, and the UglyClint EastwoodUniversity of Texas at Austin
For a Few Dollars MoreThe Good, the Bad, and the UglyMan with No Name trilogysoundtrack by Ennio Morricone
weighted undirected graph with two types of nodes
Dollars Trilogy
Lord of the Rings
Star Wars
Ecstasy of Gold
Ecstasy (drug)
Eli (bible)
Eli Wallach
Sergio talked to Ennio aboutEli‘s role in theEcstasy scene. This sequence onthe graveyardwas a highlight inSergio‘s trilogyof western films.
Joint Mapping
• Build mention-entity graph or joint-inference factor graph from knowledge and statistics in KB• Compute high-likelihood mapping (ML or MAP) or dense subgraph such that: each m is connected to exactly one e (or at most one e)
9030
5100
100
50 20
50
90
80 90
30
10 10
20
30
30
Coherence Graph Algorithm
• Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e)• Greedy approximation: iteratively remove weakest entity and its edges• Keep alternative solutions, then use local/randomized search
9030
5100
100
50 50
90
80 90
30
10 20
10
20
30
30
[J. Hoffart et al.: EMNLP‘11]140
180
50
470
145
230
Coherence Graph Algorithm
• Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e)• Greedy approximation: iteratively remove weakest entity and its edges• Keep alternative solutions, then use local/randomized search
9030
5100
100
50 50
90
80 90
30
1030
30
[J. Hoffart et al.: EMNLP‘11]140
180
50
470
145
230
140
170
470
145
210
Coherence Graph Algorithm
• Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e)• Greedy approximation: iteratively remove weakest entity and its edges• Keep alternative solutions, then use local/randomized search
9030
5100
100 90
80 90
30
30
[J. Hoffart et al.: EMNLP‘11]140
170
460
145
210
120
460
145
210
Coherence Graph Algorithm
• Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e)• Greedy approximation: iteratively remove weakest entity and its edges• Keep alternative solutions, then use local/randomized search
90100
100 90
90
30
[J. Hoffart et al.: EMNLP‘11]
120
380
145
210
Named-Entity Disambiguation: State-of-the-Art
Online tools:https://d5gate.ag5.mpi-sb.mpg.de/webaida/http://tagme.di.unipi.it/http://spotlight.dbpedia.org/demo/index.htmlhttp://viewer.opencalais.com/http://wikipedia-miner.cms.waikato.ac.nz/demos/annotate/ etc.
Literature:• Razvan Bunescu, Marius Pasca: EACL 2006• Silviu Cucerzan: EMNLP 2007• David Milne, Ian Witten: CIKM 2008• S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti: KDD 2009• G. Limaye, S. Sarawagi, S. Chakrabarti: VLDB 2010• Paolo Ferragina, Ugo Scaella: CIKM 2010• Mark Dredze et al.: COLING 2010• Johannes Hoffart et al.: EMNLP 2011
etc. etc.
NED: Experimental EvaluationBenchmark:• Extended CoNLL 2003 dataset: 1400 newswire articles• originally annotated with mention markup (NER), now with NED mappings to Yago and Freebase• difficult texts: … Australia beats India … Australian_Cricket_Team … White House talks to Kreml … President_of_the_USA … EDS made a contract with … HP_Enterprise_Services
Results:Best: AIDA method with prior+sim+coh + robustness test82% precision @100% recall, 87% mean average precisionComparison to other methods, see paper
J. Hoffart et al.: Robust Disambiguation of Named Entities in Text, EMNLP 2011http://www.mpi-inf.mpg.de/yago-naga/aida/
AIDA: Accurate Online Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/aida/
AIDA: Accurate Online Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/aida/
AIDA: Accurate Online Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/aida/
AIDA: Accurate Online Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/aida/
AIDA: Accurate Online Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/aida/
Interesting Research Issues
• More efficient graph algorithms (multicore, etc.)
• Allow mentions of unknown entities, mapped to null
• Short and difficult texts: • tweets, headlines, etc.• fictional texts: novels, song lyrics, etc.• incoherent texts
• Disambiguation beyond entity names:• coreferences: pronouns, paraphrases, etc.• common nouns, verbal phrases (general WSD)
• Leverage deep-parsing structures, leverage semantic types
Why Named Entity Disambiguation is Key
• Linked data is best if it has many good links
• New & rich contents mostly in traditional Web
• Create sameAs links in (X)HTML contents, via RDFa
• Links for named entities give best mileage/effort
• Methods & tools greatly advanced & gradually maturing
• Keep human in the loop, embed NED in authoring tools
Outline
Why More Triples:Dynamics, Linkage, Ubiquity
Web-Scale Linkage
Explain Title
Wrap-up
Linkage & Ubiquity: Named-Entity Disambiguation
Variants of NED at Web Scale
• How to run this on big batch of 1 Mio. input texts? partition inputs across distributed machines, organize dictionary appropriately, … exploit cross-document contexts
• How to deal with inputs from different time epochs? consider time-dependent contexts, map to entities of proper epoch (e.g. harvested from Wikipedia history)
• How to handle Web-scale inputs (100 Mio. pages) restricted to a set of interesting entities? (e.g. tracking politicians and companies)
Tools can map short text onto entities in a few seconds
owl:s
ameAs
rdf.freebase.com/ns/en.rome_ny
owl:sameAs
owl:sameAs
data.nytimes.com/51688803696189142301
Coord
geonames.org/5134301/city_of_rome
N 43° 12' 46'' W 75° 27' 20''
dbpprop:citizenOf
dbpedia.org/resource/Rome
rdf:ty
pe
rdfs:subclassOf
yago/wordnet:Actor109765278
rdf:ty
pe
rdfs:subclassOfyago/wikicategory:ItalianComposer
yago/wordnet: Artist109812338
prop:actedInimdb.com/name/nm0910607/
Linked RDF Triples on the Web
prop: composedMusicFor
imdb.com/title/tt0361748/
dbpedia.org/resource/Ennio_Morricone
referential data quality:automatic, dynamic,high coverage !
?
? ?
Outline
Why More Triples:Dynamics, Linkage, Ubiquity
Web-Scale Linkage
Explain Title
Wrap-up
Linkage & Ubiquity: Named-Entity Disambiguation
Summary
• Dynamics: (Deep-Web) sources feeds, pub-sub, … ? fresh & versioned triples• Linkage: LOD entity mapping user community • Ubiquity: RDFa entity disambiguation authoring
Linked Data is great!But it needs more triples to capture:
Outlook
For a Few Triples More
Challenge 1: generate high-quality sameAs links in RDFa & across all LOD sources
For a Few Triples Less
Challenge 2:add efficient top-k rankingto queries over RDF-in-context
Thank You !