Gerhard Weikum Max Planck Institute for Informatics weikum

52
Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/ ~weikum/ For a Few Triples More

description

For a Few Triples More. Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/. Acknowledgements. LOD: RDF Triples on the Web. http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png. LOD: Linked RDF Triples on the Web. - PowerPoint PPT Presentation

Transcript of Gerhard Weikum Max Planck Institute for Informatics weikum

Page 1: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Gerhard Weikum Max Planck Institute for Informaticshttp://www.mpi-inf.mpg.de/~weikum/

For a Few Triples More

Page 2: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Acknowledgements

Page 3: Gerhard  Weikum Max Planck Institute  for Informatics weikum

LOD: RDF Triples on the Web

http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png

Page 4: Gerhard  Weikum Max Planck Institute  for Informatics weikum

owl:s

ameAs

rdf.freebase.com/ns/en.rome

owl:sameAs

owl:sameAs

data.nytimes.com/51688803696189142301

Coord

geonames.org/3169070/roma

N 41° 54' 10'' E 12° 29' 2''

dbpprop:citizenOf

dbpedia.org/resource/Rome

rdf:ty

pe

rdfs:subclassOf

yago/wordnet:Actor109765278

rdf:ty

pe

rdfs:subclassOfyago/wikicategory:ItalianComposer

yago/wordnet: Artist109812338

prop:actedInimdb.com/name/nm0910607/

LOD: Linked RDF Triples on the Web

prop: composedMusicFor

imdb.com/title/tt0361748/

dbpedia.org/resource/Ennio_Morricone

Page 5: Gerhard  Weikum Max Planck Institute  for Informatics weikum

LOD: Linked RDF Triples on the Web

• Size: 30 Billion triples

• Linkage: 500 Million links

• Dynamics: encyclopedic reference data

Page 6: Gerhard  Weikum Max Planck Institute  for Informatics weikum

The Good, the Bad, and the Ugly

Page 7: Gerhard  Weikum Max Planck Institute  for Informatics weikum

30 billion triples – still not enough ?

No! Consider:1. Dynamics2. Linkage3. Ubiquity

For a Few Triples More

Page 8: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Outline

Why More Triples:Dynamics, Linkage, Ubiquity

Web-Scale Linkage

Explain Title

Wrap-up

Linkage & Ubiquity: Named-Entity Disambiguation

Page 9: Gerhard  Weikum Max Planck Institute  for Informatics weikum

1. Dynamics: in a Fast Paced WorldAnecdotic examples:• <… rdf:about="http://dbpedia.org/resource/Steve_Jobs"> <dbpprop:occupation …> Chairman and CEO, Apple Inc.• <… "http://... Ellen_Johnson_Sirleaf"> <dcterms:subject rdf:resource= "http://... Category:Nobel_Peace_Prize_laureates”/>• <… "http://... Scarlett_Johansson"> <dbpprop:spouse rdf:resource="http://... Ryan_Reynolds"/>• <… "http://... Clint_Eastwood"> <dbpprop:spouse …>Dina Ruiz 1 child <… rdf:about="http://... Clint_Eastwood"> <dbpprop:spouse …>Maggie Johnson 2 children• <… rdf:about="http://... Paul_McCartney"> <dbpprop:spouse rdf:resource=« http://… Linda_McCartney "/> <… rdf:about="http://... Paul_McCartney"> <dbpprop:spouse rdf:resource=« http://… Heather_Mills"/> <… rdf:about="http://... Paul_McCartney"> <dbpprop:spouse …>Nancy Shevell

still there

not therenever there

both there

none there

Page 10: Gerhard  Weikum Max Planck Institute  for Informatics weikum

1. Dynamics: As Fresh As Possible

http://data.gov.uk/openspending

Page 11: Gerhard  Weikum Max Planck Institute  for Informatics weikum

1. Dynamics: Updates in the Web of Datahttp://sindice.com

Page 12: Gerhard  Weikum Max Planck Institute  for Informatics weikum

1. Dynamics: Closer to the SourcesRDF Data on the Web produced by:• Maintained, but mostly „static“

reference collections (e.g. geo)• Periodic exports from curated databases

(e.g. gov, bio, music)• Periodic extraction from Web sources

(e.g. encyclopedia, news)• Tags in social streams and advertisements

mostly fresh

often stale

often stale

very noisy

Get closer to the data origin:• RDF engines (Sparql APIs) for production DBs• view-maintenance by pub-sub push (feeds)• Deep-Web crawl/query for surfacing of RDF data

Page 13: Gerhard  Weikum Max Planck Institute  for Informatics weikum

1. Dynamics: Nothing Lasts ForeverEven old and „static“ data often needs temporal scope (timepoint, timespan) for proper interpretation

Need to add temporal properties to RDF and SPARQL

with reification, or use quads (quints, pints, etc.)

[11-Jun-2002, 2008][Oct-2011, now][1999]

PaulMcCartney hasSpouse HeatherMillsPaulMcCartney hasSpouse NancyShevellPaulMcCartney gotHonor SirPaul

1:2:3:

1 validFrom 11-Jun-2002 1 validUntil 20082 validFrom Oct-20113 happendOn 1999

Select ?w Where {?id1: PM gotHonor SirPaul . ?id1 happendOn ?t .?id2: PM hasSpouse ?w . ?id2 validFrom ?b . ?id2 validUntil ?e .?t containedIn [?b,?e] . }

but: principled, expressive, easy-to-use

Page 14: Gerhard  Weikum Max Planck Institute  for Informatics weikum

1. Dynamics: Nothing Lasts Foreverhttp://www.mpi-inf.mpg.de/yago-naga/yago/

Page 15: Gerhard  Weikum Max Planck Institute  for Informatics weikum

2. Linkage: sameAs Linksdbpedia.org/resource/Linda_Louise_Eastman owl:sameAs yago-knowledge.org/resource/Linda_McCartney

www.freebase.com/view/en/man_with_no_name owl:SameAs dbpedia.org/page/Clint_Eastwood

data.linkedmdb.org/page/film/38166 owl:sameAs de.dbpedia.org/page/Zwei_glorreiche_Halunken

LOD statistics: 30 Bio. triples, 500 Mio. links330 Mio. links trivial (ID-based) within pub, within bio10‘s Mio. links near-trivial Dbpedia Freebase Yago GeoNamessameas.org: 17 Mio. bundles for 50 Mio. URIsdata.nytimes.com: 5000 people, 2000 locations

Way too few for a world with:1 Mio. people, 10 Mio. locations, 10‘s Mio. species,6 Mio. books, 2 Mio. movies, 10 Mio. songs, etc. etc.

Page 16: Gerhard  Weikum Max Planck Institute  for Informatics weikum

2. Linkage: sameAs Coverage

Page 17: Gerhard  Weikum Max Planck Institute  for Informatics weikum

2. Linkage: sameAs Accuracyhttp://sameas.org

Page 18: Gerhard  Weikum Max Planck Institute  for Informatics weikum

3. Ubiquity: Web-of-Data & Web-of-Contents

Page 19: Gerhard  Weikum Max Planck Institute  for Informatics weikum

3. Ubiquity: Web of Data & Other Contents

RDF data and Web contents need to be interconnectedRDFa & microformats provide the mechanism

How do we get the Web RDF-annotated (at large scale)?Largely automated, but allow humans in the loop

Page 20: Gerhard  Weikum Max Planck Institute  for Informatics weikum

3. Ubiquity: Web of Data & Other ContentsMay 2, 2011

Maestro Morricone will perform on the stage of the Smetana Hall to conduct the Czech National Symphony Orchestra and Choir. The concert will feature both Classical compositions and soundtracks such asthe Ecstasy of Gold.In programme two concerts for July 14th and 15th.

<html … May 2, 2011

<div typeof=event:music>

<span id="Maestro_Morricone">Maestro Morricone<a rel="sameAs"resource="dbpedia…/Ennio_Morricone "/></span>…<span property = "event:location" >Smetana Hall </span>…<span property="rdf:type"resource="yago:performance">The concert </span> will feature …<span property="event:date" content="14-07-2011"></span>July 1

</div>

Page 21: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Why a Few Triples More?

• Dynamics: Where is the live data?• Linkage: Where are the links in Linked Data?• Ubiquity: Where are the paths between the Web-of-Data and the Web?

Linked Data is great!But still in its infancyNeed to add triples to capture further issues:

Page 22: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Outline

Why More Triples:Dynamics, Linkage, Ubiquity

Web-Scale Linkage

Explain Title

Wrap-up

Linkage & Ubiquity: Named-Entity Disambiguation

Page 23: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Entities on the Webhttp://sig.ma

Page 24: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Named-Entity Disambiguation (NED)

Harry fought with you know who. He defeats the dark lord.

1) named-entity detection: segment & label by HMM or CRF (e.g. Stanford NER tagger)

2) co-reference resolution: link to preceding NP (trained classifier over linguistic features)3) named-entity disambiguation: map each mention (name) to canonical entity (entry in KB)

Three NLP tasks:

HarryPotter

DirtyHarry

LordVoldemort

The Who(band)

Prince Harryof England

Page 25: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Sergio talked to Ennio aboutEli‘s role in theEcstasy scene. This sequence onthe graveyardwas a highlight inSergio‘s trilogyof western films.

Mentions, Meanings, Mappings

D5 Overview May 30, 2011

Sergio means Sergio_LeoneSergio means Serge_GainsbourgEnnio means Ennio_AntonelliEnnio means Ennio_MorriconeEli means Eli_(bible)Eli means ExtremeLightInfrastructureEli means Eli_WallachEcstasy means Ecstasy_(drug)Ecstasy means Ecstasy_of_Goldtrilogy means Star_Wars_Trilogytrilogy means Lord_of_the_Ringstrilogy means Dollars_Trilogy … … …

KB

Eli (bible)

Eli Wallach

Mentions(surface names)

Entities(meanings)

Dollars Trilogy

Lord of the Rings

Star Wars Trilogy

Benny Andersson

Benny Goodman

Ecstasy of Gold

Ecstasy (drug)

?

Page 26: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Sergio talked to Ennio aboutEli‘s role in theEcstasy scene. This sequence onthe graveyardwas a highlight inSergio‘s trilogyof western films.

Mention-Entity Graph

Dollars Trilogy

Lord of the Rings

Star Wars

Ecstasy of Gold

Ecstasy (drug)

Eli (bible)

Eli Wallach

KB+Stats

weighted undirected graph with two types of nodes

Popularity(m,e):• freq(e|m)• length(e)• #links(e)

Similarity (m,e):• cos/Dice/KL (context(m), context(e))

bag-of-words orlanguage model:words, bigrams, phrases

Page 27: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Sergio talked to Ennio aboutEli‘s role in theEcstasy scene. This sequence onthe graveyardwas a highlight inSergio‘s trilogyof western films.

Mention-Entity Graph

Dollars Trilogy

Lord of the Rings

Star Wars

Ecstasy of Gold

Ecstasy (drug)

Eli (bible)

Eli Wallach

KB+Stats

weighted undirected graph with two types of nodes

Popularity(m,e):• freq(e|m)• length(e)• #links(e)

Similarity (m,e):• cos/Dice/KL (context(m), context(e))

jointmapping

Page 28: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Mention-Entity Graph

2828 / 20

Dollars Trilogy

Lord of the Rings

Star Wars

Ecstasy of Gold

Ecstasy(drug)

Eli (bible)

Eli Wallach

KB+Stats

weighted undirected graph with two types of nodes

Popularity(m,e):• freq(m,e|m)• length(e)• #links(e)

Similarity (m,e):• cos/Dice/KL (context(m), context(e))

Coherence (e,e‘):• dist(types)• overlap(links)• overlap (anchor words)

Sergio talked to Ennio aboutEli‘s role in theEcstasy scene. This sequence onthe graveyardwas a highlight inSergio‘s trilogyof western films.

Page 29: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Mention-Entity Graph

2929 / 20

KB+Stats

weighted undirected graph with two types of nodes

Popularity(m,e):• freq(m,e|m)• length(e)• #links(e)

Similarity (m,e):• cos/Dice/KL (context(m), context(e))

Coherence (e,e‘):• dist(types)• overlap(links)• overlap (anchor words)

American Jewsfilm actorsartistsAcademy Award winners

Metallica songsEnnio Morricone songsartifactssoundtrack music

spaghetti westernsfilm trilogiesmoviesartifactsDollars Trilogy

Lord of the Rings

Star Wars

Ecstasy of Gold

Ecstasy (drug)

Eli (bible)

Eli Wallach

Sergio talked to Ennio aboutEli‘s role in theEcstasy scene. This sequence onthe graveyardwas a highlight inSergio‘s trilogyof western films.

Page 30: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Mention-Entity Graph

3030 / 20

KB+Stats

weighted undirected graph with two types of nodes

Popularity(m,e):• freq(m,e|m)• length(e)• #links(e)

Similarity (m,e):• cos/Dice/KL (context(m), context(e))

Coherence (e,e‘):• dist(types)• overlap(links)• overlap (anchor words)

http://.../wiki/Dollars_Trilogyhttp://.../wiki/The_Good,_the_Bad, _the_Uglyhttp://.../wiki/Clint_Eastwoodhttp://.../wiki/Honorary_Academy_Award

http://.../wiki/The_Good,_the_Bad,_the_Uglyhttp://.../wiki/Metallicahttp://.../wiki/Bellagio_(casino)http://.../wiki/Ennio_Morricone

http://.../wiki/Sergio_Leonehttp://.../wiki/The_Good,_the_Bad,_the_Uglyhttp://.../wiki/For_a_Few_Dollars_Morehttp://.../wiki/Ennio_MorriconeDollars Trilogy

Lord of the Rings

Star Wars

Ecstasy of Gold

Ecstasy (drug)

Eli (bible)

Eli Wallach

Sergio talked to Ennio aboutEli‘s role in theEcstasy scene. This sequence onthe graveyardwas a highlight inSergio‘s trilogyof western films.

Page 31: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Mention-Entity Graph

3131 / 20

KB+StatsPopularity(m,e):• freq(m,e|m)• length(e)• #links(e)

Similarity (m,e):• cos/Dice/KL (context(m), context(e))

Coherence (e,e‘):• dist(types)• overlap(links)• overlap (anchor words)

Metallica on Morricone tributeBellagio water fountain showYo-Yo MaEnnio Morricone composition

The Magnificent SevenThe Good, the Bad, and the UglyClint EastwoodUniversity of Texas at Austin

For a Few Dollars MoreThe Good, the Bad, and the UglyMan with No Name trilogysoundtrack by Ennio Morricone

weighted undirected graph with two types of nodes

Dollars Trilogy

Lord of the Rings

Star Wars

Ecstasy of Gold

Ecstasy (drug)

Eli (bible)

Eli Wallach

Sergio talked to Ennio aboutEli‘s role in theEcstasy scene. This sequence onthe graveyardwas a highlight inSergio‘s trilogyof western films.

Page 32: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Joint Mapping

• Build mention-entity graph or joint-inference factor graph from knowledge and statistics in KB• Compute high-likelihood mapping (ML or MAP) or dense subgraph such that: each m is connected to exactly one e (or at most one e)

9030

5100

100

50 20

50

90

80 90

30

10 10

20

30

30

Page 33: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Coherence Graph Algorithm

• Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e)• Greedy approximation: iteratively remove weakest entity and its edges• Keep alternative solutions, then use local/randomized search

9030

5100

100

50 50

90

80 90

30

10 20

10

20

30

30

[J. Hoffart et al.: EMNLP‘11]140

180

50

470

145

230

Page 34: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Coherence Graph Algorithm

• Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e)• Greedy approximation: iteratively remove weakest entity and its edges• Keep alternative solutions, then use local/randomized search

9030

5100

100

50 50

90

80 90

30

1030

30

[J. Hoffart et al.: EMNLP‘11]140

180

50

470

145

230

140

170

470

145

210

Page 35: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Coherence Graph Algorithm

• Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e)• Greedy approximation: iteratively remove weakest entity and its edges• Keep alternative solutions, then use local/randomized search

9030

5100

100 90

80 90

30

30

[J. Hoffart et al.: EMNLP‘11]140

170

460

145

210

120

460

145

210

Page 36: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Coherence Graph Algorithm

• Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e)• Greedy approximation: iteratively remove weakest entity and its edges• Keep alternative solutions, then use local/randomized search

90100

100 90

90

30

[J. Hoffart et al.: EMNLP‘11]

120

380

145

210

Page 37: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Named-Entity Disambiguation: State-of-the-Art

Online tools:https://d5gate.ag5.mpi-sb.mpg.de/webaida/http://tagme.di.unipi.it/http://spotlight.dbpedia.org/demo/index.htmlhttp://viewer.opencalais.com/http://wikipedia-miner.cms.waikato.ac.nz/demos/annotate/ etc.

Literature:• Razvan Bunescu, Marius Pasca: EACL 2006• Silviu Cucerzan: EMNLP 2007• David Milne, Ian Witten: CIKM 2008• S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti: KDD 2009• G. Limaye, S. Sarawagi, S. Chakrabarti: VLDB 2010• Paolo Ferragina, Ugo Scaella: CIKM 2010• Mark Dredze et al.: COLING 2010• Johannes Hoffart et al.: EMNLP 2011

etc. etc.

Page 38: Gerhard  Weikum Max Planck Institute  for Informatics weikum

NED: Experimental EvaluationBenchmark:• Extended CoNLL 2003 dataset: 1400 newswire articles• originally annotated with mention markup (NER), now with NED mappings to Yago and Freebase• difficult texts: … Australia beats India … Australian_Cricket_Team … White House talks to Kreml … President_of_the_USA … EDS made a contract with … HP_Enterprise_Services

Results:Best: AIDA method with prior+sim+coh + robustness test82% precision @100% recall, 87% mean average precisionComparison to other methods, see paper

J. Hoffart et al.: Robust Disambiguation of Named Entities in Text, EMNLP 2011http://www.mpi-inf.mpg.de/yago-naga/aida/

Page 39: Gerhard  Weikum Max Planck Institute  for Informatics weikum

AIDA: Accurate Online Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/aida/

Page 40: Gerhard  Weikum Max Planck Institute  for Informatics weikum

AIDA: Accurate Online Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/aida/

Page 41: Gerhard  Weikum Max Planck Institute  for Informatics weikum

AIDA: Accurate Online Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/aida/

Page 42: Gerhard  Weikum Max Planck Institute  for Informatics weikum

AIDA: Accurate Online Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/aida/

Page 43: Gerhard  Weikum Max Planck Institute  for Informatics weikum

AIDA: Accurate Online Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/aida/

Page 44: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Interesting Research Issues

• More efficient graph algorithms (multicore, etc.)

• Allow mentions of unknown entities, mapped to null

• Short and difficult texts: • tweets, headlines, etc.• fictional texts: novels, song lyrics, etc.• incoherent texts

• Disambiguation beyond entity names:• coreferences: pronouns, paraphrases, etc.• common nouns, verbal phrases (general WSD)

• Leverage deep-parsing structures, leverage semantic types

Page 45: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Why Named Entity Disambiguation is Key

• Linked data is best if it has many good links

• New & rich contents mostly in traditional Web

• Create sameAs links in (X)HTML contents, via RDFa

• Links for named entities give best mileage/effort

• Methods & tools greatly advanced & gradually maturing

• Keep human in the loop, embed NED in authoring tools

Page 46: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Outline

Why More Triples:Dynamics, Linkage, Ubiquity

Web-Scale Linkage

Explain Title

Wrap-up

Linkage & Ubiquity: Named-Entity Disambiguation

Page 47: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Variants of NED at Web Scale

• How to run this on big batch of 1 Mio. input texts? partition inputs across distributed machines, organize dictionary appropriately, … exploit cross-document contexts

• How to deal with inputs from different time epochs? consider time-dependent contexts, map to entities of proper epoch (e.g. harvested from Wikipedia history)

• How to handle Web-scale inputs (100 Mio. pages) restricted to a set of interesting entities? (e.g. tracking politicians and companies)

Tools can map short text onto entities in a few seconds

Page 48: Gerhard  Weikum Max Planck Institute  for Informatics weikum

owl:s

ameAs

rdf.freebase.com/ns/en.rome_ny

owl:sameAs

owl:sameAs

data.nytimes.com/51688803696189142301

Coord

geonames.org/5134301/city_of_rome

N 43° 12' 46'' W 75° 27' 20''

dbpprop:citizenOf

dbpedia.org/resource/Rome

rdf:ty

pe

rdfs:subclassOf

yago/wordnet:Actor109765278

rdf:ty

pe

rdfs:subclassOfyago/wikicategory:ItalianComposer

yago/wordnet: Artist109812338

prop:actedInimdb.com/name/nm0910607/

Linked RDF Triples on the Web

prop: composedMusicFor

imdb.com/title/tt0361748/

dbpedia.org/resource/Ennio_Morricone

referential data quality:automatic, dynamic,high coverage !

?

? ?

Page 49: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Outline

Why More Triples:Dynamics, Linkage, Ubiquity

Web-Scale Linkage

Explain Title

Wrap-up

Linkage & Ubiquity: Named-Entity Disambiguation

Page 50: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Summary

• Dynamics: (Deep-Web) sources feeds, pub-sub, … ? fresh & versioned triples• Linkage: LOD entity mapping user community • Ubiquity: RDFa entity disambiguation authoring

Linked Data is great!But it needs more triples to capture:

Page 51: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Outlook

For a Few Triples More

Challenge 1: generate high-quality sameAs links in RDFa & across all LOD sources

For a Few Triples Less

Challenge 2:add efficient top-k rankingto queries over RDF-in-context

Page 52: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Thank You !