From Linked Data to Tightly Integrated Data

Post on 24-Apr-2015

323 views 1 download

description

Invited Talk at the 3rd Workshop on Linked Data in Linguistics: Multilingual Knowledge Resources and Natural Language Processing. Reykjavik, Iceland, 27th May 2014 The ideas behind the Web of Linked Data have great allure. Apart from the prospect of large amounts of freely available data, we are also promised nearly effortless interoperability. Common data formats and protocols have indeed made it easier than ever to obtain and work with information from different sources simultaneously, opening up new opportunities in linguistics, library science, and many other areas. In this talk, however, I argue that the true potential of Linked Data can only be appreciated when extensive cross-linkage and integration engenders an even higher degree of interconnectedness. This can take the form of shared identifiers, e.g. those based on Wikipedia and WordNet, which can be used to describe numerous forms of linguistic and commonsense knowledge. An alternative is to rely on sameAs and similarity links, which can automatically be discovered using scalable approaches like the LINDA algorithm but need to be interpreted with great care, as we have observed in experimental studies. A closer level of linkage is achieved when resources are also connected at the taxonomic level, as exemplified by the MENTA approach to taxonomic data integration. Such integration means that one can buy into ecosystems already carrying a range of valuable pre-existing assets. Even more tightly integrated resources like Lexvo.org combine triples from multiple sources into unified, coherent knowledge bases. Finally, I also comment on how to address some remaining challenges that are still impeding a more widespread adoption of Linked Data on the Web. In the long run, I believe that such steps will lead us to significantly more tightly integrated Linked Data.

Transcript of From Linked Data to Tightly Integrated Data

From Linked Data toTightly Integrated Data

May 2014

Gerard de MeloTsinghua University, Beijing

From Linked Data toTightly Integrated Data

May 2014

Gerard de MeloTsinghua University, Beijing

25 Years of the World Wide Web:1989−2014

25 Years of the World Wide Web:1989−2014

http://geekcom.wordpress.com/2009/03/19/

Tim Berners-Lee

Gerard de Melo

25 Years of the World Wide Web:1989−2014

25 Years of the World Wide Web:1989−2014

http://geekcom.wordpress.com/2009/03/19/

Tim Berners-Lee Documents forhuman viewingDocuments forhuman viewing

Gerard de Melo

From Text to Structured DataFrom Text to Structured Data

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

IE

Source: Marko Grobelnik, Dunja Mladenic. KDD 2007.

NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..

Gerard de Melo

The Semantic WebThe Semantic Web

http://geekcom.wordpress.com/2009/03/19/

Tim Berners-Lee

col-league

born in Frankfurt

describedby

created by

Publish datain the right formright from the start

Publish datain the right formright from the start

createdby

Gerard de Melo

The Semantic WebThe Semantic Web

Assign URIs not just toDocuments, also to People, etc.

Assign URIs not just toDocuments, also to People, etc.

http://www.demelo.org/gdm/#GDMhttp://dblp.l3s.de/d2r/page/publications/conf/cikm/MeloW09

Assign URIs to Predicates (Edge Types)Assign URIs to Predicates (Edge Types)

created by

http://purl.org/dc/elements/1.1./creator

Gerard de Melo

Challenge:Simplify Publishing

Challenge:Simplify Publishing

Gerard de Melo

Challenge:Simplify Publishing

Challenge:Simplify Publishing

http://www.gauson.com/blog/2007/12/09/minimal-template-for-blogspot/

Gerard de Melo

Challenge:Simplify Publishing

Challenge:Simplify Publishing

Freebase:Better UI butnot universal

Freebase:Better UI butnot universal

Gerard de Melo

Big Knowledge GraphsBig Knowledge Graphs

Gerard de Melo

Big Knowledge Graphs

YAGO2. Hoffart et al. WWW 2011.

YAGO2. Hoffart et al. WWW 2011.

Gerard de Melo

Lexical Knowledge Bases

Gerard de Melo

Etymological Wordnet

LREC 2014Poster Session P17

16:45-18:05

LREC 2014Poster Session P17

16:45-18:05

Also Christian Chiarcos

today

Also Christian Chiarcos

today

Gerard de Melo

Lexical Intensity OrderingsLexical Intensity Orderings

goodgood

okayokay

greatgreat

superbsuperb

<

<

<

weak

strong

de Melo & BansalTransactionsof the ACL,

2013.

de Melo & BansalTransactionsof the ACL,

2013.

Gerard de Melo

Metaphors: ICSI MetaNet Project

Gerard de Melo

Common-Sense

Relations,Properties,

Comparisons

Tandon et al.WSDM 2014.

Tandon et al.AAAI 2014.

Tandon et al.AAAI 2011.

Common-Sense

Relations,Properties,

Comparisons

Tandon et al.WSDM 2014.

Tandon et al.AAAI 2014.

Tandon et al.AAAI 2011.

WebChild: Common-SenseWebChild: Common-Sense

Gerard de Melo

Input: Keywords, the World's Data

Output:Address User's Needs

Linked Data in UseLinked Data in Use

Gerard de Melo

Linked Data In Use

Gerard de Melo

Linked Data In Use

used in IBM's Jeopardy!-winning Watson system

Gerard de Melo

The PlanThe Plan

Linked Data

Really Linked Data

Integrated Data

Tightly Integrated Data

The PlanThe Plan

Linked Data

Really Linked Data

Integrated Data

Tightly Integrated Data

Really Linked DataReally Linked Data

Just converting toRDF is trivial

Just converting toRDF is trivial

Gerard de Melo

Really Linked DataReally Linked Data

use entitiesinstead of

literals wherepossible

use entitiesinstead of

literals wherepossible

Book 23Book 23 “Franz Kafka”“Franz Kafka”author

Gerard de Melo

Really Linked DataReally Linked Data

use entitiesinstead of

literals wherepossible

use entitiesinstead of

literals wherepossible

Book 23Book 23

“Franz Kafka”“Franz Kafka”

authorAuthor 14Author 14

name

PraguePrague

born in

Gerard de Melo

Really Linked DataReally Linked Data

use entitiesinstead of

literals wherepossible

use entitiesinstead of

literals wherepossible

Performance 1Performance 1 “en”“en”language

Performance 2Performance 2 “English”“English”language

Performance 3Performance 3 “engl.”“engl.”language

Gerard de Melo

Really Linked DataReally Linked Data

use entitiesinstead of

literals wherepossible

use entitiesinstead of

literals wherepossible

Performance 1Performance 1 language

Performance 2Performance 2 EnglishEnglishlanguage

Performance 3Performance 3 language

http://lexvo.org/id/iso639-3/eng

Gerard de Melo

Vocabulary / Ontology Re-UseVocabulary / Ontology Re-Use

http://lov.okfn.org/

Gerard de Melo

Vocabulary / Ontology Re-UseVocabulary / Ontology Re-Use

Gerard de Melo

Vocabulary / Ontology Re-UseVocabulary / Ontology Re-Use

Gerard de Melo

Linked Data CloudLinked Data Cloud

Gerard de Melo

Linked Data CloudLinked Data Cloud

Gerard de Melo

Identifiers and Cross-LinkageIdentifiers and Cross-Linkage

Arguably more important than RDF as a format

Example: Google Knowledge Graph

Buy intorich existingeco-systems

Buy intorich existingeco-systems

Gerard de Melo

Focal Point: WordNet

UWN (CIKM 2009):over 1,000,000 words in over 100 languages

Gerard de Melo

UWN/MENTA: Universal WordNetUWN/MENTA: Universal WordNet

Gerard de Melo

Lexvo.org

Focal Point: Lexvo.orgFocal Point: Lexvo.org

Cyrllic(Script) Cyrllic(Script)

Ukraine Ukraine

GeoNames

Ukraine Ukraine

owl:sameAs

UkrainianUkrainianUkrainianUkrainian

Ukraine Ukraine

Gerard de Melo

Lexvo.org

Focal Point: Lexvo.orgFocal Point: Lexvo.org

Cyrllic(Script) Cyrllic(Script)

Ukraine Ukraine

UkrainianUkrainianUkrainianUkrainian

Ukraine Ukraine

My Resource

UkrainianUkrainian

Lexvo.org APIIdentifiers

.getLanguageURIforISO639P1("uk")

Gerard de Melo

Focal Point: Lexvo.orgFocal Point: Lexvo.org

Lexvo.org APIIdentifiers

.getTermURI("car", "eng")

RDF “car”@en l:means sumo:Automobile

lexvo:term/eng/car l:means sumo:Automobile

Gerard de Melo

Focal Point: Lexvo.orgFocal Point: Lexvo.org

Gerard de Melo

Focal Point: Lexvo.orgFocal Point: Lexvo.org

Gerard de Melo

Focal Point: Lexvo.orgFocal Point: Lexvo.org

Gerard de Melo

Focal Point: Lexvo.org

Semantic WebSemantic WebJournal 2014Journal 2014Semantic WebSemantic WebJournal 2014Journal 2014 Gerard de Melo

Focal Point: Lexvo.orgFocal Point: Lexvo.org

Lexvo.orgLexvo.org

Roget'sThesaurus

Roget'sThesaurus

WordNetEvocation Links

WordNetEvocation Links

EtymologicalWordNet

EtymologicalWordNet

PropBanklexicon

PropBanklexicon

NomBanklexicon

NomBanklexicon

MPQA SubjectivityLexicon

MPQA SubjectivityLexicon

MPQA SubjectivityLexicon

MPQA SubjectivityLexicon

AFINNAffective Lexicon

AFINNAffective Lexicon

CMU Pronunciation

Dictionary

CMU Pronunciation

Dictionary

Gerard de Melo

Linked EntitiesLinked Entities

Source: Gerhard Weikum. For a few Triples more.

Gerard de Melo

Linked EntitiesLinked Entities

Gerard de Melo

LINDA: Creating Links

Gerard de Melo

LINDA: Creating Links

Gerard de Melo

LINDA:Böhm et al.CIKM 2012

LINDA:Böhm et al.CIKM 2012

LINDA: Creating Links

Gerard de Melo

LINDA:Böhm et al.CIKM 2012

LINDA:Böhm et al.CIKM 2012

LINDA: Creating Links

Gerard de Melo

LINDA:Böhm et al.CIKM 2012

LINDA:Böhm et al.CIKM 2012

LINDA: Creating LinksLINDA: Creating LinksLINDA: Creating LinksLINDA: Creating Links

LINDA:Böhm et al.CIKM 2012

LINDA:Böhm et al.CIKM 2012

Scale to Billion Triples Challenge Datasetdespite dependenciesScale to Billion Triples Challenge Datasetdespite dependencies

Gerard de Melo

Lexvo.org

SameAs LinksSameAs Links

Ukraine Ukraine

GeoNames

Ukraine Ukraine

owl:sameAs

Ukraine Ukraine

Leibnizian Identity

For all x:x=x

For all x, y, p:x=y => p(x)=p(y)

Gerard de Melo

Identity vs. Near-IdentityIdentity vs. Near-Identity

OfficialStandard& Leibniz

Automaticlinkers &

sameas.org

Einstein's Miracle YearEinstein's

Miracle Year

owl:sameAs

EinsteinEinstein

Gerard de Melo

Merging Lexical Resources

ACL 2010AAAI 2013ACL 2010AAAI 2013

Gerard de Melo

Merging Lexical Resources

ACL 2010AAAI 2013ACL 2010AAAI 2013

Gerard de Melo

Identity ConstraintsIdentity Constraints

Idea: Idea: Exploit Dataset-specificUnique NamesAssumptions

Idea: Idea: Exploit Dataset-specificUnique NamesAssumptions

dbpedia: Pauldbpedia: Paul

dbpedia:Paulie (redirect)

dbpedia:Paulie (redirect)

musicbrainz: Paulie

musicbrainz: Paulie

dblp: Pauladblp: Paula

dbpedia: Pauladbpedia: Paula

freebase: Paulfreebase: Paul

Gerard de Melo

Identity ConstraintsIdentity Constraints

Idea: Idea: Exploit Dataset-specificUnique NamesAssumptions

Idea: Idea: Exploit Dataset-specificUnique NamesAssumptions

dbpedia: Pauldbpedia: Paul

musicbrainz: Paulie

musicbrainz: Paulie

dblp: Pauladblp: Paula

dbpedia: Pauladbpedia: Paula

freebase: Paulfreebase: Paul

dbpedia:Paulie (redirect)

dbpedia:Paulie (redirect)

Gerard de Melo

Identity ConstraintsIdentity Constraints

musicbrainz: Paulie

musicbrainz: Paulie

dblp: Pauladblp: Paulafreebase: Paulfreebase: Paul

dbpedia: Pauldbpedia: Paul

dbpedia:Paulie (redirect)

dbpedia:Paulie (redirect)

Use set-based formalism to Use set-based formalism to account for exceptions + account for exceptions + to avoid quadratic number of to avoid quadratic number of pairwise constraintspairwise constraints

Use set-based formalism to Use set-based formalism to account for exceptions + account for exceptions + to avoid quadratic number of to avoid quadratic number of pairwise constraintspairwise constraints

dbpedia: Pauladbpedia: Paula

Gerard de Melo

Identity ConstraintsIdentity Constraints

Add edge weightsAdd edge weightsAdd edge weightsAdd edge weights

musicbrainz: Paulie

musicbrainz: Paulie

dblp: Pauladblp: Paulafreebase: Paulfreebase: Paul

2 2

1

1

1

1

dbpedia: Pauldbpedia: Paul

dbpedia:Paulie (redirect)

dbpedia:Paulie (redirect)

dbpedia: Pauladbpedia: Paula

Goal: Consistency Goal: Consistency minimizing weightedminimizing weightededge deletionsedge deletions

Goal: Consistency Goal: Consistency minimizing weightedminimizing weightededge deletionsedge deletions

Gerard de Melo

Capture separation betweennodes, which requiresedge deletions along all paths

Capture separation betweennodes, which requiresedge deletions along all paths

AlgorithmAlgorithm

See Paper for details, incl. relationship toHungarian Algorithm and Graph Cuts

See Paper for details, incl. relationship toHungarian Algorithm and Graph Cuts

Gerard de Melo

AlgorithmAlgorithm

Leighton & Rao style Leighton & Rao style Region GrowingRegion GrowingLeighton & Rao style Leighton & Rao style Region GrowingRegion Growing

dbpedia: Pauldbpedia: Paul

dbpedia:Paulie (redirect)

dbpedia:Paulie (redirect)

musicbrainz: Paulie

musicbrainz: Paulie

dblp: Pauladblp: Paula

dbpedia: Pauladbpedia: Paula

freebase: Paulfreebase: Paul

2 2

1

1

1

1

Gerard de Melo

AlgorithmAlgorithm

Leighton & Rao style Leighton & Rao style Region GrowingRegion GrowingLeighton & Rao style Leighton & Rao style Region GrowingRegion Growing

dbpedia: Pauldbpedia: Paul

dbpedia:Paulie (redirect)

dbpedia:Paulie (redirect)

musicbrainz: Paulie

musicbrainz: Paulie

dblp: Pauladblp: Paula

dbpedia: Pauladbpedia: Paula

freebase: Paulfreebase: Paul

2 2

1

1

1

1

Gerard de Melo

ExperimentsExperiments

BTC: BTC: Large Linked Data Web crawl, 20GB gzipped

sameas.org:sameas.org:Most well-known collections of sameAs links,aggregated from various Linked Data sources

BTC: BTC: Large Linked Data Web crawl, 20GB gzipped

sameas.org:sameas.org:Most well-known collections of sameAs links,aggregated from various Linked Data sources

Gerard de Melo

Identity ConstraintsIdentity Constraints

Gerard de Melo

ExperimentsExperiments

>500,000 node pairs,>500,000 node pairs,but algorithm removesbut algorithm removesonly 280,000 edgesonly 280,000 edges

>500,000 node pairs,>500,000 node pairs,but algorithm removesbut algorithm removesonly 280,000 edgesonly 280,000 edges

Gerard de Melo

Identity LinksIdentity Links

Must distinguish identity fromnear-identityCan automatically identify 500,000 inconsistent URI pairsFix using LP Graph Algorithm

Must distinguish identity fromnear-identityCan automatically identify 500,000 inconsistent URI pairsFix using LP Graph Algorithm

Use more specific properties!

lvont:strictlySameAs (Lexvo.org)skos:closeMatch

etc.

Use more specific properties!

lvont:strictlySameAs (Lexvo.org)skos:closeMatch

etc.Gerard de Melo

Questions?Questions?

Image: Question Answering over Linked Data Workshop

Gerard de Melo

The PlanThe Plan

Linked Data

Really Linked Data

Integrated Data

Tightly Integrated Data

Taxonomic Links

a user wantsa list of

„Art Schools in Europe“

Gerard de Melo

Multilingual Taxonomies

a Swedish user wants

a list of

„Konstskolor i Europa“

Gerard de Melo

MENTA

200+ Wikipedia editions200+ Wikipedia editionsWordNetWordNetEtc.Etc.

200+ Wikipedia editions200+ Wikipedia editionsWordNetWordNetEtc.Etc.

Gerard de Melo

Predict Individual Identity Links:WordNet-WikipediaArticle-RedirectArticle-Categoryetc.

Predict Individual Identity Links:WordNet-WikipediaArticle-RedirectArticle-Categoryetc.

MENTA

Gerard de Melo

MENTA

Predict Individual Taxonomic Links:Article → CategoryCategory → WordNet

Predict Individual Taxonomic Links:Article → CategoryCategory → WordNet

MENTA

Gerard de Melo

Taxonomic Links:MENTA

Gerard de Melo

Taxonomic Links:MENTA

Use Identity ConstraintAlgorithm to form equivalence classes

Use Identity ConstraintAlgorithm to form equivalence classes

Markov Chain RandomWalk with Restartsto Rank Parents

Markov Chain RandomWalk with Restartsto Rank Parents Gerard de Melo

Taxonomic Links:MENTA

Gerard de Melo

UWN/MENTA

CIKM 2010CIKM 2010Best Paper AwardBest Paper AwardCIKM 2010CIKM 2010Best Paper AwardBest Paper Award Gerard de Melo

MENTA: Multilingual Entity Taxonomy

UWN/MENTA (de Melo & Weikum 2010)

● multilingual extension of WordNet, with 800,000 words in 250 languages

● 4,8 million instances/classesfrom multilingual Wikipedia editions

Gerard de Melo

UWN/MENTA

multilingual extension of WordNet forword senses and taxonomical information over 200 languages

Gerard de Melo

Questions?Questions?

Image: Question Answering over Linked Data Workshop

Gerard de Melo

The PlanThe Plan

Linked Data

Really Linked Data

Integrated Data

Tightly Integrated Data

Challenge: Locked Away DataChallenge: Locked Away Data

Hard to runadvanced algorithmsover a SPARQLinterface

Many sites don'tprovide downloads.

Hard to runadvanced algorithmsover a SPARQLinterface

Many sites don'tprovide downloads.

Gerard de Melo

Challenge: Lost DataChallenge: Lost Data

http://sparqles.okfn.org/

Servers offlinePoor archivingServers offlinePoor archiving

Dumps need to be archived and integrated.

Dumps need to be archived and integrated.

Gerard de Melo

Challenge: UpdatesChallenge: Updates

Need to be able toupdate when data changes

Need to be able toupdate when data changes

Need algorithmic solutions, not one-time process.

Need algorithmic solutions, not one-time process.

YAGO2s: Biega et al. 2013Gerard de Melo

Requirement: Integration Algorithm Pipelines

Requirement: Integration Algorithm Pipelines

Gerard de Melo

Input: Various Data

Input: Various Data

Output:

Tightly IntegratedData

Output:

Tightly IntegratedData

Lexvo.orgLexvo.org

Semantic WebSemantic WebJournal 2014Journal 2014Semantic WebSemantic WebJournal 2014Journal 2014 Gerard de Melo

Lexvo.orgLexvo.org

Gerard de Melo

Lexvo.orgLexvo.org

Lexvo.orgLexvo.org

Lexvo.orgLexvo.org

Semantic WebSemantic WebJournal 2014Journal 2014Semantic WebSemantic WebJournal 2014Journal 2014 Gerard de Melo

Most large-scale knowledge bases have ground facts only

But language is much more expressive

Knowledge GraphsKnowledge Graphs

bornIn(Einstein,Ulm)acquired(Microsoft,Powerset)bornIn(Einstein,Ulm)acquired(Microsoft,Powerset)

● All humans are mortal.● At least three but not more than 10 people

know this secret.● Three years ago, most people believed that

Microsoft would buy Yahoo within months.

● All humans are mortal.● At least three but not more than 10 people

know this secret.● Three years ago, most people believed that

Microsoft would buy Yahoo within months.

Gerard de Melo

Challenge: TimeChallenge: Time

Temporal scope missingTemporal scope missing

Source: Gerhard Weikum. For a few Triples more.

Gerard de Melo

OWL, RDFS, Description LogicsOWL, RDFS, Description Logics

WebProtégéhttp://protege.stanford.edu/

Limit expressivityto get decidability.

Focus on classhierarchies

and propertyaxioms.

Limit expressivityto get decidability.

Focus on classhierarchies

and propertyaxioms.

Cannot create new rulese.g. to model

“grandparent”, “uncle”,“legal adult”!

Cannot create new rulese.g. to model

“grandparent”, “uncle”,“legal adult”!

Gerard de Melo

ReasoningReasoning

Humans cannot act before being born(or, actually, before being conceived)

(=>(and

(human ?HUMAN)(birthdate ?HUMAN ?T)(agent ?PROCESS ?HUMAN))

(beforeOrEqual(daysBefore (BeginFn ?T) 365)(BeginFn (WhenFn ?PROCESS))))

Humans cannot act before being born(or, actually, before being conceived)

(=>(and

(human ?HUMAN)(birthdate ?HUMAN ?T)(agent ?PROCESS ?HUMAN))

(beforeOrEqual(daysBefore (BeginFn ?T) 365)(BeginFn (WhenFn ?PROCESS))))

Reasoning: SPASS-XDBReasoning: SPASS-XDB

Gerard de Melo

Search Interfaces

“Which companies were created during the last century in Silicon Valley ?”

YAGO2:WWW 2011

Best Demo Award

YAGO2:WWW 2011

Best Demo Award

Gerard de Melo

Common-Sense Inference

Gerard de Melo

I found the following restaurant near your current location:

La Dolce Vita Pizza. 2318 Columbus Ave.

I'd rather have somethinghealthier

Tandon et al.AAAI 2014

Tandon et al.AAAI 2014

Conclusion

Really Linked Data► Shared Identifiers► Proper Interlinking

Integrated Data► Taxonomical Integration

Tightly Integrated Data► Processing Pipelines► Towards Common-SenseInference

Gerard de Melo

www.demelo.orggdm@demelo.orgwww.demelo.orggdm@demelo.org