Harvesting Knowledge from Web Data and Text CIKM 2010 Tutorial (1/2 Day)

Harvesting Knowledge from Web Data and Text

CIKM 2010 Tutorial(1/2 Day)

Hady W. Lauw1, Ralf Schenkel2, Fabian Suchanek3, Martin Theobald4,

and Gerhard Weikum4

1Institute for Infocomm Research, Singapore2Saarland University, Saarbruecken

3INRIA Saclay, Paris4Max Planck Institute Informatics, Saarbruecken

All slides for download…

http://www.mpi-inf.mpg.de/yago-naga/ CIKM10-tutorial/

Harvesting Knowledge from Web Data 2

Outline

• Part I– What and Why– Available Knowledge Bases

• Part II– Extracting Knowledge

• Part III– Ranking and Searching

• Part IV– Conclusion and Outlook

3Harvesting Knowledge from Web Data

Motivation

Elvis Presley1935 - 1977

Elvis, when I

need you, I

can hear you!

Will there ever be someone like him again?

4

Motivation

Another Elvis

Elvis Presley: The Early YearsElvis spent more weeks at the top of the charts than any other artist.www.fiftiesweb.com/elvis.htm

5

Motivation

Personal relationships of Elvis Presley – Wikipedia...when Elvis was a young teen.... another girl whom the singer's mother hoped Presley would .... The writer called Elvis "a hillbilly cat”en.wikipedia.org/.../Personal_relationships_of_Elvis_Presley

Another singer called Elvis, young

6

Motivation

Dear Mr. Page, you don’t understand me. I just...

7

Elvis Presley - Official page for Elvis PresleyWelcome to the Official Elvis Presley Web Site, home of the undisputed King of Rock 'n' Roll and his beloved Graceland ...www.elvis.com/

Motivation

8

Other (more serious?) queries:

• when is Madonna’s next concert in Europe?

• which protein inhibits atherosclerosis?• who was king of England when Napoleon I was emperor of France?

• is there another famous singer named “Elvis”?

• has any scientist ever won the Nobel Prize in Literature? King George III

Bertrand Russel• which countries have a HDI comparable to Sweden’s?• which scientific papers have led to patents?

This Tutorial

9

singertypetype

?

“Elvis” “Elvis”

In this tutorial, we will explain• how the knowledge is organized

• how we can construct knowledge bases• what knowledge bases exist already

• how we can query knowledge bases

Mr. Page, let’s try this again.Is there another singer named Elvis?

Ontologies

10

singer

person

entity

“Elvis” “The King”

location

city

Tupelo

subclassOf

subclassOf

typebornIn

typetype

subclassOf

subclassOf

labellabel

Classes

Instances

Relations

Labels/words

scientists

The same label for two entities:homonymy

The same label for two entities:homonymy

The same entityhas two labels:synonymy

The same entityhas two labels:synonymy

?

Classes

singer

person

entity

type

subclassOf

subclassOf

scientiststype

?

Transitivity: type(x,y) /\ subclassOf(y,z) => type(x,z)

Relations

singer

person

entity

location

city

Tupelo

subclassOf

subclassOf

typebornIn

type

subclassOf

bornIn

domain range

Domain and range constraints: domain(r,c) /\ r(x,y) => type(x,c) range(r,c) /\ r(x,y) => type(y,c)

Looks like higher order, but is not. Consider introducing a predicate fact(r,x,y)

Looks like higher order, but is not. Consider introducing a predicate fact(r,x,y)

Event Entities

Grammy Awardwon

ElvisGrammy

winnerprize

1967

year

An event entityis an artificial entityintroduced to representan n-ary relationship

Winner Prize Year

Elvis Presley Grammy Award

1967

... ... ...

Event entities allow representingarbitrary relational data as binary graphs

Row42Row43

Reification

TupelobornIn

Grammy Awardwon

#42

1967

year

#42 #43

Reification is the method ofcreating an entity that representsa fact.

Wikipediasource

There are different ways to reify a fact, this is the one used in this talk.

RDF

location

city

Tupelo

subclassOf

subclassOf

typebornIn

The Resource Description Format (RDF) is aW3C standard that provides a standard vocabulary to model ontologies.

resource

An RDF ontology can be seen as adirected labeled multi-graph where• the nodes are entities• the edges are labeled with relations

Edges (facts) are commonly written• as triples <Elvis, bornIn, Tupelo>• as literals bornIn(Elvis, Tupelo)

[W3C recommendation: RDF, 2004]

Outline

• Part I– What and Why ✔– Available Knowledge Bases





Cyc

Douglas Lenat

What if we could make all common sense knowledge computer-processable?

What if we could make all common sense knowledge computer-processable?

Cyc project

• started in 1984• driven by• staff of 20 • goal: formalize knowledge manually

[Lenat, Comm. ACM, 1995]

Cyc: Language

CycL is the formal language that Cyc uses to represent knowledge.(Semantics based on First Order Logic, syntax based on LISP)

Cyc project(#$forall ?A (#$implies (#$isa ?A #$Animal) (#$thereExists ?M (#$mother ?A ?M))))

(#$arity #$GovernmentFn 1) (#$arg1Isa #$GovernmentFn #$GeopoliticalEntity) (#$resultIsa #$GovernmentFn #$RegionalGovernment)

(#$governs (#$GovernmentFn #$Canada) #$Canada)

http://cyc.com/cycdoc/ref/cycl-syntax.html + a logical reasoner

http://cyc.com/cycdoc/ref/cycl-syntax.html

Cyc: Knowledge

Cyc project

#$LoveStrong affection for another agent arising out of kinship or personal ties. Love may be felt towards things, too: warm attachment, enthusiasm, or devotion. #$Love is a collection, as further explainedunder #$Happiness. Specialized forms of #$Love are #$Love-Romantic, platonic love, maternal love, infatuation, agape, etc.

guid: bd589433-9c29-11b1-9dad-c379636f7270direct instance of: #$FeelingType direct specialization of: #$Affection direct generalization of: #$Love-Romantic

http://cyc.com/cycdoc/vocab/emotion-vocab.html#Love

Facts and axioms about: Transportation, Ecology, everyday living, chemistry, healthcare, animals, law, computer science...

“If a computer network implements IEEE 802.11 Wireless LAN Protocol and some computer is a node in that computer network, then that computer is vulnerable to decryption. “ http://cyc.com/cyc/technology/whatiscyc_dir/maptest

http://cyc.com/cycdoc/vocab/emotion-vocab.html#Love

http://cyc.com/cyc/technology/whatiscyc_dir/maptest

Cyc: Summary

Cyc SUMO

License proprietary, free for research GNU GPL

Entities 500k 20k

Assertions 5m 70k

Relations 15k

Tools Reasoner, NL understanding tool

Reasoner

URL http://cyc.com http://ontologyportal.org

References [Lenat, Comm. ACM 1995] [Niles, FOIS 2001]

SUMO (the Suggested Upper Model Ontology) is a research project in a similar spirit,driven by Adam Pease of Articulate Software

http://cyc.com/cyc/technology/whatiscyc_dir/whatsincyc http://ontologyportal.org

http://cyc.com/

http://ontologyportal.org/

http://cyc.com/cyc/technology/whatiscyc_dir/whatsincyc

http://ontologyportal.org/

WordNet

George Miller

What if we could make the English languagecomputer-processable?

What if we could make the English languagecomputer-processable?

• started in 1985• Cognitive Science Laboratory, Princeton University• written by lexicographers• goal: support automatic text analysis and AI applications

[Miller, CACM 1995]

WordNet: Lexical Database

photographic camera

camera

television camera

sense1

sense2

Word Sensesynonymous

words

polysemouswords

WordNet

WordNet: Semantic Relations

Hypernymy Meronymy Is-value-of

Kitchen Appliances

Toaster

Camera

Optical Lens

Speed

Slow Fast

WordNet: Semantic RelationsRelation Meaning Examples

Synonymy(N, V, Adj, Adv)

Same sense (camera, photographic camera)(mountain climbing, mountaineering)(fast, speedy)

Antonymy(Adj, Adv)

Opposite (fast, slow)(buy, sell)

Hypernymy (N) Is-A (camera, photographic equipment)(mountain climbing, climb)

Meronymy (N) Part (camera, optical lens)(camera, view finder)

Troponymy (V) Manner (buy, subscribe)(sell, retail)

Entailment (V) X must mean doing Y (buy, pay)(sell, give)

WordNet: Hierarchy

Hypernymy Is-A relations

WordNet: Size

Type Number

#words 155k

#senses 117k

#word-sense pairs 207k

%words that are polysemous 17%

License Proprietary,Free for research

http://wordnet.princeton.edu/wordnet/man2.1/wnstats.7WN.html

Downloadable at http://wordnet.princeton.edu

http://wordnet.princeton.edu/wordnet/man2.1/wnstats.7WN.html

http://wordnet.princeton.edu/

http://wordnet.princeton.edu/

Wikipedia

Jimmy Wales

If a small number of people can create a knowledge base, how about a LARGE number of people?

If a small number of people can create a knowledge base, how about a LARGE number of people?

• started in 2001• driven by Wikimedia Foundation, and a large number of volunteers• goal: build world’s largest encyclopedia

Wikipedia: Entities and Attributes

Entities

Attributes

Wikipedia: Synonymy and Polysemy

Redirection (synonyms)

Disambiguation (polysemy)

Wikipedia: Classes/Categories

Class hierarchy different

from WordNet

Wikipedia: Others

Navigation/Topic box

Inter-lingual Links

Wikipedia: Numbers

Growth 2001 - 2008

English: • 1B words,• 2.8M articles,• 152K contributors

All (250 languages):• 1.74B words, • 9.25M articles,• 283K contributors

vs. Britannica:• 25X as many words• ½ avg article length

License: Creative Commons Attribution-ShareAlike (CC-BY-SA)

Downloadable at http://download.wikimedia.org/

Automatically Constructed Knowledge Bases

• Manual approaches (Cyc, WordNet, Wikipedia)– produce high quality knowledge bases– labor-intensive and limited in scope

Can we construct the knowledge bases automatically?

Can we construct the knowledge bases automatically?

YAGO… , etc.

YAGO

Can we exploit Wikipedia and WordNet to build an ontology?

Can we exploit Wikipedia and WordNet to build an ontology?

• started as PhD thesis in 2007• now major project at the Max Planck Institute for Informatics in Germany• goal: extract ontology from Wikipedia with high accuracy and consistency

YAGO

[Suchanek et al., WWW 2007]

YAGO: Construction

Rock Singer

type

Exploit conceptual categories

1935born

Singer

subclassOf

Person

subclassOfSinger

subclassOf

Person

WordNet

Elvis Presley

Blah blah blub fasel (do not read this, better listen to the talk) blah blah Elvis blub (you are still reading this) blah Elvis blah blub later became astronaut blah

Categories: Rock singer

~Infobox~Born: 1935...

Exploit Infoboxes

Add WordNet

YAGO: Consistency Checks

Rock Singer

type

Check uniqueness of entities and functional arguments

1935born

Singer

subclassOf

Person

subclassOf

Physics born

Guitarist Guitar

Check domains and ranges of relationsCheck type coherence

YAGO: Relations

About People About Locations About Other Things

actedIn establishedOnDate happenedIn

bornIn / on date established from / until

diedIn / on date hasCapital isCalled

created / on date hasPopulation foundIn

dicovered locatedIn produced

hasChild, hasSpouse hasCurrency hasProductionLanguage

family name hasInflation hasISBN

graduatedFrom hasPolitician hasPrecedecssor

... ... ...

ca. 100 relations with range and domain

YAGO: NumbersYAGO YAGO+Geonam

es

Entities 2.6m 10m

organizations

0.5m 0.5m

people 0.8m 0.8m

classes 0.5m 0.5m

Facts 30m 240m

Relations 86 92

Precision 95% 95%

License Creative Commons Attribution-NonCommercial (CC-NC-BY)

Downloadable at http://mpii.de/yago incl. converters for RDF, XML, databases

DBpedia

Can we harvest facts more exhaustively with community effort?

Can we harvest facts more exhaustively with community effort?

• community effort started in 2007• driven by Free U. Berlin, U. Leipzig, OpenLink• goal: "extract structured information from Wikipedia and to make this information available on the Web"

[Bizer et al., Journal of Web Semantics 2009]

DBPedia: OntologyIn YAGO, the taxonomy is based on WordNet classes.

Dbpedia:• places entities extracted from Wikipedia into its own ontology.

•hand-crafted: 259 classes, 6 levels, 1200 properties• emphasizes recall

• only half of extracted entities are currently placed in its own ontology• alternative classifications: Wikipedia, YAGO, UMBEL (OpenCyc)

DBPedia: Mapping RulesDBpedia mapping rules:• maps Wikipedia infoboxes and tables to its ontology• target datatypes (normalize units, ignore deviant values)

Community effort:• hand-craft mapping rules• expand ontology

< http://en.wikipedia.org/wiki/Elvis_Presley >

{{Infobox musical artist|Name = Elvis Presley|Background = solo_singer|Birth_name = Elvis Aaron Presley}}

< http://dbpedia.org/page/Elvis_Presley >

foaf:name “Elvis Presley”;background “solo_singer”;foaf:givenName “Elvis Aaron Presley”;

Note that the values do not change.

DBPedia: Numbers

Type Number

Facts English: 257 m (YAGO: 240 m)All languages: 1 b

Entities 3.4 m overall (YAGO: 10 m)1.5 m in DBPedia ontology

People 312 k

Locations 413 k

Organizations 140 k

License Creative Commons Attribution-ShareAlike 3.0(CC-BY-SA 3.0)

plus• 5.5 m links to external Web pages• 1.5 m links to images• 5 m links to other RDF data sets

Downloadable at http://dbpedia.org

Freebase

What if we could harvest both automatic extraction and user contribution?

What if we could harvest both automatic extraction and user contribution?

• started in 2000• driven by Metaweb, part of Google since Jul 2010• goals:

• “an open shared database of the world's knowledge”• “a massive, collaboratively-edited database of cross-linked data”

FreebaseLike DBpedia and YAGO, Freebase imports data from Wikipedia.

Differently:• also imports from other sources (e.g., ChefMoz, NNDB, and MusicBrainz)• including individually contributed data• users can collaboratively edit its data (without having to edit Wikipedia).

Freebase: User Contribution

• create new entities• assign a new type/class to an entity • add/change attributes• connect to other entities• upload/edit images

• flag vandalism• flag entities to be merged/deleted• vote on flagged content (3 unanimous vote, or an expert has to be tie-breaker)

• define new class, specifying the attributes of the class• class definition can only be changed by creator/admin• class not part of commons until peer-reviewed & promoted by staff/admin

• finding aliases in Wikipedia redirects• extracts dates of events from Wikipedia articles• uses the Yahoo image search API to find candidates

Edit Entities Edit Schema

Review Data Game

Freebase: Community

• tie breaker in reviews• split entities• “rewind” changes

New experts inducted by current experts.

• create new classes and attributes• respond to community suggestions

Promoted by staff or other admins.

• contribute (edit, review, vote)

Anyone can be a member.

Experts Admins Members

Freebase: Numbers

Type Number

Facts 41 m

Entities 13 m (YAGO: 10 m)

People 2 m

Locations 946 k

Businesses 567 k

Film 397 k

License Creative Commons Attribution (CC-BY)

Downloadable at http://download.freebase.com

• data from Wikipedia and user edits• natural language translation of queries• 9 m entities, 300 m facts

Question Answering SystemsObjective is to answer user queries from an underlying knowledge base.

• computes answers from an internal knowledge base of curated, structured data.• stores not just facts, but also algorithms and models

Application: Semantic Similarity

• Task: determine similarity between two words– topological distance of two words in the graph– taxonomic distance: hierarchical is-a relations

• Example application: correct real-word spelling errors

Tofu is made from soy jeans.[Hirst et al., Natural Language Engineering 2001]

Application: Sentiment Orientation

• Task: determine an adjective’s polarity (positive or negative)– same polarity connected by synonymic relations– opposite polarity by antonymic relations

• Example application: overall sentiment of customer reviews

GOODright

suitable

proper

appropriate BAD

spoiledforged

riskydefective

[Hu et al., KDD 2004]

Application: Annotation of Web Data

[Limaye et al., VLDB 2010]

• Task: given a data source in the form of a Web table– Annotate column with entity type– Annotate pair of columns with relationship type– Annotate table cell with entity ID

Application: Map Annotation

Idea: • Determine geographical entities in the vicinity (by GPS coordinates) • Show information about these entities (from DBpedia)

Possible Applications:• Map search on the Internet• Enhanced Reality applications

[Becker et al., Linking Open Data Workshop 2008]

Application: Faceted Search

DBpedia Browser

search is “full text search within results”

Constraints are listed for possible deletion

Suggestions based on current consideration set

Attributes and values based on frequency (?)

Summary

• Part I covers what knowledge bases are– Knowledge representation model (RDF)– Manual knowledge bases:

• WordNet: expert-driven, English words• Wikipedia: community-driven, entities/attributes

– Automatically extracted knowledge bases:• YAGO: Wikipedia + WordNet, automated, high precision• DBpedia: Wikipedia + community-crafted mapping rules, high recall• Freebase: Wikipedia + other databases + user edits

• Part II will cover how to extract information included in the knowledge bases

References for Part I• C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, S. Hellmann: PDF DocumentDBpedia – A

Crystallization Point for the Web of Data. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, Issue 7, Pages 154–165, 2009.

• C. Becker, C. Bizer: DBpedia Mobile: A LocationEnabled Linked Data Browser. Linking Open Data Workshop 2008

• G. Hirst and A. Budanitsky: Correcting real-word spelling errors by restoring lexical cohesion. Natural Language Engineering 11 (1): 87–111, 2001.

• M. Hu and B. Liu: Mining and Summarizing Customer Reviews. KDD, 2004.• J. Kamps, M. Marx, R. J. Mokken, and M. de Rijke: Using WordNet to Measure Semantic Orientations of

Adjectives. LREC, 2004.• D. Lenat: CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM, 1995.• G. Limaye, S. Sarawagi, and S. Chakrabarti: Annotating and Searching Web Tables Using Entities, Types and

Relationships. VLDB, 2010.• G. A. Miller, WordNet: A Lexical Database for English. Communications of the ACM Vol. 38, No. 11: 39-41,

1995. • F. M. Suchanek, G. Kasneci and G. Weikum: Yago - A Core of Semantic Knowledge. WWW, 2007.• I. Niles, and A. Pease: Towards a Standard Upper Ontology. In Proceedings of the 2nd International

Conference on Formal Ontology in Information Systems (FOIS-2001), Chris Welty and Barry Smith, eds, Ogunquit, Maine, October 17-19, 2001.

• World Wide Web Consortium: RDF Primer. W3C Recommendation, 2004. http://www.w3.org/TR/rdf- primer/

Outline




• Part IV– Other topics


✔

✔

Entities & Classes

...

Which entity types (classes, unary predicates) are there?

Which subsumptions should hold(subclass/superclass, hyponym/hypernym, inclusion dependencies)?

Which individual entities belong to which classes?

Which names denote which entities?

scientists, doctoral students, computer scientists, …female humans, male humans, married humans, …

subclassOf (computer scientists, scientists),subclassOf (scientists, humans), …

instanceOf (Surajit Chaudhuri, computer scientists),instanceOf (BarbaraLiskov, computer scientists),instanceOf (Barbara Liskov, female humans), …

means (“Lady Di“, Diana Spencer),means (“Diana Frances Mountbatten-Windsor”, Diana Spencer), …means (“Madonna“, Madonna Louise Ciccone),means (“Madonna“, Madonna(painting by Edward Munch)), …

Binary RelationsWhich instances (pairs of individual entities) are therefor given binary relations with specific type signatures?

hasAdvisor (JimGray, MikeHarrison)hasAdvisor (HectorGarcia-Molina, Gio Wiederhold)hasAdvisor (Susan Davidson, Hector Garcia-Molina)graduatedAt (JimGray, Berkeley)graduatedAt (HectorGarcia-Molina, Stanford)hasWonPrize (JimGray, TuringAward)bornOn (JohnLennon, 9-Oct-1940)diedOn (JohnLennon, 8-Dec-1980)marriedTo (JohnLennon, YokoOno)

Which additional & interesting relation types are there between given classes of entities?

competedWith(x,y), nominatedForPrize(x,y), …divorcedFrom(x,y), affairWith(x,y), …assassinated(x,y), rescued(x,y), admired(x,y), …

Higher-arity Relations & Reasoning

• Time, location & provenance annotations• Knowledge representation – how do we model & store these?• Consistency reasoning – how do we filter out inconsistent facts that

the extractor produced?


Facts (RDF triples): (JimGray, hasAdvisor, MikeHarrison) (SurajitChaudhuri, hasAdvisor, JeffUllman) (Madonna, marriedTo, GuyRitchie) (NicolasSarkozy, marriedTo, CarlaBruni) (ManchesterU, wonCup, ChampionsLeague)

Facts (RDF triples)1:2:3:4:5:

Facts about facts:5: (1, inYear, 1968)6: (2, inYear, 2006)7: (3, validFrom, 22-Dec-2000) 8: (3, validUntil, Nov-2008)9: (4, validFrom, 2-Feb-2008)10: (2, source, SigmodRecord)11: (5, inYear, 1999)12: (5, location, CampNou)13: (5, source, Wikipedia)

Outline







✔

✔

Outline

• Part II–Extracting Knowledge

• Pattern-based Extraction• Consistency Reasoning• Higher-arity Relations: Space & Time


Framework: Information Extraction (IE)

many sources

one source

Surajit obtained hisPhD in CS from Stanford Universityunder the supervision of Prof. Jeff Ullman.He later joined HP andworked closely withUmesh Dayal …

source-centric IE

instanceOf (Surajit, scientist)inField (Surajit, computer science)hasAdvisor (Surajit, Jeff Ullman)almaMater (Surajit, Stanford U)workedFor (Surajit, HP)friendOf (Surajit, Umesh Dayal)…

instanceOf (Surajit, scientist)inField (Surajit, computer science)hasAdvisor (Surajit, Jeff Ullman)almaMater (Surajit, Stanford U)workedFor (Surajit, HP)friendOf (Surajit, Umesh Dayal)…

yield-centricharvesting

Student AdvisorhasAdvisor

Student UniversityalmaMater

1) recall !2) precision

1) precision !2) recallnear-human

quality !near-humanquality !

Student AdvisorSurajit Chaudhuri Jeffrey UllmanAlon Halevy Jeffrey UllmanJim Gray Mike Harrison … …

Student UniversitySurajit Chaudhuri Stanford UAlon Halevy Stanford UJim Gray UC Berkeley … …

Framework: Knowledge Representation

...

• RDF (Resource Description Framework, W3C):- subject-property-object (SPO) triples / binary relations

- highly structured, but no (prescriptive) schema - first-order logical reasoning over binary predicates • Frames, F-Logic, description logics: OWL/DL/lite• Also: higher-order logics, epistemic logics

Facts (RDF triples): (JimGray, hasAdvisor, MikeHarrison) (SurajitChaudhuri, hasAdvisor, JeffUllman) (Madonna, marriedTo, GuyRitchie) (NicolasSarkozy, marriedTo, CarlaBruni)

Facts (RDF triples)1:2:3:4:

Reification: facts about facts:5: (1, inYear, 1968)6: (2, inYear, 2006)7: (3, validFrom, 22-Dec-2000) 8: (3, validUntil, Nov-2008)9: (4, validFrom, 2-Feb-2008)10: (2, source, SigmodRecord)

Temporal, spatial, & provenance annotationscan refer to reified facts via fact identifiers(approx. equiv. to higer-arity RDF: Sub Prop Obj Time Location Source)

This tutorial!

Picking Low-Hanging Fruit (First)

Deterministic Pattern Matching

...

[Kushmerick 97; Califf & Mooney 99; Gottlob 01, …]

Wrapper Induction

67

[Gottlob et al: VLDB’01, PODS’04,…]

...

• Wrapper induction: • Hierarchical document structure, XHTML, XML• Pattern learning for restricted regular languages (ELog, combining concepts of XPath & FOL)• Visual interfaces• See e.g. http://www.lixto.com/,

http://w4f.sourceforge.net/

• Wrapper induction: • Hierarchical document structure, XHTML, XML• Pattern learning for restricted regular languages (ELog, combining concepts of XPath & FOL)• Visual interfaces• See e.g. http://www.lixto.com/,

http://w4f.sourceforge.net/

...

Tapping on Web Tables[Cafarella et al: PVLDB‘08; Sarawagi et al: PVLDB‘09]

Problem:discover interesting relations wonAward: Person Award nominatedForAward: Person Award …from many table headers and co-occurring cells

Problem:discover interesting relations wonAward: Person Award nominatedForAward: Person Award …from many table headers and co-occurring cells

Relational Fact Extraction From Plain Text

• Hearst patterns [Hearst: COLING‘92]– POS-enhanced regular expression matching in natural-language text

NP0 {,} such as {NP1, NP2, … (and|or) }{,} NPnNP0 {,}{NP1, NP2, … NPn-1}{,} or other NPn…

“The bow lute, such as the Bambara ndang, is plucked and has an individual curved neck for each string.”

isA(“Bambara ndang”, “bow lute”)

• Noun classification from predicate-argument structures [Hindle: ACL’90]– Clustering of nouns by similar

verbal phrases– Similarity based on co-occurrence

frequencies (mutual information)


DIPRE

• DPIRE: “Dual Iterative Pattern Relation Extraction”– (Almost) unsupervised, iterative gathering of facts and patterns– Positive & negative examples as seeds for target relation

e.g. +(Hillary, Bill) +(Carla, Nicolas) –(Larry, Google)– Specificity threshold for new patterns based on occurrence frequency


[Brin: WebDB‘98]

(Hillary, Bill)

(Carla, Nicolas)X and her husband Y

X and Y on their honeymoon

X and Y and their childrenX has been dating with YX loves Y

(Angelina, Brad)

(Hillary, Bill)

(Victoria, David)

(Carla, Nicolas)

(Larry, Google)…

• Snowball/QXtract [Agichtein,Gravano: DL’00, SIGMOD’01+‘03]

– Refined patterns and statistical measures– >80% recall at >85% precision over a large news corpus– QXtract demo additionally allowed user feedback in the

iteration loop


• DPIRE: “Dual Iterative Pattern Relation Extraction”– (Almost) unsupervised, iterative gathering of facts and patterns– Positive & negative examples as seeds for target relation

e.g. +(Hillary, Bill) +(Carla, Nicolas) –(Larry, Google)– Specificity threshold for new patterns based on occurrence frequency

DIPRE/Snowball/QXtract[Brin: WebDB’98; Agichtein,Gravano: SIGMOD’01+‘03]

Help from NLP: Dependency Parsing!


software tools: CMU Link Parser: http://www.link.cs.cmu.edu/link/Stanford Lex Parser: http://nlp.stanford.edu/software/lex-parser.shtmlOpen NLP Tools: http://opennlp.sourceforge.net/ANNIE Open-Source Information Extraction: http://www.aktors.org/technologies/annie/ LingPipe: http://alias-i.com/lingpipe/ (commercial license)

Carla has been seen dating with Ben.

• Analyze lexico-syntactic structure of sentences– Part-Of-Speech (POS) tagging & dependency parsing– Prefer shorter dependency paths for fact candidates

NNP VBZ VBN VBN VBG IN NNP dating(Carla, Ben)

Open-Domain Gathering of Facts (Open IE)

...

[Etzioni,Cafarella et al:WWW’04, IJCAI‘07; Weld,Hoffman,Wu: SIGMOD-Rec‘08]

Analyze verbal phrases between entities for new relation types

Carla has been seen dating with Ben.

Rumors about Carla indicate there is something between her and Ben.

• unsupervised bootstrapping with short dependency paths

• self-supervised classifier for (noun, verb-phrase, noun) triples

• build statistics & prune sparse candidates

• group/cluster candidates for new relation types and their facts

… seen dating with …

… partying with …

{datesWith, partiesWith}, {affairWith, flirtsWith}, {romanticRelation}, …

(Carla, Ben), (Carla, Sofie), …

(Carla, Ben), (Paris, Heidi), …

But: result often is noisy clusters are not canonicalized relations far from near-human-quality

But: result often is noisy clusters are not canonicalized relations far from near-human-quality

Learning More MappingsKylin Ontology Generator (KOG):learn classifier for subclassOf across Wikipedia & WordNet using

• YAGO as training data• advanced ML methods (MLN‘s, SVM‘s)• rich features from various sources

> 3 Mio. entities> 1 Mio. w/ infoboxes> 500 000 categories

[Wu & Weld: CIKM’07, WWW‘08 ]

• Category/class name similarity measures• Category instances and their infobox templates: template names, attribute names (e.g. knownFor)• Wikipedia edit history: refinement of categories• Hearst patterns: C such as X, X and Y and other C‘s, …• Other search-engine statistics: co-occurrence frequencies

• Category/class name similarity measures• Category instances and their infobox templates: template names, attribute names (e.g. knownFor)• Wikipedia edit history: refinement of categories• Hearst patterns: C such as X, X and Y and other C‘s, …• Other search-engine statistics: co-occurrence frequencies

#articles

instances/classes

Entity Disambiguation

“Penn“

“U Penn“University of Pennsylvania

“Penn State“PennsylvaniaState University

„PSU“ Pennsylvania(US State)

Sean Penn

PassengerService Unit

Names Entities

??

• ill-defined with zero context• known as record linkage for names in record fields• Wikipedia offers rich candidate mappings: disambiguation pages, re-directs, inter-wiki links, anchor texts of href links

Individual Entity Disambiguation

PennUniversity of Pennsylvania

Sean Penn

Penn StateUniversity

name similarity:edit distances, n-gram overlap, …

context similarity: record level

context similarity: words/phrases level

Penn

Penn

context similarity:text around names, classes & facts around entities

Into the Wild

XML Treebank

…

Univ. Park …

…

Typical Approaches:

Challenge: efficiency & scalability

Collective Entity Disambiguation

• Consider a set of names {n1, n2, …} in same context and sets of candidate entities E1 = {e11, e12, …}, E2 = {e21, e22, …}, …• Define joint objective function (e.g. likelihood for prob. model) that rewards coherence of mappings

(n1)=x1E1, (n2)=x2E2, …

[Doan et al: AAAI‘05; Singla,Domingos: ICDM’07; Chakrabarti et al: KDD‘09, …]

• Solve optimization problem

Stuart Russell

Michael Jordan

Stuart Russell(computer scientist)

Stuart Russell (DJ)

Michael Jordan(computer scientist)

Michael Jordan (NBA)

Declarative Extraction Frameworks

• IBM’s SystemT [Krishnamurthy et al: SIGMOD Rec.’08, ICDE’08]– Fully declarative extraction framework– SQL-style operators, cost models, full optimizer support

• DBLife/Cimple [DeRose, Doan et al: CIDR’07, VLDB’07]– Online community portal centered around the DB domain

(regular crawls of DBLP, conferences, homepages, etc.)

• More commercial endeavors:– FreeBase.com, WolframAlpha.com, Sig.ma,

TrueKnowledge.com, Google.com/squared



DBWorld/DBLP/Google Scholar

Google Images

DBLP

Homepages/DBLP/DBWorld/Google Scholar

Probabilistic Extraction Models• Hidden Markov Models (HMMs)

[Rabiner: Proc. IEEE’89; Sutton,McCallum: MIT Press’06]– Markov chain (directed graphical model) with

“hidden” states Y, observations X, and transition probabilities– Factorizes the joint distribution P(Y,X) – Assuming independence among observations

• Conditional Random Fields (CRFs)[Lafferty,McCallum,Pereira: ML’01; Sarawagi,Cohen: NIPS’04]– Markov random field (undirected graphical model) – Models the conditional distribution P(Y|X) (less strict independence assumptions)

• Joint segmentation and disambiguation of input strings onto entities and classes: NER, POS tagging, etc.

• Trained, e.g., on bibliograhic entries, no manual labeling requiredHarvesting Knowledge from Web Data 81

“I went skiing with Fernando Pereira in British Columbia.”

Pattern-Based Harvesting

Patterns(Hillary, Bill)

(Carla, Nicolas)

Facts & Fact Candidates

X and her husband Y

X and Y on their honeymoon

X and Y and their children

X has been dating with Y

X loves Y

…• good for recall• noisy, drifting• not robust enough for high precision

• good for recall• noisy, drifting• not robust enough for high precision

(Angelina, Brad)

(Hillary, Bill)

(Victoria, David)

(Carla, Nicolas)

(Angelina, Brad)

(Yoko, John)

(Carla, Benjamin)

(Larry, Google)

(Kate, Pete)

(Victoria, David)

[Hearst 92; Brin 98; Agichtein 00; Etzioni 04; …]

Outline




✔

French Marriage Problem

...

isMarriedTo: person personisMarriedTo: person person

isMarriedTo: frenchPolitician personisMarriedTo: frenchPolitician person

French Marriage Problem

Facts in KB: New facts or fact candidates:married (Hillary, Bill)married (Carla, Nicolas)married (Angelina, Brad)

married (Cecilia, Nicolas)married (Carla, Benjamin)married (Carla, Mick)married (Michelle, Barack)married (Yoko, John)married (Kate, Leonardo)married (Carla, Sofie)married (Larry, Google)

1) for recall: pattern-based harvesting2) for precision: consistency reasoning1) for recall: pattern-based harvesting2) for precision: consistency reasoning

Reasoning about Fact Candidates Use consistency constraints to prune false candidates!

spouse(Hillary,Bill)spouse(Carla,Nicolas)spouse(Cecilia,Nicolas)spouse(Carla,Ben)spouse(Carla,Mick)spouse(Carla, Sofie)

spouse(x,y) diff(y,z) spouse(x,z)

f(Hillary)f(Carla)f(Cecilia)f(Sofie)

m(Bill)m(Nicolas)m(Ben)m(Mick)

spouse(x,y) f(x) spouse(x,y) m(y)

spouse(x,y) (f(x)m(y)) (m(x)f(y))

First-order-logic rules (restricted): Ground atoms:

Rules can be weighted(e.g. by fraction of ground atoms that satisfy a rule) uncertain / probabilistic data compute prob. distr. over (a subset of) ground atoms being “true“

Rules reveal inconsistenciesFind consistent subset(s) of atoms(“possible world(s)“, “the truth“)

spouse(x,y) diff(w,y) spouse(w,y)

Markov Logic Networks (MLN‘s) [Richardson/Domingos: ML 2006]

Map logical constraints & fact candidatesinto probabilistic graphical model: Markov Random Field (MRF)

s(x,y) m(y)s(x,y) diff(y,z) s(x,z) s(Carla,Nicolas)

s(Cecilia,Nicolas)s(Carla,Ben)s(Carla,Sofie)…

s(x,y) diff(w,y) s(w,y)s(x,y) f(x)

s(Ca,Nic) s(Ce,Nic)

s(Ca,Nic) s(Ca,Ben)

s(Ca,Nic) s(Ca,So)

s(Ca,Ben) s(Ca,So)

s(Ca,Ben) s(Ca,So)

s(Ca,Nic) m(Nic)

Grounding:

s(Ce,Nic) m(Nic)

s(Ca,Ben) m(Ben)

s(Ca,So) m(So)

f(x) m(x)m(x) f(x)

Grounding: Literal Boolean VarReasoning: Literal Binary RVGrounding: Literal Boolean VarReasoning: Literal Binary RV

FOL rules: Base facts w/entities:

Markov Logic Networks (MLN‘s) Map logical constraints & fact candidatesinto probabilistic graphical model: Markov Random Field (MRF)



s(x,y) diff(w,y) s(w,y)s(x,y) f(x) f(x) m(x)

m(x) f(x)

m(Ben)

m(Nic) s(Ca,Nic)

s(Ce,Nic)

s(Ca,Ben)

s(Ca,So) m(So)

RVs coupledby MRF edgeif they appearin same clause

MRF assumption:P[Xi|X1..Xn]=P[Xi|MB(Xi)]Variety of algorithms for joint inference:

Gibbs sampling, other MCMC, belief propagation, randomized MaxSat, …

joint distribution has product form over all cliques

[Richardson,Domingos: ML 2006]

Markov Logic Networks (MLN‘s) Map logical constraints & fact candidatesinto probabilistic graphical model: Markov Random Field (MRF)

m(Ben)

m(Nic) s(Ca,Nic)

s(Ce,Nic)

s(Ca,Ben)

s(Ca,So) m(So)

[Richardson,Domingos: ML 2006]

0.1

0.5

0.2 0.7

0.6

0.8

0.7

Consistency reasoning: prune low-confidence facts!StatSnowball [Zhu et al: WWW‘09], BioSnowball [Liu et al: KDD‘10]EntityCube, MSR Asia: http://entitycube.research.microsoft.com/



s(x,y) diff(w,y) s(w,y)s(x,y) f(x) f(x) m(x)

m(x) f(x)

Related Alternative Probabilistic Models

software tools: alchemy.cs.washington.educode.google.com/p/factorie/research.microsoft.com/en-us/um/cambridge/projects/infernet/

Constrained Conditional Models [Roth et al. 2007]

Factor Graphs with Imperative Variable Coordination [McCallum et al. 2008]

log-linear classifiers with constraint-violation penaltymapped into Integer Linear Programs

RV‘s share “factors“ (joint feature functions)generalizes MRF, BN, CRF, …inference via advanced MCMCflexible coupling & constraining of RV‘s

m(So)

m(Ben)

m(Nic) s(Ca,Nic)

s(Ce,Nic)

s(Ca,Ben)

s(Ca,So)

Reasoning for KB Growth: Direct Route

Facts in KB New fact candidates:

married (Hillary, Bill)married (Carla, Nicolas)married (Angelina, Brad)

married (Cecilia, Nicolas)married (Carla, Benjamin)married (Carla, Mick)married (Carla, Sofie)married (Larry, Google)

+

Patterns:

X and her husband YX and Y and their childrenX has been dating with YX loves Y

?

• KB facts are true; fact candidates & patterns hypotheses• grounded constraints clauses with hypotheses as vars• cast into Weighted Max-Sat with weights from pattern stats• customized approximation algorithm• unifies: fact/candidate consistency, pattern goodness, entity disambig.

[Suchanek,Sozio,Weikum: WWW’09]

www.mpi-inf.mpg.de/yago-naga/sofie/

Direct approach:

SOFIE: Facts & Patterns Consistency

Constraints to connect facts, fact candidates & patterns

functional dependencies:

spouse(x,y): x y, y xrelation properties:

asymmetry, transitivity, acyclicity, …type constraints, inclusion dependencies:

spouse Person Person capitalOfCountry cityOfCountrydomain-specific constraints:

bornInYear(x) + 10years ≤ graduatedInYear(x)hasAdvisor(x,y) graduatedInYear(x,t) graduatedInYear(y,s) s < t

pattern-fact duality:

occurs(p,x,y) expresses(p,R) R(x,y)

name(-in-context)-to-entity mapping:

means(n,e1) means(n,e2) …

occurs(p,x,y) R(x,y) expresses(p,R)



SOFIE: Facts & Patterns Consistency

Constraints to connect facts, fact candidates & patterns

functional dependencies:

spouse(x,y): x y, y xrelation properties:

asymmetry, transitivity, acyclicity, …type constraints, inclusion dependencies:

spouse Person Person capitalOfCountry cityOfCountrydomain-specific constraints:

bornInYear(x) + 10years ≤ graduatedInYear(x)hasAdvisor(x,y) graduatedInYear(x,t) graduatedInYear(y,s) s < t

pattern-fact duality:

occurs(p,x,y) expresses(p,R) R(x,y)

name(-in-context)-to-entity mapping:

means(n,e1) means(n,e2) …

occurs(p,x,y) R(x,y) expresses(p,R)



• Grounded into large propositional Boolean formula in CNF

• Max-Sat solver for joint inference (complete truth assignment to all candidate patterns & facts)

• Grounded into large propositional Boolean formula in CNF

• Max-Sat solver for joint inference (complete truth assignment to all candidate patterns & facts)

Spouse (Victoria, David) Spouse (Rebecca, David)Spouse (Victoria, David) Spouse (Victoria, Tom)…occurs (husband, Victoria, David) expresses (husband, Spouse) Spouse (Victoria, David)occurs (dating, Rebecca, David) expresses (dating, Spouse) Spouse (Rebecca, David)…occurs (husband, Victoria, David) Spouse (Victoria, David) expresses (husband, Spouse) …

x,y,z,w: R(x,y) R(x,z) y=z x,y,z,w: R(x,y) R(w,y) x=w... x,y: R(x,y) R(y,x)… p,x,y: occurs (p, x, y) expresses (p, R) R (x, y)

p,x,y: occurs (p, x, y) R (x, y) expresses (p, R)

SOFIE ExampleSpouse (HillaryClinton, BillClinton)

Spouse (Rebecca, David)Spouse (Victoria, David)

Spouse (Victoria, Tom)

expresses (X and her husband Y, Spouse)expresses (X Y and their children, Spouse)expresses (X dating with Y, Spouse)

occurs (X and her husband Y, Hillary, Bill)

occurs (X and her husband Y, Victoria, David)occurs (X dating with Y, Rebecca, David)occurs (X dating with Y, Victoria, Tom)

occurs (X Y and their children, Hillary, Bill)

Spouse (CarlaBruni, NicolasSarkozy)

[100][40][60][20][10]

[60]

[20]

[60]

[1][1][1]

[1][1][1]

Soft Rules vs. Hard ConstraintsEnforce FD‘s (mutual exclusion) as hard constraints:

Generalize to other forms of constraints:Hard constraint Soft constrainthasAdvisor(x,y) graduatedInYear(x,t) graduatedInYear(y,s) s < t

firstPaper(x,p) firstPaper(y,q) author(p,x) author(p,y) ) inYear(p) > inYear(q) + 5years

hasAdvisor(x,y) [0.6]

hasAdvisor(x,y) diff(y,z) hasAdvisor(x,z)

open issue for arbitrary constraintsDatalog-style grounding (deductive & potentially recursive) rethink reasoning !

open issue for arbitrary constraintsDatalog-style grounding (deductive & potentially recursive) rethink reasoning !

combine with weighted constraintsno longer regular MaxSatconstrained (weighted) MaxSat instead

Pattern Harvesting, Revisited[Suchanek et al: KDD’06; Nakashole et al: WebDB’10, WSDM’11]

narrow / nasty / noisy patterns:

POS-lifted n-gram itemsets as patterns:

confidence weights, using seeds and counter-seeds:

X and his famous advisor YX carried out his doctoral research in math under the supervision of YX jointly developed the method with Y

X { PRP ADJ advisor } YX { his doctoral research, under the supervision of} YX { PRP doctoral research, IN DET supervision of} Y

seeds: (MosheVardi, CatrielBeeri), (JimGray, MikeHarrison)counter-seeds: (MosheVardi, RonFagin), (AlonHalevy, LarryPage) confidence of pattern p ~ #p with seeds #p with counter-seeds

using noisy patternsloses precision &slows down MaxSat

using noisy patternsloses precision &slows down MaxSat

using narrow &dropping nasty patternsloses recall !

using narrow &dropping nasty patternsloses recall !

Outline




✔

✔

Higher-arity Relations: Space & Time

• YAGO-2 Preview


www.mpi-inf.mpg.de/yago-naga/

estimated precision > 95% (for basic relations excl. space, time & provenance)

French Marriage Problem (Revisited)

Facts in KB:

New fact candidates:

married (Hillary, Bill)married (Carla, Nicolas)married (Angelina, Brad)

married (Cecilia, Nicolas)married (Carla, Benjamin)married (Carla, Mick)divorced (Madonna, Guy)domPartner (Angelina, Brad)

1:

2:

3:

validFrom (2, 2008)

validFrom (4, 1996) validUntil (4, 2007)validFrom (5, 2010)validFrom (6, 2006)validFrom (7, 2008)

4: 5:6:7:8:

JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC

Challenge: Temporal Knowledge HarvestingFor all people in Wikipedia (100,000‘s) gather all spouses, incl. divorced & widowed, and corresponding time periods! >95% accuracy, >95% coverage, in one night

Difficult Dating

(Even More Difficult) Implicit Datingexplicit dates vs.implicit dates relative to other dates

explicit dates vs.implicit dates relative to other dates

(Even More Difficult) Implicit Datingvague dates relative datesvague dates relative dates

narrative textrelative ordernarrative textrelative order

TARSQI: Extracting Time Annotations

Hong Kong is poised to hold the first election in more than half <TIMEX3 tid="t3" TYPE="DURATION" VAL="P100Y">a century</TIMEX3> that includes a democracy advocate seeking high office in territory controlled by the Chinese government in Beijing. A pro-democracy politician, Alan Leong, announced <TIMEX3 tid="t4" TYPE="DATE" VAL="20070131">Wednesday</TIMEX3> that he had obtained enough nominations to appear on the ballot to become the territory’s next chief executive. But he acknowledged that he had no chance of beating the Beijing-backed incumbent, Donald Tsang, who is seeking re-election. Under electoral rules imposed by Chinese officials, only 796 people on the election committee – the bulk of them with close ties to mainland China – will be allowed to vote in the <TIMEX3 tid="t5" TYPE="DATE" VAL="20070325">March 25</TIMEX3> election. It will be the first contested election for chief executive since Britain returned Hong Kong to China in <TIMEX3 tid="t6" TYPE="DATE" VAL="1997">1997</TIMEX3>. Mr. Tsang, an able administrator who took office during the early stages of a sharp economic upturn in <TIMEX3 tid="t7" TYPE="DATE" VAL="2005">2005</TIMEX3>, is popular with the general public. Polls consistently indicate that three-fifths of Hong Kong’s people approve of the job he has been doing. It is of course a foregone conclusion – Donald Tsang will be elected and will hold office for <TIMEX3 tid="t9" beginPoint="t0" endPoint="t8“ TYPE="DURATION" VAL="P5Y">another five years </TIMEX3>, said Mr. Leong, the former chairman of the Hong Kong Bar Association.

[Verhagen et al: ACL‘05]http://www.timeml.org/site/tarsqi/

extraction errors!extraction errors!

13 Relations between Time Intervals

A Before B B After A

A Meets B B MetBy A

A Overlaps B B OverlappedBy A

A Starts B B StartedBy A

A During B B Contains A

A Finishes B B FinishedBy A

A Equal B

A B

AB

AB

AB

AB

AB

AB

[Allen, 1984; Allen & Hayes 1989]

0.08 0.120.16

Possible Worlds in Time

0.36

0.40.6

State Relation

‘03 ‘05 ‘07

1.0

playsFor(Beckham, Real)

Base Facts

DerivedFacts

Non-independent

Independent

[Wang,Yahya,Theobald: MUD Workshop ‘10]

0.20.20.10.4

‘05‘00 ‘02

0.9

‘07playsFor(Ronaldo, Real)

State Relation

‘04

‘03 ‘04 ‘07‘05

playsFor(Beckham, Real, T1) playsFor(Ronaldo, Real, T2) overlaps(T1,T2)

teamMates(Beckham, Ronaldo)

State Relation

Need

Lineage!Need

Lineage!

0.08 0.120.16

Possible Worlds in Time

0.36

0.40.6

State Relation

‘03 ‘05 ‘07

1.0

playsFor(Beckham, Real)

Base Facts

DerivedFacts

Non-independent

Independent

[Wang,Yahya,Theobald: MUD Workshop ‘10]

0.20.20.10.4

‘05‘00 ‘02

0.9

‘07playsFor(Ronaldo, Real)

State Relation

‘04

‘03 ‘04 ‘07‘05

playsFor(Beckham, Real, T1) playsFor(Ronaldo, Real, T2) overlaps(T1,T2)

teamMates(Beckham, Ronaldo)

State Relation

• Closed and complete representation model (incl. lineage) Stanford Trio project [Widom: CIDR’05, Benjelloun et al: VLDB’06]

• Interval computation remains linear in the number of bins• Confidence computation per bin is #P-complete• In general requires possible-worlds-based sampling

techniques (Gibbs-style sampling, Luby-Karp, etc.)

Need

Lineage!Need

Lineage!

Open Problems and Challenges in IE (I)

High precision & high recall at affordable cost

Scale, dynamics, life-cycle

Declarative, self-optimizing workflows

Types and constraints

robust pattern analysis & reasoning

incorporate pattern & reasoning steps into IE queries/programs

grow & maintain KB with near-human-quality over long periods

explore & understand different families of constraintssoft rules & hard constraints, rich DL, beyond CWA

parallel processing, lazy / lifted inference, …

Open-domain knowledge harvestingturn names, phrase & table cells into entities & relations

Open Problems and Challenges in IE (II)

Temporal Querying (Revived)

Consistency Reasoning

Incomplete and Uncertain Temporal Scopes

Gathering Implicit and Relative Time Annotations

query language (T-SPARQL?), no schemaconfidence weights & ranking

incorrect, incomplete, unknown begin/endvague dating

biographies & news, relative orderingsaggregate & reconcile observations

extended MaxSat, extended Datalog, prob. graph. models, etc. for resolving inconsistencies on uncertain facts & uncertain time

Outline




✔

✔

✔

Harvesting Knowledge from Web Data

References for Part II• E. Agichtein, L. Gravano, J. Pavel, V. Sokolova, A. Voskoboynik. Snowball: a prototype system for extracting relations from large text collections. SIGMOD, 2001.• James Allen. Towards a general theory of action and time. Artif.Intell., 23(2), 1984.• M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, O. Etzioni. Open information extraction from the web. IJCAI, 2007.• R. Baumgartner, S. Flesca, G. Gottlob. Visual web information extraction with Lixto. VLDB, 2001.• S. Brin. Extracting patterns and relations from the World Wide Web. WebDB, 1998.• M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, Y. Zhang. WebTables: exploring the power of tables on the web. PVLDB, 1(1), 2008.• M. E. Califf, R. J. Mooney. Relational learning of pattern-match rules for information extraction. AAAI, 1999.• P. DeRose, W. Shen, F. Chen, Y. Lee, D. Burdick, A. Doan, R. Ramakrishnan. DBLife: A community information management platform for the database research community. CIDR, 2007.• A. Doan, L. Gravano, R. Ramakrishnan, S. Vaithyanathan. (Eds.). Special issue on information extraction. SIGMOD Record, 37(4), 2008.• O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, A. Yates. Web-scale information extraction in KnowItAll. WWW, 2004.• G. Gottlob, C. Koch, R. Baumgartner, M. Herzog, S. Flesca. The Lixto data extraction project - back and forth between theory and practice. PODS, 2004.• R. Gupta, S. Sarawagi: Answering Table Augmentation Queries from Unstructured Lists on the Web. PVLDB, 2(1), 2009.• M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. COLING, 1992.• D. Hindle. Noun classification from predicate-argument structures. ACL, 1990.• R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, H. Zhu. SystemT: a system for declarative information extraction. SIGMOD Record, 37(4), 2008.• S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti. Collective Annotation of Wikipedia Entities in Web Text. KDD, 2009.• N. Kushmerick. Wrapper induction: efficiency and expressiveness. Artif. Intell., 118(1-2), 2000.• J. Lafferty, A. McCallum, F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ML, 2001.• X. Liu, Z. Nie, N. Yu, J.-R. Wen. BioSnowball: automated population of Wikis. KDD, 2010.• A. McCallum, K. Schultz, S. Singh. FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs. NIPS, 2009.• N. Nakashole, M. Theobald, G. Weikum. Find your Advisor: Robust Knowledge Gathering from the Web. WebDB, 2010.• L.R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 1989.• M. Richardson and P. Domingos. Markov Logic Networks. ML, 2006.• D. Roth, W. Yih. Global Inference for Entity and Relation Identification via a Linear Programming Formulation. MIT Press, 2007.• S. Sarawagi. Information extraction. Foundations and Trends in Databases, 1(3), 2008.• S. Sarawagi, W. W. Cohen. Semi-Markov conditional random fields for information extraction. NIPS, 2004.• W. Shen, X. Li, A. Doan. Constraint-Based Entity Matching. AAAI, 2005.• P. Singla, P. Domingos. Entity resolution with Markov Logic. ICDM, 2006.• F. M. Suchanek, M. Sozio, G. Weikum. SOFIE: a self-organizing framework for information extraction. WWW, 2009.• F. M. Suchanek, G. Ifrim, G. Weikum. Combining linguistic and statistical analysis to extract relations from web documents. KDD, 2006.• C. Sutton, A. McCallum. An Introduction to Conditional Random Fields for Relational Learning. MIT Press, 2006.• R. C. Wang, W. W. Cohen. Language-independent set expansion of named entities using the web. ICDM, 2007.• Y. Wang, M. Yahya, M. Theobald. Time-aware Reasoning in Uncertain Knowledge Bases. VDLB/MUD, 2010.• D. S. Weld, R. Hoffmann, F. Wu. Using Wikipedia to bootstrap open information extraction. SIGMOD Record, 37(4), 2008.• F. Wu, D. S. Weld. Autonomously semantifying Wikipedia. CIKM, 2007. • F. Wu, D. S. Weld. Automatically refining the Wikipedia infobox ontology. WWW, 2008.• A. Yates, M. Banko, M. Broadhead, M. J. Cafarella, O. Etzioni, S. Soderland. TextRunner: Open information extraction on the web. HLT-NAACL, 2007.• J. Zhu, Z. Nie, X. Liu, B. Zhang, J.-R. Wen. StatSnowball: a statistical approach to extracting entity relationships. WWW, 2009.


Outline• Part I

– What and Why– Available Knowledge Bases




✔

✔

✔


Outline for Part III

• Part III.1: Querying Knowledge Bases– A short overview of SPARQL– Extensions to SPARQL

• Part III.2: Searching and Ranking Entities• Part III.3: Searching and Ranking Facts


SPARQL

• Query language for RDF from the W3C• Main component:

– select-project-join combination of triple patterns graph pattern queries on the knowledge base


SPARQL – Example

vegetarian

Albert_Einstein

physicist

Jim_Carrey

actor

Ontario

Canada

Ulm

Germany

scientist

chemist

Otto_Hahn

Frankfurt

Mike_Myers

NewmarketScarborough

Europe

isA isA isA isA isA

bornIn bornIn bornIn bornIn

locatedInlocatedIn

locatedInlocatedInlocatedInlocatedIn

isAisA

Example query:Find all actors from Ontario (that are in the knowledge base)

isA


SPARQL – ExampleExample query:Find all actors from Ontario (that are in the knowledge base)

vegetarian

Albert_Einstein

physicist

Jim_Carrey

actor

Ontario

Canada

Ulm

Germany

scientist

chemist

Otto_Hahn

Frankfurt

Mike_Myers


Europe

isA isA isA isA isA

bornIn bornIn bornIn bornIn

locatedInlocatedIn

locatedInlocatedInlocatedInlocatedIn

isAisA

isA


SPARQL – ExampleExample query:Find all actors from Ontario (that are in the knowledge base)

vegetarian

Jim_Carrey

actor

Ontario

Canada

Mike_Myers


isA isA

bornIn bornIn

locatedIn

locatedInlocatedIn

isA

actor

Ontario

?person

?loc

bornIn

locatedIn

isA

Find subgraphs of this form:

variables

constants

SELECT ?person WHERE ?person isA actor. ?person bornIn ?loc. ?loc locatedIn Ontario.

• Eliminate duplicates in results

• Return results in some order

with optional LIMIT n clause• Optional matches and filters on bounded vars

• More operators: ASK, DESCRIBE, CONSTRUCT


SPARQL – More Features

SELECT DISTINCT ?c WHERE {?person isA actor. ?person bornIn ?loc. ?loc locatedIn ?c}

SELECT ?person WHERE {?person isA actor. ?person bornIn ?loc. ?loc locatedIn Ontario} ORDER BY DESC(?person)

SELECT ?person WHERE {?person isA actor. OPTIONAL{?person bornIn ?loc}. FILTER (!BOUND(?loc))}


SPARQL: Extensions from W3C

W3C SPARQL 1.1 draft:• Aggregations (COUNT, AVG, …)• Subqueries• Negation: syntactic sugar forOPTIONAL {?x … }FILTER(!BOUND(?x))


SPARQL: Extensions from Research (1)

More complex graph patterns:• Transitive paths [Anyanwu et al., WWW07]

SELECT ?p, ?c WHERE {?p isA scientist . ?p ??r ?c. ?c isA Country. ?c locatedIn Europe .

PathFilter(cost(??r) < 5).

PathFilter (containsAny(??r,?t ). ?t isA City. }

• Regular expressions [Kasneci et al., ICDE08]SELECT ?p, ?c WHERE { ?p isA ?s. ?s isA scientist. ?p (bornIn | livesIn | citizenOf) locatedIn* Europe.}



Queries over federated RDF sources:• Determine distribution of triple patterns as part of

query (for example in ARQ from Jena)• Automatically route triple predicates to useful

sources



Queries over federated RDF sources:• Determine distribution of triple patterns as part of

query (for example in ARQ from Jena)• Automatically route triple predicates to useful

sources

Potentially requires mapping of identifiers from different sources


RDF+SPARQL: Systems

• BigOWLIM• OpenLink Virtuoso• Jena with different backends• Sesame• OntoBroker• SW-Store, Hexastore, RDF-3X (no reasoning)System deployments with >1011 triples( see http://esw.w3.org/LargeTripleStores)



• Part III.1: Querying Knowledge Bases• Part III.2: Searching and Ranking Entities

– Entity Importance: Graph Analysis– Entity Search: Language Models

• Part III.3: Searching and Ranking Facts


Why ranking is essential

• Queries often have a huge number of results:– scientists from Canada– conferences in Toronto– publications in databases– actors from the U.S.

• Ranking as integral part of search• Huge number of app-specific ranking methods:

paper/citation count, impact, salary, …• Need for generic ranking


Extending Entities with Keywords

Remember: entities occur in facts in documents Associate entities with terms in those documents

chancellor Germany scientist election Stuttgart21 Guido Westerwelle France Nicolas Sarkozy


Digression 1: Graph Authority MeasuresIdea: incoming links are endorsements & increase page authority, authority is higher if links come from high-authority pages

Random walk: uniformly random choice of links + random jumps

Authority (page q) = stationary prob. of visiting q

Eqp outdeg(p)

PR(p)

VPR(q)

),(

)1(||


Graph-Based Entity Importance

Combine several paradigms:• Keyword search on associated terms to

determine candidate entities• Pagerank or similar measure to determine

important entities• Ranking can combine entity rank with keyword-

based score


Digression 2: Language Models (LMs)

qLM(1)

d1

d2

LM(2)

?

?

• each document di has LM: generative probability distribution of terms with parameter i

• query q viewed as sample from LM(1), LM(2), …• estimate likelihood P[ q | LM(i) ] that q is sample of LM of

document di (q is „generated by“ di)• rank by descending likelihoods (best „explanation“ of q)

State-of-the-art model in text retrieval


Language Models for Text: Example

A A

C

A

D

E E E E

C C

B

A

E

B

model M

document d: sample of Mused for parameter estimation

P [ | M]A A B C E E

estimate likelihoodof observing query

query


Language Models for Text: Smoothing

A A

C

A

D

E E E E

C C

B

A

E

B

model M

document d

P [ | M]A B C E F

estimate likelihoodof observing query

query

+ background corpus and/or smoothing

used for parameter estimation

C

AD

AB

EF

+

Laplace smoothingJelinek-MercerDirichlet smoothing…


Some LM Basics

i i dqPdqPqds ]|[]|[),(

simple MLE: overfitting

][)1(]|[),( qPdqPqds

ikk

kdf

idf

dktf

ditf

)(

)()1(

),(

),(log~

ik

kidf

kdf

dktf

ditf

)(

)(1

),(

),(1log~

mixture modelfor smoothing

i diP

qiPqiPdqKL

]|[

]|[log]|[)|(~

KL divergence(Kullback-Leibler div.)aka. relative entropy

ik

dktf

ditf

),(

),(log~

independ. assumpt.

P[q] est. fromlog or corpus

rank by ascending“improbability“


Entity Search with LM Ranking

LM (entity e) = prob. distr. of words seen in context of e

][)1(]|[),( qPeqPqes ]q[P

]e|q[P~

i

ii

query q: „French player who won world championship“

candidate entities:e1: David Beckham

e2: Ruud van Nistelroy

e3: Ronaldinho

e4: Zinedine Zidane

e5: FC Barcelona

played for ManU, Real, LA GalaxyDavid Beckham champions leagueEngland lost match against Francemarried to spice girl …

weightedby conf.

Zizou champions league 2002Real Madrid won final ...Zinedine Zidane best playerFrance world cup 1998 ...

))e(|)q((KL~ LMLM

query: keywords answer: entities

[Z. Nie et al.: WWW’07]



• Part III.1: Querying Knowledge Bases• Part III.2: Searching and Ranking Entities• Part III.3: Searching and Ranking Facts

– General ranking issues– NAGA-style ranking– Language Models for facts

What makes a fact „good“?Confidence:Prefer results that are likely correct

accuracy of info extraction trust in sources (authenticity, authority)

bornIn (Jim Gray, San Francisco) from„Jim Gray was born in San Francisco“(en.wikipedia.org)livesIn (Michael Jackson, Tibet) from„Fans believe Jacko hides in Tibet“(www.michaeljacksonsightings.com)

Informativeness:Prefer results with salient factsStatistical estimation from:

frequency in answer frequency on Web frequency in query log

q: Einstein isa ?Einstein isa scientistEinstein isa vegetarian

q: ?x isa vegetarianEinstein isa vegetarianWhocares isa vegetarian

Conciseness:Prefer results that are tightly connected

size of answer graph cost of Steiner tree

Einstein won NobelPrizeBohr won NobelPrizeEinstein isa vegetarianCruise isa vegetarianCruise born 1962 Bohr died 1962

Diversity:Prefer variety of facts

E won … E discovered … E played … E won … E won … E won … E won …


How can we implement this?Confidence:Prefer results that are likely correct

accuracy of info extraction trust in sources (authenticity, authority)

Informativeness:Prefer results with salient factsStatistical estimation from:

frequency in answer frequency on Web frequency in query log

Conciseness:Prefer results that are tightly connected

size of answer graph cost of Steiner tree

Diversity:Prefer variety of facts

empirical accuracy of IEPR/HITS-style estimate of trustcombine into: max { accuracy (f,s) * trust(s) | s witnesses(f) }

Statistical Language Models

graph algorithms (BANKS, STAR, …) [J.X. Yu et al., S.Chakrabarti et al.,B. Kimelfeld et al., A. Markovetz et al.,B.C. Ooi et al., G.Kasneci et al., …]

PR/HITS-style entity/fact ranking[V. Hristidis et al., S.Chakrabarti, …]

IR models: tf*idf … [K.Chang et al., …]Statistical Language Models

or


LMs: From Entities to FactsDocument / Entity LM‘s

Triple LM‘s

LM for doc/entity: prob. distr. of words

LM for query: (prob. distr. of) words

LM‘s: rich for docs/entities, super-sparse for queries

richer query LM with query expansion, etc.

LM for facts: (degen. prob. distr. of) triple

LM for queries: (degen. prob. distr. of) triple pattern

LM‘s: apples and oranges• expand query variables by S,P,O values from DB/KB• enhance with witness statistics• query LM then is prob. distr. of triples !


LMs for Triples and Triple Patterns

f1: Beckham p ManchesterUf2: Beckham p RealMadridf3: Beckham p LAGalaxyf4: Beckham p ACMilanF5: Kaka p ACMilanF6: Kaka p RealMadridf7: Zidane p ASCannesf8: Zidane p Juventusf9: Zidane p RealMadridf10: Tidjani p ASCannesf11: Messi p FCBarcelonaf12: Henry p Arsenalf13: Henry p FCBarcelonaf14: Ribery p BayernMunichf15: Drogba p Chelseaf16: Casillas p RealMadrid

triples (facts f):triple patterns (queries q):q: Beckham p ?y

200 300 20 30300150 20200350 10400200150100150 20

: 2600

q: Beckham p ManUq: Beckham p Realq: Beckham p Galaxyq: Beckham p Milan

200/550300/550 20/550 30/550

witness statistics

q: Cruyff ?r FCBarcelonaCruyff playedFor FCBarca 200/500 Cruyff playedAgainst FCBarca 50/500Cruyff coached FCBarca 250/500

q: ?x p ASCannesZidane p ASCannes 20/30Tidjani p ASCannes 10/30

LM(q) + smoothing

q: ?x p ?yMessi p FCBarcelona 400/2600Zidane p RealMadrid 350/2600Kaka p ACMilan 300/2600…

LM(q): {t P [t | t matches q] ~ #witnesses(t)}LM(answer f): {t P [t | t matches f] ~ 1 for f}smooth all LM‘srank results by ascending KL(LM(q)|LM(f))


LMs for Composite Queriesq: Select ?x,?c Where {?x bornIn France . ?x playsFor ?c . ?c in UK . }

f1: Beckham p ManU 200f7: Zidane p ASCannes 20f8: Zidane p Juventus 200f9: Zidane p RealMadrid 300f10: Tidjani p ASCannes 10f12: Henry p Arsenal 200f13: Henry p FCBarca 150f14: Ribery p Bayern 100f15: Drogba p Chelsea 150

f31: ManU in UK 200f32: Arsenal in UK 160f33: Chelsea in UK 140

f21: Zidane bI F 200f22: Tidjani bI F 20f23: Henry bI F 200f24: Ribery bI F 200f25: Drogba bI F 30f26: Drogba bI IC 100F27: Zidane bI ALG 50

queries q with subqueries q1 … qn

results are n-tuples of triples t1 … tn

LM(q): P[q1…qn] = i P[qi]

LM(answer): P[t1…tn] = i P[ti]

KL(LM(q)|LM(answer)) = i KL(LM(qi)|LM(ti))

P [ Henry bI F, Henry p Arsenal, Arsenal in UK ]

500

160

2600

200

650

200~

P [ Drogba bI F, Drogba p Chelsea, Chelsea in UK ]

500

140

2600

150

650

30~


Extensions: Keywords

• Consider witnesses/sources (provenance meta-facts)• Allow text predicates with each triple pattern (à la XQ-FT)

Problem: not everything is triplified

European composers who have won the Oscar,whose music appeared in dramatic western scenes,and who also wrote classical pieces ?

Select ?p Where { ?p instanceOf Composer . ?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe . ?p hasWon ?a .?a Name AcademyAward . ?p contributedTo ?movie [western, gunfight, duel, sunset] . ?p composed ?music [classical, orchestra, cantata, opera] . }

Semantics: triples match struct. pred.witnesses match text pred.


Extensions: Keywords

• Consider witnesses/sources (provenance meta-facts)• Allow text predicates with each triple pattern (à la XQ-FT)

Problem: not everything is triplified

Grouping ofkeywords or phrasesboosts expressiveness

French politicians married to Italian singers? Select ?p1, ?p2 Where { ?p1 instanceOf ?c1 [France, politics] . ?p2 instanceOf ?c2 [Italy, singer] . ?p1 marriedTo ?p2 . }

CS researchers whose advisors worked on the Manhattan project?Select ?r, ?a Where {?r instOf researcher [“computer science“] . ?a workedOn ?x [“Manhattan project“] .?r hasAdvisor ?a . }

Select ?r, ?a Where {?r ?p1 ?o1 [“computer science“] . ?a ?p2 ?o2 [“Manhattan project“] .?r ?p3 ?a . }


LMs for Keyword-Augmented Queriesq: Select ?x, ?c Where { France ml ?x [goalgetter, “top scorer“] . ?x p ?c . ?c in UK [champion, “cup winner“, double] . }

subqueries qi with keywords w1 … wm

results are still n-tuples of triples ti

LM(qi): P[triple ti | w1 … wm] = k P[ti | wk] + (1) P[ti]

LM(answer fi) analogous

KL(LM(q)|LM(answer fi)) = i KL (LM(qi) | LM(fi))

result ranking prefers (n-tuples of) tripleswhose witnesses score high on the subquery keywords

144

Extensions: Query Relaxation

f1: Beckham p ManU 200f7: Zidane p ASCannes 20f9: Zidane p Real 300f10: Tidjani p ASCannes 10f12: Henry p Arsenal 200f15: Drogba p Chelsea 150

f31: ManU in UK 200f32: Arsenal in UK 160f33: Chelsea in UK 140

f21: Zidane bI F 200f22: Tidjani bI F 20F23: Henry bI F 200F24: Ribery bI F 200F26: Drogba bI IC 100F27 Zidane bI ALG 50

[ Zidane bI F, Zidane p Real, Real in ESP ] [ Drogba bI IC, Drogba p Chelsea, Chelsea in UK] [ Drogba resOf F,

Drogba p Chelsea, Chelsea in UK] [ Drogba bI IC,

Drogba p Chelsea, Chelsea in UK]

q(2): … Where {?x bornIn ?x . ?x p ?c . ?c in UK . }q(4): … Where {?x bornIn IC. ?x p ?c . ?c in UK . }

LM(q*) = LM(q) + 1 LM(q(1)) + 2 LM(q(2)) + …

replace e in q by e(i) in q(i):precompute P:=LM (e ?p ?o) and Q:=LM (e(i) ?p ?o)set i ~ 1/2 (KL (P|Q) + KL (Q|P))

replace r in q by r(i) in q(i) LM (?s r(i) ?o)replace e in q by ?x in q(i) LM (?x r ?o)… LM‘s of e, r, ...

are prob. distr.‘s of triples !


Extensions: Diversification

q: Select ?p, ?c Where { ?p isa SoccerPlayer . ?p playedFor ?c . }

1 Beckham, ManchesterU2 Beckham, RealMadrid3 Beckham, LAGalaxy4 Beckham, ACMilan5 Zidane, RealMadrid6 Kaka, RealMadrid7 Cristiano Ronaldo, RealMadrid8 Raul, RealMadrid9 van Nistelrooy, RealMadrid10 Casillas, RealMadrid

1 Beckham, ManchesterU2 Beckham, RealMadrid3 Zidane, RealMadrid4 Kaka, ACMilan5 Cristiano Ronaldo, ManchesterU6 Messi, FCBarcelona7 Henry, Arsenal8 Ribery, BayernMunich9 Drogba, Chelsea10 Luis Figo, Sporting Lissabon

rank results f1 ... fk by ascending

KL(LM(q) | LM(fi)) (1) KL( LM(fi) | LM({f1..fk}\{fi}))implemented by greedy re-ranking of fi‘s in candidate pool


Searching and Ranking – Summary

• Don‘t re-invent the wheel:

LM‘s are elegant and expressive means for ranking

consider both data & workload statistics

• Extensions should be conceptually simple:

can capture informativeness, personalization,

relaxation, diversity – all in same framework

• Unified ranking model for complete query language:

still work to do


References for Part III

• SPARQL Query Language for RDF, W3C Recommendation, 15 January 2008, http://www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/

• SPARQL New Features and Rationale, W3C Working Draft, 2 July 2009, http://www.w3.org/TR/2009/WD-sparql-features-20090702/• Kemafor Anyanwu, Angela Maduko, Amit P. Sheth: SPARQ2L: towards support for subgraph extraction queries in RDF databases.

WWW Conference, 2007• Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, S. Sudarshan: Keyword Searching and Browsing in Databases

using BANKS. ICDE, 2002• Soumen Chakrabarti: Dynamic personalized pagerank in entity-relation graphs. WWW Conference, 2007• Tao Cheng , Xifeng Yan , Kevin Chen-Chuan Chang: EntityRank: searching entities directly and holistically. VLDB, 2007• Shady Elbassuoni, Maya Ramanath, Ralf Schenkel, Marcin Sydow, Gerhard Weikum: Language-model-based ranking for queries on

RDF-graphs. CIKM, 2009• Djoerd Hiemstra: Language Models. Encyclopedia of Database Systems, 2009• Vagelis Hristidis, Heasoo Hwang, Yannis Papakonstantinou: Authority-based keyword search in databases. ACM Transactions on

Database Systems 33(1), 2008• Gjergji Kasneci, Maya Ramanath, Mauro Sozio, Fabian M. Suchanek, Gerhard Weikum: STAR: Steiner-Tree Approximation in

Relationship Graphs. ICDE, 2009• Gjergji Kasneci, Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath, Gerhard Weikum: NAGA: Searching and Ranking

Knowledge. ICDE, 2008• Mounia Lalmas: XML Retrieval. Morgan & Claypool Publishers, 2009• Thomas Neumann, Gerhard Weikum: The RDF-3X engine for scalable management of RDF data. VLDB Journal 19(1), 2010• Zaiqing Nie, Yunxiao Ma, Shuming Shi, Ji-Rong Wen, Wei-Ying Ma: Web object retrieval. WWW Conference, 2007• Desislava Petkova, W. Bruce Croft: Hierarchical Language Models for Expert Finding in Enterprise Corpora. ICTAI, 2006• Nicoleta Preda, Gjergji Kasneci, Fabian M. Suchanek, Thomas Neumann, Wenjun Yuan, Gerhard Weikum: Active knowledge:

dynamically enriching RDF knowledge bases by web services. SIGMOD Conference, 2010• Pavel Serdyukov, Djoerd Hiemstra: Modeling Documents as Mixtures of Persons for Expert Finding. ECIR, 2008• ChengXiang Zhai: Statistical Language Models for Information Retrieval. Morgan & Claypool Publishers, 2008

Outline

• Part I– What and Why ✔– Available Knowledge Bases ✔

• Part II– Extracting Knowledge ✔

• Part III– Ranking and Searching ✔



But back to the original question...

149

Will there ever be a famous singer called Elvis again?

?x “Elvis”hasGivenName

singer

type


150

http://mpii.de/yago

?x = Elvis_Costello?singer = wordnet_singer_110599806?d = 1954-08-25

We found him!

Can we find out

more about this

guy?


151

http://mpii.de/yago

Alright, and even

more?

Linking Open Data: Goal

1954bornplays

Costellopedia

guitar

Can we combine knowledge from different sources?

YAGO

Linking Open Data: URIs

1954bornplays

guitar

http://dbpedia.org/resource http://costello.org 1. Define a name space

http://dbpedia.org/resource/ElvisCostello http://costello.org/Elvis 2. Define entity names in that name space

Every entity has a worldwide unique identifier (a Uniform Resource Identifier, URI).There is a W3C standard for that.

[W3C URI]

Linking Open Data: Cool URIs

1954bornplays

guitar

http://dbpedia.org/resource 1. Define a name space

2. Define entity names in that name space

3. Make them accessible online

client server

http://costello.org/Elvis

1954born

There is a W3Cdescription for that[W3C CoolURI]

http://dbpedia.org/resource/ElvisCostello http://costello.org/Elvis

http://costello.org

Linking Open Data: Links

1954bornplays

guitar

http://dbpedia.org/resource 1. Define a name space

2. Define entity names in that name space

4. Define equivalence links

• similar identifiers• similar labels (names)

• common properties

This is an entity resolution problem. Use

• keys (e.g., the ISBN)

3. Make them accessible online

Goal of theW3C group

http://dbpedia.org/resource/ElvisCostello http://costello.org/Elvis

http://costello.org

[Bizer JSWIS 2009]

Linking Open Data: Status so far

Currently (2010) • 200 ontologies• 25 billion triples• 400m links

http://richard.cyganiak.de/2007/10/lod/imagemap.html

Querying Semantic DataSindice is an index for the Semantic Web developed at the DERI in Galway/Ireland.

Sindice exploits• RDF dumps available on the Web• RDF information embedded into HTML pages• RDF data available by cool URIs• inter-ontology links

http://sindice.com

[Tummarello ISWC 2007]

Querying Semantic Data

?

?

... far from perfect... but far from useless...

Conclusion

• We have seen the knowledge representation model of ontologies, RDF In a nutshell, RDF is a kind of distributed entity-relationship model

• We have seen numerous existing knowledge bases ...manually constructed (Cyc and WordNet) and automatically constructed (YAGO, DBpedia, Freebase, TrueKnowledge etc.)• We have seen techniques for creating such knowledge bases (Pattern-based extraction and reasoning-based extraction, with uncertainty)

• We have seen techniques for querying and ranking the knowledge (by SPARQL and language-based models)

• We have seen that many knowledge bases already exist and that is ongoing work to interlink them

• We have seen that there is indeed a promising singer called Elvis

The End

Feel free to contact us with further questions

The slides are available at http://www.mpi-inf.mpg.de/yago-naga/CIKM10-tutorial/

Hady LauwInstitute for Infocomm Research, Singaporehttp://hadylauw.com

Martin TheobaldMax-Planck Institute for Informatics, Saarbrückenhttp://mpii.de/~mtb

Fabian M. SuchanekINRIA Saclay, Parishttp://suchanek.name

Ralf SchenkelSaarland University

http://people.mmci.uni-saarland.de/~schenkel/

References for Part IV

References• [W3C URI] W3C: “Architecture of the World Wide Web, Volume One” Recommendation 15 December 2004, http://www.w3.org/TR/webarch/ • [W3C CoolURI] W3C: “Cool URIs for the Semantic Web” Interest Group Note 03 December 2008, http://www.w3.org/TR/cooluris/ • [Bizer JSWIS 2009] C.Bizer, T.Heath, T.Berners-Lee: “Linked data – the story so far” International Journal on Semantic Web and Information Systems, 5(3):1–22, 2009.• [Tummarello ISWC 2007] G. Tummarello, R. Delbru, E. Oren: “Sindice.com: Weaving the Open Linked Data” ISWC/ASWC 2007:

Harvesting Knowledge from Web Data and Text CIKM 2010 Tutorial (1/2 Day)

Documents

Transcript of Harvesting Knowledge from Web Data and Text CIKM 2010 Tutorial (1/2 Day)