Harvesting Equipment. Small Grain Harvesting Shucked Wheat Scythe.
Harvesting Knowledge from Web Data and Text CIKM 2010 Tutorial (1/2 Day)
description
Transcript of Harvesting Knowledge from Web Data and Text CIKM 2010 Tutorial (1/2 Day)
Harvesting Knowledge from Web Data and Text
CIKM 2010 Tutorial(1/2 Day)
Hady W. Lauw1, Ralf Schenkel2, Fabian Suchanek3, Martin Theobald4,
and Gerhard Weikum4
1Institute for Infocomm Research, Singapore2Saarland University, Saarbruecken
3INRIA Saclay, Paris4Max Planck Institute Informatics, Saarbruecken
All slides for download…
http://www.mpi-inf.mpg.de/yago-naga/ CIKM10-tutorial/
Harvesting Knowledge from Web Data 2
Outline
• Part I– What and Why– Available Knowledge Bases
• Part II– Extracting Knowledge
• Part III– Ranking and Searching
• Part IV– Conclusion and Outlook
3Harvesting Knowledge from Web Data
Motivation
Elvis Presley1935 - 1977
Elvis, when I
need you, I
can hear you!
Will there ever be someone like him again?
4
Motivation
Another Elvis
Elvis Presley: The Early YearsElvis spent more weeks at the top of the charts than any other artist.www.fiftiesweb.com/elvis.htm
5
Motivation
Personal relationships of Elvis Presley – Wikipedia...when Elvis was a young teen.... another girl whom the singer's mother hoped Presley would .... The writer called Elvis "a hillbilly cat”en.wikipedia.org/.../Personal_relationships_of_Elvis_Presley
Another singer called Elvis, young
6
Motivation
Dear Mr. Page, you don’t understand me. I just...
7
Elvis Presley - Official page for Elvis PresleyWelcome to the Official Elvis Presley Web Site, home of the undisputed King of Rock 'n' Roll and his beloved Graceland ...www.elvis.com/
Motivation
8
Other (more serious?) queries:
• when is Madonna’s next concert in Europe?
• which protein inhibits atherosclerosis?• who was king of England when Napoleon I was emperor of France?
• is there another famous singer named “Elvis”?
• has any scientist ever won the Nobel Prize in Literature? King George III
Bertrand Russel• which countries have a HDI comparable to Sweden’s?• which scientific papers have led to patents?
This Tutorial
9
singertypetype
?
“Elvis” “Elvis”
In this tutorial, we will explain• how the knowledge is organized
• how we can construct knowledge bases• what knowledge bases exist already
• how we can query knowledge bases
Mr. Page, let’s try this again.Is there another singer named Elvis?
Ontologies
10
singer
person
entity
“Elvis” “The King”
location
city
Tupelo
subclassOf
subclassOf
typebornIn
typetype
subclassOf
subclassOf
labellabel
Classes
Instances
Relations
Labels/words
scientists
The same label for two entities:homonymy
The same label for two entities:homonymy
The same entityhas two labels:synonymy
The same entityhas two labels:synonymy
?
Classes
singer
person
entity
type
subclassOf
subclassOf
scientiststype
?
Transitivity: type(x,y) /\ subclassOf(y,z) => type(x,z)
Relations
singer
person
entity
location
city
Tupelo
subclassOf
subclassOf
typebornIn
type
subclassOf
bornIn
domain range
Domain and range constraints: domain(r,c) /\ r(x,y) => type(x,c) range(r,c) /\ r(x,y) => type(y,c)
Looks like higher order, but is not. Consider introducing a predicate fact(r,x,y)
Looks like higher order, but is not. Consider introducing a predicate fact(r,x,y)
Event Entities
Grammy Awardwon
ElvisGrammy
winnerprize
1967
year
An event entityis an artificial entityintroduced to representan n-ary relationship
Winner Prize Year
Elvis Presley Grammy Award
1967
... ... ...
Event entities allow representingarbitrary relational data as binary graphs
Row42Row43
Reification
TupelobornIn
Grammy Awardwon
#42
1967
year
#42 #43
Reification is the method ofcreating an entity that representsa fact.
Wikipediasource
There are different ways to reify a fact, this is the one used in this talk.
RDF
location
city
Tupelo
subclassOf
subclassOf
typebornIn
The Resource Description Format (RDF) is aW3C standard that provides a standard vocabulary to model ontologies.
resource
An RDF ontology can be seen as adirected labeled multi-graph where• the nodes are entities• the edges are labeled with relations
Edges (facts) are commonly written• as triples <Elvis, bornIn, Tupelo>• as literals bornIn(Elvis, Tupelo)
[W3C recommendation: RDF, 2004]
Outline
• Part I– What and Why ✔– Available Knowledge Bases
• Part II– Extracting Knowledge
• Part III– Ranking and Searching
• Part IV– Conclusion and Outlook
16Harvesting Knowledge from Web Data
Cyc
Douglas Lenat
What if we could make all common sense knowledge computer-processable?
What if we could make all common sense knowledge computer-processable?
Cyc project
• started in 1984• driven by• staff of 20 • goal: formalize knowledge manually
[Lenat, Comm. ACM, 1995]
Cyc: Language
CycL is the formal language that Cyc uses to represent knowledge.(Semantics based on First Order Logic, syntax based on LISP)
Cyc project(#$forall ?A (#$implies (#$isa ?A #$Animal) (#$thereExists ?M (#$mother ?A ?M))))
(#$arity #$GovernmentFn 1) (#$arg1Isa #$GovernmentFn #$GeopoliticalEntity) (#$resultIsa #$GovernmentFn #$RegionalGovernment)
(#$governs (#$GovernmentFn #$Canada) #$Canada)
http://cyc.com/cycdoc/ref/cycl-syntax.html + a logical reasoner
Cyc: Knowledge
Cyc project
#$LoveStrong affection for another agent arising out of kinship or personal ties. Love may be felt towards things, too: warm attachment, enthusiasm, or devotion. #$Love is a collection, as further explainedunder #$Happiness. Specialized forms of #$Love are #$Love-Romantic, platonic love, maternal love, infatuation, agape, etc.
guid: bd589433-9c29-11b1-9dad-c379636f7270direct instance of: #$FeelingType direct specialization of: #$Affection direct generalization of: #$Love-Romantic
http://cyc.com/cycdoc/vocab/emotion-vocab.html#Love
Facts and axioms about: Transportation, Ecology, everyday living, chemistry, healthcare, animals, law, computer science...
“If a computer network implements IEEE 802.11 Wireless LAN Protocol and some computer is a node in that computer network, then that computer is vulnerable to decryption. “ http://cyc.com/cyc/technology/whatiscyc_dir/maptest
Cyc: Summary
Cyc SUMO
License proprietary, free for research GNU GPL
Entities 500k 20k
Assertions 5m 70k
Relations 15k
Tools Reasoner, NL understanding tool
Reasoner
URL http://cyc.com http://ontologyportal.org
References [Lenat, Comm. ACM 1995] [Niles, FOIS 2001]
SUMO (the Suggested Upper Model Ontology) is a research project in a similar spirit,driven by Adam Pease of Articulate Software
http://cyc.com/cyc/technology/whatiscyc_dir/whatsincyc http://ontologyportal.org
WordNet
George Miller
What if we could make the English languagecomputer-processable?
What if we could make the English languagecomputer-processable?
• started in 1985• Cognitive Science Laboratory, Princeton University• written by lexicographers• goal: support automatic text analysis and AI applications
[Miller, CACM 1995]
WordNet: Lexical Database
photographic camera
camera
television camera
sense1
sense2
Word Sensesynonymous
words
polysemouswords
WordNet
WordNet: Semantic Relations
Hypernymy Meronymy Is-value-of
Kitchen Appliances
Toaster
Camera
Optical Lens
Speed
Slow Fast
WordNet: Semantic RelationsRelation Meaning Examples
Synonymy(N, V, Adj, Adv)
Same sense (camera, photographic camera)(mountain climbing, mountaineering)(fast, speedy)
Antonymy(Adj, Adv)
Opposite (fast, slow)(buy, sell)
Hypernymy (N) Is-A (camera, photographic equipment)(mountain climbing, climb)
Meronymy (N) Part (camera, optical lens)(camera, view finder)
Troponymy (V) Manner (buy, subscribe)(sell, retail)
Entailment (V) X must mean doing Y (buy, pay)(sell, give)
WordNet: Hierarchy
Hypernymy Is-A relations
WordNet: Size
Type Number
#words 155k
#senses 117k
#word-sense pairs 207k
%words that are polysemous 17%
License Proprietary,Free for research
http://wordnet.princeton.edu/wordnet/man2.1/wnstats.7WN.html
Downloadable at http://wordnet.princeton.edu
Wikipedia
Jimmy Wales
If a small number of people can create a knowledge base, how about a LARGE number of people?
If a small number of people can create a knowledge base, how about a LARGE number of people?
• started in 2001• driven by Wikimedia Foundation, and a large number of volunteers• goal: build world’s largest encyclopedia
Wikipedia: Entities and Attributes
Entities
Attributes
Wikipedia: Synonymy and Polysemy
Redirection (synonyms)
Disambiguation (polysemy)
Wikipedia: Classes/Categories
Class hierarchy different
from WordNet
Wikipedia: Others
Navigation/Topic box
Inter-lingual Links
Wikipedia: Numbers
Growth 2001 - 2008
English: • 1B words,• 2.8M articles,• 152K contributors
All (250 languages):• 1.74B words, • 9.25M articles,• 283K contributors
vs. Britannica:• 25X as many words• ½ avg article length
License: Creative Commons Attribution-ShareAlike (CC-BY-SA)
Downloadable at http://download.wikimedia.org/
Automatically Constructed Knowledge Bases
• Manual approaches (Cyc, WordNet, Wikipedia)– produce high quality knowledge bases– labor-intensive and limited in scope
Can we construct the knowledge bases automatically?
Can we construct the knowledge bases automatically?
YAGO… , etc.
YAGO
Can we exploit Wikipedia and WordNet to build an ontology?
Can we exploit Wikipedia and WordNet to build an ontology?
• started as PhD thesis in 2007• now major project at the Max Planck Institute for Informatics in Germany• goal: extract ontology from Wikipedia with high accuracy and consistency
YAGO
[Suchanek et al., WWW 2007]
YAGO: Construction
Rock Singer
type
Exploit conceptual categories
1935born
Singer
subclassOf
Person
subclassOfSinger
subclassOf
Person
WordNet
Elvis Presley
Blah blah blub fasel (do not read this, better listen to the talk) blah blah Elvis blub (you are still reading this) blah Elvis blah blub later became astronaut blah
Categories: Rock singer
~Infobox~Born: 1935...
Exploit Infoboxes
Add WordNet
YAGO: Consistency Checks
Rock Singer
type
Check uniqueness of entities and functional arguments
1935born
Singer
subclassOf
Person
subclassOf
Physics born
Guitarist Guitar
Check domains and ranges of relationsCheck type coherence
YAGO: Relations
About People About Locations About Other Things
actedIn establishedOnDate happenedIn
bornIn / on date established from / until
diedIn / on date hasCapital isCalled
created / on date hasPopulation foundIn
dicovered locatedIn produced
hasChild, hasSpouse hasCurrency hasProductionLanguage
family name hasInflation hasISBN
graduatedFrom hasPolitician hasPrecedecssor
... ... ...
ca. 100 relations with range and domain
YAGO: NumbersYAGO YAGO+Geonam
es
Entities 2.6m 10m
organizations
0.5m 0.5m
people 0.8m 0.8m
classes 0.5m 0.5m
Facts 30m 240m
Relations 86 92
Precision 95% 95%
License Creative Commons Attribution-NonCommercial (CC-NC-BY)
Downloadable at http://mpii.de/yago incl. converters for RDF, XML, databases
DBpedia
Can we harvest facts more exhaustively with community effort?
Can we harvest facts more exhaustively with community effort?
• community effort started in 2007• driven by Free U. Berlin, U. Leipzig, OpenLink• goal: "extract structured information from Wikipedia and to make this information available on the Web"
[Bizer et al., Journal of Web Semantics 2009]
DBPedia: OntologyIn YAGO, the taxonomy is based on WordNet classes.
Dbpedia:• places entities extracted from Wikipedia into its own ontology.
•hand-crafted: 259 classes, 6 levels, 1200 properties• emphasizes recall
• only half of extracted entities are currently placed in its own ontology• alternative classifications: Wikipedia, YAGO, UMBEL (OpenCyc)
DBPedia: Mapping RulesDBpedia mapping rules:• maps Wikipedia infoboxes and tables to its ontology• target datatypes (normalize units, ignore deviant values)
Community effort:• hand-craft mapping rules• expand ontology
< http://en.wikipedia.org/wiki/Elvis_Presley >
{{Infobox musical artist|Name = Elvis Presley|Background = solo_singer|Birth_name = Elvis Aaron Presley}}
< http://dbpedia.org/page/Elvis_Presley >
foaf:name “Elvis Presley”;background “solo_singer”;foaf:givenName “Elvis Aaron Presley”;
Note that the values do not change.
DBPedia: Numbers
Type Number
Facts English: 257 m (YAGO: 240 m)All languages: 1 b
Entities 3.4 m overall (YAGO: 10 m)1.5 m in DBPedia ontology
People 312 k
Locations 413 k
Organizations 140 k
License Creative Commons Attribution-ShareAlike 3.0(CC-BY-SA 3.0)
plus• 5.5 m links to external Web pages• 1.5 m links to images• 5 m links to other RDF data sets
Downloadable at http://dbpedia.org
Freebase
What if we could harvest both automatic extraction and user contribution?
What if we could harvest both automatic extraction and user contribution?
• started in 2000• driven by Metaweb, part of Google since Jul 2010• goals:
• “an open shared database of the world's knowledge”• “a massive, collaboratively-edited database of cross-linked data”
FreebaseLike DBpedia and YAGO, Freebase imports data from Wikipedia.
Differently:• also imports from other sources (e.g., ChefMoz, NNDB, and MusicBrainz)• including individually contributed data• users can collaboratively edit its data (without having to edit Wikipedia).
Freebase: User Contribution
• create new entities• assign a new type/class to an entity • add/change attributes• connect to other entities• upload/edit images
• flag vandalism• flag entities to be merged/deleted• vote on flagged content (3 unanimous vote, or an expert has to be tie-breaker)
• define new class, specifying the attributes of the class• class definition can only be changed by creator/admin• class not part of commons until peer-reviewed & promoted by staff/admin
• finding aliases in Wikipedia redirects• extracts dates of events from Wikipedia articles• uses the Yahoo image search API to find candidates
Edit Entities Edit Schema
Review Data Game
Freebase: Community
• tie breaker in reviews• split entities• “rewind” changes
New experts inducted by current experts.
• create new classes and attributes• respond to community suggestions
Promoted by staff or other admins.
• contribute (edit, review, vote)
Anyone can be a member.
Experts Admins Members
Freebase: Numbers
Type Number
Facts 41 m
Entities 13 m (YAGO: 10 m)
People 2 m
Locations 946 k
Businesses 567 k
Film 397 k
License Creative Commons Attribution (CC-BY)
Downloadable at http://download.freebase.com
• data from Wikipedia and user edits• natural language translation of queries• 9 m entities, 300 m facts
Question Answering SystemsObjective is to answer user queries from an underlying knowledge base.
• computes answers from an internal knowledge base of curated, structured data.• stores not just facts, but also algorithms and models
Application: Semantic Similarity
• Task: determine similarity between two words– topological distance of two words in the graph– taxonomic distance: hierarchical is-a relations
• Example application: correct real-word spelling errors
Tofu is made from soy jeans.[Hirst et al., Natural Language Engineering 2001]
Application: Sentiment Orientation
• Task: determine an adjective’s polarity (positive or negative)– same polarity connected by synonymic relations– opposite polarity by antonymic relations
• Example application: overall sentiment of customer reviews
GOODright
suitable
proper
appropriate BAD
spoiledforged
riskydefective
[Hu et al., KDD 2004]
Application: Annotation of Web Data
[Limaye et al., VLDB 2010]
• Task: given a data source in the form of a Web table– Annotate column with entity type– Annotate pair of columns with relationship type– Annotate table cell with entity ID
Application: Map Annotation
Idea: • Determine geographical entities in the vicinity (by GPS coordinates) • Show information about these entities (from DBpedia)
Possible Applications:• Map search on the Internet• Enhanced Reality applications
[Becker et al., Linking Open Data Workshop 2008]
Application: Faceted Search
DBpedia Browser
search is “full text search within results”
Constraints are listed for possible deletion
Suggestions based on current consideration set
Attributes and values based on frequency (?)
Summary
• Part I covers what knowledge bases are– Knowledge representation model (RDF)– Manual knowledge bases:
• WordNet: expert-driven, English words• Wikipedia: community-driven, entities/attributes
– Automatically extracted knowledge bases:• YAGO: Wikipedia + WordNet, automated, high precision• DBpedia: Wikipedia + community-crafted mapping rules, high recall• Freebase: Wikipedia + other databases + user edits
• Part II will cover how to extract information included in the knowledge bases
References for Part I• C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, S. Hellmann: PDF DocumentDBpedia – A
Crystallization Point for the Web of Data. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, Issue 7, Pages 154–165, 2009.
• C. Becker, C. Bizer: DBpedia Mobile: A LocationEnabled Linked Data Browser. Linking Open Data Workshop 2008
• G. Hirst and A. Budanitsky: Correcting real-word spelling errors by restoring lexical cohesion. Natural Language Engineering 11 (1): 87–111, 2001.
• M. Hu and B. Liu: Mining and Summarizing Customer Reviews. KDD, 2004.• J. Kamps, M. Marx, R. J. Mokken, and M. de Rijke: Using WordNet to Measure Semantic Orientations of
Adjectives. LREC, 2004.• D. Lenat: CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM, 1995.• G. Limaye, S. Sarawagi, and S. Chakrabarti: Annotating and Searching Web Tables Using Entities, Types and
Relationships. VLDB, 2010.• G. A. Miller, WordNet: A Lexical Database for English. Communications of the ACM Vol. 38, No. 11: 39-41,
1995. • F. M. Suchanek, G. Kasneci and G. Weikum: Yago - A Core of Semantic Knowledge. WWW, 2007.• I. Niles, and A. Pease: Towards a Standard Upper Ontology. In Proceedings of the 2nd International
Conference on Formal Ontology in Information Systems (FOIS-2001), Chris Welty and Barry Smith, eds, Ogunquit, Maine, October 17-19, 2001.
• World Wide Web Consortium: RDF Primer. W3C Recommendation, 2004. http://www.w3.org/TR/rdf- primer/
Outline
• Part I– What and Why– Available Knowledge Bases
• Part II– Extracting Knowledge
• Part III– Ranking and Searching
• Part IV– Other topics
57Harvesting Knowledge from Web Data
✔
✔
Entities & Classes
...
Which entity types (classes, unary predicates) are there?
Which subsumptions should hold(subclass/superclass, hyponym/hypernym, inclusion dependencies)?
Which individual entities belong to which classes?
Which names denote which entities?
scientists, doctoral students, computer scientists, …female humans, male humans, married humans, …
subclassOf (computer scientists, scientists),subclassOf (scientists, humans), …
instanceOf (Surajit Chaudhuri, computer scientists),instanceOf (BarbaraLiskov, computer scientists),instanceOf (Barbara Liskov, female humans), …
means (“Lady Di“, Diana Spencer),means (“Diana Frances Mountbatten-Windsor”, Diana Spencer), …means (“Madonna“, Madonna Louise Ciccone),means (“Madonna“, Madonna(painting by Edward Munch)), …
Binary RelationsWhich instances (pairs of individual entities) are therefor given binary relations with specific type signatures?
hasAdvisor (JimGray, MikeHarrison)hasAdvisor (HectorGarcia-Molina, Gio Wiederhold)hasAdvisor (Susan Davidson, Hector Garcia-Molina)graduatedAt (JimGray, Berkeley)graduatedAt (HectorGarcia-Molina, Stanford)hasWonPrize (JimGray, TuringAward)bornOn (JohnLennon, 9-Oct-1940)diedOn (JohnLennon, 8-Dec-1980)marriedTo (JohnLennon, YokoOno)
Which additional & interesting relation types are there between given classes of entities?
competedWith(x,y), nominatedForPrize(x,y), …divorcedFrom(x,y), affairWith(x,y), …assassinated(x,y), rescued(x,y), admired(x,y), …
Higher-arity Relations & Reasoning
• Time, location & provenance annotations• Knowledge representation – how do we model & store these?• Consistency reasoning – how do we filter out inconsistent facts that
the extractor produced?
Harvesting Knowledge from Web Data 60
Facts (RDF triples): (JimGray, hasAdvisor, MikeHarrison) (SurajitChaudhuri, hasAdvisor, JeffUllman) (Madonna, marriedTo, GuyRitchie) (NicolasSarkozy, marriedTo, CarlaBruni) (ManchesterU, wonCup, ChampionsLeague)
Facts (RDF triples)1:2:3:4:5:
Facts about facts:5: (1, inYear, 1968)6: (2, inYear, 2006)7: (3, validFrom, 22-Dec-2000) 8: (3, validUntil, Nov-2008)9: (4, validFrom, 2-Feb-2008)10: (2, source, SigmodRecord)11: (5, inYear, 1999)12: (5, location, CampNou)13: (5, source, Wikipedia)
Outline
• Part I– What and Why– Available Knowledge Bases
• Part II– Extracting Knowledge
• Part III– Ranking and Searching
• Part IV– Conclusion and Outlook
61Harvesting Knowledge from Web Data
• Part II– Extracting Knowledge
✔
✔
Outline
• Part II–Extracting Knowledge
• Pattern-based Extraction• Consistency Reasoning• Higher-arity Relations: Space & Time
Harvesting Knowledge from Web Data 62
Framework: Information Extraction (IE)
many sources
one source
Surajit obtained hisPhD in CS from Stanford Universityunder the supervision of Prof. Jeff Ullman.He later joined HP andworked closely withUmesh Dayal …
source-centric IE
instanceOf (Surajit, scientist)inField (Surajit, computer science)hasAdvisor (Surajit, Jeff Ullman)almaMater (Surajit, Stanford U)workedFor (Surajit, HP)friendOf (Surajit, Umesh Dayal)…
instanceOf (Surajit, scientist)inField (Surajit, computer science)hasAdvisor (Surajit, Jeff Ullman)almaMater (Surajit, Stanford U)workedFor (Surajit, HP)friendOf (Surajit, Umesh Dayal)…
yield-centricharvesting
Student AdvisorhasAdvisor
Student UniversityalmaMater
1) recall !2) precision
1) precision !2) recallnear-human
quality !near-humanquality !
Student AdvisorSurajit Chaudhuri Jeffrey UllmanAlon Halevy Jeffrey UllmanJim Gray Mike Harrison … …
Student UniversitySurajit Chaudhuri Stanford UAlon Halevy Stanford UJim Gray UC Berkeley … …
Framework: Knowledge Representation
...
• RDF (Resource Description Framework, W3C):- subject-property-object (SPO) triples / binary relations
- highly structured, but no (prescriptive) schema - first-order logical reasoning over binary predicates • Frames, F-Logic, description logics: OWL/DL/lite• Also: higher-order logics, epistemic logics
Facts (RDF triples): (JimGray, hasAdvisor, MikeHarrison) (SurajitChaudhuri, hasAdvisor, JeffUllman) (Madonna, marriedTo, GuyRitchie) (NicolasSarkozy, marriedTo, CarlaBruni)
Facts (RDF triples)1:2:3:4:
Reification: facts about facts:5: (1, inYear, 1968)6: (2, inYear, 2006)7: (3, validFrom, 22-Dec-2000) 8: (3, validUntil, Nov-2008)9: (4, validFrom, 2-Feb-2008)10: (2, source, SigmodRecord)
Temporal, spatial, & provenance annotationscan refer to reified facts via fact identifiers(approx. equiv. to higer-arity RDF: Sub Prop Obj Time Location Source)
This tutorial!
Picking Low-Hanging Fruit (First)
Deterministic Pattern Matching
...
[Kushmerick 97; Califf & Mooney 99; Gottlob 01, …]
Wrapper Induction
67
[Gottlob et al: VLDB’01, PODS’04,…]
...
• Wrapper induction: • Hierarchical document structure, XHTML, XML• Pattern learning for restricted regular languages (ELog, combining concepts of XPath & FOL)• Visual interfaces• See e.g. http://www.lixto.com/,
http://w4f.sourceforge.net/
• Wrapper induction: • Hierarchical document structure, XHTML, XML• Pattern learning for restricted regular languages (ELog, combining concepts of XPath & FOL)• Visual interfaces• See e.g. http://www.lixto.com/,
http://w4f.sourceforge.net/
...
Tapping on Web Tables[Cafarella et al: PVLDB‘08; Sarawagi et al: PVLDB‘09]
Problem:discover interesting relations wonAward: Person Award nominatedForAward: Person Award …from many table headers and co-occurring cells
Problem:discover interesting relations wonAward: Person Award nominatedForAward: Person Award …from many table headers and co-occurring cells
Relational Fact Extraction From Plain Text
• Hearst patterns [Hearst: COLING‘92]– POS-enhanced regular expression matching in natural-language text
NP0 {,} such as {NP1, NP2, … (and|or) }{,} NPnNP0 {,}{NP1, NP2, … NPn-1}{,} or other NPn…
“The bow lute, such as the Bambara ndang, is plucked and has an individual curved neck for each string.”
isA(“Bambara ndang”, “bow lute”)
• Noun classification from predicate-argument structures [Hindle: ACL’90]– Clustering of nouns by similar
verbal phrases– Similarity based on co-occurrence
frequencies (mutual information)
69Harvesting Knowledge from Web Data
DIPRE
• DPIRE: “Dual Iterative Pattern Relation Extraction”– (Almost) unsupervised, iterative gathering of facts and patterns– Positive & negative examples as seeds for target relation
e.g. +(Hillary, Bill) +(Carla, Nicolas) –(Larry, Google)– Specificity threshold for new patterns based on occurrence frequency
Harvesting Knowledge from Web Data 70
[Brin: WebDB‘98]
(Hillary, Bill)
(Carla, Nicolas)X and her husband Y
X and Y on their honeymoon
X and Y and their childrenX has been dating with YX loves Y
(Angelina, Brad)
(Hillary, Bill)
(Victoria, David)
(Carla, Nicolas)
(Larry, Google)…
• Snowball/QXtract [Agichtein,Gravano: DL’00, SIGMOD’01+‘03]
– Refined patterns and statistical measures– >80% recall at >85% precision over a large news corpus– QXtract demo additionally allowed user feedback in the
iteration loop
Harvesting Knowledge from Web Data 71
• DPIRE: “Dual Iterative Pattern Relation Extraction”– (Almost) unsupervised, iterative gathering of facts and patterns– Positive & negative examples as seeds for target relation
e.g. +(Hillary, Bill) +(Carla, Nicolas) –(Larry, Google)– Specificity threshold for new patterns based on occurrence frequency
DIPRE/Snowball/QXtract[Brin: WebDB’98; Agichtein,Gravano: SIGMOD’01+‘03]
Help from NLP: Dependency Parsing!
Harvesting Knowledge from Web Data 72
software tools: CMU Link Parser: http://www.link.cs.cmu.edu/link/Stanford Lex Parser: http://nlp.stanford.edu/software/lex-parser.shtmlOpen NLP Tools: http://opennlp.sourceforge.net/ANNIE Open-Source Information Extraction: http://www.aktors.org/technologies/annie/ LingPipe: http://alias-i.com/lingpipe/ (commercial license)
Carla has been seen dating with Ben.
• Analyze lexico-syntactic structure of sentences– Part-Of-Speech (POS) tagging & dependency parsing– Prefer shorter dependency paths for fact candidates
NNP VBZ VBN VBN VBG IN NNP dating(Carla, Ben)
Harvesting Knowledge from Web Data 73
Open-Domain Gathering of Facts (Open IE)
...
[Etzioni,Cafarella et al:WWW’04, IJCAI‘07; Weld,Hoffman,Wu: SIGMOD-Rec‘08]
Analyze verbal phrases between entities for new relation types
Carla has been seen dating with Ben.
Rumors about Carla indicate there is something between her and Ben.
• unsupervised bootstrapping with short dependency paths
• self-supervised classifier for (noun, verb-phrase, noun) triples
• build statistics & prune sparse candidates
• group/cluster candidates for new relation types and their facts
… seen dating with …
… partying with …
{datesWith, partiesWith}, {affairWith, flirtsWith}, {romanticRelation}, …
(Carla, Ben), (Carla, Sofie), …
(Carla, Ben), (Paris, Heidi), …
But: result often is noisy clusters are not canonicalized relations far from near-human-quality
But: result often is noisy clusters are not canonicalized relations far from near-human-quality
Learning More MappingsKylin Ontology Generator (KOG):learn classifier for subclassOf across Wikipedia & WordNet using
• YAGO as training data• advanced ML methods (MLN‘s, SVM‘s)• rich features from various sources
> 3 Mio. entities> 1 Mio. w/ infoboxes> 500 000 categories
[Wu & Weld: CIKM’07, WWW‘08 ]
• Category/class name similarity measures• Category instances and their infobox templates: template names, attribute names (e.g. knownFor)• Wikipedia edit history: refinement of categories• Hearst patterns: C such as X, X and Y and other C‘s, …• Other search-engine statistics: co-occurrence frequencies
• Category/class name similarity measures• Category instances and their infobox templates: template names, attribute names (e.g. knownFor)• Wikipedia edit history: refinement of categories• Hearst patterns: C such as X, X and Y and other C‘s, …• Other search-engine statistics: co-occurrence frequencies
#articles
instances/classes
Entity Disambiguation
“Penn“
“U Penn“University of Pennsylvania
“Penn State“PennsylvaniaState University
„PSU“ Pennsylvania(US State)
Sean Penn
PassengerService Unit
Names Entities
??
• ill-defined with zero context• known as record linkage for names in record fields• Wikipedia offers rich candidate mappings: disambiguation pages, re-directs, inter-wiki links, anchor texts of href links
Individual Entity Disambiguation
PennUniversity of Pennsylvania
Sean Penn
Penn StateUniversity
name similarity:edit distances, n-gram overlap, …
context similarity: record level
context similarity: words/phrases level
Penn
Penn
context similarity:text around names, classes & facts around entities
Into the Wild
XML Treebank
…
Univ. Park …
…
Typical Approaches:
Challenge: efficiency & scalability
Collective Entity Disambiguation
• Consider a set of names {n1, n2, …} in same context and sets of candidate entities E1 = {e11, e12, …}, E2 = {e21, e22, …}, …• Define joint objective function (e.g. likelihood for prob. model) that rewards coherence of mappings
(n1)=x1E1, (n2)=x2E2, …
[Doan et al: AAAI‘05; Singla,Domingos: ICDM’07; Chakrabarti et al: KDD‘09, …]
• Solve optimization problem
Stuart Russell
Michael Jordan
Stuart Russell(computer scientist)
Stuart Russell (DJ)
Michael Jordan(computer scientist)
Michael Jordan (NBA)
Declarative Extraction Frameworks
• IBM’s SystemT [Krishnamurthy et al: SIGMOD Rec.’08, ICDE’08]– Fully declarative extraction framework– SQL-style operators, cost models, full optimizer support
• DBLife/Cimple [DeRose, Doan et al: CIDR’07, VLDB’07]– Online community portal centered around the DB domain
(regular crawls of DBLP, conferences, homepages, etc.)
• More commercial endeavors:– FreeBase.com, WolframAlpha.com, Sig.ma,
TrueKnowledge.com, Google.com/squared
Harvesting Knowledge from Web Data 79
Harvesting Knowledge from Web Data 80
DBWorld/DBLP/Google Scholar
Google Images
DBLP
Homepages/DBLP/DBWorld/Google Scholar
Probabilistic Extraction Models• Hidden Markov Models (HMMs)
[Rabiner: Proc. IEEE’89; Sutton,McCallum: MIT Press’06]– Markov chain (directed graphical model) with
“hidden” states Y, observations X, and transition probabilities– Factorizes the joint distribution P(Y,X) – Assuming independence among observations
• Conditional Random Fields (CRFs)[Lafferty,McCallum,Pereira: ML’01; Sarawagi,Cohen: NIPS’04]– Markov random field (undirected graphical model) – Models the conditional distribution P(Y|X) (less strict independence assumptions)
• Joint segmentation and disambiguation of input strings onto entities and classes: NER, POS tagging, etc.
• Trained, e.g., on bibliograhic entries, no manual labeling requiredHarvesting Knowledge from Web Data 81
“I went skiing with Fernando Pereira in British Columbia.”
Pattern-Based Harvesting
Patterns(Hillary, Bill)
(Carla, Nicolas)
Facts & Fact Candidates
X and her husband Y
X and Y on their honeymoon
X and Y and their children
X has been dating with Y
X loves Y
…• good for recall• noisy, drifting• not robust enough for high precision
• good for recall• noisy, drifting• not robust enough for high precision
(Angelina, Brad)
(Hillary, Bill)
(Victoria, David)
(Carla, Nicolas)
(Angelina, Brad)
(Yoko, John)
(Carla, Benjamin)
(Larry, Google)
(Kate, Pete)
(Victoria, David)
[Hearst 92; Brin 98; Agichtein 00; Etzioni 04; …]
Outline
• Part II–Extracting Knowledge
• Pattern-based Extraction• Consistency Reasoning• Higher-arity Relations: Space & Time
Harvesting Knowledge from Web Data 83
✔
French Marriage Problem
...
isMarriedTo: person personisMarriedTo: person person
isMarriedTo: frenchPolitician personisMarriedTo: frenchPolitician person
French Marriage Problem
Facts in KB: New facts or fact candidates:married (Hillary, Bill)married (Carla, Nicolas)married (Angelina, Brad)
married (Cecilia, Nicolas)married (Carla, Benjamin)married (Carla, Mick)married (Michelle, Barack)married (Yoko, John)married (Kate, Leonardo)married (Carla, Sofie)married (Larry, Google)
1) for recall: pattern-based harvesting2) for precision: consistency reasoning1) for recall: pattern-based harvesting2) for precision: consistency reasoning
Reasoning about Fact Candidates Use consistency constraints to prune false candidates!
spouse(Hillary,Bill)spouse(Carla,Nicolas)spouse(Cecilia,Nicolas)spouse(Carla,Ben)spouse(Carla,Mick)spouse(Carla, Sofie)
spouse(x,y) diff(y,z) spouse(x,z)
f(Hillary)f(Carla)f(Cecilia)f(Sofie)
m(Bill)m(Nicolas)m(Ben)m(Mick)
spouse(x,y) f(x) spouse(x,y) m(y)
spouse(x,y) (f(x)m(y)) (m(x)f(y))
First-order-logic rules (restricted): Ground atoms:
Rules can be weighted(e.g. by fraction of ground atoms that satisfy a rule) uncertain / probabilistic data compute prob. distr. over (a subset of) ground atoms being “true“
Rules reveal inconsistenciesFind consistent subset(s) of atoms(“possible world(s)“, “the truth“)
spouse(x,y) diff(w,y) spouse(w,y)
Markov Logic Networks (MLN‘s) [Richardson/Domingos: ML 2006]
Map logical constraints & fact candidatesinto probabilistic graphical model: Markov Random Field (MRF)
s(x,y) m(y)s(x,y) diff(y,z) s(x,z) s(Carla,Nicolas)
s(Cecilia,Nicolas)s(Carla,Ben)s(Carla,Sofie)…
s(x,y) diff(w,y) s(w,y)s(x,y) f(x)
s(Ca,Nic) s(Ce,Nic)
s(Ca,Nic) s(Ca,Ben)
s(Ca,Nic) s(Ca,So)
s(Ca,Ben) s(Ca,So)
s(Ca,Ben) s(Ca,So)
s(Ca,Nic) m(Nic)
Grounding:
s(Ce,Nic) m(Nic)
s(Ca,Ben) m(Ben)
s(Ca,So) m(So)
f(x) m(x)m(x) f(x)
Grounding: Literal Boolean VarReasoning: Literal Binary RVGrounding: Literal Boolean VarReasoning: Literal Binary RV
FOL rules: Base facts w/entities:
Markov Logic Networks (MLN‘s) Map logical constraints & fact candidatesinto probabilistic graphical model: Markov Random Field (MRF)
s(x,y) m(y)s(x,y) diff(y,z) s(x,z) s(Carla,Nicolas)
s(Cecilia,Nicolas)s(Carla,Ben)s(Carla,Sofie)…
s(x,y) diff(w,y) s(w,y)s(x,y) f(x) f(x) m(x)
m(x) f(x)
m(Ben)
m(Nic) s(Ca,Nic)
s(Ce,Nic)
s(Ca,Ben)
s(Ca,So) m(So)
RVs coupledby MRF edgeif they appearin same clause
MRF assumption:P[Xi|X1..Xn]=P[Xi|MB(Xi)]Variety of algorithms for joint inference:
Gibbs sampling, other MCMC, belief propagation, randomized MaxSat, …
joint distribution has product form over all cliques
[Richardson,Domingos: ML 2006]
Markov Logic Networks (MLN‘s) Map logical constraints & fact candidatesinto probabilistic graphical model: Markov Random Field (MRF)
m(Ben)
m(Nic) s(Ca,Nic)
s(Ce,Nic)
s(Ca,Ben)
s(Ca,So) m(So)
[Richardson,Domingos: ML 2006]
0.1
0.5
0.2 0.7
0.6
0.8
0.7
Consistency reasoning: prune low-confidence facts!StatSnowball [Zhu et al: WWW‘09], BioSnowball [Liu et al: KDD‘10]EntityCube, MSR Asia: http://entitycube.research.microsoft.com/
s(x,y) m(y)s(x,y) diff(y,z) s(x,z) s(Carla,Nicolas)
s(Cecilia,Nicolas)s(Carla,Ben)s(Carla,Sofie)…
s(x,y) diff(w,y) s(w,y)s(x,y) f(x) f(x) m(x)
m(x) f(x)
Related Alternative Probabilistic Models
software tools: alchemy.cs.washington.educode.google.com/p/factorie/research.microsoft.com/en-us/um/cambridge/projects/infernet/
Constrained Conditional Models [Roth et al. 2007]
Factor Graphs with Imperative Variable Coordination [McCallum et al. 2008]
log-linear classifiers with constraint-violation penaltymapped into Integer Linear Programs
RV‘s share “factors“ (joint feature functions)generalizes MRF, BN, CRF, …inference via advanced MCMCflexible coupling & constraining of RV‘s
m(So)
m(Ben)
m(Nic) s(Ca,Nic)
s(Ce,Nic)
s(Ca,Ben)
s(Ca,So)
Reasoning for KB Growth: Direct Route
Facts in KB New fact candidates:
married (Hillary, Bill)married (Carla, Nicolas)married (Angelina, Brad)
married (Cecilia, Nicolas)married (Carla, Benjamin)married (Carla, Mick)married (Carla, Sofie)married (Larry, Google)
+
Patterns:
X and her husband YX and Y and their childrenX has been dating with YX loves Y
?
• KB facts are true; fact candidates & patterns hypotheses• grounded constraints clauses with hypotheses as vars• cast into Weighted Max-Sat with weights from pattern stats• customized approximation algorithm• unifies: fact/candidate consistency, pattern goodness, entity disambig.
[Suchanek,Sozio,Weikum: WWW’09]
www.mpi-inf.mpg.de/yago-naga/sofie/
Direct approach:
SOFIE: Facts & Patterns Consistency
Constraints to connect facts, fact candidates & patterns
functional dependencies:
spouse(x,y): x y, y xrelation properties:
asymmetry, transitivity, acyclicity, …type constraints, inclusion dependencies:
spouse Person Person capitalOfCountry cityOfCountrydomain-specific constraints:
bornInYear(x) + 10years ≤ graduatedInYear(x)hasAdvisor(x,y) graduatedInYear(x,t) graduatedInYear(y,s) s < t
pattern-fact duality:
occurs(p,x,y) expresses(p,R) R(x,y)
name(-in-context)-to-entity mapping:
means(n,e1) means(n,e2) …
occurs(p,x,y) R(x,y) expresses(p,R)
www.mpi-inf.mpg.de/yago-naga/sofie/
[Suchanek,Sozio,Weikum: WWW’09]
SOFIE: Facts & Patterns Consistency
Constraints to connect facts, fact candidates & patterns
functional dependencies:
spouse(x,y): x y, y xrelation properties:
asymmetry, transitivity, acyclicity, …type constraints, inclusion dependencies:
spouse Person Person capitalOfCountry cityOfCountrydomain-specific constraints:
bornInYear(x) + 10years ≤ graduatedInYear(x)hasAdvisor(x,y) graduatedInYear(x,t) graduatedInYear(y,s) s < t
pattern-fact duality:
occurs(p,x,y) expresses(p,R) R(x,y)
name(-in-context)-to-entity mapping:
means(n,e1) means(n,e2) …
occurs(p,x,y) R(x,y) expresses(p,R)
www.mpi-inf.mpg.de/yago-naga/sofie/
[Suchanek,Sozio,Weikum: WWW’09]
• Grounded into large propositional Boolean formula in CNF
• Max-Sat solver for joint inference (complete truth assignment to all candidate patterns & facts)
• Grounded into large propositional Boolean formula in CNF
• Max-Sat solver for joint inference (complete truth assignment to all candidate patterns & facts)
Spouse (Victoria, David) Spouse (Rebecca, David)Spouse (Victoria, David) Spouse (Victoria, Tom)…occurs (husband, Victoria, David) expresses (husband, Spouse) Spouse (Victoria, David)occurs (dating, Rebecca, David) expresses (dating, Spouse) Spouse (Rebecca, David)…occurs (husband, Victoria, David) Spouse (Victoria, David) expresses (husband, Spouse) …
x,y,z,w: R(x,y) R(x,z) y=z x,y,z,w: R(x,y) R(w,y) x=w... x,y: R(x,y) R(y,x)… p,x,y: occurs (p, x, y) expresses (p, R) R (x, y)
p,x,y: occurs (p, x, y) R (x, y) expresses (p, R)
SOFIE ExampleSpouse (HillaryClinton, BillClinton)
Spouse (Rebecca, David)Spouse (Victoria, David)
Spouse (Victoria, Tom)
expresses (X and her husband Y, Spouse)expresses (X Y and their children, Spouse)expresses (X dating with Y, Spouse)
occurs (X and her husband Y, Hillary, Bill)
occurs (X and her husband Y, Victoria, David)occurs (X dating with Y, Rebecca, David)occurs (X dating with Y, Victoria, Tom)
occurs (X Y and their children, Hillary, Bill)
Spouse (CarlaBruni, NicolasSarkozy)
[100][40][60][20][10]
[60]
[20]
[60]
[1][1][1]
[1][1][1]
Soft Rules vs. Hard ConstraintsEnforce FD‘s (mutual exclusion) as hard constraints:
Generalize to other forms of constraints:Hard constraint Soft constrainthasAdvisor(x,y) graduatedInYear(x,t) graduatedInYear(y,s) s < t
firstPaper(x,p) firstPaper(y,q) author(p,x) author(p,y) ) inYear(p) > inYear(q) + 5years
hasAdvisor(x,y) [0.6]
hasAdvisor(x,y) diff(y,z) hasAdvisor(x,z)
open issue for arbitrary constraintsDatalog-style grounding (deductive & potentially recursive) rethink reasoning !
open issue for arbitrary constraintsDatalog-style grounding (deductive & potentially recursive) rethink reasoning !
combine with weighted constraintsno longer regular MaxSatconstrained (weighted) MaxSat instead
Pattern Harvesting, Revisited[Suchanek et al: KDD’06; Nakashole et al: WebDB’10, WSDM’11]
narrow / nasty / noisy patterns:
POS-lifted n-gram itemsets as patterns:
confidence weights, using seeds and counter-seeds:
X and his famous advisor YX carried out his doctoral research in math under the supervision of YX jointly developed the method with Y
X { PRP ADJ advisor } YX { his doctoral research, under the supervision of} YX { PRP doctoral research, IN DET supervision of} Y
seeds: (MosheVardi, CatrielBeeri), (JimGray, MikeHarrison)counter-seeds: (MosheVardi, RonFagin), (AlonHalevy, LarryPage) confidence of pattern p ~ #p with seeds #p with counter-seeds
using noisy patternsloses precision &slows down MaxSat
using noisy patternsloses precision &slows down MaxSat
using narrow &dropping nasty patternsloses recall !
using narrow &dropping nasty patternsloses recall !
Outline
• Part II–Extracting Knowledge
• Pattern-based Extraction• Consistency Reasoning• Higher-arity Relations: Space & Time
Harvesting Knowledge from Web Data 97
✔
✔
Higher-arity Relations: Space & Time
• YAGO-2 Preview
Harvesting Knowledge from Web Data 98
www.mpi-inf.mpg.de/yago-naga/
estimated precision > 95% (for basic relations excl. space, time & provenance)
French Marriage Problem (Revisited)
Facts in KB:
New fact candidates:
married (Hillary, Bill)married (Carla, Nicolas)married (Angelina, Brad)
married (Cecilia, Nicolas)married (Carla, Benjamin)married (Carla, Mick)divorced (Madonna, Guy)domPartner (Angelina, Brad)
1:
2:
3:
validFrom (2, 2008)
validFrom (4, 1996) validUntil (4, 2007)validFrom (5, 2010)validFrom (6, 2006)validFrom (7, 2008)
4: 5:6:7:8:
JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
Challenge: Temporal Knowledge HarvestingFor all people in Wikipedia (100,000‘s) gather all spouses, incl. divorced & widowed, and corresponding time periods! >95% accuracy, >95% coverage, in one night
Difficult Dating
(Even More Difficult) Implicit Datingexplicit dates vs.implicit dates relative to other dates
explicit dates vs.implicit dates relative to other dates
(Even More Difficult) Implicit Datingvague dates relative datesvague dates relative dates
narrative textrelative ordernarrative textrelative order
TARSQI: Extracting Time Annotations
Hong Kong is poised to hold the first election in more than half <TIMEX3 tid="t3" TYPE="DURATION" VAL="P100Y">a century</TIMEX3> that includes a democracy advocate seeking high office in territory controlled by the Chinese government in Beijing. A pro-democracy politician, Alan Leong, announced <TIMEX3 tid="t4" TYPE="DATE" VAL="20070131">Wednesday</TIMEX3> that he had obtained enough nominations to appear on the ballot to become the territory’s next chief executive. But he acknowledged that he had no chance of beating the Beijing-backed incumbent, Donald Tsang, who is seeking re-election. Under electoral rules imposed by Chinese officials, only 796 people on the election committee – the bulk of them with close ties to mainland China – will be allowed to vote in the <TIMEX3 tid="t5" TYPE="DATE" VAL="20070325">March 25</TIMEX3> election. It will be the first contested election for chief executive since Britain returned Hong Kong to China in <TIMEX3 tid="t6" TYPE="DATE" VAL="1997">1997</TIMEX3>. Mr. Tsang, an able administrator who took office during the early stages of a sharp economic upturn in <TIMEX3 tid="t7" TYPE="DATE" VAL="2005">2005</TIMEX3>, is popular with the general public. Polls consistently indicate that three-fifths of Hong Kong’s people approve of the job he has been doing. It is of course a foregone conclusion – Donald Tsang will be elected and will hold office for <TIMEX3 tid="t9" beginPoint="t0" endPoint="t8“ TYPE="DURATION" VAL="P5Y">another five years </TIMEX3>, said Mr. Leong, the former chairman of the Hong Kong Bar Association.
[Verhagen et al: ACL‘05]http://www.timeml.org/site/tarsqi/
extraction errors!extraction errors!
13 Relations between Time Intervals
A Before B B After A
A Meets B B MetBy A
A Overlaps B B OverlappedBy A
A Starts B B StartedBy A
A During B B Contains A
A Finishes B B FinishedBy A
A Equal B
A B
AB
AB
AB
AB
AB
AB
[Allen, 1984; Allen & Hayes 1989]
0.08 0.120.16
Possible Worlds in Time
0.36
0.40.6
State Relation
‘03 ‘05 ‘07
1.0
playsFor(Beckham, Real)
Base Facts
DerivedFacts
Non-independent
Independent
[Wang,Yahya,Theobald: MUD Workshop ‘10]
0.20.20.10.4
‘05‘00 ‘02
0.9
‘07playsFor(Ronaldo, Real)
State Relation
‘04
‘03 ‘04 ‘07‘05
playsFor(Beckham, Real, T1) playsFor(Ronaldo, Real, T2) overlaps(T1,T2)
teamMates(Beckham, Ronaldo)
State Relation
Need
Lineage!Need
Lineage!
0.08 0.120.16
Possible Worlds in Time
0.36
0.40.6
State Relation
‘03 ‘05 ‘07
1.0
playsFor(Beckham, Real)
Base Facts
DerivedFacts
Non-independent
Independent
[Wang,Yahya,Theobald: MUD Workshop ‘10]
0.20.20.10.4
‘05‘00 ‘02
0.9
‘07playsFor(Ronaldo, Real)
State Relation
‘04
‘03 ‘04 ‘07‘05
playsFor(Beckham, Real, T1) playsFor(Ronaldo, Real, T2) overlaps(T1,T2)
teamMates(Beckham, Ronaldo)
State Relation
• Closed and complete representation model (incl. lineage) Stanford Trio project [Widom: CIDR’05, Benjelloun et al: VLDB’06]
• Interval computation remains linear in the number of bins• Confidence computation per bin is #P-complete• In general requires possible-worlds-based sampling
techniques (Gibbs-style sampling, Luby-Karp, etc.)
Need
Lineage!Need
Lineage!
Open Problems and Challenges in IE (I)
High precision & high recall at affordable cost
Scale, dynamics, life-cycle
Declarative, self-optimizing workflows
Types and constraints
robust pattern analysis & reasoning
incorporate pattern & reasoning steps into IE queries/programs
grow & maintain KB with near-human-quality over long periods
explore & understand different families of constraintssoft rules & hard constraints, rich DL, beyond CWA
parallel processing, lazy / lifted inference, …
Open-domain knowledge harvestingturn names, phrase & table cells into entities & relations
Open Problems and Challenges in IE (II)
Temporal Querying (Revived)
Consistency Reasoning
Incomplete and Uncertain Temporal Scopes
Gathering Implicit and Relative Time Annotations
query language (T-SPARQL?), no schemaconfidence weights & ranking
incorrect, incomplete, unknown begin/endvague dating
biographies & news, relative orderingsaggregate & reconcile observations
extended MaxSat, extended Datalog, prob. graph. models, etc. for resolving inconsistencies on uncertain facts & uncertain time
Outline
• Part II–Extracting Knowledge
• Pattern-based Extraction• Consistency Reasoning• Higher-arity Relations: Space & Time
Harvesting Knowledge from Web Data 111
✔
✔
✔
Harvesting Knowledge from Web Data
References for Part II• E. Agichtein, L. Gravano, J. Pavel, V. Sokolova, A. Voskoboynik. Snowball: a prototype system for extracting relations from large text collections. SIGMOD, 2001.• James Allen. Towards a general theory of action and time. Artif.Intell., 23(2), 1984.• M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, O. Etzioni. Open information extraction from the web. IJCAI, 2007.• R. Baumgartner, S. Flesca, G. Gottlob. Visual web information extraction with Lixto. VLDB, 2001.• S. Brin. Extracting patterns and relations from the World Wide Web. WebDB, 1998.• M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, Y. Zhang. WebTables: exploring the power of tables on the web. PVLDB, 1(1), 2008.• M. E. Califf, R. J. Mooney. Relational learning of pattern-match rules for information extraction. AAAI, 1999.• P. DeRose, W. Shen, F. Chen, Y. Lee, D. Burdick, A. Doan, R. Ramakrishnan. DBLife: A community information management platform for the database research community. CIDR, 2007.• A. Doan, L. Gravano, R. Ramakrishnan, S. Vaithyanathan. (Eds.). Special issue on information extraction. SIGMOD Record, 37(4), 2008.• O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, A. Yates. Web-scale information extraction in KnowItAll. WWW, 2004.• G. Gottlob, C. Koch, R. Baumgartner, M. Herzog, S. Flesca. The Lixto data extraction project - back and forth between theory and practice. PODS, 2004.• R. Gupta, S. Sarawagi: Answering Table Augmentation Queries from Unstructured Lists on the Web. PVLDB, 2(1), 2009.• M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. COLING, 1992.• D. Hindle. Noun classification from predicate-argument structures. ACL, 1990.• R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, H. Zhu. SystemT: a system for declarative information extraction. SIGMOD Record, 37(4), 2008.• S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti. Collective Annotation of Wikipedia Entities in Web Text. KDD, 2009.• N. Kushmerick. Wrapper induction: efficiency and expressiveness. Artif. Intell., 118(1-2), 2000.• J. Lafferty, A. McCallum, F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ML, 2001.• X. Liu, Z. Nie, N. Yu, J.-R. Wen. BioSnowball: automated population of Wikis. KDD, 2010.• A. McCallum, K. Schultz, S. Singh. FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs. NIPS, 2009.• N. Nakashole, M. Theobald, G. Weikum. Find your Advisor: Robust Knowledge Gathering from the Web. WebDB, 2010.• L.R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 1989.• M. Richardson and P. Domingos. Markov Logic Networks. ML, 2006.• D. Roth, W. Yih. Global Inference for Entity and Relation Identification via a Linear Programming Formulation. MIT Press, 2007.• S. Sarawagi. Information extraction. Foundations and Trends in Databases, 1(3), 2008.• S. Sarawagi, W. W. Cohen. Semi-Markov conditional random fields for information extraction. NIPS, 2004.• W. Shen, X. Li, A. Doan. Constraint-Based Entity Matching. AAAI, 2005.• P. Singla, P. Domingos. Entity resolution with Markov Logic. ICDM, 2006.• F. M. Suchanek, M. Sozio, G. Weikum. SOFIE: a self-organizing framework for information extraction. WWW, 2009.• F. M. Suchanek, G. Ifrim, G. Weikum. Combining linguistic and statistical analysis to extract relations from web documents. KDD, 2006.• C. Sutton, A. McCallum. An Introduction to Conditional Random Fields for Relational Learning. MIT Press, 2006.• R. C. Wang, W. W. Cohen. Language-independent set expansion of named entities using the web. ICDM, 2007.• Y. Wang, M. Yahya, M. Theobald. Time-aware Reasoning in Uncertain Knowledge Bases. VDLB/MUD, 2010.• D. S. Weld, R. Hoffmann, F. Wu. Using Wikipedia to bootstrap open information extraction. SIGMOD Record, 37(4), 2008.• F. Wu, D. S. Weld. Autonomously semantifying Wikipedia. CIKM, 2007. • F. Wu, D. S. Weld. Automatically refining the Wikipedia infobox ontology. WWW, 2008.• A. Yates, M. Banko, M. Broadhead, M. J. Cafarella, O. Etzioni, S. Soderland. TextRunner: Open information extraction on the web. HLT-NAACL, 2007.• J. Zhu, Z. Nie, X. Liu, B. Zhang, J.-R. Wen. StatSnowball: a statistical approach to extracting entity relationships. WWW, 2009.
Harvesting Knowledge from Web Data
Outline• Part I
– What and Why– Available Knowledge Bases
• Part II– Extracting Knowledge
• Part III– Ranking and Searching
• Part IV– Conclusion and Outlook
✔
✔
✔
Harvesting Knowledge from Web Data
Outline for Part III
• Part III.1: Querying Knowledge Bases– A short overview of SPARQL– Extensions to SPARQL
• Part III.2: Searching and Ranking Entities• Part III.3: Searching and Ranking Facts
Harvesting Knowledge from Web Data
SPARQL
• Query language for RDF from the W3C• Main component:
– select-project-join combination of triple patterns graph pattern queries on the knowledge base
Harvesting Knowledge from Web Data 116
SPARQL – Example
vegetarian
Albert_Einstein
physicist
Jim_Carrey
actor
Ontario
Canada
Ulm
Germany
scientist
chemist
Otto_Hahn
Frankfurt
Mike_Myers
NewmarketScarborough
Europe
isA isA isA isA isA
bornIn bornIn bornIn bornIn
locatedInlocatedIn
locatedInlocatedInlocatedInlocatedIn
isAisA
Example query:Find all actors from Ontario (that are in the knowledge base)
isA
Harvesting Knowledge from Web Data 117
SPARQL – ExampleExample query:Find all actors from Ontario (that are in the knowledge base)
vegetarian
Albert_Einstein
physicist
Jim_Carrey
actor
Ontario
Canada
Ulm
Germany
scientist
chemist
Otto_Hahn
Frankfurt
Mike_Myers
NewmarketScarborough
Europe
isA isA isA isA isA
bornIn bornIn bornIn bornIn
locatedInlocatedIn
locatedInlocatedInlocatedInlocatedIn
isAisA
isA
Harvesting Knowledge from Web Data 118
SPARQL – ExampleExample query:Find all actors from Ontario (that are in the knowledge base)
vegetarian
Jim_Carrey
actor
Ontario
Canada
Mike_Myers
NewmarketScarborough
isA isA
bornIn bornIn
locatedIn
locatedInlocatedIn
isA
actor
Ontario
?person
?loc
bornIn
locatedIn
isA
Find subgraphs of this form:
variables
constants
SELECT ?person WHERE ?person isA actor. ?person bornIn ?loc. ?loc locatedIn Ontario.
• Eliminate duplicates in results
• Return results in some order
with optional LIMIT n clause• Optional matches and filters on bounded vars
• More operators: ASK, DESCRIBE, CONSTRUCT
Harvesting Knowledge from Web Data
SPARQL – More Features
SELECT DISTINCT ?c WHERE {?person isA actor. ?person bornIn ?loc. ?loc locatedIn ?c}
SELECT ?person WHERE {?person isA actor. ?person bornIn ?loc. ?loc locatedIn Ontario} ORDER BY DESC(?person)
SELECT ?person WHERE {?person isA actor. OPTIONAL{?person bornIn ?loc}. FILTER (!BOUND(?loc))}
Harvesting Knowledge from Web Data
SPARQL: Extensions from W3C
W3C SPARQL 1.1 draft:• Aggregations (COUNT, AVG, …)• Subqueries• Negation: syntactic sugar forOPTIONAL {?x … }FILTER(!BOUND(?x))
Harvesting Knowledge from Web Data
SPARQL: Extensions from Research (1)
More complex graph patterns:• Transitive paths [Anyanwu et al., WWW07]
SELECT ?p, ?c WHERE {?p isA scientist . ?p ??r ?c. ?c isA Country. ?c locatedIn Europe .
PathFilter(cost(??r) < 5).
PathFilter (containsAny(??r,?t ). ?t isA City. }
• Regular expressions [Kasneci et al., ICDE08]SELECT ?p, ?c WHERE { ?p isA ?s. ?s isA scientist. ?p (bornIn | livesIn | citizenOf) locatedIn* Europe.}
Harvesting Knowledge from Web Data 122
SPARQL: Extensions from Research (2)
Queries over federated RDF sources:• Determine distribution of triple patterns as part of
query (for example in ARQ from Jena)• Automatically route triple predicates to useful
sources
Harvesting Knowledge from Web Data 123
SPARQL: Extensions from Research (2)
Queries over federated RDF sources:• Determine distribution of triple patterns as part of
query (for example in ARQ from Jena)• Automatically route triple predicates to useful
sources
Potentially requires mapping of identifiers from different sources
Harvesting Knowledge from Web Data
RDF+SPARQL: Systems
• BigOWLIM• OpenLink Virtuoso• Jena with different backends• Sesame• OntoBroker• SW-Store, Hexastore, RDF-3X (no reasoning)System deployments with >1011 triples( see http://esw.w3.org/LargeTripleStores)
Harvesting Knowledge from Web Data
Outline for Part III
• Part III.1: Querying Knowledge Bases• Part III.2: Searching and Ranking Entities
– Entity Importance: Graph Analysis– Entity Search: Language Models
• Part III.3: Searching and Ranking Facts
Harvesting Knowledge from Web Data
Why ranking is essential
• Queries often have a huge number of results:– scientists from Canada– conferences in Toronto– publications in databases– actors from the U.S.
• Ranking as integral part of search• Huge number of app-specific ranking methods:
paper/citation count, impact, salary, …• Need for generic ranking
Harvesting Knowledge from Web Data
Extending Entities with Keywords
Remember: entities occur in facts in documents Associate entities with terms in those documents
chancellor Germany scientist election Stuttgart21 Guido Westerwelle France Nicolas Sarkozy
Harvesting Knowledge from Web Data
Digression 1: Graph Authority MeasuresIdea: incoming links are endorsements & increase page authority, authority is higher if links come from high-authority pages
Random walk: uniformly random choice of links + random jumps
Authority (page q) = stationary prob. of visiting q
Eqp outdeg(p)
PR(p)
VPR(q)
),(
)1(||
Harvesting Knowledge from Web Data
Graph-Based Entity Importance
Combine several paradigms:• Keyword search on associated terms to
determine candidate entities• Pagerank or similar measure to determine
important entities• Ranking can combine entity rank with keyword-
based score
Harvesting Knowledge from Web Data 130
Digression 2: Language Models (LMs)
qLM(1)
d1
d2
LM(2)
?
?
• each document di has LM: generative probability distribution of terms with parameter i
• query q viewed as sample from LM(1), LM(2), …• estimate likelihood P[ q | LM(i) ] that q is sample of LM of
document di (q is „generated by“ di)• rank by descending likelihoods (best „explanation“ of q)
State-of-the-art model in text retrieval
Harvesting Knowledge from Web Data 131
Language Models for Text: Example
A A
C
A
D
E E E E
C C
B
A
E
B
model M
document d: sample of Mused for parameter estimation
P [ | M]A A B C E E
estimate likelihoodof observing query
query
Harvesting Knowledge from Web Data 132
Language Models for Text: Smoothing
A A
C
A
D
E E E E
C C
B
A
E
B
model M
document d
P [ | M]A B C E F
estimate likelihoodof observing query
query
+ background corpus and/or smoothing
used for parameter estimation
C
AD
AB
EF
+
Laplace smoothingJelinek-MercerDirichlet smoothing…
Harvesting Knowledge from Web Data 133
Some LM Basics
i i dqPdqPqds ]|[]|[),(
simple MLE: overfitting
][)1(]|[),( qPdqPqds
ikk
kdf
idf
dktf
ditf
)(
)()1(
),(
),(log~
ik
kidf
kdf
dktf
ditf
)(
)(1
),(
),(1log~
mixture modelfor smoothing
i diP
qiPqiPdqKL
]|[
]|[log]|[)|(~
KL divergence(Kullback-Leibler div.)aka. relative entropy
ik
dktf
ditf
),(
),(log~
independ. assumpt.
P[q] est. fromlog or corpus
rank by ascending“improbability“
Harvesting Knowledge from Web Data 134
Entity Search with LM Ranking
LM (entity e) = prob. distr. of words seen in context of e
][)1(]|[),( qPeqPqes ]q[P
]e|q[P~
i
ii
query q: „French player who won world championship“
candidate entities:e1: David Beckham
e2: Ruud van Nistelroy
e3: Ronaldinho
e4: Zinedine Zidane
e5: FC Barcelona
played for ManU, Real, LA GalaxyDavid Beckham champions leagueEngland lost match against Francemarried to spice girl …
weightedby conf.
Zizou champions league 2002Real Madrid won final ...Zinedine Zidane best playerFrance world cup 1998 ...
))e(|)q((KL~ LMLM
query: keywords answer: entities
[Z. Nie et al.: WWW’07]
Harvesting Knowledge from Web Data
Outline for Part III
• Part III.1: Querying Knowledge Bases• Part III.2: Searching and Ranking Entities• Part III.3: Searching and Ranking Facts
– General ranking issues– NAGA-style ranking– Language Models for facts
What makes a fact „good“?Confidence:Prefer results that are likely correct
accuracy of info extraction trust in sources (authenticity, authority)
bornIn (Jim Gray, San Francisco) from„Jim Gray was born in San Francisco“(en.wikipedia.org)livesIn (Michael Jackson, Tibet) from„Fans believe Jacko hides in Tibet“(www.michaeljacksonsightings.com)
Informativeness:Prefer results with salient factsStatistical estimation from:
frequency in answer frequency on Web frequency in query log
q: Einstein isa ?Einstein isa scientistEinstein isa vegetarian
q: ?x isa vegetarianEinstein isa vegetarianWhocares isa vegetarian
Conciseness:Prefer results that are tightly connected
size of answer graph cost of Steiner tree
Einstein won NobelPrizeBohr won NobelPrizeEinstein isa vegetarianCruise isa vegetarianCruise born 1962 Bohr died 1962
Diversity:Prefer variety of facts
E won … E discovered … E played … E won … E won … E won … E won …
Harvesting Knowledge from Web Data 137
How can we implement this?Confidence:Prefer results that are likely correct
accuracy of info extraction trust in sources (authenticity, authority)
Informativeness:Prefer results with salient factsStatistical estimation from:
frequency in answer frequency on Web frequency in query log
Conciseness:Prefer results that are tightly connected
size of answer graph cost of Steiner tree
Diversity:Prefer variety of facts
empirical accuracy of IEPR/HITS-style estimate of trustcombine into: max { accuracy (f,s) * trust(s) | s witnesses(f) }
Statistical Language Models
graph algorithms (BANKS, STAR, …) [J.X. Yu et al., S.Chakrabarti et al.,B. Kimelfeld et al., A. Markovetz et al.,B.C. Ooi et al., G.Kasneci et al., …]
PR/HITS-style entity/fact ranking[V. Hristidis et al., S.Chakrabarti, …]
IR models: tf*idf … [K.Chang et al., …]Statistical Language Models
or
Harvesting Knowledge from Web Data 138
LMs: From Entities to FactsDocument / Entity LM‘s
Triple LM‘s
LM for doc/entity: prob. distr. of words
LM for query: (prob. distr. of) words
LM‘s: rich for docs/entities, super-sparse for queries
richer query LM with query expansion, etc.
LM for facts: (degen. prob. distr. of) triple
LM for queries: (degen. prob. distr. of) triple pattern
LM‘s: apples and oranges• expand query variables by S,P,O values from DB/KB• enhance with witness statistics• query LM then is prob. distr. of triples !
Harvesting Knowledge from Web Data 139
LMs for Triples and Triple Patterns
f1: Beckham p ManchesterUf2: Beckham p RealMadridf3: Beckham p LAGalaxyf4: Beckham p ACMilanF5: Kaka p ACMilanF6: Kaka p RealMadridf7: Zidane p ASCannesf8: Zidane p Juventusf9: Zidane p RealMadridf10: Tidjani p ASCannesf11: Messi p FCBarcelonaf12: Henry p Arsenalf13: Henry p FCBarcelonaf14: Ribery p BayernMunichf15: Drogba p Chelseaf16: Casillas p RealMadrid
triples (facts f):triple patterns (queries q):q: Beckham p ?y
200 300 20 30300150 20200350 10400200150100150 20
: 2600
q: Beckham p ManUq: Beckham p Realq: Beckham p Galaxyq: Beckham p Milan
200/550300/550 20/550 30/550
witness statistics
q: Cruyff ?r FCBarcelonaCruyff playedFor FCBarca 200/500 Cruyff playedAgainst FCBarca 50/500Cruyff coached FCBarca 250/500
q: ?x p ASCannesZidane p ASCannes 20/30Tidjani p ASCannes 10/30
LM(q) + smoothing
q: ?x p ?yMessi p FCBarcelona 400/2600Zidane p RealMadrid 350/2600Kaka p ACMilan 300/2600…
LM(q): {t P [t | t matches q] ~ #witnesses(t)}LM(answer f): {t P [t | t matches f] ~ 1 for f}smooth all LM‘srank results by ascending KL(LM(q)|LM(f))
Harvesting Knowledge from Web Data 140
LMs for Composite Queriesq: Select ?x,?c Where {?x bornIn France . ?x playsFor ?c . ?c in UK . }
f1: Beckham p ManU 200f7: Zidane p ASCannes 20f8: Zidane p Juventus 200f9: Zidane p RealMadrid 300f10: Tidjani p ASCannes 10f12: Henry p Arsenal 200f13: Henry p FCBarca 150f14: Ribery p Bayern 100f15: Drogba p Chelsea 150
f31: ManU in UK 200f32: Arsenal in UK 160f33: Chelsea in UK 140
f21: Zidane bI F 200f22: Tidjani bI F 20f23: Henry bI F 200f24: Ribery bI F 200f25: Drogba bI F 30f26: Drogba bI IC 100F27: Zidane bI ALG 50
queries q with subqueries q1 … qn
results are n-tuples of triples t1 … tn
LM(q): P[q1…qn] = i P[qi]
LM(answer): P[t1…tn] = i P[ti]
KL(LM(q)|LM(answer)) = i KL(LM(qi)|LM(ti))
P [ Henry bI F, Henry p Arsenal, Arsenal in UK ]
500
160
2600
200
650
200~
P [ Drogba bI F, Drogba p Chelsea, Chelsea in UK ]
500
140
2600
150
650
30~
Harvesting Knowledge from Web Data
Extensions: Keywords
• Consider witnesses/sources (provenance meta-facts)• Allow text predicates with each triple pattern (à la XQ-FT)
Problem: not everything is triplified
European composers who have won the Oscar,whose music appeared in dramatic western scenes,and who also wrote classical pieces ?
Select ?p Where { ?p instanceOf Composer . ?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe . ?p hasWon ?a .?a Name AcademyAward . ?p contributedTo ?movie [western, gunfight, duel, sunset] . ?p composed ?music [classical, orchestra, cantata, opera] . }
Semantics: triples match struct. pred.witnesses match text pred.
Harvesting Knowledge from Web Data 142
Extensions: Keywords
• Consider witnesses/sources (provenance meta-facts)• Allow text predicates with each triple pattern (à la XQ-FT)
Problem: not everything is triplified
Grouping ofkeywords or phrasesboosts expressiveness
French politicians married to Italian singers? Select ?p1, ?p2 Where { ?p1 instanceOf ?c1 [France, politics] . ?p2 instanceOf ?c2 [Italy, singer] . ?p1 marriedTo ?p2 . }
CS researchers whose advisors worked on the Manhattan project?Select ?r, ?a Where {?r instOf researcher [“computer science“] . ?a workedOn ?x [“Manhattan project“] .?r hasAdvisor ?a . }
Select ?r, ?a Where {?r ?p1 ?o1 [“computer science“] . ?a ?p2 ?o2 [“Manhattan project“] .?r ?p3 ?a . }
Harvesting Knowledge from Web Data 143
LMs for Keyword-Augmented Queriesq: Select ?x, ?c Where { France ml ?x [goalgetter, “top scorer“] . ?x p ?c . ?c in UK [champion, “cup winner“, double] . }
subqueries qi with keywords w1 … wm
results are still n-tuples of triples ti
LM(qi): P[triple ti | w1 … wm] = k P[ti | wk] + (1) P[ti]
LM(answer fi) analogous
KL(LM(q)|LM(answer fi)) = i KL (LM(qi) | LM(fi))
result ranking prefers (n-tuples of) tripleswhose witnesses score high on the subquery keywords
144
Extensions: Query Relaxation
f1: Beckham p ManU 200f7: Zidane p ASCannes 20f9: Zidane p Real 300f10: Tidjani p ASCannes 10f12: Henry p Arsenal 200f15: Drogba p Chelsea 150
f31: ManU in UK 200f32: Arsenal in UK 160f33: Chelsea in UK 140
f21: Zidane bI F 200f22: Tidjani bI F 20F23: Henry bI F 200F24: Ribery bI F 200F26: Drogba bI IC 100F27 Zidane bI ALG 50
[ Zidane bI F, Zidane p Real, Real in ESP ] [ Drogba bI IC, Drogba p Chelsea, Chelsea in UK] [ Drogba resOf F,
Drogba p Chelsea, Chelsea in UK] [ Drogba bI IC,
Drogba p Chelsea, Chelsea in UK]
q(2): … Where {?x bornIn ?x . ?x p ?c . ?c in UK . }q(4): … Where {?x bornIn IC. ?x p ?c . ?c in UK . }
LM(q*) = LM(q) + 1 LM(q(1)) + 2 LM(q(2)) + …
replace e in q by e(i) in q(i):precompute P:=LM (e ?p ?o) and Q:=LM (e(i) ?p ?o)set i ~ 1/2 (KL (P|Q) + KL (Q|P))
replace r in q by r(i) in q(i) LM (?s r(i) ?o)replace e in q by ?x in q(i) LM (?x r ?o)… LM‘s of e, r, ...
are prob. distr.‘s of triples !
Harvesting Knowledge from Web Data
Extensions: Diversification
q: Select ?p, ?c Where { ?p isa SoccerPlayer . ?p playedFor ?c . }
1 Beckham, ManchesterU2 Beckham, RealMadrid3 Beckham, LAGalaxy4 Beckham, ACMilan5 Zidane, RealMadrid6 Kaka, RealMadrid7 Cristiano Ronaldo, RealMadrid8 Raul, RealMadrid9 van Nistelrooy, RealMadrid10 Casillas, RealMadrid
1 Beckham, ManchesterU2 Beckham, RealMadrid3 Zidane, RealMadrid4 Kaka, ACMilan5 Cristiano Ronaldo, ManchesterU6 Messi, FCBarcelona7 Henry, Arsenal8 Ribery, BayernMunich9 Drogba, Chelsea10 Luis Figo, Sporting Lissabon
rank results f1 ... fk by ascending
KL(LM(q) | LM(fi)) (1) KL( LM(fi) | LM({f1..fk}\{fi}))implemented by greedy re-ranking of fi‘s in candidate pool
Harvesting Knowledge from Web Data
Searching and Ranking – Summary
• Don‘t re-invent the wheel:
LM‘s are elegant and expressive means for ranking
consider both data & workload statistics
• Extensions should be conceptually simple:
can capture informativeness, personalization,
relaxation, diversity – all in same framework
• Unified ranking model for complete query language:
still work to do
Harvesting Knowledge from Web Data
References for Part III
• SPARQL Query Language for RDF, W3C Recommendation, 15 January 2008, http://www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/
• SPARQL New Features and Rationale, W3C Working Draft, 2 July 2009, http://www.w3.org/TR/2009/WD-sparql-features-20090702/• Kemafor Anyanwu, Angela Maduko, Amit P. Sheth: SPARQ2L: towards support for subgraph extraction queries in RDF databases.
WWW Conference, 2007• Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, S. Sudarshan: Keyword Searching and Browsing in Databases
using BANKS. ICDE, 2002• Soumen Chakrabarti: Dynamic personalized pagerank in entity-relation graphs. WWW Conference, 2007• Tao Cheng , Xifeng Yan , Kevin Chen-Chuan Chang: EntityRank: searching entities directly and holistically. VLDB, 2007• Shady Elbassuoni, Maya Ramanath, Ralf Schenkel, Marcin Sydow, Gerhard Weikum: Language-model-based ranking for queries on
RDF-graphs. CIKM, 2009• Djoerd Hiemstra: Language Models. Encyclopedia of Database Systems, 2009• Vagelis Hristidis, Heasoo Hwang, Yannis Papakonstantinou: Authority-based keyword search in databases. ACM Transactions on
Database Systems 33(1), 2008• Gjergji Kasneci, Maya Ramanath, Mauro Sozio, Fabian M. Suchanek, Gerhard Weikum: STAR: Steiner-Tree Approximation in
Relationship Graphs. ICDE, 2009• Gjergji Kasneci, Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath, Gerhard Weikum: NAGA: Searching and Ranking
Knowledge. ICDE, 2008• Mounia Lalmas: XML Retrieval. Morgan & Claypool Publishers, 2009• Thomas Neumann, Gerhard Weikum: The RDF-3X engine for scalable management of RDF data. VLDB Journal 19(1), 2010• Zaiqing Nie, Yunxiao Ma, Shuming Shi, Ji-Rong Wen, Wei-Ying Ma: Web object retrieval. WWW Conference, 2007• Desislava Petkova, W. Bruce Croft: Hierarchical Language Models for Expert Finding in Enterprise Corpora. ICTAI, 2006• Nicoleta Preda, Gjergji Kasneci, Fabian M. Suchanek, Thomas Neumann, Wenjun Yuan, Gerhard Weikum: Active knowledge:
dynamically enriching RDF knowledge bases by web services. SIGMOD Conference, 2010• Pavel Serdyukov, Djoerd Hiemstra: Modeling Documents as Mixtures of Persons for Expert Finding. ECIR, 2008• ChengXiang Zhai: Statistical Language Models for Information Retrieval. Morgan & Claypool Publishers, 2008
Outline
• Part I– What and Why ✔– Available Knowledge Bases ✔
• Part II– Extracting Knowledge ✔
• Part III– Ranking and Searching ✔
• Part IV– Conclusion and Outlook
148Harvesting Knowledge from Web Data
But back to the original question...
149
Will there ever be a famous singer called Elvis again?
?x “Elvis”hasGivenName
singer
type
But back to the original question...
150
http://mpii.de/yago
?x = Elvis_Costello?singer = wordnet_singer_110599806?d = 1954-08-25
We found him!
Can we find out
more about this
guy?
But back to the original question...
151
http://mpii.de/yago
Alright, and even
more?
Linking Open Data: Goal
1954bornplays
Costellopedia
guitar
Can we combine knowledge from different sources?
YAGO
Linking Open Data: URIs
1954bornplays
guitar
http://dbpedia.org/resource http://costello.org 1. Define a name space
http://dbpedia.org/resource/ElvisCostello http://costello.org/Elvis 2. Define entity names in that name space
Every entity has a worldwide unique identifier (a Uniform Resource Identifier, URI).There is a W3C standard for that.
[W3C URI]
Linking Open Data: Cool URIs
1954bornplays
guitar
http://dbpedia.org/resource 1. Define a name space
2. Define entity names in that name space
3. Make them accessible online
client server
http://costello.org/Elvis
1954born
There is a W3Cdescription for that[W3C CoolURI]
http://dbpedia.org/resource/ElvisCostello http://costello.org/Elvis
http://costello.org
Linking Open Data: Links
1954bornplays
guitar
http://dbpedia.org/resource 1. Define a name space
2. Define entity names in that name space
4. Define equivalence links
• similar identifiers• similar labels (names)
• common properties
This is an entity resolution problem. Use
• keys (e.g., the ISBN)
3. Make them accessible online
Goal of theW3C group
http://dbpedia.org/resource/ElvisCostello http://costello.org/Elvis
http://costello.org
[Bizer JSWIS 2009]
Linking Open Data: Status so far
Currently (2010) • 200 ontologies• 25 billion triples• 400m links
http://richard.cyganiak.de/2007/10/lod/imagemap.html
Querying Semantic DataSindice is an index for the Semantic Web developed at the DERI in Galway/Ireland.
Sindice exploits• RDF dumps available on the Web• RDF information embedded into HTML pages• RDF data available by cool URIs• inter-ontology links
http://sindice.com
[Tummarello ISWC 2007]
Querying Semantic Data
?
?
... far from perfect... but far from useless...
Conclusion
• We have seen the knowledge representation model of ontologies, RDF In a nutshell, RDF is a kind of distributed entity-relationship model
• We have seen numerous existing knowledge bases ...manually constructed (Cyc and WordNet) and automatically constructed (YAGO, DBpedia, Freebase, TrueKnowledge etc.)• We have seen techniques for creating such knowledge bases (Pattern-based extraction and reasoning-based extraction, with uncertainty)
• We have seen techniques for querying and ranking the knowledge (by SPARQL and language-based models)
• We have seen that many knowledge bases already exist and that is ongoing work to interlink them
• We have seen that there is indeed a promising singer called Elvis
The End
Feel free to contact us with further questions
The slides are available at http://www.mpi-inf.mpg.de/yago-naga/CIKM10-tutorial/
Hady LauwInstitute for Infocomm Research, Singaporehttp://hadylauw.com
Martin TheobaldMax-Planck Institute for Informatics, Saarbrückenhttp://mpii.de/~mtb
Fabian M. SuchanekINRIA Saclay, Parishttp://suchanek.name
Ralf SchenkelSaarland University
http://people.mmci.uni-saarland.de/~schenkel/
References for Part IV
References• [W3C URI] W3C: “Architecture of the World Wide Web, Volume One” Recommendation 15 December 2004, http://www.w3.org/TR/webarch/ • [W3C CoolURI] W3C: “Cool URIs for the Semantic Web” Interest Group Note 03 December 2008, http://www.w3.org/TR/cooluris/ • [Bizer JSWIS 2009] C.Bizer, T.Heath, T.Berners-Lee: “Linked data – the story so far” International Journal on Semantic Web and Information Systems, 5(3):1–22, 2009.• [Tummarello ISWC 2007] G. Tummarello, R. Delbru, E. Oren: “Sindice.com: Weaving the Open Linked Data” ISWC/ASWC 2007: