Gerhard Weikum Max Planck Institute for Informaticshttp://www.mpi-inf.mpg.de/~weikum/
From Information to KnowledgeHarvesting Entities and RelationshipsFrom Web Sources
Martin Theobald Max Planck Institute for Informaticshttp://www.mpi-inf.mpg.de/~mtb/
Goal: Turn Web into Knowledge Base
comprehensive DB of human knowledge• everything that Wikipedia knows• everything machine-readable• capturing entities, classes, relationships
Source: DB & IR methods for knowledge discovery.Communications ofthe ACM 52(4), 2009
Approach: Harvesting Facts from WebPolitician Political Party
Angela Merkel CDU
Karl-Theodor zu Guttenberg CDU
Christoph Hartmann FDP
…
Company CEO
Google Eric Schmidt
Yahoo Overture
Facebook FriendFeed
Software AG IDS Scheer
…
Movie ReportedRevenue
Avatar $ 2,718,444,933
The Reader $ 108,709,522
Facebook FriendFeed
Software AG IDS Scheer
…
PoliticalParty Spokesperson
CDU Philipp Wachholz
Die Grünen Claudia Roth
Facebook FriendFeed
Software AG IDS Scheer
…
Actor Award
Christoph Waltz Oscar
Sandra Bullock Oscar
Sandra Bullock Golden Raspberry
…
Politician Position
Angela Merkel Chancellor Germany
Karl-Theodor zu Guttenberg Minister of Defense Germany
Christoph Hartmann Minister of Economy Saarland
…
Company AcquiredCompany
Google YouTube
Yahoo Overture
Facebook FriendFeed
Software AG IDS Scheer
…
YAGO-NAGA
IWP
Cyc TextRunner
ReadTheWeb
Knowledge as Enabling Technology
• entity recognition & disambiguation• understanding natural language & speech• knowledge services & reasoning for semantic apps (e.g. deep QA)
• semantic search: precise answers to advanced queries (by scientists, students, journalists, analysts, etc.)
Indy 500 winners who are still alive?
Politicians who are also scientists?
Enzymes that inhibit HIV? Influenza drugs for teens with high blood pressure?
...
US president when Barack Obama was born?
Relationship between Angela Merkel, Jim Gray, Dalai Lama?
6/54
Knowledge Search (1)
http://www.wolframalpha.com
Who wasmayor of Indianapoliswhen Barack Obamawas born?
not enoughfacts in KB !
9/54
Knowledge Search (2)
http://www.google.com/squared/
Indy500 winnersfromEurope?
no typesno inference !
YAGO-NAGA
Related Work
communities
KylinKOG
Cyc
Freebase
CimpleDBlife
UIMA
DBpedia
Yago-Naga
StatSnowballEntityCube
AvatarSystem T
Powerset
START
ontologiesinformationextraction
Answers
SWSE
Hakia
TextRunner
TrueKnowledge
WolframAlpha
Text2Onto
sig.ma
kosmixKnowItAll
(Semantic Web)
(Statistical Web)
(Social Web)
ReadTheWeb
GoogleSquared
10/38
Cyc TextRunnerIWP
WebTables
WorldWideTables
PSOX
EntityRankCazoodle
Framework: Types of Knowledge
...
• facts / assertions: bornIn (JohnDillinger, Indianapolis)
hasWon (JimGray, TuringAward), …• taxonomic: instanceOf (JohnDillinger, bankRobbers),
subclassOf (bankRobbers, criminals), …• lexical / terminology: means (“Big Apple“, NewYorkCity),
means (“Big Mike“, MichaelStonebraker) means (“MS“, Microsoft) , means (“MS“, MultipleSclerosis) …• common-sense properties: apples are green, red, juicy, sweet, sour … - but not fast, smart … balls are round, smooth, slippery … - but not square, funny …• common-sense axioms: x: human(x) male(x) female(x) x: (male(x) female(x)) (female(x) ) male(x)) x: animal(x) (hasLegs(x) isEven(numberOfLegs(x)) …• procedural: how to fix/install/prepare/remove …• epistemic / beliefs: believes (Ptolemy, shape(Earth, disc)),
believes (Copernicus, shape(Earth, sphere)) …
Framework: Information Extraction (IE)
many sources
one source
Surajit obtained hisPhD in CS from Stanford Universityunder the supervision of Prof. Jeff Ullman.He later joined HP andworked closely withUmesh Dayal …
source-centric IE
instanceOf (Surajit, scientist)inField (Surajit, computer science)hasAdvisor (Surajit, Jeff Ullman)almaMater (Surajit, Stanford U)workedFor (Surajit, HP)friendOf (Surajit, Umesh Dayal)…
yield-centricharvesting
Student AdvisorhasAdvisor
Student UniversityalmaMater
Student Advisor
1) recall !2) precision
1) precision !2) recall
near-humanquality !
Student AdvisorSurajit Chaudhuri Jeffrey UllmanAlon Halevy Jeffrey UllmanJim Gray Mike Harrison … …
Student UniversitySurajit Chaudhuri Stanford UAlon Halevy Stanford UJim Gray UC Berkeley … …
Framework: Knowledge Representation
...
• RDF (Resource Description Framework, W3C): subject-property-object (SPO) triples, binary relations structure, but no (prescriptive) schema• Relations, frames• Description logics: OWL, DL-lite• Higher-order logics, epistemic logics
temporal & provenance annotationscan refer to reified facts via fact identifiers(approx. equiv. to RDF quadruples: “Color“ Sub Prop Obj)
facts (RDF triples): (JimGray, hasAdvisor, MikeHarrison)
(SurajitChaudhuri, hasAdvisor, JeffUllman)
(Madonna, marriedTo, GuyRitchie)
(NicolasSarkozy, marriedTo, CarlaBruni)
facts (RDF triples)1:
2:
3:
4:
facts about facts:5: (1, inYear, 1968)
6: (2, inYear, 2006)
7: (3, validFrom, 22-Dec-2000)
8: (3, validUntil, Nov-2008)
9: (4, validFrom, 2-Feb-2008)
10: (2, source, SigmodRecord)
http://www.mpi-inf.mpg.de/yago-naga/
KB‘s: Example YAGO (Suchanek et al.: WWW‘07)Entity
Max_Planck
Apr 23, 1858
Person
City
Countrysubclass
Locationsubclass
instanceOf
subclass
bornOn
“Max Planck”
means(0.9)
subclass
Oct 4, 1947 diedOn
Kiel
bornInNobel Prize
Erwin_Planck
FatherOfhasWon
Scientist
means
“Max Karl Ernst Ludwig Planck”
Physicist
instanceOf
subclassBiologist
subclass
Germany
Politician
Angela Merkel
Schleswig-Holstein
State
“Angela Dorothea Merkel”
Oct 23, 1944diedOn
Organization
subclass
Max_Planck Society
instanceOf
means(0.1)
instanceOfinstanceOf
subclass
subclass
means
“Angela Merkel”
means
citizenOf
instanceOfinstanceOf
locatedIn
locatedIn
subclass
Accuracy 95%
2 Mio. entities, 20 Mio. facts 40 Mio. RDF triples ( entity1-relation-entity2, subject-predicate-object )
KB‘s: Example DBpedia (Auer, Bizer, et al.: ISWC‘07)
• 3 Mio. entities, • 1 Bio. facts (RDF triples)• 1.5 Mio. entities mapped to hand-crafted taxonomy of 259 classes with 1200 properties
http://www.dbpedia.org
Entities & Classes
...
Which entity types (classes, unary predicates) are there?
Which subsumptions should hold(subclass/superclass, hyponym/hypernym, inclusion dependencies)?
Which individual entities belong to which classes?
Which names denote which entities?
scientists, doctoral students, computer scientists, …female humans, male humans, married humans, …
subclassOf (computer scientists, scientists),subclassOf (scientists, humans), …
instanceOf (Surajit Chaudhuri, computer scientists),instanceOf (BarbaraLiskov, computer scientists),instanceOf (Barbara Liskov, female humans), …
means (“Lady Di“, Diana Spencer),means (“Diana Frances Mountbatten-Windsor”, Diana Spencer), …means (“Madonna“, Madonna Louise Ciccone),means (“Madonna“, Madonna(painting by Edward Munch)), …
WordNet Thesaurus [Miller/Fellbaum 1998]
http://wordnet.princeton.edu/
3 concepts / classes & their synonyms (synset‘s)
WordNet Thesaurus [Miller/Fellbaum 1998]
http://wordnet.princeton.edu/
subclasses(hyponyms)
superclasses(hypernyms)
WordNet Thesaurus [Miller & Fellbaum 1998]
scientist, man of science (a person with advanced knowledge) => cosmographer, cosmographist => biologist, life scientist => chemist => cognitive scientist => computer scientist ... => principal investigator, PI …HAS INSTANCE => Bacon, Roger Bacon …
but: only few individual entities (instances of classes)
> 100 000 classes and lexical relations;can be cast into • description logics or • graph, with weights for relation strengths (derived from co-occurrence statistics)
http://wordnet.princeton.edu/
Mapping: Wikipedia WordNet[Suchanek: WWW‘07, Ponzetto&Strube: AAAI‘07]
Jim Gray(computer specialist)
ComputerScientist
American
Scientist
Sailor,Crewman
MissingPerson
Chemist
Artist
American
Sailor,Crewman
Mapping: Wikipedia WordNet[Suchanek: WWW‘07, Ponzetto&Strube: AAAI‘07]
Jim Gray(computer specialist)
ComputerScientist
Data-base
Fellow (1), Comrade
Fellow (2),Colleague
Fellow (3)(of Society)
Scientist
Member (1),Fellow
Member (2),Extremity
AmericanComputerScientists
DatabaseResearcher
Fellows ofthe ACM
PeopleLost at Sea
instanceOf
subclassOf
?
?
?
name similarity(edit dist., n-gram overlap) ?context similarity(word/phrase level) ?
machine learning ?
ComputerScientistsby Nation
Databases
ACM
Members of LearnedSocieties
EngineeringSocieties
?
?
?
MissingPerson
Mapping: Wikipedia WordNet[Suchanek: WWW‘07, Ponzetto & Strube:AAAI‘07]
Analyzing category names noun group parser:
American Musicians of Italian Descent
American Folk Music of the 20th Century
American Indy 500 Drivers on Pole Positions
Head word is key, should be in plural for instanceOf
headpre-modifier post-modifier
headpre-modifier post-modifier
headpre-modifier post-modifier
Given: entity e in Wikipedia categories c1, …, ck
Wanted: instanceOf(e,c) and subclassOf(ci,c) for WN class cProblem: vagueness & ambiguity of names c1, …, ck
Mapping Wikipedia Entities to WordNet Classes
Given: entity e in Wikipedia categories c1, …, ck
Wanted: instanceOf(e,c) and subclassOf(ci,c) for WN class cProblem: vagueness & ambiguity of names c1, …, ck
Heuristic Method:for each ci do if head word w of category name ci is plural { 1) match w against synsets of WordNet classes 2) choose best fitting class c and set e c 3) expand w by pre-modifier and set ci w+ c }
• can also derive features this way • feed into supervised classifier
[Suchanek: WWW‘07, Ponzetto & Strube: AAAI‘07]
tuned conservatively: high precision, reduced recall
Learning More Mappings [ Wu & Weld: WWW‘08 ]
Kylin Ontology Generator (KOG):learn classifier for subclassOf across Wikipedia & WordNet using
• YAGO as training data• advanced ML methods (MLN‘s, SVM‘s)• rich features from various sources
• category/class name similarity measures• category instances and their infobox templates: template names, attribute names (e.g. knownFor)• Wikipedia edit history: refinement of categories• Hearst patterns: C such as X, X and Y and other C‘s, …• other search-engine statistics: co-occurrence frequencies
> 3 Mio. entities> 1 Mio. w/ infoboxes> 500 000 categories
Goal: Comprehensive & Consistent !
Jim Gray(computer specialist)
Madonna(entertainer)
JeffreyUllman
Bob Dylan
……
AmericanComputerScientists
DatabaseResearcher
Fellows ofthe ACM
Databases
Members of LearnedSocieties
Artist
Singer
Italian
American
Musician
Born
AwardWinner
Scientist
KnownFor
AlmaMater
NotableAwards
DoctoralStudents
Academic
Bell LabsPrincetonAlumni
Knuth PrizeLaureate
AmericanPeople byOccupation
Fellow(1)
Fellow(2)
World Record Holders
AmericanSongwriters
AthleteGenres
YearsActive
Hall of FameInductees
U MichiganAlumni
AlsoKnownAs
WebsiteGuitar Players
Americans ofItalian Descent
Peopleby Status
ComputerData
Telecomm.History
Goal: Comprehensive & Consistent !
Jim Gray(computer specialist)
Madonna(entertainer)
JeffreyUllman
Bob Dylan
……
AmericanComputerScientists
DatabaseResearcher
Fellows ofthe ACM
Databases
Members of LearnedSocieties
Artist
Singer
Italian
American
Musician
Born
AwardWinner
Scientist
KnownFor
AlmaMater
NotableAwards
DoctoralStudents
Academic
Bell LabsPrincetonAlumni
Knuth PrizeLaureate
AmericanPeople byOccupation
Fellow(1)
Fellow(2)
World Record Holders
AmericanSongwriters
AthleteGenres
YearsActive
Hall of FameInductees
U MichiganAlumni
AlsoKnownAs
WebsiteGuitar Players
Americans ofItalian Descent
Peopleby Status
ComputerData
Telecomm.History
Goal: Comprehensive & Consistent !
Jim Gray(computer specialist)
Madonna(entertainer)
JeffreyUllman
Bob Dylan
……
AmericanComputerScientists
DatabaseResearcher
Fellows ofthe ACM
Databases
Members of LearnedSocieties
Artist
Singer
Italian
American
Musician
Born
AwardWinner
Scientist
KnownFor
AlmaMater
NotableAwards
DoctoralStudents
Academic
Bell LabsPrincetonAlumni
Knuth PrizeLaureate
AmericanPeople byOccupation
Fellow(1)
Fellow(2)
World Record Holders
AmericanSongwriters
AthleteGenres
YearsActive
Hall of FameInductees
U MichiganAlumni
AlsoKnownAs
WebsiteGuitar Players
Americans ofItalian Descent
Peopleby Status
ComputerData
Telecomm.History
Goal: Comprehensive & Consistent !
Jim Gray(computer specialist)
Madonna(entertainer)
JeffreyUllman
Bob Dylan
……
AmericanComputerScientists
DatabaseResearcher
Fellows ofthe ACM
Databases
Members of LearnedSocieties
Artist
Singer
Italian
American
Musician
Born
AwardWinner
Scientist
KnownFor
AlmaMater
NotableAwards
DoctoralStudents
Academic
Bell LabsPrincetonAlumni
Knuth PrizeLaureate
AmericanPeople byOccupation
Fellow(1)
Fellow(2)
World Record Holders
AmericanSongwriters
AthleteGenres
YearsActive
Hall of FameInductees
U MichiganAlumni
AlsoKnownAs
WebsiteGuitar Players
Americans ofItalian Descent
Peopleby Status
ComputerData
Telecomm.History
Clean up the mess:• graph algorithms ?
• random walk with restart• dense subgraphs …
• statistical machine learning ?• logical consistency reasoning ?• gigantic schema integration ?
• ontology merging
Long Tail of Class Instances[Etzioni et al. 2004, Cohen et al. 2008, Mitchell et al. 2010]
But:Precision drops for classes with sparse statistics (DB profs, …)Harvested items are names, not entitiesCanonicalization (de-duplication) unsolved
State-of-the-Art Approach (e.g. SEAL):• Start with seeds: a few class instances• Find lists, tables, text snippets (“for example: …“), … that contain one or more seeds• Extract candidates: noun phrases from vicinity• Gather co-occurrence stats (seed&cand, cand&className pairs)• Rank candidates
• point-wise mutual information, …• random walk (PR-style) on seed-cand graph
Individual Entity Disambiguation
“Penn“
“U Penn“University of Pennsylvania
“Penn State“PennsylvaniaState University
„PSU“Pennsylvania(US State)
Sean Penn
PassengerService Unit
Names Entities
??
• ill-defined with zero context• known as record linkage for names in record fields• Wikipedia offers rich candidate mappings: disambiguation pages, re-directs, inter-wiki links, anchor texts of href links
Collective Entity Disambiguation
• Consider a set of names {n1, n2, …} in same context
and sets of candidate entities E1 = {e11, e12, …}, E2 = {e21, e22, …}, …• Define joint objective function (e.g. likelihood for prob. model)
that rewards coherence of mappings ni eij
[McCallum 2003, Doan 2005, Getoor 2006. Domingos 2007, Chakrabarti 2009, …]
• Solve optimization problem
Stuart Russell
Michael Jordan
Stuart Russell(computer scientist)
Stuart Russell (DJ)
Michael Jordan(computer scientist)
Michael Jordan (NBA)
Problems and ChallengesWikipedia categories reloaded
Robust disambiguation
Tags, tables, topics
Long tail of entities
comprehensive & consistent instanceOf and subClassOfacross Wikipedia and WordNet (via consistency reasoning ?)
tap on other sources: Web2.0, Web tables, directories, etc.
near-real-time mapping of names to entitieswith near-human quality
discover new entities, detect new names for known entities
beyond Wikipedia: domain-specific entity catalogs
RelationshipsWhich instances (pairs of individual entities) are therefor given binary relations with specific type signatures?
hasAdvisor (JimGray, MikeHarrison)hasAdvisor (HectorGarcia-Molina, Gio Wiederhold)hasAdvisor (Susan Davidson, Hector Garcia-Molina)graduatedAt (JimGray, Berkeley)graduatedAt (HectorGarcia-Molina, Stanford)hasWonPrize (JimGray, TuringAward)bornOn (JohnLennon, 9Oct1940)diedOn (JohnLennon, 8Dec1980)marriedTo (JohnLennon, YokoOno)
Which additional & interesting relation types are there between given classes of entities?
competedWith(x,y), nominatedForPrize(x,y), …divorcedFrom(x,y), affairWith(x,y), …assassinated(x,y), rescued(x,y), admired(x,y), …
Deterministic Pattern Matching
...
[Kushmerick 97, Califf & Mooney 99, Gottlob 01, …]
• Regular expressions matching• Wrapper induction (grammar learning for restricted regular languages)• Well understood
French Marriage Problem
facts in KB: new facts or fact candidates:
married (Hillary, Bill)married (Carla, Nicolas)married (Angelina, Brad)
married (Cecilia, Nicolas)married (Carla, Benjamin)married (Carla, Mick)married (Michelle, Barack)married (Yoko, John)married (Kate, Leonardo)married (Carla, Sofie)married (Larry, Google)
1) for recall: pattern-based harvesting2) for precision: consistency reasoning
Pattern-Based Harvesting
Facts Patterns
(Hillary, Bill)
(Carla, Nicolas)
& Fact Candidates
X and her husband Y
X and Y on their honeymoon
X and Y and their children
X has been dating with Y
X loves Y
… • good for recall• noisy, drifting• not robust enough for high precision
(Angelina, Brad)
(Hillary, Bill)(Victoria, David)
(Carla, Nicolas)
(Angelina, Brad)
(Yoko, John)
(Carla, Benjamin)(Larry, Google)
(Kate, Pete)
(Victoria, David)
(Hearst 92, Brin 98, Agichtein 00, Etzioni 04, …)
Reasoning about Fact Candidates Use consistency constraints to prune false candidates
spouse(Hillary,Bill)spouse(Carla,Nicolas)spouse(Cecilia,Nicolas)spouse(Carla,Ben)spouse(Carla,Mick)Spouse(Carla, Sofie)
spouse(x,y) diff(y,z) spouse(x,z)
f(Hillary)f(Carla)f(Cecilia)f(Sofie)
m(Bill)m(Nicolas)m(Ben)m(Mick)
spouse(x,y) f(x) spouse(x,y) m(y)
spouse(x,y) (f(x)m(y)) (m(x)f(y))
FOL rules (restricted): ground atoms:
Rules can be weighted(e.g. by fraction of ground atoms that satisfy a rule) uncertain / probabilistic data compute prob. distr. of subset of atoms being the truth
Rules reveal inconsistenciesFind consistent subset(s) of atoms(“possible world(s)“, “the truth“)
spouse(x,y) diff(w,y) spouse(w,y)
Markov Logic Networks (MLN‘s) (M. Richardson / P. Domingos 2006)
Map logical constraints & fact candidatesinto probabilistic graph model: Markov Random Field (MRF)
s(x,y) m(y)
s(x,y) diff(y,z) s(x,z) s(Carla,Nicolas)s(Cecilia,Nicolas)s(Carla,Ben)s(Carla,Sofie)…
s(x,y) diff(w,y) s(w,y)
s(x,y) f(x)
s(Ca,Nic) s(Ce,Nic)
s(Ca,Nic) s(Ca,Ben)
s(Ca,Nic) s(Ca,So)
s(Ca,Ben) s(Ca,So)
s(Ca,Ben) s(Ca,So)
s(Ca,Nic) m(Nic)
Grounding:
s(Ce,Nic) m(Nic)
s(Ca,Ben) m(Ben)
s(Ca,So) m(So)
f(x) m(x)
M(x) f(x)
Literal Boolean VarLiteral binary RV
Markov Logic Networks (MLN‘s) (M. Richardson / P. Domingos 2006)
Map logical constraints & fact candidatesinto probabilistic graph model: Markov Random Field (MRF)
s(x,y) m(y)
s(x,y) diff(y,z) s(x,z) s(Carla,Nicolas)s(Cecilia,Nicolas)s(Carla,Ben)s(Carla,Sofie)…
s(x,y) diff(w,y) s(w,y)
s(x,y) f(x) f(x) m(x)
M(x) f(x)
m(Ben)
m(Nic) s(Ca,Nic)
s(Ce,Nic)
s(Ca,Ben)
s(Ca,So) m(So)
RVs coupledby MRF edgeif they appearin same clause
MRF assumption:P[Xi|X1..Xn]=P[Xi|N(Xi)]
Variety of algorithms for joint inference:Gibbs sampling, other MCMC, belief propagation, randomized MaxSat, …
joint distribution has product form over all cliques
Related Alternative Probabilistic Models
software tools: alchemy.cs.washington.edu code.google.com/p/factorie/ research.microsoft.com/en-us/um/cambridge/projects/infernet/
Constrained Conditional Models [D. Roth et al. 2007]
Factor Graphs with Imperative Variable Coordination [A. McCallum et al. 2008]
log-linear classifiers with constraint-violation penaltymapped into Integer Linear Programs
RV‘s share “factors“ (joint feature functions)generalizes MRF, BN, CRF, …inference via advanced MCMCflexible coupling & constraining of RV‘s
m(Ben)
m(Nic) s(Ca,Nic)
s(Ce,Nic)
s(Ca,Ben)
s(Ca,So) m(So)
Reasoning for KB Growth: Direct Route
facts in KB:new fact candidates:
married (Hillary, Bill)married (Carla, Nicolas)married (Angelina, Brad)
married (Cecilia, Nicolas)married (Carla, Benjamin)married (Carla, Mick)married (Carla, Sofie)married (Larry, Google)
+
patterns:X and her husband YX and Y and their childrenX has been dating with YX loves Y
?
• facts are true; fact candidates & patterns hypotheses• grounded constraints clauses with hypotheses as vars• cast into Weighted Max-Sat with weights from pattern stats• customized approximation algorithm• unifies: fact cand consistency, pattern goodness, entity disambig.
(F. Suchanek et al.: WWW‘09)
www.mpi-inf.mpg.de/yago-naga/sofie/
Direct approach:
Facts & Patterns Consistency
constraints to connect facts, fact candidates, patterns(F. Suchanek et al.: WWW‘09)
functional dependencies:spouse(X,Y): X Y, Y X
relation properties:asymmetry, transitivity, acyclicity, …
type constraints, inclusion dependencies:spouse Person Person capitalOfCountry cityOfCountry
domain-specific constraints:bornInYear(x) + 10years ≤ graduatedInYear(x)
www.mpi-inf.mpg.de/yago-naga/sofie/
hasAdvisor(x,y) graduatedInYear(x,t) graduatedInYear(y,s) s < t
pattern-fact duality:
occurs(p,x,y) expresses(p,R) R(x,y)
name(-in-context)-to-entity mapping:
means(n,e1) means(n,e2) …
occurs(p,x,y) R(x,y) expresses(p,R)
Soft Rules vs. Hard Constraints
Enforce FD‘s (mutual exclusion) as hard constraints:
Generalize to other forms of constraints:
hard constraint soft constraint
hasAdvisor(x,y) graduatedInYear(x,t) graduatedInYear(y,s) s < t
firstPaper(x,p) firstPaper(y,q) author(p,x) author(p,y) ) inYear(p) > inYear(q) + 5years hasAdvisor(x,y)
hasAdvisor(x,y) diff(y,z) hasAdvisor(x,z)
combine with weighted constraintsno longer MaxSatconstrained MaxSat instead
open issue for arbitrary constraints rethink reasoning !
Problems and ChallengesHigh precision & high recall at affordable cost
Scale, dynamics, life-cycle
Declarative, self-optimizing workflows
Types and constraints
robust pattern analysis & reasoning
incorporate pattern & reasoning steps into IE queries/programs
grow & maintain KB with near-human-quality over long periods
explore & understand different families of constraints
soft rules & hard constraints, rich DL, beyond CWA
parallel processing, lazy / lifted inference, …
Open-domain knowledge harvestingturn names, phrase & table cells into entities & relations
Temporal KnowledgeWhich facts for given relations hold at what time point or during which time intervals ?
marriedTo (Madonna, Guy) [ 22Dec2000, Dec2008 ]capitalOf (Berlin, Germany) [ 1990, now ]capitalOf (Bonn, Germany) [ 1949, 1989 ]hasWonPrize (JimGray, TuringAward) [ 1998 ]graduatedAt (HectorGarcia-Molina, Stanford) [ 1979 ]graduatedAt (SusanDavidson, Princeton) [ Oct 1982 ]hasAdvisor (SusanDavidson, HectorGarcia-Molina) [ Oct 1982, forever ]
How can we query & reason on entity-relationship factsin a “time-travel“ manner - with uncertain/incomplete KB ?
US president when Barack Obama was born?students of Hector Garcia-Molina while he was at Princeton?
French Marriage Problem
facts in KB
new fact candidates:
married (Hillary, Bill)married (Carla, Nicolas)married (Angelina, Brad)
married (Cecilia, Nicolas)married (Carla, Benjamin)married (Carla, Mick)divorced (Madonna, Guy)domPartner (Angelina, Brad)
1:
2:
3:
validFrom (2, 2008)
validFrom (4, 1996) validUntil (4, 2007)validFrom (5, 2010)validFrom (6, 2006)validFrom (7, 2008)
4: 5:6:7:8:
JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
Challenge: Temporal Knowledgefor all people in Wikipedia (100,000‘s) gather all spouses, incl. divorced & widowed, and corresponding time periods! >95% accuracy, >95% coverage, in one night
consistency constraints are potentially helpful:• functional dependencies: husband, time wife• inclusion dependencies: marriedPerson adultPerson• age/time/gender restrictions: birthdate + < marriage < divorce
1) recall: gather temporal scopes for base facts2) precision: reason on mutual consistency
(Even More Difficult) Relative Datingvague dates relative datesvague dates relative dates
narrative textrelative ordernarrative textrelative order
TARSQI: Extracting Time Annotations
Hong Kong is poised to hold the first election in more than half <TIMEX3 tid="t3" TYPE="DURATION" VAL="P100Y">a century</TIMEX3> that includes a democracy advocate seeking high office in territory controlled by the Chinese government in Beijing. A pro-democracy politician, Alan Leong, announced <TIMEX3 tid="t4" TYPE="DATE" VAL="20070131">Wednesday</TIMEX3> that he had obtained enough nominations to appear on the ballot to become the territory’s next chief executive. But he acknowledged that he had no chance of beating the Beijing-backed incumbent, Donald Tsang, who is seeking re-election. Under electoral rules imposed by Chinese officials, only 796 people on the election committee – the bulk of them with close ties to mainland China – will be allowed to vote in the <TIMEX3 tid="t5" TYPE="DATE" VAL="20070325">March 25</TIMEX3> election. It will be the first contested election for chief executive since Britain returned Hong Kong to China in <TIMEX3 tid="t6" TYPE="DATE" VAL="1997">1997</TIMEX3>. Mr. Tsang, an able administrator who took office during the early stages of a sharp economic upturn in <TIMEX3 tid="t7" TYPE="DATE" VAL="2005">2005</TIMEX3>, is popular with the general public. Polls consistently indicate that three-fifths of Hong Kong’s people approve of the job he has been doing. It is of course a foregone conclusion – Donald Tsang will be elected and will hold office for <TIMEX3 tid="t9" beginPoint="t0" endPoint="t8“ TYPE="DURATION" VAL="P5Y">another five years </TIMEX3>, said Mr. Leong, the former chairman of the Hong Kong Bar Association.
(M. Verhagen et al.: ACL‘05)http://www.timeml.org/site/tarsqi/
extractionerrorsextractionerrors
Representing Time: AI Perspective
• Instant– durationless piece of time
• Period– potentially unbounded continuum of instants
• Events– time as a sequence of events E– precedence and overlap relations on E E
[Allen 1984, Allen & Hayes 1989, …]
Relations between Time Periods
A Before B B After A
A Meets B B MetBy A
A Overlaps B B OverlappedBy A
A Starts B B StartedBy A
A During B B Contains A
A Finishes B B FinishedBy A
A Equal B
A B
AB
AB
AB
A
B
AB
AB
Representing Time: DB Perspective• Time point: smallest time unit of fixed duration/granularity (e.g., a day, a year, a second)
• Interval: finite set of time points
• State relation:fact holds at every time point within intervalisCapitalOf (Bonn, Germany) [1949, 1989]
• Event relation: fact holds at exactly one time point within interval
wonCup (United, ChampionsLeague) [1999, 1999]
intervals can also capture uncertainty of time points
Uncertainty and Time• Point-probabilities for facts and intervals
playsFor(Beckham, United)[1990, 2005]:0.9– fact valid in interval [tb, te ] with prob. p– fact not valid with prob. 1-p
• Continuous distributionsplaysFor(Beckham, United)
[1990, 2005]:Gauss(µ=1996,σ2=1)
• HistogramsplaysFor(Beckham, United)
[1990, 1992):0.1[1992, 2004):0.6
[2004, 2005]:0.2
0.60.20.1
‘90 ‘92 ‘05‘04
0.9
‘90 ‘05
‘90 ‘96 ‘05
µ=1996σ2=1
0.30.6
Possible Worlds in Time
0.3
State Event
Event
‘95 ‘98 ‘02
‘96 ‘98 ‘00 ‘01
‘96 ‘99 ‘00
‘99
0.54
0.9 1.0
‘01playsFor (Beckham, United) wonCup (United,
ChampionsLeague)
playsFor(Beckham, United)wonCup(United, ChampionsLeague)
Base Facts
hasWon (Beckham, ChampionsLeague)
0.20.5
0.10.2
0.120.30
0.060.06
• #P-complete per histogram bin• linear in #bins
Joint Reasoning on Facts & Time
marriedTo(Nicolas,
Carla)0.91
marriedTo(Nicolas, Cecilia)
0.65
divorcedFrom(Nicolas, Cecilia)
0.78
bornIn(Nicolas,
Paris)
0.77
bornIn(Cecilia,
Boulogne)
0.12
bornIn(Carla, Turin)
0.43
marriedTo(Carla, Ben)
0.18
marriedTo(Carla, Mick)
0.25 marriedTo(a,b,T1) marriedTo(a,c,T2) different(b,c) disjoint(T1,T2)
marriedTo(a,b,T1) divorcedFrom(a,b,T2) before(T1,T2)
marriedTo(a,b,T1) bornIn(a,c,T2) before(T2,T1)
Rules: Facts from KB (with confidence weights):
Joint Reasoning on Facts & Time
bornIn(Nicolas, Paris)
bornIn(Cecilia, Boulogne)
bornIn(Carla, Turin)
m(Nicolas, Cecilia)div(Nicolas, Cecilia)
m(Nicolas, Carla)
m(Carla, Mick)
m(Carla, Ben)
marriedTo(Nicolas,
Carla)
marriedTo(Nicolas, Cecilia)
divorcedFrom(Nicolas, Cecilia)
marriedTo(Carla, Mick)
marriedTo(Carla, Ben)
bornIn(Carla, Turin)
bornIn(Cecilia,
Boulogne)
bornIn(Nicolas,
Paris)
0.91
0.65 0.78
0.77 0.12
0.43
0.18
0.25 marriedTo(a,b,T1) marriedTo(a,c,T2) different(b,c) disjoint(T1,T2)
marriedTo(a,b,T1) divorcedFrom(a,b,T2) before(T1,T2)
marriedTo(a,b,T1) bornIn(a,c,T2) before(T2,T1)
Rules: Facts from KB (with confidence weights):
time
+ more soft rules: hasChild (a,c) hasChild (b,c) different (a,b) marriedTo(a,b)+ recursive rules …
Compute most likely possible world !
Problems and Challenges
Temporal Querying (Revived)
Consistency Reasoning
Incomplete and Uncertain Temporal Scopes
Gathering Implicit and Relative Time Annotations
query language (T-SPARQL?), no schemaconfidence weights & ranking
incorrect, incomplete, unknown begin/endvague dating
biographies & news, relative orderingsaggregate & reconcile observations
extended MaxSat, extended Datalog, prob. graph. models, etc. for resolving inconsistencies on uncertain facts & uncertain time
KB Building: Where Do We Stand?Entities & Classes
Relationships
Temporal Knowledgewidely open (fertile) research ground:
• uncertain / incomplete temporal scopes of facts• joint reasoning on ER facts and time scopes
good progress, but many challenges left:• recall & precision by patterns & reasoning• efficiency & scalability• soft rules, hard constraints, richer logics, …• open-domain discovery of new relation types
strong success story, some problems left:• large taxonomies of classes with individual entities• long tail calls for new methods• entity disambiguation remains grand challenge
Overall Take-Home
...
Historic opportunity: revive Cyc vision, make it real & large-scale !challenging & risky, but high pay-off
Explore & exploit synergies between semantic, statistical, & social Web methods:statistical evidence + logical consistency !
For DB researchers (theoreticians & normal ones):• efficiency & scalability• constraints & reasoning• killer app for uncertain data management• knowledge-base life-cycle: growth & maintenance
Top Related