Download - Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Transcript
Page 1: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Outline

1. Introduction2. Harvesting Classes 3. Harvesting Facts4. Common Sense Knowledge

5. Knowledge Consolidation6. Web Content Analytics7. Wrap-Up

• Goal• Extraction from text• Consistency reasoning• Extraction from Tables• Open IE

Page 2: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Source-centric IE vs. Yield-centric IE

many sources

one source

Surajit obtained hisPhD in CS from Stanford ...

Document 1:instanceOf (Surajit, scientist)inField (Surajit, c.science)almaMater (Surajit, Stanford U)…

Yield-centric IE

Student UniversitySurajit Chaudhuri Stanford UJim Gray UC Berkeley … …

Student AdvisorSurajit Chaudhuri Jeffrey UllmanJim Gray Mike Harrison … …

1) recall !2) precision

1) precision !2) recall

Source-centric IE

worksAt

hasAdvisor

+ (optional)targetedrelations

2

Page 3: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

We focus on yield-centric IE

many sources

Yield-centric IE

Student UniversitySurajit Chaudhuri Stanford UJim Gray UC Berkeley … …

Student AdvisorSurajit Chaudhuri Jeffrey UllmanJim Gray Mike Harrison … …

1) precision !2) recall

worksAt

hasAdvisor

+ (optional)targetedrelations

3

Page 4: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Goal: Find facts of given binary relations

...find instances of these relationshasAdvisor (JimGray, MikeHarrison)hasAdvisor (Susan Davidson, Hector Garcia-Molina)graduatedAt (JimGray, Berkeley)graduatedAt (HectorGarcia-Molina, Stanford)bornOn (JohnLennon, 9-Oct-1940)

Given binary relations with type signaturehasAdvisor: Person PersongraduatedAt: Person UniversitybornOn: Person Date

4

Page 5: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Facts Patterns(JimGray, MikeHarrison)

(BarbaraLiskov, JohnMcCarthy)

& Fact CandidatesX and his advisor Y

X under the guidance of Y

X and Y in their paper

X co-authored with Y

X rarely met his advisor Y

… • good for recall• noisy, drifting• not robust enough for high precision

(Surajit, Jeff)

(Sunita, Mike)(Alon, Jeff)

(Renee, Yannis)

(Surajit, Microsoft)

(Sunita, Soumen)

(Surajit, Moshe)(Alon, Larry)

(Soumen, Sunita)

Facts yield patterns – and vice versa

5[Brin@WebDB1998 "DIPRE"; Agichtein@SIGMOD2001 "Snowball"]

Page 6: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Facts Patterns(JimGray, MikeHarrison)

(BarbaraLiskov, JohnMcCarthy)

& Fact CandidatesX and his advisor Y

X under the guidance of Y

X and Y in their paper

X co-authored with Y

X rarely met his advisor Y

… • good for recall• noisy, drifting• not robust enough for high precision

(Surajit, Jeff)

(Sunita, Mike)(Alon, Jeff)

(Renee, Yannis)

(Surajit, Microsoft)

(Sunita, Soumen)

(Surajit, Moshe)(Alon, Larry)

(Soumen, Sunita)

Facts yield patterns – and vice versa

6

Extensions:1. use statistics to estimate the trustworthiness of patterns2. use counter examples to "punish" bad patterns

[Ravichandran 2002; Suchanek 2006; ...]

3. use deep parsing to generalize patterns[Bunescu 2005 , Suchanek 2006, …]

Page 7: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Outline

1. Introduction2. Harvesting Classes 3. Harvesting Facts4. Common Sense Knowledge

5. Knowledge Consolidation6. Web Content Analytics7. Wrap-Up

• Goal √• Extraction from text √• Consistency reasoning• Extraction from Tables• Open IE

Page 8: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Reasoning

[Suchanek@WWW2009] 8

occurs("Elvis","died in",528)occurs("Einstein","died in",1955)died(Einstein,1955), born(Elvis, 1935)occurs(X',P,Y) & means(X',X) & R(X,Y) => pattern(P,R)occurs(X',P,Y) & means(X',X) & pattern(P,R) => R(X,Y)born(X,Y) & died(X,Z) => Z>Y…

Einstein died in 1955

Page 9: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Reasoning

[Suchanek@WWW2009] 9

occurs("Elvis","died in",528)occurs("Einstein","died in",1955)died(Einstein,1955), born(Elvis, 1935)occurs(X',P,Y) & means(X',X) & R(X,Y) => pattern(P,R)occurs(X',P,Y) & means(X',X) & pattern(P,R) => R(X,Y)born(X,Y) & died(X,Z) => Z>Y…

Solving a weightedMAX SAT problemat scale

Page 10: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Reasoning

[Suchanek@WWW2009] 10

occurs("Elvis","died in",528)occurs("Einstein","died in",1955)died(Einstein,1955), born(Elvis, 1935)occurs(X',P,Y) & means(X',X) & R(X,Y) => pattern(P,R)occurs(X',P,Y) & means(X',X) & pattern(P,R) => R(X,Y)born(X,Y) & died(X,Z) => Z>Y…

Page 11: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Reasoning

[Suchanek@WWW2009] 11

occurs("Elvis","died in",528)occurs("Einstein","died in",1955)died(Einstein,1955), born(Elvis, 1935)occurs(X',P,Y) & means(X',X) & R(X,Y) => pattern(P,R)occurs(X',P,Y) & means(X',X) & pattern(P,R) => R(X,Y)born(X,Y) & died(X,Z) => Z>Y…

Extensions:1. parallelize the reasoning by performing a min cut on the dependency graph [Nakashole@WSDM2011 "Prospera"]

2. use Markov logic networks to represent the entire joint probability distribution [M. Richardson / P. Domingos 2006]

MLN>

Page 12: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Using Markov Logic Networks

12

We can model/compute• the marginal probabilities• the joint distribution• the MAP (=maximum a posteriori), i.e. the most likely world

World 1: World 2:

Probability:

Application: Extracting facts at large scale[Zhu@WWW2009 "StatSnowball", "EntityCube"]

528528

Page 13: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Outline

1. Introduction2. Harvesting Classes 3. Harvesting Facts4. Common Sense Knowledge

5. Knowledge Consolidation6. Web Content Analytics7. Wrap-Up

• Goal √• Extraction from text √• Consistency reasoning √• Extraction from Tables• Open IE

tables>

Page 14: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Web Tables provide relational information[Cafarella et al: PVLDB 08; Sarawagi et al: PVLDB 09]

14

Page 15: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Web Tables can be annotated with YAGO[Limaye, Sarawagi, Chakrabarti: PVLDB 10]

Goal: enable semantic search over Web tables

Idea:• Map column headers to Yago classes,• Map cell values to Yago entities• Using joint inference for factor-graph learning model

15

Title Author

A short history of time S Hawkins

D AdamsHitchhiker's guide

Book Person

Entity

hasAuthorwebtables>

Page 16: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Statistics yield semantics of Web tables

[Venetis,Halevy et al: PVLDB 11]

Idea: Infer classes from co-occurrences, headers are class names

𝑃 (𝑣𝑎𝑙1 ,…,𝑣𝑎𝑙𝑛|𝑐𝑙𝑎𝑠𝑠 )∝∏ 𝑃 (𝑐𝑙𝑎𝑠𝑠∨𝑣𝑎𝑙𝑖)𝑃 (𝑐𝑙𝑎𝑠𝑠)

Result from 12 Mio. Web tables:• 1.5 Mio. labeled columns (=classes)• 155 Mio. instances (=values) 16

but: classes&entities not canonicalized. Instances may include: Google Inc., Google, NASDAQ GOOG, Google search engine, … Jet Li, Li Lianjie,  Ley Lin Git, Li Yangzhong, Nameless hero, …

Page 17: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

ID-Based Extraction

887128476661

• Unique identifiers exist for books (ISBN), products (GTIN), companies (VAT), people (emails*), etc.

• Unique identifiers can be found by regular expression + check digit verification

Page 18: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

id Name URL

123 Puma PowerTech url1123 Please choose url1123 Puma PowerTech url2123 Puma Power Shoe url2124 Puma Slow Cat url3779 Please choose url3779 Canon PowerShot url3…

ID-Based Extraction

Page 19: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

ID-Based Extraction

[Talaika@WebDB2015 "IBEX"]

Page 20: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Outline

1. Introduction2. Harvesting Classes 3. Harvesting Facts4. Common Sense Knowledge

5. Knowledge Consolidation6. Web Content Analytics7. Wrap-Up

• Goal √• Extraction from text √• Consistency reasoning √• Extraction from Tables √• Open IE

Page 21: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Open Information Extraction

So far we assumed given relations with type signatures <entity1, relation, entity2>

< CarlaBruni marriedTo NicolasSarkozy> Person R Person < NataliePortman wonAward AcademyAward > Person R Prize

Open IE aims to discover new entities and new relation types <name1, phrase, name2>

Madame Bruni in her happy marriage with Sarko…

21<Madame Bruni, her happy marriage with, Sarko>

details>

Page 22: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Open IE with ReVerb [A. Fader et al. 2011, T. Lin 2012, Mausam 2012]

Idea: Consider all subject-verb-object triples as facts.

Problem 1: uninformative extractions “Gold has an atomic weight of 196” <Gold,has,atomicweight> “Faust made a deal with the devil” <Faust, made, a deal>

Solutions: 1. enforce regular expressions over POS tags, such as VB (N | ADJ | ADV | PRN | DET)* PREP2. require relation phrase appear with many distinct arg pairs3. intersect with Freebase

Problem 2: over-specific extractions “Elvis is the first and greatest rock and roll star of America” <..., is the first and greatest rock and roll star of, …>

22

Page 23: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

23http://openie.cs.washington.edu/

PATTY>

Page 24: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Syntactic-Lexical-Ontological (SOL) patterns combine1. ontological types2. lexical surface form3. syntactic properties

Amy Winehouse’s cosy voice in her song ‘Rehab’Jim Morrison’s haunting voice and charisma in ‘The End’Joan Baez’s angel-like voice in ‘Farewell Angelina’

SOL pattern: <singer> ’s ADJECTIVE voice * in <song>

[Nakashole@EMNLP2012 "PATTY"]

24

Enhanced Patterns

Patterns can subsume each other: "wife of" => "spouse of"… which means that we can create synsets of patternsand arrange them in a taxonomy.

Page 25: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

350 000 SOL patterns with 4 Mio. instancesaccessible at: www.mpi-inf.mpg.de/yago-naga/patty 25

[Nakashole@EMNLP2012 "PATTY"]

Enhanced Patterns

Page 26: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Open Problems and Grand Challenges

Real-time & incremental fact extractionfor continuous KB growth & maintenance(life-cycle management over years and decades)

Extensions to ternary & higher-arity relationsevents in context: who did what to/with whom when where why …?

Robust fact extraction with both high precision & recallas highly automated (self-tuning) as possible

Extend the approaches to other languages

26