Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry...
Transcript of Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry...
![Page 1: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/1.jpg)
Tuffy
Scaling up Statistical Inference in Markov Logic using an RDBMS
Feng Niu, Chris Ré, AnHai Doan, and Jude Shavlik
University of Wisconsin-Madison
![Page 2: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/2.jpg)
One Slide Summary
2
Machine Reading is a DARPA program to capture knowledge expressed in free-form text
We use Markov Logic, a language that allows rules that are likely – but not certain – to be correct
Markov Logic yields high quality, but current implementations are confined to small scales
Tuffy scales up Markov Logic by orders of magnitude using an RDBMS
Similar challenges in enterprise applications
![Page 3: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/3.jpg)
Outline
v Markov Logic § Data model § Query language § Inference = grounding then search
v Tuffy the System § Scaling up grounding with RDBMS § Scaling up search with partitioning
3
![Page 4: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/4.jpg)
A Familiar Data Model
4
Relations with known facts
Relations to be predicted
Markov Logic program Datalog?
EDB IDB
Datalog + Weights ≈ Markov Logic
![Page 5: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/5.jpg)
Markov Logic*
5
v Syntax: a set of weighted logical rules § Weights: cost for rule violation
v Semantics: a distribution over possible worlds § Each possible world 𝐼 incurs total cost cost(𝐼) § Pr[𝐼] ∝ exp(−cost(𝐼)) § Thus most likely world has lowest cost
3 wrote(s,t) ∧ advisedBy(s,p) à wrote(p,t) // students’ papers tend to be co-authored by advisors
* [Richardson & Domingos 2006]
exponential models
![Page 6: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/6.jpg)
Markov Logic by Example
6
Rules
3 wrote(s,t) ∧ advisedBy(s,p) à wrote(p,t) // students’ papers tend to be co-authored by advisors
5 advisedBy(s,p) ∧ advisedBy(s,q) à p = q // students tend to have at most one advisor
∞ advisedBy(s,p) à professor(p) // advisors must be professors
Evidence wrote(Tom, Paper1) wrote(Tom, Paper2) wrote(Jerry, Paper1)
professor(John) …
Query
advisedBy(?, ?) // who advises whom
EDB IDB
![Page 7: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/7.jpg)
Inference
7
Rules
Evidence Relations
Query Relations
Inference
regular tuples
tuple probabilities
MAP
Marginal
![Page 8: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/8.jpg)
Inference
8
Rules
Evidence Relations
Query Relations
Grounding Search
1. Find tuples that are relevant (to the query)
2. Find tuples that are true (in most likely world)
![Page 9: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/9.jpg)
How to Perform Inference
v Step 1: Grounding § Instantiate the rules
9
3 wrote(s, t) ∧ advisedBy(s, p) à wrote(p, t)
3 wrote(Tom, P1) ∧ advisedBy(Tom, Jerry) à wrote (Jerry, P1) 3 wrote(Tom, P1) ∧ advisedBy(Tom, Chuck) à wrote (Chuck, P1) 3 wrote(Chuck, P1) ∧ advisedBy(Chuck, Jerry) à wrote (Jerry, P1) 3 wrote(Chuck, P2) ∧ advisedBy(Chuck, Jerry) à wrote (Jerry, P2)
…
Grounding
![Page 10: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/10.jpg)
How to Perform Inference
v Step 1: Grounding § Instantiated rules à Markov Random Field (MRF)
• A graphical structure of correlations
10
3 wrote(Tom, P1) ∧ advisedBy(Tom, Jerry) à wrote (Jerry, P1) 3 wrote(Tom, P1) ∧ advisedBy(Tom, Chuck) à wrote (Chuck, P1) 3 wrote(Chuck, P1) ∧ advisedBy(Chuck, Jerry) à wrote (Jerry, P1) 3 wrote(Chuck, P2) ∧ advisedBy(Chuck, Jerry) à wrote (Jerry, P2)
…
Nodes: Truth values of tuples
Edges: Instantiated rules
![Page 11: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/11.jpg)
How to Perform Inference
v Step 2: Search § Problem: Find most likely state of the MRF (NP-hard) § Algorithm: WalkSAT*, random walk with heuristics § Remember lowest-cost world ever seen
11
advisee advisor Tom Jerry
Tom Chuck Search
* [Kautz et al. 2006]
False
True
![Page 12: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/12.jpg)
Outline
v Markov Logic § Data model § Query language § Inference = grounding then search
v Tuffy the System § Scaling up grounding with RDBMS § Scaling up search with partitioning
12
![Page 13: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/13.jpg)
Challenge 1: Scaling Grounding
v Previous approaches § Store all data in RAM § Top-down evaluation
13
RAM size quickly becomes bottleneck
Even when runnable, grounding takes long time
[Singla and Domingos 2006] [Shavlik and Natarajan 2009]
![Page 14: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/14.jpg)
Grounding in Alchemy*
14
v Prolog-style top-down grounding with C++ loops § Hand-coded pruning, reordering strategies
3 wrote(s, t) ∧ advisedBy(s, p) à wrote(p, t)
For each person s: For each paper t: If !wrote(s, t) then continue For each person p: If wrote(p, t) then continue Emit grounding using <s, t, p>
Grounding sometimes accounts for over 90% of Alchemy’s run time
[*] reference system from UWash
![Page 15: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/15.jpg)
Grounding in Tuffy
Encode grounding as SQL queries
15
Executed and optimized by RDBMS
![Page 16: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/16.jpg)
Grounding Performance
Tuffy achieves orders of magnitude speed-up
16
Relational Classification
Entity Resolution
Alchemy [C++] 68 min 420 min
Tuffy [Java + PostgreSQL] 1 min 3 min Evidence tuples 430K 676
Query tuples 10K 16K
Rules 15 3.8K
Yes, join algorithms & optimizer are the key!
![Page 17: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/17.jpg)
Challenge 2: Scaling Search
17
Grounding Search
![Page 18: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/18.jpg)
Challenge 2: Scaling Search
v First attempt: pure RDBMS, search also in SQL § No-go: millions of random accesses
v Obvious fix: hybrid architecture
18
Problem: stuck if |MRF | > |RAM|!
RDBMS RAM
RAM RDBMS RAM
RDBMS Grounding
Search
Alchemy Tuffy-DB Tuffy
![Page 19: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/19.jpg)
Partition to Scale up Search
v Observation § MRF sometimes have multiple components
v Solution § Partition graph into components § Process in turn
19
![Page 20: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/20.jpg)
Effect of Partitioning
v Pro
v Con (?) § Motivated by scalability § Willing to sacrifice quality
20
Scalability Parallelism
What’s the effect on quality?
![Page 21: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/21.jpg)
Partitioning Hurts Quality?
21
0
1000
2000
3000
0 100 200 300
cost
time (sec)
Tuffy
Tuffy-no-part
Relational Classification
Goal: lower the cost quickly
Partitioning can actually improve quality!
Alchemy took over 1 hr. Quality similar to
Tuffy-no-part
![Page 22: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/22.jpg)
WalkSATiteration cost1 cost2 cost1 + cost2
1 5 20 25
min 5 20 25
Partitioning (Actually) Improves Quality
22
Reason:
Tuffy Tuffy-no-part
![Page 23: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/23.jpg)
WalkSATiteration cost1 cost2 cost1 + cost2
1 5 20 25
2 20 10 30
min 5 10 25
Partitioning (Actually) Improves Quality
23
Reason:
Tuffy Tuffy-no-part
![Page 24: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/24.jpg)
WalkSATiteration cost1 cost2 cost1 + cost2
1 5 20 25
2 20 10 30
3 20 5 25
min 5 5 25
Partitioning (Actually) Improves Quality
24
Reason:
Tuffy Tuffy-no-part
cost[Tuffy] = 10 cost[Tuffy-no-part] = 25
![Page 25: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/25.jpg)
100 components à 100 years of gap!
Under certain technical conditions, component-wise partitioning reduces expected time to hit an optimal state by (2 ^ #components) steps.
Partitioning (Actually) Improves Quality
25
Theorem (roughly):
![Page 26: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/26.jpg)
Further Partitioning
Partition one component further into pieces
26
Graph Scalability Quality
J Sparse
Dense
In the paper: cost-based trade-off model
J
J/L
![Page 27: Tuffy - Stanford Universityinfolab.stanford.edu/hazy/papers/tuffy-vldb2011-slides.pdfTom Jerry Search Tom Chuck * [Kautz et al. 2006] False True. Outline ! Markov Logic " Data model](https://reader033.fdocuments.us/reader033/viewer/2022060922/60ae01a648827b1c831d2734/html5/thumbnails/27.jpg)
Conclusion
v Markov Logic is a powerful framework for statistical inference § But existing implementations do not scale
v Tuffy scales up Markov Logic inference § RDBMS query processing is perfect fit for grounding § Partitioning improves search scalability and quality
v Try it out!
27
http://www.cs.wisc.edu/hazy/tuffy