1 Special Cases: Query Semantics: (“Marginal Probabilities”) Run query Q against each...

13
1 Special Cases: Query Semantics: (“Marginal Probabilities”) Run query Q against each instance D i ; for each answer tuple t, sum up the probabilities of all instances D where t A probabilistic database D p (compactly) encodes a probability distribution over a finite set of deterministic database instances D i . WorksAt(Sub, Obj) Jeff Stanford Jeff Princeton WorksAt(Sub, Obj) Jeff Stanford WorksAt(Sub, Obj) Jeff Princeton WorksAt(Sub, Obj) D 1 : 0.42 D 2 : 0.18 D 3 : 0.28 D 4 : 0.12 WorksAt(Sub, Obj) p Jef f Stanford 0.6 Jef f Princeton 0.7 (1) D p tuple-independent (II) D p block-independent Note: (I) and (II) are not equivalent! Probabilistic Database WorksAt(Sub, Ob j) p Jef f Stanford 0.6 Princeton 0.4

Transcript of 1 Special Cases: Query Semantics: (“Marginal Probabilities”) Run query Q against each...

Page 1: 1  Special Cases:  Query Semantics: (“Marginal Probabilities”)  Run query Q against each instance D i ; for each answer tuple t, sum up the probabilities.

1

Special Cases:

Query Semantics: (“Marginal Probabilities”) Run query Q against each instance Di; for each

answer tuple t, sum up the probabilities of all instances Di where t exists.

A probabilistic database Dp (compactly) encodes a probability distribution over a finite set of deterministic

database instances Di.

WorksAt(Sub, Obj)

Jeff Stanford

Jeff Princeton

WorksAt(Sub, Obj)

Jeff Stanford

WorksAt(Sub, Obj)

Jeff Princeton

WorksAt(Sub, Obj)

D1: 0.42 D2: 0.18 D3: 0.28 D4: 0.12

WorksAt(Sub, Obj)

p

Jeff Stanford 0.6

Jeff Princeton 0.7

(1) Dp tuple-independent (II) Dp block-independent

Note: (I) and (II)

are not equivalent!

Probabilistic Database

WorksAt(Sub, Obj)

p

Jeff Stanford 0.6

Princeton 0.4

Page 2: 1  Special Cases:  Query Semantics: (“Marginal Probabilities”)  Run query Q against each instance D i ; for each answer tuple t, sum up the probabilities.

Soft Rules vs. Hard Rules

(Soft) Deduction Rules vs. (Hard) Consistency Constraints

People may live in more than one placelivesIn(x,y) marriedTo(x,z) livesIn(z,y)livesIn(x,y) hasChild(x,z) livesIn(z,y)

People are not born in different places/on different datesbornIn(x,y) bornIn(x,z) y=zbornOn(x,y) bornOn(x,z) y=z

People are not married to more than one person (at the same time, in most countries?)

marriedTo(x,y,t1) marriedTo(x,z,t2) y≠z

disjoint(t1,t2)

[0.8]

[0.5]

Page 3: 1  Special Cases:  Query Semantics: (“Marginal Probabilities”)  Run query Q against each instance D i ; for each answer tuple t, sum up the probabilities.

Basic Types of Inference

MAP Inference

Find the most likely assignment to query variables y under a given evidence x.

Compute: arg max y P( y | x) (NP-complete

for MaxSAT)

Marginal/Success Probabilities

Probability that query y is true in a random world under a given evidence x.

Compute: ∑y P( y | x) (#P-complete already for conjunctive queries)

Page 4: 1  Special Cases:  Query Semantics: (“Marginal Probabilities”)  Run query Q against each instance D i ; for each answer tuple t, sum up the probabilities.

[Yahya,Theobald: RuleML’11 Dylla,Miliaraki,Theobald: ICDE’13]

Deductive Grounding with Lineage (SLD Resolution in Datalog/Prolog)

\/

/\

graduatedFrom(Surajit,

Princeton)[0.7]

hasAdvisor(Surajit,Jeff)

[0.8]

worksAt(Jeff,Stanford)[0.9]

graduatedFrom(Surajit,

Stanford)[0.6]

Query graduatedFrom(Surajit, y)

C D

A B

A(B (CD)) A(B (CD))

graduatedFrom(Surajit, Princeton)

graduatedFrom(Surajit, Stanford)Q1 Q2

Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z)

[0.4]

graduatedFrom(x,y) graduatedFrom(x,z) y=zBase FactsgraduatedFrom(Surajit, Princeton)

[0.7]graduatedFrom(Surajit, Stanford)

[0.6]graduatedFrom(David, Princeton)

[0.9]hasAdvisor(Surajit, Jeff) [0.8]hasAdvisor(David, Jeff) [0.7]worksAt(Jeff, Stanford) [0.9]type(Princeton, University) [1.0]type(Stanford, University) [1.0]type(Jeff, Computer_Scientist) [1.0]type(Surajit, Computer_Scientist)

[1.0]type(David, Computer_Scientist)

[1.0]

Page 5: 1  Special Cases:  Query Semantics: (“Marginal Probabilities”)  Run query Q against each instance D i ; for each answer tuple t, sum up the probabilities.

Lineage & Possible Worlds

1) Deductive Grounding Dependency graph of the query Trace lineage of individual query

answers

2) Lineage DAG (not in CNF),

consisting of Grounded soft & hard rules Probabilistic base facts

3) Probabilistic Inference Compute marginals:

P(Q): sum up the probabilities of all possible worlds that entail the query answers’ lineage

P(Q|H): drop “impossible worlds”

\/

/\

graduatedFrom(Surajit,

Princeton)[0.7]

hasAdvisor(Surajit,Jeff)

[0.8]

worksAt(Jeff,Stanford)[0.9]

graduatedFrom(Surajit,

Stanford)[0.6]

Query graduatedFrom(Surajit, y)

0.7x(1-0.888)=0.078 (1-0.7)x0.888=0.266

1-(1-0.72)x(1-0.6)=0.888

0.8x0.9=0.72

C D

A B

A(B (CD)) A(B (CD))

graduatedFrom(Surajit, Princeton)

graduatedFrom(Surajit, Stanford)Q1 Q2

[Das Sarma,Theobald,Widom: ICDE’08 Dylla,Miliaraki,Theobald: ICDE’13]

Page 6: 1  Special Cases:  Query Semantics: (“Marginal Probabilities”)  Run query Q against each instance D i ; for each answer tuple t, sum up the probabilities.

Possible Worlds Semantics

A:0.7

B:0.6

C:0.8

D:0.9

Q2: A(B(CD))

P(W)

1 1 1 1 0 0.7x0.6x0.8x0.9 = 0.3024

1 1 1 0 0 0.7x0.6x0.8x0.1 = 0.0336

1 1 0 1 0 … = 0.0756

1 1 0 0 0 … = 0.0084

1 0 1 1 0 … = 0.2016

1 0 1 0 0 … = 0.0224

1 0 0 1 0 … = 0.0504

1 0 0 0 0 … = 0.0056

0 1 1 1 1 0.3x0.6x0.8x0.9 = 0.1296

0 1 1 0 1 0.3x0.6x0.8x0.1 = 0.0144

0 1 0 1 1 0.3x0.6x0.2x0.9 = 0.0324

0 1 0 0 1 0.3x0.6x0.2x0.1 = 0.0036

0 0 1 1 1 0.3x0.4x0.8x0.9 = 0.0864

0 0 1 0 0 … = 0.0096

0 0 0 1 0 … = 0.0216

0 0 0 0 0 … = 0.0024

1.0

0.2664

0.412

P(Q2)=0.2664

P(Q2|H)=0.2664 / 0.412 = 0.6466

P(Q1)=0.0784 P(Q1|H)=0.0784 / 0.412 = 0.1903

0.0784

Hard rule H: A (B (CD))

Page 7: 1  Special Cases:  Query Semantics: (“Marginal Probabilities”)  Run query Q against each instance D i ; for each answer tuple t, sum up the probabilities.

Basic Complexity Issue

Theorem [Valiant:1979]For a Boolean expression E, computing P(E) is #P-complete.

NP = class of problems of the form “is there a witness ?” SAT#P = class of problems of the form “how many witnesses ?” #SAT

The decision problem for 2CNF is in PTIME.The counting problem for 2CNF is already #P-complete.

[Suciu & Dalvi: SIGMOD’05 Tutorial on "Foundations of Probabilistic Answers to Queries"]

Page 8: 1  Special Cases:  Query Semantics: (“Marginal Probabilities”)  Run query Q against each instance D i ; for each answer tuple t, sum up the probabilities.

Tuple-Independent Probabilistic Database

Theorem: The query answering problem for the above join is

#P-hard.

A probabilistic database Dp (compactly) encodes a probability distribution over a finite set of deterministic

database instances Di.

WorksAt(Pers, Uni)

p

Jeff Stanford 0.6

Jeff Princeton 0.7

Probabilistic Database

IsProfessor(Pers)

P

Jeff 0.9

Located(Uni, State)

p

Stanford CA 0.8

Princeton

CA 0.5

Which professors work at a university that is located in CA?

Page 9: 1  Special Cases:  Query Semantics: (“Marginal Probabilities”)  Run query Q against each instance D i ; for each answer tuple t, sum up the probabilities.

Probabilistic & Temporal Database

Sequenced Semantics & Snapshot Reducibility: Built-in semantics: reduce temporal-relational operators to

their non-temporal counterparts at each snapshot (i.e., time point) of the database.

Coalesce/split tuples with consecutive time intervals based on their lineages.

Non-Sequenced Semantics Queries can freely manipulate timestamps just like regular

attributes. Single temporal operator ≤T supports all of Allen’s 13 temporal relations.

Deduplicate tuples with overlapping time intervals based on their lineages.

A temporal-probabilistic database DTp (compactly) encodes a probability distribution over a finite set of

deterministic database instances Di at each time point of a finite time domain T.

BornIn(Sub,Obj)

T p

DeNiro Green- which

[1943, 1944)

0.9

DeNiro Tribeca [1998, 1999)

0.6[Dignös, Gamper, Böhlen: SIGMOD’12]

[Dylla,Miliaraki,Theobald: PVLDB’13

Wedding(Sub,Obj)

T p

DeNiro Abbott [1936, 1940)

0.3

DeNiro Abbott [1976, 1977)

0.7

Divorce(Sub,Obj)

T p

DeNiro Abbott [1988, 1989)

0.8

Page 10: 1  Special Cases:  Query Semantics: (“Marginal Probabilities”)  Run query Q against each instance D i ; for each answer tuple t, sum up the probabilities.

Temporal Alignment & Deduplication Example

Non-Sequenced Semantics:

f1

1936 1976 1988

f2 ¬f3

f2 f3f1 ¬f3

f1 f3

(f1 f3) (f1 ¬f3)

(f1 f3) (f1 ¬f3) (f2 f3) (f2 ¬f3)

(f1 f3) (f2 ¬f3)

T

marriedTo(x,y)[tb1,Tmax) wedding(x,y)[tb1,te1) ¬divorce(x,y)[tb2,te2)

marriedTo(x,y)[tb1,te2) wedding(x,y)[tb1,te1) divorce(x,y)[tb2,te2) te1 ≤T tb2

BaseFacts

DerivedFacts

Dedupl.Facts

wedding(DeNiro,Abbott)

wedding(DeNiro,Abbott)

divorce(DeNiro,Abbott)

Tmax

f2

f3

Tmin

Page 11: 1  Special Cases:  Query Semantics: (“Marginal Probabilities”)  Run query Q against each instance D i ; for each answer tuple t, sum up the probabilities.

0.08

0.120.16

0.40.6

‘03 ‘05 ‘07playsFor(Beckham, Real, T1)

Base Facts

DerivedFacts

0.20.20.1

0.4

‘05‘00 ‘02 ‘07playsFor(Ronaldo, Real, T2)

‘04

‘03 ‘04 ‘07‘05

playsFor(Beckham, Real, T1)Ù playsFor(Ronaldo, Real, T2)Ù overlaps(T1, T2, T3)

t3 teamMates(Beckham, Ronaldo, t3)

teamMates(Beckham, Ronaldo, T3)

Inference in Probabilistic-Temporal Databases

[Wang,Yahya,Theobald: MUD’10; Dylla,Miliaraki,Theobald: PVLDB’13]

Example using the Allen predicate overlaps

Page 12: 1  Special Cases:  Query Semantics: (“Marginal Probabilities”)  Run query Q against each instance D i ; for each answer tuple t, sum up the probabilities.

0.40.6

‘03 ‘05 ‘07playsFor(Beckham, Real, T1)

Base Facts

DerivedFacts

playsFor(Ronaldo, Real, T2)

0.20.20.1

‘05‘00 ‘02 ‘07‘04

0.4

0.08

0.120.16

‘03 ‘04 ‘07‘05

playsFor(Zidane, Real, T3)

teamMates(Beckham, Zidane, T5)

teamMates(Ronaldo, Zidane, T6)

teamMates(Beckham, Ronaldo, T4)

Non-independent

Independent

Inference in Probabilistic-Temporal Databases

[Wang,Yahya,Theobald: MUD’10; Dylla,Miliaraki,Theobald: PVLDB’13]

Page 13: 1  Special Cases:  Query Semantics: (“Marginal Probabilities”)  Run query Q against each instance D i ; for each answer tuple t, sum up the probabilities.

playsFor(Beckham, Real, T1)Base Facts

DerivedFacts

playsFor(Ronaldo, Real, T2)

playsFor(Zidane, Real, T3)

teamMates(Beckham, Zidane, T5)

teamMates(Ronaldo, Zidane, T6)

Non-independent

Independent

Closed and complete representation model (incl. lineage)

Temporal alignment is polyn. in the number of input intervals

Confidence computation per interval remains #P-hard

In general requires Monte Carlo approximations (Luby-Karp for DNF, MCMC-style sampling), decompositions, or top-k pruning

teamMates(Beckham, Ronaldo, T4)

Need

Lineage!

Inference in Probabilistic-Temporal Databases

[Wang,Yahya,Theobald: MUD’10; Dylla,Miliaraki,Theobald: PVLDB’13]