IPAW ’08 Salt Lake City June 17, 2008

47
Provenance for Database Transformations Val Tannen University of Pennsylvania Joint work with J.N. Foster T.J. Green G. Karvounarakis IPAW ’08 Salt Lake City June 17, 2008

description

Provenance for Database Transformations Val Tannen University of Pennsylvania Joint work with J.N. Foster T.J. Green G. Karvounarakis. IPAW ’08 Salt Lake City June 17, 2008. Motivation. Some of the work in IPAW! - PowerPoint PPT Presentation

Transcript of IPAW ’08 Salt Lake City June 17, 2008

Page 1: IPAW ’08 Salt Lake City              June 17, 2008

Provenance for Database Transformations

Val Tannen University of Pennsylvania

Joint work with J.N. Foster T.J. Green G. Karvounarakis

IPAW ’08Salt Lake City June 17, 2008

Page 2: IPAW ’08 Salt Lake City              June 17, 2008

Motivation Some of the work in IPAW!

Data integration [Wang,Madnick 1990, Lee,Bressan,Madnick 1998]

Data warehousing– Lineage [Cui,Widom,Wiener 2000]

Scientific applications– Why-Provenance [Buneman,Khanna,Tan 2001]

Collaborative data sharing networks in the ORCHESTRA system (project headed by Zack Ives)– Trust conditions based on provenance– Deletion propagation [Green,Ives,Karvounarakis,T. 2007, Karvounarakis,Ives 2008] 2

Page 3: IPAW ’08 Salt Lake City              June 17, 2008

Database transformations, e.g., views

a b c

d b e

f g e

a c

a e

d c

d e

f e

CREATE VIEW V AS

(SELECT u.1, v.3

FROM R u, R v

WHERE u.3 = v.3

UNION

SELECT u.1, v.3

FROM R u, R v

WHERE u.2 = v.2)

1 2 3

VVRR

View V = q(R) View V = q(R)

white box

(Ludäscher)

white box

(Ludäscher)

3

Page 4: IPAW ’08 Salt Lake City              June 17, 2008

Database transformations, e.g., views

a b c

d b e

f g e

a c

a e

d c

d e

f e Datalog without recursion

V(x,z) :- R(x, _,z), R(_,

_,z)

V(x,z) :- R(x,y, _), R(_ ,y,z)

1 2 3

VVRR

View V = q(R) View V = q(R)

Relational algebra

V := ¼12((¼13(R) ⋈ ¼23(R)) [ (¼12(R) ⋈¼23(R)))

4

Page 5: IPAW ’08 Salt Lake City              June 17, 2008

Provenance questions

a b c

d b e

f g e

a c

a e

d c

d e

f e

VVRR

View V = q(R) View V = q(R)

t???

Which input tuples contributed in some way to t being in the output?

Which sets of input tuples support each way for t to be in the output?

What are all possible ways in which t was caused to be in the output?

5

Page 6: IPAW ’08 Salt Lake City              June 17, 2008

Provenance answers…

d e…

t

Which input tuples contributed in some way to t being in the output?

{r,s} lineage [CWW 00]

Which sets of input tuples support each way for t to be in the output?

{{r},{r,s}} proof why-provenance [BKT 01] see [PODS 08]

What are all possible ways in which t was caused to be in the output?

2r2 + rs prov. polynomials [Green,Karvounarakis,T. 2007]

a b c p

d b e r

f g e s

6

tuple ids

Page 7: IPAW ’08 Salt Lake City              June 17, 2008

More generality: annotated relationsProvenance: an annotation on tuplesOther instances of relations with annotated tuples

– incomplete databases (conditional tables) [Imielinski,Lipski 1984]

– probabilistic databases (independent tuple tables) [Fuhr, Rölleke 1997, Zimányi 1997, others]

– bag semantics databases (tuples with multiplicities) […SQL!]

How do annotations combine as they propagate through queries?(Is there an algebra of annotations?)

7

Imielinski and Lipski already computed some form of

provenance!!

Page 8: IPAW ’08 Salt Lake City              June 17, 2008

semantics: a set of instances

Incomplete databases: boolean c-tables [IL 84]

a b c p

d b e

r

f g e s

a b c

f g e{ }I(R)=

, , , , , , ,; d b ea b c f g ed b e

f g e

a b cd b e

f g e

R

a b c

d b e

boolean variables

Page 9: IPAW ’08 Salt Lake City              June 17, 2008

Queries on c -tablesR

a b c

p

d b e

r

f g e sa c (p Æ p) Ç (p Æ p)

a e p Æ r

d c r Æ p

d e (r Æ r) Ç (r Æ r) Ç (r Æ s)

f e (s Æ s) Ç (s Æ s) Ç (s Æ r)

p

p Æ r

p Æ r

r

s

=

V

V(x,z) :- R(x, _,z), R(_,

_,z)

V(x,z) :- R(x,y, _), R(_ ,y,z)

r r

r r

s

9

But…simplifying like this misses the general idea!

Page 10: IPAW ’08 Salt Lake City              June 17, 2008

Probabilistic independent-tuple tables

a b c 0.6

d b e 0.5

f g e 0.1

Events “tuple in instance” are independent

View V may not be representableas an independent-tuple table

R

a b c X

d b e Y

f g e Z

a c X

a e X Å Y

d c X Å Y

d e Y

f e Z

RV

view computation: similar to c-tables, but for algebra of sets

Page 11: IPAW ’08 Salt Lake City              June 17, 2008

C –tables vs. Lineage

a c ({p} {p}) ({p} {p})

a e {p} {r}

d c {r} {p}

d e ({r} {r}) ({r} {r}) ({r} {s})

f e ({s} {s}) ({s} {s}) ({s} {r})

a c (p Æ p) Ç (p Æ p)

a e p Æ r

d c r Æ p

d e (r Æ r) Ç (r Æ r) Ç (r Æ s)

f e (s Æ s) Ç (s Æ s) Ç (s Æ r)

c-table calculations

lineage calculations [CWW 00]

The structure of the calculations is the same!

11

Page 12: IPAW ’08 Salt Lake City              June 17, 2008

Another analogy, with bag semantics

a b c

2

d b e

5

f g e 1

Rtuple

multiplicities

a c 2 ¢ 2 + 2 ¢ 2

a e 2 ¢ 5

d c 5 ¢ 2

d e 5 ¢ 5 + 5 ¢ 5 + 5 ¢ 1

f e 1 ¢ 1 + 1 ¢ 1 + 1 ¢ 5

Va c 8

a e 10

d c 10

d e 55

f e 7

multiplicity

calculations

a c (p Æ p) Ç (p Æ p)

a e p Æ r

d c r Æ p

d e (r Æ r) Ç (r Æ r) Ç (r Æ s)

f e (s Æ s) Ç (s Æ s) Ç (s Æ r)

c-table calculations

Again, the structure of the calculations is the same!

12

Page 13: IPAW ’08 Salt Lake City              June 17, 2008

Abstracting the structure of these calculations

These expressions capture the abstract structure of the calculations

We will end up using these expressions as provenance!

db ops c-tables bags lineage abstract

join Æ ¢ [ ¢union Ç + [ +

a c (p ¢ p) + (p ¢ p)

a e

p ¢ r

d c r ¢ p

d e

(r ¢ r) + (r ¢ r) + (r ¢ s)

f e (s ¢ s) + (s ¢ s) + (s ¢ r)

abstract calculations

13

Page 14: IPAW ’08 Salt Lake City              June 17, 2008

Technical Development: K-relations

Annotations are elements from an algebraic structure

(K,+,¢, 0, 1)

If D is the domain of database values,

an n-ary K-relation is a function: R: Dn ! K

Although the notation resembles arithmetic, these are abstract

operations

All possible tuples 14

Page 15: IPAW ’08 Salt Lake City              June 17, 2008

K-relations, annotated tables

K-relation corresponds to table: R: Dn ! K

If R(t)=k, then t “is annotated by k”

For all but finitely many tuples t, R(t) = 0we omit the tuples annotated with 0

tuple1 k1

tuple2 k2

tuple3

. . .

k3

15

Page 16: IPAW ’08 Salt Lake City              June 17, 2008

Positive K-relational algebra

We define an RA+ on K-relations:

The ¢ corresponds to joint use (join)

The + corresponds to alternative use (union and

projection)

0 and 1 are used for selection predicates16

Page 17: IPAW ’08 Salt Lake City              June 17, 2008

Positive K-relational algebra: details

Natural join: [R1 ⋈ R2](t) = R1(t1) ¢ R2(t2)

t on attrs(R1) = t1, t on attrs(R2) = t2

Union: [R1 [ R2](t) = R1(t) + R2(t)

Projection: [V R](t) = t'=t on V and R(t’)0 R(t')

Selection: [P R](t) = R(t) ¢ P(t) P(t) = 0 or 1

17

Page 18: IPAW ’08 Salt Lake City              June 17, 2008

RA+ identities imply semiring structure!

Common RA+ identities–Union and join are associative, commutative

–Join distributes over union

–etc. (but not idempotence!)

These identities hold for RA+ on K-relations iff

(K, +, ¢, 0, 1) is a commutative semiring

18

Page 19: IPAW ’08 Salt Lake City              June 17, 2008

Semiring Bestiary

• (B, Ç, Æ, ?, >) Usual rel. alg. (sets)

• (N, +, ¢, 0, 1) Bag semantics

• (PosBool(X), Ç, Æ, ?, >) Boolean c-tables, also

Minimal why-provenance [BKT 01]

• (P(), [, Å, ;, ) Event tables (prob. db)

• (P(P(X)), [, d, ;, {;}) Proof why-provenancewhere A d B := {a [ b : a 2 A, b 2 B}

• (P(X), [, [, ;, ;)★ Lineage

• (N[X], +, ¢, 0, 1) Provenance polynomials19

Page 20: IPAW ’08 Salt Lake City              June 17, 2008

Provenance polynomials X = {p, r, s, …}: indeterminates (provenance “tokens” for

base tuples)

N[X] : multivariate polynomials with coefficients in N and indeterminates in X

(N[X], +, ¢, 0, 1) is the free commutative semiring generated by X ; its elements abstract calculations in all semirings

The polynomials capture the propagation of provenance through (positive) relational algebra in the most general way allowed by commutative semiring-based semantics

20

Page 21: IPAW ’08 Salt Lake City              June 17, 2008

Provenance calculationsa b

cp

d b e

r

f g e s

R

V

a c {p} {{p}} {{p}} p 2p2

a e {p,r}

{{p,r}} {{p,r}}

p Æ r

pr

d c {p,r}

{{p,r}} {{p.r}}

p Æ r

pr

d e {r,s}

{{r},{r,s}}

{{r}} r 2r2 + rs

f e {r,s}

{{r,s},{s}}

{{s}} s 2s2 + rs

lineage

proof why-prov.

minimal

why-prov.

provenancepolynomials

boolean

c-tableannot.

≈≈

Three derivations: two of them use r, twice, and the third uses r and s, once each

Page 22: IPAW ’08 Salt Lake City              June 17, 2008

p: certified by Moer: certified by Larrys: certified by Curly

Trust assesment

a b c p

d b e r

f g e s

Ra c 2p2

a e pr

d c pr

d e 2r2 + rs

f e 2s2 + rs

V

One alternative needs Larry and Curly Two others only need Larry, twice

2 alternatives, both need Moe, twice

Needs both Moe and Larry

Which output tuples can be trusted after Larry is jailed?22

Page 23: IPAW ’08 Salt Lake City              June 17, 2008

A glimpse at work by T.J. Green:Provenance and Query Optimization

• Many kinds of semiring-based provenance annotations to choose from:– Lineage– Proof why-provenance– Minimal why-provenance– Provenance polynomials– ...

• They keep track of more/less information• A fundamental question, asked repeatedly by

Peter Buneman: how does this affect query optimization?

23

Page 24: IPAW ’08 Salt Lake City              June 17, 2008

Choice of K Affects Query Optimization

K = N (bag semantics) differs from K = B (set semantics)

e.g., the conjunctive queries

Q1(x) :- R(x,y), R(x,z) Q2(u) :- R(u,v)

are set-equivalent, but not bag-equivalent

24

Page 25: IPAW ’08 Salt Lake City              June 17, 2008

A Hierarchy of Semiring Provenance (1)

• Provenance polynomials (N[X], +, ¢, 0, 1) – tracks calculations abstractly; most general

e.g., 2p2r + 3ps + ps3

• Drop coefficients to get (B[X], +, ¢, 0, 1)p2r + ps + ps3

• Drop exponents to get proof why-prov. (P(P(X)), [, d, ;, {;})

{{p,r}, {p,s}}• Flatten set-of-sets to get lineage

{p,r,s}• Drop, flatten, etc. correspond to surjective semiring

homomorphisms25

Page 26: IPAW ’08 Salt Lake City              June 17, 2008

A Hierarchy of Semiring Provenance (2)Definition: K1 ¹L K2 means that for all queries

P, Q in language L P ´K2 Q implies P ´K1

Q

Languages of interest: CQ and UCQ (equivalent to RA+)

Definition: K1 ¼L K2 means K1 ¹L K2 and K2 ¹L K1

Proposition: If there exists a surjective homomorphism h : K1 K2 then K1

¹UCQ K2

Proposition (from [GKT 07]) If K is a distributive lattice then B ¼UCQ K

(In particular B ¼UCQ PosBool(X) )

26

Page 27: IPAW ’08 Salt Lake City              June 17, 2008

A Hierarchy of Semiring Provenance (3)Definition: A semiring is positive if 0=1 and a+b = 0

implies a=0 and b=0 and a¢b = 0 implies a=0 or b=0All the semirings we consider are positive.

Proposition: For any positive K (and “big enough” X) B ¹UCQ K ¹UCQ N[X]

Moreover:Proposition (Provenance Hierarchy):B ¹UCQ lineage ¹UCQ proof why-prov. UCQ ¹ B[X] ¹UCQ N[X]

27

Page 28: IPAW ’08 Salt Lake City              June 17, 2008

Separating the Models for ´ of CQs

B ÁCQ lineage:Q1(x,y) :- R(x,y), R(x,z) Q2(x,y) :- R(x,y)

Q1 ´B Q2 but Q1 ´lin Q2

lineage ÁCQ why:Q1(x) :- R(x,y), R(x,z) Q2(x) :- R(x,y)

Q1 ´lin Q2 but Q1 ´why Q2

28

Page 29: IPAW ’08 Salt Lake City              June 17, 2008

Summary: Provenance Hierarchy

29

B PosB.(B) Lineage N Why-Pr. B[X] N[X]

CQs vK ¼ Á Á Á Á ¼

´K ¼ Á Á ¼ ¼ ¼

B PosB.(B) Lineage Why-Pr. B[X] N[X]

UCQs vK ¼ Á Á Á Á

´K ¼ Á Á Á Á

More importantly, Green’s results also show decidability for containment and equivalence of CQs and UCQs under

the various provenance semantics

More importantly, Green’s results also show decidability for containment and equivalence of CQs and UCQs under

the various provenance semantics

Page 30: IPAW ’08 Salt Lake City              June 17, 2008

Extension to annotated XML

• Data model: unordered XML data with semiring annotations (K-UXML)

• Query language: positive, unordered XQuery fragment (K-UXQuery)

• Sanity checks: agrees with encoded relational queries, bag semantics, probabilistic XML, ...

• Applications: security, incomplete XML databases, ...

30

Page 31: IPAW ’08 Salt Lake City              June 17, 2008

K-UXML

• No attributes, no text values, no repeated children (inessential); no order (essential!)

• Each subtree decorated with a value k from semiring K (1 “neutral,” 0 “not present”)

• K-collection: a finite set of elements annotated with values from K

• The child subtrees of a node form a K-collection

31

Page 32: IPAW ’08 Salt Lake City              June 17, 2008

c bc b

c adc ad

K-UXML Example

32

a

bx1

cy3

cy1

a d

a

cy2 bx2

d

a

b c

a d11y3

x1

1

y1

y2 x21´

Annotations are on elements of K-collections. There are 5 K-collections in this tree (all colored differently).

To annotate whole tree, must include in singleton K-collection.

Page 33: IPAW ’08 Salt Lake City              June 17, 2008

a

du

x b

dv ew

y c

fz , ,

K-UXQuery Semantics: for-Loops

33

Answer:

ax

du

by

dv

,cz

f,

ew

dxu + yv , eyw , fz

Computation:

ax

du

by

dv

cz

f,

ew

,

Source, $S:

dxu , dvy , eyw , fzx du , y dy , y ew , z f

Query: for $t in $S return $t/*

Page 34: IPAW ’08 Salt Lake City              June 17, 2008

• Annotation of result is a sum over products of annotations along paths to root

K-UXQuery Semantics: // Operator

34

Source, $S:r

cx1¢y3 + y1¢y2 cy1

d

a

cy2 bx2

Answer:Query: <r> $S//c </r>

a

bx1

cy3

cy1

a d

a

cy2 bx2

d

Page 35: IPAW ’08 Salt Lake City              June 17, 2008

• Data annotated with clearance levels from total order C : P < C < S < T < 0

• Joint use of data (¢) requires access to both (max of clearances); alternative use of data (+) requires access to either (min of clearances)

• (C, min, max, 0, P) is a commutative semiring

p

dmin(max(P,C,C), max(P,C,S)) emax(P,C,T)

Application: Access Control

35

Query: <p> $S/*/* </p>

bC

dC

cC

dS eT

a

dC eT

p

Page 36: IPAW ’08 Salt Lake City              June 17, 2008

• For any given clearance level (e.g., C), want the following diagram to commute:

Security Condition: Non-Interference

36

query

query

erase > C erase > C

a

bC

dC

cC

dS eT

p

dC eT

p

dC

a

dC

bC cC

Page 37: IPAW ’08 Salt Lake City              June 17, 2008

Application: Incomplete XML

• Data annotated with Boolean expressions; tree T represents set of possible worlds Rep(T)

37

T =

a

b

cy

cx

a d

a

cz b

d

a

b

c

c

a d

a

c b

d

Rep(T) =

a

b

a

d

a

b

c

a

d

a

b c

a d

a

b

d

, , ,...,

7 possible worlds

Page 38: IPAW ’08 Salt Lake City              June 17, 2008

Correctness: Possible Worlds

38

• For every incomplete tree T, and every UXQuery query q, want this diagram to commute:

T Rep(T)

q(Rep(T)) = Rep(q(T))q(T)

q q

Rep

Rep q(Rep(T))

Page 39: IPAW ’08 Salt Lake City              June 17, 2008

Commutation with Homomorphisms

• Ex: access controlhc : C C hc(k) := if k · c then k else 0

• Ex: incomplete databasesº : Vars B Evalº : PosBool(Vars) B

• Ex: duplicate elimination± : N B ±(k) := if k = 0 then ? else > 39

Theorem: Let h : K1 K2 be a semiring homomorphism. Then for any RA+/NRC/UXQuery query q, and for any K1- instance D, we have

h(q(D)) = q(h(D)).

Page 40: IPAW ’08 Salt Lake City              June 17, 2008

Provenance Polynomials are Universal

40

Corollary: The semantics of RA+/NRC/UXQuery evaluation on K-instances for any commutative semiring K factors through evaluation using provenance polynomials N[X].

e.g., for any K-UXML document D, for any K-UXQuery q, we haveq(D) = Evalº(q(D’))

where • D’ is obtained by replacing K-annotations in D with fresh variables from X• º : X K is the corresponding valuation •Evalº : N[X] K is the unique semiring homomorphism such that for the one-variable monomials, Evalº(x) = º(x).

Page 41: IPAW ’08 Salt Lake City              June 17, 2008

Datalog? The semiring structure on annotations works out

nicely for positive relational algebra, positive nested relational calculus (NRC), a large fragment of XQuery,.

What more do we need to capture recursion, i.e., for Datalog queries?

-complete semirings with -continuous operations (so fixed points exist!)

-continuous semirings

N is not, but N1 ≜ N [ {1} is. 41

Page 42: IPAW ’08 Salt Lake City              June 17, 2008

Datalog may have infinite derivations!

Polynomials do not suffice, since they are finite!

Nonetheless, the calculations are finitely representable through a system of equations

The equations have a least solution in any -continuous semiring

For provenance, we must generalize from polynomials to formal power series (in general, infinitely many monomials)

42

Page 43: IPAW ’08 Salt Lake City              June 17, 2008

Related Work

• Foundations: semirings/systems of equations/formal power

series first used in CS in theory of formal languages

[Chomsky,Schutzenberger 1963]

• Our work is related to and shares similar goals with

“Debugging schema mappings with routes” [Chiticariu,Tan

VLDB2006], where “routes” are like minimal finite portions of

our provenance polynomials

43

Page 44: IPAW ’08 Salt Lake City              June 17, 2008

More Related Work

• Bag semantics for NRC [Libkin&Wong 97] • Incomplete XML [Kanza+ 99, Abiteboul+ 06]

• Probabilistic XML [Nierman&Jagadish 02, van Keulen+ 05, Abit.&Senellart 06, Sen.&Abit. 07, Hung+ 07]

• XML provenance [Buneman+ 01]

• NRC provenance [Hidders+ 07]

• Soft CSPs [Bistarelli et al]

• Semiring-annotated XPath [Grahne+ 07]

• Negation, expressiveness of RAK [Geerts&Poggi 08]

44

Page 45: IPAW ’08 Salt Lake City              June 17, 2008

Related Work for T.J. Green

• Already mentioned– Set-cont. and equiv. of CQs [Chandra&Merlin 77]– Set-cont. and equiv. of UCQs [Sagiv&Yannakakis 80]– Bag-cont. of UCQs [Ioannidis&Ramakrishnan 95]– Bag-equiv. of CQs [Chaudhuri&Vardi 93]

• Containment of CQs with where-provenance [Tan 03]• Bag-set semantics [CV 93], combined semantics [Cohen 06]

– For K-relations: support operator of [Geerts&Poggi 08] generalizes duplicate elimination

• Bag-containment of CQs [Jayram+ 06]

45

Page 46: IPAW ’08 Salt Lake City              June 17, 2008

Conclusion• Annotations forming a commutative semiring seem to fit

well with database transformations expressed in positive query languages, be they relational, even recursive, or for complex values or tree data.

• We obtained explanations for a number of puzzles related to why-provenance in a broad sense.

• Provenance polynomials also capture tuple multiplicity and serve well systems such as Orchestra.

• Big open questions: negation (although see work by Geerts, Poggi) and order

46

Page 47: IPAW ’08 Salt Lake City              June 17, 2008

Future Work

I have the feeling that we have only scratched the surface so far…

I am working on marrying this approach with data exchange, with a broader perspective on security, with integrity constraints, with a broader perspective on mapping/view maintenance and update…

47