IPAW ’08 Salt Lake City June 17, 2008
description
Transcript of IPAW ’08 Salt Lake City June 17, 2008
Provenance for Database Transformations
Val Tannen University of Pennsylvania
Joint work with J.N. Foster T.J. Green G. Karvounarakis
IPAW ’08Salt Lake City June 17, 2008
Motivation Some of the work in IPAW!
Data integration [Wang,Madnick 1990, Lee,Bressan,Madnick 1998]
Data warehousing– Lineage [Cui,Widom,Wiener 2000]
Scientific applications– Why-Provenance [Buneman,Khanna,Tan 2001]
Collaborative data sharing networks in the ORCHESTRA system (project headed by Zack Ives)– Trust conditions based on provenance– Deletion propagation [Green,Ives,Karvounarakis,T. 2007, Karvounarakis,Ives 2008] 2
Database transformations, e.g., views
a b c
d b e
f g e
a c
a e
d c
d e
f e
CREATE VIEW V AS
(SELECT u.1, v.3
FROM R u, R v
WHERE u.3 = v.3
UNION
SELECT u.1, v.3
FROM R u, R v
WHERE u.2 = v.2)
1 2 3
VVRR
View V = q(R) View V = q(R)
white box
(Ludäscher)
white box
(Ludäscher)
3
Database transformations, e.g., views
a b c
d b e
f g e
a c
a e
d c
d e
f e Datalog without recursion
V(x,z) :- R(x, _,z), R(_,
_,z)
V(x,z) :- R(x,y, _), R(_ ,y,z)
1 2 3
VVRR
View V = q(R) View V = q(R)
Relational algebra
V := ¼12((¼13(R) ⋈ ¼23(R)) [ (¼12(R) ⋈¼23(R)))
4
Provenance questions
a b c
d b e
f g e
a c
a e
d c
d e
f e
VVRR
View V = q(R) View V = q(R)
t???
Which input tuples contributed in some way to t being in the output?
Which sets of input tuples support each way for t to be in the output?
What are all possible ways in which t was caused to be in the output?
5
Provenance answers…
d e…
t
Which input tuples contributed in some way to t being in the output?
{r,s} lineage [CWW 00]
Which sets of input tuples support each way for t to be in the output?
{{r},{r,s}} proof why-provenance [BKT 01] see [PODS 08]
What are all possible ways in which t was caused to be in the output?
2r2 + rs prov. polynomials [Green,Karvounarakis,T. 2007]
a b c p
d b e r
f g e s
6
tuple ids
More generality: annotated relationsProvenance: an annotation on tuplesOther instances of relations with annotated tuples
– incomplete databases (conditional tables) [Imielinski,Lipski 1984]
– probabilistic databases (independent tuple tables) [Fuhr, Rölleke 1997, Zimányi 1997, others]
– bag semantics databases (tuples with multiplicities) […SQL!]
How do annotations combine as they propagate through queries?(Is there an algebra of annotations?)
7
Imielinski and Lipski already computed some form of
provenance!!
semantics: a set of instances
Incomplete databases: boolean c-tables [IL 84]
a b c p
d b e
r
f g e s
a b c
f g e{ }I(R)=
, , , , , , ,; d b ea b c f g ed b e
f g e
a b cd b e
f g e
R
a b c
d b e
boolean variables
Queries on c -tablesR
a b c
p
d b e
r
f g e sa c (p Æ p) Ç (p Æ p)
a e p Æ r
d c r Æ p
d e (r Æ r) Ç (r Æ r) Ç (r Æ s)
f e (s Æ s) Ç (s Æ s) Ç (s Æ r)
p
p Æ r
p Æ r
r
s
=
V
V(x,z) :- R(x, _,z), R(_,
_,z)
V(x,z) :- R(x,y, _), R(_ ,y,z)
r r
r r
s
9
But…simplifying like this misses the general idea!
Probabilistic independent-tuple tables
a b c 0.6
d b e 0.5
f g e 0.1
Events “tuple in instance” are independent
View V may not be representableas an independent-tuple table
R
a b c X
d b e Y
f g e Z
a c X
a e X Å Y
d c X Å Y
d e Y
f e Z
RV
view computation: similar to c-tables, but for algebra of sets
C –tables vs. Lineage
a c ({p} {p}) ({p} {p})
a e {p} {r}
d c {r} {p}
d e ({r} {r}) ({r} {r}) ({r} {s})
f e ({s} {s}) ({s} {s}) ({s} {r})
a c (p Æ p) Ç (p Æ p)
a e p Æ r
d c r Æ p
d e (r Æ r) Ç (r Æ r) Ç (r Æ s)
f e (s Æ s) Ç (s Æ s) Ç (s Æ r)
c-table calculations
lineage calculations [CWW 00]
The structure of the calculations is the same!
11
Another analogy, with bag semantics
a b c
2
d b e
5
f g e 1
Rtuple
multiplicities
a c 2 ¢ 2 + 2 ¢ 2
a e 2 ¢ 5
d c 5 ¢ 2
d e 5 ¢ 5 + 5 ¢ 5 + 5 ¢ 1
f e 1 ¢ 1 + 1 ¢ 1 + 1 ¢ 5
Va c 8
a e 10
d c 10
d e 55
f e 7
multiplicity
calculations
a c (p Æ p) Ç (p Æ p)
a e p Æ r
d c r Æ p
d e (r Æ r) Ç (r Æ r) Ç (r Æ s)
f e (s Æ s) Ç (s Æ s) Ç (s Æ r)
c-table calculations
Again, the structure of the calculations is the same!
12
Abstracting the structure of these calculations
These expressions capture the abstract structure of the calculations
We will end up using these expressions as provenance!
db ops c-tables bags lineage abstract
join Æ ¢ [ ¢union Ç + [ +
a c (p ¢ p) + (p ¢ p)
a e
p ¢ r
d c r ¢ p
d e
(r ¢ r) + (r ¢ r) + (r ¢ s)
f e (s ¢ s) + (s ¢ s) + (s ¢ r)
abstract calculations
13
Technical Development: K-relations
Annotations are elements from an algebraic structure
(K,+,¢, 0, 1)
If D is the domain of database values,
an n-ary K-relation is a function: R: Dn ! K
Although the notation resembles arithmetic, these are abstract
operations
All possible tuples 14
K-relations, annotated tables
K-relation corresponds to table: R: Dn ! K
If R(t)=k, then t “is annotated by k”
For all but finitely many tuples t, R(t) = 0we omit the tuples annotated with 0
tuple1 k1
tuple2 k2
tuple3
. . .
k3
15
Positive K-relational algebra
We define an RA+ on K-relations:
The ¢ corresponds to joint use (join)
The + corresponds to alternative use (union and
projection)
0 and 1 are used for selection predicates16
Positive K-relational algebra: details
Natural join: [R1 ⋈ R2](t) = R1(t1) ¢ R2(t2)
t on attrs(R1) = t1, t on attrs(R2) = t2
Union: [R1 [ R2](t) = R1(t) + R2(t)
Projection: [V R](t) = t'=t on V and R(t’)0 R(t')
Selection: [P R](t) = R(t) ¢ P(t) P(t) = 0 or 1
17
RA+ identities imply semiring structure!
Common RA+ identities–Union and join are associative, commutative
–Join distributes over union
–etc. (but not idempotence!)
These identities hold for RA+ on K-relations iff
(K, +, ¢, 0, 1) is a commutative semiring
18
Semiring Bestiary
• (B, Ç, Æ, ?, >) Usual rel. alg. (sets)
• (N, +, ¢, 0, 1) Bag semantics
• (PosBool(X), Ç, Æ, ?, >) Boolean c-tables, also
Minimal why-provenance [BKT 01]
• (P(), [, Å, ;, ) Event tables (prob. db)
• (P(P(X)), [, d, ;, {;}) Proof why-provenancewhere A d B := {a [ b : a 2 A, b 2 B}
• (P(X), [, [, ;, ;)★ Lineage
• (N[X], +, ¢, 0, 1) Provenance polynomials19
Provenance polynomials X = {p, r, s, …}: indeterminates (provenance “tokens” for
base tuples)
N[X] : multivariate polynomials with coefficients in N and indeterminates in X
(N[X], +, ¢, 0, 1) is the free commutative semiring generated by X ; its elements abstract calculations in all semirings
The polynomials capture the propagation of provenance through (positive) relational algebra in the most general way allowed by commutative semiring-based semantics
20
Provenance calculationsa b
cp
d b e
r
f g e s
R
V
a c {p} {{p}} {{p}} p 2p2
a e {p,r}
{{p,r}} {{p,r}}
p Æ r
pr
d c {p,r}
{{p,r}} {{p.r}}
p Æ r
pr
d e {r,s}
{{r},{r,s}}
{{r}} r 2r2 + rs
f e {r,s}
{{r,s},{s}}
{{s}} s 2s2 + rs
lineage
proof why-prov.
minimal
why-prov.
provenancepolynomials
boolean
c-tableannot.
≈≈
Three derivations: two of them use r, twice, and the third uses r and s, once each
p: certified by Moer: certified by Larrys: certified by Curly
Trust assesment
a b c p
d b e r
f g e s
Ra c 2p2
a e pr
d c pr
d e 2r2 + rs
f e 2s2 + rs
V
One alternative needs Larry and Curly Two others only need Larry, twice
2 alternatives, both need Moe, twice
Needs both Moe and Larry
Which output tuples can be trusted after Larry is jailed?22
A glimpse at work by T.J. Green:Provenance and Query Optimization
• Many kinds of semiring-based provenance annotations to choose from:– Lineage– Proof why-provenance– Minimal why-provenance– Provenance polynomials– ...
• They keep track of more/less information• A fundamental question, asked repeatedly by
Peter Buneman: how does this affect query optimization?
23
Choice of K Affects Query Optimization
K = N (bag semantics) differs from K = B (set semantics)
e.g., the conjunctive queries
Q1(x) :- R(x,y), R(x,z) Q2(u) :- R(u,v)
are set-equivalent, but not bag-equivalent
24
A Hierarchy of Semiring Provenance (1)
• Provenance polynomials (N[X], +, ¢, 0, 1) – tracks calculations abstractly; most general
e.g., 2p2r + 3ps + ps3
• Drop coefficients to get (B[X], +, ¢, 0, 1)p2r + ps + ps3
• Drop exponents to get proof why-prov. (P(P(X)), [, d, ;, {;})
{{p,r}, {p,s}}• Flatten set-of-sets to get lineage
{p,r,s}• Drop, flatten, etc. correspond to surjective semiring
homomorphisms25
A Hierarchy of Semiring Provenance (2)Definition: K1 ¹L K2 means that for all queries
P, Q in language L P ´K2 Q implies P ´K1
Q
Languages of interest: CQ and UCQ (equivalent to RA+)
Definition: K1 ¼L K2 means K1 ¹L K2 and K2 ¹L K1
Proposition: If there exists a surjective homomorphism h : K1 K2 then K1
¹UCQ K2
Proposition (from [GKT 07]) If K is a distributive lattice then B ¼UCQ K
(In particular B ¼UCQ PosBool(X) )
26
A Hierarchy of Semiring Provenance (3)Definition: A semiring is positive if 0=1 and a+b = 0
implies a=0 and b=0 and a¢b = 0 implies a=0 or b=0All the semirings we consider are positive.
Proposition: For any positive K (and “big enough” X) B ¹UCQ K ¹UCQ N[X]
Moreover:Proposition (Provenance Hierarchy):B ¹UCQ lineage ¹UCQ proof why-prov. UCQ ¹ B[X] ¹UCQ N[X]
27
Separating the Models for ´ of CQs
B ÁCQ lineage:Q1(x,y) :- R(x,y), R(x,z) Q2(x,y) :- R(x,y)
Q1 ´B Q2 but Q1 ´lin Q2
lineage ÁCQ why:Q1(x) :- R(x,y), R(x,z) Q2(x) :- R(x,y)
Q1 ´lin Q2 but Q1 ´why Q2
28
Summary: Provenance Hierarchy
29
B PosB.(B) Lineage N Why-Pr. B[X] N[X]
CQs vK ¼ Á Á Á Á ¼
´K ¼ Á Á ¼ ¼ ¼
B PosB.(B) Lineage Why-Pr. B[X] N[X]
UCQs vK ¼ Á Á Á Á
´K ¼ Á Á Á Á
More importantly, Green’s results also show decidability for containment and equivalence of CQs and UCQs under
the various provenance semantics
More importantly, Green’s results also show decidability for containment and equivalence of CQs and UCQs under
the various provenance semantics
Extension to annotated XML
• Data model: unordered XML data with semiring annotations (K-UXML)
• Query language: positive, unordered XQuery fragment (K-UXQuery)
• Sanity checks: agrees with encoded relational queries, bag semantics, probabilistic XML, ...
• Applications: security, incomplete XML databases, ...
30
K-UXML
• No attributes, no text values, no repeated children (inessential); no order (essential!)
• Each subtree decorated with a value k from semiring K (1 “neutral,” 0 “not present”)
• K-collection: a finite set of elements annotated with values from K
• The child subtrees of a node form a K-collection
31
c bc b
c adc ad
K-UXML Example
32
a
bx1
cy3
cy1
a d
a
cy2 bx2
d
a
b c
a d11y3
x1
1
y1
y2 x21´
Annotations are on elements of K-collections. There are 5 K-collections in this tree (all colored differently).
To annotate whole tree, must include in singleton K-collection.
a
du
x b
dv ew
y c
fz , ,
K-UXQuery Semantics: for-Loops
33
Answer:
ax
du
by
dv
,cz
f,
ew
dxu + yv , eyw , fz
Computation:
ax
du
by
dv
cz
f,
ew
,
Source, $S:
dxu , dvy , eyw , fzx du , y dy , y ew , z f
Query: for $t in $S return $t/*
• Annotation of result is a sum over products of annotations along paths to root
K-UXQuery Semantics: // Operator
34
Source, $S:r
cx1¢y3 + y1¢y2 cy1
d
a
cy2 bx2
Answer:Query: <r> $S//c </r>
a
bx1
cy3
cy1
a d
a
cy2 bx2
d
• Data annotated with clearance levels from total order C : P < C < S < T < 0
• Joint use of data (¢) requires access to both (max of clearances); alternative use of data (+) requires access to either (min of clearances)
• (C, min, max, 0, P) is a commutative semiring
p
dmin(max(P,C,C), max(P,C,S)) emax(P,C,T)
Application: Access Control
35
Query: <p> $S/*/* </p>
bC
dC
cC
dS eT
a
dC eT
p
• For any given clearance level (e.g., C), want the following diagram to commute:
Security Condition: Non-Interference
36
query
query
erase > C erase > C
a
bC
dC
cC
dS eT
p
dC eT
p
dC
a
dC
bC cC
Application: Incomplete XML
• Data annotated with Boolean expressions; tree T represents set of possible worlds Rep(T)
37
T =
a
b
cy
cx
a d
a
cz b
d
a
b
c
c
a d
a
c b
d
Rep(T) =
a
b
a
d
a
b
c
a
d
a
b c
a d
a
b
d
, , ,...,
7 possible worlds
Correctness: Possible Worlds
38
• For every incomplete tree T, and every UXQuery query q, want this diagram to commute:
T Rep(T)
q(Rep(T)) = Rep(q(T))q(T)
q q
Rep
Rep q(Rep(T))
Commutation with Homomorphisms
• Ex: access controlhc : C C hc(k) := if k · c then k else 0
• Ex: incomplete databasesº : Vars B Evalº : PosBool(Vars) B
• Ex: duplicate elimination± : N B ±(k) := if k = 0 then ? else > 39
Theorem: Let h : K1 K2 be a semiring homomorphism. Then for any RA+/NRC/UXQuery query q, and for any K1- instance D, we have
h(q(D)) = q(h(D)).
Provenance Polynomials are Universal
40
Corollary: The semantics of RA+/NRC/UXQuery evaluation on K-instances for any commutative semiring K factors through evaluation using provenance polynomials N[X].
e.g., for any K-UXML document D, for any K-UXQuery q, we haveq(D) = Evalº(q(D’))
where • D’ is obtained by replacing K-annotations in D with fresh variables from X• º : X K is the corresponding valuation •Evalº : N[X] K is the unique semiring homomorphism such that for the one-variable monomials, Evalº(x) = º(x).
Datalog? The semiring structure on annotations works out
nicely for positive relational algebra, positive nested relational calculus (NRC), a large fragment of XQuery,.
What more do we need to capture recursion, i.e., for Datalog queries?
-complete semirings with -continuous operations (so fixed points exist!)
-continuous semirings
N is not, but N1 ≜ N [ {1} is. 41
Datalog may have infinite derivations!
Polynomials do not suffice, since they are finite!
Nonetheless, the calculations are finitely representable through a system of equations
The equations have a least solution in any -continuous semiring
For provenance, we must generalize from polynomials to formal power series (in general, infinitely many monomials)
42
Related Work
• Foundations: semirings/systems of equations/formal power
series first used in CS in theory of formal languages
[Chomsky,Schutzenberger 1963]
• Our work is related to and shares similar goals with
“Debugging schema mappings with routes” [Chiticariu,Tan
VLDB2006], where “routes” are like minimal finite portions of
our provenance polynomials
43
More Related Work
• Bag semantics for NRC [Libkin&Wong 97] • Incomplete XML [Kanza+ 99, Abiteboul+ 06]
• Probabilistic XML [Nierman&Jagadish 02, van Keulen+ 05, Abit.&Senellart 06, Sen.&Abit. 07, Hung+ 07]
• XML provenance [Buneman+ 01]
• NRC provenance [Hidders+ 07]
• Soft CSPs [Bistarelli et al]
• Semiring-annotated XPath [Grahne+ 07]
• Negation, expressiveness of RAK [Geerts&Poggi 08]
44
Related Work for T.J. Green
• Already mentioned– Set-cont. and equiv. of CQs [Chandra&Merlin 77]– Set-cont. and equiv. of UCQs [Sagiv&Yannakakis 80]– Bag-cont. of UCQs [Ioannidis&Ramakrishnan 95]– Bag-equiv. of CQs [Chaudhuri&Vardi 93]
• Containment of CQs with where-provenance [Tan 03]• Bag-set semantics [CV 93], combined semantics [Cohen 06]
– For K-relations: support operator of [Geerts&Poggi 08] generalizes duplicate elimination
• Bag-containment of CQs [Jayram+ 06]
45
Conclusion• Annotations forming a commutative semiring seem to fit
well with database transformations expressed in positive query languages, be they relational, even recursive, or for complex values or tree data.
• We obtained explanations for a number of puzzles related to why-provenance in a broad sense.
• Provenance polynomials also capture tuple multiplicity and serve well systems such as Orchestra.
• Big open questions: negation (although see work by Geerts, Poggi) and order
46
Future Work
I have the feeling that we have only scratched the surface so far…
I am working on marrying this approach with data exchange, with a broader perspective on security, with integrity constraints, with a broader perspective on mapping/view maintenance and update…
47