1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania Principles...
-
date post
21-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of 1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania Principles...
1
Provenance Semirings
T.J. Green, G. Karvounarakis, V. TannenUniversity of Pennsylvania
Principles of Provenance (PrOPr)
Philadelphia, PA June 26, 2007
PROPR 2007 2
Provenance
●First studied in data warehousing
▪Lineage [Cui,Widom,Wiener 2000]
●Scientific applications (to assess quality of data)
▪Why-Provenance [Buneman,Khanna,Tan 2001]
●Our interest: P2P data sharing in the ORCHESTRA system (project headed by Zack Ives)
▪Trust conditions based on provenance
▪Deletion propagation
PROPR 2007 3
Annotated relations
●Provenance: an annotation on tuples
●Our observation: propagating provenance/lineage through views is similar to querying
▪Incomplete Databases (conditional tables)
▪Probabilistic Databases (independent tuple tables)
▪Bag Semantics Databases (tuples with multiplicities)
●Hence we look at queries on relations with annotated tuples
PROPR 2007 4
semantics: a set of instances
Incomplete databases: boolean C-tables
a b c p
d b e r
f g e s
a b c
f g e{ }I(R)= , , , , , , ,; d b ea b c f g ed b e
f g e
a b cd b e
f g e
R
a b c
d b e
boolean variables
PROPR 2007 5
Imielinski & Lipski (1984): queries on C -tables
R a b c p
d b e r
f g e s
a c (p Æ p) Ç (p Æ p)
a e p Æ r
d c r Æ p
d e (r Æ r) Ç (r Æ r) Ç (r Æ s)
f e (s Æ s) Ç (s Æ s) Ç (s Æ r)
p
p Æ r
p Æ r
r
s
=
q(R)
q(x,z) :- R(x, _,z), R(_, _,z)
q(x,z) :- R(x,y, _), R(_ ,y,z)
r r
r r
sunion of conjunctive queries (UCQ)
a c
f e
p=true
r=false
s=true
PROPR 2007 6
Why-provenance/lineage
a b c p
d b e r
f g e s
R
Which input tuples contribute to the presence of a tuple in the output?
q(R)tuple ids
same query
a c {p}
a e {p,r}
d c {p,r}
d e {r,s}
f e {r,s}
[Cui,Widom,Wiener 2000][Buneman,Khanna,Tan 2001]
PROPR 2007 7
C –tables vs. Why-provenance
a c ({p} {p}) ({p} {p})
a e {p} {r}
d c {r} {p}
d e ({r} {r}) ({r} {r}) ({r} {s})
f e ({s} {s}) ({s} {s}) ({s} {r})
a c (p Æ p) Ç (p Æ p)
a e p Æ r
d c r Æ p
d e (r Æ r) Ç (r Æ r) Ç (r Æ s)
f e (s Æ s) Ç (s Æ s) Ç (s Æ r)
c-table calculations
Why-provenance calculations
The structure of the calculations is the same!
PROPR 2007 8
Another analogy, with bag semantics
a b c 2
d b e 5
f g e 1
R tuple multiplicities
a c 2 ¢ 2 + 2 ¢ 2
a e 2 ¢ 5
d c 5 ¢ 2
d e 5 ¢ 5 + 5 ¢ 5 + 5 ¢ 1
f e 1 ¢ 1 + 1 ¢ 1 + 1 ¢ 5
q(R)a c 8
a e 10
d c 10
d e 55
f e 7
multiplicity calculation
s
same query
a c (p Æ p) Ç (p Æ p)
a e p Æ r
d c r Æ p
d e (r Æ r) Ç (r Æ r) Ç (r Æ s)
f e (s Æ s) Ç (s Æ s) Ç (s Æ r)
c-table calculations
The structure of the calculations is the same!
PROPR 2007 9
Abstracting the structure of these calculations
These expressions capture the abstract structure of the calculations, which encodes the logical derivation of the output tuples We shall use these expressions as provenance
C-tables
BagsWhy-provenance
Abstract
join Æ ¢ [ ¢union Ç + [ +
a c (p ¢ p) + (p ¢ p)
a e p ¢ r
d c r ¢ p
d e (r ¢ r) + (r ¢ r) + (r ¢ s)
f e (s ¢ s) + (s ¢ s) + (s ¢ r)
abstract calculation
s
PROPR 2007 10
Positive K-relational algebra
●We define an RA+ on K-relations:
▪The ¢ corresponds to join:
▪The + corresponds to union and projection
▪0 and 1 are used for selection predicates
▪Details in the paper (but recall how we evaluated the UCQ q earlier and we will see another example later)
PROPR 2007 11
RA+ identities imply semiring structure!
●Common RA+ identities
▪Union and join are associative, commutative
▪Join distributes over union
▪etc. (but not idempotence!)
These identities hold for RA+ on K-relations iff
(K, +, ¢, 0, 1) is a commutative semiring (K,+,0) is a commutative monoid (K, ¢,1) is a commutative monoid ¢ distributes over +, etc
PROPR 2007 12
Calculations on annotated tables are particular cases
(B, Ç, Æ, false, true) usual relational algebra
(N, +, ¢, 0, 1) bag semantics
(PosBool(B), Ç, Æ, false, true)
boolean C-tables
(P( ), [, Å, ;, )probabilistic event tables
(P(X), [, [, ;, ;)lineage/why-provenance
PROPR 2007 13
Provenance Semirings
●X = {p, r, s, …}: indeterminates (provenance “tokens” for base tuples)
●N[X] : multivariate polynomials with coefficients in N and indeterminates in X
●(N[X], +, ¢, 0, 1) is the most “general” commutative semiring: its elements abstract calculations in all semirings
●N[X] –relations are the relations with provenance!
▪The polynomials capture the propagation of provenance through (positive) relational algebra
PROPR 2007 14
A provenance calculation
a b c p
d b e r
f g e s
Ra c 2p2
a e pr
d c pr
d e 2r2 + rs
f e 2s2 + rs
q(R)a c {p}
a e {p,r}
d c {p,r}
d e {r,s}
f e {r,s}
Why-provenance
●Not just why- but also how-provenance (encodes derivations)!
●More informative than why-provenance
same why-provenance, different polynomials
q(x,z) :- R(x, _,z), R(_, _,z)q(x,z) :- R(x,y, _), R(_ ,y,z)
PROPR 2007 15
Further work
●Application: P2P data sharing in the ORCHESTRA system:
▪Need to express trust conditions based on provenance of tuples
▪Incremental propagation of deletions
▪Semiring provenance itself is incrementally maintainable
●Future extensions:
▪full relational algebra: For difference we need semirings with “proper subtraction”
▪richer data models: nested relations/complex values, XML