1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania Principles...

15
1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania Principles of Provenance (PrOPr) Philadelphia, PA June 26, 2007
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of 1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania Principles...

1

Provenance Semirings

T.J. Green, G. Karvounarakis, V. TannenUniversity of Pennsylvania

Principles of Provenance (PrOPr)

Philadelphia, PA June 26, 2007

PROPR 2007 2

Provenance

●First studied in data warehousing

▪Lineage [Cui,Widom,Wiener 2000]

●Scientific applications (to assess quality of data)

▪Why-Provenance [Buneman,Khanna,Tan 2001]

●Our interest: P2P data sharing in the ORCHESTRA system (project headed by Zack Ives)

▪Trust conditions based on provenance

▪Deletion propagation

PROPR 2007 3

Annotated relations

●Provenance: an annotation on tuples

●Our observation: propagating provenance/lineage through views is similar to querying

▪Incomplete Databases (conditional tables)

▪Probabilistic Databases (independent tuple tables)

▪Bag Semantics Databases (tuples with multiplicities)

●Hence we look at queries on relations with annotated tuples

PROPR 2007 4

semantics: a set of instances

Incomplete databases: boolean C-tables

a b c p

d b e r

f g e s

a b c

f g e{ }I(R)= , , , , , , ,; d b ea b c f g ed b e

f g e

a b cd b e

f g e

R

a b c

d b e

boolean variables

PROPR 2007 5

Imielinski & Lipski (1984): queries on C -tables

R a b c p

d b e r

f g e s

a c (p Æ p) Ç (p Æ p)

a e p Æ r

d c r Æ p

d e (r Æ r) Ç (r Æ r) Ç (r Æ s)

f e (s Æ s) Ç (s Æ s) Ç (s Æ r)

p

p Æ r

p Æ r

r

s

=

q(R)

q(x,z) :- R(x, _,z), R(_, _,z)

q(x,z) :- R(x,y, _), R(_ ,y,z)

r r

r r

sunion of conjunctive queries (UCQ)

a c

f e

p=true

r=false

s=true

PROPR 2007 6

Why-provenance/lineage

a b c p

d b e r

f g e s

R

Which input tuples contribute to the presence of a tuple in the output?

q(R)tuple ids

same query

a c {p}

a e {p,r}

d c {p,r}

d e {r,s}

f e {r,s}

[Cui,Widom,Wiener 2000][Buneman,Khanna,Tan 2001]

PROPR 2007 7

C –tables vs. Why-provenance

a c ({p} {p}) ({p} {p})

a e {p} {r}

d c {r} {p}

d e ({r} {r}) ({r} {r}) ({r} {s})

f e ({s} {s}) ({s} {s}) ({s} {r})

a c (p Æ p) Ç (p Æ p)

a e p Æ r

d c r Æ p

d e (r Æ r) Ç (r Æ r) Ç (r Æ s)

f e (s Æ s) Ç (s Æ s) Ç (s Æ r)

c-table calculations

Why-provenance calculations

The structure of the calculations is the same!

PROPR 2007 8

Another analogy, with bag semantics

a b c 2

d b e 5

f g e 1

R tuple multiplicities

a c 2 ¢ 2 + 2 ¢ 2

a e 2 ¢ 5

d c 5 ¢ 2

d e 5 ¢ 5 + 5 ¢ 5 + 5 ¢ 1

f e 1 ¢ 1 + 1 ¢ 1 + 1 ¢ 5

q(R)a c 8

a e 10

d c 10

d e 55

f e 7

multiplicity calculation

s

same query

a c (p Æ p) Ç (p Æ p)

a e p Æ r

d c r Æ p

d e (r Æ r) Ç (r Æ r) Ç (r Æ s)

f e (s Æ s) Ç (s Æ s) Ç (s Æ r)

c-table calculations

The structure of the calculations is the same!

PROPR 2007 9

Abstracting the structure of these calculations

These expressions capture the abstract structure of the calculations, which encodes the logical derivation of the output tuples We shall use these expressions as provenance

C-tables

BagsWhy-provenance

Abstract

join Æ ¢ [ ¢union Ç + [ +

a c (p ¢ p) + (p ¢ p)

a e p ¢ r

d c r ¢ p

d e (r ¢ r) + (r ¢ r) + (r ¢ s)

f e (s ¢ s) + (s ¢ s) + (s ¢ r)

abstract calculation

s

PROPR 2007 10

Positive K-relational algebra

●We define an RA+ on K-relations:

▪The ¢ corresponds to join:

▪The + corresponds to union and projection

▪0 and 1 are used for selection predicates

▪Details in the paper (but recall how we evaluated the UCQ q earlier and we will see another example later)

PROPR 2007 11

RA+ identities imply semiring structure!

●Common RA+ identities

▪Union and join are associative, commutative

▪Join distributes over union

▪etc. (but not idempotence!)

These identities hold for RA+ on K-relations iff

(K, +, ¢, 0, 1) is a commutative semiring (K,+,0) is a commutative monoid (K, ¢,1) is a commutative monoid ¢ distributes over +, etc

PROPR 2007 12

Calculations on annotated tables are particular cases

(B, Ç, Æ, false, true) usual relational algebra

(N, +, ¢, 0, 1) bag semantics

(PosBool(B), Ç, Æ, false, true)

boolean C-tables

(P( ), [, Å, ;, )probabilistic event tables

(P(X), [, [, ;, ;)lineage/why-provenance

PROPR 2007 13

Provenance Semirings

●X = {p, r, s, …}: indeterminates (provenance “tokens” for base tuples)

●N[X] : multivariate polynomials with coefficients in N and indeterminates in X

●(N[X], +, ¢, 0, 1) is the most “general” commutative semiring: its elements abstract calculations in all semirings

●N[X] –relations are the relations with provenance!

▪The polynomials capture the propagation of provenance through (positive) relational algebra

PROPR 2007 14

A provenance calculation

a b c p

d b e r

f g e s

Ra c 2p2

a e pr

d c pr

d e 2r2 + rs

f e 2s2 + rs

q(R)a c {p}

a e {p,r}

d c {p,r}

d e {r,s}

f e {r,s}

Why-provenance

●Not just why- but also how-provenance (encodes derivations)!

●More informative than why-provenance

same why-provenance, different polynomials

q(x,z) :- R(x, _,z), R(_, _,z)q(x,z) :- R(x,y, _), R(_ ,y,z)

PROPR 2007 15

Further work

●Application: P2P data sharing in the ORCHESTRA system:

▪Need to express trust conditions based on provenance of tuples

▪Incremental propagation of deletions

▪Semiring provenance itself is incrementally maintainable

●Future extensions:

▪full relational algebra: For difference we need semirings with “proper subtraction”

▪richer data models: nested relations/complex values, XML