Language Design and Data Provenance
Transcript of Language Design and Data Provenance
LanguageDesignandDataProvenance
6/3/2019 1GeCoWorkshop,Como
ValTannenUniversityofPennsylvania
6/3/2019 2GeCoWorkshop,Como
Collaborators
TofTawardTJGreenRelationalAIGrigorisKarvounarakisRelationalAI
GofPODSpaperTJ
ORCHESTRAZackIvesUniversityofPennsylvaniaTJ,Grigoris
OthercorepapersNateFosterCornellUniversityYaelAmsterdamerBar-IlanUniversityDanielDeutchTelAvivUniversityTovaMiloTelAvivUniversitySudeepaRoyDukeUniversityYuvalMoskovitchTelAvivUniversity
RecentworkErichGrädelRWTHAachen
MuchgratitudePeterBunemanUniversityofEdinburgh
Provenance?
• Provenanceisabout
– trust:propagateitfrominputstooutputs
– diagnostics:faultyoutputscomefromwhere?
– (repairs):fixinputstofixoutputs(reverseprovenanceanalysis).
6/3/2019 GeCoWorkshop,Como 3
(Binary)TrustwithCatVictims
6/3/2019 GeCoWorkshop,Como 4
mouse gray
mouse red
rat gray
*SueandValarenotedzoologists.**Zackisanotedcomputationalzoologist
cat mouse
cat rat
Sue’s notes *
Val’s notes *
cat gray
cat red
Zack ** computation
Yes
No
Yes
Yes
Yes Yes
No
No
No
Yes prey color
ConfidenceScores(non-binarytrust)
6/3/2019 GeCoWorkshop,Como 5
mouse gray
mouse red
rat gray
cat mouse
cat rat
Sue’s notes
Val’s notes
cat gray
cat red
Zack computation
0.6
0.1
0.8
0.9
0.9 0.72
0.09
0.72 = max(0.9× 0.8, 0.9 × 0.6) 0.09 = 0.9 × 0.1
ASimpleModelforDataPricing
6/3/2019 GeCoWorkshop,Como 6
mouse gray
mouse red
rat gray
cat mouse
cat rat
Sue’s notes
Val’s notes
cat gray
cat red
Zack computation
$6
$1
$8
$10
$10 $16
$11
16 = min(10 +8, 10 + 6) 11 = 10 + 1
Computation?ExpressedinaQueryLanguage
6/3/2019 GeCoWorkshop,Como 7
mouse gray
mouse red
rat gray
cat mouse
cat rat
Sue’s notes
Val’s notes
cat gray
cat red
Zack computation
Zack(x,z) :- Sue(x,y) , Val(y,z)
Zack = PROJECT (JOIN (Sue, Val))
Zack = { (u.#pred, v.#color) | u 2 Sue , v 2 Val , u.#prey=v.#animal }
6/3/2019 8GeCoWorkshop,Como
Doitonceanduseitrepeatedly:provenance
Label(annotate)inputitemsabstractlywithprovenancetokens.Provenancetracking:propagateexpressions(involvingtokens)
(toannotateintermediatedataand,finally,outputs)
Basedonquerylanguagedesign,tracktwodistinctwaysofusingdataitemsbycomputationprimitives:
• jointly(thisaloneisbasicallylikekeepingalog)
• alternatively(doingbothisessential;thinktrust)
Input-outputcompositional;Modular(intheprimitives)
Later,wewanttoevaluatetheprovenanceexpressionstoobtain binarytrust,confidencescores,dataprices,etc.
AlgebraicinterpretationforRDB
SetX ofprovenancetokens.Spaceofannotations,provenanceexpressionsProv(X)
Prov(X)-relations:everytupleisannotatedwithsomeelementfromProv(X).
BinaryoperationsonProv(X):
· correspondstojointuse(join,cartesianproduct), +correspondstoalternativeuse(unionandprojection).
Specialannotations:
‘‘Absent’’tuplesareannotatedwith0. 1 isa‘‘neutral’’annotation(datawedonottrack).
6/3/2019 GeCoWorkshop,Como 9
K-Relationalalgebra
Algebraiclawsof(Prov(X), +, ·, 0,1)?Moregenerally,forannotations
fromastructure(K, +, ·, 0,1)?
K-relations.GeneralizeRA+to(positive)K-relationalalgebra.
DesiredoptimizationequivalencesofK- relationalalgebraiff
(K, +, ·, 0,1) isacommutativesemiring.
GeneralizesSPJUorUCQornon-rec.Datalog
setsemantics(B,Ç,Æ,?,>)bagsemantics(N,+,·,0,1)
c-table-semantics[IL84](BoolExp(X), Ç,Æ,?,>) eventtablesemantics[FR97,Z97](P(Ω),[,Å,;,Ω)
6/3/2019 GeCoWorkshop,Como 10
Whatisacommutativesemiring?
Analgebraicstructure(K,+,·,0,1)where:• Kisthedomain
• +isassociative,commutative,with0identity
• ·isassociative,with1identitysemiring• ·distributesover+• a·0=0·a=0
• ·isalsocommutative
Unlikering,norequirementforinversesto+
116/3/2019 GeCoWorkshop,Como
Provenance:abstractsemiringannotation
6/3/2019 GeCoWorkshop,Como 12
mouse gray
mouse red
rat gray
cat mouse
cat rat
Sue’s notes
Val’s notes
cat gray
cat red
Zack Zack(x,z):-
Sue(x,y),Val(y,z)
r s t
p q
p·r+q·t p·s
KeepX={p,q,r,s,t } abstract.Diagnosticforwronganswers;Deletionpropagation.E.g.,r=s=0
Provenancepolynomials(N[X],+,·,0,1)semiring
Provenancepropagationthroughlanguageoperations
6/3/2019 GeCoWorkshop,Como 13
mouse gray
mouse red
rat gray
cat mouse
cat rat
Sue Val
cat gray
cat red
PROJECT
r s t
p q
p·r+q·t p·s
cat mouse gray
cat mouse red
cat rat gray
p·r p·s q·t
JOIN
Provenancepolynomials
6/3/2019 GeCoWorkshop,Como 14
(N[X],+,·,0,1)isthecommutativesemiringfreelygeneratedbyX(universalitypropertyinvolvinghomomorphisms)
ProvenancepolynomialsarePTIME-computable(datacomplexity).(querycomplexitydependsonlanguageandrepresentation)
ORCHESTRAprovenance(graphrepresentation)about30%overhead
Monomialscorrespondtologicalderivations(prooftreesinnon-rec.Datalog)
Provenancereadingofpolynomails:
outputtuplehasprovenance2r2 + rs threederivationsofthetuple-twoofthemuser, twice,-thethirduses r and s, onceeach
Specializeprovenanceforconfidencescores
6/3/2019 GeCoWorkshop,Como 15
mouse gray
mouse red
rat gray
cat mouse
cat rat
Sue’s notes
Val’s notes cat gray
cat red
Zack Zack(x,z):-
Sue(x,y),Val(y,z)
r s t
p q
pr+qt ps
V =([0,1], max,·,0,1)theViterbisemiring
f: X![0,1] f(p)=f(q)=0.9 f(r)=0.6 f(s)=0.1 f(t)= 0.8
eval(f): N[X]!V eval(f)(pr+qt)=0.72 eval(f)(ps)= 0.09
0.6
0.1
0.8
0.72
0.09
0.9
0.9
Someapplicationsemirings
6/3/2019 GeCoWorkshop,Como 16
(B,Æ,Ç,>,?)binarytrust
(N,+,·,0,1)multiplicity(numberofderivations)
(A,min,max,0,Pub)accesscontrol
V =([0,1], max,·,0,1)Viterbisemiring(MPE)confidencescores
T =([0,1],min,+,1,0)tropicalsemiring(shortestpaths)datapricing
F =([0,1], max,min,0,1)“fuzzylogic”semiring
Twokindsofsemiringsinthisframework
6/3/2019 GeCoWorkshop,Como 17
Provenancesemirings,e.g.,
(N[X],+,·,0,1)provenancepolynomials[GKT07]
(Why(X),[,d,;,{;})witnesswhy-provenance[BKT01]
Applicationsemirings,e.g.,
(A,min,max,0,Pub)accesscontrol[FGT08]
V =([0,1], max,·,0,1)Viterbisemiring(MPE)[GKIT07]
Provenancespecializationrelieson
-Provenancesemiringsarefreelygeneratedbyprovenancetokens- Querycommutationwithsemiringhomomorphisms
Querycommutationwithhomomorphisms
queryinQL homomorphismh : K1 ! K2
6/3/2019 GeCoWorkshop,Como 18
K1-Rel
K1-Rel
query query
h
h K2-Rel
K2-Rel
QL =RA+,Datalog[GKT07]andextensions[FGT08,GP10,ADT11a,T13,DMT15,GUKFC16,T17]
K-NestedRelationalCalculus
K-sets.Everyelementofthesetisannotatedwithsomek 2 K.where (K, +, ·, 0,1) isacommutativesemiring.
Mapf onS{ f(x) | x 2 S }
Ifxisannotatedbykthentheannotationoff(x)ismultipliedbyk.
K-setsalsoformacommutativesemiring.Thisgivesannotationsfor
“FlatMap”g onS[ { g(x) | x 2 S }
6/3/2019 GeCoWorkshop,Como 19
AHierarchyofProvenanceSemirings[G09,DMRT14]
N[X]
B[X] Trio(X)
Why(X)
Which(X)PosBool(X)
mostinformative
leastinformative
Example:2x2y+xy+5y2+xz
+="
206/3/2019 GeCoWorkshop,Como
Sorp(X)
surjectivesemiringhomomorphism,identityonX
absorption
absorption(ab+a=a)
"idemp.+idemp.
x2y+xy+y2+xz 3xy+5y+xz
y+xz
xy+y2+xz
xyz
"idemp.
xy+y+xz
"idemp. +idemp.
A
T,V
N
B
Amenagerieofprovenancesemirings
6/3/2019 GeCoWorkshop,Como 21
(Which(X),[,[*, ;,;*)setsofcontributingtuples“Lineage”(1)[CWW00]
(Why(X),[,d,;,{;})setsofsetsof…Witnesswhy-provenance[BKT01]
(PosBool(X),Æ,Ç,>,?)minimalsetsofsetsof…Minimalwitnesswhy-provenance[BKT01]also“Lineage”(2)usedinprobabilisticdbs[SORK11]
(Trio(X),+,·,0,1)bagsofsetsof…“Lineage”(3)[BDHT08,G09]
(B[X],+,·,0,1)setsofbagsof…Booleancoeff.polynomials[G09]
(Sorp(X),+, ·,0,1)minimalsetsofbagsof…absorptivepolynomials[DMRT14]
(N[X],+,·,0,1)bagsofbagsof…universalprovenancepolynomials[GKT07]
Furtheraspectsoftheframework
6/3/2019 GeCoWorkshop,Como 22
Extensiontotreedata(NestedRelationalCalculus,structuralrecursionontrees,unorderedXQuery)[FGT08]
StudyofCQ/UCQonprovenance-annotatedrelations[G09]
Extensiontoaggregates(poly-sizeoverhead)[ADT11a]
Poly-sizeprovenanceforDatalog(circuits;PosBool(X),Sorp(X)…)[DMRT14]
Extensiontodata-dependentfinitestateprocesses[DMT15]
Connectionstosemiringmonad[FGT08,T13] tosemimodules[ADT11a] totensorproducts[ADT11a,DMT15]
Provenanceforaggregation
9/2/16
a 20+10 ?
b 15+10+25 ?
a 20 x
a 10 y
b 15 q
b 10 r
b 25 s
Desiderata1. Compatibilitywithset/bagsemantics
2. Fundamentalproperty(commutationwithhomomorphisms)
3. Poly-sizeoverhead!1+2+4+…+2n-1=>2nresults
DS-agg
DS
SUMSGROUP BY D
23SimonsInstitute
Solutioninspiredby(semi)linearalgebra
9/2/16
a x 20 + y 10 ?
b q 15 + r 10 + s 25 ?
DS-agga 20 x
a 10 y
b 15 q
b 10 r
b 25 s
DS
24SimonsInstitute
(R,+,0)isnotaProv(X)-semimodule,but…
(K-Rel,[,;)isaK-semimodulewiththesingletonsasbasis.
Relationsaretheresultof[-aggregation!Whatif(R,+,0)wereaProv(X)-semimodule?
Tensorproductconstruction
9/2/16
a x ⊗20+y ⊗10 x + y
b q ⊗15+r ⊗10+s ⊗25 q + r + s
DS-agg
EmbedacommutativemonoidM(forsum,maxormin)intoaK-semimoduleK⊗M(newvalues!)
Consistency: embedding should be faithful.
25SimonsInstitute
Negativeinformation;non-monotoneoperations(difference)
6/3/2019 GeCoWorkshop,Como 26
Booleanexpressions[IL84].Limited.
Addabinaryoperationcorrespondingtodifference m-semirings(commongen.ofsetandbagdifference)[GP10] spm-semirings(OPTIONALinSPARQL)[GUKFC16]
Encodedifferencebyaggregation[ADT11a]
Differentequationaltheories,differentalgebraicoptimizations[ADT11b]
Stillnotclearhowtotracknegativeinformation.useful:non-answers(whynot?),insertionpropagation.
Logicalmodelchecking(“provenanceof…truth?”) negationasduality(NNFs),logicalgames ongoingworkwithGrädel[T16,T17]
Currenttargets
6/3/2019 GeCoWorkshop,Como 27
ANALYTICSCOMPUTATIONS
“Fine-grainedprovenanceforlinearalgebraoperators”Yan,T.,IvesTaPP16
DISTRIBUTEDSYSTEMS/NETWORKPROVENANCE
“Time-awareprovenancefordistributedsystems”,Zhou,Ding,Haeberlen,Ives,LooTaPP11
“Diagnosingmissingeventsindistributedsystemswithnegativeprovenance”,Wu,Zhao,Haeberlen,Zhou,LooSIGCOMM14
STATICANALYSISOFSOFTWARE
“OnabstractionrefinementforprogramanalysesinDatalog”Zhang,Mangal,Grigore,NaikPLDI14
Frameworkreferences(I)
6/3/2019 GeCoWorkshop,Como 28
[GKT07]“Provenancesemirings”Green,Karvounarakis,TannenPODS07.
[GKIT07]“Updateexchangewithmappingsandprovenance”Green,Karvounarakis,Ives,TannenVLDB07.
[FGT08]“AnnotatedXML:queriesandprovenance”Foster,Green,TannenPODS08.
[G09]“Containmentofconjunctivequeriesonannotatedrelations”GreenICDT09.
[GP10]“OndatabasequerylanguagesforK-relations”,Geerts,PoggiJAppl.Logic2010.
Frameworkreferences(II)
6/3/2019 GeCoWorkshop,Como 29
[ADT11a]“Provenanceforaggregatequeries”,Amsterdamer,Deutch,TannenPODS11.
[ADT11b]“Onthelimitationsofprovenanceforquerieswithdifference”,Amsterdamer,Deutch,TannenTaPP11
[T13]“Provenancepropagationincomplexqueries”TannenBunemanFestschrift2013
[DMRT14]“CircuitsforDatalogprovenance”,Deutch,Milo,Roy,T.ICDT14.
[DMT15]“Provenance-basedanalysisofdata-centricprocesses”Deutch,Moskovitch,TannenVLDBJ.2015
Frameworkreferences(III)
6/3/2019 GeCoWorkshop,Como 30
[GUKFC16]“AlgebraicstructuresforcapturingtheprovenanceofSPARQLqueries”Geerts,Unger,Karvounarakis,Fundulaki,ChristophidesJACM2016
[T16]“Abouttheprovenanceoftruth”TannenSimonsInst.Website16https://simons.berkeley.edu/talks/val-tannen-2016-12-09
[T17]“ProvenanceanalysisforFOLmodelchecking”TannenSIGLOGNews2017
[GT17a]“Thesemiringframeworkfordatabaseprovenance”,Green,TannenPODS2017.
[GT17b]“Semiringprovenanceforfirst-ordermodelchecking”,Grädel,TannenCoRRabs/1712.01980(2017)
Otherreferences
6/3/2019 GeCoWorkshop,Como 31
[IL84]“Incompleteinformationinrelationaldatabases”Imieliński,LipskiJACM1984
[FR97]“Aprobabilisticrelationalalgebra”Fuhr,RölleckeTOIS1997
[Z97]“Queryevaluationinprobabilisticrelationaldatabases”ZimányiDDS1997
[CWW00]“Tracingthelineageofviewdatainawarehousingenvironment”Cui,Widom,WienerTODS2000
[BKT01]“Whyandwhere:acharacterizationofdataprovenance”Buneman,Khanna,TanICDT2001
[BDHTW08]“Databaseswithuncertaintyandlineage”Benjelloun,DasSarma,Halevy,Theobald,WidomVLDBJ.2008
[SORK11]“Probabilisticdatabases”Suciu,Olteanu,Ré,KochSLDM2011
[SuciuOlteanuRéKoch11]
6/3/2019 GeCoWorkshop,Como 32
Thankyou!