Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table...

21
Ibis: A Provenance Manager for Mul5‐Layer Systems Christopher Olston & Anish Das Sarma Yahoo! Research

Transcript of Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table...

Page 1: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge

Ibis:AProvenanceManagerforMul5‐LayerSystems

ChristopherOlston&AnishDasSarmaYahoo!Research

Page 2: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge

Mo5va5on:ManySub‐Systems

scalablefilesysteme.g.GFS

distributedsor5ng&hashinge.g.Map‐Reduce

dataflowprogrammingframeworke.g.Pig

workflowmanagere.g.Oozie

low‐latencyprocessor

servinginges5on

datumX

datumY

metadataqueries

provenanceofX?

Page 3: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge

IbisProject

•  Benefits:–  Provideuniformviewtousers–  Factoroutmetadatamanagementcode–  Decouplemetadatalife5mefromdata/subsystemlife5me

•  Challenges:–  Overheadofshippingmetadata–  Disparatedata/processinggranulari5es

dataprocessingsub‐systems metadatamanager users

metadataqueries

answers

metadataIbis

integratedmetadata

THISPAPER

Page 4: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge

ExampleGranularityLaRces

Pigscript

PigjobPiglogicalopera5onMRjob

Pigphysicalopera5on

MRjobphase

MRtask

TaskaTempt

datagranulari5es processgranulari5es

Table

Columngroup

RowColumn

Cell

Version

Webpage

Workflow

MRprogram

Page 5: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge

Challenges

•  Inference:Givenrela5onshipsexpressedatonegranularity,answerqueriesaboutothergranulari5es(theseman;csaretrickyhere!)

•  Efficiency:Implementinferencewithoutresor5ngtomaterializingeverythingintermsoffinestgranularity(e.g.cells)

Page 6: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge

TalkOutline

•  Informaloverview– Exampledataprovenancegraph

– Querylanguageoverview+examples

•  Touchonformalmodel(detailsinpaper)

Page 7: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge

ExampleWorkflow

IMDbExtract

Y!Extract

Merge

ExtractedY!

ExtractedIMDb

MovieDB

IMDBwebpage

Yahoo!Movieswebpage

Page 8: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge

extractpigscript

5tle year leadactor

Avatar 2009 V1:WorthingtonV2:Saldana

Incep5on 2010 DiCaprio5tle year leadactor

Avatar 2009 Saldana

Incep5on 2010 DiCaprio

5tle year leadactor

Avatar 2009 Worthington

Incep5on 2010 DiCaprio

Yahoo!Movieswebpage

IMDBwebpage

mapoutput1

mapoutput2

pigjob2

Yahooextractedtable

IMDBextractedtable

combinedtable

maptask1,aTempt1

maptask2,aTempt1

reducetask1,aTempt1

mergepigscript

version=3wrapper=yahoo

pigjob1

version=2wrapper=imdb

license=yahooauth.score=5

license=imdbauth.score=4

ProvenanceGraph

Page 9: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge

MeaningofProvenanceRela5onships

•  (P,D1,D2):ProcessPconsumedPARTOFdatumD1andemiTedALLOFdatumD2

•  “partall”seman5csareanaturaldefault

•  Upshot:ifD1andD2aretables,cannotinferthatagivenrowinD1influencedD2

•  Inquerylanguage,cans5llask“partpart”ques5ons:d2εD2suchthatD1influencedd2?

Page 10: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge

QueryLanguage:“IQL”

•  SQL‐stylelanguageforqueryingtheprovenancegraph

•  Specialconstructs:– Under(containment):IsrowRundertableT?–  Influence:DoesdataD1influencedataD2?– Feed:DoesdataDfeedprocessP?– Emit:DoesprocessPemitdataD?

Page 11: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge

IQLExamples

•  Finddataitemsthatinfluencedthecombinedextractedtable:

•  Finddatatablesthatare“contaminated”byversion3oftheextrac5onscript(foundtohaveabug):

select d.id from AnyData d, Table t where d influences t and t.id = (combined extracted table);

select t.id from PigScript p, PigJob j, AnyData d1, AnyData d2, Table t where p.id = (extract pig script) and j under p and j.version = 3 and j emits d1 and d1 influences d2 and d2 under t;

Page 12: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge

Implementa5onStatus

•  Wehaveaworkingstorage/queryenginebasedonrewri5ngoverSQL/RDBMS(SQLite)

•  We’recurrentlyworkingonautoma5cprovenancecapture(fromPig,Hadoop,etc.)

Page 13: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge

TalkOutline

•  Informaloverview– Exampledataprovenancegraph

– Querylanguageoverview+examples

•  Touchonformalmodel(detailsinpaper)– Open‐worldseman5cs

– Transi5veinferenceofcontainment&influence

Page 14: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge

Open‐WorldSeman5cs

•  MetadataofIbisencodessetFoffacts•  Open‐world:– Correctness:AllfactsinFarecorrect–  Incomplete:MaybeotherfactsunknowntoIbis

•  Extension,ext(F),offactsthatcanbederivedfromF

•  TrueworldhassetoffactsF’

U|

U|•  WehaveFext(F)F’

Page 15: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge

Open‐WorldSeman5cs:OneImplica5on

•  SupposeFcontains:•  ProcesspemiTedrowr1

•  Currentlyr1istheonlyrowintableT•  ``ProcesspemiTedtableT’’isafactthatmaybeinF’(trueworld)butcannotbeinferredinext(F)

Page 16: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge

Inferring“XisunderY”

•  Definedintermsof“granulariza5on”:1.  ResolveXandYintofinest‐grainelements(e.g.cells)2.  Performsetcontainmentcheck

•  Implementedviaashortcutthatavoidsenumera5ngsub‐elements

•  Proofthatimplementa5on&defini5onareequivalent

Page 17: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge

Inferring“XisunderY”

Basicelementbdefinedbygranularityg,directparentsP(andaniden5fier).

Granulariza5onofbtofinestgranularitygmindefinedby:

G(b)={b’=(gmin,P’)|bcontainsb’}Containmentobtainedbyrecursiveapplica5onofparentrela5on

ComplexelementEdefinedbysetofgranularity{g1,…,gn},andcorrespondingbasicelements{b1,…,bn}.

Granulariza5onofcomplexelementEconsis5ngofb1,….,bnis:G(E)=iG(bi)

U

Page 18: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge

Inferring“XisunderY”

UnderCheck‐1:SetsofcomplexelementsE1,E2.E1isunderE2iffnotexistsatrueworldwithUe1εE1G(e1)Ue1εE1G(e1)

U|

EfficientUnderCheck‐2:SetsofcomplexelementsE1,E2.E1isunderE2iffforalle1εE1,existse2εE2suchthate1isundere2.

Givencomplexelementse1ande2withbasicelementsetsB(e1)andB(e2),e1isundere2iffforallb2εB(e2),existsb1εB(e1)suchthatb2containsb1.

Theorem:Check‐1isequivalenttoCheck‐2.

Page 19: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge

Inferring“XinfluencesY”

Giventwodataver5cesd1andd2:

(1)d1influences(0)d2iffd2isunderd1;(2)d1influences(1)d2iffoneofthefollowinghold:

(A)d1influences(0)d2(B)thereexistsaprovenancerela5onship(d1’,p,d2’)such

thatd1influences(0)d1’andd2’influences(0)d2

(3)Foranyintegerk>1,d1influences(k)d2iffexistsd*suchthatd1influences(1)d*andd*influences(k‐1)d2

Page 20: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge

RelatedWork

•  Mul5‐layersystemprovenance:–  HarvardPASSv2

•  Nestedcollec5onsinscien5ficworkflowprovenance:–  Kepler’sCOMADnestedcollec5ons–  ZOOMuserviews–  Openprovenancemodel

•  Annota5onsonarbitrarysub‐regionsofrela5ons:–  [Eltabakhetal.]–  [Srivastavaetal.]

Page 21: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge

Summary

•  Manysemi‐independentdatamgmt.layers+provenancequeryneedsintegratedprovenance

•  Diversedata&processgranulari5escarefulseman5cs

•  Ourcontribu5ons:–  Formalmul5‐granularityprovenanceseman5cs– Querylanguage– Workingprototype(seepaper;workinprogress)