Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table...

Post on 22-May-2020

10 views 0 download

Transcript of Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table...

Ibis:AProvenanceManagerforMul5‐LayerSystems

ChristopherOlston&AnishDasSarmaYahoo!Research

Mo5va5on:ManySub‐Systems

scalablefilesysteme.g.GFS

distributedsor5ng&hashinge.g.Map‐Reduce

dataflowprogrammingframeworke.g.Pig

workflowmanagere.g.Oozie

low‐latencyprocessor

servinginges5on

datumX

datumY

metadataqueries

provenanceofX?

IbisProject

•  Benefits:–  Provideuniformviewtousers–  Factoroutmetadatamanagementcode–  Decouplemetadatalife5mefromdata/subsystemlife5me

•  Challenges:–  Overheadofshippingmetadata–  Disparatedata/processinggranulari5es

dataprocessingsub‐systems metadatamanager users

metadataqueries

answers

metadataIbis

integratedmetadata

THISPAPER

ExampleGranularityLaRces

Pigscript

PigjobPiglogicalopera5onMRjob

Pigphysicalopera5on

MRjobphase

MRtask

TaskaTempt

datagranulari5es processgranulari5es

Table

Columngroup

RowColumn

Cell

Version

Webpage

Workflow

MRprogram

Challenges

•  Inference:Givenrela5onshipsexpressedatonegranularity,answerqueriesaboutothergranulari5es(theseman;csaretrickyhere!)

•  Efficiency:Implementinferencewithoutresor5ngtomaterializingeverythingintermsoffinestgranularity(e.g.cells)

TalkOutline

•  Informaloverview– Exampledataprovenancegraph

– Querylanguageoverview+examples

•  Touchonformalmodel(detailsinpaper)

ExampleWorkflow

IMDbExtract

Y!Extract

Merge

ExtractedY!

ExtractedIMDb

MovieDB

IMDBwebpage

Yahoo!Movieswebpage

extractpigscript

5tle year leadactor

Avatar 2009 V1:WorthingtonV2:Saldana

Incep5on 2010 DiCaprio5tle year leadactor

Avatar 2009 Saldana

Incep5on 2010 DiCaprio

5tle year leadactor

Avatar 2009 Worthington

Incep5on 2010 DiCaprio

Yahoo!Movieswebpage

IMDBwebpage

mapoutput1

mapoutput2

pigjob2

Yahooextractedtable

IMDBextractedtable

combinedtable

maptask1,aTempt1

maptask2,aTempt1

reducetask1,aTempt1

mergepigscript

version=3wrapper=yahoo

pigjob1

version=2wrapper=imdb

license=yahooauth.score=5

license=imdbauth.score=4

ProvenanceGraph

MeaningofProvenanceRela5onships

•  (P,D1,D2):ProcessPconsumedPARTOFdatumD1andemiTedALLOFdatumD2

•  “partall”seman5csareanaturaldefault

•  Upshot:ifD1andD2aretables,cannotinferthatagivenrowinD1influencedD2

•  Inquerylanguage,cans5llask“partpart”ques5ons:d2εD2suchthatD1influencedd2?

QueryLanguage:“IQL”

•  SQL‐stylelanguageforqueryingtheprovenancegraph

•  Specialconstructs:– Under(containment):IsrowRundertableT?–  Influence:DoesdataD1influencedataD2?– Feed:DoesdataDfeedprocessP?– Emit:DoesprocessPemitdataD?

IQLExamples

•  Finddataitemsthatinfluencedthecombinedextractedtable:

•  Finddatatablesthatare“contaminated”byversion3oftheextrac5onscript(foundtohaveabug):

select d.id from AnyData d, Table t where d influences t and t.id = (combined extracted table);

select t.id from PigScript p, PigJob j, AnyData d1, AnyData d2, Table t where p.id = (extract pig script) and j under p and j.version = 3 and j emits d1 and d1 influences d2 and d2 under t;

Implementa5onStatus

•  Wehaveaworkingstorage/queryenginebasedonrewri5ngoverSQL/RDBMS(SQLite)

•  We’recurrentlyworkingonautoma5cprovenancecapture(fromPig,Hadoop,etc.)

TalkOutline

•  Informaloverview– Exampledataprovenancegraph

– Querylanguageoverview+examples

•  Touchonformalmodel(detailsinpaper)– Open‐worldseman5cs

– Transi5veinferenceofcontainment&influence

Open‐WorldSeman5cs

•  MetadataofIbisencodessetFoffacts•  Open‐world:– Correctness:AllfactsinFarecorrect–  Incomplete:MaybeotherfactsunknowntoIbis

•  Extension,ext(F),offactsthatcanbederivedfromF

•  TrueworldhassetoffactsF’

U|

U|•  WehaveFext(F)F’

Open‐WorldSeman5cs:OneImplica5on

•  SupposeFcontains:•  ProcesspemiTedrowr1

•  Currentlyr1istheonlyrowintableT•  ``ProcesspemiTedtableT’’isafactthatmaybeinF’(trueworld)butcannotbeinferredinext(F)

Inferring“XisunderY”

•  Definedintermsof“granulariza5on”:1.  ResolveXandYintofinest‐grainelements(e.g.cells)2.  Performsetcontainmentcheck

•  Implementedviaashortcutthatavoidsenumera5ngsub‐elements

•  Proofthatimplementa5on&defini5onareequivalent

Inferring“XisunderY”

Basicelementbdefinedbygranularityg,directparentsP(andaniden5fier).

Granulariza5onofbtofinestgranularitygmindefinedby:

G(b)={b’=(gmin,P’)|bcontainsb’}Containmentobtainedbyrecursiveapplica5onofparentrela5on

ComplexelementEdefinedbysetofgranularity{g1,…,gn},andcorrespondingbasicelements{b1,…,bn}.

Granulariza5onofcomplexelementEconsis5ngofb1,….,bnis:G(E)=iG(bi)

U

Inferring“XisunderY”

UnderCheck‐1:SetsofcomplexelementsE1,E2.E1isunderE2iffnotexistsatrueworldwithUe1εE1G(e1)Ue1εE1G(e1)

U|

EfficientUnderCheck‐2:SetsofcomplexelementsE1,E2.E1isunderE2iffforalle1εE1,existse2εE2suchthate1isundere2.

Givencomplexelementse1ande2withbasicelementsetsB(e1)andB(e2),e1isundere2iffforallb2εB(e2),existsb1εB(e1)suchthatb2containsb1.

Theorem:Check‐1isequivalenttoCheck‐2.

Inferring“XinfluencesY”

Giventwodataver5cesd1andd2:

(1)d1influences(0)d2iffd2isunderd1;(2)d1influences(1)d2iffoneofthefollowinghold:

(A)d1influences(0)d2(B)thereexistsaprovenancerela5onship(d1’,p,d2’)such

thatd1influences(0)d1’andd2’influences(0)d2

(3)Foranyintegerk>1,d1influences(k)d2iffexistsd*suchthatd1influences(1)d*andd*influences(k‐1)d2

RelatedWork

•  Mul5‐layersystemprovenance:–  HarvardPASSv2

•  Nestedcollec5onsinscien5ficworkflowprovenance:–  Kepler’sCOMADnestedcollec5ons–  ZOOMuserviews–  Openprovenancemodel

•  Annota5onsonarbitrarysub‐regionsofrela5ons:–  [Eltabakhetal.]–  [Srivastavaetal.]

Summary

•  Manysemi‐independentdatamgmt.layers+provenancequeryneedsintegratedprovenance

•  Diversedata&processgranulari5escarefulseman5cs

•  Ourcontribu5ons:–  Formalmul5‐granularityprovenanceseman5cs– Querylanguage– Workingprototype(seepaper;workinprogress)