CompSci516 Data Intensive Computing SystemsCompSci516 Data Intensive Computing Systems Lecture 21...

of 69/69
CompSci 516 Data Intensive Computing Systems Lecture 21 Datalog Instructor: Sudeepa Roy 1 Duke CS, Fall 2016 CompSci 516: Data Intensive Computing Systems
  • date post

    08-Jul-2020
  • Category

    Documents

  • view

    3
  • download

    0

Embed Size (px)

Transcript of CompSci516 Data Intensive Computing SystemsCompSci516 Data Intensive Computing Systems Lecture 21...

  • CompSci 516DataIntensiveComputingSystems

    Lecture21Datalog

    Instructor:Sudeepa Roy

    1DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

  • Announcement

    • HW3duenextWednesday:11/16

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 2

  • Today

    • Datalog– forrecursion indatabasequeries

    • AquicklookatIncrementalViewMaintenance(IVM)

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 3

  • ReadingMaterial:DatalogOptional:1. Thedatalog chaptersinthe“AliceBook”FoundationsofDatabasesAbiteboul-Hull-VianuAvailableonline:http://webdam.inria.fr/Alice/

    2.Datalog tutorialSIGMOD2011“Datalog andEmergingApplications:AnInteractiveTutorial”

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 4

  • BriefHistoryofDatalog

    • MotivatedbyProlog– startedbackin1970-80’s– thenquietforalongtime

    • AlongargumentintheDatabasecommunitywhetherrecursionshouldbesupportedinquerylanguages– “Nopracticalapplicationsofrecursivequerytheory...havebeenfoundto

    date”—MichaelStonebraker,1998ReadingsinDatabaseSystems,3rdEditionStonebraker andHellerstein,eds.

    – RecentworkbyHellerstein etal.onDatalog-extensionstobuildnetworkingprotocolsanddistributedsystems.[Link]

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 5

  • Datalog isresurging!

    • NumberofpapersandtutorialsinDBconferences

    • Applicationsin– dataintegration,declarativenetworking,programanalysis,information

    extraction,networkmonitoring,security,andcloudcomputing

    • Systemssupportingdatalog inbothacademiaandindustry:– Lixto (informationextraction)– LogicBlox (enterprisedecisionautomation)– Semmle (programanalysis)– BOOM/Dedalus (Berlekey)– Coral– LDL++

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 6

  • RecallourdrinkerexampleinRC(Lecture4)

    Find drinkers that frequent some bar that serves some beer they like.

    Q(x) = $y. $z. Frequents(x, y)∧Serves(y,z)∧Likes(x,z)

    Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

    7CompSci516:DataIntensiveComputingSystems

    DukeCS,Fall2016

    DrinkerexampleisfromslidesbyProfs.Balazinska andSuciuandthe[GUW]book

  • WriteitasaDatalog RuleFind drinkers that frequent some bar that serves some beer they like.

    Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

    8CompSci516:DataIntensiveComputingSystems

    DukeCS,Fall2016

    RC:Q(x) = $y. $z. Frequents(x, y)∧Serves(y,z)∧Likes(x,z)

    Datalog:Q(x) :- Frequents(x, y), Serves(y,z), Likes(x,z)

  • WriteitasaDatalog RuleFind drinkers that frequent some bar that serves some beer they like.

    Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

    9CompSci516:DataIntensiveComputingSystems

    DukeCS,Fall2016

    Datalog:Q(x) :- Frequents(x, y), Serves(y,z), Likes(x,z)

    RC:Q(x) = $y. $z. Frequents(x, y)∧Serves(y,z)∧Likes(x,z)

    • Quickdifferences:– Uses“:-”not=– noneedfor$ (assumedbydefault)– Use“,”ontherighthandside(RHS)– AnythingonRHStheof:- isassumedtobecombinedwith∧ bydefault– ",Þ,notallowed– theyneedtousenegation¬– Standard“Datalog”doesnotallownegation– Negationallowedindatalog withnegation

    • Howtospecifydisjunction(OR/⋁)?

  • Example:ORinDatalogFind drinkers that (a) either frequent some bar that serves some beer they like, (b) or like beer “BestBeer”

    Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

    10CompSci516:DataIntensiveComputingSystems

    DukeCS,Fall2016

    Datalog:Q(x) :- Frequents(x, y), Serves(y,z), Likes(x,z)Q(x) :- Likes(x, “BestBeer”)

    RC:Q(x)=[$y.$z.Frequents(x,y)∧Serves(y,z)∧Likes(x,z)]⋁ [Likes(x,“BestBeer”)]

  • Example:ORinDatalogFind drinkers that (a) either frequent some bar that serves some beer they like, (b) or like beer “BestBeer”, (c) or, frequent bars that “Joe” frequents

    Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

    11CompSci516:DataIntensiveComputingSystems

    DukeCS,Fall2016

    Datalog:JoeFrequents(w):- Frequents(“Joe”,w)Q(x):- Frequents(x,y),Serves(y,z),Likes(x,z)Q(x):- Likes(x,“BestBeer”)Q(x):- Frequents(x,w),JoeFrequents(w)

    RC:Q(x)=[$y.$z.Frequents(x,y)∧Serves(y,z)∧Likes(x,z)]⋁ [Likes(x,“BestBeer”)]

    ⋁ [$w Frequents(x,w)∧ Frequents(“Joe”,w)]

    • Tospecify“OR”,writemultipleruleswiththesame“Head”• Next:terminologyforDatalog

  • • Each rule is of the form Head :- Body

    • Each variable in the head of each rule must appear in the body of the rule

    Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

    12CompSci516:DataIntensiveComputingSystems

    DukeCS,Fall2016

    JoeFrequents(w):- Frequents(“Joe”,w)Q(x):- Frequents(x,y),Serves(y,z),Likes(x,z)Q(x):- Likes(x,“BestBeer”)Q(x):- Frequents(x,w),JoeFrequents(w)

    Fourrules

    BodyHead

    Datalog Rules

    Atom

    Variable

  • EDBsandIDBs

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    13

    Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

    • Intensional DataBases (IDBs)– Relationsthatarederived– Canbeintermediateorfinaloutputtables– e.g.JoeFrequents,Q– CanbeontheLHSorRHS(e.g.JoeFrequents)

    • ExtensionalDataBases (EDBs)– Inputrelationnames– e.g.Likes,Frequents,Serves– canonlybeontheRHSofarule

    JoeFrequents(w):- Frequents(“Joe”,w)Q(x):- Frequents(x,y),Serves(y,z),Likes(x,z)Q(x):- Likes(x,“BestBeer”)Q(x):- Frequents(x,w),JoeFrequents(w)

    TupleinanEDBoranIDB:aFACT

    eitherbelongstoagivenEDBrelation,orisderivedinanIDB relation

  • GraphExample

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    14

    a

    b

    c

    d e

    V1 V2

    a c

    b a

    b d

    c d

    d a

    d e

    E(edgerelation)

  • Example1

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    15

    E(edgerelation)

    WriteaDatalog programtofindpathsoflengthtwo(outputstartandfinishvertices)

    V1 V2

    a c

    b a

    b d

    c d

    d a

    d e

    a

    b

    c

    d e

  • Example1

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    16

    E(edgerelation)

    WriteaDatalog programtofindpathsoflengthtwo(outputstartandfinishvertices)

    P2(x,y):- E(x,z),E(z,y)

    V1 V2

    a c

    b a

    b d

    c d

    d a

    d e

    a

    b

    c

    d e

  • Example1:Execution

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    17

    E(edgerelation)

    WriteaDatalog programtofindpathsoflengthtwo(outputstartandfinishvertices)

    P2(x,y):- E(x,z),E(z,y)

    V1 V2

    a c

    b a

    b d

    c d

    d a

    d e

    a

    b

    c

    d e

    V1 V2

    a db cb ec ac ed c

    P2

    sameasE⨝E.V2=E.V1 E

  • Example2

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    18

    E(edgerelation)

    WriteaDatalog programtofindallpairsofvertices(u,v)suchthatvisreachablefromu

    • CanyouwriteaSQL/RA/RCqueryforreachability?

    V1 V2

    a c

    b a

    b d

    c d

    d a

    d e

    a

    b

    c

    d e

  • Example2

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    19

    E(edgerelation)

    • CanyouwriteaSQL/RA/RCqueryforreachability?• NO- SQL/RA/RCcannotexpressreachability

    V1 V2

    a c

    b a

    b d

    c d

    d a

    d e

    a

    b

    c

    d e WriteaDatalog programtofindallpairsofvertices(u,v)suchthatvisreachablefromu

  • Example2

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    20

    E(edgerelation)

    R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)

    Option1

    V1 V2

    a c

    b a

    b d

    c d

    d a

    d e

    a

    b

    c

    d e WriteaDatalog programtofindallpairsofvertices(u,v)suchthatvisreachablefromu

  • Example2

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    21

    E(edgerelation)

    R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)

    Option1 R(x,y):- E(x,y)R(x,y):- R(x,z),E(z,y)

    Option2

    R(x,y):- E(x,y)R(x,y):- R(x,z),R(z,y)

    Option3

    V1 V2

    a c

    b a

    b d

    c d

    d a

    d e

    a

    b

    c

    d e WriteaDatalog programtofindallpairsofvertices(u,v)suchthatvisreachablefromu

    linear

    non-linear

  • LinearDatalog

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    22

    • Linearrule– atmostoneatominthebodythatisrecursivewiththeheadoftherule

    – e.g.R(x,y):- E(x,z),R(z,y)• Lineardatalog program

    – ifallrulesarelinear– likelinearrecursion

    • Top-downandbottom-upevaluationarepossible– wewillfocusonbottom-up

  • Example2:Execution

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    23

    E

    R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)

    Option1

    V1 V2

    a cb ab dc dd ad e

    Iteration1 R=E

    V1 V2

    a c

    b a

    b d

    c d

    d a

    d e

    a

    b

    c

    d e

    (verticesreachablein1-hopbyadirectedge)

  • Example2:Execution

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    24

    E

    R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)

    Option1

    V1 V2

    a cb ab dc dd ad ea db cb ec ac ed c

    RIteration2

    V1 V2

    a c

    b a

    b d

    c d

    d a

    d e

    a

    b

    c

    d e

    (verticesreachablein2-hops)

  • Example2:Execution

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    25

    E

    R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)

    Option1

    V1 V2

    a cb ab dc dd ad ea db cb ec ac ed ca ea ac cd d

    RIteration3

    V1 V2

    a c

    b a

    b d

    c d

    d a

    d e

    a

    b

    c

    d e

    (verticesreachablein3-hops)

  • Example2:Execution

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    26

    E

    R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)

    Option1

    V1 V2

    a cb ab dc dd ad ea db cb ec ac ed ca ea ac cd d

    RIteration4

    V1 V2

    a c

    b a

    b d

    c d

    d a

    d e

    a

    b

    c

    d e

    Runchanged- stop

  • Examples3and4

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    27

    E(edgerelation)

    WriteaDatalog programtofindallverticesreachablefromb

    R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)QB(y):- R(b,y)

    V1 V2

    a c

    b a

    b d

    c d

    d a

    d e

    a

    b

    c

    d e

    WriteaDatalog programtofindallverticesureachablefromthemselvesR(u,u)

    R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)Q(x):- R(x,x)

  • TerminationofaDatalog Program

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    28

    Q. ADatalog programalwaysterminates– why?

  • TerminationofaDatalog Program

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    29

    Q. ADatalog programalwaysterminates– why?

    • Becausethevaluesofthevariablesarecomingfromthe“activedomain”intheinputrelations(EDBs)

    • Activedomain=(finite)valuesfromthe(possiblyinfinite)domainappearingintheinstanceofadatabase

    – e.g.agecanbeanyinteger(infinite),butactivedomainisonlyfinitelymanyinR(id,name,age)

    • ThereforethenumberofpossiblevaluesineachoftheIDBsisfinite

    • e.g.inthereachabilityexampleR(x,y),thevaluesofxandycomefrom{a,b,c,d,e}

    – atmost5x5=25tuplespossibleintheIDBR(x,y)– inanyiteration,atleastonenewtupleisaddedinatleastoneIDB– Muststopafterfinitesteps– e.g.themaximumnumberofiterationinthereachabilityexampleforanygraphwith

    fiveverticesis25(itwasonly4inourexample)

  • Bottom-upEvaluationofaDatalog Program

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    30

    • Naïveevaluation

    • Semi-naïveevaluation

  • Naïveevaluation- 1

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    31

    V1 V2

    a cb ab dc dd ad e

    V1 V2

    a c

    b a

    b d

    c d

    d a

    d e

    Iteration1:R=E=R1(say)

    E

    a

    b

    c

    d e

    Inallsubsequentiteration,checkifanyoftherulescanbeapplied

    DounionofalltheruleswiththesameheadIDB

  • Naïveevaluation- 2

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    32

    V1 V2

    a cb ab dc dd ad e

    V1 V2

    a c

    b a

    b d

    c d

    d a

    d e

    V1 V2

    a cb ab dc dd ad ea db cb ec ac ed c

    Iteration1:R=E=R1(say)

    E

    a

    b

    c

    d e

    Iteration2:R=E∪

    E⨝ R1=R2(say)

    R1≠R2socontinue

  • Naïveevaluation- 3

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    33

    V1 V2

    a cb ab dc dd ad e

    V1 V2

    a c

    b a

    b d

    c d

    d a

    d e

    V1 V2

    a cb ab dc dd ad ea db cb ec ac ed c

    V1 V2

    a cb ab dc dd ad ea db cb ec ac ed ca ea ac cd d

    Iteration1:R=E=R1(say)

    E

    a

    b

    c

    d e

    Iteration2:R=E∪

    E⨝ R1=R2(say)

    R1≠R2socontinue

    Iteration3:R=E∪

    E⨝ R2=R3(say)

    R2≠R3socontinue

  • Naïveevaluation- 4

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    34

    V1 V2

    a cb ab dc dd ad e

    V1 V2

    a c

    b a

    b d

    c d

    d a

    d e

    V1 V2

    a cb ab dc dd ad ea db cb ec ac ed c

    V1 V2

    a cb ab dc dd ad ea db cb ec ac ed ca ea ac cd d

    Iteration1:R=E=R1(say)

    E

    a

    b

    c

    d e

    Iteration2:R=E∪

    E⨝ R1=R2(say)

    R1≠R2socontinue

    Iteration3:R=E∪

    E⨝ R2=R3(say)

    R2≠R3socontinue

    Iteration4:R=E∪

    E⨝ R3=R4(say)

    R3=R4soSTOP

  • ProblemwithNaïveEvaluation

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    35

    • ThesameIDBfactsarediscoveredagainandagain– e.g.ineachiterationalledgesinEareincludedinR– Inthe2nd-4th iterations,thefirstsixtuplesinRarecomputed

    repeatedly

    • Solution:Semi-NaïveEvaluation

    • Workonlywiththenewtuplesgeneratedinthepreviousiteration

  • Semi-Naïveevaluation- 1

    DukeCS,Fall2016CompSci 516:DataIntensiveComputing

    Systems 36

    V1 V2

    a cb ab dc dd ad e

    V1 V2

    a c

    b a

    b d

    c d

    d a

    d e

    Iteration1:R=E=R1(say)ΔR1=R1

    E

    a

    b

    c

    d e

    Initially:R=Φ

  • Semi-Naïveevaluation- 2

    DukeCS,Fall2016CompSci 516:DataIntensiveComputing

    Systems 37

    V1 V2

    a cb ab dc dd ad e

    V1 V2

    a c

    b a

    b d

    c d

    d a

    d e

    V1 V2

    a cb ab dc dd ad ea db cb ec ac ed c

    Iteration1:R=E=R1(say)ΔR1=R1

    E

    a

    b

    c

    d e

    Iteration2:R=R1∪

    E⨝ ΔR1=R2(say)

    ΔR2=R2– R1

    ΔR2≠Φsocontinue

    Initially:R=Φ

  • Semi-Naïveevaluation- 3

    DukeCS,Fall2016CompSci 516:DataIntensiveComputing

    Systems 38

    V1 V2

    a cb ab dc dd ad e

    V1 V2

    a c

    b a

    b d

    c d

    d a

    d e

    V1 V2

    a cb ab dc dd ad ea db cb ec ac ed c

    V1 V2

    a cb ab dc dd ad ea db cb ec ac ed ca ea ac cd d

    Iteration1:R=E=R1(say)ΔR1=R1

    E

    a

    b

    c

    d e

    Iteration2:R=R1∪

    E⨝ ΔR1=R2(say)

    ΔR2=R2– R1

    ΔR2≠Φsocontinue

    Iteration3:R=R2∪

    E⨝ ΔR2=R3(say)

    ΔR3=R3– R2

    ΔR3≠Φsocontinue

    Initially:R=Φ

  • Semi-Naïveevaluation- 4

    DukeCS,Fall2016CompSci 516:DataIntensiveComputing

    Systems 39

    V1 V2

    a cb ab dc dd ad e

    V1 V2

    a c

    b a

    b d

    c d

    d a

    d e

    V1 V2

    a cb ab dc dd ad ea db cb ec ac ed c

    V1 V2

    a cb ab dc dd ad ea db cb ec ac ed ca ea ac cd d

    Iteration1:R=E=R1(say)ΔR1=R1

    E

    a

    b

    c

    d e

    Iteration2:R=R1∪

    E⨝ ΔR1=R2(say)

    ΔR2=R2– R1

    ΔR2≠Φsocontinue

    Iteration3:R=R2∪

    E⨝ ΔR2=R3(say)

    ΔR3=R3– R2

    ΔR3≠Φsocontinue

    Iteration4:R=R3∪E⨝ ΔR3=R4(say)

    ΔR4=R4– R3ΔR=Φ(CHECKJ)soSTOP

    Initially:R=Φ

  • IncrementalViewMaintenance(IVM)

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    40

    • Whydidthesemi-naïvealgorithmwork?• BecauseofthegenerictechniqueofIncrementalView

    Maintenance(IVM)

    • Supposeyouhave– adatabaseD=(R1,R2,R3)– aqueryQthatgivesanswerQ(D)– D=(R1,R2,R3)getsupdatedtoD’=(R1’,R2’,R3’)– e.g.R1’=R1∪ ΔR1(insertion),R2’=R2- ΔR1(deletion)etc.

  • IncrementalViewMaintenance(IVM)

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    41

    • Whydidthesemi-naïvealgorithmwork?• BecauseofthegenerictechniqueofIncrementalView

    Maintenance(IVM)

    • Supposeyouhave– adatabaseD=(R1,R2,R3)– aqueryQthatgivesanswerQ(D)– D=(R1,R2,R3)getsupdatedtoD’=(R1’,R2’,R3’)– e.g.R1’=R1∪ ΔR1(insertion),R2’=R2- ΔR1(deletion)etc.

    • IVM: CanyoucomputeQ(D’)usingQ(D)andΔR1,ΔR2,ΔR3withoutcomputingitfromscratch(i.e.donotrerunthequeryQ)?

  • IVM Example:Selection

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    42

    V1 V2

    a cb ad ac d

    V1 V2

    a cb ad ac db dd e

    σV1=bR

    R

    V1 V2

    b a

    R’=R∪ ΔR

    ΔR

    σV1=bR’

    V1 V2

    b ab d

    V1 V2

    b a

    V1 V2

    b d

    σV1=bR σV1=bΔR

    • σV1=b(R∪ ΔR)=σV1=bR∪ σV1=bΔR• Itsufficestoapplytheselectionconditiononly onΔR

    – andincludewiththeoriginalsolution

  • IVM Example:Projection

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    43

    • πV1(R∪ ΔR)=πV1R∪ πV1ΔR• Itsufficestoapplytheprojectionconditiononly onΔR

    – andincludewiththeoriginalsolution

    V1 V2

    a cb ad ad e

    V1 V2

    a cb ad ad eb dc d

    πV1R

    R

    V1

    abd

    R’=R∪ ΔR

    ΔR

    πV1R’

    V1

    bc

    πV1R

    πV1ΔR

    V1

    abdc

    V1

    abd

  • IVM Example:Join

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    44

    (R∪ ΔR)⨝ (S∪ ΔS)=(R⨝ S)∪ (R⨝ ΔS)∪ (ΔR⨝ S)∪ (ΔR⨝ ΔS)

    A B

    a1 b1a2 b2a3 b1

    B C

    b1 c1b2 c2

    S’=S∪ ΔSR’=R∪ ΔR

    ΔRΔS

    A B

    a1 b1B C

    b1 c1A B C

    a1 b1 c1⨝

    =

    A B

    a1 b1B C

    b1 c1⨝

    A B

    a1 b1⨝

    B C

    b2 c2

    A B

    a2 b2a3 b1

    ⨝B C

    b1 c1

    =

    A B

    a2 b2a3 b1

    B C

    b2 c2⨝

    A B C

    a1 b1 c1a3 b1 c1a2 b2 c2

    = =

  • IVM forLinearDatalog Rule

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    45

    (R∪ ΔR)⨝ (S∪ ΔS)=(R⨝ S)∪ (R⨝ ΔS)∪ (ΔR⨝ S)∪ (ΔR⨝ ΔS)

    A B

    a1 b1a2 b2a3 b1

    B C

    b1 c1

    S’=S

    R’=R∪ ΔR

    ΔR

    A B

    a1 b1B C

    b1 c1A B C

    a1 b1 c1⨝

    =

    A B C

    a1 b1 c1a3 b1 c1

    =

    • R(x,y):- E(x,z),R(z,y)– i.e.Rnew =E⨝ R

    • ButEisEDB– ΔE=Φ

    • Therefore,E⨝ (R∪ ΔR)=(E⨝ R)∪ (E⨝ ΔR)• Itsufficestojoinwiththedifference

    ΔRandincludeintheresultinthepreviousroundE⨝ R

    • Advantageofhaving“linearrule”

  • (Non-recursive)Datalog withNegation

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    46

    • RecursionandnegationtogethermakeDatalog executioncomplicated– thereisanotioncalled“stratifiedsemantic”forthispurpose– computeIDBrelationsinstrata/layersbeforetakinganegation– notcoveredinthisclass

    • Wewillonlydonegationfornon-recursiveDatalog

  • Unsafe/SafeDatalog Rules

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    47

    • Whatistheproblemwiththisrule?• Whatshouldthisrulereturn?

    – namesofalldrinkersintheworld?– namesofalldrinkersintheUSA?– namesofalldrinkersinDurham?

    Find drinkers who like beer “BestBeer” Q(x) :- Likes(x, “BestBeer”)

    Find drinkers who DO NOT like beer “BestBeer”

    Q(x) :- ¬Likes(x, “BestBeer”)

    Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

  • DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    48

    Find drinkers who like beer “BestBeer” Q(x) :- Likes(x, “BestBeer”)

    Find drinkers who DO NOT like beer “BestBeer”

    Q(x) :- ¬Likes(x, “BestBeer”)

    ProblemwithNegationinDatalog Rules

    Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

    • Whatistheproblemwiththisrule?• Dependenton“domain”ofdrinkers

    – domain-dependent– infiniteanswerspossibletoo..

    • keepgenerating“names”

  • DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    49

    • Solution:• Restrictto“activedomain”ofdrinkersfromtheinputLikes (orFrequents)relation– “domain-independence”– samefiniteansweralways

    • Becomesa“saferule”

    Find drinkers who like beer “BestBeer” Q(x) :- Likes(x, “BestBeer”)

    Find drinkers who DO NOT like beer “BestBeer”

    Q(x) :- ¬Likes(x, “BestBeer”)

    ProblemwithNegationinDatalog Rules

    Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

    Q(x) :- Likes(x, y), ¬Likes(x, “BestBeer”)

  • Q(x) = $y. Likes(x, y)∧"z.(Serves(z,y) Þ Frequents(x,z))

    Query: Find drinkers that like some beer (so much) that they frequent all bars that serve it

    CompSci516:DataIntensiveComputingSystems

    50

    Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

    DukeCS,Fall2016

    RC→Datalog withnegation→ SQL(1/8)

    Ack:slidesbyProfs.Balazinska andSuciu

    RevisitexamplefromLecture4

  • Q(x) = $y. Likes(x, y)∧"z.(Serves(z,y) Þ Frequents(x,z))

    Query: Find drinkers that like some beer so much that they frequent all bars that serve it

    Step 1: Replace " with $ using de Morgan’s Laws

    Q(x) = $y. Likes(x, y)∧ ¬$z.(Serves(z,y) ∧ ¬Frequents(x,z))

    CompSci516:DataIntensiveComputingSystems

    51

    Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

    "x P(x) same as¬$x ¬P(x)

    ¬(¬P∨Q) same asP∧ ¬ Q

    º Q(x) = $y. Likes(x, y)∧"z.(¬ Serves(z,y) ∨ Frequents(x,z))

    DukeCS,Fall2016

    RC→Datalog withnegation→SQL(2/8)

    Ack:slidesbyProfs.Balazinska andSuciu

    RevisitexamplefromLecture4

    P => Q same as ¬P∨Q

  • Q(x) = $y. Likes(x, y)∧"z.(Serves(z,y) Þ Frequents(x,z))

    Query: Find drinkers that like some beer so much that they frequent all bars that serve it

    Step 1: Replace " with $ using de Morgan’s Laws

    Q(x) = $y. Likes(x, y)∧ ¬$z.(Serves(z,y) ∧ ¬Frequents(x,z))

    (new) Step 2: Make all subqueries domain independent

    Q(x) = $y. Likes(x, y) ∧ ¬$z.(Likes(x,y)∧Serves(z,y)∧¬Frequents(x,z))

    CompSci516:DataIntensiveComputingSystems

    52

    Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

    DukeCS,Fall2016

    RC→Datalog withnegation→ SQL(3/8)

    Ack:slidesbyProfs.Balazinska andSuciu

    RevisitexamplefromLecture4

  • (new) Step 3: Create a datalog rule for some subexpressions of the form$x $y…. R(….)∧ S(….)∧ T(….)∧….

    Q(x) = $y. Likes(x, y) ∧¬ $z.(Likes(x,y)∧Serves(z,y)∧¬Frequents(x,z))

    H(x,y) :- Likes(x,y),Serves(z,y), not Frequents(x,z)Q(x) :- Likes(x,y), not H(x,y)

    H(x,y)

    CompSci516:DataIntensiveComputingSystems

    53

    Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

    DukeCS,Fall2016

    RC→Datalog withnegation→ SQL(4/8)

    Ack:slidesbyProfs.Balazinska andSuciu

    RevisitexamplefromLecture4

  • Step 4: Write it in SQL

    SELECT DISTINCT L.drinker FROM Likes LWHERE ……

    H(x,y) :- Likes(x,y),Serves(z,y), not Frequents(x,z)Q(x) :- Likes(x,y), not H(x,y)

    54

    Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    RC→Datalog withnegation→ SQL(5/8)

    Ack:slidesbyProfs.Balazinska andSuciu

    RevisitexamplefromLecture4

  • Step 4: Write it in SQL

    SELECT DISTINCT L.drinker FROM Likes LWHERE not exists

    (SELECT * FROM Likes L2, Serves SWHERE … …)

    H(x,y) :- Likes(x,y),Serves(z,y), not Frequents(x,z)Q(x) :- Likes(x,y), not H(x,y)

    55

    Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

    CompSci516:DataIntensiveComputingSystems

    DukeCS,Fall2016

    RC→Datalog withnegation→ SQL(6/8)

    Ack:slidesbyProfs.Balazinska andSuciu

    RevisitexamplefromLecture4

  • Step 4: Write it in SQL

    SELECT DISTINCT L.drinker FROM Likes LWHERE not exists

    (SELECT * FROM Likes L2, Serves SWHERE L2.drinker=L.drinker and L2.beer=L.beer

    and L2.beer=S.beerand not exists (SELECT * FROM Frequents F

    WHERE F.drinker=L2.drinkerand F.bar=S.bar))

    H(x,y) :- Likes(x,y),Serves(z,y), not Frequents(x,z)Q(x) :- Likes(x,y), not H(x,y)

    56

    Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    RC→Datalog withnegation→ SQL(7/8)

    Ack:slidesbyProfs.Balazinska andSuciu

    RevisitexamplefromLecture4

  • Sometimes can simplify the SQL query by using an unsafe datalog ruleCorrectness ensured by safe outermost rule

    SELECT DISTINCT L.drinker FROM Likes LWHERE not exists

    (SELECT * FROM Serves SWHERE L.beer=S.beer

    and not exists (SELECT * FROM Frequents FWHERE F.drinker=L.drinker

    and F.bar=S.bar))

    H(x,y) :- Likes(x,y),Serves(z,y), not Frequents(x,z)Q(x) :- Likes(x,y), not H(x,y) Unsafe rule

    CompSci516:DataIntensiveComputingSystems

    57

    Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

    DukeCS,Fall2016

    RC→Datalog withnegation→SQL(8/8)

    Ack:slidesbyProfs.Balazinska andSuciu

    RevisitexamplefromLecture4

  • AnOverviewofDataProvenancewithAnnotations

    Selected/adaptedslidesfromthekeynotebyProf.ValTannen,EDBT2010

    (optionalmaterial:fullslidedeckisavailableonVal’swebpage)

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    58

    Optional/additionalslides

  • Lineage

    • [Cui-Widom-Wiener’00]• Lineage:

    – Givenadataitemintheview– Determinethesourcedatathatproducedit– Theprocessbywhichitwasproduced

    59

    optionalslide

  • Applications

    • OLAP/OLAM(mining)– originofanomalousdatatoverifyreliability

    • ScientificDatabases– howanswerwasproducedfromrawdata

    • OnlineNetworkmonitoringandDiagnosissystem– identifyfaultysensorfromnetworkmonitors

    60

    optionalslide

  • Applications

    • Cleanseddatafeedback– Cleanrawdataandsendreporttosources

    • Materializedviewschemaevolution– ifviewschemaischanged(newcolumnadded),recomputation maynotbenecessary

    • Viewupdate– translateviewupdatestobasedataupdates

    61

    optionalslide

  • DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 62

    SlidebyValTannen,EDBT2010

    optionalslide

  • DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 63

    SlidebyValTannen,EDBT2010

    optionalslide

  • DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 64

    SlidebyValTannen,EDBT2010

    optionalslide

  • DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 65

    SlidebyValTannen,EDBT2010

    optionalslide

  • DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 66

    SlidebyValTannen,EDBT2010

    optionalslide

  • HasAsthma

    Ann

    Bob

    Friend

    Ann Joe

    Ann Tom

    Bob Tom

    Smoker

    Joe

    Tom

    Booleanquery Q():- HasAsthma (x),Friend(x,y),Smoker(y)

    • x,y,z∈ {0,1}• 1=present• 0 =absent• Therearethreealternativewaystoderivethe“true”answer

    x1x2

    z1z2

    y1y2y3

    67

    ProvenanceExample

    ProvenanceFQ,D =x1y1z1 +x1y2z2 +x2y3z2

    optionalslide

  • HasAsthma

    Ann

    Bob

    Friend

    Ann Joe

    Ann Tom

    Bob Tom

    Smoker

    Joe

    Tom

    Booleanquery Q:$ x$ yHasAsthma (x)Ù Friend(x,y)Ù Smoker(y)

    • x,y,z∈ {0,1}

    • WhathappensifAnnisdeleted?– Doestheanswerchangetofalsefromtrue?

    x1x2

    z1z2

    y1y2y3

    68

    Applicationto“DeletionPropagation”

    ProvenanceFQ,D =x1y1z1 +x1y2z2 +x2y3z2 [Greenetal.’07]

    optionalslide

  • HasAsthma

    Ann

    Bob

    Friend

    Ann Joe

    Ann Tom

    Bob Tom

    Smoker

    Joe

    Tom

    Booleanquery Q:$ x$ yHasAsthma (x)Ù Friend(x,y)Ù Smoker(y)

    • x,y,z∈ {0,1}

    • WhathappensifAnnisdeleted?– Doestheanswerchangetofalsefromtrue?

    • Noneedtore-evaluatethequery– justpluginx1=0andevaluate

    x1x2

    z1z2

    y1y2y3

    69

    Applicationto“DeletionPropagation”

    ProvenanceFQ,D =x1y1z1 +x1y2z2 +x2y3z2

    0

    [Greenetal.’07]

    optionalslide