Lecture-23-DataCube-Datamining · Lecture 23 Data Cube and Data Mining Instructor: SudeepaRoy Duke...

10
11/27/18 1 CompSci 516 Database Systems Lecture 23 Data Cube and Data Mining Instructor: Sudeepa Roy Duke CS, Fall 2018 CompSci 516: Database Systems 1 Announcements HW3 due on Nov 30 (Fri), 5 pm Final report due on Dec 12, Wed Class presentation next lecture (Thurs, Nov 29) In the order of your group no. 4 min presentation See details from the announcements in the last lecture Please fill out the course evaluations! We need to hear from each of you We need your feedback/suggestions to keep improving this class Duke CS, Fall 2018 CompSci 516: Database Systems 2 Data Warehousing Duke CS, Fall 2018 CompSci 516: Database Systems 3 Reading Material [RG] Chapter 25 Gray-Chaudhuri-Bosworth-Layman-Reichart-Venkatrao-Pellow-Pirahesh, ICDE 1996 “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals” Harinarayan-Rajaraman-Ullman, SIGMOD 1996 “Implementing data cubes efficiently” Duke CS, Fall 2018 CompSci 516: Database Systems 4 Acknowledgement: The following slides have been created adapting the instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke. Some slides have been prepared by Prof. Shivnath Babu 5 Warehousing Growing industry: $8 billion way back in 1998 Data warehouse vendor like Teradata big “Petabyte scale”customers Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data 40PB, single table with 1 trillion rows), Verizon, AT&T, Bank of America supports data into and out of Hadoop Lots of buzzwords, hype slice & dice, rollup, MOLAP, pivot, ... Ack: Slide by Prof. Shivnath Babu https://gigaom.com/2013/03/27/why-apple-ebay-and-walmart- have-some-of-the-biggest-data-warehouses-youve-ever-seen/ Duke CS, Fall 2018 CompSci 516: Database Systems Introduction Organizations analyze current and historical data to identify useful patterns to support business strategies Emphasis is on complex, interactive, exploratory analysis of very large datasets Created by integrating data from across all parts of an enterprise Data is fairly static Relevant once again for the recent “Big Data analysisto figure out what we can reuse, what we cannot Duke CS, Fall 2018 CompSci 516: Database Systems 6

Transcript of Lecture-23-DataCube-Datamining · Lecture 23 Data Cube and Data Mining Instructor: SudeepaRoy Duke...

Page 1: Lecture-23-DataCube-Datamining · Lecture 23 Data Cube and Data Mining Instructor: SudeepaRoy Duke CS, Fall 2018 CompSci 516: Database Systems 1 Announcements • HW3 due on Nov 30

11/27/18

1

CompSci 516DatabaseSystems

Lecture23DataCube

andDataMining

Instructor:Sudeepa Roy

DukeCS,Fall2018 CompSci516:DatabaseSystems 1

Announcements

• HW3dueonNov30(Fri),5pm

• FinalreportdueonDec12,Wed

• Classpresentationnextlecture(Thurs,Nov29)– Intheorderofyourgroupno.– 4minpresentation– Seedetailsfromtheannouncementsinthelastlecture

• Pleasefilloutthecourseevaluations!– Weneedtohearfromeachofyou– Weneedyourfeedback/suggestionstokeepimprovingthisclass

DukeCS,Fall2018 CompSci516:DatabaseSystems 2

DataWarehousing

DukeCS,Fall2018 CompSci516:DatabaseSystems 3

ReadingMaterial

• [RG]– Chapter25

• Gray-Chaudhuri-Bosworth-Layman-Reichart-Venkatrao-Pellow-Pirahesh,ICDE1996“DataCube:ARelationalAggregationOperatorGeneralizingGroup-By,Cross-Tab,andSub-Totals”

• Harinarayan-Rajaraman-Ullman,SIGMOD1996“Implementingdatacubesefficiently”

DukeCS,Fall2018 CompSci516:DatabaseSystems 4

Acknowledgement:• Thefollowingslideshavebeencreatedadaptingtheinstructormaterialofthe[RG]bookprovidedbytheauthorsDr.RamakrishnanandDr.Gehrke.• SomeslideshavebeenpreparedbyProf.Shivnath Babu

5

Warehousing

• Growingindustry:$8billionwaybackin1998• DatawarehousevendorlikeTeradata

• big“Petabytescale”customers• Apple,Walmart(2008-2.5PB),eBay(2013-primaryDW9.2PB,otherbigdata40PB,singletablewith1trillionrows),Verizon,AT&T,BankofAmerica

• supportsdataintoandoutofHadoop• Lotsofbuzzwords,hype– slice&dice,rollup,MOLAP,pivot,...

Ack:SlidebyProf.Shivnath Babu

https://gigaom.com/2013/03/27/why-apple-ebay-and-walmart-have-some-of-the-biggest-data-warehouses-youve-ever-seen/

DukeCS,Fall2018 CompSci516:DatabaseSystems

Introduction• Organizationsanalyzecurrentandhistoricaldata

– toidentifyusefulpatterns– tosupportbusinessstrategies

• Emphasisisoncomplex,interactive,exploratoryanalysisofverylargedatasets

• Createdbyintegratingdatafromacrossallpartsofanenterprise

• Dataisfairlystatic

• Relevantonceagainfortherecent“BigDataanalysis”– tofigureoutwhatwecanreuse,whatwecannot

DukeCS,Fall2018 CompSci516:DatabaseSystems 6

Page 2: Lecture-23-DataCube-Datamining · Lecture 23 Data Cube and Data Mining Instructor: SudeepaRoy Duke CS, Fall 2018 CompSci 516: Database Systems 1 Announcements • HW3 due on Nov 30

11/27/18

2

OLTP DataWarehousing/OLAPMostlyupdates Mostlyreads

Applications:Orderentry,salesupdate,bankingtransactions

Applications:Decisionsupportinindustry/organization

Detailed,up-to-datedata Summarized, historicaldata(frommultipleoperationaldb,growsovertime)

Structured,repetitive,shorttasks Queryintensive,adhoc,complexqueries

Eachtransactionreads/updatesonlyafewtuples(tensof)

Eachquerycanaccesses manyrecords,andperformmanyjoins,scans,aggregates

MB-GBdata GB-TB data

Typicallyclericalusers Decisionmakers,analysts asusers

Important:Consistency,recoverability,Maximizingtr.throughput

Important:QuerythroughputResponse times

7DukeCS,Fall2018 CompSci516:DatabaseSystems

ThreeComplementaryTrends

• DataWarehousing(DW):– Consolidatedatafrommanysourcesinonelargerepository– Loading,periodicsynchronizationofreplicas– Semanticintegration

• OLAP:– ComplexSQLqueriesandviews.– Queriesbasedonspreadsheet-styleoperationsand“multidimensional”viewofdata.

– Interactiveand“online”queries.

• DataMining:– Exploratorysearchforinterestingtrendsandanomalies

DukeCS,Fall2018 CompSci516:DatabaseSystems 8

ROLAPandMOLAP

• RelationalOLAP(ROLAP)– OntopofstandardrelationalDBMS– DataisstoredinrelationalDBMS– SupportsextensionstoSQLtoaccessmulti-dimensionaldata

• MultidimensionalOLAP(MOLAP)– Directlystoresmultidimensionaldatainspecialdatastructures(e.g.arrays)

9DukeCS,Fall2018 CompSci516:DatabaseSystems

ROLAP:StarSchema

• Toreflectmulti-dimensionalviewsofdata

• Singlefacttable

• Singletableforeverydimension

• Eachtupleinthefacttableconsistsof– pointers(foreignkey)toeach

ofthedimensions(multi-dimensionalcoordinates)

– numericvalueforthosecoordinates

• Eachdimensiontablecontainsattributesofthatdimension 10

NosupportforattributehierarchiesDukeCS,Fall2018 CompSci516:DatabaseSystems

DimensionHierarchies

• Foreachdimension,thesetofvaluescanbeorganizedinahierarchy:

PRODUCT TIME LOCATION

categoryweekmonthstate

pnamedatecity

year

quartercountry

DukeCS,Fall2018 CompSci516:DatabaseSystems 11

ROLAP:SnowflakeSchema

12

• Refinesstar-schema• Dimensionalhierarchyis

explicitlyrepresented

• (+)Dimensiontableseasiertomaintain– supposethe“category

descriptionisbeingchanged

• (-)Needadditionaljoins

• FactConstellations– Multiplefacttablessharesome

dimensionaltables– e.g.ProjectedandActual

Expensesmaysharemanydimensions

DukeCS,Fall2018 CompSci516:DatabaseSystems

Page 3: Lecture-23-DataCube-Datamining · Lecture 23 Data Cube and Data Mining Instructor: SudeepaRoy Duke CS, Fall 2018 CompSci 516: Database Systems 1 Announcements • HW3 due on Nov 30

11/27/18

3

OLAPand

DataCube

DukeCS,Fall2018 CompSci516:DatabaseSystems 13

Motivation:OLAPQueries• Dataanalystsareinterestedinexploringtrendsandanomalies– Possiblybyvisualization(Excel)- 2Dor3Dplots– “DimensionalityReduction”bysummarizingdataandcomputing

aggregates

• Findtotalunitsalesforeach1. Model2. Model,brokenintoyears3. Year,brokenintocolors4. Year5. Model,brokenintocolor,….

14

Sales(Model,Year,Color,Units)

DukeCS,Fall2018 CompSci516:DatabaseSystems

Roll-Ups• Analysisreportsstartatacoarselevel,

gotofinerlevels• Orderofattributematters• Notrelationaldata(emptycellsno

keys)

Model Year Color Model,Year,Color Model,Year Model

Chevy 1994 Black 50

Chevy 1994 White 40

90

Chevy 1995 Black 115

Chevy 1995 White 85

200

290

GROUPBY

Sales(Model,Year,Color,Units)

Roll-ups

Drill-downs

• Roll-upsareasymmetric

‘ALL’ ConstructEasiertovisualizeroll-upifallowALLtofillinthesuper-aggregates

SELECT Model, Year, Color, SUM(Units)

FROM Sales

WHERE Model = ‘Chevy’

GROUP BY Model, Year, Color

UNION

SELECT Model, Year, ‘ALL’, SUM(Units)

FROM Sales

WHERE Model = ‘Chevy’

GROUP BY Model, Year

UNION

UNION

SELECT ‘ALL’, ‘ALL’, ‘ALL’, SUM(Units)

FROM Sales

WHERE Model = ‘Chevy’;

Model Year Color Units

Chevy 1994 Black 50

Chevy 1994 White 40

Chevy 1994 ‘ALL’ 90

Chevy 1995 Black 85

Chevy 1995 White 115

Chevy 1995 ‘ALL’ 200

Chevy ‘ALL’ ‘ALL’ 290

Sales(Model,Year,Color,Units)

DataCube:Intuition

SELECT ‘ALL’, ‘ALL’, ‘ALL’, sum(units)

FROM Sales

UNIONSELECT ‘ALL’, ‘ALL’, Color, sum(units)

FROM Sales

GROUP BY Color

UNION

SELECT ‘ALL’, Year, ‘ALL’, sum(units)

FROM Sales

GROUP BY Year

UNION SELECT Model, Year, ‘ALL’, sum(units)

FROM Sales

GROUP BY Model, Year

UNION ….

17

Sales(Model,Year,Color,Units)

YearColor

Mod

el

TotalUnitsales

DukeCS,Fall2018 CompSci516:DatabaseSystems

• Howmanysub-queries?• Howmanysub-queriesfor

8attributes?

MorecomplextodothesewithGROUP-BY

DataCube• Computestheaggregateonallpossiblecombinationsof

groupbycolumns.

• IfthereareNattributes,thereare2N-1super-aggregates.

• IfthecardinalityoftheNattributesareC1,...,CN,thenthereareatotalof(C1+1)...(CN+1)valuesinthecube.

• ROLL-UPissimilarbutjustlooksatNaggregates

Page 4: Lecture-23-DataCube-Datamining · Lecture 23 Data Cube and Data Mining Instructor: SudeepaRoy Duke CS, Fall 2018 CompSci 516: Database Systems 1 Announcements • HW3 due on Nov 30

11/27/18

4

DataCube

Ack:fromslidesbyLaurelOrrandJeremyHyrkas,UW

DataCube Syntax

• SQLServer

SELECT Model, Year, Color, sum(units)

FROM Sales

GROUP BY Model, Year, Color

WITH CUBE

Sales(Model,Year,Color,Units)

ImplementingDataCube

DukeCS,Fall2018 CompSci516:DatabaseSystems 21

BasicIdeas

• Needtocomputeallgroup-by-s:– ABCD,ABC,ABD,BCD,AB,AC,AD,BC,BD,CD,A,B,C,D

• ComputeGROUP-BYsfrompreviouslycomputedGROUP-BYs– e.g.firstABCD– thenABCorACD– thenABorAC…

• WhichorderABCDissorted,mattersforsubsequentcomputations

– if(ABCD)isthesortedorder,ABCischeap,ACDorBCDisexpensive

Notations

• ABCD– group-byonattributesA,B,C,D– noguaranteeontheorderoftuples

• (ABCD)– sortedaccordingtoA->B->C->D

• ABCDand(ABCD)and(BCDA)– allcontainthesameresults– butindifferentsortedorder

Optimization1:SmallestParent

• ComputeGROUP-BYfromthesmallest(size)previouslycomputedGROUP-BYasaparent

– ABcanbecomputedfromABC,ABD,orABCD

– ABCorABDbetterthanABCD– EvenABCorABDmayhave

differentsizes,trytochoosethesmallerparent

24

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all

DukeCS,Fall2018 CompSci516:DatabaseSystems

LATTICESTRUCTUREofdatacube

Page 5: Lecture-23-DataCube-Datamining · Lecture 23 Data Cube and Data Mining Instructor: SudeepaRoy Duke CS, Fall 2018 CompSci 516: Database Systems 1 Announcements • HW3 due on Nov 30

11/27/18

5

Optimization2:CacheResults

• CacheresultofoneGROUP-BYinmemorytoreducediskI/O– ComputeABfromABCwhile

ABCisstillinmemory

25

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all

DukeCS,Fall2018 CompSci516:DatabaseSystems

Optimization3:AmortizeDiskScans

• AmortizediskreadsformultipleGROUP-BYs– SupposetheresultforABCD

isstoredondisk– ComputeallofABC,ABD,

ACD,BCDsimultaneouslyinonescanofABCD

26

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all

DukeCS,Fall2018 CompSci516:DatabaseSystems

Optimization4:Sharedpartition

• forhash-basedalgorithms– Useshashtablesto

computesmallerGROUP-Bys

– IfthehashtablesforABandACfitinmemory,computebothinonescanofABC

– OtherwisepartitiononA,andcomputeHTsofABandACindifferentpartitions

• “pipe-hash”algorithmnotcoveredinclass

27

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all

DukeCS,Fall2018 CompSci516:DatabaseSystems

(ABCD)

(ABC) ABD ACD BCD

AC AD BC BD CD(AB)

(A) B C D

all Level0

Level1

Level2

Level3

Level4

A B C sum

a1 b1 c1 5

a1 b1 c2 10

a1 b2 c3 8

a2 b2 c1 2

a2 b2 c3 11

NotSortedA B C sum

a2 b2 c3 11

a1 b1 c2 10

a2 b2 c1 2

a1 b1 c1 5

a1 b2 c3 8

Sorted

A B sum

a1 b1 15

a1 b2 8

a2 b2 13

Optimization5:Shared-sort

• Fromonesortedorderof(ABCD)

– Compute(ABC)– Compute(AB)– Compute(A)– Compute“all”

PipeSort:Shared-sortoptimization

• BUT,mayhaveaconflictwith“smallest-parent”optimization– (ABD)->(AB)couldbeabetterchoice– Figureoutthebestparentchoicebyrunning

aweighted-matchingalgorithmlayerbylayer(detailsnotcoveredinclass) (ABCD)

(ABC)

(AB)

(A)

Aftercomputingthe plan,executeallpipelines

PipeSort Algorithminpicture

1.Firstpipelineisexecutedbyonescanofthedata

2.Sort(CBAD)->(BADC),computethesecondpipeline

3.…..

A,S costs

ACBD

ACB ABD ACD BDC

AC AD BC BD CD

A B C D

all

AB

Page 6: Lecture-23-DataCube-Datamining · Lecture 23 Data Cube and Data Mining Instructor: SudeepaRoy Duke CS, Fall 2018 CompSci 516: Database Systems 1 Announcements • HW3 due on Nov 30

11/27/18

6

DataMining

ReadingMaterialOptionalReading:

1.[RG]:Chapter26

2.“FastAlgorithmsforMiningAssociationRules”Agrawal andSrikant,VLDB1994

23,863citationsonGoogleScholarinNovember2018• 23,038inNovember2017• 20,610inNovember2016• 19,496inApril2016

OneofthemostcitedpapersinCS!

32

• Acknowledgement:Thefollowingslideshavebeenpreparedadaptingtheslidesprovidedbytheauthorsof[RG]andusingseveralpresentationsofthispaperavailableontheinternet(esp.byOfer PasternakandBrianChase)

DukeCS,Fall2018 CompSci516:DatabaseSystems

FourMainStepsinKDandDM(KDD)

• DataSelection– Identifytargetsubsetofdataandattributesofinterest

• DataCleaning– Removenoiseandoutliers,unifyunits,createnewfields,usedenormalization ifneeded

• DataMining– extractinterestingpatterns

• Evaluation– presentthepatternstotheendusersinasuitableform,e.g.throughvisualization

DukeCS,Fall2018 CompSci516:DatabaseSystems 33

RememberHW1!

SeveralDM/KD(Research)Problems

• Discoveryofcausalrules• Learningoflogicaldefinitions• Fittingoffunctionstodata• Clustering• Classification• Inferringfunctionaldependenciesfromdata• Finding“usefulness”or“interestingness”ofarule

– SeethecitationsintheAgarwal-Srikant paper– Somediscussedin[RG]Chapter27

DukeCS,Fall2018 CompSci516:DatabaseSystems 34

• Retailerscollectandstoremassiveamountsofsalesdata– transactiondateandlistofitems

• Associationrules:– e.g.98%customerswhopurchase“tires”and“autoaccessories”alsoget“automotiveservices”done

– Customerswhobuymustardandketchupalsobuyburgers

– Goal:findtheserulesfromjusttransactionaldata(transactionid+listofitems)

MiningAssociationRules

35DukeCS,Fall2018 CompSci516:DatabaseSystems

• Canbeusedfor– marketingprogramandstrategies– cross-marketing(masse-mail,webpages)– catalogdesign– add-onsales– storelayout– customersegmentation

Applications

36DukeCS,Fall2018 CompSci516:DatabaseSystems

Page 7: Lecture-23-DataCube-Datamining · Lecture 23 Data Cube and Data Mining Instructor: SudeepaRoy Duke CS, Fall 2018 CompSci 516: Database Systems 1 Announcements • HW3 due on Nov 30

11/27/18

7

Notations

• ItemsI={i1,i2,…,im}• D:asetoftransactions• EachtransactionT⊆ I– hasanidentifierTID

• AssociationRule– Xà Y(notFunctionalDependency!)– X,Y⊂ I– X∩Y=∅

37DukeCS,Fall2018 CompSci516:DatabaseSystems

ConfidenceandSupport

• AssociationruleXàY

• Confidence c=|Tr.withXandY|/|Tr.with|X|– c%oftransactionsinDthatcontainXalsocontainY

• Support s=|Tr.withXandY|/|allTr.|– s%oftransactionsinDcontainXandY.

38DukeCS,Fall2018 CompSci516:DatabaseSystems

SupportExample

TID Cereal Beer Bread Bananas Milk

1 X X X

2 X X X X

3 X X

4 X X

5 X X

6 X X

7 X X

8 X

• Support(Cereal)• 4/8=.5

• Support(CerealàMilk)• 3/8=.375

39DukeCS,Fall2018 CompSci516:DatabaseSystems

ConfidenceExample

TID Cereal Beer Bread Bananas Milk

1 X X X

2 X X X X

3 X X

4 X X

5 X X

6 X X

7 X X

8 X

• Confidence(CerealàMilk)• 3/4=.75

• Confidence(Bananasà Bread)• 1/3=.33333…

40DukeCS,Fall2018 CompSci516:DatabaseSystems

Xà YisnotaFunctionalDependencyForfunctionaldependencies• F.D.=twotupleswiththesamevalueofofXmusthavethe

samevalueofY– Xà Y =>XZà Y(concatenation)– Xà Y,Yà Z=>Xà Z(transitivity)

Forassociationrules• Xà AdoesnotmeanXYàA

– Maynothavetheminimumsupport– Assumeonetransaction{AX}

• Xà AandAà ZdonotmeanXà Z– Maynothave theminimumconfidence– Assumetwotransactions{XA},{AZ}

41DukeCS,Fall2018 CompSci516:DatabaseSystems

Checkyourself

ProblemDefinition

• Input– asetoftransactionsD

• Canbeinanyform– afile,relationaltable,etc.

– minsupport(minsup)– minconfidence(minconf)

• Goal:generateallassociationrulesthathave– support>=minsup and– confidence>=minconf

42DukeCS,Fall2018 CompSci516:DatabaseSystems

Page 8: Lecture-23-DataCube-Datamining · Lecture 23 Data Cube and Data Mining Instructor: SudeepaRoy Duke CS, Fall 2018 CompSci 516: Database Systems 1 Announcements • HW3 due on Nov 30

11/27/18

8

Decompositionintotwosubproblems

• 1.Apriori– forfinding“large”itemsets withsupport>=minsup– allotheritemsets are“small”

• 2.ThenuseanotheralgorithmtofindrulesXà Ysuchthat– Bothitemsets X∪ YandXarelarge– Xà Yhasconfidence>=minconf

• Paperfocusesonsubproblem 1– ifsupportislow,confidencemaynotsaymuch– subproblem 2infullversionofthepaper

DukeCS,Fall2018 CompSci516:DatabaseSystems 43

BasicIdeas- 1

• Q.Whichitemset canpossiblyhavelargersupport:ABCDorAB– i.e.whenoneisasubsetoftheother?

• Ans:AB– anysubsetofalargeitemset mustbelarge– SoifABissmall,noneedtoinvestigateABC,ABCDetc.

DukeCS,Fall2018 CompSci516:DatabaseSystems 44

BasicIdeas- 2

• Startwithindividual(singleton)items{A},{B},…

• Insubsequentpasses,extendthe“largeitemsets”ofthepreviouspassas“seed”

• Generatenewpotentiallylargeitemsets (candidateitemsets)

• Thencounttheiractualsupportfromthedata

• Attheendofthepass,determinewhichofthecandidateitemsets areactuallylarge– becomesseedforthenextpass

• Continueuntilnonewlargeitemsets arefound

DukeCS,Fall2018 CompSci516:DatabaseSystems 45

Notations

ACTUAL

POTENTIALUsedinbothApriori andAprioriTID

46

• Assumethedatabaseisoftheform<TID,i1,i2,…>whereitemsarestoredinlexicographicorder

• TID = identifierofthetransaction• Alsoworkswhenthedatabaseis“normalized”:eachdatabaserecordis<TID,item>pair

DukeCS,Fall2018 CompSci516:DatabaseSystems

AlgorithmApriori

47

!k

k

kk

t

kt

k-k

k-1

;L Answer

minsup}|c.countC { c L

c.count Cc

,t)(C C Dt

)(L C); k 2; L( k

itemsets} {large 1- L

=

³Î=

++Î

=++¹=

=

end

endend

;do candidates forall

subset begin do ons transactiforall

;genapriori-begin do For

1

1

fCountindividual itemoccurrences

Generatenewk-itemsets candidates

count=0

Findthesupportofallthecandidates

Ct = candidatescontainedint

incrementcount

Takeonlythosewithsupport>= minsup

DukeCS,Fall2018 CompSci516:DatabaseSystems

Apriori-Gen

4848

l Joinstep

l Prunestep

1k1k2k2k11

1k1k

1k1k1

k

q.itemp.item,q.itemp.item,...,q.itemp.itemqp,LL

itemqitempitempp.itemC

----

--

--

<== where from

.,.,.,select intoinsert

2

k

k-1

k

c from C) L(s

ets s of c(k-1)-subs C itemsets c

deletethen if

do foralldo forall

Ï

Î

p andqaretwo large

(k-1)-itemsets identicalinallk-2firstitems.

Joinbyaddingthelastitemofqtop

Checkallthesubsets,removeallcandidatewithsome“small” subset

• TakesasargumentLk-1 (thesetofalllargek-1)-itemsets• Returnsasupersetofthesetofalllargek-itemsets byaugmentingLk-1

Lk-1⨝ Lk-1

DukeCS,Fall2018 CompSci516:DatabaseSystems

Page 9: Lecture-23-DataCube-Datamining · Lecture 23 Data Cube and Data Mining Instructor: SudeepaRoy Duke CS, Fall 2018 CompSci 516: Database Systems 1 Announcements • HW3 due on Nov 30

11/27/18

9

Apriori-GenExample- 1

• {1,2,3}• {1,2,4}{1,3,4}

• {1,3,5}• {2,3,4}

• {1,2,3,4}

Step1:Join(k=4)

Assumenumbers1-5correspondtoindividualitems

L3 C4

49DukeCS,Fall2018 CompSci516:DatabaseSystems

Apriori-GenExample- 2

• {1,2,3,4}• {1,3,4,5}

Step1:Join(k=4)

Assumenumbers1-5correspondtoindividualitems

L3 C4• {1,2,3}• {1,2,4}{1,3,4}

• {1,3,5}• {2,3,4}

50DukeCS,Fall2018 CompSci516:DatabaseSystems

Apriori-GenExample- 3

• {1,2,3,4}• {1,3,4,5}

Step2:Prune(k=4)• Removeitemsets thatcan’thavethe

requiredsupportbecausethereisasubsetinitwhichdoesn’thavethelevelofsupporti.e.notinthepreviouspass(k-1)

L3 C4

No{1,4,5}existsinL3Rulesout{1,3,4,5}

• {1,2,3}• {1,2,4}{1,3,4}

• {1,3,5}• {2,3,4}

51DukeCS,Fall2018 CompSci516:DatabaseSystems

CorrectnessofApriori

52

1k1k2k2k11

1k1k

1k1k1

k

q.itemp.item,q.itemp.item,...,q.itemp.itemqp,LL

itemqitempitempp.itemC

----

--

--

<== where from

.,.,.,select intoinsert

2

k

k-1

k

c from C) L(s

ets s of c(k-1)-subs C itemsets c

deletethen if

do foralldo forall

Ï

Î

ShowthatCk⊇ Lk• Anysubsetoflargeitemset mustalsobe

large

• foreachpinLk,ithasasubsetqinLk-1• WeareextendingthosesubsetsqinJoin

withanothersubsetq’ofp,whichmustalsobelarge

• equivalenttoextendingLk-1 withallitemsandremovingthose

whose(k-1)subsetsarenotinLk-1• PruneisnotdeletinganythingfromLk

prune

join

Checkyourself

DukeCS,Fall2018 CompSci516:DatabaseSystems

Conclusions(of516,Fall2017)

DukeCS,Fall2018 CompSci516:DatabaseSystems 53

Take-Aways

• DBMSBasics

• DBMSInternals

• OverviewofResearchAreas

• Hands-onExperienceinDBsystems

DukeCS,Fall2018 CompSci516:DatabaseSystems 54

Page 10: Lecture-23-DataCube-Datamining · Lecture 23 Data Cube and Data Mining Instructor: SudeepaRoy Duke CS, Fall 2018 CompSci 516: Database Systems 1 Announcements • HW3 due on Nov 30

11/27/18

10

DBSystems• TraditionalDBMS– PostGres,SQL

• Large-scaleDataProcessingSystems– Spark/Scala,AWS

• NewDBMS/NOSQL– MongoDB

• Inaddition– XML,JSON,JDBC,Python/Java

DukeCS,Fall2018 CompSci516:DatabaseSystems 55

DBBasics

• SQL• RA/LogicalPlans• RC• Datalog–Whyweneededeachoftheselanguages

• NormalForms

DukeCS,Fall2018 CompSci516:DatabaseSystems 56

DBInternalsandAlgorithms

• Storage• Indexing• OperatorAlgorithms– ExternalSort– JoinAlgorithms

• Cost-basedQueryOptimization• Transactions– ConcurrencyControl– Recovery

DukeCS,Fall2018 CompSci516:DatabaseSystems 57

Large-scaleProcessingandNewApproaches

• ParallelDBMS• DistributedDBMS• MapReduce• NOSQL

DukeCS,Fall2018 CompSci516:DatabaseSystems 58

OtherTopics

• DataWarehouse/OLAP/DataCube• AssociationRuleMining

• HopesomeofyouwillfurtherexploreDatabaseSystems/DataManagement/DataAnalysis/BigData!

DukeCS,Fall2018 CompSci516:DatabaseSystems 59