Lecture-23-DataCube-Datamining · Lecture 23 Data Cube and Data Mining Instructor: SudeepaRoy Duke...
Transcript of Lecture-23-DataCube-Datamining · Lecture 23 Data Cube and Data Mining Instructor: SudeepaRoy Duke...
11/27/18
1
CompSci 516DatabaseSystems
Lecture23DataCube
andDataMining
Instructor:Sudeepa Roy
DukeCS,Fall2018 CompSci516:DatabaseSystems 1
Announcements
• HW3dueonNov30(Fri),5pm
• FinalreportdueonDec12,Wed
• Classpresentationnextlecture(Thurs,Nov29)– Intheorderofyourgroupno.– 4minpresentation– Seedetailsfromtheannouncementsinthelastlecture
• Pleasefilloutthecourseevaluations!– Weneedtohearfromeachofyou– Weneedyourfeedback/suggestionstokeepimprovingthisclass
DukeCS,Fall2018 CompSci516:DatabaseSystems 2
DataWarehousing
DukeCS,Fall2018 CompSci516:DatabaseSystems 3
ReadingMaterial
• [RG]– Chapter25
• Gray-Chaudhuri-Bosworth-Layman-Reichart-Venkatrao-Pellow-Pirahesh,ICDE1996“DataCube:ARelationalAggregationOperatorGeneralizingGroup-By,Cross-Tab,andSub-Totals”
• Harinarayan-Rajaraman-Ullman,SIGMOD1996“Implementingdatacubesefficiently”
DukeCS,Fall2018 CompSci516:DatabaseSystems 4
Acknowledgement:• Thefollowingslideshavebeencreatedadaptingtheinstructormaterialofthe[RG]bookprovidedbytheauthorsDr.RamakrishnanandDr.Gehrke.• SomeslideshavebeenpreparedbyProf.Shivnath Babu
5
Warehousing
• Growingindustry:$8billionwaybackin1998• DatawarehousevendorlikeTeradata
• big“Petabytescale”customers• Apple,Walmart(2008-2.5PB),eBay(2013-primaryDW9.2PB,otherbigdata40PB,singletablewith1trillionrows),Verizon,AT&T,BankofAmerica
• supportsdataintoandoutofHadoop• Lotsofbuzzwords,hype– slice&dice,rollup,MOLAP,pivot,...
Ack:SlidebyProf.Shivnath Babu
https://gigaom.com/2013/03/27/why-apple-ebay-and-walmart-have-some-of-the-biggest-data-warehouses-youve-ever-seen/
DukeCS,Fall2018 CompSci516:DatabaseSystems
Introduction• Organizationsanalyzecurrentandhistoricaldata
– toidentifyusefulpatterns– tosupportbusinessstrategies
• Emphasisisoncomplex,interactive,exploratoryanalysisofverylargedatasets
• Createdbyintegratingdatafromacrossallpartsofanenterprise
• Dataisfairlystatic
• Relevantonceagainfortherecent“BigDataanalysis”– tofigureoutwhatwecanreuse,whatwecannot
DukeCS,Fall2018 CompSci516:DatabaseSystems 6
11/27/18
2
OLTP DataWarehousing/OLAPMostlyupdates Mostlyreads
Applications:Orderentry,salesupdate,bankingtransactions
Applications:Decisionsupportinindustry/organization
Detailed,up-to-datedata Summarized, historicaldata(frommultipleoperationaldb,growsovertime)
Structured,repetitive,shorttasks Queryintensive,adhoc,complexqueries
Eachtransactionreads/updatesonlyafewtuples(tensof)
Eachquerycanaccesses manyrecords,andperformmanyjoins,scans,aggregates
MB-GBdata GB-TB data
Typicallyclericalusers Decisionmakers,analysts asusers
Important:Consistency,recoverability,Maximizingtr.throughput
Important:QuerythroughputResponse times
7DukeCS,Fall2018 CompSci516:DatabaseSystems
ThreeComplementaryTrends
• DataWarehousing(DW):– Consolidatedatafrommanysourcesinonelargerepository– Loading,periodicsynchronizationofreplicas– Semanticintegration
• OLAP:– ComplexSQLqueriesandviews.– Queriesbasedonspreadsheet-styleoperationsand“multidimensional”viewofdata.
– Interactiveand“online”queries.
• DataMining:– Exploratorysearchforinterestingtrendsandanomalies
DukeCS,Fall2018 CompSci516:DatabaseSystems 8
ROLAPandMOLAP
• RelationalOLAP(ROLAP)– OntopofstandardrelationalDBMS– DataisstoredinrelationalDBMS– SupportsextensionstoSQLtoaccessmulti-dimensionaldata
• MultidimensionalOLAP(MOLAP)– Directlystoresmultidimensionaldatainspecialdatastructures(e.g.arrays)
9DukeCS,Fall2018 CompSci516:DatabaseSystems
ROLAP:StarSchema
• Toreflectmulti-dimensionalviewsofdata
• Singlefacttable
• Singletableforeverydimension
• Eachtupleinthefacttableconsistsof– pointers(foreignkey)toeach
ofthedimensions(multi-dimensionalcoordinates)
– numericvalueforthosecoordinates
• Eachdimensiontablecontainsattributesofthatdimension 10
NosupportforattributehierarchiesDukeCS,Fall2018 CompSci516:DatabaseSystems
DimensionHierarchies
• Foreachdimension,thesetofvaluescanbeorganizedinahierarchy:
PRODUCT TIME LOCATION
categoryweekmonthstate
pnamedatecity
year
quartercountry
DukeCS,Fall2018 CompSci516:DatabaseSystems 11
ROLAP:SnowflakeSchema
12
• Refinesstar-schema• Dimensionalhierarchyis
explicitlyrepresented
• (+)Dimensiontableseasiertomaintain– supposethe“category
descriptionisbeingchanged
• (-)Needadditionaljoins
• FactConstellations– Multiplefacttablessharesome
dimensionaltables– e.g.ProjectedandActual
Expensesmaysharemanydimensions
DukeCS,Fall2018 CompSci516:DatabaseSystems
11/27/18
3
OLAPand
DataCube
DukeCS,Fall2018 CompSci516:DatabaseSystems 13
Motivation:OLAPQueries• Dataanalystsareinterestedinexploringtrendsandanomalies– Possiblybyvisualization(Excel)- 2Dor3Dplots– “DimensionalityReduction”bysummarizingdataandcomputing
aggregates
• Findtotalunitsalesforeach1. Model2. Model,brokenintoyears3. Year,brokenintocolors4. Year5. Model,brokenintocolor,….
14
Sales(Model,Year,Color,Units)
DukeCS,Fall2018 CompSci516:DatabaseSystems
Roll-Ups• Analysisreportsstartatacoarselevel,
gotofinerlevels• Orderofattributematters• Notrelationaldata(emptycellsno
keys)
Model Year Color Model,Year,Color Model,Year Model
Chevy 1994 Black 50
Chevy 1994 White 40
90
Chevy 1995 Black 115
Chevy 1995 White 85
200
290
GROUPBY
Sales(Model,Year,Color,Units)
Roll-ups
Drill-downs
• Roll-upsareasymmetric
‘ALL’ ConstructEasiertovisualizeroll-upifallowALLtofillinthesuper-aggregates
SELECT Model, Year, Color, SUM(Units)
FROM Sales
WHERE Model = ‘Chevy’
GROUP BY Model, Year, Color
UNION
SELECT Model, Year, ‘ALL’, SUM(Units)
FROM Sales
WHERE Model = ‘Chevy’
GROUP BY Model, Year
UNION
…
UNION
SELECT ‘ALL’, ‘ALL’, ‘ALL’, SUM(Units)
FROM Sales
WHERE Model = ‘Chevy’;
Model Year Color Units
Chevy 1994 Black 50
Chevy 1994 White 40
Chevy 1994 ‘ALL’ 90
Chevy 1995 Black 85
Chevy 1995 White 115
Chevy 1995 ‘ALL’ 200
Chevy ‘ALL’ ‘ALL’ 290
Sales(Model,Year,Color,Units)
DataCube:Intuition
SELECT ‘ALL’, ‘ALL’, ‘ALL’, sum(units)
FROM Sales
UNIONSELECT ‘ALL’, ‘ALL’, Color, sum(units)
FROM Sales
GROUP BY Color
UNION
SELECT ‘ALL’, Year, ‘ALL’, sum(units)
FROM Sales
GROUP BY Year
UNION SELECT Model, Year, ‘ALL’, sum(units)
FROM Sales
GROUP BY Model, Year
UNION ….
17
Sales(Model,Year,Color,Units)
YearColor
Mod
el
TotalUnitsales
DukeCS,Fall2018 CompSci516:DatabaseSystems
• Howmanysub-queries?• Howmanysub-queriesfor
8attributes?
MorecomplextodothesewithGROUP-BY
DataCube• Computestheaggregateonallpossiblecombinationsof
groupbycolumns.
• IfthereareNattributes,thereare2N-1super-aggregates.
• IfthecardinalityoftheNattributesareC1,...,CN,thenthereareatotalof(C1+1)...(CN+1)valuesinthecube.
• ROLL-UPissimilarbutjustlooksatNaggregates
11/27/18
4
DataCube
Ack:fromslidesbyLaurelOrrandJeremyHyrkas,UW
DataCube Syntax
• SQLServer
SELECT Model, Year, Color, sum(units)
FROM Sales
GROUP BY Model, Year, Color
WITH CUBE
Sales(Model,Year,Color,Units)
ImplementingDataCube
DukeCS,Fall2018 CompSci516:DatabaseSystems 21
BasicIdeas
• Needtocomputeallgroup-by-s:– ABCD,ABC,ABD,BCD,AB,AC,AD,BC,BD,CD,A,B,C,D
• ComputeGROUP-BYsfrompreviouslycomputedGROUP-BYs– e.g.firstABCD– thenABCorACD– thenABorAC…
• WhichorderABCDissorted,mattersforsubsequentcomputations
– if(ABCD)isthesortedorder,ABCischeap,ACDorBCDisexpensive
Notations
• ABCD– group-byonattributesA,B,C,D– noguaranteeontheorderoftuples
• (ABCD)– sortedaccordingtoA->B->C->D
• ABCDand(ABCD)and(BCDA)– allcontainthesameresults– butindifferentsortedorder
Optimization1:SmallestParent
• ComputeGROUP-BYfromthesmallest(size)previouslycomputedGROUP-BYasaparent
– ABcanbecomputedfromABC,ABD,orABCD
– ABCorABDbetterthanABCD– EvenABCorABDmayhave
differentsizes,trytochoosethesmallerparent
24
ABCD
ABC ABD ACD BCD
AC AD BC BD CDAB
A B C D
all
DukeCS,Fall2018 CompSci516:DatabaseSystems
LATTICESTRUCTUREofdatacube
11/27/18
5
Optimization2:CacheResults
• CacheresultofoneGROUP-BYinmemorytoreducediskI/O– ComputeABfromABCwhile
ABCisstillinmemory
25
ABCD
ABC ABD ACD BCD
AC AD BC BD CDAB
A B C D
all
DukeCS,Fall2018 CompSci516:DatabaseSystems
Optimization3:AmortizeDiskScans
• AmortizediskreadsformultipleGROUP-BYs– SupposetheresultforABCD
isstoredondisk– ComputeallofABC,ABD,
ACD,BCDsimultaneouslyinonescanofABCD
26
ABCD
ABC ABD ACD BCD
AC AD BC BD CDAB
A B C D
all
DukeCS,Fall2018 CompSci516:DatabaseSystems
Optimization4:Sharedpartition
• forhash-basedalgorithms– Useshashtablesto
computesmallerGROUP-Bys
– IfthehashtablesforABandACfitinmemory,computebothinonescanofABC
– OtherwisepartitiononA,andcomputeHTsofABandACindifferentpartitions
• “pipe-hash”algorithmnotcoveredinclass
27
ABCD
ABC ABD ACD BCD
AC AD BC BD CDAB
A B C D
all
DukeCS,Fall2018 CompSci516:DatabaseSystems
(ABCD)
(ABC) ABD ACD BCD
AC AD BC BD CD(AB)
(A) B C D
all Level0
Level1
Level2
Level3
Level4
A B C sum
a1 b1 c1 5
a1 b1 c2 10
a1 b2 c3 8
a2 b2 c1 2
a2 b2 c3 11
NotSortedA B C sum
a2 b2 c3 11
a1 b1 c2 10
a2 b2 c1 2
a1 b1 c1 5
a1 b2 c3 8
Sorted
A B sum
a1 b1 15
a1 b2 8
a2 b2 13
Optimization5:Shared-sort
• Fromonesortedorderof(ABCD)
– Compute(ABC)– Compute(AB)– Compute(A)– Compute“all”
PipeSort:Shared-sortoptimization
• BUT,mayhaveaconflictwith“smallest-parent”optimization– (ABD)->(AB)couldbeabetterchoice– Figureoutthebestparentchoicebyrunning
aweighted-matchingalgorithmlayerbylayer(detailsnotcoveredinclass) (ABCD)
(ABC)
(AB)
(A)
Aftercomputingthe plan,executeallpipelines
PipeSort Algorithminpicture
1.Firstpipelineisexecutedbyonescanofthedata
2.Sort(CBAD)->(BADC),computethesecondpipeline
3.…..
A,S costs
ACBD
ACB ABD ACD BDC
AC AD BC BD CD
A B C D
all
AB
11/27/18
6
DataMining
ReadingMaterialOptionalReading:
1.[RG]:Chapter26
2.“FastAlgorithmsforMiningAssociationRules”Agrawal andSrikant,VLDB1994
23,863citationsonGoogleScholarinNovember2018• 23,038inNovember2017• 20,610inNovember2016• 19,496inApril2016
OneofthemostcitedpapersinCS!
32
• Acknowledgement:Thefollowingslideshavebeenpreparedadaptingtheslidesprovidedbytheauthorsof[RG]andusingseveralpresentationsofthispaperavailableontheinternet(esp.byOfer PasternakandBrianChase)
DukeCS,Fall2018 CompSci516:DatabaseSystems
FourMainStepsinKDandDM(KDD)
• DataSelection– Identifytargetsubsetofdataandattributesofinterest
• DataCleaning– Removenoiseandoutliers,unifyunits,createnewfields,usedenormalization ifneeded
• DataMining– extractinterestingpatterns
• Evaluation– presentthepatternstotheendusersinasuitableform,e.g.throughvisualization
DukeCS,Fall2018 CompSci516:DatabaseSystems 33
RememberHW1!
SeveralDM/KD(Research)Problems
• Discoveryofcausalrules• Learningoflogicaldefinitions• Fittingoffunctionstodata• Clustering• Classification• Inferringfunctionaldependenciesfromdata• Finding“usefulness”or“interestingness”ofarule
– SeethecitationsintheAgarwal-Srikant paper– Somediscussedin[RG]Chapter27
DukeCS,Fall2018 CompSci516:DatabaseSystems 34
• Retailerscollectandstoremassiveamountsofsalesdata– transactiondateandlistofitems
• Associationrules:– e.g.98%customerswhopurchase“tires”and“autoaccessories”alsoget“automotiveservices”done
– Customerswhobuymustardandketchupalsobuyburgers
– Goal:findtheserulesfromjusttransactionaldata(transactionid+listofitems)
MiningAssociationRules
35DukeCS,Fall2018 CompSci516:DatabaseSystems
• Canbeusedfor– marketingprogramandstrategies– cross-marketing(masse-mail,webpages)– catalogdesign– add-onsales– storelayout– customersegmentation
Applications
36DukeCS,Fall2018 CompSci516:DatabaseSystems
11/27/18
7
Notations
• ItemsI={i1,i2,…,im}• D:asetoftransactions• EachtransactionT⊆ I– hasanidentifierTID
• AssociationRule– Xà Y(notFunctionalDependency!)– X,Y⊂ I– X∩Y=∅
37DukeCS,Fall2018 CompSci516:DatabaseSystems
ConfidenceandSupport
• AssociationruleXàY
• Confidence c=|Tr.withXandY|/|Tr.with|X|– c%oftransactionsinDthatcontainXalsocontainY
• Support s=|Tr.withXandY|/|allTr.|– s%oftransactionsinDcontainXandY.
38DukeCS,Fall2018 CompSci516:DatabaseSystems
SupportExample
TID Cereal Beer Bread Bananas Milk
1 X X X
2 X X X X
3 X X
4 X X
5 X X
6 X X
7 X X
8 X
• Support(Cereal)• 4/8=.5
• Support(CerealàMilk)• 3/8=.375
39DukeCS,Fall2018 CompSci516:DatabaseSystems
ConfidenceExample
TID Cereal Beer Bread Bananas Milk
1 X X X
2 X X X X
3 X X
4 X X
5 X X
6 X X
7 X X
8 X
• Confidence(CerealàMilk)• 3/4=.75
• Confidence(Bananasà Bread)• 1/3=.33333…
40DukeCS,Fall2018 CompSci516:DatabaseSystems
Xà YisnotaFunctionalDependencyForfunctionaldependencies• F.D.=twotupleswiththesamevalueofofXmusthavethe
samevalueofY– Xà Y =>XZà Y(concatenation)– Xà Y,Yà Z=>Xà Z(transitivity)
Forassociationrules• Xà AdoesnotmeanXYàA
– Maynothavetheminimumsupport– Assumeonetransaction{AX}
• Xà AandAà ZdonotmeanXà Z– Maynothave theminimumconfidence– Assumetwotransactions{XA},{AZ}
41DukeCS,Fall2018 CompSci516:DatabaseSystems
Checkyourself
ProblemDefinition
• Input– asetoftransactionsD
• Canbeinanyform– afile,relationaltable,etc.
– minsupport(minsup)– minconfidence(minconf)
• Goal:generateallassociationrulesthathave– support>=minsup and– confidence>=minconf
42DukeCS,Fall2018 CompSci516:DatabaseSystems
11/27/18
8
Decompositionintotwosubproblems
• 1.Apriori– forfinding“large”itemsets withsupport>=minsup– allotheritemsets are“small”
• 2.ThenuseanotheralgorithmtofindrulesXà Ysuchthat– Bothitemsets X∪ YandXarelarge– Xà Yhasconfidence>=minconf
• Paperfocusesonsubproblem 1– ifsupportislow,confidencemaynotsaymuch– subproblem 2infullversionofthepaper
DukeCS,Fall2018 CompSci516:DatabaseSystems 43
BasicIdeas- 1
• Q.Whichitemset canpossiblyhavelargersupport:ABCDorAB– i.e.whenoneisasubsetoftheother?
• Ans:AB– anysubsetofalargeitemset mustbelarge– SoifABissmall,noneedtoinvestigateABC,ABCDetc.
DukeCS,Fall2018 CompSci516:DatabaseSystems 44
BasicIdeas- 2
• Startwithindividual(singleton)items{A},{B},…
• Insubsequentpasses,extendthe“largeitemsets”ofthepreviouspassas“seed”
• Generatenewpotentiallylargeitemsets (candidateitemsets)
• Thencounttheiractualsupportfromthedata
• Attheendofthepass,determinewhichofthecandidateitemsets areactuallylarge– becomesseedforthenextpass
• Continueuntilnonewlargeitemsets arefound
DukeCS,Fall2018 CompSci516:DatabaseSystems 45
Notations
ACTUAL
POTENTIALUsedinbothApriori andAprioriTID
46
• Assumethedatabaseisoftheform<TID,i1,i2,…>whereitemsarestoredinlexicographicorder
• TID = identifierofthetransaction• Alsoworkswhenthedatabaseis“normalized”:eachdatabaserecordis<TID,item>pair
DukeCS,Fall2018 CompSci516:DatabaseSystems
AlgorithmApriori
47
!k
k
kk
t
kt
k-k
k-1
;L Answer
minsup}|c.countC { c L
c.count Cc
,t)(C C Dt
)(L C); k 2; L( k
itemsets} {large 1- L
=
³Î=
++Î
=Î
=++¹=
=
end
endend
;do candidates forall
subset begin do ons transactiforall
;genapriori-begin do For
1
1
fCountindividual itemoccurrences
Generatenewk-itemsets candidates
count=0
Findthesupportofallthecandidates
Ct = candidatescontainedint
incrementcount
Takeonlythosewithsupport>= minsup
DukeCS,Fall2018 CompSci516:DatabaseSystems
Apriori-Gen
4848
l Joinstep
l Prunestep
1k1k2k2k11
1k1k
1k1k1
k
q.itemp.item,q.itemp.item,...,q.itemp.itemqp,LL
itemqitempitempp.itemC
----
--
--
<== where from
.,.,.,select intoinsert
2
k
k-1
k
c from C) L(s
ets s of c(k-1)-subs C itemsets c
deletethen if
do foralldo forall
Ï
Î
p andqaretwo large
(k-1)-itemsets identicalinallk-2firstitems.
Joinbyaddingthelastitemofqtop
Checkallthesubsets,removeallcandidatewithsome“small” subset
• TakesasargumentLk-1 (thesetofalllargek-1)-itemsets• Returnsasupersetofthesetofalllargek-itemsets byaugmentingLk-1
Lk-1⨝ Lk-1
DukeCS,Fall2018 CompSci516:DatabaseSystems
11/27/18
9
Apriori-GenExample- 1
• {1,2,3}• {1,2,4}{1,3,4}
• {1,3,5}• {2,3,4}
• {1,2,3,4}
Step1:Join(k=4)
Assumenumbers1-5correspondtoindividualitems
L3 C4
49DukeCS,Fall2018 CompSci516:DatabaseSystems
Apriori-GenExample- 2
• {1,2,3,4}• {1,3,4,5}
Step1:Join(k=4)
Assumenumbers1-5correspondtoindividualitems
L3 C4• {1,2,3}• {1,2,4}{1,3,4}
• {1,3,5}• {2,3,4}
50DukeCS,Fall2018 CompSci516:DatabaseSystems
Apriori-GenExample- 3
• {1,2,3,4}• {1,3,4,5}
Step2:Prune(k=4)• Removeitemsets thatcan’thavethe
requiredsupportbecausethereisasubsetinitwhichdoesn’thavethelevelofsupporti.e.notinthepreviouspass(k-1)
L3 C4
No{1,4,5}existsinL3Rulesout{1,3,4,5}
• {1,2,3}• {1,2,4}{1,3,4}
• {1,3,5}• {2,3,4}
51DukeCS,Fall2018 CompSci516:DatabaseSystems
CorrectnessofApriori
52
1k1k2k2k11
1k1k
1k1k1
k
q.itemp.item,q.itemp.item,...,q.itemp.itemqp,LL
itemqitempitempp.itemC
----
--
--
<== where from
.,.,.,select intoinsert
2
k
k-1
k
c from C) L(s
ets s of c(k-1)-subs C itemsets c
deletethen if
do foralldo forall
Ï
Î
ShowthatCk⊇ Lk• Anysubsetoflargeitemset mustalsobe
large
• foreachpinLk,ithasasubsetqinLk-1• WeareextendingthosesubsetsqinJoin
withanothersubsetq’ofp,whichmustalsobelarge
• equivalenttoextendingLk-1 withallitemsandremovingthose
whose(k-1)subsetsarenotinLk-1• PruneisnotdeletinganythingfromLk
prune
join
Checkyourself
DukeCS,Fall2018 CompSci516:DatabaseSystems
Conclusions(of516,Fall2017)
DukeCS,Fall2018 CompSci516:DatabaseSystems 53
Take-Aways
• DBMSBasics
• DBMSInternals
• OverviewofResearchAreas
• Hands-onExperienceinDBsystems
DukeCS,Fall2018 CompSci516:DatabaseSystems 54
11/27/18
10
DBSystems• TraditionalDBMS– PostGres,SQL
• Large-scaleDataProcessingSystems– Spark/Scala,AWS
• NewDBMS/NOSQL– MongoDB
• Inaddition– XML,JSON,JDBC,Python/Java
DukeCS,Fall2018 CompSci516:DatabaseSystems 55
DBBasics
• SQL• RA/LogicalPlans• RC• Datalog–Whyweneededeachoftheselanguages
• NormalForms
DukeCS,Fall2018 CompSci516:DatabaseSystems 56
DBInternalsandAlgorithms
• Storage• Indexing• OperatorAlgorithms– ExternalSort– JoinAlgorithms
• Cost-basedQueryOptimization• Transactions– ConcurrencyControl– Recovery
DukeCS,Fall2018 CompSci516:DatabaseSystems 57
Large-scaleProcessingandNewApproaches
• ParallelDBMS• DistributedDBMS• MapReduce• NOSQL
DukeCS,Fall2018 CompSci516:DatabaseSystems 58
OtherTopics
• DataWarehouse/OLAP/DataCube• AssociationRuleMining
• HopesomeofyouwillfurtherexploreDatabaseSystems/DataManagement/DataAnalysis/BigData!
DukeCS,Fall2018 CompSci516:DatabaseSystems 59