L21: Joins 2 - Northeastern University
Transcript of L21: Joins 2 - Northeastern University
208
L21:Joins2
CS3200 Databasedesign(sp18 s2)https://course.ccs.neu.edu/cs3200sp18s2/4/2/2018
209
Announcements!
• Pleasepickupyourexamifyouhavenotyet• Changedclasscalendar• Outlinetoday- Joins- Relationalalgebra
• Nextclass- QueryOptimizations
210
211
212
GroupProjects:whatisyourexperience?
Source:FoundontheWebasvariationofhttp://www.inquisitr.com/160288/graph-what-i-learned-from-group-projects/
213
214
BNLJ:Somequickfacts.
• WeuseM bufferpagesas:- 1pageforS- 1pageforoutput- M-2PagesforR
• IfP(R)<=M-2- thenwedoonepassoverS,andwerunintimeP(R)+P(S)+OUT.- Note:Thisisoptimalforourcostmodel!- Thus,ifmin{P(R),P(S)}<=M-2weshouldalwaysuseBNLJ
• Weusethisattheendofhashjoin.Wedefineendcondition,oneofthebucketsissmallerthanM-2!
P 𝑅 +k l?@$
𝑃(𝑆) +OUT
215
SmarterthanCross-Products:FromQuadratictoNearlyLinear
• Alljoinsthatcomputethefullcross-product havesomequadraticterm- Forexamplewesaw:
• Nextwe’llseesome(nearly)linearjoins:- ~O(P(R)+P(S)+OUT),whereagainOUTcouldbequadraticbutisusuallybetter
P R +q rA@$
P(S) +OUT
P(R)+T(R)P(S)+OUTNLJ
BNLJ
Wegetthisgainbytakingadvantageofstructure- movingtoequalityconstraints(“equijoin”)only!
216
IndexNestedLoopJoin(INLJ)
Compute R ⋈ 𝑆𝑜𝑛𝐴:Given index idx on S.A: for r in R:s in idx(r[A]):yield r,s
P(R)+T(R)*L+OUT
àWecanuseanindex (e.g.B+Tree)toavoiddoingthefullcross-product!
whereListheIOcosttoaccessallthedistinctvaluesintheindex;assumingthesefitononepage,L~3 isgoodest.
Cost:
217
BetterJoinAlgorithms
• 2.Sort-MergeJoin(SMJ)
• 3.HashJoin(HJ)
• Comparison:SMJ vs.HJ
218
2.Sort-MergeJoin(SMJ)
219
Whatwewilllearnnext
• Sort-MergeJoin
• “Backup”&TotalCost
• Optimizations
220
SortMergeJoin(SMJ):BasicProcedure
• TocomputeR ⋈ 𝑆𝑜𝑛𝐴:
• SortR,SonAusingexternalmergesort
• Scan sortedfilesand“merge”
• [Mayneedto“backup”- seenextsubsection]
NotethatifR,SarealreadysortedonA,SMJwillbeawesome!
Notethatweareonlyconsideringequalityjoinconditionshere
221
SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer• Forsimplicity:Leteachpagebeonetuple,andletthefirstvaluebeA
Disk
Main Memory
BufferR (5,b) (3,j)(0,a)
S (7,f) (0,j)(3,g)
WeshowthefileHEAD,whichisthenextvaluetoberead!
222
SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer• 1.SorttherelationsR,Sonthejoinkey(firstvalue)
Disk
Main Memory
BufferR (5,b) (3,j)(0,a)
S (7,f) (0,j)(3,g)
(3,j) (5,b)(0,a)
(3,g) (7,f)(0,j)
223
SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer• 2.Scanand“merge”onjoinkey!
Disk
Main Memory
BufferR
S (3,g) (7,f)
(3,j) (5,b)
Output
(0,j)
(0,a)(0,a)
(0,j)
224
SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer• 2.Scanand“merge”onjoinkey!
Disk
Main Memory
BufferR
S (3,g) (7,f)
(3,j) (5,b)
Output
(0,j)(0,a)
(0,a)
(0,j)(0,a,j)
225
SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer• 2.Scanand“merge”onjoinkey!
Disk
Main Memory
BufferR
S (3,g) (7,f)
(3,j) (5,b)
Output
(0,a)
(0,j)
(0,a,j)
(3,j,g)
(3,j)
(3,g)
(5,b)
(7,f)
226
SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer• 2.Done!
Disk
Main Memory
BufferR
S 3,g 7,f
3,j 5,b
Output
(0,a)
(0,j)
(0,a,j)
(3,j)
(3,g)
(3,j,g)
(5,b)
(7,f)
227
Whathappenswithduplicatejoinkeys?
228
MultipletupleswithSameJoinKey:“Backup”
• 1.Startwithsortedrelations,andbeginscan/merge…
Disk
Main Memory
BufferR
S 3,g 7,f
3,j 5,b
Output
(0,j)
(0,g)
(0,b)
(7,f)
(0,a)
(0,j)
(0,a)
(0,j)
229
MultipletupleswithSameJoinKey:“Backup”
• 1.Startwithsortedrelations,andbeginscan/merge…
Disk
Main Memory
BufferR
S 3,g 7,f
3,j 5,b
Output
(0,j)
(0,g)
(0,b)
(7,f)
(0,a)
(0,a)(0,j)
(0,j) (0,a,j)
230
MultipletupleswithSameJoinKey:“Backup”
• 1.Startwithsortedrelations,andbeginscan/merge…
Disk
Main Memory
BufferR
S (0,g) 7,f
(0,j) 5,b
Output
(0,b)
(7,f)
(0,a)
(0,a)(0,j)
(0,a,j)
(0,a,g)(0,g)
(0,j)
231
MultipletupleswithSameJoinKey:“Backup”
• 1.Startwithsortedrelations,andbeginscan/merge…
Disk
Main Memory
BufferR
S 0,g 7,f
0,j 5,b
Output
(0,j) (0,b)
(7,f)
(0,a)
(0,a,j)
(0,g)
(0,a,g)
(0,j)
Haveto“backup”inthescanofSandreadtuplewe’vealreadyread!
(0,j)(0,j)
232
Backup
• Atbest,nobackupà scantakesP(R)+P(S) reads- Forex:ifnoduplicatevaluesinjoinattribute
• Atworst(e.g.fullbackupeachtime),scancouldtakeP(R)*P(S) reads!- Forex:ifallduplicate valuesinjoinattribute,i.e.alltuplesinRandShavethesame
valueforthejoinattribute- Roughly:ForeachpageofR,we’llhavetobackup andreadeachpageofS…
• Oftennotthatbadhowever,pluswecan:- Leavemoredatainbuffer(forlargerbuffers)- Can“zig-zag”(seeanimation)
233
SMJ:Totalcost
• CostofSMJ iscostofsorting RandS…
• Plusthecostofscanning:~P(R)+P(S)- Becauseofbackup:inworstcaseP(R)*P(S);butthiswouldbeveryunlikely
• Plusthecostofwritingout:~P(R)+P(S)butinworstcaseT(R)*T(S)
~Sort(P(R))+Sort(P(S))+P(R)+P(S) +OUT
Recall:Sort(N)≈ 2𝑁 log?@"𝑵𝟐𝑴
+ 1Note:thisisusingrepacking,whereweestimatethatwecancreateinitialrunsoflength~2M
Externalmerge:slidesp26Externalmergesort:slidesp43
234
Merge/JoinPhase
SortPhase(Ext.MergeSort)
SMJ Illustrated
SR
Split&sortSplit&sort
MergeMerge
MergeMerge
GivenM bufferpages
Joinedoutputfilecreated!
Unsortedinputrelations
235
SMJ vs.BNLJ:Comparison
• IfwehaveM=100bufferpages,P(R)= 1000pagesandP(S)=500pages:• CostforSMJ:- Sort:- Merge:- Sum:
• WhatisBNLJ?
236
SMJ vs.BNLJ:Comparison
• IfwehaveM=100bufferpages,P(R)= 1000pagesandP(S)=500pages:• CostforSMJ:- Sort:- Merge:- Sum:
• WhatisBNLJ?- 500+1000* wTT
xy=5,500IOs+OUT
• But,ifwehaveM=35bufferpages?- SortMergehassamebehavior(still2passes)- BNLJ?15,500IOs+OUT!
SMJis~linearvs.BNLJisquadratic…Butit’sallaboutthememory.
Sortbothintwopasses:2*2*1000+2*2*500=6,000IOsMergephase1000+500=1,500IOs7,500IOs+OUT
237
TakeawaypointsfromSMJ
• Ifinputalreadysortedonjoinkey,skipthesorts.- SMJ isbasicallylinear.- Nastybutunlikelycase:Manyduplicatejoinkeys.
• SMJ needstosortboth relations- Ifmax{P(R),P(S)}<M2 thencostis3(P(R)+P(S))+OUT
239
L21:TheRelationalMOdel
CS3200 Databasedesign(sp18 s2)https://course.ccs.neu.edu/cs3200sp18s2/4/2/2018
240
Ournextfocus
• TheRelationalModel
• RelationalAlgebra
• RelationalAlgebraPt.II[Optional:mayskip]
241
1.TheRelationalModel&RelationalAlgebra
242
Whatyouwilllearnaboutinthissection
• TheRelationalModel
• RelationalAlgebra:BasicOperators
• Execution
243
Motivation
TheRelationalmodelisprecise,implementable,andwecanoperateonit
(query/update,etc.)
Databasemapsinternallyintothisprocedurallanguage.
244
ALittleHistory
• RelationalmodelduetoEdgar“Ted”Codd,amathematicianatIBMin1970- ARelationalModelofDataforLarge
SharedDataBanks". CommunicationsoftheACM 13 (6):377–387
• IBMdidn’twanttouserelationalmodel(takemoneyfromIMS)- Apparentlyusedinthemoonlanding…
WonTuringaward1981
245
TheRelationalModel:Schemata
• RelationalSchema:
Students(sid: string, name: string, gpa: float)
AttributesString, float, int, etc. are the domains of the attributes
Relationname
246
TheRelationalModel:Data
sid name gpa
001 Bob 3.2
002 Joe 2.8
003 Mary 3.8
004 Alice 3.5
Student
Anattribute (orcolumn)isatypeddataentrypresentineachtupleintherelation
Thenumberofattributesisthearity oftherelation
247
TheRelationalModel:Data
sid name gpa
001 Bob 3.2
002 Joe 2.8
003 Mary 3.8
004 Alice 3.5
Student
Atuple orrow (orrecord)isasingleentryinthetablehavingtheattributesspecifiedbytheschema
Thenumberoftuplesisthecardinality oftherelation
248
TheRelationalModel:Data
Arelationalinstance isaset oftuplesallconformingtothesameschema
Recall:InpracticeDBMSsrelaxthesetrequirement,andusemultisets (orbags).
sid name gpa
001 Bob 3.2
002 Joe 2.8
003 Mary 3.8
004 Alice 3.5
Student
249
ToReiterate
• Arelationalschema describesthedatathatiscontainedinarelationalinstance
LetR(f1:Dom1,…,fm:Domm)bearelationalschema then,aninstanceofRisasubsetofDom1 xDom2 x…xDomn
Inthisway,arelationalschema Risatotalfunctionfromattributenames totypes
250
OneMoreTime
• Arelationalschema describesthedatathatiscontainedinarelationalinstance
ArelationRofarity t isafunction:R:Dom1 x…xDomt à {0,1}
Then,theschemaissimplythesignatureofthefunction
I.e.returnswhetherornotatupleofmatchingtypesisamemberofit
Noteherethatordermatters,attributenamedoesn’t…We’ll(mostly)workwiththeothermodel(lastslide)in
whichattributenamematters,orderdoesn’t!
251
Arelationaldatabase
• Arelationaldatabaseschema isasetofrelationalschemata,oneforeachrelation
• Arelationaldatabaseinstance isasetofrelationalinstances,oneforeachrelation
Twoconventions:1. Wecallrelationaldatabaseinstancesassimplydatabases2. Weassumeallinstancesarevalid,i.e.,satisfythedomainconstraints
252
ACourseManagementSystem(CMS)
• RelationDBSchema- Students(sid:string,name:string,gpa:float)- Courses(cid:string,cname:string,credits:int)- Enrolled(sid:string,cid:string,grade:string)
Sid Name Gpa101 Bob 3.2123 Mary 3.8
Students
cid cname credits564 564-2 4308 417 2
Coursessid cid Grade123 564 A
Enrolled
RelationInstances
Notethattheschemasimposeeffectivedomain/typeconstraints,i.e.Gpacan’tbe“Apple”
253
2ndPartoftheModel:Querying
“FindnamesofallstudentswithGPA>3.5”
Wedon’ttellthesystem howorwhere togetthedata- justwhatwewant,i.e.,Queryingisdeclarative
Actually,Ishowedhowtodothistranslationforamuchricherlanguage!
SELECT S.nameFROM Students SWHERE S.gpa > 3.5;
Tomakethishappen,weneedtotranslatethedeclarativequeryintoaseriesofoperators…we’llseethisnext!
254
Virtuesofthemodel
• Physicalindependence(logicaltoo),Declarative
• Simple,elegantclean:Everythingisarelation
• Whydidittakemultipleyears?- Doubteditcouldbedoneefficiently.
255
2.RelationalAlgebra
256
RDBMSArchitecture
• HowdoesaSQLenginework?
SQLQuery
RelationalAlgebra(RA)
Plan
OptimizedRAPlan Execution
Declarativequery(fromuser)
Translatetorelationalalgebraexpresson
Findlogicallyequivalent- butmoreefficient- RAexpression
Executeeachoperatoroftheoptimizedplan!