CompSci516 Data Intensive Computing Systems...

64
CompSci 516 Data Intensive Computing Systems Lecture 8 Index (B+-Tree and Hash) Instructor: Sudeepa Roy 1 Duke CS, Fall 2017 CompSci 516: Database Systems

Transcript of CompSci516 Data Intensive Computing Systems...

Page 1: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

CompSci 516DataIntensiveComputingSystems

Lecture8Index

(B+-TreeandHash)

Instructor:Sudeepa Roy

1DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 2: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

Announcements• HW1duetomorrow:

– Dueon09/21(Thurs),11:55pm,nolatedays– Submitearly– evenifnotcomplete– thensubmitfinalversionbythedeadline

• Informalprojectproposalfeedbacksoon

DukeCS,Fall2017 CompSci516:DatabaseSystems 2

Page 3: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

ReadingMaterial

• [RG]– Storage:Chapters8.1,8.2,8.4,9.4-9.7– Index:8.3,8.5– Tree-basedindex:Chapter10.1-10.7– Hash-basedindex:Chapter11

Additionalreading• [GUW]

– Chapters8.3,14.1-14.4

DukeCS,Fall2017 CompSci516:DatabaseSystems 3

Acknowledgement:Thefollowingslideshavebeencreatedadaptingtheinstructormaterialofthe[RG]bookprovidedbytheauthorsDr.Ramakrishnan andDr.Gehrke.

Page 4: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

Today

• Howarepagesstoredinafile?• Howarerecordsstoredinapage?

– Fixedlengthrecords– Variablelengthrecords

• Howarefieldsstoredinarecord?– Fixedlengthfields/records– Variablelengthfields/records

DukeCS,Fall2017 CompSci516:DatabaseSystems 4

Page 5: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

Recap

• Storage:– Files->Records->Fields– Fixedandvariablelength

• Index– Searchkeyk->Dataentryk*->Record– Alternative1/2/3fork*– Primary/secondary,clustered/unclustered

• Today– B+treeindex– Hashbasedindex

DukeCS,Fall2017 CompSci516:DatabaseSystems 5

Page 6: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

• SupposethatAlternative(2)isusedfordataentries,andthatthedatarecordsarestoredinaHeapfile

• Tobuildclusteredindex,firstsorttheHeapfile– withsomefreespaceoneachpageforfutureinserts– Overflowpagesmaybeneededforinserts– Thus,datarecordsare`closeto’,butnotidenticalto,sorted

Index entries

Data entries

direct search for

(Index File)(Data file)

Data Records

data entries

Data entries

Data Records

CLUSTERED UNCLUSTERED

DukeCS,Fall2017 CompSci516:DatabaseSystems 6

Clusteredvs.Unclustered Index

Page 7: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

Tree-basedIndexandB+-Tree

DukeCS,Fall2017 CompSci516:DatabaseSystems 7

Page 8: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

RangeSearches

• ``Findallstudentswithgpa >3.0’’– Ifdataisinsortedfile,dobinarysearchtofindfirstsuchstudent,thenscantofindothers.

– Costofbinarysearchcanbequitehigh.

DukeCS,Fall2017 CompSci516:DatabaseSystems 8

Page 9: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

Indexfileformat

• Simpleidea:Createan“indexfile”– <first-key-on-page,pointer-to-page>,sortedonkeys

Can do binary search on (smaller) index filebut may still be expensive: apply this idea repeatedly

Page 1 Page 2 Page NPage 3 Data File

k2 kNk1 Index File

P0 K 1 P 1 K 2 P 2 K m P m

index entry

DukeCS,Fall2017 CompSci516:DatabaseSystems 9

Page 10: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

IndexedSequentialAccessMethod(ISAM)

• Leaf-pagescontaindataentry– alsosomeoverflowpages• DBMSorganizeslayoutoftheindex– astaticstructure• Ifanumberofinsertstothesameleaf,alongoverflowchaincanbecreated

– affectstheperformance

Leaf pages contain data entries.

Non-leafPages

PagesOverflow

pagePrimary pages

Leaf

DukeCS,Fall2017 CompSci516:DatabaseSystems 10

Page 11: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

B+Tree• MostWidelyUsedIndex

– adynamicstructure• Insert/deleteatlogF Ncost=heightofthetree(cost=I/O)

– F=fanout,N=no.ofleafpages– treeismaintainedheight-balanced

• Minimum50%occupancy– Eachnodecontainsd <=m<=2d entries– Rootcontains1<=m<=2dentries– Theparameterd iscalledtheorder ofthetree

• Supportsequalityandrange-searchesefficiently

Index Entries

Data Entries("Sequence set")

(Direct search)Theindex-file

DukeCS,Fall2017 CompSci516:DatabaseSystems 11

Page 12: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

B+TreeIndexes

• Leaf pages contain data entries, and are chained (prev & next)• Non-leaf pages have index entries; only used to direct searches:

P0 K 1 P 1 K 2 P 2 K m P m

index entry

Non-l eafPages

Pages (Sorted by search key)

Leaf

DukeCS,Fall2017 CompSci516:DatabaseSystems 12

Page 13: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

ExampleB+Tree

• Searchbeginsatroot,andkeycomparisonsdirectittoaleaf

• Searchfor5*,15*,alldataentries>=24*...Based on the search for 15*, we knowit is not in the tree!Root

17 24 30

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

13

DukeCS,Fall2017 CompSci516:DatabaseSystems 13

Page 14: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

ExampleB+Tree

• Find– 28*?– 29*?– All>15*and<30*

2* 3*

Root

17

30

14* 16* 33* 34* 38* 39*

135

7*5* 8* 22* 24*

27

27* 29*

Entries<17 Entries>=17

Notehowdataentriesinleaflevelaresorted

DukeCS,Fall2017 CompSci516:DatabaseSystems 14

Page 15: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

B+TreesinPractice

• Typicalorder:d=100.Typicalfill-factor:67%– averagefanout F=133

• Typicalcapacities:– Height4:1334 =312,900,700records– Height3:1333 =2,352,637records

• Canoftenholdtoplevelsinbufferpool:– Level1=1page=8Kbytes– Level2=133pages=1Mbyte– Level3=17,689pages=133MBytes

DukeCS,Fall2017 CompSci516:DatabaseSystems 15

Page 16: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

InsertingaDataEntryintoaB+Tree

• FindcorrectleafL• PutdataentryontoL

– IfLhasenoughspace,done– Else,mustsplit L

• intoLandanewnodeL2• Redistributeentriesevenly,copyupmiddlekey.• InsertindexentrypointingtoL2intoparentofL.

• Thiscanhappenrecursively– Tosplitindexnode,redistributeentriesevenly,butpushupmiddlekey

• Contrastwithleafsplits• Splits“grow”tree;rootsplitincreasesheight.

– Treegrowth:getswider oroneleveltallerattop.

DukeCS,Fall2017 CompSci516:DatabaseSystems 16

Seethisslidelater,First,seeexamplesonthenextfewslides

Page 17: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

Root

17 24 30

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

13

Inserting8*intoExampleB+Tree

2* 3* 5* 7* 8*

5Entry to be inserted in parent node.(Note that 5 iscontinues to appear in the leaf.)

s copied up and

• Copy-up:5appearsinleafandthelevelabove

• Observehowminimumoccupancyisguaranteed

STEP-1

DukeCS,Fall2017 CompSci516:DatabaseSystems 17

Page 18: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

• Notedifferencebetweencopy-upandpush-up

• Whatisthereasonforthisdifference?

• Alldataentriesmustappearasleaves

– (foreasyrangesearch)• nosuchrequirementfor

indexes– (soavoidredundancy)

appears once in the index. Contrast

5 24 30

17

13

Entry to be inserted in parent node.(Note that 17 is pushed up and only

this with a leaf split.)

Root

17 24 30

2* 3* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

13

Inserting8*intoExampleB+Tree

STEP-2

5* 7* 8*

Needtosplitparent

5

Page 19: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

ExampleB+TreeAfterInserting8*

• Notice that root was split, leading to increase in height.

• In this example, we can avoid split by re-distributing entries (insert 8 to the 2nd leaf node from left and copy it up instead of 13)

• however, this is usually not done in practice – since need to access 1-2 extra pages always (for two siblings), and average occupancy may remain unaffected as the file grows

2* 3*

Root17

24 30

14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

135

7*5* 8*

DukeCS,Fall2017 CompSci516:DatabaseSystems 19

Page 20: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

DeletingaDataEntryfromaB+Tree

• Startatroot,findleafLwhereentrybelongs• Removetheentry

– IfLisatleasthalf-full,done!– IfLhasonlyd-1entries,

• Trytore-distribute,borrowingfromsibling(adjacentnodewithsameparentasL)

• Ifre-distributionfails,merge Landsibling

• Ifmergeoccurred,mustdeleteentry(pointingtoLorsibling)fromparentofL

• Mergecouldpropagatetoroot,decreasingheight

Eachnon-rootnodecontainsd <=m<=2d entries

DukeCS,Fall2017 CompSci516:DatabaseSystems 20

Seethisslidelater,First,seeexamplesonthenextfewslides

Page 21: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

ExampleTree:Delete19*

• Wehadinserted8*• Nowdelete19*• Easy

DukeCS,Fall2017 CompSci516:DatabaseSystems 21

2* 3*

Root17

24 30

14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

135

7*5* 8*

Beforedeleting19*

Page 22: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

ExampleTree:Delete19*

DukeCS,Fall2017 CompSci516:DatabaseSystems 22

2* 3*

Root17

24 30

14* 16* 20* 22* 24* 27* 29* 33* 34* 38* 39*

135

7*5* 8*

Afterdeleting19*

Page 23: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

ExampleTree:Delete20*

DukeCS,Fall2017 CompSci516:DatabaseSystems 23

2* 3*

Root17

24 30

14* 16* 20* 22* 24* 27* 29* 33* 34* 38* 39*

135

7*5* 8*

Beforedeleting20*

Page 24: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

ExampleTree:Delete20*

• <2entriesinleaf-node• Redistribute

DukeCS,Fall2017 CompSci516:DatabaseSystems 24

2* 3*

Root17

24 30

14* 16* 22* 24* 27* 29* 33* 34* 38* 39*

135

7*5* 8*

Afterdeleting20*- step1

Page 25: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

• Noticehowmiddlekeyiscopiedup

2* 3*

Root

17

30

14* 16* 33* 34* 38* 39*

135

7*5* 8* 22* 24*

27

27* 29*

ExampleTree:Delete20*

Afterdeleting20*- step2

DukeCS,Fall2017 CompSci516:DatabaseSystems 25

Page 26: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

2* 3*

Root

17

30

14* 16* 33* 34* 38* 39*

135

7*5* 8* 22* 24*

27

27* 29*

Beforedeleting24*

ExampleTree: ...AndThenDelete24*

DukeCS,Fall2017 CompSci516:DatabaseSystems 26

Page 27: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

• Onceagain,imbalanceatleaf• Canweborrowfromsibling(s)?• No– d-1anddentries(d=2)• Needtomerge

2* 3*

Root

17

30

14* 16* 33* 34* 38* 39*

135

7*5* 8* 22*

27

27* 29*

Afterdeleting24*- Step1

ExampleTree: ...AndThenDelete24*

DukeCS,Fall2017 CompSci516:DatabaseSystems 27

Page 28: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

• Imbalanceatparent• Mergeagain• Butneedto“pulldown”rootindexentry

2* 3*

Root

17

14* 16* 33* 34* 38* 39*

135

7*5* 8* 22*

30

27* 29*

Afterdeleting24*- Step2

ExampleTree: ...AndThenDelete24*

• Observe`toss’ ofoldindexentry27

DukeCS,Fall2017 CompSci516:DatabaseSystems 28

because,threeindex5,13,30butfivepointerstoleaves

Page 29: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

FinalExampleTree

2* 3* 7* 14* 16* 22* 27* 29* 33* 34* 38* 39*5* 8*

Root30135 17

DukeCS,Fall2017 CompSci516:DatabaseSystems 29

Page 30: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

ExampleofNon-leafRe-distribution

• Anintermediatetreeisshown• Incontrasttopreviousexample,canre-distributeentryfromleftchildof

roottorightchild

Root

135 17 20

22

30

14* 16* 17* 18* 20* 33* 34* 38* 39*22* 27* 29*21*7*5* 8*3*2*

DukeCS,Fall2017 CompSci516:DatabaseSystems 30

Page 31: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

AfterRe-distribution• Intuitively,entriesarere-distributedby`pushingthrough’the

splittingentryintheparentnode.– Itsufficestore-distributeindexentrywithkey20;we’vere-distributed

17aswellforillustration.

14* 16* 33* 34* 38* 39*22* 27* 29*17* 18* 20* 21*7*5* 8*2* 3*

Root

135

17

3020 22

DukeCS,Fall2017 CompSci516:DatabaseSystems 31

Page 32: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

Duplicates

• FirstOption:– Thebasicsearchalgorithmassumesthatallentrieswiththe

samekeyvalueresidesonthesameleafpage– Iftheydonotfit,useoverflowpages(likeISAM)

• SecondOption:– Severalleafpagescancontainentrieswithagivenkeyvalue– Searchfortheleftmostentrywithakeyvalue,andfollowthe

leaf-sequencepointers– Needmodificationinthesearchalgorithm

• ifk*=<k,rid>,severalentrieshavetobesearched– Orincluderidink– becomesuniqueindex,noduplicate– Ifk*=<k,rid-list>,somesolution,butifthelistislong,againa

singleentrycanspanmultiplepages

DukeCS,Fall2017 CompSci516:DatabaseSystems 32

Page 33: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

ANoteon`Order’

• Order(d)– denotesminimumoccupancy

• replacedbyphysicalspacecriterioninpractice(`atleasthalf-full’)

– Indexpagescantypicallyholdmanymoreentriesthanleafpages– Variablesizedrecordsandsearchkeysmeandifferentnodeswill

containdifferentnumbersofentries.– Evenwithfixedlengthfields,multiplerecordswiththesamesearchkey

value(duplicates)canleadtovariable-sizeddataentries(ifweuseAlternative(3))

DukeCS,Fall2017 CompSci516:DatabaseSystems 33

Page 34: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

Summary• Tree-structuredindexesareidealforrange-searches,alsogood

forequalitysearches• ISAMisastaticstructure

– Onlyleafpagesmodified;overflowpagesneeded– Overflowchainscandegradeperformanceunlesssizeofdatasetand

datadistributionstayconstant• B+treeisadynamicstructure

– Inserts/deletesleavetreeheight-balanced;logF Ncost– Highfanout (F)meansdepthrarelymorethan3or4– Almostalwaysbetterthanmaintainingasortedfile– Mostwidelyusedindexindatabasemanagementsystemsbecauseof

itsversatility.– OneofthemostoptimizedcomponentsofaDBMS

• Next:Hash-basedindex

DukeCS,Fall2017 CompSci516:DatabaseSystems 34

Page 35: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

Hash-basedIndex

DukeCS,Fall2017 CompSci516:DatabaseSystems 35

Page 36: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

Hash-BasedIndexes

• Recordsaregroupedintobuckets– Bucket=primarypagepluszeroormore overflowpages

• Hashingfunction h:– h(r)=bucketinwhich(dataentryfor)recordrbelongs– h looksatthesearchkeyfieldsofr– Noneedfor“indexentries”inthisscheme

DukeCS,Fall2017 CompSci516:DatabaseSystems 36

Page 37: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

Example:Hash-basedindex

DukeCS,Fall2017 CompSci516:DatabaseSystems 37

h1AGE

Smith,44,3000Jones,40,6003Tracy,44,5004

Ashby,25,3000Basu,33,4003

Bristow,29,2007

Cass,50,5004Daniels,22,6003

h1(AGE)=00

h1(AGE)=01

h1(AGE)=10

h2

3000300050045004

4003200760036003

h2(SAL)=01

h2(AGE)=00

Fileof<SAL,rid>pairshashedonSAL

EmployeeFilehashedonAGE

IndexorganizedfilehashedonAGE,withAuxiliaryindexonSAL

Alternative1

Alternative2

Page 38: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

Introduction

• Hash-basedindexesarebestforequalityselections– Findallrecordswithname=“Joe”– Cannotsupportrangesearches– Butusefulinimplementingrelationaloperatorslikejoin(later)

• Staticanddynamichashingtechniquesexist– trade-offssimilartoISAMvs.B+trees

DukeCS,Fall2017 CompSci516:DatabaseSystems 38

Page 39: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

StaticHashing• Pagescontainingdata=acollectionofbuckets

– eachbuckethasoneprimarypage,alsopossiblyoverflowpages

– bucketscontaindataentriesk*

h(key)modN

hkey

Primary bucket pages Overflow pages

20

N-1

DukeCS,Fall2017 CompSci516:DatabaseSystems 39

Page 40: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

StaticHashing• #primarypagesfixed

– allocatedsequentially,neverde-allocated,overflowpagesifneeded.

• h(k)modN=buckettowhichdataentrywithkeykbelongs– N=#ofbuckets

h(key)modN

hkey

Primary bucket pages Overflow pages

20

N-1

DukeCS,Fall2017 CompSci516:DatabaseSystems 40

Page 41: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

StaticHashing• Hashfunctionworksonsearchkeyfieldofrecordr

– Mustdistributevaluesoverrange0...N-1– h(key)=(a*key+b)usuallyworkswell

• bucket=h(key)modN– aandbareconstants– chosentotuneh

• Advantage:– #bucketsknown– pagescanbeallocatedsequentially– searchneeds1I/O(ifnooverflowpage)– insert/deleteneeds2I/O(ifnooverflowpage)(why2?)

• Disadvantage:– Longoverflowchainscandevelopiffilegrowsanddegradeperformance– Orwasteofspaceiffileshrinks

• Solutions:– keepsomepagessay80%fullinitially– Periodicallyrehash ifoverflowpages(canbeexpensive)– oruseDynamicHashing

DukeCS,Fall2017 CompSci516:DatabaseSystems 41

Page 42: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

DynamicHashingTechniques

• ExtendibleHashing• LinearHashing

DukeCS,Fall2017 CompSci516:DatabaseSystems 42

Page 43: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

ExtendibleHashing• Considerstatichashing• Bucket(primarypage)becomesfull

• Whynotre-organizefilebydoubling#ofbuckets?– Readingandwriting(double#pages)allpagesisexpensive

• Idea:Usedirectoryofpointerstobuckets– double#ofbucketsbydoublingthedirectory,splittingjustthe

bucketthatoverflowed– Directorymuchsmallerthanfile,sodoublingitismuchcheaper– Onlyonepageofdataentriesissplit– Nooverflowpage(newbucket,nonewoverflowpage)– Trickliesinhowhashfunctionisadjusted

DukeCS,Fall2017 CompSci516:DatabaseSystems 43

Page 44: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

Example• Directoryisarrayofsize4

– eachelementpointstoabucket– #bitstorepresent=log4=2=

globaldepth

• Tofindbucketforsearchkeyr– takelastglobaldepth#bitsof

h(r)– assumeh(r)=r– Ifh(r)=5=binary101– itisinbucketpointedtoby01

00

01

10

11

2

2

2

2

2

LOCAL DEPTH

GLOBAL DEPTH

DIRECTORY

Bucket A

Bucket B

Bucket C

Bucket D

DATA PAGES

10*

1* 21*

4* 12* 32* 16*

15* 7* 19*

5*

Duke%CS,%Spring%2016 CompSci%516:%Data%Intensive%Computing%Systems 6213

Page 45: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

ExampleInsert: • If bucket is full, split it• allocate new page• re-distribute

Supposeinserting13*• binary=1101• bucket01• Hasspace,insert

00

01

10

11

2

2

2

2

2

LOCAL DEPTH

GLOBAL DEPTH

DIRECTORY

Bucket A

Bucket B

Bucket C

Bucket D

DATA PAGES

10*

1* 21*

4* 12* 32* 16*

15* 7* 19*

5*

Duke%CS,%Spring%2016 CompSci%516:%Data%Intensive%Computing%Systems 6214

Page 46: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

ExampleInsert: • If bucket is full, split it• allocate new page• re-distribute

Supposeinserting20*• binary=10100• bucket00• Alreadyfull• Tosplit,considerlastthreebitsof10100• Lasttwobitsthesame00– thedataentry

willbelongtooneofthesebuckets• Thirdbittodistinguishthem

13*00

01

10

11

2

2

2

2

2

LOCAL DEPTH

GLOBAL DEPTH

DIRECTORY

Bucket A

Bucket B

Bucket C

Bucket D

DATA PAGES

10*

1* 21*

4* 12* 32* 16*

15* 7* 19*

5*

Duke%CS,%Spring%2016 CompSci%516:%Data%Intensive%Computing%Systems 6215

Page 47: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

20*

00011011

2 2

2

2

LOCAL DEPTH 2

2

DIRECTORY

GLOBAL DEPTHBucket A

Bucket B

Bucket C

Bucket D

Bucket A2(`split image'of Bucket A)

1* 5* 21*13*

32*16*

10*

15* 7* 19*

4* 12*

19*

2

2

2

000001010011100101

110111

3

3

3DIRECTORY

Bucket A

Bucket B

Bucket C

Bucket D

Bucket A2(new `split image'of Bucket A)

32*

1* 5* 21*13*

16*

10*

15* 7*

4* 20*12*

LOCAL DEPTH

GLOBAL DEPTH

DukeCS,Fall2017 CompSci516:DatabaseSystems 47

Globaldepth:Max#ofbitsneededtotellwhichbucketanentrybelongsto

Localdepth:#ofbitsusedtodetermineifanentrybelongstothisbucket• alsodenoteswhetheradirectorydoublingisneededwhilesplitting• nodirectorydoublingneededwhen9*=1001isinserted(LD<GD)

Example

Page 48: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

Whendoesbucketsplitcausedirectorydoubling?

• Beforeinsert,localdepthofbucket=globaldepth• Insertcauseslocaldepthtobecome>globaldepth

• directoryisdoubledbycopyingitoverand`fixing’pointertosplitimagepage

DukeCS,Fall2017 CompSci516:DatabaseSystems 48

Page 49: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

CommentsonExtendibleHashing• Ifdirectoryfitsinmemory,equalitysearchansweredwithone

diskaccess(toaccessthebucket);elsetwo.– 100MBfile,100bytes/rec,4KBpagesize,contains106 records(asdata

entries)and25,000directoryelements;chancesarehighthatdirectorywillfitinmemory.

– Directorygrowsinspurts,and,ifthedistributionofhashvaluesisskewed,directorycangrowlarge

– Multipleentrieswithsamehashvaluecauseproblems

• Delete:– Ifremovalofdataentrymakesbucketempty,canbemergedwith`split

image’– Ifeachdirectoryelementpointstosamebucketasitssplitimage,can

halvedirectory.

DukeCS,Fall2017 CompSci516:DatabaseSystems 49

Page 50: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

LinearHashing

• Thisisanotherdynamichashingscheme– analternativetoExtendibleHashing

• LHhandlestheproblemoflongoverflowchains– withoutusingadirectory– handlesduplicatesandcollisions– veryflexiblew.r.t.timingofbucketsplits

DukeCS,Fall2017 CompSci516:DatabaseSystems 50

Page 51: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

LinearHashing:BasicIdea• Useafamilyofhashfunctionsh0,h1,h2,...

– hi(key)=h(key)mod(2iN)– N=initial#buckets– hissomehashfunction(rangeisnot0toN-1)– IfN=2d0,forsomed0,hi consistsofapplyinghandlookingatthelastdi bits,wheredi =d0 +i

• Note:hi(key)=h(key)mod(2d0+i)– hi+1doublestherangeofhi

• ifhi mapstoMbuckets,hi+1 mapsto2Mbuckets• similartodirectorydoubling

– SupposeN=32,d0 =5• h0 =hmod32(last5bits)• h1 =hmod64 (last6bits)• h2 =hmod128 (last7bits)etc.

DukeCS,Fall2017 CompSci516:DatabaseSystems 51

Page 52: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

LinearHashing:Rounds

• DirectoryavoidedinLHbyusingoverflowpages,andchoosingbuckettosplitround-robin

• DuringroundLevel,onlyhLevel andhLevel+1 areinuse

• Thebucketsfromstarttolastaresplitsequentially– thisdoublestheno.ofbuckets

• Therefore,atanypointinaround,wehave– bucketsthathavebeensplit– bucketsthatareyettobesplit– bucketscreatedbysplitsinthisround

DukeCS,Fall2017 CompSci516:DatabaseSystems 52

Page 53: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

OverviewofLHFile

Levelh

Buckets that existed at thebeginning of this round:

this is the range of

NextBucket to be split if hLevel (r) Buckets split in this round:

is in this range, must use

`split image' bucket.hLevel + 1 (r) to decide if entry is in

`split image' buckets:created (through splittingof other buckets) in this round

DukeCS,Fall2017 CompSci516:DatabaseSystems 53

Next - 1

• Buckets0toNext-1havebeensplit

• NexttoNLevel yettobesplit• RoundendswhenallNRinitial

(forroundR)bucketsaresplit

0

NLevel

is in this range, no needif hLevel (r)

• InthemiddleofaroundLevel– originally0toNLevel

Page 54: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

OverviewofLHFile• InthemiddleofaroundLevel– originally0toNLevel

Levelh

Buckets that existed at thebeginning of this round:

this is the range of

NextBucket to be split if hLevel (r) Buckets split in this round:

is in this range, must use

`split image' bucket.hLevel + 1 (r) to decide if entry is in

`split image' buckets:created (through splittingof other buckets) in this round

DukeCS,Fall2017 CompSci516:DatabaseSystems 54

Next - 1

• Buckets0toNext-1havebeensplit

• NexttoNLevel yettobesplit• RoundendswhenallNRinitial

(forroundR)bucketsaresplit

0

NLevel

is in this range, no needif hLevel (r)

• Search:Tofindbucketfordataentryr,findhLevel(r):• IfhLevel(r)inrange`Next toNLevel ’ ,rbelongshere.• Else,rcouldbelongtobuckethLevel(r) orhLevel(r)+NR

• ApplyhLevel+1(r)tofindout

Page 55: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

LinearHashing:Insert

• Insert:FindbucketbyapplyinghLevel /hLevel+1:– Ifbuckettoinsertintoisfull:

1. Addoverflowpageandinsertdataentry2. SplitNext bucketandincrementNext

• Note:Wearegoingtoassumethatasplitis`triggered’wheneveraninsertcausesthecreationofanoverflowpage,butingeneral,wecouldimposeadditionalconditionsforbetterspaceutilization([RG],p.380)

DukeCS,Fall2017 CompSci516:DatabaseSystems 55

Page 56: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

ExampleofLinearHashing

0hh

1

(Thisinfoisforillustrationonly!)

Level=0, N0=4=2d0 ,d0=2

00

01

10

11

000

001

010

011

(Theactualcontentsofthelinearhashedfile)

Next=0PRIMARYPAGES

Data entry rwith h(r)=5

Primary bucket page

44* 36*32*

25*9* 5*

14* 18*10*30*

31*35* 11*7*

• Insert43*=101011• h0(43)=11• Full• Insertinanoverflowpage• NeedasplitatNext(=0)• Entriesin00isdistributedto

000and100

Duke%CS,%Spring%2016 CompSci%516:%Data%Intensive%Computing%Systems 26

Page 57: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

ExampleofLinearHashing

0hh

1

(Thisinfoisforillustrationonly!)

00

01

10

11

000

001

010

011

(Theactualcontentsofthelinearhashedfile)

Next=0PRIMARYPAGES

Data entry rwith h(r)=5

Primary bucket page

44* 36*32*

25*9* 5*

14* 18*10*30*

31*35* 11*7*

0hh

1

00

01

10

11

000

001

010

011

Next=1

PRIMARYPAGES

44* 36*

32*

25*9* 5*

14* 18*10*30*

31*35* 11*7*

OVERFLOWPAGES

43*

00100

• Nextisincremented aftersplit• Notethedifferencebetweenoverflowpageof11

andsplitimageof00(000and100)

Duke%CS,%Spring%2016 CompSci%516:%Data%Intensive%Computing%Systems 27

Level=0, N0=4=2d0 ,d0=2 Level=0, N0=4=2d0 ,d0=2

Page 58: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

ExampleofLinearHashing

0hh

1

00

01

10

11

000

001

010

011

Next=1

PRIMARYPAGES

44* 36*

32*

25*9* 5*

14* 18*10*30*

31*35* 11*7*

OVERFLOWPAGES

43*

00100

• Searchfor18*=10010• betweenNext(=1)and4• thisbuckethasnotbeensplit

• 18shouldbehere

• Searchfor32*=100000or44*=101100• Between0andNext-1

• Needh1

• Notallinsertiontriggerssplit• Insert37*=100101• Hasspace

• SplittingatNext?• Nooverflowbucketneeded• Justcopyattheimage/original

• Next=Nlevel-1andasplit?• Startanewround• IncrementLevel• Nextresetto0

Duke%CS,%Spring%2016 CompSci%516:%Data%Intensive%Computing%Systems 28

Level=0, N0=4=2d0 ,d0=2

Page 59: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

ExampleofLinearHashing

0hh

1

00

01

10

11

000

001

010

011

Next=1

PRIMARYPAGES

44* 36*

32*

25*9* 5*

14* 18*10*30*

31*35* 11*7*

OVERFLOWPAGES

43*

00100

• Notallinsertiontriggerssplit• Insert37*=100101

• Hasspace

Duke%CS,%Spring%2016 CompSci%516:%Data%Intensive%Computing%Systems 28

Level=0, N0=4=2d0 ,d0=2

0hh

1

00

01

10

11

000

001

010

011

Next=1

PRIMARYPAGES

44* 36*

32*

25*9* 5*

14* 18*10*30*

31*35* 11*7*

OVERFLOWPAGES

43*

00100

Level=0, N0=4=2d0 ,d0=2

37*

Page 60: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

ExampleofLinearHashing

0hh

1

00

01

10

11

000

001

010

011

Next=1

PRIMARYPAGES

44* 36*

32*

25*9* 5*

14* 18*10*30*

31*35* 11*7*

OVERFLOWPAGES

43*

00100

• SplittingatNext?• Nooverflowbucketneeded• Justcopyattheimage/original

Duke%CS,%Spring%2016 CompSci%516:%Data%Intensive%Computing%Systems 28

Level=0, N0=4=2d0 ,d0=2

0hh

1

00

01

10

11

000

001

010

011

Next=2

PRIMARYPAGES

44* 36*

32*

25*9*

14* 18*10*30*

31*35* 11*7*

OVERFLOWPAGES

43*

00100

Level=0, N0=4=2d0 ,d0=2

37*

insert29*=11101

5* 37*01101 29*

Page 61: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

Example:EndofaRound

0hh1

22*

00

01

10

11

000

001

010

011

00100

Next=3

01

10

101

110

PRIMARYPAGES

OVERFLOWPAGES

32*

9*

5*

14*

25*

66* 10*18* 34*

35*31* 7* 11* 43*

44*36*

37*29*

30*

0hh1

37*

00

01

10

11

000

001

010

011

00100

10

101

110

Next=0

111

11

PRIMARYPAGES

OVERFLOWPAGES

11

32*

9* 25*

66* 18* 10* 34*

35* 11*

44* 36*

5* 29*

43*

14* 30* 22*

31*7*

50*

DukeCS,Fall2017 CompSci516:DatabaseSystems 61

Level=0, N0=4=2d0 ,d0=2

Level=1,N1=8=2d1 ,d1=3

(afterinserting22*,66*,34*- checkyourself)

insert50*=110010

h2

0000

0001

0010

0011

0100

0101

0110

0111

Page 62: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

LHvs.EH

• Theyareverysimilar– hi tohi+1 islikedoublingthedirectory– LH:avoidtheexplicitdirectory,cleverchoiceofsplit– EH:alwayssplit– higherbucketoccupancy

• Uniformdistribution:LHhasloweraveragecost– Nodirectorylevel

• Skeweddistribution– Manyempty/nearlyemptybucketsinLH– EHmaybebetterDukeCS,Fall2017 CompSci516:DatabaseSystems 62

Page 63: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

Summary

• Hash-basedindexes:bestforequalitysearches,cannotsupportrangesearches.

• StaticHashingcanleadtolongoverflowchains.• ExtendibleHashingavoidsoverflowpagesbysplittingafullbucketwhenanewdataentryistobeaddedtoit– Duplicatesmaystillrequireoverflowpages– Directorytokeeptrackofbuckets,doublesperiodically– Cangetlargewithskeweddata;additionalI/Oifthisdoesnotfitinmainmemory

DukeCS,Fall2017 CompSci516:DatabaseSystems 63

Page 64: CompSci516 Data Intensive Computing Systems …x86.cs.duke.edu/courses/fall17/compsci516/Lectures/...•Splits “grow” tree; root split increases height. – Tree growth: gets wideror

Summary

• LinearHashingavoidsdirectorybysplittingbucketsround-robin,andusingoverflowpages– Overflowpagesnotlikelytobelong– Duplicateshandledeasily

• Forhash-basedindexes,askewed datadistributionisoneinwhichthehashvaluesofdataentriesarenotuniformlydistributed– bad

DukeCS,Fall2017 CompSci516:DatabaseSystems 64