CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no...

30
CompSci 516 Data Intensive Computing Systems Lecture 7 Storage and Index Instructor: Sudeepa Roy 1 Duke CS, Fall 2017 CompSci 516: Database Systems

Transcript of CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no...

Page 1: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

CompSci 516DataIntensiveComputingSystems

Lecture7Storageand

Index

Instructor:Sudeepa Roy

1DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 2: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

Announcements• HW1deadlinethisweek:– Dueon09/21(Thurs),11:55pm,nolatedays

• Projectproposaldeadline:– Preliminaryideaandteammembersduebytonight09/18(Mon),11:55pm byemailtotheinstructor

– Proposaldueonsakai by09/25(Mon),11:55pm

• Everyoneshouldbeinagroupnow– otherwiselettheinstructorknowasap

DukeCS,Fall2017 CompSci516:DatabaseSystems 2

Page 3: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

ReadingMaterial

• [RG]– Storage:Chapters8.1,8.2,8.4,9.4-9.7– Index:8.3,8.5– Tree-basedindex:Chapter10.1-10.7– Hash-basedindex:Chapter11

Additionalreading• [GUW]

– Chapters8.3,14.1-14.4

DukeCS,Fall2017 CompSci516:DatabaseSystems 3

Acknowledgement:Thefollowingslideshavebeencreatedadaptingtheinstructormaterialofthe[RG]bookprovidedbytheauthorsDr.Ramakrishnan andDr.Gehrke.

Page 4: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

Storage(contd.fromLecture6)

DukeCS,Fall2017 CompSci516:DatabaseSystems 4

Page 5: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

Recap

• TypicalDBMShierarchy• Diskandmainmemory/bufferpool• Unit=pageorblock– pagereplacementstrategies– dirtybit– pin

DukeCS,Fall2017 CompSci516:DatabaseSystems 5

Page 6: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

Today

• Howarepagesstoredinafile?• Howarerecordsstoredinapage?– Fixedlengthrecords– Variablelengthrecords

• Howarefieldsstoredinarecord?– Fixedlengthfields/records– Variablelengthfields/records

DukeCS,Fall2017 CompSci516:DatabaseSystems 6

Page 7: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

FilesofRecords

• PageorblockisOKwhendoingI/O,buthigherlevelsofDBMSoperateonrecords,andfilesofrecords

• FILE:Acollectionofpages,eachcontainingacollectionofrecords

• Mustsupport:– insert/delete/modifyrecord– readaparticularrecord(specifiedusingrecordid)– scanallrecords(possiblywithsomeconditionsontherecordstoberetrieved)

DukeCS,Fall2017 CompSci516:DatabaseSystems 7

Page 8: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

FileOrganization

• Fileorganization:Methodofarrangingafileofrecordsonexternalstorage– Onefilecanhavemultiplepages– Recordid(rid)issufficienttophysicallylocatethepagecontainingtherecordondisk

– Indexes aredatastructuresthatallowustofindtherecordidsofrecordswithgivenvaluesinindexsearchkeyfields

• NOTE:Severalusesof“keys”inadatabase– Primary/foreign/candidate/superkeys– Indexsearchkeys

DukeCS,Fall2017 CompSci516:DatabaseSystems 8

Page 9: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

AlternativeFileOrganizationsManyalternativesexist,eachidealforsomesituations,and

notsogoodinothers:• Heap(randomorder)files: Suitablewhentypicalaccessisa

filescanretrievingallrecords• SortedFiles:Bestifrecordsmustberetrievedinsome

order,oronlya“range”ofrecordsisneeded.• Indexes:Datastructurestoorganizerecordsviatreesor

hashing– Likesortedfiles,theyspeedupsearchesforasubsetofrecords,

basedonvaluesincertain(“searchkey”)fields– Updatesaremuchfasterthaninsortedfiles

DukeCS,Fall2017 CompSci516:DatabaseSystems 9

Page 10: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

Unordered(Heap)Files

• Simplestfilestructurecontainsrecordsinnoparticularorder

• Asfilegrowsandshrinks,diskpagesareallocatedandde-allocated

• Tosupportrecordleveloperations,wemust:– keeptrackofthepages inafile– keeptrackoffreespaceonpages– keeptrackoftherecords onapage

• Therearemanyalternativesforkeepingtrackofthis

DukeCS,Fall2017 CompSci516:DatabaseSystems 10

Page 11: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

HeapFileImplementedasaList

• TheheaderpageidandHeapfilenamemustbestoredsomeplace

• Eachpagecontains2`pointers’plusdata• Problem?

– toinsertanewrecord,wemayneedtoscanseveralpagesonthefreelisttofindonewithsufficientspace

HeaderPage

DataPage

DataPage

DataPage

DataPage

DataPage

DataPage Pages with

Free Space

Full Pages

DukeCS,Fall2017 CompSci516:DatabaseSystems 11

Page 12: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

HeapFileUsingaPageDirectory

• Theentryforapagecanincludethenumberoffreebytesonthepage.

• Thedirectoryisacollectionofpages– linkedlistimplementationofdirectoryisjustonealternative– Muchsmallerthanlinkedlistofallheapfilepages!

DataPage 1

DataPage 2

DataPage N

HeaderPage

DIRECTORY

DukeCS,Fall2017 CompSci516:DatabaseSystems 12

Page 13: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

Howdowearrangeacollectionofrecordsonapage?

• Eachpagecontainsseveralslots– oneforeachrecord

• Recordisidentifiedby<page-id,slot-number>

• Fixed-LengthRecords• Variable-LengthRecords

• Forboth,thereareoptionsfor– Recordformats(howtoorganizethefieldswithinarecord)– Pageformats(howtoorganizetherecordswithinapage)

DukeCS,Fall2017 CompSci516:DatabaseSystems 13

Page 14: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

PageFormats:FixedLengthRecords

• Recordid=<pageid,slot#>• Packed:movingrecordsforfreespacemanagementchangesrid;maynotbe

acceptable• Unpacked:useabitmap– scanthebitarraytofindanemptyslot• Eachpagealsomaycontainadditionalinfoliketheidofthenextpage(notshown)

Slot 1Slot 2

Slot N

. . . . . .

N M10. . .M ... 3 2 1

PACKED UNPACKED, BITMAP

Slot 1Slot 2

Slot N

FreeSpace

Slot M11

number of records

numberof slots

DukeCS,Fall2017 CompSci516:DatabaseSystems 14

Page 15: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

PageFormats:VariableLengthRecords

• Needtofindapagewiththerightamountofspace– Toosmall– cannotinsert– Toolarge– wasteofspace

• ifarecordisdeleted,needtomovetherecordssothatallfreespaceiscontiguous– needabilitytomoverecordswithinapage

• Canmaintainadirectoryofslots(nextslide)– Slotcontains<record-offset,record-length>– deletion=setrecord-offsetto-1

• Record-idrid=<page,slot-in-directory>remainsunchanged

DukeCS,Fall2017 CompSci516:DatabaseSystems 15

Page 16: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

PageFormats:VariableLengthRecords

• Canmoverecordsonpagewithoutchangingrid– so,attractiveforfixed-lengthrecordstoo

• Store(record-offset,record-length)ineachslot• rid-sunaffectedbyrearrangingrecordsinapage

Page iRid = (i,N)

Rid = (i,2)

Rid = (i,1)

Pointerto startof freespace

SLOT DIRECTORY

N . . . 2 120 16 24 N

# slots

DukeCS,Fall2017 CompSci516:DatabaseSystems 16

Page 17: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

RecordFormats:FixedLength

• Eachfieldhasafixedlength– forallrecords– thenumberoffieldsisalsofixed– fieldscanbestoredconsecutively

• Informationaboutfieldtypessameforallrecordsinafile– storedinsystemcatalogs

• Findingi-th fielddoesnotrequirescanofrecord– giventheaddressoftherecord,addressofafieldcanbeobtained

easily

Base address (B)

L1 L2 L3 L4

F1 F2 F3 F4

Address = B+L1+L2

DukeCS,Fall2017 CompSci516:DatabaseSystems 17

Page 18: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

RecordFormats:VariableLength• Cannotusefixed-lengthslotsforrecords• Twoalternativeformats(#fieldsisfixed):

• Second offers direct access to i-th field, efficient storage of nulls (special don’t know value); small directory overhead

• Modification may be costly (may grow the field and not fit in the page)

4 $ $ $ $

FieldCount

Fields Delimited by Special Symbols

F1 F2 F3 F4

F1 F2 F3 F4

Array of Field Offsets

1.usedelimiters

2.useoffsetsatthestartofeachrecord

DukeCS,Fall2017 CompSci516:DatabaseSystems 18

Page 19: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

Indexes

DukeCS,Fall2017 CompSci516:DatabaseSystems 19

Page 20: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

Indexes

• Anindexonafilespeedsupselectionsonthesearchkeyfieldsfortheindex– Anysubsetofthefieldsofarelationcanbethesearchkeyforan

indexontherelation.– “Searchkey”isnotthesameas“key”

key=minimalsetoffieldsthatuniquelyidentifyatuple

• Anindexcontainsacollectionofdataentries,andsupportsefficientretrievalofalldataentries k*withagivenkeyvaluek

DukeCS,Fall2017 CompSci516:DatabaseSystems 20

Page 21: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

RememberTerminology

• Indexsearchkey(key):k– Usedtosearcharecord

• Dataentry:k*– Pointedtobyk– Containsrecordid(s)orrecorditself

• Recordsordata– Actualtuples– Pointedtobyrecordids

DukeCS,Fall2017 CompSci516:DatabaseSystems 21

INDEXdoesthis

Page 22: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

AlternativesforDataEntryk*inIndexk

• Inadataentryk*wecanstore:1. (Alternative1)Theactualdatarecordwithkeyvalue k,

or2. (Alternative2)<k,rid>

• rid=recordofdatarecordwithsearchkeyvalue k,or

3. (Alternative3)<k,rid-list>• listofrecordidsofdatarecordswithsearchkeyk>

• Choiceofalternativefordataentriesisorthogonaltotheindexingtechniqueusedtolocatedataentrieswithagivenkeyvaluek

DukeCS,Fall2017 CompSci516:DatabaseSystems 22

Page 23: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

AlternativesforDataEntries:Alternative1

• Indexstructureisafileorganizationfordatarecords– insteadofaHeapfileorsortedfile

• HowmanydifferentindexescanuseAlternative1?• AtmostoneindexcanuseAlternative1

– Otherwise,datarecordsareduplicated,leadingtoredundantstorageandpotentialinconsistency

• Ifdatarecordsareverylarge,#pageswithdataentriesishigh– Impliessizeofauxiliaryinformationintheindexisalsolarge

• Inadataentryk*wecanstore:1. Theactualdatarecordwithkeyvalue k2. <k,rid>

• rid=recordofdatarecordwithsearchkeyvalue k3. <k,rid-list>

• listofrecordidsofdatarecordswithsearchkeyk>

DukeCS,Fall2017 CompSci516:DatabaseSystems 23

Advantages/Disadvantages?

Page 24: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

AlternativesforDataEntries:Alternative2,3

• Dataentriestypicallymuchsmallerthandatarecords– So,betterthanAlternative1withlargedatarecords– Especiallyifsearchkeysaresmall.

• Alternative3morecompactthanAlternative2– butleadstovariable-sizedataentriesevenifsearchkeyshavefixedlength.

• Inadataentryk*wecanstore:1. Theactualdatarecordwithkeyvalue k2. <k,rid>

• rid=recordofdatarecordwithsearchkeyvalue k3. <k,rid-list>

• listofrecordidsofdatarecordswithsearchkeyk>

DukeCS,Fall2017 CompSci516:DatabaseSystems 24

Advantages/Disadvantages?

Page 25: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

IndexClassification

• Primaryvs.secondary• Clusteredvs.unclustered• Tree-basedvs.Hash-based

DukeCS,Fall2017 CompSci516:DatabaseSystems 25

Page 26: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

Primaryvs.SecondaryIndex

• Ifsearchkeycontainsprimarykey,thencalledprimaryindex,otherwisesecondary– Unique index:Searchkeycontainsacandidatekey

• Duplicatedataentries:– iftheyhavethesamevalueofsearchkeyfieldk– Primary/uniqueindexneverhasaduplicate– Othersecondaryindexcanhaveduplicates

DukeCS,Fall2017 CompSci516:DatabaseSystems 26

Page 27: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

Clusteredvs.Unclustered Index

• Iforderofdatarecordsinafileisthesameas,or`closeto’,orderofdataentriesinanindex,thenclustered,otherwiseunclustered– Alternative1impliesclustered– Alternative2,3aretypicallyunclustered

• unlesssortedaccordingtothesearchkey

– Sometimes,clusteredalsoimpliesAlternative1• sincesortedfilesarerare

– Afilecanbeclusteredonatmostonesearchkey– Costofretrievingdatarecords(rangequeries)throughindexvaries

greatlybasedonwhetherindexisclusteredornot

DukeCS,Fall2017 CompSci516:DatabaseSystems 27

Page 28: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

• SupposethatAlternative(2)isusedfordataentries,andthatthedatarecordsarestoredinaHeapfile

• Tobuildclusteredindex,firstsorttheHeapfile– withsomefreespaceoneachpageforfutureinserts– Overflowpagesmaybeneededforinserts– Thus,datarecordsare`closeto’,butnotidenticalto,sorted

Index entries

Data entries

direct search for

(Index File)(Data file)

Data Records

data entries

Data entries

Data Records

CLUSTERED UNCLUSTERED

DukeCS,Fall2017 CompSci516:DatabaseSystems 28

Clusteredvs.Unclustered Index

Page 29: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

Methodsforindexing

• Tree-based• Hash-based

• (indetaillater)

DukeCS,Fall2017 CompSci516:DatabaseSystems 29

Page 30: CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no late days •Project proposal deadline: –Preliminary idea and team members due

SystemCatalogs

• Foreachindex:– structure(e.g.,B+tree)andsearchkeyfields

• Foreachrelation:– name,filename,filestructure(e.g.,Heapfile)– attributenameandtype,foreachattribute– indexname,foreachindex– integrityconstraints

• Foreachview:– viewnameanddefinition

• Plusstatistics,authorization,bufferpoolsize,etc.• (describedin[RG]12.1)

Catalogs are themselves stored as relations!DukeCS,Fall2017 CompSci516:DatabaseSystems 30