CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no...

Post on 21-Jun-2020

0 views 0 download

Transcript of CompSci516 Data Intensive Computing Systems Lecture 7 ... · –Due on 09/21 (Thurs), 11:55 pm, no...

CompSci 516DataIntensiveComputingSystems

Lecture7Storageand

Index

Instructor:Sudeepa Roy

1DukeCS,Fall2017 CompSci516:DatabaseSystems

Announcements• HW1deadlinethisweek:– Dueon09/21(Thurs),11:55pm,nolatedays

• Projectproposaldeadline:– Preliminaryideaandteammembersduebytonight09/18(Mon),11:55pm byemailtotheinstructor

– Proposaldueonsakai by09/25(Mon),11:55pm

• Everyoneshouldbeinagroupnow– otherwiselettheinstructorknowasap

DukeCS,Fall2017 CompSci516:DatabaseSystems 2

ReadingMaterial

• [RG]– Storage:Chapters8.1,8.2,8.4,9.4-9.7– Index:8.3,8.5– Tree-basedindex:Chapter10.1-10.7– Hash-basedindex:Chapter11

Additionalreading• [GUW]

– Chapters8.3,14.1-14.4

DukeCS,Fall2017 CompSci516:DatabaseSystems 3

Acknowledgement:Thefollowingslideshavebeencreatedadaptingtheinstructormaterialofthe[RG]bookprovidedbytheauthorsDr.Ramakrishnan andDr.Gehrke.

Storage(contd.fromLecture6)

DukeCS,Fall2017 CompSci516:DatabaseSystems 4

Recap

• TypicalDBMShierarchy• Diskandmainmemory/bufferpool• Unit=pageorblock– pagereplacementstrategies– dirtybit– pin

DukeCS,Fall2017 CompSci516:DatabaseSystems 5

Today

• Howarepagesstoredinafile?• Howarerecordsstoredinapage?– Fixedlengthrecords– Variablelengthrecords

• Howarefieldsstoredinarecord?– Fixedlengthfields/records– Variablelengthfields/records

DukeCS,Fall2017 CompSci516:DatabaseSystems 6

FilesofRecords

• PageorblockisOKwhendoingI/O,buthigherlevelsofDBMSoperateonrecords,andfilesofrecords

• FILE:Acollectionofpages,eachcontainingacollectionofrecords

• Mustsupport:– insert/delete/modifyrecord– readaparticularrecord(specifiedusingrecordid)– scanallrecords(possiblywithsomeconditionsontherecordstoberetrieved)

DukeCS,Fall2017 CompSci516:DatabaseSystems 7

FileOrganization

• Fileorganization:Methodofarrangingafileofrecordsonexternalstorage– Onefilecanhavemultiplepages– Recordid(rid)issufficienttophysicallylocatethepagecontainingtherecordondisk

– Indexes aredatastructuresthatallowustofindtherecordidsofrecordswithgivenvaluesinindexsearchkeyfields

• NOTE:Severalusesof“keys”inadatabase– Primary/foreign/candidate/superkeys– Indexsearchkeys

DukeCS,Fall2017 CompSci516:DatabaseSystems 8

AlternativeFileOrganizationsManyalternativesexist,eachidealforsomesituations,and

notsogoodinothers:• Heap(randomorder)files: Suitablewhentypicalaccessisa

filescanretrievingallrecords• SortedFiles:Bestifrecordsmustberetrievedinsome

order,oronlya“range”ofrecordsisneeded.• Indexes:Datastructurestoorganizerecordsviatreesor

hashing– Likesortedfiles,theyspeedupsearchesforasubsetofrecords,

basedonvaluesincertain(“searchkey”)fields– Updatesaremuchfasterthaninsortedfiles

DukeCS,Fall2017 CompSci516:DatabaseSystems 9

Unordered(Heap)Files

• Simplestfilestructurecontainsrecordsinnoparticularorder

• Asfilegrowsandshrinks,diskpagesareallocatedandde-allocated

• Tosupportrecordleveloperations,wemust:– keeptrackofthepages inafile– keeptrackoffreespaceonpages– keeptrackoftherecords onapage

• Therearemanyalternativesforkeepingtrackofthis

DukeCS,Fall2017 CompSci516:DatabaseSystems 10

HeapFileImplementedasaList

• TheheaderpageidandHeapfilenamemustbestoredsomeplace

• Eachpagecontains2`pointers’plusdata• Problem?

– toinsertanewrecord,wemayneedtoscanseveralpagesonthefreelisttofindonewithsufficientspace

HeaderPage

DataPage

DataPage

DataPage

DataPage

DataPage

DataPage Pages with

Free Space

Full Pages

DukeCS,Fall2017 CompSci516:DatabaseSystems 11

HeapFileUsingaPageDirectory

• Theentryforapagecanincludethenumberoffreebytesonthepage.

• Thedirectoryisacollectionofpages– linkedlistimplementationofdirectoryisjustonealternative– Muchsmallerthanlinkedlistofallheapfilepages!

DataPage 1

DataPage 2

DataPage N

HeaderPage

DIRECTORY

DukeCS,Fall2017 CompSci516:DatabaseSystems 12

Howdowearrangeacollectionofrecordsonapage?

• Eachpagecontainsseveralslots– oneforeachrecord

• Recordisidentifiedby<page-id,slot-number>

• Fixed-LengthRecords• Variable-LengthRecords

• Forboth,thereareoptionsfor– Recordformats(howtoorganizethefieldswithinarecord)– Pageformats(howtoorganizetherecordswithinapage)

DukeCS,Fall2017 CompSci516:DatabaseSystems 13

PageFormats:FixedLengthRecords

• Recordid=<pageid,slot#>• Packed:movingrecordsforfreespacemanagementchangesrid;maynotbe

acceptable• Unpacked:useabitmap– scanthebitarraytofindanemptyslot• Eachpagealsomaycontainadditionalinfoliketheidofthenextpage(notshown)

Slot 1Slot 2

Slot N

. . . . . .

N M10. . .M ... 3 2 1

PACKED UNPACKED, BITMAP

Slot 1Slot 2

Slot N

FreeSpace

Slot M11

number of records

numberof slots

DukeCS,Fall2017 CompSci516:DatabaseSystems 14

PageFormats:VariableLengthRecords

• Needtofindapagewiththerightamountofspace– Toosmall– cannotinsert– Toolarge– wasteofspace

• ifarecordisdeleted,needtomovetherecordssothatallfreespaceiscontiguous– needabilitytomoverecordswithinapage

• Canmaintainadirectoryofslots(nextslide)– Slotcontains<record-offset,record-length>– deletion=setrecord-offsetto-1

• Record-idrid=<page,slot-in-directory>remainsunchanged

DukeCS,Fall2017 CompSci516:DatabaseSystems 15

PageFormats:VariableLengthRecords

• Canmoverecordsonpagewithoutchangingrid– so,attractiveforfixed-lengthrecordstoo

• Store(record-offset,record-length)ineachslot• rid-sunaffectedbyrearrangingrecordsinapage

Page iRid = (i,N)

Rid = (i,2)

Rid = (i,1)

Pointerto startof freespace

SLOT DIRECTORY

N . . . 2 120 16 24 N

# slots

DukeCS,Fall2017 CompSci516:DatabaseSystems 16

RecordFormats:FixedLength

• Eachfieldhasafixedlength– forallrecords– thenumberoffieldsisalsofixed– fieldscanbestoredconsecutively

• Informationaboutfieldtypessameforallrecordsinafile– storedinsystemcatalogs

• Findingi-th fielddoesnotrequirescanofrecord– giventheaddressoftherecord,addressofafieldcanbeobtained

easily

Base address (B)

L1 L2 L3 L4

F1 F2 F3 F4

Address = B+L1+L2

DukeCS,Fall2017 CompSci516:DatabaseSystems 17

RecordFormats:VariableLength• Cannotusefixed-lengthslotsforrecords• Twoalternativeformats(#fieldsisfixed):

• Second offers direct access to i-th field, efficient storage of nulls (special don’t know value); small directory overhead

• Modification may be costly (may grow the field and not fit in the page)

4 $ $ $ $

FieldCount

Fields Delimited by Special Symbols

F1 F2 F3 F4

F1 F2 F3 F4

Array of Field Offsets

1.usedelimiters

2.useoffsetsatthestartofeachrecord

DukeCS,Fall2017 CompSci516:DatabaseSystems 18

Indexes

DukeCS,Fall2017 CompSci516:DatabaseSystems 19

Indexes

• Anindexonafilespeedsupselectionsonthesearchkeyfieldsfortheindex– Anysubsetofthefieldsofarelationcanbethesearchkeyforan

indexontherelation.– “Searchkey”isnotthesameas“key”

key=minimalsetoffieldsthatuniquelyidentifyatuple

• Anindexcontainsacollectionofdataentries,andsupportsefficientretrievalofalldataentries k*withagivenkeyvaluek

DukeCS,Fall2017 CompSci516:DatabaseSystems 20

RememberTerminology

• Indexsearchkey(key):k– Usedtosearcharecord

• Dataentry:k*– Pointedtobyk– Containsrecordid(s)orrecorditself

• Recordsordata– Actualtuples– Pointedtobyrecordids

DukeCS,Fall2017 CompSci516:DatabaseSystems 21

INDEXdoesthis

AlternativesforDataEntryk*inIndexk

• Inadataentryk*wecanstore:1. (Alternative1)Theactualdatarecordwithkeyvalue k,

or2. (Alternative2)<k,rid>

• rid=recordofdatarecordwithsearchkeyvalue k,or

3. (Alternative3)<k,rid-list>• listofrecordidsofdatarecordswithsearchkeyk>

• Choiceofalternativefordataentriesisorthogonaltotheindexingtechniqueusedtolocatedataentrieswithagivenkeyvaluek

DukeCS,Fall2017 CompSci516:DatabaseSystems 22

AlternativesforDataEntries:Alternative1

• Indexstructureisafileorganizationfordatarecords– insteadofaHeapfileorsortedfile

• HowmanydifferentindexescanuseAlternative1?• AtmostoneindexcanuseAlternative1

– Otherwise,datarecordsareduplicated,leadingtoredundantstorageandpotentialinconsistency

• Ifdatarecordsareverylarge,#pageswithdataentriesishigh– Impliessizeofauxiliaryinformationintheindexisalsolarge

• Inadataentryk*wecanstore:1. Theactualdatarecordwithkeyvalue k2. <k,rid>

• rid=recordofdatarecordwithsearchkeyvalue k3. <k,rid-list>

• listofrecordidsofdatarecordswithsearchkeyk>

DukeCS,Fall2017 CompSci516:DatabaseSystems 23

Advantages/Disadvantages?

AlternativesforDataEntries:Alternative2,3

• Dataentriestypicallymuchsmallerthandatarecords– So,betterthanAlternative1withlargedatarecords– Especiallyifsearchkeysaresmall.

• Alternative3morecompactthanAlternative2– butleadstovariable-sizedataentriesevenifsearchkeyshavefixedlength.

• Inadataentryk*wecanstore:1. Theactualdatarecordwithkeyvalue k2. <k,rid>

• rid=recordofdatarecordwithsearchkeyvalue k3. <k,rid-list>

• listofrecordidsofdatarecordswithsearchkeyk>

DukeCS,Fall2017 CompSci516:DatabaseSystems 24

Advantages/Disadvantages?

IndexClassification

• Primaryvs.secondary• Clusteredvs.unclustered• Tree-basedvs.Hash-based

DukeCS,Fall2017 CompSci516:DatabaseSystems 25

Primaryvs.SecondaryIndex

• Ifsearchkeycontainsprimarykey,thencalledprimaryindex,otherwisesecondary– Unique index:Searchkeycontainsacandidatekey

• Duplicatedataentries:– iftheyhavethesamevalueofsearchkeyfieldk– Primary/uniqueindexneverhasaduplicate– Othersecondaryindexcanhaveduplicates

DukeCS,Fall2017 CompSci516:DatabaseSystems 26

Clusteredvs.Unclustered Index

• Iforderofdatarecordsinafileisthesameas,or`closeto’,orderofdataentriesinanindex,thenclustered,otherwiseunclustered– Alternative1impliesclustered– Alternative2,3aretypicallyunclustered

• unlesssortedaccordingtothesearchkey

– Sometimes,clusteredalsoimpliesAlternative1• sincesortedfilesarerare

– Afilecanbeclusteredonatmostonesearchkey– Costofretrievingdatarecords(rangequeries)throughindexvaries

greatlybasedonwhetherindexisclusteredornot

DukeCS,Fall2017 CompSci516:DatabaseSystems 27

• SupposethatAlternative(2)isusedfordataentries,andthatthedatarecordsarestoredinaHeapfile

• Tobuildclusteredindex,firstsorttheHeapfile– withsomefreespaceoneachpageforfutureinserts– Overflowpagesmaybeneededforinserts– Thus,datarecordsare`closeto’,butnotidenticalto,sorted

Index entries

Data entries

direct search for

(Index File)(Data file)

Data Records

data entries

Data entries

Data Records

CLUSTERED UNCLUSTERED

DukeCS,Fall2017 CompSci516:DatabaseSystems 28

Clusteredvs.Unclustered Index

Methodsforindexing

• Tree-based• Hash-based

• (indetaillater)

DukeCS,Fall2017 CompSci516:DatabaseSystems 29

SystemCatalogs

• Foreachindex:– structure(e.g.,B+tree)andsearchkeyfields

• Foreachrelation:– name,filename,filestructure(e.g.,Heapfile)– attributenameandtype,foreachattribute– indexname,foreachindex– integrityconstraints

• Foreachview:– viewnameanddefinition

• Plusstatistics,authorization,bufferpoolsize,etc.• (describedin[RG]12.1)

Catalogs are themselves stored as relations!DukeCS,Fall2017 CompSci516:DatabaseSystems 30