Indexing and Hashingnzhang10/6907/files/d1.pdfOverview • Basic Concepts • Ordered Indices •...

IndexingandHashing

Overview• BasicConcepts• OrderedIndices• B+-TreeIndexFiles• StaticHashing• DynamicHashing• ComparisonofOrderedIndexingandHashing• IndexDefinitioninSQL• Multiple-KeyAccess

BasicConcepts• Indexingmechanismsusedtospeedupaccesstodesireddata.

– E.g.,authorcataloginlibrary• SearchKey- attributetosetofattributesusedtolookuprecordsin

afile.• Anindexfileconsistsofrecords(calledindexentries)oftheform

• Indexfilesaretypicallymuchsmallerthantheoriginalfile• Twobasickindsofindices:

– Orderedindices:searchkeysarestoredinsortedorder– Hashindices:searchkeysaredistributeduniformlyacross“buckets”

usinga“hashfunction”.

search-key pointer

IndexEvaluationMetrics

• Accesstypessupportedefficiently.E.g.,– recordswithaspecifiedvalueintheattribute– orrecordswithanattributevaluefallinginaspecifiedrangeofvalues.

• Accesstime• Insertiontime• Deletiontime• Spaceoverhead

OrderedIndices• Inanorderedindex,indexentriesarestoredsortedonthe

searchkeyvalue.E.g.,authorcataloginlibrary.• Primaryindex:inasequentiallyorderedfile,theindex

whosesearchkeyspecifiesthesequentialorderofthefile.– Alsocalledclusteringindex– Thesearchkeyofaprimaryindexisusuallybutnotnecessarily

theprimarykey.• Secondaryindex:anindexwhosesearchkeyspecifiesan

orderdifferentfromthesequentialorderofthefile.Alsocallednon-clusteringindex.

• Index-sequentialfile:orderedsequentialfilewithaprimaryindex.

DenseIndexFiles

• Denseindex— Indexrecordappearsforeverysearch-keyvalueinthefile.

SparseIndexFiles• SparseIndex:containsindexrecordsforonlysomesearch-

keyvalues.– Applicablewhenrecordsaresequentiallyorderedonsearch-key

• Tolocatearecordwithsearch-keyvalueKwe:– Findindexrecordwithlargestsearch-keyvalue<K– Searchfilesequentiallystartingattherecordtowhichtheindex

recordpoints

SparseIndexFiles(Cont.)

• Comparedtodenseindices:– Lessspaceandlessmaintenanceoverheadforinsertionsanddeletions.

– Generallyslowerthandenseindexforlocatingrecords.

• Goodtradeoff:sparseindexwithanindexentryforeveryblockinfile,correspondingtoleastsearch-keyvalueintheblock.

MultilevelIndex• Ifprimaryindexdoesnotfitinmemory,accessbecomesexpensive.

• Solution:treatprimaryindexkeptondiskasasequentialfileandconstructasparseindexonit.– outerindex– asparseindexofprimaryindex– innerindex– theprimaryindexfile

• Ifevenouterindexistoolargetofitinmainmemory,yetanotherlevelofindexcanbecreated,andsoon.

• Indicesatalllevelsmustbeupdatedoninsertionordeletionfromthefile.

IndexUpdate:Deletion• Ifdeletedrecordwastheonlyrecordinthefilewithitsparticularsearch-keyvalue,thesearch-keyisdeletedfromtheindexalso.

• Single-levelindexdeletion:– Denseindices– deletionofsearch-key:similartofilerecorddeletion.

– Sparseindices–• ifanentryforthesearchkeyexistsintheindex,itisdeletedbyreplacingtheentryintheindexwiththenextsearch-keyvalueinthefile(insearch-keyorder).

• Ifthenextsearch-keyvaluealreadyhasanindexentry,theentryisdeletedinsteadofbeingreplaced.

IndexUpdate:Insertion• Single-level indexinsertion:– Performalookupusingthesearch-keyvalueappearingintherecordtobeinserted.

– Denseindices– ifthesearch-keyvaluedoesnotappearintheindex,insertit.

– Sparseindices– ifindexstoresanentryforeachblockofthefile,nochangeneedstobemadetotheindexunlessanewblockiscreated.• Ifanewblockiscreated,thefirstsearch-keyvalueappearinginthenewblockisinsertedintotheindex.

• Multilevelinsertion(aswellasdeletion)algorithmsaresimpleextensionsofthesingle-levelalgorithms

SecondaryIndices• Frequently,onewantstofindalltherecordswhosevaluesinacertainfield(whichisnotthesearch-keyoftheprimaryindex)satisfysomecondition.– Example1:Intheaccountrelationstoredsequentiallybyaccountnumber,wemaywanttofindallaccountsinaparticularbranch

– Example2:asabove,butwherewewanttofindallaccountswithaspecifiedbalanceorrangeofbalances

• Wecanhaveasecondaryindexwithanindexrecordforeachsearch-keyvalue

SecondaryIndicesExample• Indexrecordpointstoabucketthatcontainspointerstoalltheactual

recordswiththatparticularsearch-keyvalue.• Secondaryindiceshavetobedense

PrimaryandSecondaryIndices• Indicesoffersubstantialbenefitswhensearchingforrecords.

• BUT:Updatingindicesimposesoverheadondatabasemodification--whenafileismodified,everyindexonthefilemustbeupdated,

• Sequentialscanusingprimaryindexisefficient,butasequentialscanusingasecondaryindexisexpensive– Eachrecordaccessmayfetchanewblockfromdisk– Blockfetchrequiresabout5to10milliseconds

• versusabout100nanosecondsformemoryaccess

B+-TreeIndexFiles• Disadvantageofindexed-sequentialfiles

– performancedegradesasfilegrows,sincemanyoverflowblocksgetcreated.

– Periodicreorganizationofentirefileisrequired.• AdvantageofB+-treeindexfiles:

– automaticallyreorganizesitselfwithsmall,local,changes,inthefaceofinsertionsanddeletions.

– Reorganizationofentirefileisnotrequiredtomaintainperformance.

• (Minor)disadvantageofB+-trees:– extrainsertionanddeletionoverhead,spaceoverhead.

• AdvantagesofB+-treesoutweighdisadvantages– B+-treesareusedextensively

B+-tree indices are an alternative to indexed-sequential files.

B+-TreeIndexFiles(Cont.)

• AB+-treeisarootedtreesatisfyingthefollowingproperties:– Allpathsfromroottoleafareofthesamelength– Eachnodethatisnotarootoraleafhasbetween⎡n/2⎤ andnchildren.

– Aleafnodehasbetween⎡(n–1)/2⎤ andn–1values– Specialcases:• Iftherootisnotaleaf,ithasatleast2children.• Iftherootisaleaf(thatis,therearenoothernodesinthetree),itcanhavebetween0and(n–1)values.

B+-TreeNodeStructure

• Typicalnode

– Kiarethesearch-keyvalues– Piarepointerstochildren(fornon-leafnodes)orpointerstorecordsorbucketsofrecords(forleafnodes).

• Thesearch-keysinanodeareordered– K1<K2<K3<...<Kn–1

LeafNodesinB+-Trees• Fori=1,2,...,n–1,pointerPieitherpointstoafilerecordwith

search-keyvalueKi,ortoabucketofpointerstofilerecords,eachrecordhavingsearch-keyvalueKi.Onlyneedbucketstructureifsearch-keydoesnotformaprimarykey.

• IfLi,Ljareleafnodesandi<j,Li’ssearch-keyvaluesarelessthanLj’ssearch-keyvalues

• Pnpointstonextleafnodeinsearch-keyorder

Non-LeafNodesinB+-Trees

• Nonleafnodesformamulti-levelsparseindexontheleafnodes.Foranon-leafnodewithmpointers:– Allthesearch-keysinthesubtreetowhichP1pointsarelessthanK1

– For2≤ i≤ n– 1,allthesearch-keysinthesubtreetowhichPipointshavevaluesgreaterthanorequaltoKi–1andlessthanKi

– Allthesearch-keysinthesubtreetowhichPnpointshavevaluesgreaterthanorequaltoKn–1

ExampleofaB+-tree

B+-tree for account file (n = 3)

ExampleofB+-tree

• Leafnodesmusthavebetween2and4values(⎡(n–1)/2⎤ andn–1,withn=5).

• Non-leafnodesotherthanrootmusthavebetween3and5children(⎡(n/2⎤ andnwithn=5).

• Rootmusthaveatleast2children.

B+-tree for account file (n = 5)

ObservationsaboutB+-trees• Sincetheinter-nodeconnectionsaredonebypointers,“logically” closeblocksneednotbe“physically” close.

• Thenon-leaflevelsoftheB+-treeformahierarchyofsparseindices.

• TheB+-treecontainsarelativelysmallnumberoflevels– Levelbelowroothasatleast2*⎡n/2⎤ values– Nextlevelhasatleast2*⎡n/2⎤ *⎡n/2⎤ values– IfthereareKsearch-keyvaluesinthefile,thetreeheightisno

morethan⎡log⎡n/2⎤(K)⎤– thussearchescanbeconductedefficiently.

• Insertionsanddeletionstothemainfilecanbehandledefficiently,astheindexcanberestructuredinlogarithmictime(asweshallsee).

QueriesonB+-Trees• Findallrecordswithasearch-keyvalueofk.

– N=root– Repeat

• ExamineNforthesmallestsearch-keyvalue>k.• Ifsuchavalueexists,assumeitisKi.ThensetN=Pi• Otherwisek≥ Kn–1.SetN=Pn• UntilNisaleafnode

– Ifforsomei,keyKi=kfollowpointerPitothedesiredrecordorbucket.

– Elsenorecordwithsearch-keyvaluekexists.

QueriesonB+-Trees(Cont.)• IfthereareKsearch-keyvaluesinthefile,theheightofthe

treeisnomorethan⎡log⎡n/2⎤(K)⎤.• Anodeisgenerallythesamesizeasadiskblock,typically4

kilobytes– andnistypicallyaround100(40bytesperindexentry).

• With1millionsearchkeyvaluesandn=100– atmostlog50(1,000,000)=4nodesareaccessedinalookup.

• Contrastthiswithabalancedbinarytreewith1millionsearchkeyvalues— around20nodesareaccessedinalookup– abovedifferenceissignificantsinceeverynodeaccessmayneed

adiskI/O,costingaround20milliseconds

UpdatesonB+-Trees:Insertion• Findtheleafnodeinwhichthesearch-keyvaluewould

appear• Ifthereisroomintheleafnode,insert(key-value,

pointer)pairintheleafnode• Otherwise,splitthenode(alongwiththenew(key-value,

pointer)entry)asdiscussedinthenextslide.

UpdatesonB+-Trees:Insertion(Cont.)

• Splittingaleafnode:– takethen(search-keyvalue,pointer)pairs(includingtheonebeing

inserted)insortedorder.Placethefirst⎡n/2⎤ intheoriginalnode,andtherestinanewnode.

– letthenewnodebep,andletkbetheleastkeyvalueinp.Insert(k,p)intheparentofthenodebeingsplit.

– Iftheparentisfull,splititandpropagatethesplitfurtherup.• Splittingofnodesproceedsupwardstillanodethatisnotfullis

found.– Intheworstcasetherootnodemaybesplitincreasingtheheightof

thetreeby1.

UpdatesonB+-Trees:Insertion(Cont.)

B+-Tree before and after insertion of “Clearview”

UpdatesonB+-Trees:Deletion

• Findtherecordtobedeleted,andremoveitfromthemainfileandfromthebucket(ifpresent)

• Remove(search-keyvalue,pointer)fromtheleafnodeifthereisnobucketorifthebuckethasbecomeempty

• Ifthenodehastoofewentriesduetotheremoval,andtheentriesinthenodeandasiblingfitintoasinglenode,thenmergesiblings

UpdatesonB+-Trees:Deletion• Otherwise,ifthenodehastoofewentriesduetotheremoval,buttheentriesinthenodeandasiblingdonotfitintoasinglenode,thenredistributepointers:– Redistributethepointersbetweenthenodeandasiblingsuchthatbothhavemorethantheminimumnumberofentries.

– Updatethecorrespondingsearch-keyvalueintheparentofthenode.

• Thenodedeletionsmaycascadeupwardstillanodewhichhas⎡n/2⎤ ormorepointersisfound.

• Iftherootnodehasonlyonepointerafterdeletion,itisdeletedandthesolechildbecomestheroot.

ExamplesofB+-TreeDeletion

Before and after deleting “Downtown”

ExamplesofB+-TreeDeletion

Deletion of “Perryridge” from result of previous example

ExampleofB+-treeDeletion

Before and after deletion of “Perryridge” from earlier example

Multiple-KeyAccess• Usemultipleindicesforcertaintypesofqueries.• Example:

selectaccount_numberfromaccountwherebranch_name=“Perryridge” andbalance=1000

• Possiblestrategiesforprocessingqueryusingindicesonsingleattributes:1. Useindexonbranch_nametofindaccountswithbranchname

Perryridge;testbalance=10002. Useindexonbalancetofindaccountswithbalancesof$1000;test

branch_name=“Perryridge”.3. Usebranch_nameindextofindpointerstoallrecordspertainingto

thePerryridgebranch.Similarlyuseindexonbalance.Takeintersectionofbothsetsofpointersobtained.

IndicesonMultipleKeys• Compositesearchkeysaresearchkeyscontainingmorethanoneattribute– E.g.(branch_name,balance)

• Lexicographicordering:(a1,a2)<(b1,b2)ifeither– a1<b1,or– a1=b1anda2<b2

• Canalsoefficientlyhandlewherebranch_name=“Perryridge” andbalance<1000

• Butcannotefficientlyhandlewherebranch_name<“Perryridge” andbalance=1000

Non-UniqueSearchKeys

• Alternatives:– Makesearchkeyuniquebyaddingarecord-identifier• Extrastorageoverheadforkeys• Simplercodeforinsertion/deletion• Widelyused

OtherIssuesinIndexing• Coveringindices

– Addextraattributestoindexso(some)queriescanavoidfetchingtheactualrecords• Particularlyusefulforsecondaryindices

– Canstoreextraattributesonlyatleaf• Recordrelocationandsecondaryindices

– Ifarecordmoves,allsecondaryindicesthatstorerecordpointershavetobeupdated

– NodesplitsinB+-treefileorganizationsbecomeveryexpensive– Solution:useprimary-indexsearchkeyinsteadofrecordpointer

insecondaryindex• Extratraversalofprimaryindextolocaterecord

– Highercostforqueries,butnodesplitsarecheap• Addrecord-idifprimary-indexsearchkeyisnon-unique

Hashing

StaticHashing• Abucketisaunitofstoragecontainingoneormorerecords(abucketistypicallyadiskblock).

• Inahashfileorganizationweobtainthebucketofarecorddirectlyfromitssearch-keyvalueusingahashfunction.

• Hashfunctionhisafunctionfromthesetofallsearch-keyvaluesKtothesetofallbucketaddressesB.

• Hashfunctionisusedtolocaterecordsforaccess,insertionaswellasdeletion.

• Recordswithdifferentsearch-keyvaluesmaybemappedtothesamebucket;thusentirebuckethastobesearchedsequentiallytolocatearecord.

ExampleofHashFileOrganization

• Hashfileorganizationofaccount file,usingbranch_nameaskey– Thereare10buckets,– E.g.h(Perryridge)=5h(RoundHill)=3h(Brighton)=3

HashFunctions• Worsthashfunctionmapsallsearch-keyvaluestothesamebucket;thismakesaccesstimeproportionaltothenumberofsearch-keyvaluesinthefile.

• Anidealhashfunctionisuniform,i.e.,eachbucketisassignedthesamenumberofsearch-keyvaluesfromthesetofallpossiblevalues.

• Idealhashfunctionisrandom,soeachbucketwillhavethesamenumberofrecordsassignedtoitirrespectiveoftheactualdistributionofsearch-keyvaluesinthefile.

HandlingofBucketOverflows

• Bucketoverflowcanoccurbecauseof– Insufficientbuckets– Skewindistributionofrecords.Thiscanoccurduetotworeasons:• multiplerecordshavesamesearch-keyvalue• chosenhashfunctionproducesnon-uniformdistributionofkeyvalues

• Althoughtheprobabilityofbucketoverflowcanbereduced,itcannotbeeliminated;itishandledbyusingoverflowbuckets.

HandlingofBucketOverflows(Cont.)

• Overflowchaining– theoverflowbucketsofagivenbucketarechainedtogetherinalinkedlist.

HashIndices• Hashingcanbeusednotonlyforfileorganization,butalsoforindex-structurecreation.

• Ahashindexorganizesthesearchkeys,withtheirassociatedrecordpointers,intoahashfilestructure.

• Strictlyspeaking,hashindicesarealwayssecondaryindices– ifthefileitselfisorganizedusinghashing,aseparateprimaryhashindexonitusingthesamesearch-keyisunnecessary.

– However,weusethetermhashindextorefertobothsecondaryindexstructuresandhashorganizedfiles.

ExampleofHashIndex

DeficienciesofStaticHashing• Instatichashing,functionhmapssearch-keyvaluestoa

fixedsetofBofbucketaddresses.Databasesgroworshrinkwithtime.– Ifinitialnumberofbucketsistoosmall,andfilegrows,

performancewilldegradeduetotoomuchoverflows.– Ifspaceisallocatedforanticipatedgrowth,asignificantamount

ofspacewillbewastedinitially(andbucketswillbeunderfull).– Ifdatabaseshrinks,againspacewillbewasted.

• Onesolution:periodicre-organizationofthefilewithanewhashfunction– Expensive,disruptsnormaloperations

• Bettersolution:allowthenumberofbucketstobemodifieddynamically.

Initial Hash structure, bucket size = 2

ExtendableHashingvs.OtherSchemes

• Benefitsofextendablehashing:– Hashperformancedoesnotdegradewithgrowthoffile

– Minimalspaceoverhead• Disadvantagesofextendablehashing– Extralevelofindirectiontofinddesiredrecord– Bucketaddresstablemayitselfbecomeverybig(largerthanmemory)• Cannotallocateverylargecontiguousareasondiskeither• Solution:B+-treestructuretolocatedesiredrecordinbucketaddresstable

ComparisonofOrderedIndexingandHashing

• Costofperiodicre-organization• Relativefrequencyofinsertionsanddeletions• Isitdesirabletooptimizeaverageaccesstimeattheexpenseof

worst-caseaccesstime?• Expectedtypeofqueries:

– Hashingisgenerallybetteratretrievingrecordshavingaspecifiedvalueofthekey.

– Ifrangequeriesarecommon,orderedindicesaretobepreferred• Inpractice:

– PostgreSQLsupportshashindices,butdiscouragesuseduetopoorperformance

– Oraclesupportsstatichashorganization,butnothashindices– SQLServersupportsonlyB+-trees

BitmapIndices• Bitmapindicesareaspecialtypeofindexdesignedfor

efficientqueryingonmultiplekeys• Recordsinarelationareassumedtobenumbered

sequentiallyfrom,say,0– Givenanumbernitmustbeeasytoretrieverecordn

• Particularlyeasyifrecordsareoffixedsize• Applicableonattributesthattakeonarelativelysmall

numberofdistinctvalues– E.g.gender,country,state,…– E.g.income-level(incomebrokenupintoasmallnumberof

levelssuchas0-9999,10000-19999,20000-50000,50000-infinity)

• Abitmapissimplyanarrayofbits

BitmapIndices(Cont.)• Initssimplestformabitmapindexonanattributehasabitmapforeachvalueoftheattribute– Bitmaphasasmanybitsasrecords– Inabitmapforvaluev,thebitforarecordis1iftherecordhasthevaluevfortheattribute,andis0otherwise

BitmapIndices(Cont.)• Bitmapindicesareusefulforqueriesonmultipleattributes

– notparticularlyusefulforsingleattributequeries• Queriesareansweredusingbitmapoperations

– Intersection(and)– Union(or)– Complementation(not)

• Eachoperationtakestwobitmapsofthesamesizeandappliestheoperationoncorrespondingbitstogettheresultbitmap– E.g.100110AND110011=100010– 100110OR110011=110111

NOT100110=011001– MaleswithincomelevelL1:10010AND10100=10000

• Canthenretrieverequiredtuples.• Countingnumberofmatchingtuplesisevenfaster

BitmapIndices(Cont.)• Bitmapindicesgenerallyverysmallcomparedwithrelation

size– E.g.ifrecordis100bytes,spaceforasinglebitmapis1/800of

spaceusedbyrelation.• Ifnumberofdistinctattributevaluesis8,bitmapisonly1%ofrelationsize

• Deletionneedstobehandledproperly– Existencebitmaptonoteifthereisavalidrecordatarecord

location– Neededforcomplementation

• not(A=v):(NOTbitmap-A-v)ANDExistenceBitmap• Shouldkeepbitmapsforallvalues,evennullvalue

– TocorrectlyhandleSQLnullsemanticsforNOT(A=v):• intersectaboveresultwith(NOTbitmap-A-Null)

EfficientImplementationofBitmapOperations

• Bitmapsarepackedintowords;asinglewordand(abasicCPUinstruction)computesandof32or64bitsatonce– E.g.1-million-bitmapscanbeand-edwithjust31,250instruction

• Countingnumberof1scanbedonefastbyatrick:– Useeachbytetoindexintoaprecomputedarrayof256elementseach

storingthecountof1sinthebinaryrepresentation• Canusepairsofbytestospeedupfurtheratahighermemorycost

– Adduptheretrievedcounts• BitmapscanbeusedinsteadofTuple-IDlistsatleaflevelsof

B+-trees,forvaluesthathavealargenumberofmatchingrecords– Worthwhileif>1/64oftherecordshavethatvalue,assumingatuple-

idis64bits– AbovetechniquemergesbenefitsofbitmapandB+-treeindices

IndexDefinitioninSQL• Createanindex

– createindex<index-name>on<relation-name>(<attribute-list>)

– E.g.:createindexb-indexonbranch(branch_name)• Usecreateuniqueindextoindirectlyspecifyandenforce

theconditionthatthesearchkeyisacandidatekeyisacandidatekey.– NotreallyrequiredifSQLuniqueintegrityconstraintis

supported• Todropanindex

– dropindex<index-name>• Mostdatabasesystemsallowspecificationoftypeofindex,

andclustering.

EndofChapter

Indexing and Hashingnzhang10/6907/files/d1.pdfOverview • Basic Concepts • Ordered Indices •...

Documents

Transcript of Indexing and Hashingnzhang10/6907/files/d1.pdfOverview • Basic Concepts • Ordered Indices •...