Indexing and Hashingnzhang10/6907/files/d1.pdfOverview • Basic Concepts • Ordered Indices •...
Transcript of Indexing and Hashingnzhang10/6907/files/d1.pdfOverview • Basic Concepts • Ordered Indices •...
IndexingandHashing
Overview• BasicConcepts• OrderedIndices• B+-TreeIndexFiles• StaticHashing• DynamicHashing• ComparisonofOrderedIndexingandHashing• IndexDefinitioninSQL• Multiple-KeyAccess
BasicConcepts• Indexingmechanismsusedtospeedupaccesstodesireddata.
– E.g.,authorcataloginlibrary• SearchKey- attributetosetofattributesusedtolookuprecordsin
afile.• Anindexfileconsistsofrecords(calledindexentries)oftheform
• Indexfilesaretypicallymuchsmallerthantheoriginalfile• Twobasickindsofindices:
– Orderedindices:searchkeysarestoredinsortedorder– Hashindices:searchkeysaredistributeduniformlyacross“buckets”
usinga“hashfunction”.
search-key pointer
IndexEvaluationMetrics
• Accesstypessupportedefficiently.E.g.,– recordswithaspecifiedvalueintheattribute– orrecordswithanattributevaluefallinginaspecifiedrangeofvalues.
• Accesstime• Insertiontime• Deletiontime• Spaceoverhead
OrderedIndices• Inanorderedindex,indexentriesarestoredsortedonthe
searchkeyvalue.E.g.,authorcataloginlibrary.• Primaryindex:inasequentiallyorderedfile,theindex
whosesearchkeyspecifiesthesequentialorderofthefile.– Alsocalledclusteringindex– Thesearchkeyofaprimaryindexisusuallybutnotnecessarily
theprimarykey.• Secondaryindex:anindexwhosesearchkeyspecifiesan
orderdifferentfromthesequentialorderofthefile.Alsocallednon-clusteringindex.
• Index-sequentialfile:orderedsequentialfilewithaprimaryindex.
DenseIndexFiles
• Denseindex— Indexrecordappearsforeverysearch-keyvalueinthefile.
SparseIndexFiles• SparseIndex:containsindexrecordsforonlysomesearch-
keyvalues.– Applicablewhenrecordsaresequentiallyorderedonsearch-key
• Tolocatearecordwithsearch-keyvalueKwe:– Findindexrecordwithlargestsearch-keyvalue<K– Searchfilesequentiallystartingattherecordtowhichtheindex
recordpoints
SparseIndexFiles(Cont.)
• Comparedtodenseindices:– Lessspaceandlessmaintenanceoverheadforinsertionsanddeletions.
– Generallyslowerthandenseindexforlocatingrecords.
• Goodtradeoff:sparseindexwithanindexentryforeveryblockinfile,correspondingtoleastsearch-keyvalueintheblock.
MultilevelIndex• Ifprimaryindexdoesnotfitinmemory,accessbecomesexpensive.
• Solution:treatprimaryindexkeptondiskasasequentialfileandconstructasparseindexonit.– outerindex– asparseindexofprimaryindex– innerindex– theprimaryindexfile
• Ifevenouterindexistoolargetofitinmainmemory,yetanotherlevelofindexcanbecreated,andsoon.
• Indicesatalllevelsmustbeupdatedoninsertionordeletionfromthefile.
IndexUpdate:Deletion• Ifdeletedrecordwastheonlyrecordinthefilewithitsparticularsearch-keyvalue,thesearch-keyisdeletedfromtheindexalso.
• Single-levelindexdeletion:– Denseindices– deletionofsearch-key:similartofilerecorddeletion.
– Sparseindices–• ifanentryforthesearchkeyexistsintheindex,itisdeletedbyreplacingtheentryintheindexwiththenextsearch-keyvalueinthefile(insearch-keyorder).
• Ifthenextsearch-keyvaluealreadyhasanindexentry,theentryisdeletedinsteadofbeingreplaced.
IndexUpdate:Insertion• Single-level indexinsertion:– Performalookupusingthesearch-keyvalueappearingintherecordtobeinserted.
– Denseindices– ifthesearch-keyvaluedoesnotappearintheindex,insertit.
– Sparseindices– ifindexstoresanentryforeachblockofthefile,nochangeneedstobemadetotheindexunlessanewblockiscreated.• Ifanewblockiscreated,thefirstsearch-keyvalueappearinginthenewblockisinsertedintotheindex.
• Multilevelinsertion(aswellasdeletion)algorithmsaresimpleextensionsofthesingle-levelalgorithms
SecondaryIndices• Frequently,onewantstofindalltherecordswhosevaluesinacertainfield(whichisnotthesearch-keyoftheprimaryindex)satisfysomecondition.– Example1:Intheaccountrelationstoredsequentiallybyaccountnumber,wemaywanttofindallaccountsinaparticularbranch
– Example2:asabove,butwherewewanttofindallaccountswithaspecifiedbalanceorrangeofbalances
• Wecanhaveasecondaryindexwithanindexrecordforeachsearch-keyvalue
SecondaryIndicesExample• Indexrecordpointstoabucketthatcontainspointerstoalltheactual
recordswiththatparticularsearch-keyvalue.• Secondaryindiceshavetobedense
PrimaryandSecondaryIndices• Indicesoffersubstantialbenefitswhensearchingforrecords.
• BUT:Updatingindicesimposesoverheadondatabasemodification--whenafileismodified,everyindexonthefilemustbeupdated,
• Sequentialscanusingprimaryindexisefficient,butasequentialscanusingasecondaryindexisexpensive– Eachrecordaccessmayfetchanewblockfromdisk– Blockfetchrequiresabout5to10milliseconds
• versusabout100nanosecondsformemoryaccess
B+-TreeIndexFiles• Disadvantageofindexed-sequentialfiles
– performancedegradesasfilegrows,sincemanyoverflowblocksgetcreated.
– Periodicreorganizationofentirefileisrequired.• AdvantageofB+-treeindexfiles:
– automaticallyreorganizesitselfwithsmall,local,changes,inthefaceofinsertionsanddeletions.
– Reorganizationofentirefileisnotrequiredtomaintainperformance.
• (Minor)disadvantageofB+-trees:– extrainsertionanddeletionoverhead,spaceoverhead.
• AdvantagesofB+-treesoutweighdisadvantages– B+-treesareusedextensively
B+-tree indices are an alternative to indexed-sequential files.
B+-TreeIndexFiles(Cont.)
• AB+-treeisarootedtreesatisfyingthefollowingproperties:– Allpathsfromroottoleafareofthesamelength– Eachnodethatisnotarootoraleafhasbetween⎡n/2⎤ andnchildren.
– Aleafnodehasbetween⎡(n–1)/2⎤ andn–1values– Specialcases:• Iftherootisnotaleaf,ithasatleast2children.• Iftherootisaleaf(thatis,therearenoothernodesinthetree),itcanhavebetween0and(n–1)values.
B+-TreeNodeStructure
• Typicalnode
– Kiarethesearch-keyvalues– Piarepointerstochildren(fornon-leafnodes)orpointerstorecordsorbucketsofrecords(forleafnodes).
• Thesearch-keysinanodeareordered– K1<K2<K3<...<Kn–1
LeafNodesinB+-Trees• Fori=1,2,...,n–1,pointerPieitherpointstoafilerecordwith
search-keyvalueKi,ortoabucketofpointerstofilerecords,eachrecordhavingsearch-keyvalueKi.Onlyneedbucketstructureifsearch-keydoesnotformaprimarykey.
• IfLi,Ljareleafnodesandi<j,Li’ssearch-keyvaluesarelessthanLj’ssearch-keyvalues
• Pnpointstonextleafnodeinsearch-keyorder
Non-LeafNodesinB+-Trees
• Nonleafnodesformamulti-levelsparseindexontheleafnodes.Foranon-leafnodewithmpointers:– Allthesearch-keysinthesubtreetowhichP1pointsarelessthanK1
– For2≤ i≤ n– 1,allthesearch-keysinthesubtreetowhichPipointshavevaluesgreaterthanorequaltoKi–1andlessthanKi
– Allthesearch-keysinthesubtreetowhichPnpointshavevaluesgreaterthanorequaltoKn–1
ExampleofaB+-tree
B+-tree for account file (n = 3)
ExampleofB+-tree
• Leafnodesmusthavebetween2and4values(⎡(n–1)/2⎤ andn–1,withn=5).
• Non-leafnodesotherthanrootmusthavebetween3and5children(⎡(n/2⎤ andnwithn=5).
• Rootmusthaveatleast2children.
B+-tree for account file (n = 5)
ObservationsaboutB+-trees• Sincetheinter-nodeconnectionsaredonebypointers,“logically” closeblocksneednotbe“physically” close.
• Thenon-leaflevelsoftheB+-treeformahierarchyofsparseindices.
• TheB+-treecontainsarelativelysmallnumberoflevels– Levelbelowroothasatleast2*⎡n/2⎤ values– Nextlevelhasatleast2*⎡n/2⎤ *⎡n/2⎤ values– IfthereareKsearch-keyvaluesinthefile,thetreeheightisno
morethan⎡log⎡n/2⎤(K)⎤– thussearchescanbeconductedefficiently.
• Insertionsanddeletionstothemainfilecanbehandledefficiently,astheindexcanberestructuredinlogarithmictime(asweshallsee).
QueriesonB+-Trees• Findallrecordswithasearch-keyvalueofk.
– N=root– Repeat
• ExamineNforthesmallestsearch-keyvalue>k.• Ifsuchavalueexists,assumeitisKi.ThensetN=Pi• Otherwisek≥ Kn–1.SetN=Pn• UntilNisaleafnode
– Ifforsomei,keyKi=kfollowpointerPitothedesiredrecordorbucket.
– Elsenorecordwithsearch-keyvaluekexists.
QueriesonB+-Trees(Cont.)• IfthereareKsearch-keyvaluesinthefile,theheightofthe
treeisnomorethan⎡log⎡n/2⎤(K)⎤.• Anodeisgenerallythesamesizeasadiskblock,typically4
kilobytes– andnistypicallyaround100(40bytesperindexentry).
• With1millionsearchkeyvaluesandn=100– atmostlog50(1,000,000)=4nodesareaccessedinalookup.
• Contrastthiswithabalancedbinarytreewith1millionsearchkeyvalues— around20nodesareaccessedinalookup– abovedifferenceissignificantsinceeverynodeaccessmayneed
adiskI/O,costingaround20milliseconds
UpdatesonB+-Trees:Insertion• Findtheleafnodeinwhichthesearch-keyvaluewould
appear• Ifthereisroomintheleafnode,insert(key-value,
pointer)pairintheleafnode• Otherwise,splitthenode(alongwiththenew(key-value,
pointer)entry)asdiscussedinthenextslide.
UpdatesonB+-Trees:Insertion(Cont.)
• Splittingaleafnode:– takethen(search-keyvalue,pointer)pairs(includingtheonebeing
inserted)insortedorder.Placethefirst⎡n/2⎤ intheoriginalnode,andtherestinanewnode.
– letthenewnodebep,andletkbetheleastkeyvalueinp.Insert(k,p)intheparentofthenodebeingsplit.
– Iftheparentisfull,splititandpropagatethesplitfurtherup.• Splittingofnodesproceedsupwardstillanodethatisnotfullis
found.– Intheworstcasetherootnodemaybesplitincreasingtheheightof
thetreeby1.
UpdatesonB+-Trees:Insertion(Cont.)
B+-Tree before and after insertion of “Clearview”
UpdatesonB+-Trees:Deletion
• Findtherecordtobedeleted,andremoveitfromthemainfileandfromthebucket(ifpresent)
• Remove(search-keyvalue,pointer)fromtheleafnodeifthereisnobucketorifthebuckethasbecomeempty
• Ifthenodehastoofewentriesduetotheremoval,andtheentriesinthenodeandasiblingfitintoasinglenode,thenmergesiblings
UpdatesonB+-Trees:Deletion• Otherwise,ifthenodehastoofewentriesduetotheremoval,buttheentriesinthenodeandasiblingdonotfitintoasinglenode,thenredistributepointers:– Redistributethepointersbetweenthenodeandasiblingsuchthatbothhavemorethantheminimumnumberofentries.
– Updatethecorrespondingsearch-keyvalueintheparentofthenode.
• Thenodedeletionsmaycascadeupwardstillanodewhichhas⎡n/2⎤ ormorepointersisfound.
• Iftherootnodehasonlyonepointerafterdeletion,itisdeletedandthesolechildbecomestheroot.
ExamplesofB+-TreeDeletion
Before and after deleting “Downtown”
ExamplesofB+-TreeDeletion
Deletion of “Perryridge” from result of previous example
ExampleofB+-treeDeletion
Before and after deletion of “Perryridge” from earlier example
Multiple-KeyAccess• Usemultipleindicesforcertaintypesofqueries.• Example:
selectaccount_numberfromaccountwherebranch_name=“Perryridge” andbalance=1000
• Possiblestrategiesforprocessingqueryusingindicesonsingleattributes:1. Useindexonbranch_nametofindaccountswithbranchname
Perryridge;testbalance=10002. Useindexonbalancetofindaccountswithbalancesof$1000;test
branch_name=“Perryridge”.3. Usebranch_nameindextofindpointerstoallrecordspertainingto
thePerryridgebranch.Similarlyuseindexonbalance.Takeintersectionofbothsetsofpointersobtained.
IndicesonMultipleKeys• Compositesearchkeysaresearchkeyscontainingmorethanoneattribute– E.g.(branch_name,balance)
• Lexicographicordering:(a1,a2)<(b1,b2)ifeither– a1<b1,or– a1=b1anda2<b2
• Canalsoefficientlyhandlewherebranch_name=“Perryridge” andbalance<1000
• Butcannotefficientlyhandlewherebranch_name<“Perryridge” andbalance=1000
Non-UniqueSearchKeys
• Alternatives:– Makesearchkeyuniquebyaddingarecord-identifier• Extrastorageoverheadforkeys• Simplercodeforinsertion/deletion• Widelyused
OtherIssuesinIndexing• Coveringindices
– Addextraattributestoindexso(some)queriescanavoidfetchingtheactualrecords• Particularlyusefulforsecondaryindices
– Canstoreextraattributesonlyatleaf• Recordrelocationandsecondaryindices
– Ifarecordmoves,allsecondaryindicesthatstorerecordpointershavetobeupdated
– NodesplitsinB+-treefileorganizationsbecomeveryexpensive– Solution:useprimary-indexsearchkeyinsteadofrecordpointer
insecondaryindex• Extratraversalofprimaryindextolocaterecord
– Highercostforqueries,butnodesplitsarecheap• Addrecord-idifprimary-indexsearchkeyisnon-unique
Hashing
StaticHashing• Abucketisaunitofstoragecontainingoneormorerecords(abucketistypicallyadiskblock).
• Inahashfileorganizationweobtainthebucketofarecorddirectlyfromitssearch-keyvalueusingahashfunction.
• Hashfunctionhisafunctionfromthesetofallsearch-keyvaluesKtothesetofallbucketaddressesB.
• Hashfunctionisusedtolocaterecordsforaccess,insertionaswellasdeletion.
• Recordswithdifferentsearch-keyvaluesmaybemappedtothesamebucket;thusentirebuckethastobesearchedsequentiallytolocatearecord.
ExampleofHashFileOrganization
• Hashfileorganizationofaccount file,usingbranch_nameaskey– Thereare10buckets,– E.g.h(Perryridge)=5h(RoundHill)=3h(Brighton)=3
HashFunctions• Worsthashfunctionmapsallsearch-keyvaluestothesamebucket;thismakesaccesstimeproportionaltothenumberofsearch-keyvaluesinthefile.
• Anidealhashfunctionisuniform,i.e.,eachbucketisassignedthesamenumberofsearch-keyvaluesfromthesetofallpossiblevalues.
• Idealhashfunctionisrandom,soeachbucketwillhavethesamenumberofrecordsassignedtoitirrespectiveoftheactualdistributionofsearch-keyvaluesinthefile.
HandlingofBucketOverflows
• Bucketoverflowcanoccurbecauseof– Insufficientbuckets– Skewindistributionofrecords.Thiscanoccurduetotworeasons:• multiplerecordshavesamesearch-keyvalue• chosenhashfunctionproducesnon-uniformdistributionofkeyvalues
• Althoughtheprobabilityofbucketoverflowcanbereduced,itcannotbeeliminated;itishandledbyusingoverflowbuckets.
HandlingofBucketOverflows(Cont.)
• Overflowchaining– theoverflowbucketsofagivenbucketarechainedtogetherinalinkedlist.
HashIndices• Hashingcanbeusednotonlyforfileorganization,butalsoforindex-structurecreation.
• Ahashindexorganizesthesearchkeys,withtheirassociatedrecordpointers,intoahashfilestructure.
• Strictlyspeaking,hashindicesarealwayssecondaryindices– ifthefileitselfisorganizedusinghashing,aseparateprimaryhashindexonitusingthesamesearch-keyisunnecessary.
– However,weusethetermhashindextorefertobothsecondaryindexstructuresandhashorganizedfiles.
ExampleofHashIndex
DeficienciesofStaticHashing• Instatichashing,functionhmapssearch-keyvaluestoa
fixedsetofBofbucketaddresses.Databasesgroworshrinkwithtime.– Ifinitialnumberofbucketsistoosmall,andfilegrows,
performancewilldegradeduetotoomuchoverflows.– Ifspaceisallocatedforanticipatedgrowth,asignificantamount
ofspacewillbewastedinitially(andbucketswillbeunderfull).– Ifdatabaseshrinks,againspacewillbewasted.
• Onesolution:periodicre-organizationofthefilewithanewhashfunction– Expensive,disruptsnormaloperations
• Bettersolution:allowthenumberofbucketstobemodifieddynamically.
Initial Hash structure, bucket size = 2
ExtendableHashingvs.OtherSchemes
• Benefitsofextendablehashing:– Hashperformancedoesnotdegradewithgrowthoffile
– Minimalspaceoverhead• Disadvantagesofextendablehashing– Extralevelofindirectiontofinddesiredrecord– Bucketaddresstablemayitselfbecomeverybig(largerthanmemory)• Cannotallocateverylargecontiguousareasondiskeither• Solution:B+-treestructuretolocatedesiredrecordinbucketaddresstable
ComparisonofOrderedIndexingandHashing
• Costofperiodicre-organization• Relativefrequencyofinsertionsanddeletions• Isitdesirabletooptimizeaverageaccesstimeattheexpenseof
worst-caseaccesstime?• Expectedtypeofqueries:
– Hashingisgenerallybetteratretrievingrecordshavingaspecifiedvalueofthekey.
– Ifrangequeriesarecommon,orderedindicesaretobepreferred• Inpractice:
– PostgreSQLsupportshashindices,butdiscouragesuseduetopoorperformance
– Oraclesupportsstatichashorganization,butnothashindices– SQLServersupportsonlyB+-trees
BitmapIndices• Bitmapindicesareaspecialtypeofindexdesignedfor
efficientqueryingonmultiplekeys• Recordsinarelationareassumedtobenumbered
sequentiallyfrom,say,0– Givenanumbernitmustbeeasytoretrieverecordn
• Particularlyeasyifrecordsareoffixedsize• Applicableonattributesthattakeonarelativelysmall
numberofdistinctvalues– E.g.gender,country,state,…– E.g.income-level(incomebrokenupintoasmallnumberof
levelssuchas0-9999,10000-19999,20000-50000,50000-infinity)
• Abitmapissimplyanarrayofbits
BitmapIndices(Cont.)• Initssimplestformabitmapindexonanattributehasabitmapforeachvalueoftheattribute– Bitmaphasasmanybitsasrecords– Inabitmapforvaluev,thebitforarecordis1iftherecordhasthevaluevfortheattribute,andis0otherwise
BitmapIndices(Cont.)• Bitmapindicesareusefulforqueriesonmultipleattributes
– notparticularlyusefulforsingleattributequeries• Queriesareansweredusingbitmapoperations
– Intersection(and)– Union(or)– Complementation(not)
• Eachoperationtakestwobitmapsofthesamesizeandappliestheoperationoncorrespondingbitstogettheresultbitmap– E.g.100110AND110011=100010– 100110OR110011=110111
NOT100110=011001– MaleswithincomelevelL1:10010AND10100=10000
• Canthenretrieverequiredtuples.• Countingnumberofmatchingtuplesisevenfaster
BitmapIndices(Cont.)• Bitmapindicesgenerallyverysmallcomparedwithrelation
size– E.g.ifrecordis100bytes,spaceforasinglebitmapis1/800of
spaceusedbyrelation.• Ifnumberofdistinctattributevaluesis8,bitmapisonly1%ofrelationsize
• Deletionneedstobehandledproperly– Existencebitmaptonoteifthereisavalidrecordatarecord
location– Neededforcomplementation
• not(A=v):(NOTbitmap-A-v)ANDExistenceBitmap• Shouldkeepbitmapsforallvalues,evennullvalue
– TocorrectlyhandleSQLnullsemanticsforNOT(A=v):• intersectaboveresultwith(NOTbitmap-A-Null)
EfficientImplementationofBitmapOperations
• Bitmapsarepackedintowords;asinglewordand(abasicCPUinstruction)computesandof32or64bitsatonce– E.g.1-million-bitmapscanbeand-edwithjust31,250instruction
• Countingnumberof1scanbedonefastbyatrick:– Useeachbytetoindexintoaprecomputedarrayof256elementseach
storingthecountof1sinthebinaryrepresentation• Canusepairsofbytestospeedupfurtheratahighermemorycost
– Adduptheretrievedcounts• BitmapscanbeusedinsteadofTuple-IDlistsatleaflevelsof
B+-trees,forvaluesthathavealargenumberofmatchingrecords– Worthwhileif>1/64oftherecordshavethatvalue,assumingatuple-
idis64bits– AbovetechniquemergesbenefitsofbitmapandB+-treeindices
IndexDefinitioninSQL• Createanindex
– createindex<index-name>on<relation-name>(<attribute-list>)
– E.g.:createindexb-indexonbranch(branch_name)• Usecreateuniqueindextoindirectlyspecifyandenforce
theconditionthatthesearchkeyisacandidatekeyisacandidatekey.– NotreallyrequiredifSQLuniqueintegrityconstraintis
supported• Todropanindex
– dropindex<index-name>• Mostdatabasesystemsallowspecificationoftypeofindex,
andclustering.
EndofChapter