Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?
-
Upload
perla-graydon -
Category
Documents
-
view
212 -
download
0
Transcript of Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?
Copyright 2003Curt Hill
Hash indexes
Are they better or worse than a B+Tree?
Copyright 2003Curt Hill
Tasks
• Consider the basics of hashing• Consider how this applies to
indexing schemes• Consider variations • Consider the Hash Join
Copyright 2003Curt Hill
Basics of hashing
• Internal hashing– A table in memory– Bucket usually holds one entry– Covered in a separate presentation:
• Hashing.ppt
• External hashing– On disk– Bucket is a page – holds multiple
entries
Copyright 2003Curt Hill
External Hashing
• Hash function takes the key and computes an integer
• How is this integer used?• Direct file
– Key is an integer• Directory of heap file
– Works well if one directory page can hold the correct number of page ids
• Lookup table– Converts integer to page id number
Copyright 2003Curt Hill
How does it work?
• Hash function takes the key and computes a page number
• Search the page for the correct data page
• Access the data• For very large indices the number of
accesses can still be quite small
Copyright 2003Curt Hill
Diagram
Data . . .
Hash buckets
. . .
0 1 2 N-1
Copyright 2003Curt Hill
Example• Assume 1 million records with 4 records
to a page– 250000 pages of data
• Assume the key and pointer is 24 bytes with a 512 byte page– 21 keys and pointers in a page
• Assume buckets are ¾ full– 67000 buckets with 15 keys
• Hash function computes a value in range 0 – 66999
• Without collisions it takes two accesses to get data
Copyright 2003Curt Hill
Static and Dynamic
• Static hashing works well until the buckets fill up– Then a bucket requires an overflow
bucket– Searching the original and overflow
pages increases the accesses and performance drops
• Dynamic hashing involves techniques where the sizes may grow gracefully
Copyright 2003Curt Hill
Extendible/Extensible Hashing
• Mechanism for altering the size of the hash table without the usual pain
• Common strategy for internal hashes is to double the hash table and rehash each entry
• This is too expensive for an index• Instead we do incremental doubling of
the buckets and index– Spreads the cost nicely
Copyright 2003Curt Hill
Scheme
• We generate a hash that is in a range much larger than we need
• Typically modulo some large prime number
• Use only the bottom so many bits of that result to select the bucket
• Start the process with just one bit• We also have the notion of global and
local depth
Copyright 2003Curt Hill
First example
01
11 2 1484
5 7 9 111
Bit pattern ends in 0
Bit pattern ends in 1
Numbers shown are hash function output.
Index Buckets
Copyright 2003Curt Hill
Splitting a bucket• The next insertion will overfill a
bucket• The exact action is dependent on
the local and global levels• If the local level = global level
– Add one to global level (number of bits)
– Double the index• Add one more bit
– Double the bucket• Distribute values between the two
• If the local level < global level only double bucket
Copyright 2003Curt Hill
Bucket and Index Split
0
1
11 2 1484
5 7 9 111
2
00011011
1
2
2
2 4 8 14
3 7
5 9
11
Add 3 – split 1 bucket into 01 and 11
Copyright 2003Curt Hill
Continued Insertions
• Notice that there were two pointers to the unsplit bucket
• Insertions to a bucket that has a lower level than the global level only splits the bucket not the index
• It separates the two pointers
Copyright 2003Curt Hill
Bucket Only Split 2
00011011
1
2
2
2 4 8 14
5 9
73 11
2
11
2
2
2
4 8
5 9
73 11
10
01
00
2 2 10 14
Add 10split bucket
Copyright 2003Curt Hill
Extensible Hashing
• When the index exceeds one page– The upper so many bits may be checked so
the entire index is not searched
• The mechanism is different than a tree• The net effect is not that much different• The index may grow smoothly without
changes to the hash function or drastic rewriting
Copyright 2003Curt Hill
Not without problems• When the index is doubled there is
work which is added to the insertion• When the index will not fit in
memory substantial I/Os occur• When number of records per block
is small we can end up with much larger global levels than needed– Suppose 2 records per block and 3
records have the same key for the last 20 bits
– Global level of 20, even when most local levels are in 1-5 range
Copyright 2003Curt Hill
Linear Hashing
• A different scheme with mechanism different than extensible hashing but some common properties– Splits are incrementally added– Some flexibility when they occur
• Like extensible hashing we use the bottom so many bits of a larger hash function
• Round robin bucket splitting• Overflow buckets are used but may be
later consumed
Copyright 2003Curt Hill
Linear Hashing Numbers
• N – the number of buckets– Not always a power of two
• I – the number of used bits in the hash function
• R – the number of records in the structure
• M – the hash result– 0 M 2i
– M may larger or smaller than N– If M > N we use M - 2i-1
Copyright 2003Curt Hill
Adding a bucket
• Any strategy can be used to determine when a bucket is added– Adding a bucket increases N
• When the ratio of records to buckets crosses a threshold
• When a bucket is forced to add an overflow bucket
Copyright 2003Curt Hill
Linear Example
I = 1R = 3N = 2
0000
10100
11111
R/N = 1.5Split > 1.7
Copyright 2003Curt Hill
Adding an Item• When an item is added it is put in the
proper bucket• If if does not fit add an overflow bucket• If the R/N threshold is crossed add a
new bucket– This causes the corresponding bucket to be
redistributed over the two buckets
• The number of hash bits used may be increased
Copyright 2003Curt Hill
Insert 0101I = 1R = 3N = 2
0000
10100
11111
Add 0101 to this
Increases R/N ratio
I = 2R = 4N = 3
000000
01 0101
101010
1111
Copyright 2003Curt Hill
Explanation
• The bucket that was added to was not split– It was not its turn– Buckets are split in round robin
fashion
Copyright 2003Curt Hill
Insert 0111I = 2R = 4N = 3
000000
01 0101
101010
1111
R/N = 1.3
I = 2R = 5N = 3
000000
01 0101
101010
1111
R/N = 1.67
0111
Add 0111No splitOverflow
Copyright 2003Curt Hill
Insertion and Searching• The hash function is now using two bits
which gives it four possibilities, but there are only three buckets
• If the hash result M < N just use that bucket
• If the hash result M N subtract from M 2i-1
• In last case 0111 was inserted – Last two bits are 11, but there is not yet a
11 bucket, so 10 was subtracted from it
• Searching uses same type of scheme
Copyright 2003Curt Hill
Insert 1100I = 2R = 6N = 40000
00
01 0101
101010
11110111
000000
01 0101
101010
1100
011111
1111
R/N = 1.5
Copyright 2003Curt Hill
Dynamic Hashing Summary
• Linear hashing lacks the index of extensible hashing
• There are similarities– Hash function where only the bottom
so many bits are used– Gradual splits– Quick lookups
Copyright 2003Curt Hill
Joins• If both files are sorted on the join
the previously mentioned zipper join is used
• However, if the join field is not the primary key sorting the relation on this field may be expensive – Especially so if the outer join is larger
than an inner join– The number of joined records is small
compared to either relation size
Copyright 2003Curt Hill
Hash Join• Recall that a Cartesian Product makes
all possible combinations of records from two relations– This could mean read numbering the
products of block– That is exactly what we want to avoid
• Hash join partitions two relations into pieces based on a hash function
• Then only joins partitions that reacted similarly to the hash function
• Of course only works on Equi-Joins
Copyright 2003Curt Hill
Hash Join Process• Hash the smaller of the two files on
the join field• Read in the other file• Hash each key into a bucket
– The only candidates for equality are here
• Produce the output• Smaller but still substantial