Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

31
Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Transcript of Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Page 1: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Hash indexes

Are they better or worse than a B+Tree?

Page 2: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Tasks

• Consider the basics of hashing• Consider how this applies to

indexing schemes• Consider variations • Consider the Hash Join

Page 3: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Basics of hashing

• Internal hashing– A table in memory– Bucket usually holds one entry– Covered in a separate presentation:

• Hashing.ppt

• External hashing– On disk– Bucket is a page – holds multiple

entries

Page 4: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

External Hashing

• Hash function takes the key and computes an integer

• How is this integer used?• Direct file

– Key is an integer• Directory of heap file

– Works well if one directory page can hold the correct number of page ids

• Lookup table– Converts integer to page id number

Page 5: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

How does it work?

• Hash function takes the key and computes a page number

• Search the page for the correct data page

• Access the data• For very large indices the number of

accesses can still be quite small

Page 6: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Diagram

Data . . .

Hash buckets

. . .

0 1 2 N-1

Page 7: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Example• Assume 1 million records with 4 records

to a page– 250000 pages of data

• Assume the key and pointer is 24 bytes with a 512 byte page– 21 keys and pointers in a page

• Assume buckets are ¾ full– 67000 buckets with 15 keys

• Hash function computes a value in range 0 – 66999

• Without collisions it takes two accesses to get data

Page 8: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Static and Dynamic

• Static hashing works well until the buckets fill up– Then a bucket requires an overflow

bucket– Searching the original and overflow

pages increases the accesses and performance drops

• Dynamic hashing involves techniques where the sizes may grow gracefully

Page 9: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Extendible/Extensible Hashing

• Mechanism for altering the size of the hash table without the usual pain

• Common strategy for internal hashes is to double the hash table and rehash each entry

• This is too expensive for an index• Instead we do incremental doubling of

the buckets and index– Spreads the cost nicely

Page 10: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Scheme

• We generate a hash that is in a range much larger than we need

• Typically modulo some large prime number

• Use only the bottom so many bits of that result to select the bucket

• Start the process with just one bit• We also have the notion of global and

local depth

Page 11: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

First example

01

11 2 1484

5 7 9 111

Bit pattern ends in 0

Bit pattern ends in 1

Numbers shown are hash function output.

Index Buckets

Page 12: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Splitting a bucket• The next insertion will overfill a

bucket• The exact action is dependent on

the local and global levels• If the local level = global level

– Add one to global level (number of bits)

– Double the index• Add one more bit

– Double the bucket• Distribute values between the two

• If the local level < global level only double bucket

Page 13: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Bucket and Index Split

0

1

11 2 1484

5 7 9 111

2

00011011

1

2

2

2 4 8 14

3 7

5 9

11

Add 3 – split 1 bucket into 01 and 11

Page 14: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Continued Insertions

• Notice that there were two pointers to the unsplit bucket

• Insertions to a bucket that has a lower level than the global level only splits the bucket not the index

• It separates the two pointers

Page 15: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Bucket Only Split 2

00011011

1

2

2

2 4 8 14

5 9

73 11

2

11

2

2

2

4 8

5 9

73 11

10

01

00

2 2 10 14

Add 10split bucket

Page 16: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Extensible Hashing

• When the index exceeds one page– The upper so many bits may be checked so

the entire index is not searched

• The mechanism is different than a tree• The net effect is not that much different• The index may grow smoothly without

changes to the hash function or drastic rewriting

Page 17: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Not without problems• When the index is doubled there is

work which is added to the insertion• When the index will not fit in

memory substantial I/Os occur• When number of records per block

is small we can end up with much larger global levels than needed– Suppose 2 records per block and 3

records have the same key for the last 20 bits

– Global level of 20, even when most local levels are in 1-5 range

Page 18: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Linear Hashing

• A different scheme with mechanism different than extensible hashing but some common properties– Splits are incrementally added– Some flexibility when they occur

• Like extensible hashing we use the bottom so many bits of a larger hash function

• Round robin bucket splitting• Overflow buckets are used but may be

later consumed

Page 19: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Linear Hashing Numbers

• N – the number of buckets– Not always a power of two

• I – the number of used bits in the hash function

• R – the number of records in the structure

• M – the hash result– 0 M 2i

– M may larger or smaller than N– If M > N we use M - 2i-1

Page 20: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Adding a bucket

• Any strategy can be used to determine when a bucket is added– Adding a bucket increases N

• When the ratio of records to buckets crosses a threshold

• When a bucket is forced to add an overflow bucket

Page 21: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Linear Example

I = 1R = 3N = 2

0000

10100

11111

R/N = 1.5Split > 1.7

Page 22: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Adding an Item• When an item is added it is put in the

proper bucket• If if does not fit add an overflow bucket• If the R/N threshold is crossed add a

new bucket– This causes the corresponding bucket to be

redistributed over the two buckets

• The number of hash bits used may be increased

Page 23: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Insert 0101I = 1R = 3N = 2

0000

10100

11111

Add 0101 to this

Increases R/N ratio

I = 2R = 4N = 3

000000

01 0101

101010

1111

Page 24: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Explanation

• The bucket that was added to was not split– It was not its turn– Buckets are split in round robin

fashion

Page 25: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Insert 0111I = 2R = 4N = 3

000000

01 0101

101010

1111

R/N = 1.3

I = 2R = 5N = 3

000000

01 0101

101010

1111

R/N = 1.67

0111

Add 0111No splitOverflow

Page 26: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Insertion and Searching• The hash function is now using two bits

which gives it four possibilities, but there are only three buckets

• If the hash result M < N just use that bucket

• If the hash result M N subtract from M 2i-1

• In last case 0111 was inserted – Last two bits are 11, but there is not yet a

11 bucket, so 10 was subtracted from it

• Searching uses same type of scheme

Page 27: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Insert 1100I = 2R = 6N = 40000

00

01 0101

101010

11110111

000000

01 0101

101010

1100

011111

1111

R/N = 1.5

Page 28: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Dynamic Hashing Summary

• Linear hashing lacks the index of extensible hashing

• There are similarities– Hash function where only the bottom

so many bits are used– Gradual splits– Quick lookups

Page 29: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Joins• If both files are sorted on the join

the previously mentioned zipper join is used

• However, if the join field is not the primary key sorting the relation on this field may be expensive – Especially so if the outer join is larger

than an inner join– The number of joined records is small

compared to either relation size

Page 30: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Hash Join• Recall that a Cartesian Product makes

all possible combinations of records from two relations– This could mean read numbering the

products of block– That is exactly what we want to avoid

• Hash join partitions two relations into pieces based on a hash function

• Then only joins partitions that reacted similarly to the hash function

• Of course only works on Equi-Joins

Page 31: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

Copyright 2003Curt Hill

Hash Join Process• Hash the smaller of the two files on

the join field• Read in the other file• Hash each key into a bucket

– The only candidates for equality are here

• Produce the output• Smaller but still substantial