Hash Based Indexing

download Hash Based Indexing

of 23

Transcript of Hash Based Indexing

  • 8/12/2019 Hash Based Indexing

    1/23

    4 Hash-Based Indexing

    We now turn to a different family of index structures: hash indexes. Hash indexes are unbeatable when it comes to equality selections, e.g.

    SELECT FROM R

    WHERE

    A=k .

    If we carefully maintain the hash index while the underlying data file (for relation R) grows or

    shrinks, we can answer such an equality query using a single I/O operation.

    Other query types, like joins, internally initiate a whole flood of such equality tests.

    Hash indexes provide nosupport for range searches, however

    (hash indexes are also known as scatter storage).

    In a typical DBMS, you will find support for B+ trees as well as hash-based indexing structures.

    c 2001, 2002 T. Grust Architecture and Implementation of DBMSs: 4. Hash-Based Indexing 91

  • 8/12/2019 Hash Based Indexing

    2/23

    In a B+ tree world, to locate a record with keyk means tocomparek with other keys organized in a

    (tree-shaped) search data structure.

    Hash indexes use the bits ofk itself(independent of all other stored records and their keys) to find

    the location of the associated record.

    We will now look into static hashing to illustrate the basic ideas behind hashing.

    Static hashing does nothandle updates well (much like ISAM).

    Later, we will introduce extendibleand linear hashing which refine the hashing principle and

    adapt well to record insertions and deletions.

    4.1 Static Hashing

    To build a static hash index for an attribute A we need to

    1 allocate an area ofN (successive) disk pages, the so-called primary buckets(or hash table),2 in each bucket, install a pointer to a chain ofoverflow pages(initially, set this pointer to nil),3 define a hash function h with range [0 . . . N

    1]

    (the domain ofh is the type ofA, e.g.

    h : INTEGER [0 . . . N 1]

    ifA has the SQL type INTEGER).

    c 2001, 2002 T. Grust Architecture and Implementation of DBMSs: 4. Hash-Based Indexing 92

  • 8/12/2019 Hash Based Indexing

    3/23

    The resulting setup looks like this:

    h

    hash table

    0

    1

    2

    N1...

    ...

    ...

    k

    primary buckets overflow pages

    bucket

    A primary bucket and its associated chain of overflow pages is referred to as a bucket(

    above).

    Each bucket containsdata entriesk(implemented using any of the variants a. . . c, see Chapter 2).

    To perform hsearch(k) (orhinsert(k)/hdelete(k)) for a record with key A = k,

    1 apply hash function h to the key value, i.e., compute h(k),2 access the primary bucket pagewith number h(k),3 then search (insert/delete) the record on this page or, if necessary, access the overflow chainof

    bucket h(k).

    c 2001, 2002 T. Grust Architecture and Implementation of DBMSs: 4. Hash-Based Indexing 93

    http://-/?-
  • 8/12/2019 Hash Based Indexing

    4/23

    If we are lucky or (somehow) avoid chains of overflow pages altogether,

    hsearch(k) needs one I/O operation,

    hinsert(k) andhdelete(k) need two I/O operations.

    At least for static hashing, overflow chain management is important: Generally, we do notwant hash functionh to avoid collisions, i.e.,

    h(k) =h(k) even if k =k

    (otherwise we would need as many primary bucket pages as different key values in the data file).

    However, we dowanth toscatterthe domain of the key attribute evenlyacross[0 . . . N 1](to avoid the development of extremely long overflow chains for few buckets).

    Such good hash functions are hard to discover, unfortunately

    (see next slide).

    c 2001, 2002 T. Grust Architecture and Implementation of DBMSs: 4. Hash-Based Indexing 94

  • 8/12/2019 Hash Based Indexing

    5/23

    The birthday paradoxIf you consider the people in a group as the domainand use their birthday ashash function

    h (i.e., h : Person [0 . . . 364]), chances are already > 50% that two people share thesame birthday (collision) if the group has

    23people.

    Check yourself:

    1 Compute the probability that n people all have different birthdays:

    different birthday(n):ifn= 1 then

    return1

    else

    return different birthday(n 1) probability that n 1 persons havedifferent birthdays

    365 (n 1)365

    probability thatnth person has birthdaydifferent from firstn 1persons

    2 . . . or try to find birthday mates at the next larger party.

    c 2001, 2002 T. Grust Architecture and Implementation of DBMSs: 4. Hash-Based Indexing 95

  • 8/12/2019 Hash Based Indexing

    6/23

    If key values would be purely random we could arbitrarily extract a few bits and use these for the

    hash function. Real key distributions found in DBMS are far from random, though.

    Fairly good hash functions may be found using the following two simple approaches:

    1

    By division. Simply define

    h(k) =k mod N .

    This guarantees the range ofh(k) to be [0 . . . N 1]. N.B.: If you choose N = 2d for some dyou effectively consider the leastd bits ofk only. Prime numberswere found to work best forN.

    2 By multiplication. Extract the fractional part ofZ k (for a specific Z)10 and multiply by hashtable size N (N is arbitrary here):

    h(k) =

    N Z k Z k .However, forZ= Z

    /2

    wandN = 2d (w: number of bits in a CPU word) we simply have

    h(k) = msb d(Z k)

    where msbd(x) denotes the d most significant (leading) bits ofx (e.g., msb3(42) = 5).

    Non-numeric key domains?

    How would you hash a non-numeric key domain (e.g., hashing over a CHAR()attribute)?

    10Z= (

    5 1)/2 0.6180339887 is a good choice. See Don E. Knuth, Sorting and Searching.c 2001, 2002 T. Grust Architecture and Implementation of DBMSs: 4. Hash-Based Indexing 96

  • 8/12/2019 Hash Based Indexing

    7/23

    Clearly, if the underlying data file grows, the development of overflow chains spoils the otherwise

    predictable hash I/O behaviour (12 I/Os).

    Similarly, if the file shrinks significantly, the static hash table may be a waste of space

    (data entry slots in the primary buckets remain unallocated).

    4.2 Extendible Hashing

    Extendible hashing is prepared to adapt to growing (or shrinking) data files.

    To keep track of the actual primary buckets which are part of our current hash table, wehash via a

    in-memory bucket directory(ignore the 2 fields for now)11

    :

    bucket A

    bucket B

    bucket C

    bucket D

    hash table

    directory

    h

    00

    01

    10

    11

    2

    4*

    1*

    16*32*12*

    5* 21*

    10*

    15* 7* 19*

    2

    2

    2

    2

    11N.B.: In this figure we have depicted the data entries as h(k) (notas k).c 2001, 2002 T. Grust Architecture and Implementation of DBMSs: 4. Hash-Based Indexing 97

  • 8/12/2019 Hash Based Indexing

    8/23

    Tosearchfor a record with key k:

    1 applyh, i.e., compute h(k),2 consider the last 2 bits ofh(k) and follow the corresponding directory pointer to find the bucket.Example:

    To find a record with key k such that h(k) = 5 = 1012, follow the second directory pointer to

    bucket B, then use entry 5 to access the wanted record.

    The meaning of the fields might now become clear:

    n athash directory(global depth):

    Use the lastn bits ofh(k) to lookup a bucket pointer in this directory. (The directory size is 2n.)

    d atbucket(local depth):

    The hash valuesh(k) of all data entries in this bucket agree on their lastd bits.

    Toinserta record with key k:

    1 applyh, i.e., compute h(k),2

    use the last 2 bits ofh(k) to lookup the bucket pointer in the directory,

    3 if the bucket still has capacity, store h(k) in it,otherwise . . . ?

    c 2001, 2002 T. Grust Architecture and Implementation of DBMSs: 4. Hash-Based Indexing 98

  • 8/12/2019 Hash Based Indexing

    9/23

    Example (no bucket overflow):

    To insert a record with key k such that h(k) = 13 = 11012, follow the second directory pointer to

    bucket B (which still has empty slots) and place 13 there:bucket A

    bucket B

    bucket C

    bucket D

    hash table

    directory

    h

    00

    01

    10

    11

    2

    4*

    1*

    16*32*12*

    5* 21*

    10*

    15* 7* 19*

    13*

    2

    2

    2

    2

    Example (continued, bucket overflow):

    Inserting a record with key k such that h(k) = 20 = 101002 lets bucket A overflow

    We thus initiate a bucket split for bucket A.

    c 2001, 2002 T. Grust Architecture and Implementation of DBMSs: 4. Hash-Based Indexing 99

  • 8/12/2019 Hash Based Indexing

    10/23

    1 Splitbucket A (creating a new bucket A2) and use bit number ( 2 + 1) to redistribute the entries:4 = 1002

    12= 1100232= 100000216= 10000220= 101002

    0

    1

    bucket A 32 16 4 12 20 bucket A2

    N.B.: we now need 3 bits to discriminate between the old bucket A and the new split bucket A2.

    2 In this case we double the directoryby simply copying its original pages

    (we now use 2 + 1 = 3 bits to lookup a bucket pointer).3 Let bucket pointer for 1002 point to A2 (the directory pointer for 0002 still points to bucket A):

    directory

    bucket A

    bucket B

    bucket C

    bucket D

    bucket A2

    h

    000

    001

    010

    011

    100

    101

    110

    111

    2

    2

    2

    3

    3

    3

    7*

    5*1*

    12*4* 20*

    19*15*

    10*

    13*21*

    32*16*

    c 2001, 2002 T. Grust Architecture and Implementation of DBMSs: 4. Hash-Based Indexing 100

  • 8/12/2019 Hash Based Indexing

    11/23

    If we split a bucket with local depth d < n (global depth) directory doubling is not necessary:

    Consider the insertion of record with key k and hash value h(k) = 9 = 10012.

    The associated bucket B is split (creating a new bucket B2) and entries are redistributed. The new

    local depth of bucket B is 3 (and thus does not exceed the global depth of 3 ).

    Modifying the directorys bucket pointer for 1012 suffices:

    bucket A2

    directory

    bucket A

    bucket B

    bucket C

    bucket D

    h

    bucket B2

    9*

    21*13*5*

    000

    001

    010

    011

    100

    101

    110

    111

    3

    12*4* 20*

    3

    2

    2

    3

    3

    7*

    1*

    19*15*

    10*

    3

    32*16*

    c 2001, 2002 T. Grust Architecture and Implementation of DBMSs: 4. Hash-Based Indexing 101

  • 8/12/2019 Hash Based Indexing

    12/23

    Algorithm: hsearch (k)Input: search for hashed record with key valuekOutput: pointer to hash bucket containing potential hit(s)

    n n ; // global depth of hash directoryb h(k) & (2n 1);return bucket[b];

    Algorithm: hinsert (k)Input: entryk to be insertedOutput: new global depth of extendible hash directory

    n n ; // global depth of hash directoryb hsearch(k);ifb has capacity then

    place h(k)

    in bucket b;return n;

    else

    //bucket b overflows, we need to splitd d ; // local depth of bucket bcreate a new empty bucket b2;// redistribute entries of bucket b includingh(k)foreach h(k) in bucket b do

    ifh(k) & 2d= 0 then

    moveh(k) to bucket b2;d d+ 1; //new local depth of buckets b andb2

    ifn < d+ 1 then//we need to double the directoryallocate 2n directory entries bucket[2n . . . 2n+1 1];copy bucket[0 . . . 2n 1]into bucket[2n . . . 2n+1 1];n n+ 1;

    c

    2001,2002T.G

    rustArchitectureandImple

    mentationofDBMSs:4.Hash-BasedIndexing

  • 8/12/2019 Hash Based Indexing

    13/23

    Overflow chains?

    Extendible hashing uses overflow chains hanging off a bucket only as a last resort. Under

    which circumstances will extendible hashing create an overflow chain?

    Deleting an entry h(k) from a bucket with local depth d may leave this bucket completely empty. Extendible hashing mergesthe empty bucket and its associated bucket 2(

    n d ) partner buckets.

    You should work out the details on your own.

    4.3 Linear Hashing

    Linear hashing can, just like extendible hashing, adapt its underlying data structure to record

    insertions and deletions:

    Linear hashing does notneed a hash directory in addition to the actual hash table buckets,

    . . . but linear hashing may perform bad if the key distribution in the data file isskewed.

    We will now investigate linear hashing in detail and come back to the two points above as we goalong.

    c 2001, 2002 T. Grust Architecture and Implementation of DBMSs: 4. Hash-Based Indexing 103

  • 8/12/2019 Hash Based Indexing

    14/23

    The core idea behind linear hashing is to use an ordered family of hash functions,h0, h1, h2, . . .

    (traditionally the subscript is called the hash functions level).

    We design the family so that the range ofhlevel+1 is twice as large as the range ofhlevel

    (for level = 0, 1, 2, . . . ).

    This relationship is depicted below. Here, hlevel has range [0 . . . N 1](i.e., a range of size N):

    0

    N 1

    N

    2

    N

    1

    hlevel

    hlevel+1

    0

    2

    N

    1

    2 N

    4 N 1

    hlevel+1

    hlevel+2

    Given an initial hash function h and an initial hash table size N, one approach to define such a

    family of hash functionsh0, h1, h2, . . . would be

    hlevel(k) = h(k) mod (2level N)

    (for level = 0, 1, 2, . . . ).

    c 2001, 2002 T. Grust Architecture and Implementation of DBMSs: 4. Hash-Based Indexing 104

  • 8/12/2019 Hash Based Indexing

    15/23

    The basic linear hashingscheme then goes like this:

    Startwith level = 0, next = 0.

    The current hash function in use for searches (insertions/deletions) is hlevel, active hash table

    buckets are those in hlevels range: [0 . . . 2

    level

    N 1]. Wheneverwe realize that the current hash table overflows, e.g.,

    insertions filled a primary bucket beyond c % capacity,

    or the overflow chain of a bucket grew longer than l pages,

    or . . .

    wesplit the bucketas hash table position next (not the bucket which triggered the split!):

    1 allocate a new bucket,appendit to the hash table (its position will be 2level N+next),2 redistributethe entries in bucket next byrehashing them via h level+1

    (some entries remain in bucket next, some go to bucket 2level N+next Which? ),

    0

    ...

    2level N 1

    2level N+next

    next

    hlevel+1

    3 increment next by1.c 2001, 2002 T. Grust Architecture and Implementation of DBMSs: 4. Hash-Based Indexing 105

  • 8/12/2019 Hash Based Indexing

    16/23

    As we proceed, nextwalks down the table. Hashing via hlevelhas to take care ofnexts position:

    hlevel(k)

    < next : we hit an already split bucket, rehash: find record in bucket hlevel+1(k)

    next : we hit a yet unsplit bucket, bucket found

    0

    2level N 1

    buckets already split(hlevel+1)

    unsplit buckets (hlevel)

    images of already split buckets (hlevel+1)

    range ofhlevel

    range ofhlevel+1

    hash buckets

    nextbucket to be split

    c 2001, 2002 T. Grust Architecture and Implementation of DBMSs: 4. Hash-Based Indexing 106

  • 8/12/2019 Hash Based Indexing

    17/23

    Ifnext is incremented beyond the hash table size...?

    A bucket split increments next by 1 to mark the next bucket to be split. How would you

    propose to handle the situation if next is incremented beyond the last current hash table

    position, i.e.,

    next > 2level

    N 1 ?

    Answer:

    N.B.: Ifnext >2level N 1,all buckets in the current hash table are hashed via hlevel+1(see previous slide).

    Linear hashing thus proceeds in around-robin fashion:

    Ifnext > 2 level N 1, then1 increment level by1,2 reset next to0(start splitting from the hash table top again).

    Remark:

    In general, an overflowing bucket is notsplit immediately, butdue to round-robin splittingno later

    than in the following round.

    c 2001, 2002 T. Grust Architecture and Implementation of DBMSs: 4. Hash-Based Indexing 107

  • 8/12/2019 Hash Based Indexing

    18/23

    Running example:

    Linear hash table: primary bucket capacity of4, initial hash table size N = 4, level = 0, next = 0:

    next

    31* 35*

    hash buckets overflow pages

    level = 0

    32*44* 36*

    9* 25* 5*

    14* 18* 10* 30*

    11*7*

    01

    11011

    10010

    001

    00000

    h1 0h

    Insert record with key k such that h0(k) = 43:

    next

    level = 0

    hash buckets overflow pages

    31* 35*

    32*

    9* 25* 5*

    14* 18* 10*30*

    01

    11011

    10010

    001

    00000

    h1 0h

    100

    11*7*

    44*36*

    43*

    c 2001, 2002 T. Grust Architecture and Implementation of DBMSs: 4. Hash-Based Indexing 108

  • 8/12/2019 Hash Based Indexing

    19/23

    Insert record with key k such that h0(k) = 37:

    next

    eve =

    hash buckets overflow pages

    31* 35*

    32*

    9* 25* 5*

    14* 18* 10*30*

    01

    11011

    10010

    001

    00000

    h1 0h

    100

    11*7*

    44*36*

    43*

    37*

    Insert record with key k such that h0(k) = 29:

    next

    level = 0

    31* 35*

    32*

    14* 18* 10*30*

    01

    11011

    10010

    001

    00000

    h1 0h

    100

    11*7*

    44*36*

    43*

    9* 25*

    5* 37* 29*101

    c 2001, 2002 T. Grust Architecture and Implementation of DBMSs: 4. Hash-Based Indexing 109

  • 8/12/2019 Hash Based Indexing

    20/23

    Insert 3 records with keys k such thath0(k) = 22 (66, 34):

    next

    level = 0

    31* 35*

    5* 37* 29*

    32*

    18*10*

    01

    11011

    10010

    001

    00000

    h1 0h

    100

    11*7*

    44*36*

    43*

    9* 25*

    101

    14* 22*30*

    66* 34*

    110

    Insert record with key k such that h0(k) = 50:

    next

    35*

    eve =

    5* 37* 29*

    011

    010

    001

    000

    h1

    100

    101

    110

    32*

    18*10*

    44*36*

    9* 25*

    14* 22*30*

    66* 34*

    43* 11*

    50*

    c

    2001,2002T.G

    rustArchitectureandImplementationofDBMSs:4.Hash

    -BasedIndexing

  • 8/12/2019 Hash Based Indexing

    21/23

    Algorithm: hsearch (k)Input: search for hashed record with key valuekOutput: pointer to hash bucket containing potential hit(s)

    b hlevel(k);ifb < next then

    //bucket b has already been split,//the record for keyk may be in bucketb or bucket2levelN+ b,//rehash:b hlevel+1(k);

    return bucket[b];

    Algorithm: hinsert (k)Input: entryk to be insertedOutput: none

    b hlevel(k);ifb < next then

    //bucket b has already been split, rehash:b hlevel+1(k);

    place h(k) in bucket[b];if full(bucket[b]) then

    //the last insertion triggered a split of bucket next

    allocate a new bucket b;bucket[2level N+next] b;//rehash the entries of bucket next

    foreach entry with keyk in bucket[next] doplace entry in bucket[hlevel+1(k

    )];next next+ 1;//did we split every bucket in the original hash table?if next >2level N 1 then

    //

    c

    2001,2002T.G

    rustArchitectureandImplementationofDBMSs:4.Hash

    -BasedIndexing

  • 8/12/2019 Hash Based Indexing

    22/23

    hdelete(k) for a linear hash table can essentially be implemented as the inverse ofhinsert(k):

    Algorithm: hdelete (k)Input: keyk of entry to be deletedOutput: none

    ...if empty(bucket[2level N+next]) then

    //the last bucket in the hash table is empty, remove itremove bucket[2level N+next] from hash table;next next 1;if next

  • 8/12/2019 Hash Based Indexing

    23/23

    Chapter Contents

    4.1 Static Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

    4.2 Extendible Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 974.3 Linear Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

    c 2001, 2002 T. Grust Architecture and Implementation of DBMSs: 4. Hash-Based Indexing 113