Hashing Static Hashing Dynamic Hashing. – 2 – Sungkyunkwan University, Hyoung-Kee Choi © Symbol...

HashingHashing

Static HashingDynamic Hashing

– 2 –

Sungkyunkwan University, Hyoung-Kee Choi ©

Symbol table ADTSymbol table ADT We define the symbol table as a set of name-

attribute pairs. In a thesaurus, the name is a word, and the attribute is

a list of synonyms for the word. In a symbol table for a compiler, the name is an

identifier, and the attributes might include an initial value and a list of lines that use the identifier.

Generally we would want to perform the following operations on any symbol table: Determine if a particular name is in the table Retrieve the attributes of that name Modify the attributes of that name Insert a new name and its attributes Delete a name and its attributes.

– 3 –


There are only three basic operations on symbol tables: searching, inserting, and deleting.

To implement these operations, we could use the O(n) binary search tree introduced in Section 5.7, or some other binary trees with O(log n) complexity.

In this chapter we examine a technique for search, insert, and delete operations that has very good expected performance. This technique is referred to as hashing.

Unlike search tree methods which rely on identifier comparisons to perform a search, hashing relies on a formula called the hash function.

– 4 –


In static hashing, we store the identifiers in a fixed size table called a hash table. We use an arithmetic function, f, to determine the a

ddress, or location, of an identifier, x, in the table. Thus, f(x) gives the hash, or home address, of x in the table.

The hash table ht is stored in sequential memory locations that are partitioned into b buckets, ht[0], …, ht[b-1]. Each bucket has s slots.

Static hashingStatic hashing

– 5 –


Definition: The identifier density of a hash table is the ratio n/T, where n

is the number of identifiers in the table. The loading density or loading factor of a hash table is = n/(sb).

Two identifiers, i1 and i2 , are synonyms with respect to f if f(i1) = f(i2).

An overflow occurs when we hash a new identifier, i, into a full bucket.

A collision occurs when we hash tow nonidentical identifiers into the same bucket. When the bucket size is 1, collisions and overflows occur sim

ultaneously.

– 6 –


The time required to enter, delete, or search for identifiers does not depend on the number of identifiers n in use; it is O(1).

Since the ratio b/T is usually small, we cannot avoid collisions altogether.

Example 8.1 b = 26, s = 2 f(x) = the first character of x

Slot 0 Slot 1

0 acos atan1

2 char ceil

3 define

4 exp

5 float floor

6

…

25Hash tables with 26 buckets and two slots per bucket

– 7 –


A hash function, f, transforms an identifier, x, into a bucket address in the hash table.

We want a hash function that is easy to compute and that minimizes the number of collisions. To avoid collisions, the hash function should depend on all th

e characters in an identifier. Hashing functions should be unbiased.

That is, if we randomly choose an identifier, x, from the identifier space, the probability that f(x) = i is 1/b for all buckets i.

We call a hash function that satisfies unbiased property a uniform hash function.

Hashing functionsHashing functions

– 8 –


Four type of uniform hash functions: Mid-square Division Folding Digit Analysis

Mid-square We compute the function f by squaring the identifier and

then using an appropriate number of bits from the middle of the square to obtain the bucket address.

Since the middle bits of the square usually depend upon all the characters in an identifier, there is high probability that different identifiers will produce different hash addresses.

The number of bits used to obtain the bucket address depends on the table size. If we use r bits, the range of the value is 2r.

– 9 –


DivisionDivision We divide the identifier x by some number M and use t

he remainder as the hash address for x. f(x) = x % M

This gives bucket addresses that range from 0 to M - 1, where M = that table size.

The choice of M is critical. If M is divisible by 2, then odd keys are mapped to odd b

uckets and even keys are mapped to even buckets. When many identifiers are permutations of each other,

a biased use of the table results. A good choice for M would be : M a prime number such t

hat M does not divide rka for small k and a. In practice, choose M such that it has no prime divisors l

ess than 20.

– 10 –


FoldingFolding We partition the identifier x into several parts. All

parts, except for the last one have the same length. We then add the parts together to obtain the hash address for x.

There are two ways of carrying out this addition. Shift folding: We shift all parts except for the last

one, so that the least significant bit of each part lines up with corresponding bit of the last part. We then add the parts together to obtain f (x). Ex: suppose that we have divided the identifier x into the

following parts: x1 = 123, x2 = 203, x3 = 241, x4 = 112, and x5 = 20. We would align x1 through x4 with x5 and add. This gives us a hash address of 699.

Folding at the boundaries: reverses every other partition before adding. Ex: suppose the identifier x is divided into the same

partitions as in shift folding. We would reverse the second and forth partitions, that is x2 = 302 and x4 = 211, and add the partitions. This gives us a hash address of 897.

– 11 –


Digit AnalysisDigit Analysis Digit Analysis

Digital analysis is used with static files. A static file is one in which all the identifiers are known in advance.

Using this method, We first transform the identifiers into numbers using some

radix, r. We then examine the digits of each identifier, deleting

those digits that have the most skewed distribution. We continue deleting digits until the number of remaining

digits is small enough to give an address in the range of the hash table.

Of these methods, the one most suitable for general purpose applications is the division method with a divisor, M, such that M has no prime factors less than 20.

– 12 –


Overflow Handling (1/8)Overflow Handling (1/8)

Linear open addressing (Linear probing) Compute f(x) for identifier x Examine the buckets:

ht[(f(x)+j)%TABLE_SIZE], 0 j TABLE_SIZE The bucket contains x. The bucket contains the empty string (insert to it) The bucket contains a nonempty string other than x

(examine the next bucket) (circular rotation) Return to the home bucket ht[f(x)],

if the table is full we report an error condition and exit

– 13 –


Overflow Handling (2/8)Overflow Handling (2/8) Additive transformation and Division

Hash table with linear probing (13 buckets, 1 slot/bucket)

insertion

– 14 –


Overflow Handling (3/8)Overflow Handling (3/8) Problem of Linear Probing

Identifiers tend to cluster together Adjacent cluster tend to coalesce Increase the search time Example: suppose we enter the

C built-in functions into a 26-bucket hash table in order. The hash function uses the first character in each function name

acosacos, , atoiatoi, , charchar, , definedefine, , expexp, , ceilceil, , coscos, , floatfloat, , atolatol, , floorfloor, , ctimectime

Hash table with linear probing (26 buckets, 1 slot/bucket)

Enter:

Enter sequence:

acosacosatoiatoicharchardefinedefineexpexpceilceilcoscosfloatfloatatolatolfloorfloorctimectime

# of key comparisons=35/11=3.18

– 15 –


Overflow Handling (4/8)Overflow Handling (4/8) Alternative techniques to improve open

addressing approach: Quadratic probing rehashing random probing

Rehashing Try f1, f2, …, fm in sequence if collision occurs disadvantage

comparison of identifiers with different hash values use chain to resolve collisions

– 16 –


Overflow Handling (5/8)Overflow Handling (5/8)

Quadratic Probing Linear probing searches buckets (f(x)+i)%b Quadratic probing uses a quadratic function of i as

the increment Examine buckets f(x), (f(x)+i2)%b, (f(x)-i2)%b, for 1<

=i<=(b-1)/2 When b is a prime number of the form

4j+3, j is an integer, the quadratic search examines every bucket in the table

Prime j Prime j

3 0 43 10

7 1 59 14

11 2 127 31

19 4 251 62

23 5 503 125

31 7 1019 254

– 17 –


Overflow Handling (6/8)Overflow Handling (6/8) Chaining

Linear probing and its variations perform poorly because inserting an identifier requires the comparison of identifiers with different hash values.

In this approach we maintained a list of synonyms for each bucket.

To insert a new element Compute the hash address f (x) Examine the identifiers in the list for f(x).

Since we would not know the sizes of the lists in advance, we should maintain them as linjed chains

The experimental evaluation indicates that chaining performs better than linear open addressing.

– 18 –


Overflow Handling (7/8)Overflow Handling (7/8) Results of Hash Chaining

acos, atoi, char, define, exp, ceil, cos, float, atol, floor, ctimef (x)=first character of x

# of key comparisons=21/11=1.91

– 19 –


Overflow Handling (8/8)Overflow Handling (8/8) Comparison:

In Figure 8.7, The values in each column give the average number of bucket accesses made in searching eight different table with 33,575, 24,050, 4909, 3072, 2241, 930, 762, and 500 identifiers each.

Chaining performs better than linear open addressing. We can see that division is generally superior

Average number of bucket accesses per identifier retrieved

– 20 –


void chaing_insert(element item, list_pointer ht[]) {int hash_value = hash(item.key);list_pointer ptr, trail = NULL, lead = ht[hash_value];for(; lead; trail = lead, lead = lead->link)

if (!strcmp(lead->item.key, item-key)) { fprintf(stderr, “The key is in the table\n”); exit(1);}

ptr = (list_pointer)malloc(sizeof(list));if (IS_FULL(ptr)) {

fprintf(stderr, “The memory is full\n”);exit(1);}

ptr->item = item;ptr->link = NULLif (trail)

trail->link = ptr;else

ht[hash_value] = ptr;}

typedef struct {char key[MAX_CHAR];/* other fields */} element;

typedef struct list *list_pointer;Typedef struct list {

element item;list_pointer link;}

List_pointer hash_table[TABLE_SIZE];

– 21 –


Motivation for dynamic hashing One of the most important classes of software is

the DBMS. A key characteristic of a DBMS is that the amount of information can vary a great deal over time. Various data structures have been suggested for storing

the data in a DBMS. In this section, we examine an extension of hashing that permits the technique to be used by a DBMS.

Traditional hashing schemes are not ideal because we must statically allocate a portion of memory to hold the hash table.

Dynamic hashing, also referred to as extendible hashing, can accommodate dynamically increasing and decreasing file size without penalty.

Dynamic hashingDynamic hashing

– 22 –


Dynamic hashing using directoriesDynamic hashing using directories Example: an identifier

consists of two characters and each character is represented by 3 bits. We would like to place these

identifiers into a table that has four page. Each page can hold no more than two identifiers, and the pages are indexed by the 2 bit sequence 00, 01, 10, 11, respectively.

We use the two low-order bits of each identifier to determine the page address of the identifier.

IdentifiersBinary representation

a0 100 000

a1 100 001

b0 101 000

b1 101 001

c0 110 000

c1 110 001

c2 110 010

c3 110 011

– 23 –


a0, b0

c2

a1, b1

c3

0

01

0

1

1

a0, b0

c2

a1, b1

c3

0

01

0

1

1

c5

0

1

a0, b0

c2

a1, c1

c3

0

01

0

1

1

c5

0

1b1

0

1

two level trie on four pages

inserting c5 with overflow

inserting c1 with overflow

a

b

c

d

aa

b b

cc

d

d

e

e

f

We use the term trie to denote a binary tree in which we locate an identifier by following its bit sequence. Notice that this trie has nodes that always branch in two

directions corresponding to 0 or 1. Only the leaf nodes of the trie contain a pointer to a page.

– 24 –


From this example we can see two major problems exist. The access time for a page depends on the number of

bits needed to distinguish the identifiers. If the identifiers have a skewed distribution, the tree i

s also skewed. How to avoid these problems?

To avoid the skewed distribution of identifiers, a hash function is used.

To avoid the long search down the trie, the trie is mapped to a directory.

A directory is a table of page pointer. In case k bits are needed to distinguish the identifiers,

the directory has 2k entries indexed 0, 1, 2, 3, …, 2k-1.

– 25 –


00 -a-> a0, b0 000 -a-> a0, b0 0000 -a-> a0, b0

01 -c-> a1, b1 001 -c-> a1, b1 0001 -c-> a1, c1

10 -b-> c2 010 -b-> c2 0010 -b-> c2

11 -d-> c3 011 -e-> c3 0011 -f-> c3

100 -a-> 0100 -a->

101 -d-> c5 0101 -e-> c5

110 -b-> 0110 -b->

111 -e-> 0111 -f->

1000 -a->

1001 -d-> b1

1010 -b->

1011 -f->

1100 -a->

1101 -e->

1110 -b->

1111 -f->

– 26 –


Advantage of a directory: Using a directory to represent a trie allows the table of identifie

rs to grow and shrink dynamically. Accessing any page requires only two steps.

First, use the hash function to find the address of the directory entry.

Then, retrieve the page associated with the address. Disadvantage of a directory:

If the keys are not uniformly divided among the pages, the directory can grow quite large. However, most of the entries point to the same pages.

To prevent this from happening, we cannot use the bit sequence of the keys themselves. Instead we translate the bits into a random sequence using a uniform hash function as discussed in the previous section.

We need a family of hash functions, because, at any point, we may require a different number of bits to distinguish the new key.

– 27 –


Our solution is the family of

where hashi is simply hashi-1 with either a zero or one appended as the new leading bit of the result. Thus hash (key, i) might be a function that produces a random number of i bits from the identifier key.

Some important twists are associated with this approach. For example, suppose a page identified by i bits overflow

s. We allocate a new page and rehash the identifiers into those two pages. The identifiers in both pages have their low-order i bits in common. We refer to these pages as buddies. When the number of identifiers in two buddy pages is no more than the capacity of a single page, then we coalesce the two pages into one.

1: {0...2 }, 1i

ihash key i d

Hashing Static Hashing Dynamic Hashing. – 2 – Sungkyunkwan University, Hyoung-Kee Choi © Symbol...

Documents

Transcript of Hashing Static Hashing Dynamic Hashing. – 2 – Sungkyunkwan University, Hyoung-Kee Choi © Symbol...