Hashing Static Hashing Dynamic Hashing. – 2 – Sungkyunkwan University, Hyoung-Kee Choi © Symbol...
-
Upload
andrew-hodges -
Category
Documents
-
view
224 -
download
1
Transcript of Hashing Static Hashing Dynamic Hashing. – 2 – Sungkyunkwan University, Hyoung-Kee Choi © Symbol...
HashingHashing
Static HashingDynamic Hashing
– 2 –
Sungkyunkwan University, Hyoung-Kee Choi ©
Symbol table ADTSymbol table ADT We define the symbol table as a set of name-
attribute pairs. In a thesaurus, the name is a word, and the attribute is
a list of synonyms for the word. In a symbol table for a compiler, the name is an
identifier, and the attributes might include an initial value and a list of lines that use the identifier.
Generally we would want to perform the following operations on any symbol table: Determine if a particular name is in the table Retrieve the attributes of that name Modify the attributes of that name Insert a new name and its attributes Delete a name and its attributes.
– 3 –
Sungkyunkwan University, Hyoung-Kee Choi ©
There are only three basic operations on symbol tables: searching, inserting, and deleting.
To implement these operations, we could use the O(n) binary search tree introduced in Section 5.7, or some other binary trees with O(log n) complexity.
In this chapter we examine a technique for search, insert, and delete operations that has very good expected performance. This technique is referred to as hashing.
Unlike search tree methods which rely on identifier comparisons to perform a search, hashing relies on a formula called the hash function.
– 4 –
Sungkyunkwan University, Hyoung-Kee Choi ©
In static hashing, we store the identifiers in a fixed size table called a hash table. We use an arithmetic function, f, to determine the a
ddress, or location, of an identifier, x, in the table. Thus, f(x) gives the hash, or home address, of x in the table.
The hash table ht is stored in sequential memory locations that are partitioned into b buckets, ht[0], …, ht[b-1]. Each bucket has s slots.
Static hashingStatic hashing
– 5 –
Sungkyunkwan University, Hyoung-Kee Choi ©
Definition: The identifier density of a hash table is the ratio n/T, where n
is the number of identifiers in the table. The loading density or loading factor of a hash table is = n/(sb).
Two identifiers, i1 and i2 , are synonyms with respect to f if f(i1) = f(i2).
An overflow occurs when we hash a new identifier, i, into a full bucket.
A collision occurs when we hash tow nonidentical identifiers into the same bucket. When the bucket size is 1, collisions and overflows occur sim
ultaneously.
– 6 –
Sungkyunkwan University, Hyoung-Kee Choi ©
The time required to enter, delete, or search for identifiers does not depend on the number of identifiers n in use; it is O(1).
Since the ratio b/T is usually small, we cannot avoid collisions altogether.
Example 8.1 b = 26, s = 2 f(x) = the first character of x
Slot 0 Slot 1
0 acos atan1
2 char ceil
3 define
4 exp
5 float floor
6
…
25Hash tables with 26 buckets and two slots per bucket
– 7 –
Sungkyunkwan University, Hyoung-Kee Choi ©
A hash function, f, transforms an identifier, x, into a bucket address in the hash table.
We want a hash function that is easy to compute and that minimizes the number of collisions. To avoid collisions, the hash function should depend on all th
e characters in an identifier. Hashing functions should be unbiased.
That is, if we randomly choose an identifier, x, from the identifier space, the probability that f(x) = i is 1/b for all buckets i.
We call a hash function that satisfies unbiased property a uniform hash function.
Hashing functionsHashing functions
– 8 –
Sungkyunkwan University, Hyoung-Kee Choi ©
Four type of uniform hash functions: Mid-square Division Folding Digit Analysis
Mid-square We compute the function f by squaring the identifier and
then using an appropriate number of bits from the middle of the square to obtain the bucket address.
Since the middle bits of the square usually depend upon all the characters in an identifier, there is high probability that different identifiers will produce different hash addresses.
The number of bits used to obtain the bucket address depends on the table size. If we use r bits, the range of the value is 2r.
– 9 –
Sungkyunkwan University, Hyoung-Kee Choi ©
DivisionDivision We divide the identifier x by some number M and use t
he remainder as the hash address for x. f(x) = x % M
This gives bucket addresses that range from 0 to M - 1, where M = that table size.
The choice of M is critical. If M is divisible by 2, then odd keys are mapped to odd b
uckets and even keys are mapped to even buckets. When many identifiers are permutations of each other,
a biased use of the table results. A good choice for M would be : M a prime number such t
hat M does not divide rka for small k and a. In practice, choose M such that it has no prime divisors l
ess than 20.
– 10 –
Sungkyunkwan University, Hyoung-Kee Choi ©
FoldingFolding We partition the identifier x into several parts. All
parts, except for the last one have the same length. We then add the parts together to obtain the hash address for x.
There are two ways of carrying out this addition. Shift folding: We shift all parts except for the last
one, so that the least significant bit of each part lines up with corresponding bit of the last part. We then add the parts together to obtain f (x). Ex: suppose that we have divided the identifier x into the
following parts: x1 = 123, x2 = 203, x3 = 241, x4 = 112, and x5 = 20. We would align x1 through x4 with x5 and add. This gives us a hash address of 699.
Folding at the boundaries: reverses every other partition before adding. Ex: suppose the identifier x is divided into the same
partitions as in shift folding. We would reverse the second and forth partitions, that is x2 = 302 and x4 = 211, and add the partitions. This gives us a hash address of 897.
– 11 –
Sungkyunkwan University, Hyoung-Kee Choi ©
Digit AnalysisDigit Analysis Digit Analysis
Digital analysis is used with static files. A static file is one in which all the identifiers are known in advance.
Using this method, We first transform the identifiers into numbers using some
radix, r. We then examine the digits of each identifier, deleting
those digits that have the most skewed distribution. We continue deleting digits until the number of remaining
digits is small enough to give an address in the range of the hash table.
Of these methods, the one most suitable for general purpose applications is the division method with a divisor, M, such that M has no prime factors less than 20.
– 12 –
Sungkyunkwan University, Hyoung-Kee Choi ©
Overflow Handling (1/8)Overflow Handling (1/8)
Linear open addressing (Linear probing) Compute f(x) for identifier x Examine the buckets:
ht[(f(x)+j)%TABLE_SIZE], 0 j TABLE_SIZE The bucket contains x. The bucket contains the empty string (insert to it) The bucket contains a nonempty string other than x
(examine the next bucket) (circular rotation) Return to the home bucket ht[f(x)],
if the table is full we report an error condition and exit
– 13 –
Sungkyunkwan University, Hyoung-Kee Choi ©
Overflow Handling (2/8)Overflow Handling (2/8) Additive transformation and Division
Hash table with linear probing (13 buckets, 1 slot/bucket)
insertion
– 14 –
Sungkyunkwan University, Hyoung-Kee Choi ©
Overflow Handling (3/8)Overflow Handling (3/8) Problem of Linear Probing
Identifiers tend to cluster together Adjacent cluster tend to coalesce Increase the search time Example: suppose we enter the
C built-in functions into a 26-bucket hash table in order. The hash function uses the first character in each function name
acosacos, , atoiatoi, , charchar, , definedefine, , expexp, , ceilceil, , coscos, , floatfloat, , atolatol, , floorfloor, , ctimectime
Hash table with linear probing (26 buckets, 1 slot/bucket)
Enter:
Enter sequence:
acosacosatoiatoicharchardefinedefineexpexpceilceilcoscosfloatfloatatolatolfloorfloorctimectime
# of key comparisons=35/11=3.18
– 15 –
Sungkyunkwan University, Hyoung-Kee Choi ©
Overflow Handling (4/8)Overflow Handling (4/8) Alternative techniques to improve open
addressing approach: Quadratic probing rehashing random probing
Rehashing Try f1, f2, …, fm in sequence if collision occurs disadvantage
comparison of identifiers with different hash values use chain to resolve collisions
– 16 –
Sungkyunkwan University, Hyoung-Kee Choi ©
Overflow Handling (5/8)Overflow Handling (5/8)
Quadratic Probing Linear probing searches buckets (f(x)+i)%b Quadratic probing uses a quadratic function of i as
the increment Examine buckets f(x), (f(x)+i2)%b, (f(x)-i2)%b, for 1<
=i<=(b-1)/2 When b is a prime number of the form
4j+3, j is an integer, the quadratic search examines every bucket in the table
Prime j Prime j
3 0 43 10
7 1 59 14
11 2 127 31
19 4 251 62
23 5 503 125
31 7 1019 254
– 17 –
Sungkyunkwan University, Hyoung-Kee Choi ©
Overflow Handling (6/8)Overflow Handling (6/8) Chaining
Linear probing and its variations perform poorly because inserting an identifier requires the comparison of identifiers with different hash values.
In this approach we maintained a list of synonyms for each bucket.
To insert a new element Compute the hash address f (x) Examine the identifiers in the list for f(x).
Since we would not know the sizes of the lists in advance, we should maintain them as linjed chains
The experimental evaluation indicates that chaining performs better than linear open addressing.
– 18 –
Sungkyunkwan University, Hyoung-Kee Choi ©
Overflow Handling (7/8)Overflow Handling (7/8) Results of Hash Chaining
acos, atoi, char, define, exp, ceil, cos, float, atol, floor, ctimef (x)=first character of x
# of key comparisons=21/11=1.91
– 19 –
Sungkyunkwan University, Hyoung-Kee Choi ©
Overflow Handling (8/8)Overflow Handling (8/8) Comparison:
In Figure 8.7, The values in each column give the average number of bucket accesses made in searching eight different table with 33,575, 24,050, 4909, 3072, 2241, 930, 762, and 500 identifiers each.
Chaining performs better than linear open addressing. We can see that division is generally superior
Average number of bucket accesses per identifier retrieved
– 20 –
Sungkyunkwan University, Hyoung-Kee Choi ©
void chaing_insert(element item, list_pointer ht[]) {int hash_value = hash(item.key);list_pointer ptr, trail = NULL, lead = ht[hash_value];for(; lead; trail = lead, lead = lead->link)
if (!strcmp(lead->item.key, item-key)) { fprintf(stderr, “The key is in the table\n”); exit(1);}
ptr = (list_pointer)malloc(sizeof(list));if (IS_FULL(ptr)) {
fprintf(stderr, “The memory is full\n”);exit(1);}
ptr->item = item;ptr->link = NULLif (trail)
trail->link = ptr;else
ht[hash_value] = ptr;}
typedef struct {char key[MAX_CHAR];/* other fields */} element;
typedef struct list *list_pointer;Typedef struct list {
element item;list_pointer link;}
List_pointer hash_table[TABLE_SIZE];
– 21 –
Sungkyunkwan University, Hyoung-Kee Choi ©
Motivation for dynamic hashing One of the most important classes of software is
the DBMS. A key characteristic of a DBMS is that the amount of information can vary a great deal over time. Various data structures have been suggested for storing
the data in a DBMS. In this section, we examine an extension of hashing that permits the technique to be used by a DBMS.
Traditional hashing schemes are not ideal because we must statically allocate a portion of memory to hold the hash table.
Dynamic hashing, also referred to as extendible hashing, can accommodate dynamically increasing and decreasing file size without penalty.
Dynamic hashingDynamic hashing
– 22 –
Sungkyunkwan University, Hyoung-Kee Choi ©
Dynamic hashing using directoriesDynamic hashing using directories Example: an identifier
consists of two characters and each character is represented by 3 bits. We would like to place these
identifiers into a table that has four page. Each page can hold no more than two identifiers, and the pages are indexed by the 2 bit sequence 00, 01, 10, 11, respectively.
We use the two low-order bits of each identifier to determine the page address of the identifier.
IdentifiersBinary representation
a0 100 000
a1 100 001
b0 101 000
b1 101 001
c0 110 000
c1 110 001
c2 110 010
c3 110 011
– 23 –
Sungkyunkwan University, Hyoung-Kee Choi ©
a0, b0
c2
a1, b1
c3
0
01
0
1
1
a0, b0
c2
a1, b1
c3
0
01
0
1
1
c5
0
1
a0, b0
c2
a1, c1
c3
0
01
0
1
1
c5
0
1b1
0
1
two level trie on four pages
inserting c5 with overflow
inserting c1 with overflow
a
b
c
d
aa
b b
cc
d
d
e
e
f
We use the term trie to denote a binary tree in which we locate an identifier by following its bit sequence. Notice that this trie has nodes that always branch in two
directions corresponding to 0 or 1. Only the leaf nodes of the trie contain a pointer to a page.
– 24 –
Sungkyunkwan University, Hyoung-Kee Choi ©
From this example we can see two major problems exist. The access time for a page depends on the number of
bits needed to distinguish the identifiers. If the identifiers have a skewed distribution, the tree i
s also skewed. How to avoid these problems?
To avoid the skewed distribution of identifiers, a hash function is used.
To avoid the long search down the trie, the trie is mapped to a directory.
A directory is a table of page pointer. In case k bits are needed to distinguish the identifiers,
the directory has 2k entries indexed 0, 1, 2, 3, …, 2k-1.
– 25 –
Sungkyunkwan University, Hyoung-Kee Choi ©
00 -a-> a0, b0 000 -a-> a0, b0 0000 -a-> a0, b0
01 -c-> a1, b1 001 -c-> a1, b1 0001 -c-> a1, c1
10 -b-> c2 010 -b-> c2 0010 -b-> c2
11 -d-> c3 011 -e-> c3 0011 -f-> c3
100 -a-> 0100 -a->
101 -d-> c5 0101 -e-> c5
110 -b-> 0110 -b->
111 -e-> 0111 -f->
1000 -a->
1001 -d-> b1
1010 -b->
1011 -f->
1100 -a->
1101 -e->
1110 -b->
1111 -f->
– 26 –
Sungkyunkwan University, Hyoung-Kee Choi ©
Advantage of a directory: Using a directory to represent a trie allows the table of identifie
rs to grow and shrink dynamically. Accessing any page requires only two steps.
First, use the hash function to find the address of the directory entry.
Then, retrieve the page associated with the address. Disadvantage of a directory:
If the keys are not uniformly divided among the pages, the directory can grow quite large. However, most of the entries point to the same pages.
To prevent this from happening, we cannot use the bit sequence of the keys themselves. Instead we translate the bits into a random sequence using a uniform hash function as discussed in the previous section.
We need a family of hash functions, because, at any point, we may require a different number of bits to distinguish the new key.
– 27 –
Sungkyunkwan University, Hyoung-Kee Choi ©
Our solution is the family of
where hashi is simply hashi-1 with either a zero or one appended as the new leading bit of the result. Thus hash (key, i) might be a function that produces a random number of i bits from the identifier key.
Some important twists are associated with this approach. For example, suppose a page identified by i bits overflow
s. We allocate a new page and rehash the identifiers into those two pages. The identifiers in both pages have their low-order i bits in common. We refer to these pages as buddies. When the number of identifiers in two buddy pages is no more than the capacity of a single page, then we coalesce the two pages into one.
1: {0...2 }, 1i
ihash key i d