Unit 3
-
Upload
vasantha-kumari -
Category
Documents
-
view
58 -
download
0
Transcript of Unit 3
Data structures and Algorithms R.Punithavathi Department of IT
UNIT III
DISJOINT SET ADT, Hashing
THE DISJOINT SET ADT
Introduction
It is an efficient data structure to solve the equivalence problem. Each routine
requires only a few lines of code, and a simple array can be used. The
implementation is also extremely fast, requiring constant average time per
operation.
1. Equivalence Relations
A relation R is defined on a set S if for every pair of elements (a, b), a, b S, a R
b is either true or false. If a R b is true, then we say that a is related to b.
An equivalence relation is a relation R that satisfies three properties:
1. (Reflexive) a R a, for all a S.
2. (Symmetric) a R b if and only if b R a.
3. (Transitive) a R b and b R c implies that a R c.
Examples.
1. The relationship is not an equivalence relationship.
Although it is reflexive, since a a, and
transitive, since a b and b c implies a c,
it is not symmetric, since a b does not imply b a.
Data structures and Algorithms R.Punithavathi Department of IT
2. Electrical connectivity, where all connections are by metal wires, is an
equivalence relation.
The relation is clearly reflexive, as any component is connected to itself. If a is
electrically connected to b, then b must be electrically connected to a, so the
relation is symmetric. Finally, if a is connected to b and b is connected to c, then
a is connected to c. Thus electrical connectivity is an equivalence relation.
3. Two cities are related if they are in the same country. It is easily verified that
this is an equivalence relation. Suppose town a is related to b if it is possible to
travel from a to b by taking roads. This relation is an equivalence relation if all
the roads are two-way.
2. The Dynamic Equivalence Problem
The input is initially a collection of n sets, each with one element. This initial
representation is that all relations (except reflexive relations) are false. Each set
has a different element, so that Si Sj = ; this makes the sets disjoint.
There are two permissible operations.
The first is find, which returns the name of the set (that is, the equivalence class)
containing a given element.
The second operation adds relations. If we want to add the relation a ~ b, then
we first see if a and b are already related. This is done by performing finds on
both a and b and checking whether they are in the same equivalence class. If
they are not, then we apply union. This operation merges the two equivalence
classes containing a and b into a new equivalence class. From a set point of
view, the result of is to create a new set Sk = Si Sj, destroying the originals
and preserving the disjointness of all the sets. The algorithm to do this is
frequently known as the disjoint set union/find algorithm for this reason.
Data structures and Algorithms R.Punithavathi Department of IT
This algorithm is dynamic because, during the course of the algorithm, the sets
can change via the union operation. The algorithm must also operate on-line:
When a find is performed, it must give an answer before continuing. Another
possibility would be an off-line algorithm. Such an algorithm would be allowed to
see the entire sequence of unions and finds. The answer it provides for each find
must still be consistent with all the unions that were performed up until the find,
but the algorithm can give all its answers after it has seen all the questions. The
difference is similar to taking a written exam (which is generally off-line--you only
have to give the answers before time expires), and an oral exam (which is on-
line, because you must answer the current question before proceeding to the
next question).
Notice that we do not perform any operations comparing the relative values of
elements, but merely require knowledge of their location. For this reason, we can
assume that all the elements have been numbered sequentially from 1 to n and
that the numbering can be determined easily by some hashing scheme. Thus,
initially we have Si = {i} for i = 1 through n.
Our second observation is that the name of the set returned by find is actually
fairly abitrary. All that really matters is that find(x) = find( ) if and only if x and are
in the same set.
These operations are important in many graph theory problems and also in
compilers which process equivalence (or type) declarations. We will see an
application later.
There are two strategies to solve this problem. One ensures that the find
instruction can be executed in constant worst-case time, and the other ensures
that the union instruction can be executed in constant worst-case time. It has
recently been shown that both cannot be done simultaneously in constant worst-
case time.
Data structures and Algorithms R.Punithavathi Department of IT
We will now briefly discuss the first approach. For the find operation to be fast,
we could maintain, in an array, the name of the equivalence class for each
element. Then find is just a simple O(1) lookup. Suppose we want to perform
union(a, b). Suppose that a is in equivalence class i and b is in equivalence class
j. Then we scan down the array, changing all is to j. Unfortunately, this scan
takes (n). Thus, a sequence of n - 1 unions (the maximum, since then
everything is in one set), would take (n2) time. If there are (n2) find operations,
this performance is fine, since the total running time would then amount to O(1)
for each union or find operation over the course of the algorithm. If there are
fewer finds, this bound is not acceptable.
One idea is to keep all the elements that are in the same equivalence class in a
linked list. This saves time when updating, because we do not have to search
through the entire array. This by itself does not reduce the asymptotic running
time, because it is still possible to perform (n2) equivalence class updates over
the course of the algorithm.
If we also keep track of the size of each equivalence class, and when performing
unions we change the name of the smaller equivalence class to the larger, then
the total time spent for n - 1 merges isO (n log n). The reason for this is that each
element can have its equivalence class changed at most log n times, since every
time its class is changed, its new equivalence class is at least twice as large as
its old. Using this strategy, any sequence of m finds and up to n - 1 unions takes
at most O(m + n log n) time.
In the remainder of this chapter, we will examine a solution to the union/find
problem that makes unions easy but finds hard. Even so, the running time for any
sequences of at most m finds and up to n - 1 unions will be only a little more than
O(m + n).
3. Basic Data Structure
Data structures and Algorithms R.Punithavathi Department of IT
Recall that the problem does not require that a find operation return any specific
name, just that finds on two elements return the same answer if and only if they
are in the same set. One idea might be to use a tree to represent each set, since
each element in a tree has the same root. Thus, the root can be used to name
the set. We will represent each set by a tree. (Recall that a collection of trees is
known as a forest.) Initially, each set contains one element. The trees we will use
are not necessarily binary trees, but their representation is easy, because the
only information we will need is a parent pointer. The name of a set is given by
the node at the root. Since only the name of the parent is required, we can
assume that this tree is stored implicitly in an array: each entry p[i] in the array
represents the parent of element i. If i is a root, then p[i] = 0. In the forest in
Figure 1, p[i] = 0 for 1 i As with heaps, we will draw the trees explicitly, with the
understanding that an array is being used. Figure 1 shows the explicit
representation. We will draw the root's parent pointer vertically for convenience.
To perform a union of two sets, we merge the two trees by making the root of
one tree point to the root of the other. It should be clear that this operation takes
constant time. Figures 2, 3, and 4 represent the forest after each of union(5,6)
union(7,8), union(5,7), where we have adopted the convention that the new root
after the union(x,y) is x. The implicit representation of the last forest is shown in
Figure 5.
A find(x) on element x is performed by returning the root of the tree containing x.
The time to perform this operation is proportional to the depth of the node
representing x, assuming, of course, that we can find the node representing x in
constant time. Using the strategy above, it is possible to create a tree of depth n -
1, so the worst-case running time of a find is O(n). Typically, the running time is
computed for a sequence of m intermixed instructions. In this case, m
consecutive operations could take O(mn) time in the worst case.
The code in Figures 6 through 9 represents an implementation of the basic
algorithm, assuming that error checks have already been performed. In our
Data structures and Algorithms R.Punithavathi Department of IT
routine, unions are performed on the roots of the trees. Sometimes the operation
is performed by passing any two elements, and having the union perform two
finds to determine the roots.
Figure 1 Eight elements, initially in different sets
Figure 2 After union (5, 6)
Figure 3 After union (7, 8)
Data structures and Algorithms R.Punithavathi Department of IT
Figure 4 After union (5, 7)
Figure 5 Implicit representation of previous tree
typedef int DISJ_SET[ NUM_SETS+1 ];
typedef unsigned int set_type;
typedef unsigned int element_type;
Figure 6 Disjoint set type declaration
void
initialize( DISJ_SET S )
{
int i;
for( i = NUN_SETS; i > 0; i-- )
S[i] = 0;
}
Figure 7 Disjoint set initialization routine
/* Assumes root1 and root2 are roots. */
/* union is a C keyword, so this routine is named set_union. */
void
set_union( DISJ_SET S, set_type root1, set_type root2 )
{
S[root2] = root1;
}
Figure 8 Union (not the best way)
set_type
Data structures and Algorithms R.Punithavathi Department of IT
find( element_type x, DISJ_SET S )
{
if( S[x] <= 0 )
return x;
else
return( find( S[x], S ) );
}
Figure 9 A simple disjoint set find algorithm
4. Smart Union Algorithms
The unions above were performed rather arbitrarily, by making the second tree a
subtree of the first. A simple improvement is always to make the smaller tree a
subtree of the larger, breaking ties by any method; we call this approach union-
by-size. The three unions in the preceding example were all ties, and so we can
consider that they were performed by size. If the next operation were union (4,
5), then the forest in Figure 10 would form. Had the size heuristic not been used,
a deeper forest would have been formed (Fig. 11).
Figure 10 Result of union-by-size
Data structures and Algorithms R.Punithavathi Department of IT
Figure 11 Result of an arbitrary union
Figure 12 Worst-case tree for n = 16
We can prove that if unions are done by size, the depth of any node is never
more than log n. To see this, note that a node is initially at depth 0. When its
depth increases as a result of a union, it is placed in a tree that is at least twice
as large as before. Thus, its depth can be increased at most log n times. (We
used this argument in the quick-find algorithm at the end of Section 2.) This
implies that the running time for a find operation is O(log n), and a sequence of m
operations takes O(m log n). The tree in Figure 12 shows the worst tree possible
after 16 unions and is obtained if all unions are between equal-sized trees.
To implement this strategy, we need to keep track of the size of each tree. Since
we are really just using an array, we can have the array entry of each root
contain the negative of the size of its tree. Thus, initially the array representation
Data structures and Algorithms R.Punithavathi Department of IT
of the tree is all -1s (and Fig 7 needs to be changed accordingly). When a union
is performed, check the sizes; the new size is the sum of the old. Thus, union-by-
size is not at all difficult to implement and requires no extra space. It is also fast,
on average. For virtually all reasonable models, it has been shown that a
sequence of m operations requires O(m) average time if union-by-size is used.
This is because when random unions are performed, generally very small
(usually one-element) sets are merged with large sets throughout the algorithm.
The following figures show a tree and its implicit representation for both union-by-
size and union-by-height. The code in Figure 13 implements union-by-height.
5. Path Compression
The union/find algorithm, as described so far, is quite acceptable for most cases.
It is very simple and linear on average for a sequence of m instructions (under all
models). However, the worst case of O(m log n ) can occur fairly easily and
naturally.
/* assume root1 and root2 are roots */
Data structures and Algorithms R.Punithavathi Department of IT
/* union is a C keyword, so this routine is named set_union */
void
set_union (DISJ_SET S, set_type root1, set_type root2 )
{
if( S[root2] < S[root1] ) /* root2 is deeper set */
S[root1] = root2; /* make root2 new root */
else
{
if( S[root2] == S[root1] ) /* same height, so update */
S[root1]--;
S[root2] = root1; /* make root1 new root */
}
}
Figure 13 Code for union-by-height (rank)
For instance, if we put all the sets on a queue and repeatedly dequeue the first
two sets and enqueue the union, the worst case occurs. If there are many more
finds than unions, this running time is worse than that of the quick-find algorithm.
Moreover, it should be clear that there are probably no more improvements
possible for the union algorithm. This is based on the observation that any
method to perform the unions will yield the same worst-case trees, since it must
break ties arbitrarily. Therefore, the only way to speed the algorithm up, without
reworking the data structure entirely, is to do something clever on the find
operation.
The clever operation is known as path compression. Path compression is
performed during a find operation and is independent of the strategy used to
perform unions. Suppose the operation is find(x). Then the effect of path
compression is that every node on the path from x to the root has its parent
changed to the root. Figure 14 shows the effect of path compression after find
(15) on the generic worst tree of Figure 12.
Data structures and Algorithms R.Punithavathi Department of IT
The effect of path compression is that with an extra two pointer moves, nodes 13
and 14 are now one position closer to the root and nodes 15 and 16 are now two
positions closer. Thus, the fast future accesses on these nodes will pay (we
hope) for the extra work to do the path compression.
As the code in Figure 15 shows, path compression is a trivial change to the basic
find algorithm. The only change to the find routine is that S[x] is made equal to
the value returned by find; thus after the root of the set is found recursively, x is
made to point directly to it. This occurs recursively to every node on the path to
the root, so this implements path compression. As we stated when we
implemented stacks and queues, modifying a parameter to a function called is
not necessarily in line with current software engineering rules. Some languages
will not allow this, so this code may well need changes.
Figure 14 An example of path compression
set_type
find( element_type x, DISJ_SET S )
{
if( S[x] <= 0 )
return x;
else
return( S[x] = find( S[x], S ) );
}
Data structures and Algorithms R.Punithavathi Department of IT
Figure 15 Code for disjoint set find with path compression
When unions are done arbitrarily, path compression is a good idea, because
there is an abundance of deep nodes and these are brought near the root by
path compression. It has been proven that when path compression is done in this
case, a sequence of m operations requires at most O(m log n) time. It is still an
open problem to determine what the average-case behavior is in this situation.
Path compression is perfectly compatible with union-by-size, and thus both
routines can be implemented at the same time. Since doing union-by-size by
itself is expected to execute a sequence of m operations in linear time, it is not
clear that the extra pass involved in path compression is worthwhile on average.
Indeed, this problem is still open. However, as we shall see later, the
combination of path compression and a smart union rule guarantees a very
efficient algorithm in all cases.
Path compression is not entirely compatible with union-by-height, because path
compression can change the heights of the trees. It is not at all clear how to re-
compute them efficiently. The answer is do not!! Then the heights stored for each
tree become estimated heights (sometimes known as ranks), but it turns out that
union-by-rank (which is what this has now become) is just as efficient in theory
as union-by-size. Furthermore, heights are updated less often than sizes. As with
union-by-size, it is not clear whether path compression is worthwhile on average.
What we will show in the next section is that with either union heuristic, path
compression significantly reduces the worst-case running time.
6. Worst Case for Union-by-Rank and Path Compression
When both heuristics are used, the algorithm is almost linear in the worst case.
Specifically, the time required in the worst case is (m (m, n)) (provided m n),
where (m, n) is a functional inverse of Ackerman's function, which is defined
below:*
Data structures and Algorithms R.Punithavathi Department of IT
A(1, j) = 2j for j 1
A(i, 1) = A(i - 1, 2) for i 2
A(i, j) = A(i - 1,A(i, j - 1)) for i, j 2
*Ackerman's function is frequently defined with A(1, j) = j + 1 for j 1. the form in
this text grows faster; thus, the inverse grows more slowly.
From this, we define
(m, n) = min{i 1|A(i, m/ n ) > log n}
You may want to compute some values, but for all practical purposes, (m, n) 4,
which is all that is really important here. The single-variable inverse Ackerman
function, sometimes written as log*n, is the number of times the logarithm of n
needs to be applied until n 1. Thus, log* 65536 = 4, because log log log log
65536 = 1. log* 265536 = 5, but keep in mind that 265536 is a 20,000-digit number.
(m, n) actually grows even slower then log* n. However, (m, n) is not a constant,
so the running time is not linear.
In the remainder of this section, we will prove a slightly weaker result. We will
show that any sequence of m = (n) union/find operations takes a total of O(m
log* n) running time. The same bound holds if union-by-rank is replaced with
union-by-size. This analysis is probably the most complex in the book and one of
the first truly complex worst-case analyses ever performed for an algorithm that
is essentially trivial to implement
An Application
We have a network of computers and a list of bidirectional connections; each of these
connections allows a file transfer from one computer to another. Is it possible to send a file
from any computer on the network to any other? An extra restriction is that the problem must
be solved on-line. Thus, the list of connections is presented one at a time, and the algorithm
must be prepared to give an answer at any point.
Data structures and Algorithms R.Punithavathi Department of IT
An algorithm to solve this problem can initially put every computer in its own set. Our
invariant is that two computers can transfer files if and only if they are in the same set. We
can see that the ability to transfer files forms an equivalence relation. We then read
connections one at a time. When we read some connection, say (u, v), we test to see
whether u and v are in the same set and do nothing if they are. If they are in different sets,
we merge their sets. At the end of the algorithm, the graph is connected if and only if there is
exactly one set. If there are m connections and n computers, the space requirement is O(n).
Using union-by-size and path compression, we obtain a worst-case running time of O(m (m,
n)), since there are 2m finds and at most n - 1 unions. This running time is linear for all
practical purposes.
Summary
We have seen a very simple data structure to maintain disjoint sets. When the union
operation is performed, it does not matter, as far as correctness is concerned, which set
retains its name. A valuable lesson that should be learned here is that it can be very
important to consider the alternatives when a particular step is not totally specified. The
union step is flexible; by taking advantage of this, we are able to get a much more efficient
algorithm.
Path compression is one of the earliest forms of self-adjustment, which we have seen
elsewhere (splay trees, skew heaps). Its use is extremely interesting, especially from a
theoretical point of view, because it was one of the first examples of a simple algorithm with
a not-so-simple worst-case analysis.
HASHING
INTRODUCTION
Hashing is a method to store data in an array so that sorting, searching, inserting and
deleting data is fast. For this every record needs unique key.
Data structures and Algorithms R.Punithavathi Department of IT
The basic idea is not to search for the correct position of a record with comparisons but to
compute the position within the array. The function that returns the position is called the
'hash function' and the array is called a 'hash table'.
WHY HASHING?
In the other type of searching, we have seen that the record is stored in a table and it is
necessary to pass through some number of keys before finding the desired one. While we
know that the efficient search technique is one which minimizesthese comparisons. Thus
we need a search technique in which there is no unnecessary comparisons.
If we want to access a key in a single retrieval, then the location of the record within the
table must depend only on the key, not on the location of other keys(as in other type of
searching i.e. tree). The most efficient way to organize such a table is an array.It was
possible only with hashing.
HASH CLASH
Suppose two keys k1 and k2 are such that h(k1) equals h(k2).When a record with key two
keys can't get the same position.such a situation is called hash collision or hash clash.
METHODS OF DEALING WITH HASH CLASH
There are three basic methods of dealing with hash clash. They are:
1. Chaining
2. Rehashing
3. separate chaining.
Hashing Techniques
CHAINING
Data structures and Algorithms R.Punithavathi Department of IT
It builds a link list of all items whose keys has the same value. During search,this sorted
linked list is traversed sequentially fro the desired key.It involves adding an extra link
field to each table position. There are three types of chaining
1. Standard Coalsced Hashing
2. General Coalsced Hashing
3. Varied insertion coalsced Hashing
Standard Coalsced Hashing It is the simplest of chaining methods. It reduces the
average number of probes for an unsuccessful search. It efficiently does the deletion
without affecting the efficiency General Coalsced Hashing It is the generalization of
standard coalesced chaining method. In this method, we add extra positions to the hash
table that can be used to list the nodes in the time of collision.
SOLVING HASH CLASHES BY LINEAR PROBING
The simplest method is that when a clash occurs, insert the record in the next available
place in the table. For example in the table the next position 646 is empty. So we can
insert the record with key 012345645 in this place which is still empty. Similarly if the
record with key %1000 = 646 appears, it will be inserted in next empty space. This
technique is called linear probing and is an example for resolving hash clashes called
rehashing or open addressing.
DISADVANTAGES OF LINEAR PROBING
It may happen, however, that the loop executes forever. There are two possible reasons
for this. First, the table may be full so that it is impossible to insert any new record. This
situation can be detected by keeping an account of the number of records in the table.
When the count equals the table size, no further insertion should be done. The other
reason may be that the table is not full, too. In this type, suppose all the odd positions are
emptyAnd the even positions are full and we want to insert in the even position by
Data structures and Algorithms R.Punithavathi Department of IT
rh(i)=(i+2)%1000 used as a hash function. Of course, it is very unlikely that all the odd
positions are empty and all the even positions are full.
However the rehash function rh(i)=(i+200)%1000 is used, each key can be placed in one
of the five positions only. Here the loop can run infinitely, too.
SEPARATE CHAINING
As we have seen earlier, we can't insert items more than the table size. In some cases, we
allocate space much more than required resulting in wastage of space. In order to tackle
all these problems, we have a separate method of resolving clashes called separate
chaining. It keeps a distinct link list for all records whose keys hash into a particular
value. In this method, the items that end with a particular number (unit position) is placed
in a particular link list as shown in the figure. The 10's, 100's not taken into account. The
pointer to the node points to the next node and when there is no more nodes, the pointer
points to NULL value.
ADVANTAGES OF SEPERATE CHAINING
1. No worries of filling up the table whatever be the number of items.
2. The list items need not be contiguous storage
3. It allows traversal of items in hash key order.
SITUATION OF HASH CLASH
What would happen if we want to insert a new part number 012345645 in the table
.Using he hash function key %1000 we get 645.Therefore for the part belongs in position
645.However record for the part is already being occupied by 011345645.Therefore the
record with the key 012345645 must be inserted somewhere in the table resulting in hash
clash.This is illustrated in the given table:
POSITION KEY RECORD
0 258001201
2 698321903
Data structures and Algorithms R.Punithavathi Department of IT
3 986453204
.
.
450 256894450
451 158965451
.
.
647 214563647
648 782154648
649 325649649
.
.
997 011239997
998 231452998
999 011232999
DOUBLE HASHING
A method of open addressing for a hash table in which a collision is resolved by
searching the table for an empty place at intervals given by a different hash function, thus
minimizing clustering. Double Hashing is another method of collision resolution, but
unlike the linear collision resolution, double hashing uses a second hashing function that
normally limits multiple collisions. The idea is that if two values hash to the same spot in
the table, a constant can be calculated from the initial value using the second hashing
function that can then be used to change the sequence of locations in the table, but still
have access to the entire table.
LUSTERING
There are mainly two types of clustering:
Primary clustering
Data structures and Algorithms R.Punithavathi Department of IT
When the entire array is empty, it is equally likely that a record is inserted at any position
in the array. However, once entries have been inserted and several hash clashes have been
resolved, it doesn't remain true. For, example in the given above table, it is five times as
likely for the record to be inserted at the position 994 as the position 401. This is because
any record whose key hashes into 990, 991, 992, 993 or 994 will be placed in 994,
whereas only a record whose key hashes into 401 will be placed there. This phenomenon
where two keys that hash into different values compete with each other in successive
rehashes is called primary clustering.
CAUSE OF PRIMARY CLUSTERING Any rehash function that depends solely on the
index to be rehashed causes primary clustering WAYS OF ELEMINATING PRIMARY
CLUSTERING
One way of eliminating primary clustering is to allow the rehash function to depend on
the number of times that the function is applied to a particular hash value. Another way is
to use random permutation of the number between 1 and e, where e is (table size -1, the
largest index of the table). One more method is to allow rehash to depend on the hash
value. All these methods allow key that hash into different locations to follow separate
rehash paths.
ECONDARY CLUSTERING
In this type, different keys that hash to the same value follow same rehash path. WAYS
TO ELIMINATE SECONDARY CLUSTERING
All types of clustering can be eliminated by double hashing, which involves the use of
two hash function h1(key) and h2(key).h1 is known as primary hash function and is
allowed first to get the position where the key will be inserted. If that position is occupied
already, the rehash function rh(i,key) = (i+h2(key))%table size is used successively until
an empty position is found. As long as h2(key1) doesn't equal h2(key2),records with keys
h1 and h2 don't compete for the same position. Therefore one should choose functions h1
and h2 that distributes the hashes and rehashes uniformly in the table and also minimizes
clustering.
Data structures and Algorithms R.Punithavathi Department of IT
DELETING AN ITEM FROM THE HASH TABLE:
It is very difficult to delete an item from the hash table that uses rehashes for search and
insertion. Suppose that a record r is placed at some specific location. We want to insert
some other record r1 on the same location. We will have to insert the record in the next
empty location to the specified original location. Suppose that the record r which was
there at the specified location is deleted.
Now, we want to search the record r1, as the location with record r is now empty, it will
erroneously conclude that the record r1 is absent from the table. One possible solution to
this problem is that the deleted record must be marked "deleted" rather than "empty" and
the search must continue whenever a "deleted" position is encountered. But this is
possible only when there are small numbers of deletions otherwise an unsuccessful search
will have to search the entire table since most of the positions will be marked "deleted"
rather than "empty".
Part A Questions
1. “Chaining is a much better resolution technique than open addressing
.”Justify the statement?
2. Briefly explain the collision resolution by Open Addressing ? Explain its
disadvantages ?
3. Which problem occurs during Quadratic Rehash ? How it can be avoided ?
4. Write a short note on Mid Square method adopted to calculate the address of
the record ?
5. Explain the division method to obtain the address of the record?
6. Briefly explain the multiplicative method used to obtain the address of a
record ?
7. Summarize the hash table organization in terms of advantages and
disadvantages ?
Data structures and Algorithms R.Punithavathi Department of IT
8. What is Universal Hashing ? What should be the correct table size m in the
Division Method used in the Hashing ?
9. a) Explain the Hashing function : “ Digit analysis” ?
b) What should be done when a key is large ?
10. Briefly explain the folding method used to calculate the address for a given
record ? What is the major drawback of the hashing functions?
11. What is Direct Addressing? When is it used?
12. What is the advantage of using Hash Function over Direct Addressing?
Explain.
13. When does a 'collision' occur? What are the methods to resolve it? Explain
giving examples.
14. Explain the 'division method' for creating hash functions.
15. Consider a version of the division method in which h(k)= k mod m , where
m=(2p) - 1 and k is a character string interpreted in radix 2p.
Show that if string x can be derived from string y by permuting its
characters, then x and y hash to the same value.
16. What is 'universal hashing' and how will you design a universal class of hash
functions?
17. Differentiate between linear probing and quadratic probing.
18. Insert the keys 10,22,31,4,15,28,17,88,59 into a hash table of length m=11
using open addressing with the primary hash function h'(k)=k mod m.
Illustrate the result of inserting keys using linear probing, using quadratic
probing with c1=1 and c2=3 and using double hashing with
h2(k)=1+(kmod(m-1)).
Data structures and Algorithms R.Punithavathi Department of IT
PART B Questions
1. Explain any two techniques to overcome hash collision?
Separate chaining
Example
Explanation
Coding
Open addressing
Linear probing
Quadratic probing
2. Explain the implementation of different hashing techniques.
3. Given input {4371,1323,6173,4199,4344,9679,1989} and a hash function
h(x)=x(mod10), show the resulting:
(a) separate chaining table (4)
(b) open addressing hash table using linear probing (4)
(c) open addressing hash table using quadratic probing (4)
(d) open addressing hash table with second hash function h2(x) =7-(x mod 7).
Objective Type Questions
1. Name the function which when applied to the key produces an integer which can be
used as an address in a Hash table ?
a) Greatest Integer Function
b) Modulus function
c) Signum Function
d) Hash Function
2) Which is the phenomenon which occurs when a hash function
maps 2 different keys to the same table address ?
a) Collision
b) Overlaps
c) Nothing Happens
d) Rebound
Data structures and Algorithms R.Punithavathi Department of IT
3) What is the name of the collision resolution technique in which the next slot in a table is
checked on a collision ?
a) Chaining
b) Linear probing
c) Quadratic Probing
d) Clustering
4) Name the Hashing scheme in which a second order function of the hash index is used to
calculate the address ?
a) Quadratic Probing
b) Sequential Search
c) Linear Probing
d) Chaining
5) What is the name of the collision resolution technique in which colliding records are
chained into special overflow area which is separate from the prime area ?
a) Chaining
b) Mid Square Method
c) Folding Method
d) Open Addressing
6) Which number corresponds to the mapping of 72963195 according to Folding method ?
a) 455
b) 1455
c) 451
d) 456
7) What will be the address of the key 12345 according to Mid
Square Method if the positions at 5 and 6 are taken into account ?
a) 21
b) 91
c) 99
d) 19
8) Which type of address is obtained by a perfect Hash Function ?
a) Odd
b) Unique
c) Even
Data structures and Algorithms R.Punithavathi Department of IT
d) No Address is obtained
9) What is the name of the collision sequences generated by address calculated with the
help of Quadratic Probing ?
a) Rings
b) Stacks
c) Secondary Clustering
d) Linked Lists
10) Name the technique used to avoid Secondary Clustering ?
a) Quadratic Probe
b) Double Hashing
c) Linear Probe
d) Open addressing
Answers:
1. Answer : d) Hash Function
Explanation : Hash function does a key to address transformation i.e when applied to
the key produces an integer which can be used as an address in a hash table .
2. Answer : a) Collision
Explanation : When a Hash function maps 2 different keys to the same table address a
collision is said to occur . Suppose that 2 keys K1 & K2 map to the same address . When
a record with the key K1 is entered , it is inserted at the hashed address . But , when
another record with key K2 is entered , K2 is hashed and an attempt is made to insert
the record with key K1 is stored because its key is hashed to the same address as that
of key K2 . This results in a Hash Collision .
3. Answer : b) Linear Probing
Explanation : Linear Probing comes under the category of Collision resolution technique
called Open Addressing . Here , using a suitable algorithm the address of the hash table
Data structures and Algorithms R.Punithavathi Department of IT
is calculated . If the position is not empty then the next slot of the table is checked
and the record is inserted if the position is empty .
4. Answer : a) Quadratic Probing
Explanation : Quadratic probing is a Rehasing Scheme in which usually a second order
function of the hash index is used to calculate the address . If there is a collision at the
hash address h then this method probes the array at h+1, h+4 , h+9 address locations .
5. Answer : a) Chaining
Explanation : Chaining is a Collision Resolution Technique in which colliding records are
chained into a special overflow area which is separate from the prime area .The prime
area contains the part of the array into which the records of the array are initially
hashed . Each record in the prime area contains a pointer field to maintain a separate
Linked List for each set of colliding records into the overflow area .
6. Answer : a) 455
Explanation : In the Folding Method a key is broken into several parts . Each part has
the same address as that of the required address except the last part . The parts are
added together and the final carry is ignored . Therefore , 72963195 maps to 729 + 631
+ 95 =1455 ignoring the final carry of 1 .
7. Answer : d) 19
Explanation : In Mid Square Method a key is multiplied by itself and the hash value is
obtained by selecting an appropriate number of digits from the middle of the square .
In the key 12345 , square of the key is 152199025 . If we choose the values at 5th and
6th position for the address then the address is 19 .
8. Answer : b) Unique
Explanation : A perfect Hash Function is a function which when applied to all the
members of the set of items to be stored in a hash table produces a unique set of
integers within some suitable range .
9. Answer : c) Secondary Clustering
Explanation : Secondary Clustering are the collision sequences generated by addresses
calculated with quadratic probing . In this the different keys that hash to the same
value are rehashed to the same value again .
10. Answer : b) Double Hashing
Explanation : Double Hashing can be used to avoid Secondary Clustering . For example
if h1 is the same hashing function which generates the same address i for 2 different
Data structures and Algorithms R.Punithavathi Department of IT
keys x1 and x2 and we have a second function h2 such that it generates different
addresses for x1 and x2 . We can use h1 as a primary hash function and h2 as a rehash
function . The functions h1 and h2 should be chosen so as to distribute the hashes and
rehashes uniformly over the array indices and to minimize clustering .
11.What is the best definition of a collision in a hash table?
o A. Two entries are identical except for their keys.
o B. Two entries with different data have the exact same key.
o C. Two entries with different keys have the same exact hash value.
o D. Two entries with the exact same key have different hash values.
12.Which guideline is NOT suggested from from empirical or theoretical studies of
hash tables:
o A. Hash table size should be the product of two primes.
o B. Hash table size should be the upper of a pair of twin primes.
o C. Hash table size should have the form 4K+3 for some K.
o D. Hash table size should not be too near a power of two.
13. In an open-address hash table there is a difference between those spots which
have never been used and those spots which have previously been used but no
longer contain an item. Which function has a better implementation because of
this difference?
a. A. insert
b. B. is_present
c. C. remove
d. D. size
e. E. Two or more of the above functions
14. What kind of initialization needs to be done for an open-address hash table?
a. A. None.
Data structures and Algorithms R.Punithavathi Department of IT
b. B. The key at each array location must be initialized.
c. C. The head pointer of each chain must be set to NULL.
d. D. Both B and C must be carried out.
15. What kind of initialization needs to be done for an chained hash table?
a. A. None.
b. B. The key at each array location must be initialized.
c. C. The head pointer of each chain must be set to NULL.
d. D. Both B and C must be carried out.
16. A chained hash table has an array size of 512. What is the maximum number of
entries that can be placed in the table?
a. A. 256
b. B. 511
c. C. 512
d. D. 1024
e. E. There is no maximum.
17. Suppose you place m items in a hash table with an array size of s. What is the
correct formula for the load factor?
a. A. s + m
b. B. s - m
c. C. m - s
d. D. m * s
e. E. m / s
18. The constructor of the Hashtable class initializes data members and creates the
Hashtable.
o True
o False
19. Hash table are very good if there are many insertion and deletion.
o True
Data structures and Algorithms R.Punithavathi Department of IT
o False