Unit 3

Data structures and Algorithms R.Punithavathi Department of IT

UNIT III

DISJOINT SET ADT, Hashing

THE DISJOINT SET ADT

Introduction

It is an efficient data structure to solve the equivalence problem. Each routine

requires only a few lines of code, and a simple array can be used. The

implementation is also extremely fast, requiring constant average time per

operation.

1. Equivalence Relations

A relation R is defined on a set S if for every pair of elements (a, b), a, b S, a R

b is either true or false. If a R b is true, then we say that a is related to b.

An equivalence relation is a relation R that satisfies three properties:

1. (Reflexive) a R a, for all a S.

2. (Symmetric) a R b if and only if b R a.

3. (Transitive) a R b and b R c implies that a R c.

Examples.

1. The relationship is not an equivalence relationship.

Although it is reflexive, since a a, and

transitive, since a b and b c implies a c,

it is not symmetric, since a b does not imply b a.


2. Electrical connectivity, where all connections are by metal wires, is an

equivalence relation.

The relation is clearly reflexive, as any component is connected to itself. If a is

electrically connected to b, then b must be electrically connected to a, so the

relation is symmetric. Finally, if a is connected to b and b is connected to c, then

a is connected to c. Thus electrical connectivity is an equivalence relation.

3. Two cities are related if they are in the same country. It is easily verified that

this is an equivalence relation. Suppose town a is related to b if it is possible to

travel from a to b by taking roads. This relation is an equivalence relation if all

the roads are two-way.

2. The Dynamic Equivalence Problem

The input is initially a collection of n sets, each with one element. This initial

representation is that all relations (except reflexive relations) are false. Each set

has a different element, so that Si Sj = ; this makes the sets disjoint.

There are two permissible operations.

The first is find, which returns the name of the set (that is, the equivalence class)

containing a given element.

The second operation adds relations. If we want to add the relation a ~ b, then

we first see if a and b are already related. This is done by performing finds on

both a and b and checking whether they are in the same equivalence class. If

they are not, then we apply union. This operation merges the two equivalence

classes containing a and b into a new equivalence class. From a set point of

view, the result of is to create a new set Sk = Si Sj, destroying the originals

and preserving the disjointness of all the sets. The algorithm to do this is

frequently known as the disjoint set union/find algorithm for this reason.


This algorithm is dynamic because, during the course of the algorithm, the sets

can change via the union operation. The algorithm must also operate on-line:

When a find is performed, it must give an answer before continuing. Another

possibility would be an off-line algorithm. Such an algorithm would be allowed to

see the entire sequence of unions and finds. The answer it provides for each find

must still be consistent with all the unions that were performed up until the find,

but the algorithm can give all its answers after it has seen all the questions. The

difference is similar to taking a written exam (which is generally off-line--you only

have to give the answers before time expires), and an oral exam (which is on-

line, because you must answer the current question before proceeding to the

next question).

Notice that we do not perform any operations comparing the relative values of

elements, but merely require knowledge of their location. For this reason, we can

assume that all the elements have been numbered sequentially from 1 to n and

that the numbering can be determined easily by some hashing scheme. Thus,

initially we have Si = {i} for i = 1 through n.

Our second observation is that the name of the set returned by find is actually

fairly abitrary. All that really matters is that find(x) = find( ) if and only if x and are

in the same set.

These operations are important in many graph theory problems and also in

compilers which process equivalence (or type) declarations. We will see an

application later.

There are two strategies to solve this problem. One ensures that the find

instruction can be executed in constant worst-case time, and the other ensures

that the union instruction can be executed in constant worst-case time. It has

recently been shown that both cannot be done simultaneously in constant worst-

case time.


We will now briefly discuss the first approach. For the find operation to be fast,

we could maintain, in an array, the name of the equivalence class for each

element. Then find is just a simple O(1) lookup. Suppose we want to perform

union(a, b). Suppose that a is in equivalence class i and b is in equivalence class

j. Then we scan down the array, changing all is to j. Unfortunately, this scan

takes (n). Thus, a sequence of n - 1 unions (the maximum, since then

everything is in one set), would take (n2) time. If there are (n2) find operations,

this performance is fine, since the total running time would then amount to O(1)

for each union or find operation over the course of the algorithm. If there are

fewer finds, this bound is not acceptable.

One idea is to keep all the elements that are in the same equivalence class in a

linked list. This saves time when updating, because we do not have to search

through the entire array. This by itself does not reduce the asymptotic running

time, because it is still possible to perform (n2) equivalence class updates over

the course of the algorithm.

If we also keep track of the size of each equivalence class, and when performing

unions we change the name of the smaller equivalence class to the larger, then

the total time spent for n - 1 merges isO (n log n). The reason for this is that each

element can have its equivalence class changed at most log n times, since every

time its class is changed, its new equivalence class is at least twice as large as

its old. Using this strategy, any sequence of m finds and up to n - 1 unions takes

at most O(m + n log n) time.

In the remainder of this chapter, we will examine a solution to the union/find

problem that makes unions easy but finds hard. Even so, the running time for any

sequences of at most m finds and up to n - 1 unions will be only a little more than

O(m + n).

3. Basic Data Structure


Recall that the problem does not require that a find operation return any specific

name, just that finds on two elements return the same answer if and only if they

are in the same set. One idea might be to use a tree to represent each set, since

each element in a tree has the same root. Thus, the root can be used to name

the set. We will represent each set by a tree. (Recall that a collection of trees is

known as a forest.) Initially, each set contains one element. The trees we will use

are not necessarily binary trees, but their representation is easy, because the

only information we will need is a parent pointer. The name of a set is given by

the node at the root. Since only the name of the parent is required, we can

assume that this tree is stored implicitly in an array: each entry p[i] in the array

represents the parent of element i. If i is a root, then p[i] = 0. In the forest in

Figure 1, p[i] = 0 for 1 i As with heaps, we will draw the trees explicitly, with the

understanding that an array is being used. Figure 1 shows the explicit

representation. We will draw the root's parent pointer vertically for convenience.

To perform a union of two sets, we merge the two trees by making the root of

one tree point to the root of the other. It should be clear that this operation takes

constant time. Figures 2, 3, and 4 represent the forest after each of union(5,6)

union(7,8), union(5,7), where we have adopted the convention that the new root

after the union(x,y) is x. The implicit representation of the last forest is shown in

Figure 5.

A find(x) on element x is performed by returning the root of the tree containing x.

The time to perform this operation is proportional to the depth of the node

representing x, assuming, of course, that we can find the node representing x in

constant time. Using the strategy above, it is possible to create a tree of depth n -

1, so the worst-case running time of a find is O(n). Typically, the running time is

computed for a sequence of m intermixed instructions. In this case, m

consecutive operations could take O(mn) time in the worst case.

The code in Figures 6 through 9 represents an implementation of the basic

algorithm, assuming that error checks have already been performed. In our


routine, unions are performed on the roots of the trees. Sometimes the operation

is performed by passing any two elements, and having the union perform two

finds to determine the roots.

Figure 1 Eight elements, initially in different sets

Figure 2 After union (5, 6)




Figure 5 Implicit representation of previous tree

typedef int DISJ_SET[ NUM_SETS+1 ];

typedef unsigned int set_type;

typedef unsigned int element_type;

Figure 6 Disjoint set type declaration

void

initialize( DISJ_SET S )

{

int i;

for( i = NUN_SETS; i > 0; i-- )

S[i] = 0;

}

Figure 7 Disjoint set initialization routine

/* Assumes root1 and root2 are roots. */

/* union is a C keyword, so this routine is named set_union. */

void

set_union( DISJ_SET S, set_type root1, set_type root2 )

{

S[root2] = root1;

}

Figure 8 Union (not the best way)

set_type


find( element_type x, DISJ_SET S )

{

if( S[x] <= 0 )

return x;

else

return( find( S[x], S ) );

}

Figure 9 A simple disjoint set find algorithm

4. Smart Union Algorithms

The unions above were performed rather arbitrarily, by making the second tree a

subtree of the first. A simple improvement is always to make the smaller tree a

subtree of the larger, breaking ties by any method; we call this approach union-

by-size. The three unions in the preceding example were all ties, and so we can

consider that they were performed by size. If the next operation were union (4,

5), then the forest in Figure 10 would form. Had the size heuristic not been used,

a deeper forest would have been formed (Fig. 11).

Figure 10 Result of union-by-size


Figure 11 Result of an arbitrary union

Figure 12 Worst-case tree for n = 16

We can prove that if unions are done by size, the depth of any node is never

more than log n. To see this, note that a node is initially at depth 0. When its

depth increases as a result of a union, it is placed in a tree that is at least twice

as large as before. Thus, its depth can be increased at most log n times. (We

used this argument in the quick-find algorithm at the end of Section 2.) This

implies that the running time for a find operation is O(log n), and a sequence of m

operations takes O(m log n). The tree in Figure 12 shows the worst tree possible

after 16 unions and is obtained if all unions are between equal-sized trees.

To implement this strategy, we need to keep track of the size of each tree. Since

we are really just using an array, we can have the array entry of each root

contain the negative of the size of its tree. Thus, initially the array representation


of the tree is all -1s (and Fig 7 needs to be changed accordingly). When a union

is performed, check the sizes; the new size is the sum of the old. Thus, union-by-

size is not at all difficult to implement and requires no extra space. It is also fast,

on average. For virtually all reasonable models, it has been shown that a

sequence of m operations requires O(m) average time if union-by-size is used.

This is because when random unions are performed, generally very small

(usually one-element) sets are merged with large sets throughout the algorithm.

The following figures show a tree and its implicit representation for both union-by-

size and union-by-height. The code in Figure 13 implements union-by-height.

5. Path Compression

The union/find algorithm, as described so far, is quite acceptable for most cases.

It is very simple and linear on average for a sequence of m instructions (under all

models). However, the worst case of O(m log n ) can occur fairly easily and

naturally.

/* assume root1 and root2 are roots */


/* union is a C keyword, so this routine is named set_union */

void

set_union (DISJ_SET S, set_type root1, set_type root2 )

{

if( S[root2] < S[root1] ) /* root2 is deeper set */

S[root1] = root2; /* make root2 new root */

else

{

if( S[root2] == S[root1] ) /* same height, so update */

S[root1]--;

S[root2] = root1; /* make root1 new root */

}

}

Figure 13 Code for union-by-height (rank)

For instance, if we put all the sets on a queue and repeatedly dequeue the first

two sets and enqueue the union, the worst case occurs. If there are many more

finds than unions, this running time is worse than that of the quick-find algorithm.

Moreover, it should be clear that there are probably no more improvements

possible for the union algorithm. This is based on the observation that any

method to perform the unions will yield the same worst-case trees, since it must

break ties arbitrarily. Therefore, the only way to speed the algorithm up, without

reworking the data structure entirely, is to do something clever on the find

operation.

The clever operation is known as path compression. Path compression is

performed during a find operation and is independent of the strategy used to

perform unions. Suppose the operation is find(x). Then the effect of path

compression is that every node on the path from x to the root has its parent

changed to the root. Figure 14 shows the effect of path compression after find

(15) on the generic worst tree of Figure 12.


The effect of path compression is that with an extra two pointer moves, nodes 13

and 14 are now one position closer to the root and nodes 15 and 16 are now two

positions closer. Thus, the fast future accesses on these nodes will pay (we

hope) for the extra work to do the path compression.

As the code in Figure 15 shows, path compression is a trivial change to the basic

find algorithm. The only change to the find routine is that S[x] is made equal to

the value returned by find; thus after the root of the set is found recursively, x is

made to point directly to it. This occurs recursively to every node on the path to

the root, so this implements path compression. As we stated when we

implemented stacks and queues, modifying a parameter to a function called is

not necessarily in line with current software engineering rules. Some languages

will not allow this, so this code may well need changes.

Figure 14 An example of path compression

set_type

find( element_type x, DISJ_SET S )

{

if( S[x] <= 0 )

return x;

else

return( S[x] = find( S[x], S ) );

}


Figure 15 Code for disjoint set find with path compression

When unions are done arbitrarily, path compression is a good idea, because

there is an abundance of deep nodes and these are brought near the root by

path compression. It has been proven that when path compression is done in this

case, a sequence of m operations requires at most O(m log n) time. It is still an

open problem to determine what the average-case behavior is in this situation.

Path compression is perfectly compatible with union-by-size, and thus both

routines can be implemented at the same time. Since doing union-by-size by

itself is expected to execute a sequence of m operations in linear time, it is not

clear that the extra pass involved in path compression is worthwhile on average.

Indeed, this problem is still open. However, as we shall see later, the

combination of path compression and a smart union rule guarantees a very

efficient algorithm in all cases.

Path compression is not entirely compatible with union-by-height, because path

compression can change the heights of the trees. It is not at all clear how to re-

compute them efficiently. The answer is do not!! Then the heights stored for each

tree become estimated heights (sometimes known as ranks), but it turns out that

union-by-rank (which is what this has now become) is just as efficient in theory

as union-by-size. Furthermore, heights are updated less often than sizes. As with

union-by-size, it is not clear whether path compression is worthwhile on average.

What we will show in the next section is that with either union heuristic, path

compression significantly reduces the worst-case running time.

6. Worst Case for Union-by-Rank and Path Compression

When both heuristics are used, the algorithm is almost linear in the worst case.

Specifically, the time required in the worst case is (m (m, n)) (provided m n),

where (m, n) is a functional inverse of Ackerman's function, which is defined

below:*


A(1, j) = 2j for j 1

A(i, 1) = A(i - 1, 2) for i 2

A(i, j) = A(i - 1,A(i, j - 1)) for i, j 2

*Ackerman's function is frequently defined with A(1, j) = j + 1 for j 1. the form in

this text grows faster; thus, the inverse grows more slowly.

From this, we define

(m, n) = min{i 1|A(i, m/ n ) > log n}

You may want to compute some values, but for all practical purposes, (m, n) 4,

which is all that is really important here. The single-variable inverse Ackerman

function, sometimes written as log*n, is the number of times the logarithm of n

needs to be applied until n 1. Thus, log* 65536 = 4, because log log log log

65536 = 1. log* 265536 = 5, but keep in mind that 265536 is a 20,000-digit number.

(m, n) actually grows even slower then log* n. However, (m, n) is not a constant,

so the running time is not linear.

In the remainder of this section, we will prove a slightly weaker result. We will

show that any sequence of m = (n) union/find operations takes a total of O(m

log* n) running time. The same bound holds if union-by-rank is replaced with

union-by-size. This analysis is probably the most complex in the book and one of

the first truly complex worst-case analyses ever performed for an algorithm that

is essentially trivial to implement

An Application

We have a network of computers and a list of bidirectional connections; each of these

connections allows a file transfer from one computer to another. Is it possible to send a file

from any computer on the network to any other? An extra restriction is that the problem must

be solved on-line. Thus, the list of connections is presented one at a time, and the algorithm

must be prepared to give an answer at any point.


An algorithm to solve this problem can initially put every computer in its own set. Our

invariant is that two computers can transfer files if and only if they are in the same set. We

can see that the ability to transfer files forms an equivalence relation. We then read

connections one at a time. When we read some connection, say (u, v), we test to see

whether u and v are in the same set and do nothing if they are. If they are in different sets,

we merge their sets. At the end of the algorithm, the graph is connected if and only if there is

exactly one set. If there are m connections and n computers, the space requirement is O(n).

Using union-by-size and path compression, we obtain a worst-case running time of O(m (m,

n)), since there are 2m finds and at most n - 1 unions. This running time is linear for all

practical purposes.

Summary

We have seen a very simple data structure to maintain disjoint sets. When the union

operation is performed, it does not matter, as far as correctness is concerned, which set

retains its name. A valuable lesson that should be learned here is that it can be very

important to consider the alternatives when a particular step is not totally specified. The

union step is flexible; by taking advantage of this, we are able to get a much more efficient

algorithm.

Path compression is one of the earliest forms of self-adjustment, which we have seen

elsewhere (splay trees, skew heaps). Its use is extremely interesting, especially from a

theoretical point of view, because it was one of the first examples of a simple algorithm with

a not-so-simple worst-case analysis.

HASHING

INTRODUCTION

Hashing is a method to store data in an array so that sorting, searching, inserting and

deleting data is fast. For this every record needs unique key.


The basic idea is not to search for the correct position of a record with comparisons but to

compute the position within the array. The function that returns the position is called the

'hash function' and the array is called a 'hash table'.

WHY HASHING?

In the other type of searching, we have seen that the record is stored in a table and it is

necessary to pass through some number of keys before finding the desired one. While we

know that the efficient search technique is one which minimizesthese comparisons. Thus

we need a search technique in which there is no unnecessary comparisons.

If we want to access a key in a single retrieval, then the location of the record within the

table must depend only on the key, not on the location of other keys(as in other type of

searching i.e. tree). The most efficient way to organize such a table is an array.It was

possible only with hashing.

HASH CLASH

Suppose two keys k1 and k2 are such that h(k1) equals h(k2).When a record with key two

keys can't get the same position.such a situation is called hash collision or hash clash.

METHODS OF DEALING WITH HASH CLASH

There are three basic methods of dealing with hash clash. They are:

1. Chaining

2. Rehashing

3. separate chaining.

Hashing Techniques

CHAINING


It builds a link list of all items whose keys has the same value. During search,this sorted

linked list is traversed sequentially fro the desired key.It involves adding an extra link

field to each table position. There are three types of chaining

1. Standard Coalsced Hashing

2. General Coalsced Hashing

3. Varied insertion coalsced Hashing

Standard Coalsced Hashing It is the simplest of chaining methods. It reduces the

average number of probes for an unsuccessful search. It efficiently does the deletion

without affecting the efficiency General Coalsced Hashing It is the generalization of

standard coalesced chaining method. In this method, we add extra positions to the hash

table that can be used to list the nodes in the time of collision.

SOLVING HASH CLASHES BY LINEAR PROBING

The simplest method is that when a clash occurs, insert the record in the next available

place in the table. For example in the table the next position 646 is empty. So we can

insert the record with key 012345645 in this place which is still empty. Similarly if the

record with key %1000 = 646 appears, it will be inserted in next empty space. This

technique is called linear probing and is an example for resolving hash clashes called

rehashing or open addressing.

DISADVANTAGES OF LINEAR PROBING

It may happen, however, that the loop executes forever. There are two possible reasons

for this. First, the table may be full so that it is impossible to insert any new record. This

situation can be detected by keeping an account of the number of records in the table.

When the count equals the table size, no further insertion should be done. The other

reason may be that the table is not full, too. In this type, suppose all the odd positions are

emptyAnd the even positions are full and we want to insert in the even position by


rh(i)=(i+2)%1000 used as a hash function. Of course, it is very unlikely that all the odd

positions are empty and all the even positions are full.

However the rehash function rh(i)=(i+200)%1000 is used, each key can be placed in one

of the five positions only. Here the loop can run infinitely, too.

SEPARATE CHAINING

As we have seen earlier, we can't insert items more than the table size. In some cases, we

allocate space much more than required resulting in wastage of space. In order to tackle

all these problems, we have a separate method of resolving clashes called separate

chaining. It keeps a distinct link list for all records whose keys hash into a particular

value. In this method, the items that end with a particular number (unit position) is placed

in a particular link list as shown in the figure. The 10's, 100's not taken into account. The

pointer to the node points to the next node and when there is no more nodes, the pointer

points to NULL value.

ADVANTAGES OF SEPERATE CHAINING

1. No worries of filling up the table whatever be the number of items.

2. The list items need not be contiguous storage

3. It allows traversal of items in hash key order.

SITUATION OF HASH CLASH

What would happen if we want to insert a new part number 012345645 in the table

.Using he hash function key %1000 we get 645.Therefore for the part belongs in position

645.However record for the part is already being occupied by 011345645.Therefore the

record with the key 012345645 must be inserted somewhere in the table resulting in hash

clash.This is illustrated in the given table:

POSITION KEY RECORD

0 258001201

2 698321903


3 986453204

.

.

450 256894450

451 158965451

.

.

647 214563647

648 782154648

649 325649649

.

.

997 011239997

998 231452998

999 011232999

DOUBLE HASHING

A method of open addressing for a hash table in which a collision is resolved by

searching the table for an empty place at intervals given by a different hash function, thus

minimizing clustering. Double Hashing is another method of collision resolution, but

unlike the linear collision resolution, double hashing uses a second hashing function that

normally limits multiple collisions. The idea is that if two values hash to the same spot in

the table, a constant can be calculated from the initial value using the second hashing

function that can then be used to change the sequence of locations in the table, but still

have access to the entire table.

LUSTERING

There are mainly two types of clustering:

Primary clustering


When the entire array is empty, it is equally likely that a record is inserted at any position

in the array. However, once entries have been inserted and several hash clashes have been

resolved, it doesn't remain true. For, example in the given above table, it is five times as

likely for the record to be inserted at the position 994 as the position 401. This is because

any record whose key hashes into 990, 991, 992, 993 or 994 will be placed in 994,

whereas only a record whose key hashes into 401 will be placed there. This phenomenon

where two keys that hash into different values compete with each other in successive

rehashes is called primary clustering.

CAUSE OF PRIMARY CLUSTERING Any rehash function that depends solely on the

index to be rehashed causes primary clustering WAYS OF ELEMINATING PRIMARY

CLUSTERING

One way of eliminating primary clustering is to allow the rehash function to depend on

the number of times that the function is applied to a particular hash value. Another way is

to use random permutation of the number between 1 and e, where e is (table size -1, the

largest index of the table). One more method is to allow rehash to depend on the hash

value. All these methods allow key that hash into different locations to follow separate

rehash paths.

ECONDARY CLUSTERING

In this type, different keys that hash to the same value follow same rehash path. WAYS

TO ELIMINATE SECONDARY CLUSTERING

All types of clustering can be eliminated by double hashing, which involves the use of

two hash function h1(key) and h2(key).h1 is known as primary hash function and is

allowed first to get the position where the key will be inserted. If that position is occupied

already, the rehash function rh(i,key) = (i+h2(key))%table size is used successively until

an empty position is found. As long as h2(key1) doesn't equal h2(key2),records with keys

h1 and h2 don't compete for the same position. Therefore one should choose functions h1

and h2 that distributes the hashes and rehashes uniformly in the table and also minimizes

clustering.


DELETING AN ITEM FROM THE HASH TABLE:

It is very difficult to delete an item from the hash table that uses rehashes for search and

insertion. Suppose that a record r is placed at some specific location. We want to insert

some other record r1 on the same location. We will have to insert the record in the next

empty location to the specified original location. Suppose that the record r which was

there at the specified location is deleted.

Now, we want to search the record r1, as the location with record r is now empty, it will

erroneously conclude that the record r1 is absent from the table. One possible solution to

this problem is that the deleted record must be marked "deleted" rather than "empty" and

the search must continue whenever a "deleted" position is encountered. But this is

possible only when there are small numbers of deletions otherwise an unsuccessful search

will have to search the entire table since most of the positions will be marked "deleted"

rather than "empty".

Part A Questions

1. “Chaining is a much better resolution technique than open addressing

.”Justify the statement?

2. Briefly explain the collision resolution by Open Addressing ? Explain its

disadvantages ?

3. Which problem occurs during Quadratic Rehash ? How it can be avoided ?

4. Write a short note on Mid Square method adopted to calculate the address of

the record ?

5. Explain the division method to obtain the address of the record?

6. Briefly explain the multiplicative method used to obtain the address of a

record ?

7. Summarize the hash table organization in terms of advantages and

disadvantages ?


8. What is Universal Hashing ? What should be the correct table size m in the

Division Method used in the Hashing ?

9. a) Explain the Hashing function : “ Digit analysis” ?

b) What should be done when a key is large ?

10. Briefly explain the folding method used to calculate the address for a given

record ? What is the major drawback of the hashing functions?

11. What is Direct Addressing? When is it used?

12. What is the advantage of using Hash Function over Direct Addressing?

Explain.

13. When does a 'collision' occur? What are the methods to resolve it? Explain

giving examples.

14. Explain the 'division method' for creating hash functions.

15. Consider a version of the division method in which h(k)= k mod m , where

m=(2p) - 1 and k is a character string interpreted in radix 2p.

Show that if string x can be derived from string y by permuting its

characters, then x and y hash to the same value.

16. What is 'universal hashing' and how will you design a universal class of hash

functions?

17. Differentiate between linear probing and quadratic probing.

18. Insert the keys 10,22,31,4,15,28,17,88,59 into a hash table of length m=11

using open addressing with the primary hash function h'(k)=k mod m.

Illustrate the result of inserting keys using linear probing, using quadratic

probing with c1=1 and c2=3 and using double hashing with

h2(k)=1+(kmod(m-1)).


PART B Questions

1. Explain any two techniques to overcome hash collision?

Separate chaining

Example

Explanation

Coding

Open addressing

Linear probing

Quadratic probing

2. Explain the implementation of different hashing techniques.

3. Given input {4371,1323,6173,4199,4344,9679,1989} and a hash function

h(x)=x(mod10), show the resulting:

(a) separate chaining table (4)

(b) open addressing hash table using linear probing (4)

(c) open addressing hash table using quadratic probing (4)

(d) open addressing hash table with second hash function h2(x) =7-(x mod 7).

Objective Type Questions

1. Name the function which when applied to the key produces an integer which can be

used as an address in a Hash table ?

a) Greatest Integer Function

b) Modulus function

c) Signum Function

d) Hash Function

2) Which is the phenomenon which occurs when a hash function

maps 2 different keys to the same table address ?

a) Collision

b) Overlaps

c) Nothing Happens

d) Rebound


3) What is the name of the collision resolution technique in which the next slot in a table is

checked on a collision ?

a) Chaining

b) Linear probing

c) Quadratic Probing

d) Clustering

4) Name the Hashing scheme in which a second order function of the hash index is used to

calculate the address ?

a) Quadratic Probing

b) Sequential Search

c) Linear Probing

d) Chaining

5) What is the name of the collision resolution technique in which colliding records are

chained into special overflow area which is separate from the prime area ?

a) Chaining

b) Mid Square Method

c) Folding Method

d) Open Addressing

6) Which number corresponds to the mapping of 72963195 according to Folding method ?

a) 455

b) 1455

c) 451

d) 456

7) What will be the address of the key 12345 according to Mid

Square Method if the positions at 5 and 6 are taken into account ?

a) 21

b) 91

c) 99

d) 19

8) Which type of address is obtained by a perfect Hash Function ?

a) Odd

b) Unique

c) Even


d) No Address is obtained

9) What is the name of the collision sequences generated by address calculated with the

help of Quadratic Probing ?

a) Rings

b) Stacks

c) Secondary Clustering

d) Linked Lists

10) Name the technique used to avoid Secondary Clustering ?

a) Quadratic Probe

b) Double Hashing

c) Linear Probe

d) Open addressing

Answers:

1. Answer : d) Hash Function

Explanation : Hash function does a key to address transformation i.e when applied to

the key produces an integer which can be used as an address in a hash table .

2. Answer : a) Collision

Explanation : When a Hash function maps 2 different keys to the same table address a

collision is said to occur . Suppose that 2 keys K1 & K2 map to the same address . When

a record with the key K1 is entered , it is inserted at the hashed address . But , when

another record with key K2 is entered , K2 is hashed and an attempt is made to insert

the record with key K1 is stored because its key is hashed to the same address as that

of key K2 . This results in a Hash Collision .

3. Answer : b) Linear Probing

Explanation : Linear Probing comes under the category of Collision resolution technique

called Open Addressing . Here , using a suitable algorithm the address of the hash table


is calculated . If the position is not empty then the next slot of the table is checked

and the record is inserted if the position is empty .

4. Answer : a) Quadratic Probing

Explanation : Quadratic probing is a Rehasing Scheme in which usually a second order

function of the hash index is used to calculate the address . If there is a collision at the

hash address h then this method probes the array at h+1, h+4 , h+9 address locations .

5. Answer : a) Chaining

Explanation : Chaining is a Collision Resolution Technique in which colliding records are

chained into a special overflow area which is separate from the prime area .The prime

area contains the part of the array into which the records of the array are initially

hashed . Each record in the prime area contains a pointer field to maintain a separate

Linked List for each set of colliding records into the overflow area .

6. Answer : a) 455

Explanation : In the Folding Method a key is broken into several parts . Each part has

the same address as that of the required address except the last part . The parts are

added together and the final carry is ignored . Therefore , 72963195 maps to 729 + 631

+ 95 =1455 ignoring the final carry of 1 .

7. Answer : d) 19

Explanation : In Mid Square Method a key is multiplied by itself and the hash value is

obtained by selecting an appropriate number of digits from the middle of the square .

In the key 12345 , square of the key is 152199025 . If we choose the values at 5th and

6th position for the address then the address is 19 .

8. Answer : b) Unique

Explanation : A perfect Hash Function is a function which when applied to all the

members of the set of items to be stored in a hash table produces a unique set of

integers within some suitable range .

9. Answer : c) Secondary Clustering

Explanation : Secondary Clustering are the collision sequences generated by addresses

calculated with quadratic probing . In this the different keys that hash to the same

value are rehashed to the same value again .

10. Answer : b) Double Hashing

Explanation : Double Hashing can be used to avoid Secondary Clustering . For example

if h1 is the same hashing function which generates the same address i for 2 different


keys x1 and x2 and we have a second function h2 such that it generates different

addresses for x1 and x2 . We can use h1 as a primary hash function and h2 as a rehash

function . The functions h1 and h2 should be chosen so as to distribute the hashes and

rehashes uniformly over the array indices and to minimize clustering .

11.What is the best definition of a collision in a hash table?

o A. Two entries are identical except for their keys.

o B. Two entries with different data have the exact same key.

o C. Two entries with different keys have the same exact hash value.

o D. Two entries with the exact same key have different hash values.

12.Which guideline is NOT suggested from from empirical or theoretical studies of

hash tables:

o A. Hash table size should be the product of two primes.

o B. Hash table size should be the upper of a pair of twin primes.

o C. Hash table size should have the form 4K+3 for some K.

o D. Hash table size should not be too near a power of two.

13. In an open-address hash table there is a difference between those spots which

have never been used and those spots which have previously been used but no

longer contain an item. Which function has a better implementation because of

this difference?

a. A. insert

b. B. is_present

c. C. remove

d. D. size

e. E. Two or more of the above functions

14. What kind of initialization needs to be done for an open-address hash table?

a. A. None.


b. B. The key at each array location must be initialized.

c. C. The head pointer of each chain must be set to NULL.

d. D. Both B and C must be carried out.

15. What kind of initialization needs to be done for an chained hash table?

a. A. None.

b. B. The key at each array location must be initialized.

c. C. The head pointer of each chain must be set to NULL.

d. D. Both B and C must be carried out.

16. A chained hash table has an array size of 512. What is the maximum number of

entries that can be placed in the table?

a. A. 256

b. B. 511

c. C. 512

d. D. 1024

e. E. There is no maximum.

17. Suppose you place m items in a hash table with an array size of s. What is the

correct formula for the load factor?

a. A. s + m

b. B. s - m

c. C. m - s

d. D. m * s

e. E. m / s

18. The constructor of the Hashtable class initializes data members and creates the

Hashtable.

o True

o False

19. Hash table are very good if there are many insertion and deletion.

o True


o False

Unit 3

Documents

Transcript of Unit 3