cellular patition by using hash org.

8/8/2019 cellular patition by using hash org.

1/14

Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63 1

DATA STRUCTUREFILES

UNIT IV

Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 2

Learning Objectives

Hashing

Indexing Techniques

File Organization


Hashing


2/14


Hashing

Technique for performing Insertion, Deletion,

Search in constant average time

Ordering of elements is not supported efficiently

Keys are mapped onto a number between 0 &

TableSize-1

Mapping is done on basis of a function called

Hash Function


Hash Function

Transforms a key into a cell/bucket address

Must be simple to compute

Should ensure that distinct keys get distinct cells

Not possible in all cases as number of keys increases

Leads to collisions (multiple keys map to the same hash

value)

So choose a function that leads to even distribution of

keys


Considerations

Which hash function to use

How to respond to collisions


3/14


Common hash functions

Mod

Mid Square

Folding

Digit Analysis


Collision Resolution

Open address hashing

Linear Probing

Quadratic Probing

Double Hashing

Separate Chaining

Rehashing


Open Addressing

In case of a collision alternate cells are tried till an

empty cell is not found.

Cell hi(X)= Hash(X) + F(i) Given F(0)=0

For Linear probing F(i) is a linear function of i;

For Quadratic probing F(i) is a function ofi2;

For Double Hashing F(i) is some function of I otherthan the one chosen originally.


4/14


Linear Probing

For a table large enough in size to hold all the keys;

free space will always be found

Though the time required will be large

Drawback Blocks of occupied cells might get formed: PRIMARY

CLUSTERING

i.e a key that hashes into a cluster will require several

attempts to resolve collision


Linear Probing

Consider a hash table with 10 slots.

Say, The keys to be inserted are 12, 30, 11, 32, 34, 54, 50

The hash function is mod 10

This divisor is chosen just for illustration and is not a goodchoice

as a maximum of 10 resultant cells get generated, thuscollisions will be frequent.

The divisor should preferably be a prime number

Stages of insertion are illustrated on following slides


Linear Probing: Illustration

0

1

2

3

4

5

6

7

8

9

Add 12 on Cell 12%10= 2 12

Add 30 on Cell 30%10= 0 30

Add 11 on Cell 11%10= 1 11

Try to Add 32 on Cell 32%10= 2; Not available; Try Next 32

Add 34 on Cell 34%10= 4 34

Try to Add 54 on Cell 54%10= 4; Not available; Try Next 54

Try to Add 50 on Cell 50%10= 0; Not available; Try Next

Till an empty cell isnt found50


5/14


Quadratic Probing

Similar treatment can be given when collisions

occur in case of Quadratic probing;

Here,

instead of choosing the next cell that lies after the idealcell i(or a cell given by a linear function ofi)

A new cell number given by some quadratic function of

iis chosen


Separate Chaining

Maintains a list of all the keys that hash to the same value

To insert:

Calculate the hash function

Access the corresponding list

Add a link to the list

i.e. A link is added in case of a collision

The new key might be added at either end of the list

Better for large sized records, handles collisions & overflow

efficiently.

Not as efficient when record size is small or domain of keysvalues is limited to a small number of entries


Separate Chaining: Illustration

3

2

1

030

22

43

10

Insert Sequence: 22, 42, 30, 43, 10Insert Sequence: 22, 42, 30, 43, 10

42


6/14


Rehashing

When table gets Too full, number of collisions increase;

thus, resulting in a degradation in performance whileinserting as well as searching

Build another hash table with size ~ 2*OldSize

Scan the original table; for each entry Compute the new hash value

Insert in the new hash table

Rehashing is costly, thus, should not be done veryfrequently.


Rehashing: Illustration

Consider the hashtable as given in thefigure:

0

1

2

3

4

5

6

7

8

9

12

30

11

32

34

54

50

The keys to be inserted are 12,30, 11, 32, 34, 54, 50



Rehashing

New table size 19


0 1 2 3 4 5 67

8 9 10 11 12 13 14 15 16 17 18

1230 1132 345450


7/14


Indexing Techniques


Indexing Techniques

Cylinder Surface Indexing

Hashed Indexing

Tree Indexing



Used for primary key index in sequential file

organization

Assumes records are stored in increasing order of

Primary Key

Index consists of CYLINDER INDEX + SURFACE

INDEX for each cylinder


8/14



If a data file takes up ccylinders CI has centries

Each CI entry contains

{CYLINDER_NO, Largest key on cylinder}

Each entry of SI of ith cylinder contains:

{SURFACE_NO, Largest key on ith cylinder of this surface}



Searching a record (ISAM)

Read Cylinder Index in memory

Locate the cylinder number that possibly contains therecord

Read the surface index of the corresponding cylinder

Find the surface (reduced to track) that may contain therecord

Search the track sequentially


Hashed Indexing

Maintains hash table of key values along with the correspondingrecord addresses

The set of hash functions and overflow handling techniques:discussed in hashing

In case oflinear probingseek time is less as overflow buckets /cells are adjacent

In case ofSeparate Chainingspecial buffer space is allocatedfor expansion of buckets; thus little or no additional seek timeis required

Max seek time in case of random or quadratic probing


9/14


Tree Indexing

Indexing using balanced trees of orderm

Discussed before as B-trees and B+ tree

Maximum number of keys: ml-1

Let number of Keys= N

Number of failure nodes (number of nodes that one could

reach while looking for a key that doesnt exist in tree)=

N+1

= number of nodes at level l+1

>= 2 * Ceil (m/2) l-1

Thus, N >= 2 * Ceil (m/2) l-11


Tree indexing

Consider a B-Tree of order m=200

Say N= 2 * Ceil (m/2) l-11

i.e. 2*106 >= 2 * Ceil (200/2) l-11

We get

106 >= (100) l-1

6 >= 2(l-1)

l


10/14


FILE ORGANIZATION


File Organization

Sequential File Organization

Random File Organization

Inverted Files

Cellular Files


Sequential File Organization

ISAM is the most popular sequential file organization

Cylinder surface index is maintained for primary key.

Makes search based on PK efficient

Search based on other attributes require use of an alternate

indexing technique

Insertion, Deletion are time consuming

Batch processes and Range queries are executed efficiently


11/14


Random File Organization

Records are stored at random locations

Techniques used for randomization

Direct Addressing

Directory Lookup Hashed File organization


Direct Addressing

Available disk space is divided into nodes large enough to hold

a record

Numeric value of the PK determined the node number where

the insertion is to be made (1 disk access for read)

Good for fixed length records and high identifier density

(Current/Domain).

In case of variable length records pointer to actual locations on

disk are maintained. (2 disk accesses for read)


Directory Lookup

Like, DA, Variable length records, index maintainskey values and pointers to disk addresses

Unlike, DA, Variable length records, available

space is utilized efficiently as the existing keys

are stored contiguously

Searching requires multiple disk accesses as the

index needs to be searched first


12/14


Hashed File Organization

Uses same principle as hashed indexes

Available file space is divided into

cells/buckets/slots

Some space is set aside for overflow in case of

chaining


Inverted Files

Index contains the link information

Index structure is most important

Stores index values and related record addresses

Records may be stored using any organization

Actual records my do away with storage of key

values.


Inverted Files

F340

C220

B200

E110

D101

A100

E# Index

A, B, DProgramme

r

C, EAnalyst

Occupation Index

B, C, DFemale

A, EMale

Gender Index


13/14


Inverted Files

Searching becomes efficient as address

associated with a key value are available as a

list

Combination of conditions can be carried out

using simple list operations like union,

intersection, subtraction etc.


Cellular PartitionsStorage media is divided into cells

A cell could be

A disk pack; or

A cylinder

Lists of a given key value are divided into sub-lists

such that each sub-list occupies a single cell.

The index entries now contain the starting address of

each sub-list and the number of records in this list.


Cellular Partition

In case a cell is a cylinder, all the records placed in

on cell can be accessed without moving theread/write head

In case a cell is a disk pack, several cells can be

search in parallel.


14/14


What we Studied

Hashing

Indexing Techniques

File Organization


Review Questions

1. What is the criteria behind the design of hash function ?

2. What are the various ways to store the Graphs in Memory?3. Discuss the application of hash table. Write short note on symbol

table.

4. Compare Sequential and random file organization.5. What are the advantages of usinginverted files?

6. Would you use Quadratic Probing for resolving collisions inhashed index files? State reasons.

7. Write short note on Structure of direct file8. Give comparison between sequential file,indexed sequential file

and random access file.

9. Write a short note on Open Address Hashing and SeparateChaining

10. Discuss Random file Organization and various techniques used

for randomization11. Explain various techniques for overflow / collision resolution incase of hashing


References

Fundamentals of Data Structures, E. Horowitz and S. Sahani,Galgotia Booksource Pvt. Ltd., (1999)

Data Structures and Algorithm Analysis in C (Second Edition)by Mark Allen Weiss

Data Structures: A Pseudocode Approach with C, Second EditionRichard Gilberg, Behrouz Forouzan

Data Structures and program design in C, R. L. Kruse, B. P.Leung, C. L. Tondo, PHI.

Data Structure, Schaums outline series, TMH, 2002

Data Structures using C and C++, Y. Langsam et. al., PHI (1999).

Data Structures, N. Dale and S.C. Lilly, D.C. Heath and Co. (1995).

Data Structure & Algorithms, R. S. Salaria, Khanna BookPublishing Co. (P) Ltd., 2002.

cellular patition by using hash org.

Documents

Transcript of cellular patition by using hash org.