cellular patition by using hash org.

download cellular patition by using hash org.

of 14

Transcript of cellular patition by using hash org.

  • 8/8/2019 cellular patition by using hash org.

    1/14

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63 1

    DATA STRUCTUREFILES

    UNIT IV

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 2

    Learning Objectives

    Hashing

    Indexing Techniques

    File Organization

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63 3

    Hashing

  • 8/8/2019 cellular patition by using hash org.

    2/14

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 4

    Hashing

    Technique for performing Insertion, Deletion,

    Search in constant average time

    Ordering of elements is not supported efficiently

    Keys are mapped onto a number between 0 &

    TableSize-1

    Mapping is done on basis of a function called

    Hash Function

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 5

    Hash Function

    Transforms a key into a cell/bucket address

    Must be simple to compute

    Should ensure that distinct keys get distinct cells

    Not possible in all cases as number of keys increases

    Leads to collisions (multiple keys map to the same hash

    value)

    So choose a function that leads to even distribution of

    keys

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 6

    Considerations

    Which hash function to use

    How to respond to collisions

  • 8/8/2019 cellular patition by using hash org.

    3/14

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 7

    Common hash functions

    Mod

    Mid Square

    Folding

    Digit Analysis

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 8

    Collision Resolution

    Open address hashing

    Linear Probing

    Quadratic Probing

    Double Hashing

    Separate Chaining

    Rehashing

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 9

    Open Addressing

    In case of a collision alternate cells are tried till an

    empty cell is not found.

    Cell hi(X)= Hash(X) + F(i) Given F(0)=0

    For Linear probing F(i) is a linear function of i;

    For Quadratic probing F(i) is a function ofi2;

    For Double Hashing F(i) is some function of I otherthan the one chosen originally.

  • 8/8/2019 cellular patition by using hash org.

    4/14

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 10

    Linear Probing

    For a table large enough in size to hold all the keys;

    free space will always be found

    Though the time required will be large

    Drawback Blocks of occupied cells might get formed: PRIMARY

    CLUSTERING

    i.e a key that hashes into a cluster will require several

    attempts to resolve collision

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 11

    Linear Probing

    Consider a hash table with 10 slots.

    Say, The keys to be inserted are 12, 30, 11, 32, 34, 54, 50

    The hash function is mod 10

    This divisor is chosen just for illustration and is not a goodchoice

    as a maximum of 10 resultant cells get generated, thuscollisions will be frequent.

    The divisor should preferably be a prime number

    Stages of insertion are illustrated on following slides

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 12

    Linear Probing: Illustration

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    Add 12 on Cell 12%10= 2 12

    Add 30 on Cell 30%10= 0 30

    Add 11 on Cell 11%10= 1 11

    Try to Add 32 on Cell 32%10= 2; Not available; Try Next 32

    Add 34 on Cell 34%10= 4 34

    Try to Add 54 on Cell 54%10= 4; Not available; Try Next 54

    Try to Add 50 on Cell 50%10= 0; Not available; Try Next

    Till an empty cell isnt found50

  • 8/8/2019 cellular patition by using hash org.

    5/14

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 13

    Quadratic Probing

    Similar treatment can be given when collisions

    occur in case of Quadratic probing;

    Here,

    instead of choosing the next cell that lies after the idealcell i(or a cell given by a linear function ofi)

    A new cell number given by some quadratic function of

    iis chosen

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 14

    Separate Chaining

    Maintains a list of all the keys that hash to the same value

    To insert:

    Calculate the hash function

    Access the corresponding list

    Add a link to the list

    i.e. A link is added in case of a collision

    The new key might be added at either end of the list

    Better for large sized records, handles collisions & overflow

    efficiently.

    Not as efficient when record size is small or domain of keysvalues is limited to a small number of entries

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 15

    Separate Chaining: Illustration

    3

    2

    1

    030

    22

    43

    10

    Insert Sequence: 22, 42, 30, 43, 10Insert Sequence: 22, 42, 30, 43, 10

    42

  • 8/8/2019 cellular patition by using hash org.

    6/14

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 16

    Rehashing

    When table gets Too full, number of collisions increase;

    thus, resulting in a degradation in performance whileinserting as well as searching

    Build another hash table with size ~ 2*OldSize

    Scan the original table; for each entry Compute the new hash value

    Insert in the new hash table

    Rehashing is costly, thus, should not be done veryfrequently.

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 17

    Rehashing: Illustration

    Consider the hashtable as given in thefigure:

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    12

    30

    11

    32

    34

    54

    50

    The keys to be inserted are 12,30, 11, 32, 34, 54, 50

    The hash function is mod 10

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 18

    Rehashing

    New table size 19

    The hash function is mod 23

    0 1 2 3 4 5 67

    8 9 10 11 12 13 14 15 16 17 18

    1230 1132 345450

  • 8/8/2019 cellular patition by using hash org.

    7/14

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63 19

    Indexing Techniques

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 20

    Indexing Techniques

    Cylinder Surface Indexing

    Hashed Indexing

    Tree Indexing

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 21

    Cylinder Surface Indexing

    Used for primary key index in sequential file

    organization

    Assumes records are stored in increasing order of

    Primary Key

    Index consists of CYLINDER INDEX + SURFACE

    INDEX for each cylinder

  • 8/8/2019 cellular patition by using hash org.

    8/14

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 22

    Cylinder Surface Indexing

    If a data file takes up ccylinders CI has centries

    Each CI entry contains

    {CYLINDER_NO, Largest key on cylinder}

    Each entry of SI of ith cylinder contains:

    {SURFACE_NO, Largest key on ith cylinder of this surface}

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 23

    Cylinder Surface Indexing

    Searching a record (ISAM)

    Read Cylinder Index in memory

    Locate the cylinder number that possibly contains therecord

    Read the surface index of the corresponding cylinder

    Find the surface (reduced to track) that may contain therecord

    Search the track sequentially

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 24

    Hashed Indexing

    Maintains hash table of key values along with the correspondingrecord addresses

    The set of hash functions and overflow handling techniques:discussed in hashing

    In case oflinear probingseek time is less as overflow buckets /cells are adjacent

    In case ofSeparate Chainingspecial buffer space is allocatedfor expansion of buckets; thus little or no additional seek timeis required

    Max seek time in case of random or quadratic probing

  • 8/8/2019 cellular patition by using hash org.

    9/14

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 25

    Tree Indexing

    Indexing using balanced trees of orderm

    Discussed before as B-trees and B+ tree

    Maximum number of keys: ml-1

    Let number of Keys= N

    Number of failure nodes (number of nodes that one could

    reach while looking for a key that doesnt exist in tree)=

    N+1

    = number of nodes at level l+1

    >= 2 * Ceil (m/2) l-1

    Thus, N >= 2 * Ceil (m/2) l-11

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 26

    Tree indexing

    Consider a B-Tree of order m=200

    Say N= 2 * Ceil (m/2) l-11

    i.e. 2*106 >= 2 * Ceil (200/2) l-11

    We get

    106 >= (100) l-1

    6 >= 2(l-1)

    l

  • 8/8/2019 cellular patition by using hash org.

    10/14

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63 28

    FILE ORGANIZATION

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 29

    File Organization

    Sequential File Organization

    Random File Organization

    Inverted Files

    Cellular Files

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 30

    Sequential File Organization

    ISAM is the most popular sequential file organization

    Cylinder surface index is maintained for primary key.

    Makes search based on PK efficient

    Search based on other attributes require use of an alternate

    indexing technique

    Insertion, Deletion are time consuming

    Batch processes and Range queries are executed efficiently

  • 8/8/2019 cellular patition by using hash org.

    11/14

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 31

    Random File Organization

    Records are stored at random locations

    Techniques used for randomization

    Direct Addressing

    Directory Lookup Hashed File organization

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 32

    Direct Addressing

    Available disk space is divided into nodes large enough to hold

    a record

    Numeric value of the PK determined the node number where

    the insertion is to be made (1 disk access for read)

    Good for fixed length records and high identifier density

    (Current/Domain).

    In case of variable length records pointer to actual locations on

    disk are maintained. (2 disk accesses for read)

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 33

    Directory Lookup

    Like, DA, Variable length records, index maintainskey values and pointers to disk addresses

    Unlike, DA, Variable length records, available

    space is utilized efficiently as the existing keys

    are stored contiguously

    Searching requires multiple disk accesses as the

    index needs to be searched first

  • 8/8/2019 cellular patition by using hash org.

    12/14

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 34

    Hashed File Organization

    Uses same principle as hashed indexes

    Available file space is divided into

    cells/buckets/slots

    Some space is set aside for overflow in case of

    chaining

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 35

    Inverted Files

    Index contains the link information

    Index structure is most important

    Stores index values and related record addresses

    Records may be stored using any organization

    Actual records my do away with storage of key

    values.

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 36

    Inverted Files

    F340

    C220

    B200

    E110

    D101

    A100

    E# Index

    A, B, DProgramme

    r

    C, EAnalyst

    Occupation Index

    B, C, DFemale

    A, EMale

    Gender Index

  • 8/8/2019 cellular patition by using hash org.

    13/14

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 37

    Inverted Files

    Searching becomes efficient as address

    associated with a key value are available as a

    list

    Combination of conditions can be carried out

    using simple list operations like union,

    intersection, subtraction etc.

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 38

    Cellular PartitionsStorage media is divided into cells

    A cell could be

    A disk pack; or

    A cylinder

    Lists of a given key value are divided into sub-lists

    such that each sub-list occupies a single cell.

    The index entries now contain the starting address of

    each sub-list and the number of records in this list.

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 39

    Cellular Partition

    In case a cell is a cylinder, all the records placed in

    on cell can be accessed without moving theread/write head

    In case a cell is a disk pack, several cells can be

    search in parallel.

  • 8/8/2019 cellular patition by using hash org.

    14/14

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 40

    What we Studied

    Hashing

    Indexing Techniques

    File Organization

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 41

    Review Questions

    1. What is the criteria behind the design of hash function ?

    2. What are the various ways to store the Graphs in Memory?3. Discuss the application of hash table. Write short note on symbol

    table.

    4. Compare Sequential and random file organization.5. What are the advantages of usinginverted files?

    6. Would you use Quadratic Probing for resolving collisions inhashed index files? State reasons.

    7. Write short note on Structure of direct file8. Give comparison between sequential file,indexed sequential file

    and random access file.

    9. Write a short note on Open Address Hashing and SeparateChaining

    10. Discuss Random file Organization and various techniques used

    for randomization11. Explain various techniques for overflow / collision resolution incase of hashing

    Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi-63. 42

    References

    Fundamentals of Data Structures, E. Horowitz and S. Sahani,Galgotia Booksource Pvt. Ltd., (1999)

    Data Structures and Algorithm Analysis in C (Second Edition)by Mark Allen Weiss

    Data Structures: A Pseudocode Approach with C, Second EditionRichard Gilberg, Behrouz Forouzan

    Data Structures and program design in C, R. L. Kruse, B. P.Leung, C. L. Tondo, PHI.

    Data Structure, Schaums outline series, TMH, 2002

    Data Structures using C and C++, Y. Langsam et. al., PHI (1999).

    Data Structures, N. Dale and S.C. Lilly, D.C. Heath and Co. (1995).

    Data Structure & Algorithms, R. S. Salaria, Khanna BookPublishing Co. (P) Ltd., 2002.