Data Structure Unit 5

Post on 16-Apr-2015

24 views 2 download

Transcript of Data Structure Unit 5









Files –Queries and sequential organizations –Index techniques. File organizations – Sequential, Random, Linked organizations – Inverted files, Cellular partitions. FILE ORGANIZATIONS File : A file is collection of records. A record is a collection of related fields. Each field is a data item. Ex. Employee file, student file etc.

The primary objective of file organization is to provide a means for record retrieval and update. The update of a record could involve its deletion, changes in some of the fields or the insertion of an entirely new record. Certain fields in a record may be designated as key fields. Records may be retrieved by specifying values for some or all of these keys. A combination of key values specified for retrieval is called as a query.

query types

The following are the different types of queries1. Sex = M Simple query ( the value of the key is specified)2. Salary > 9000 Range query ( a range of values for a single key is specified) 3. Salary > average salary of all employees Functional query ( some function of the key values in file is specified)

4. (sex = M and occupation = programmer) or (employee number > 700 and sex = F) Boolean query ( Boolean operators are used )

The different types of file organizations are

Sequential file organization Random file organization Linked organization Inverted files Cellular partition



In this organization the records are placed sequentially on to the storage media. i.e. they occupy consecutive memory locations and in the case of a tape this would mean placing records adjacent to each other. In addition, the physical sequence of records is ordered on some key called primary key.

Consider the following EMPLOYEE file Table :1

Empno name occupation sex salary

A 800 xxx programmer M 10,000 B 510 yyy analyst F 15,000 C 950 zzz analyst F 12,000 D 750 kkk programmer F 12,000 E 620 rrr programmer M 9,000

In the above table if the records are store on a tape in the sequence A,B,C,D,E then it is a sequential file. This file is unordered. If the primary key is empno then physical storage of the file in the sequence B, E, D, A, C would lead to an ordered file.

Mode of retrieval

The mode of retrieval may be either batched or real time.

In real time retrieval the response for any query is immediate. Example. In Air line reservation system, one must be able to determine the status of the flight in a matter of seconds.

In batch processing system the response time is not significant. Requests for retrieval are batched together on a ‘transaction’ file until either enough requests have been received or suitable amount of time has passed. Then the requests in the transaction file are processed.

Mode of update

The mode of update can either be batched or real time.

In real time system, update is made immediately. For example, in a reservation system, as soon as a seat on a flight is reserved, the file must be updated immediately to reflect the changes made to the file.

In batch system, the updating is made when the transaction file is processed. For example, in a banking system, all deposits and withdrawals made on a particular day is collected on a transaction file and updates are made at the end of the day. The system contains two types of files.

They are ‘master file’ and ‘transaction file’.

For batch processing system magnetic tape is an adequate storage medium. Master file represents the file status. The transaction file contains all update


requests that have not been reflected in the master file. So the master file is always ‘out of date’ to the extent that update requests have been batched on the transaction file. The master file contains records which are sorted on the primary key of the file. The requests for retrieval and update are on the transaction file. When it is time to process the transaction file, the transactions are sorted on the key and an update process is carried out to create a new master file. All the records in the old master file are examined, changed if necessary and then written on to the new master file. The time required for this process is O(n + m log m).

Sequential organization is also possible on dynamic access storage devices (DASD). Even though the disk storage is really two-dimensional (cylinder X surface) it can be mapped in to a one-dimensional memory. If a disk contains c cylinders and s surfaces one way is to view the disk memory sequentially as given in the following figure.(figure 1)

Sequence for cylinders

Sequence within a cylinder

Using the notation tij to represent jth of the ith surface , the sequence is t11, t21, …ts1, t12, … ts,2 and so on.

The other way of representing sequential file organization is to access tracks in order : all tracks of surface 1, all tracks of surface 2, etc.

Surface 1 surface 2 … surface S

For each surface the tracks are

If the records are of same size ,binary search technique can be used to search for a record with the required key. For a file containing n records, log 2 n accesses are to be made.

If the records are of variable size, binary search cannot be used. Sequential search has to be applied. But the retrieval time can be reduced by maintaining an





Cylinder 1 Cylinder 2 Cylinder c. . .

Surface 1 Surface 2 . . . Surface s

Track1 track2 track3 track c


index. An index contains (address, key) pairs. In case of record retrieval, first the index is referenced, then the record is read directly from the address of the storage medium. For example, for the table given in figure1, one can maintain an index for the key empno as given below.

address key

A1, A2, A3, A4,A5 are addresses of records on the storage medium.

Disadvantages of sequential file organization.

Updates are not easily accommodated. By definition, random accessing is not possible All records must be structurally identical. If a few field is to be added, then

every record must be rewritten to provide space for the new field. Continuous areas may not be possible because both the primary data file

and the transaction file must be looked during merging.

Area of use

Sequential files are most frequently used in commercial batch oriented data processing applications where there is the concept of a master file to which details are added periodically. Ex. Payroll applications


One of the important components of a file is directory. A directory is a collection of indexes. The directory may contain one index for every key or may contain an index for only some of the keys. Some of the indexes may be dense (i.e. contains an entry for every record ) while the others may be non-dense ( contains an entry for some of the records )

An index is a collection of pairs of the form (key value, address). If the records of the table1 are stored on addresses a1, a2, a3, … an respectively, then an index for the key empnumber would have entries (800, a1), (510, a2), (950, a3), (750, a4) and (620, a5). The index is dense since it contains an entry for each record. In case of occupation key index, the number of records with ‘occupation = programmer’ is three and ‘occupation = analyst’ is two, therefore entries of the index corresponds to some of the records. The difficulty can be overcome by keeping in the address field of each distinct key value a pointer to another address where we maintain a list of addresses of records having this value. If at address b1 we store the list of addresses of all programmer records i.e. a1, a4 and a5and at b2 the addresses of all analysts i.e. a2 and a3 then we achieve the index of the occupation field as (‘programmer’, b1) and (‘analyst’, b2). Another method

A1 510

A2 620

A3 750

A4 800

A5 900


is to change the format of the entries in an index to (key value, address1, address2, .. address n). The second method is for records of variable size.

An index differs from a table essentially in its size. While a table was small enough to fit into available internal memory, an index is too large for this and has to be maintained on external storage devices ( floppy, hard disk, etc.).

Accessing a word of information from internal memory takes about 10 – 8

seconds while accessing the same word from a disk could take about 10 –1 seconds.


The simplest of all indexing techniques is cylinder-surface indexing. It is useful only for the primary key index of a sequentially ordered file. The sequential interpretation of the disk memory is shown in figure 1.

It is assumed that the records are stored sequentially in the increasing order of the primary key. The index contains of the cylinder index and several surface indexes. If the file requires c cylinders ( 1 through c) for storage then the cylinder index contains c entries. Associated with each of the c cylinders is a surface index. If the disk has s usable surfaces then the surface has s entries. The i th entry in the surface index for cylinder j is the value largest key on the j th track of the ith surface. The total number of surfaces is s c.

A search for a record with a particular key with value X is carried by first reading into memory and cylinder index. Since the number of cylinders in a disk is only a few hundred and cylinder index occupies only one track. The cylinder index is searched to determine which cylinder possibly contains the desired record. The search can be carried out by binary search in the case when the entry requires a fixed number of words. If it is not feasible, the cylinder index can consist of an array of pointers to the starting point of individual key values. In either case the search can be carried out in O(log c) time.

Once the cylinder index is searched, appropriate cylinder is determined, the surface index corresponding to the cylinder is retrieved from the disk. The number of surfaces on a disk is usually very small, so the best way to search a surface index would be sequential search. Having determined which surface and cylinder is to be accessed, this track is read in and searched for the record with desired key. So the total number of disk accesses is three ( one to access the cylinder index c, one for the surface index and one to get the track address). When track sizes are very large it may not be feasible to read in the whole track.. In this case the disk is usually be sector addressable and so an extra level of indexing will be needed: the sector index. In this case the number of accesses needed to retrieve a record will be four. When the file extends over several disks, a disk index is also be maintained.

This method of maintaining a file and index is referred to as ISAM (Indexed Sequential Access Method). It is probably the most popular and simplest file organization in use for single key values. When the file contains more than one key, it is not possible to use this index organization for the remaining keys.



The principles involved in maintaining hashed indexes are essentially the same as those of hash tables.

All the hash functions and overflow techniques of hash tables are applicable to hashed indexes also.

The overflow techniques are1. Rehashing2. open addressing

a. Randomb. Quadraticc. Linear

3. Chaining (refer to unit 4)


The AVL trees are used to search, insert and delete entries from a table of size n using at most O(log n) time. The AVL tree resides on a disk. If nodes are retrieved from the disk, one at a time, then a search of an index with n entries would require at most 1.4 log n disk accesses (the maximum depth of an AVL tree is 1.4 log n). This is a lot worse than the cylinder sector index. Therefore balanced tree based upon an m-way search tree is used which is better than binary search tree.

Definition: An m-way search tree, T , is a tree in which all nodes are of degree ≤ m. If T is empty,(T= nil) then T is a m-way search tree. When T is not empty it has the following properties:

(i) T is a node of the type

n . A 0, (K1, A1), (K2,A2),…..(Kn,An)

where the A i, 0 ≤ i ≤ n are pointers to the sub trees of T and then the Ki, 1 ≤ i ≤ n are key values; and 1 ≤ n < m .

(ii) Ki, < Ki+1, 1 ≤ i < n

(iii) All key values in the sub tree Ai, are less than the key value Ki+1, 0 ≤ i < n

(iv) All key values in the sub tree An are greater than Kn.

(v) The sub trees Ai, 0 ≤ i ≤ n are also m-way search trees.

A B-tree is a balanced m-way tree. A node of the tree contain many records or keys and pointers to children. A B-tree is also known as the balanced sort tree. It finds its use in external sorting. It is not a binary tree.


To reduce disk accesses, several conditions of the tree must be true; The height of the tree must be kept to a minimum, There must be no empty sub trees above the leaves of the tree; The leaves of the tree must all be on the same level; and All nodes except the leaves must have at least some minimum number of

children .

3- way search tree.a

b c d


Figure A.Procedure to search in m-way search tree is given below:

1. Procedure msearch(t : mtree ; x : integer; var p:mtree; var i,j: integer);2. {Search the m-way search tree t residing on disk for the key value x. 3. Individual node format is n,Ao,(K1,A1),…(Kn,…An),n <m. A triple (p,i,j) is 4. returned. j=1 implies x is found at node location p with key Ki.5. Else j=0 and p is the location of the node into which x can be inserted.}6. label 99:7. begin8. p:=t; K0:= -maxint; q:=nil; j:=1;9. while p< >0 do10. begin 11. input node located at p from disk;12. Let this node define n, A0,(K1,A1),…(Kn,An);13. Kn+1 := maxint;14. Let i be such that Ki <= x <ki+1;15. if x = Ki then return (p,i,l) goto 99;16. q:=p; p:=Ai;17. end;18. p:=q; j:=0; return (q,i,0);19. 99:end;

In figure A if we want to search 35 then a search in the root node indicates that the appropriate subtree to be searched is the one with root A1 at address c. A search of this root node indicates that the next node to search is at address e. The key value 35 is found in this node and the search terminates. So if this search tree resides on a disk then the search for x=35 would require accessing the nodes of addresses a,c and e for a total of three disk accesses. The maximum number of disk accesses made is equal to the height of the tree. ( to minimize the number of disk accesses, we can minimize the height of a search tree).


20 40

10 15 25 30 45 50



In this organization the records are stored at random locations on disks. The following techniques are used for randomization.

Direct addressing Directory lookup Hashing

Direct addressing

In direct addressing with equal size records, available disk space is divided in to nodes large enough to hold a record. The numeric value of the primary key is used to determine the node in to which a particular record is stored. No index on this key is required. With primary key = empno., the record for empno =259 will be stored in node 259. In this organization, searching and deleting a record given its primary key value requires only one disc access, whereas updating requires two disc access (one to read and another to write back the modified record ).

In case of variable size records, an index is setup with pointers to actual

records on the disk as shown in the following figure. The number of disc access is one more than that of fixed size records.






The space efficiency of direct accessing depends on the identifier density n/T ( n = number of distinct primary key values in the file, T is the total number of possible primary keys. )

Directory lookup

This is similar to the addressing technique for variable size records. The index is not of direct access type but a dense index maintained using a structure suitable for index operations. Retrieving a record involves searching the index for the record address and then accessing the record itself. This technique makes a more efficient utilization of space than direct addressing, but requires more accesses for retrieval and update, since index searching will generally require more than one access.

Record B

Record E

Record D

Record A

Record C



The principle of hashed file organization is same as that of a hashed table. The available space is divided into buckets and slots. Some space may have to be set aside for an overflow area in case chaining is being used to handle overflows. When variable size records are present, the number of slots per bucket will be only a rough indicator of the number of records a bucket can hold. The actual number will vary dynamically with the size of records in a particular bucket.

Random organization on the primary key using any of the above three techniques overcomes difficulties of sequential organizations. One of the main disadvantages of random organization is that batch processing of queries become inefficient as the records are not maintained in order of the primary key. For example the query retrieve the records of all employees with employee number >800, needs to examine every node.


Linked organization differs from the sequential organizations essentially in that the logical sequence of records is generally different from the physical sequence. In a sequential organization if the ith record of the file is at Li then the (i+1) th record will be at Li+c, where c is the size of the ith record.

In linked organization the next record is obtained by following the link value from the present record. Linking records together by the primary key facilitates the deletion and insertion of records once the place at which insertion and deletion to be made is known. An index with ranges of empnumbers can be maintained to facilitate searching based on empnumbers.

For example, ranges for empumbers 501-700, 701-900, 901-1100 can be created for the EMPLOYEE table given in Table 1. All records having empno in the same range can be linked together as shown in the following figure.

Upper value

Using an index in this way reduces the length of the lists and thus the search time. This idea can be generalized to allow for easy secondary key retrieval. We set up indexes for each key and allow records to be in more than one list. This leads to multi list representation.

Retaining list lengths enables us to reduce search time by allowing us to search the smaller list.(in case of Boolean queries)




Record B

Record D

Record C

Record E

Record A


Empno index occupation index max empno


pointer to the first node

empno link

occupation link

salary link


Salary index




Inserting a new record in to a multi list structure is easy so long as the individual lists do not have to be maintained in some order. A record may be inserted at the front of the appropriate list. Deletion of a record is difficult since there are no back pointers. Deletion may be simplified if we maintain it as a doubly liked list.


The coral ring structure is same as the doubly linked multi list structure. Each list is structured as a circular list with a head node. The headnode for the list for the key value Ki = x will have an information field with value x. The field for the key Ki is replaced by a link field. Thus for each record the coral ring contains 2 fields : y↑.alink[i], y↑.blink[i]

The alink is used to link together all records with the same key Ki. The alinks form a circular list with a head node whose information field retains the value of Ki for the records in the ring.

The blink is used for some records as a back pointer and for some records it

is a pointer to the head node. To distinguish between these two y↑.field[i] is used. If y↑.field[i] = 1 then it is a back pointer and y↑.field[i] = 0 it is a pointer to the nearest record z preceding it its circular list for Ki having z↑.blink[i] also a back pointer.

700 900 1100

2 2 1

Analyst programmer

2 3










<=9000 <=12000 <= 15000

1 3 1 E A B


In any given circular list, all records with back pointers form another circular list in the reverse direction.

The presence of these back pointers makes it possible to carry out the deletion without having to start from the beginning of the list.

Forward circular list contains nodes α, A, B, C, and D

Reverse circular list contains nodes α, C, A


Inverted files are similar to multi lists. The difference is that while in multi lists records with the same key value are linked together with link information being kept in individual records whereas in the case inverted files this information is kept in the index itself. The following are the indexes of the file given in figure 1.

Empno index Occupation index

The index for every key is dense and contains a value entry for each distinct value in the file. Since the index entries are variable length (the number of records with the same key value is variable ), index maintenance becomes more complex than for multi lists.

Benefits of inverted file organization

Boolean queries require only one access per record satisfying the query. Queries of the type K1 = XY or K2 = XY can be processed by first accessing the indexes and

510 B

620 E

750 D

800 A

950 C

Analyst B, C

Programmer A, D, E

Female B, C, D

Male A, E

Sex index

900 E

10,000 A

12,000 C, D

15,000 B

Salary index





obtaining the address lists for all the records with K1 = XX and K2 = XY. These two lists are merged to obtain a list of all records satisfying the query. K1 = XX and K2 = XY can be handled by intersecting the two lists. similarly K1 = .not. XX can be obtained by taking the difference between the universal list (list with all the records) and the list of K1 = XX.

The retrieval works in two steps.

Indexes are processed to obtain a list of records satisfying the query The records are retrieved using the list.

The number of disk accesses = number of records being retrieved + the number to process the indexes.

Inverted files result in space saving compared to other file structures when

record retrieval does not require retrieval of key fields. In this case the key fields may be deleted from the records.

Insertion and deletion of records requires only the ability to insert and delete within indexes. CELLULAR PARTITIONS:

In order to reduce file search times, the storage media may be divided into cells. A cell may be an entire disk pack or it may simply be a cylinder. Lists are localized to lie with in a cylinder. If we have a multi list organization in which the list for key1 = prog included records on several different cylinders then we can break the list in to several smaller lists where each prog list included only those records in the same cylinder. By doing this all the records in the same cell (i.e. the same cylinder)may be accessed with out moving the read/write heads. In case a cell is a disk pack then using cellular partitions it is possible to search different cells in parallel.Note: ( The information given below is not related to cellular partitions) Traversing a binary tree can be done as follows: A

Preorder : ABCDEFGB D Inorder : CBAEFDG

G Post order : CBFEGDA C E


Preorder: Inorder :Traverse the root node Traverse the left subtree in inorder Traverse the left subtree in preorder Traverse the root nodeTraverse the right subtree in preorder Traverse the right subtree in inorder

Postorder :

Traverse the left subtree in postorder Traverse the right subtree in postorder


Traverse the root node**************QUESTION BANK


SECTION A1. The method of maintaining a file and index is referred to as __2. Name an application that makes use of an inverted file.3. Define Query.4. Define cellular partitions.5. What is meant by Random Organization?6. If the biscuits packed in a cylindrical paper pack is to be taken out from one

end, then the arrangement of the biscuits is _____.7. Indexing methods are generally used for files stored in _____.

SECTION B1. What is a file ? Discuss.2. Discuss on inverted files.3. Explain the advantages of sequential file organizations.4. Explain any five file based operations.5. Explain sequential file.6. Write down various index techniques.7. Explain different types of query with examples.8. How do linked organization differ from sequential organization?9. What are factors that affect the I/O time for disks?

SECTION C1. Discuss on indexed techniques.2. Describe random file organization.3. Discuss cellular partitioned structures.4. Explain linked organization with an example.5. Explain inverted file structures.6. Explain tree traversals with examples.

SECTION D1. Develop an algorithm to create a sequential file for sorting the given records . Each record contains name and age of persons,along with monthly income,number of dependents and their names.2. Define and discuss the advantages of Random file organization.3. Explain the structure and processing of indexed sequential files.4. Explain random access file organization and discuss the technique for deleting and inserting records.5. Compare and contrast sequential,indexed sequential and random access file organization.7. Explain with an example where an inverted file organization is preferred over an indexed sequential file.8. Explain how specific file organizations can favour certain types of request for data.8. Define an m-way search tree and write a procedure for searching a key value.