4 File & Index

8/12/2019 4 File & Index

1/35

File Organizations

and Indexes

8/12/2019 4 File & Index

2/35

Definition a file of recordsis a collection of records that

may reside on several pages.

Each record has a unique identifier called arecord idor rid

Indexesare data structures that is intended tohelp find the record ids of records that meet aselection condition.

Every index has an associated search key, which

is a collection of one or more fields of the file ofrecords for which we are building the index;

Any subset of the fields can be a search key.

We sometimes refer to the file of records as the

indexed file.

8/12/2019 4 File & Index

3/35

File OrganizationA file organization:a method of arranging

the records in a file on external storage.

For example:

If we want to retrieve employee records inalphabetical order??? if we want to retrieve all employees whose salary is

in a given range??? If we want to retrieve all employees who are 55

years old (with employees name)???

8/12/2019 4 File & Index

4/35

Three Basic File Organizations1. Heap (random order)files : Suitable when

typical access is a file scan retrieving allrecords.

2. Sorted Files: Best if records must be retrievedin some order, or only a range of records is

needed.3. Indexes: Data structures to organize records

via treesor hashing. Like sorted files, they speed up searches for a subset

of records, based on values in certain (search key)

fields Updates are much faster than in sorted files.

Each ideal for some situations, and not so good

in others.

8/12/2019 4 File & Index

5/35

FILE ORGANIZATIONS

For sorted and hashed files, the sequence offields on which the file is sorted or hashed iscalled the search key.

The search key for an index can be any

sequence of one or more fieldsit need notuniquely identify records.

8/12/2019 4 File & Index

6/35

COST MODELA cost modelto estimate the cost (in terms of

execution time) of different database operations.

ignore CPU costs, for simplicity.

Notations: B : numberof data pages

R : number of records per page.

D : the average time to read or write a disk page

C : the average time to process a record hash function : map a record into a range of numbers.

H : the time required to apply the hash function to a

record

8/12/2019 4 File & Index

7/35

Typical values today are [Ramakhrisnan, 2003] D = 15 milliseconds,

C and H = 100 nanoseconds

The cost of I/O is dominantsupported by

current hardware trends, in which CPU speeds

are steadily rising, whereas disk speeds are not

increasing at a similar pace.

On the other hand, as main memory sizesincrease, a much larger fraction of the needed

pages are likely to fit in memory, leading to fewer

I/O requests.

8/12/2019 4 File & Index

8/35

8/12/2019 4 File & Index

9/35

Heap Files

Scan : retrieve each of B pages taking time D per page,and for each page, process R records taking time C perrecord

The cost : BD + BRC= B(D + RC)

Search with equality selection: Suppose that exactly one record matches the desired

equality selection, that is, the selection is specified on acandidate key.

For each retrieved data page, check all records on the pageto see if it is the desired record.

The cost : 0.5B(D + RC)

Search with range selection: The entire file must be scanned because qualifying records

could appear anywhere in the fileThe cost : B(D + RC)

8/12/2019 4 File & Index

10/35

Heap Files

Insert:Assume : records are always inserted at the end of thefile.

fetch the last page in the file, add the record, and writethe page back.

The cost : 2D + C.

Delete: Find the record, remove the record from the page, and

write the modified page.Assume that no attempt is made to compact the file toreclaim the free space created by deletions.

The cost : (the cost of searching) + C + D. The cost of searching = D or search with equality or range

conditions. The cost of deletion is also affected by the number of qualifying

records, since all pages containing such records must be modified

8/12/2019 4 File & Index

11/35

Sorted Files (1) Scan: The cost : B(D + RC) Search with equality selection:

assume : the equality selection is specified on thefield by which the file is sorted; if not, the cost is

identical to that for a heap file. locate the first page containing the desired records,

should any qualifying records exist, with a binarysearch in log2B steps.

Once the page is known, the first qualifying recordcan again be located by a binary search of the pageat a cost of Clog2R.

The Cost : Dlog2B + Clog2R

8/12/2019 4 File & Index

12/35

Sorted Files (2)

Search with range selection:Assume: the range selection is on the sort field, the

first record that satisfies the selection is located as it

is for search with equality.

The cost is the cost of search plus the cost ofretrieving the set of records that satisfy the search.

The cost of the search includes the cost of fetching

the first page containing qualifying, or matching,

records.The cost : Dlog2B + #matches

8/12/2019 4 File & Index

13/35

Sorted Files (3) Insert:

find the correct position in the file, add the record,and then fetch and rewrite all subsequent pages

assume : the inserted record belongs in the middle ofthe file.

The cost : the cost of searching to find the position ofthe new record plus 2*(0.5B(D + RC)), that is,The cost: (search cost) +(B(D + RC))

Delete: search for the record, remove the record from the

page, and write the modified page back read and write all subsequent pages because all

records that follow the deleted record must be movedup to compact the free space.

The cost : (search cost) + B(D + RC).

8/12/2019 4 File & Index

14/35

Hashed Files

Enables to locate records with a given search keyvalue quickly

For example : if the file is hashed on the name

field : Find the Students record for Joe.

The pages in a hashed file are grouped into

buckets.

The bucket to which a record belongs can be

determined by applying a special function called ahash function, to the search field(s).

8/12/2019 4 File & Index

15/35

Hashed Files

On inserts, a record is inserted into theappropriate bucket, with additional `overflow'

pages allocated if the primary page for the bucket

becomes full.

The overflowpages for each bucket are

maintained in a linked list.

To search for a record with a given search key

value, we simply apply the hash function to

identify the bucket to which such records belong

and look at all pages in that bucket.

For analysis, assume that there are no overflow

pages.

8/12/2019 4 File & Index

16/35

Hashed Files

Scan: pages are kept at about 80 percent occupancy. add a new page to a bucket when each existing

page is 80 percent full no overflow buckets! The number of pages and the cost of scanning all

the data pages is about 1.25 times the cost ofscanning unordered file. The cost: 1.25B(D + RC)

Search with equality selection The cost of identifying the page that contains

qualifying records is H Assume : the bucket consists of just one page (no

overflow pages), retrieving it costs D. The cost : H + D + 0.5RC

8/12/2019 4 File & Index

17/35

Hashed Files Search with range selection:

The entire file must be scanned.

The cost is 1.25B(D + RC).

Insert:

The appropriate page must be located, modified,and then written back

The cost: the cost of search +(C + D)

Delete:

search for the record, remove it from the page, andwrite the modified page back

The cost : the cost of search + (C + D)

8/12/2019 4 File & Index

18/35

Choosing a File Organization

8/12/2019 4 File & Index

19/35

INDEXES

An index: Data structures to organize recordsvia trees or hashing.An index on a file is an auxiliary structure

designed to speed up operations that are notefficiently supported by the basic organization ofrecords in that file.

An index contains a collection of data entries, k*,with an efficient way to locate all data entries with

search key value k

Data entry, k*, contains enough information to

retrieve (one or more) data records with search

key value k.

8/12/2019 4 File & Index

20/35

File Hashed on age, with Index on sal :contains (sal, rid)

pairs as data entries

8/12/2019 4 File & Index

21/35

Hash-Based Indexes

A simple hashed file organization enables us to locaterecords with a given search key value quickly.

Example: Find the Students record for Joe

The pages in a hashed file are grouped into buckets. Bucket = primary page plus zero or more over flow pages. Buckets contain data entries.

Given a bucket number, the hashed file structure allowsus to find the primary pagefor that bucket.

The bucket to which a record belongs can be determinedby applying a special function called a hash functionto search the file.

8/12/2019 4 File & Index

22/35

Alternatives for Data Entries k*in an

Index

1. A data entry k*is an actual data record (withsearch key value k).

2. A data entry is a (k, rid) pair,

rid : the record id of a data record with search

key value k.

3. A data entry is a (k, rid-list) pair,

rid-list : a list of record ids of data records with

search key valuek.

8/12/2019 4 File & Index

23/35

Alternatives for Data Entries (Contd.)

Alternative 1: there is no need to store the data records

separately If this is used, index structure is a file

organization for data records (instead of aHeap file or sorted file).

At most one index on a given collection ofdata records can use Alternative 1.

(Otherwise, data records are duplicated,leading to redundant storage and potentialinconsistency.)

If data records are very large, # of pagescontaining data entries is high. Implies size ofauxiliary information in the index is also large,

8/12/2019 4 File & Index

24/35

Alternatives for Data Entries (Contd.)

Alternatives 2 and 3: Data entries typically much smaller than data

records. So, better than Alternative 1 withlarge data records, especially if search keys

are small. (Portion of index structure used todirect search, which depends on size of dataentries, is much smaller than with Alternative1.)

Alternative 3 more compact than Alternative 2,offer better space utilization, but data entriesare variable in length, depending on thenumber of data records with a given search

key value.

8/12/2019 4 File & Index

25/35

PROPERTIES OF INDEXES:

Clustered

Clustered: a file is organized so that theordering of data records is the same as orclose to the ordering of data entries in someindex.

An index that uses Alternative (1) is clustered.Alternative 1 implies clustered; in practice, clustered

also implies Alternative 1.

A file can be clustered on at most one search key

at most one clustered index on a data file. Cost of retrieving data records through index

varies greatly based on whether index isclustered or not!

8/12/2019 4 File & Index

26/35


Clustered An index that uses Alternative (2) or Alternative (3) can

be a clustered index only if the data records are sortedon the search key field, Otherwise, the order of the datarecords is random.

Indexes that maintain data entries in sorted order bysearch key use a collection of index entries, organizedinto a tree structure, to guide searches for data entries,which are stored at the leaf level of the tree in sortedorder.

Suppose that Alternative (2) is used for data entries, andthat the data records are stored in a Heap file. To build clustered index, first sort the Heap file (with some free

space on each page for future inserts).

Overflow pages may be needed for inserts. (Thus, order of datarecords is `close to, but not identical to, the sort order.)

8/12/2019 4 File & Index

27/35

Clustered Tree Index Using Alternative (2)

8/12/2019 4 File & Index

28/35


UnClustered

Unclusteredindex : An index that is not clustered. We can have several unclustered indexes on a data

file

If the index is clustered, the rids in qualifying data

entries point to a contiguous collection of records

need to retrieve only a few data pages.

If the index is unclustered, each qualifying data entrycould contain a rid that points to a distinct data page,leading to as many data page I/Os as the number of

data entries that match the range selection!

8/12/2019 4 File & Index

29/35

Unclustered Tree Index Using Alternative

(2)

8/12/2019 4 File & Index

30/35

Dense versus Sparse Indexes

An index is said to be denseif it contains (atleast) one data entry for every search keyvalue that appears in a record in the indexedfile.

A sparseindex contains one entry for eachpage of records in the data file.Alternative (1) for data entries always leads to

a dense index.Alternative (2) can be used to build either

dense or sparse indexes.Alternative (3) is typically only used to build a

dense index.

8/12/2019 4 File & Index

31/35

A data file of records with three fields (name, age, and sal),

with two simple indexes on it, both of which use Alternative

(2) for data entry format.

P i d S d

8/12/2019 4 File & Index

32/35

Primary and Secondary

Indexes

Primary index:An index on a set of fields thatincludes theprimary key.

Secondary index: An index that is not a primaryindex.

An index that uses Alternative (1) is called a primaryindex, and one that uses Alternatives (2) or (3) iscalled a secondary index.

Two data entries are said to be duplicates if theyhave the same value for the search key field

associated with the index. A primary index is guaranteed not to contain

duplicates, but an index on other (collections of) fieldscan contain duplicates.

Unique index: no duplicates exist and the search keycontains some candidate key.

8/12/2019 4 File & Index

33/35

Indexes Using Composite Search

Keys

Composite search keysor concatenated keys:The search key for an index that contain several

fields.

If the search key is composite, an equality query

is one in which each field in the search key isbound to a constant.

A range queryis one in which not all elds in the

search key are bound to constants.

8/12/2019 4 File & Index

34/35

INDEX SPECIFICATION IN SQL

8/12/2019 4 File & Index

35/35

INDEX SPECIFICATION IN SQL-

92

This specifies that a B+ tree index is to be

created on the Students table using theconcatenation of the age and gpa columns as thekey.

Key values are pairs of the form (age; gpa), and

there is a distinct entry for each such pair. Once the index is created, it is automatically

maintained by the DBMS adding/removing dataentries in response to inserts/deletes of records

on the Students relation

4 File & Index

Documents

Transcript of 4 File & Index