4 File & Index

download 4 File & Index

of 35

Transcript of 4 File & Index

  • 8/12/2019 4 File & Index

    1/35

    File Organizations

    and Indexes

  • 8/12/2019 4 File & Index

    2/35

    Definition a file of recordsis a collection of records that

    may reside on several pages.

    Each record has a unique identifier called arecord idor rid

    Indexesare data structures that is intended tohelp find the record ids of records that meet aselection condition.

    Every index has an associated search key, which

    is a collection of one or more fields of the file ofrecords for which we are building the index;

    Any subset of the fields can be a search key.

    We sometimes refer to the file of records as the

    indexed file.

  • 8/12/2019 4 File & Index

    3/35

    File OrganizationA file organization:a method of arranging

    the records in a file on external storage.

    For example:

    If we want to retrieve employee records inalphabetical order??? if we want to retrieve all employees whose salary is

    in a given range??? If we want to retrieve all employees who are 55

    years old (with employees name)???

  • 8/12/2019 4 File & Index

    4/35

    Three Basic File Organizations1. Heap (random order)files : Suitable when

    typical access is a file scan retrieving allrecords.

    2. Sorted Files: Best if records must be retrievedin some order, or only a range of records is

    needed.3. Indexes: Data structures to organize records

    via treesor hashing. Like sorted files, they speed up searches for a subset

    of records, based on values in certain (search key)

    fields Updates are much faster than in sorted files.

    Each ideal for some situations, and not so good

    in others.

  • 8/12/2019 4 File & Index

    5/35

    FILE ORGANIZATIONS

    For sorted and hashed files, the sequence offields on which the file is sorted or hashed iscalled the search key.

    The search key for an index can be any

    sequence of one or more fieldsit need notuniquely identify records.

  • 8/12/2019 4 File & Index

    6/35

    COST MODELA cost modelto estimate the cost (in terms of

    execution time) of different database operations.

    ignore CPU costs, for simplicity.

    Notations: B : numberof data pages

    R : number of records per page.

    D : the average time to read or write a disk page

    C : the average time to process a record hash function : map a record into a range of numbers.

    H : the time required to apply the hash function to a

    record

  • 8/12/2019 4 File & Index

    7/35

    Typical values today are [Ramakhrisnan, 2003] D = 15 milliseconds,

    C and H = 100 nanoseconds

    The cost of I/O is dominantsupported by

    current hardware trends, in which CPU speeds

    are steadily rising, whereas disk speeds are not

    increasing at a similar pace.

    On the other hand, as main memory sizesincrease, a much larger fraction of the needed

    pages are likely to fit in memory, leading to fewer

    I/O requests.

  • 8/12/2019 4 File & Index

    8/35

  • 8/12/2019 4 File & Index

    9/35

    Heap Files

    Scan : retrieve each of B pages taking time D per page,and for each page, process R records taking time C perrecord

    The cost : BD + BRC= B(D + RC)

    Search with equality selection: Suppose that exactly one record matches the desired

    equality selection, that is, the selection is specified on acandidate key.

    For each retrieved data page, check all records on the pageto see if it is the desired record.

    The cost : 0.5B(D + RC)

    Search with range selection: The entire file must be scanned because qualifying records

    could appear anywhere in the fileThe cost : B(D + RC)

  • 8/12/2019 4 File & Index

    10/35

    Heap Files

    Insert:Assume : records are always inserted at the end of thefile.

    fetch the last page in the file, add the record, and writethe page back.

    The cost : 2D + C.

    Delete: Find the record, remove the record from the page, and

    write the modified page.Assume that no attempt is made to compact the file toreclaim the free space created by deletions.

    The cost : (the cost of searching) + C + D. The cost of searching = D or search with equality or range

    conditions. The cost of deletion is also affected by the number of qualifying

    records, since all pages containing such records must be modified

  • 8/12/2019 4 File & Index

    11/35

    Sorted Files (1) Scan: The cost : B(D + RC) Search with equality selection:

    assume : the equality selection is specified on thefield by which the file is sorted; if not, the cost is

    identical to that for a heap file. locate the first page containing the desired records,

    should any qualifying records exist, with a binarysearch in log2B steps.

    Once the page is known, the first qualifying recordcan again be located by a binary search of the pageat a cost of Clog2R.

    The Cost : Dlog2B + Clog2R

  • 8/12/2019 4 File & Index

    12/35

    Sorted Files (2)

    Search with range selection:Assume: the range selection is on the sort field, the

    first record that satisfies the selection is located as it

    is for search with equality.

    The cost is the cost of search plus the cost ofretrieving the set of records that satisfy the search.

    The cost of the search includes the cost of fetching

    the first page containing qualifying, or matching,

    records.The cost : Dlog2B + #matches

  • 8/12/2019 4 File & Index

    13/35

    Sorted Files (3) Insert:

    find the correct position in the file, add the record,and then fetch and rewrite all subsequent pages

    assume : the inserted record belongs in the middle ofthe file.

    The cost : the cost of searching to find the position ofthe new record plus 2*(0.5B(D + RC)), that is,The cost: (search cost) +(B(D + RC))

    Delete: search for the record, remove the record from the

    page, and write the modified page back read and write all subsequent pages because all

    records that follow the deleted record must be movedup to compact the free space.

    The cost : (search cost) + B(D + RC).

  • 8/12/2019 4 File & Index

    14/35

    Hashed Files

    Enables to locate records with a given search keyvalue quickly

    For example : if the file is hashed on the name

    field : Find the Students record for Joe.

    The pages in a hashed file are grouped into

    buckets.

    The bucket to which a record belongs can be

    determined by applying a special function called ahash function, to the search field(s).

  • 8/12/2019 4 File & Index

    15/35

    Hashed Files

    On inserts, a record is inserted into theappropriate bucket, with additional `overflow'

    pages allocated if the primary page for the bucket

    becomes full.

    The overflowpages for each bucket are

    maintained in a linked list.

    To search for a record with a given search key

    value, we simply apply the hash function to

    identify the bucket to which such records belong

    and look at all pages in that bucket.

    For analysis, assume that there are no overflow

    pages.

  • 8/12/2019 4 File & Index

    16/35

    Hashed Files

    Scan: pages are kept at about 80 percent occupancy. add a new page to a bucket when each existing

    page is 80 percent full no overflow buckets! The number of pages and the cost of scanning all

    the data pages is about 1.25 times the cost ofscanning unordered file. The cost: 1.25B(D + RC)

    Search with equality selection The cost of identifying the page that contains

    qualifying records is H Assume : the bucket consists of just one page (no

    overflow pages), retrieving it costs D. The cost : H + D + 0.5RC

  • 8/12/2019 4 File & Index

    17/35

    Hashed Files Search with range selection:

    The entire file must be scanned.

    The cost is 1.25B(D + RC).

    Insert:

    The appropriate page must be located, modified,and then written back

    The cost: the cost of search +(C + D)

    Delete:

    search for the record, remove it from the page, andwrite the modified page back

    The cost : the cost of search + (C + D)

  • 8/12/2019 4 File & Index

    18/35

    Choosing a File Organization

  • 8/12/2019 4 File & Index

    19/35

    INDEXES

    An index: Data structures to organize recordsvia trees or hashing.An index on a file is an auxiliary structure

    designed to speed up operations that are notefficiently supported by the basic organization ofrecords in that file.

    An index contains a collection of data entries, k*,with an efficient way to locate all data entries with

    search key value k

    Data entry, k*, contains enough information to

    retrieve (one or more) data records with search

    key value k.

  • 8/12/2019 4 File & Index

    20/35

    File Hashed on age, with Index on sal :contains (sal, rid)

    pairs as data entries

  • 8/12/2019 4 File & Index

    21/35

    Hash-Based Indexes

    A simple hashed file organization enables us to locaterecords with a given search key value quickly.

    Example: Find the Students record for Joe

    The pages in a hashed file are grouped into buckets. Bucket = primary page plus zero or more over flow pages. Buckets contain data entries.

    Given a bucket number, the hashed file structure allowsus to find the primary pagefor that bucket.

    The bucket to which a record belongs can be determinedby applying a special function called a hash functionto search the file.

  • 8/12/2019 4 File & Index

    22/35

    Alternatives for Data Entries k*in an

    Index

    1. A data entry k*is an actual data record (withsearch key value k).

    2. A data entry is a (k, rid) pair,

    rid : the record id of a data record with search

    key value k.

    3. A data entry is a (k, rid-list) pair,

    rid-list : a list of record ids of data records with

    search key valuek.

  • 8/12/2019 4 File & Index

    23/35

    Alternatives for Data Entries (Contd.)

    Alternative 1: there is no need to store the data records

    separately If this is used, index structure is a file

    organization for data records (instead of aHeap file or sorted file).

    At most one index on a given collection ofdata records can use Alternative 1.

    (Otherwise, data records are duplicated,leading to redundant storage and potentialinconsistency.)

    If data records are very large, # of pagescontaining data entries is high. Implies size ofauxiliary information in the index is also large,

  • 8/12/2019 4 File & Index

    24/35

    Alternatives for Data Entries (Contd.)

    Alternatives 2 and 3: Data entries typically much smaller than data

    records. So, better than Alternative 1 withlarge data records, especially if search keys

    are small. (Portion of index structure used todirect search, which depends on size of dataentries, is much smaller than with Alternative1.)

    Alternative 3 more compact than Alternative 2,offer better space utilization, but data entriesare variable in length, depending on thenumber of data records with a given search

    key value.

  • 8/12/2019 4 File & Index

    25/35

    PROPERTIES OF INDEXES:

    Clustered

    Clustered: a file is organized so that theordering of data records is the same as orclose to the ordering of data entries in someindex.

    An index that uses Alternative (1) is clustered.Alternative 1 implies clustered; in practice, clustered

    also implies Alternative 1.

    A file can be clustered on at most one search key

    at most one clustered index on a data file. Cost of retrieving data records through index

    varies greatly based on whether index isclustered or not!

  • 8/12/2019 4 File & Index

    26/35

    PROPERTIES OF INDEXES:

    Clustered An index that uses Alternative (2) or Alternative (3) can

    be a clustered index only if the data records are sortedon the search key field, Otherwise, the order of the datarecords is random.

    Indexes that maintain data entries in sorted order bysearch key use a collection of index entries, organizedinto a tree structure, to guide searches for data entries,which are stored at the leaf level of the tree in sortedorder.

    Suppose that Alternative (2) is used for data entries, andthat the data records are stored in a Heap file. To build clustered index, first sort the Heap file (with some free

    space on each page for future inserts).

    Overflow pages may be needed for inserts. (Thus, order of datarecords is `close to, but not identical to, the sort order.)

  • 8/12/2019 4 File & Index

    27/35

    Clustered Tree Index Using Alternative (2)

  • 8/12/2019 4 File & Index

    28/35

    PROPERTIES OF INDEXES:

    UnClustered

    Unclusteredindex : An index that is not clustered. We can have several unclustered indexes on a data

    file

    If the index is clustered, the rids in qualifying data

    entries point to a contiguous collection of records

    need to retrieve only a few data pages.

    If the index is unclustered, each qualifying data entrycould contain a rid that points to a distinct data page,leading to as many data page I/Os as the number of

    data entries that match the range selection!

  • 8/12/2019 4 File & Index

    29/35

    Unclustered Tree Index Using Alternative

    (2)

  • 8/12/2019 4 File & Index

    30/35

    Dense versus Sparse Indexes

    An index is said to be denseif it contains (atleast) one data entry for every search keyvalue that appears in a record in the indexedfile.

    A sparseindex contains one entry for eachpage of records in the data file.Alternative (1) for data entries always leads to

    a dense index.Alternative (2) can be used to build either

    dense or sparse indexes.Alternative (3) is typically only used to build a

    dense index.

  • 8/12/2019 4 File & Index

    31/35

    A data file of records with three fields (name, age, and sal),

    with two simple indexes on it, both of which use Alternative

    (2) for data entry format.

    P i d S d

  • 8/12/2019 4 File & Index

    32/35

    Primary and Secondary

    Indexes

    Primary index:An index on a set of fields thatincludes theprimary key.

    Secondary index: An index that is not a primaryindex.

    An index that uses Alternative (1) is called a primaryindex, and one that uses Alternatives (2) or (3) iscalled a secondary index.

    Two data entries are said to be duplicates if theyhave the same value for the search key field

    associated with the index. A primary index is guaranteed not to contain

    duplicates, but an index on other (collections of) fieldscan contain duplicates.

    Unique index: no duplicates exist and the search keycontains some candidate key.

  • 8/12/2019 4 File & Index

    33/35

    Indexes Using Composite Search

    Keys

    Composite search keysor concatenated keys:The search key for an index that contain several

    fields.

    If the search key is composite, an equality query

    is one in which each field in the search key isbound to a constant.

    A range queryis one in which not all elds in the

    search key are bound to constants.

  • 8/12/2019 4 File & Index

    34/35

    INDEX SPECIFICATION IN SQL

  • 8/12/2019 4 File & Index

    35/35

    INDEX SPECIFICATION IN SQL-

    92

    This specifies that a B+ tree index is to be

    created on the Students table using theconcatenation of the age and gpa columns as thekey.

    Key values are pairs of the form (age; gpa), and

    there is a distinct entry for each such pair. Once the index is created, it is automatically

    maintained by the DBMS adding/removing dataentries in response to inserts/deletes of records

    on the Students relation