4 File & Index
-
Upload
abdul-azies-kurniawan -
Category
Documents
-
view
228 -
download
0
Transcript of 4 File & Index
-
8/12/2019 4 File & Index
1/35
File Organizations
and Indexes
-
8/12/2019 4 File & Index
2/35
Definition a file of recordsis a collection of records that
may reside on several pages.
Each record has a unique identifier called arecord idor rid
Indexesare data structures that is intended tohelp find the record ids of records that meet aselection condition.
Every index has an associated search key, which
is a collection of one or more fields of the file ofrecords for which we are building the index;
Any subset of the fields can be a search key.
We sometimes refer to the file of records as the
indexed file.
-
8/12/2019 4 File & Index
3/35
File OrganizationA file organization:a method of arranging
the records in a file on external storage.
For example:
If we want to retrieve employee records inalphabetical order??? if we want to retrieve all employees whose salary is
in a given range??? If we want to retrieve all employees who are 55
years old (with employees name)???
-
8/12/2019 4 File & Index
4/35
Three Basic File Organizations1. Heap (random order)files : Suitable when
typical access is a file scan retrieving allrecords.
2. Sorted Files: Best if records must be retrievedin some order, or only a range of records is
needed.3. Indexes: Data structures to organize records
via treesor hashing. Like sorted files, they speed up searches for a subset
of records, based on values in certain (search key)
fields Updates are much faster than in sorted files.
Each ideal for some situations, and not so good
in others.
-
8/12/2019 4 File & Index
5/35
FILE ORGANIZATIONS
For sorted and hashed files, the sequence offields on which the file is sorted or hashed iscalled the search key.
The search key for an index can be any
sequence of one or more fieldsit need notuniquely identify records.
-
8/12/2019 4 File & Index
6/35
COST MODELA cost modelto estimate the cost (in terms of
execution time) of different database operations.
ignore CPU costs, for simplicity.
Notations: B : numberof data pages
R : number of records per page.
D : the average time to read or write a disk page
C : the average time to process a record hash function : map a record into a range of numbers.
H : the time required to apply the hash function to a
record
-
8/12/2019 4 File & Index
7/35
Typical values today are [Ramakhrisnan, 2003] D = 15 milliseconds,
C and H = 100 nanoseconds
The cost of I/O is dominantsupported by
current hardware trends, in which CPU speeds
are steadily rising, whereas disk speeds are not
increasing at a similar pace.
On the other hand, as main memory sizesincrease, a much larger fraction of the needed
pages are likely to fit in memory, leading to fewer
I/O requests.
-
8/12/2019 4 File & Index
8/35
-
8/12/2019 4 File & Index
9/35
Heap Files
Scan : retrieve each of B pages taking time D per page,and for each page, process R records taking time C perrecord
The cost : BD + BRC= B(D + RC)
Search with equality selection: Suppose that exactly one record matches the desired
equality selection, that is, the selection is specified on acandidate key.
For each retrieved data page, check all records on the pageto see if it is the desired record.
The cost : 0.5B(D + RC)
Search with range selection: The entire file must be scanned because qualifying records
could appear anywhere in the fileThe cost : B(D + RC)
-
8/12/2019 4 File & Index
10/35
Heap Files
Insert:Assume : records are always inserted at the end of thefile.
fetch the last page in the file, add the record, and writethe page back.
The cost : 2D + C.
Delete: Find the record, remove the record from the page, and
write the modified page.Assume that no attempt is made to compact the file toreclaim the free space created by deletions.
The cost : (the cost of searching) + C + D. The cost of searching = D or search with equality or range
conditions. The cost of deletion is also affected by the number of qualifying
records, since all pages containing such records must be modified
-
8/12/2019 4 File & Index
11/35
Sorted Files (1) Scan: The cost : B(D + RC) Search with equality selection:
assume : the equality selection is specified on thefield by which the file is sorted; if not, the cost is
identical to that for a heap file. locate the first page containing the desired records,
should any qualifying records exist, with a binarysearch in log2B steps.
Once the page is known, the first qualifying recordcan again be located by a binary search of the pageat a cost of Clog2R.
The Cost : Dlog2B + Clog2R
-
8/12/2019 4 File & Index
12/35
Sorted Files (2)
Search with range selection:Assume: the range selection is on the sort field, the
first record that satisfies the selection is located as it
is for search with equality.
The cost is the cost of search plus the cost ofretrieving the set of records that satisfy the search.
The cost of the search includes the cost of fetching
the first page containing qualifying, or matching,
records.The cost : Dlog2B + #matches
-
8/12/2019 4 File & Index
13/35
Sorted Files (3) Insert:
find the correct position in the file, add the record,and then fetch and rewrite all subsequent pages
assume : the inserted record belongs in the middle ofthe file.
The cost : the cost of searching to find the position ofthe new record plus 2*(0.5B(D + RC)), that is,The cost: (search cost) +(B(D + RC))
Delete: search for the record, remove the record from the
page, and write the modified page back read and write all subsequent pages because all
records that follow the deleted record must be movedup to compact the free space.
The cost : (search cost) + B(D + RC).
-
8/12/2019 4 File & Index
14/35
Hashed Files
Enables to locate records with a given search keyvalue quickly
For example : if the file is hashed on the name
field : Find the Students record for Joe.
The pages in a hashed file are grouped into
buckets.
The bucket to which a record belongs can be
determined by applying a special function called ahash function, to the search field(s).
-
8/12/2019 4 File & Index
15/35
Hashed Files
On inserts, a record is inserted into theappropriate bucket, with additional `overflow'
pages allocated if the primary page for the bucket
becomes full.
The overflowpages for each bucket are
maintained in a linked list.
To search for a record with a given search key
value, we simply apply the hash function to
identify the bucket to which such records belong
and look at all pages in that bucket.
For analysis, assume that there are no overflow
pages.
-
8/12/2019 4 File & Index
16/35
Hashed Files
Scan: pages are kept at about 80 percent occupancy. add a new page to a bucket when each existing
page is 80 percent full no overflow buckets! The number of pages and the cost of scanning all
the data pages is about 1.25 times the cost ofscanning unordered file. The cost: 1.25B(D + RC)
Search with equality selection The cost of identifying the page that contains
qualifying records is H Assume : the bucket consists of just one page (no
overflow pages), retrieving it costs D. The cost : H + D + 0.5RC
-
8/12/2019 4 File & Index
17/35
Hashed Files Search with range selection:
The entire file must be scanned.
The cost is 1.25B(D + RC).
Insert:
The appropriate page must be located, modified,and then written back
The cost: the cost of search +(C + D)
Delete:
search for the record, remove it from the page, andwrite the modified page back
The cost : the cost of search + (C + D)
-
8/12/2019 4 File & Index
18/35
Choosing a File Organization
-
8/12/2019 4 File & Index
19/35
INDEXES
An index: Data structures to organize recordsvia trees or hashing.An index on a file is an auxiliary structure
designed to speed up operations that are notefficiently supported by the basic organization ofrecords in that file.
An index contains a collection of data entries, k*,with an efficient way to locate all data entries with
search key value k
Data entry, k*, contains enough information to
retrieve (one or more) data records with search
key value k.
-
8/12/2019 4 File & Index
20/35
File Hashed on age, with Index on sal :contains (sal, rid)
pairs as data entries
-
8/12/2019 4 File & Index
21/35
Hash-Based Indexes
A simple hashed file organization enables us to locaterecords with a given search key value quickly.
Example: Find the Students record for Joe
The pages in a hashed file are grouped into buckets. Bucket = primary page plus zero or more over flow pages. Buckets contain data entries.
Given a bucket number, the hashed file structure allowsus to find the primary pagefor that bucket.
The bucket to which a record belongs can be determinedby applying a special function called a hash functionto search the file.
-
8/12/2019 4 File & Index
22/35
Alternatives for Data Entries k*in an
Index
1. A data entry k*is an actual data record (withsearch key value k).
2. A data entry is a (k, rid) pair,
rid : the record id of a data record with search
key value k.
3. A data entry is a (k, rid-list) pair,
rid-list : a list of record ids of data records with
search key valuek.
-
8/12/2019 4 File & Index
23/35
Alternatives for Data Entries (Contd.)
Alternative 1: there is no need to store the data records
separately If this is used, index structure is a file
organization for data records (instead of aHeap file or sorted file).
At most one index on a given collection ofdata records can use Alternative 1.
(Otherwise, data records are duplicated,leading to redundant storage and potentialinconsistency.)
If data records are very large, # of pagescontaining data entries is high. Implies size ofauxiliary information in the index is also large,
-
8/12/2019 4 File & Index
24/35
Alternatives for Data Entries (Contd.)
Alternatives 2 and 3: Data entries typically much smaller than data
records. So, better than Alternative 1 withlarge data records, especially if search keys
are small. (Portion of index structure used todirect search, which depends on size of dataentries, is much smaller than with Alternative1.)
Alternative 3 more compact than Alternative 2,offer better space utilization, but data entriesare variable in length, depending on thenumber of data records with a given search
key value.
-
8/12/2019 4 File & Index
25/35
PROPERTIES OF INDEXES:
Clustered
Clustered: a file is organized so that theordering of data records is the same as orclose to the ordering of data entries in someindex.
An index that uses Alternative (1) is clustered.Alternative 1 implies clustered; in practice, clustered
also implies Alternative 1.
A file can be clustered on at most one search key
at most one clustered index on a data file. Cost of retrieving data records through index
varies greatly based on whether index isclustered or not!
-
8/12/2019 4 File & Index
26/35
PROPERTIES OF INDEXES:
Clustered An index that uses Alternative (2) or Alternative (3) can
be a clustered index only if the data records are sortedon the search key field, Otherwise, the order of the datarecords is random.
Indexes that maintain data entries in sorted order bysearch key use a collection of index entries, organizedinto a tree structure, to guide searches for data entries,which are stored at the leaf level of the tree in sortedorder.
Suppose that Alternative (2) is used for data entries, andthat the data records are stored in a Heap file. To build clustered index, first sort the Heap file (with some free
space on each page for future inserts).
Overflow pages may be needed for inserts. (Thus, order of datarecords is `close to, but not identical to, the sort order.)
-
8/12/2019 4 File & Index
27/35
Clustered Tree Index Using Alternative (2)
-
8/12/2019 4 File & Index
28/35
PROPERTIES OF INDEXES:
UnClustered
Unclusteredindex : An index that is not clustered. We can have several unclustered indexes on a data
file
If the index is clustered, the rids in qualifying data
entries point to a contiguous collection of records
need to retrieve only a few data pages.
If the index is unclustered, each qualifying data entrycould contain a rid that points to a distinct data page,leading to as many data page I/Os as the number of
data entries that match the range selection!
-
8/12/2019 4 File & Index
29/35
Unclustered Tree Index Using Alternative
(2)
-
8/12/2019 4 File & Index
30/35
Dense versus Sparse Indexes
An index is said to be denseif it contains (atleast) one data entry for every search keyvalue that appears in a record in the indexedfile.
A sparseindex contains one entry for eachpage of records in the data file.Alternative (1) for data entries always leads to
a dense index.Alternative (2) can be used to build either
dense or sparse indexes.Alternative (3) is typically only used to build a
dense index.
-
8/12/2019 4 File & Index
31/35
A data file of records with three fields (name, age, and sal),
with two simple indexes on it, both of which use Alternative
(2) for data entry format.
P i d S d
-
8/12/2019 4 File & Index
32/35
Primary and Secondary
Indexes
Primary index:An index on a set of fields thatincludes theprimary key.
Secondary index: An index that is not a primaryindex.
An index that uses Alternative (1) is called a primaryindex, and one that uses Alternatives (2) or (3) iscalled a secondary index.
Two data entries are said to be duplicates if theyhave the same value for the search key field
associated with the index. A primary index is guaranteed not to contain
duplicates, but an index on other (collections of) fieldscan contain duplicates.
Unique index: no duplicates exist and the search keycontains some candidate key.
-
8/12/2019 4 File & Index
33/35
Indexes Using Composite Search
Keys
Composite search keysor concatenated keys:The search key for an index that contain several
fields.
If the search key is composite, an equality query
is one in which each field in the search key isbound to a constant.
A range queryis one in which not all elds in the
search key are bound to constants.
-
8/12/2019 4 File & Index
34/35
INDEX SPECIFICATION IN SQL
-
8/12/2019 4 File & Index
35/35
INDEX SPECIFICATION IN SQL-
92
This specifies that a B+ tree index is to be
created on the Students table using theconcatenation of the age and gpa columns as thekey.
Key values are pairs of the form (age; gpa), and
there is a distinct entry for each such pair. Once the index is created, it is automatically
maintained by the DBMS adding/removing dataentries in response to inserts/deletes of records
on the Students relation