Secondary Storage and Indexing

Secondary Storage and Indexing CSCI 4380 Database Systems Friday, April 20, 12

Transcript of Secondary Storage and Indexing

Page 1: Secondary Storage and Indexing

Secondary Storage and Indexing

CSCI 4380 Database Systems

Friday, April 20, 12

Page 2: Secondary Storage and Indexing

Disk access• Databases are generally large data stores,

much larger than available memory.

• Data is stored on disk, brought to memory on demand.

• A disk page or block is the smallest unit of access, to read and write.

• A disk page typically is 1K - 8K.

Friday, April 20, 12

Page 3: Secondary Storage and Indexing

Disk organization

• A disk contains

• multiple platters (usually 2 surfaces per platter)

• usually, the disk contains read/write heads that allow is to read/write from all surfaces simultaneously

Friday, April 20, 12

Page 4: Secondary Storage and Indexing

Disk organization

• A disk surface contains

• multiple concentric tracks

• the same track on different surfaces can be read by different heads at the same time, this unit is called a cylinder

Friday, April 20, 12

Page 5: Secondary Storage and Indexing

Disk organization• A track is broken down to sectors, sectors are

separated from each other by blank spaces

• A sector is the smallest unit of operation (read/write) possible on a disk

• A disk block is usually composed of a number of consecutive sectors (determined by the operating system)

• Data are read/written in units of a disk block/page

• A disk block is the same size as a memory block or page.

Friday, April 20, 12

Page 6: Secondary Storage and Indexing

Reading a disk page• Reading a page from disk requires the disk to start


• Disk arm has to be moved to the correct track of the disk -> seek operation

• The disk head must wait until the right location on the track is found -> rotational latency

• Then, the disk page can be read from disk and copied to memory -> transfer time.

Friday, April 20, 12

Page 7: Secondary Storage and Indexing

Reading a disk page• The cost of reading a disk page:

• seek time + rotational latency time + transfer time

• Multiple pages on the same track/cylinder can be read with a single seek/latency. Reading M pages on the same track/cylinder:

• seek time + rotational latency time + transfer time * (percentage of disk circumference to be scanned)

Friday, April 20, 12

Page 8: Secondary Storage and Indexing

A high end disk example

• Consider a disk with 16 surfaces, 216 tracks per surface (approx. 65K), 28=256 sectors per track and 212 bytes per sector.

• Each track has = 212 * 28 =220 bytes (1MB)

• Each surface has = 220 * 216 = 236 bytes

• The disk has = 24 * 236 = 240 byte = 1 TB

Friday, April 20, 12

Page 9: Secondary Storage and Indexing

Reading a page• Typical times:

• 7200 rpm means one rotation takes 8.33 ms (in the average, 1/2 of the disk needs to be rotated before the correct location is found, 4.17ms)

• seek time between 0 - 17.38 ms (in the average, 1/3 of the disk surface is scanned = 6.46 ms)

• transfer time for one sector : 8.33/256 = 0.03 ms

Friday, April 20, 12

Page 10: Secondary Storage and Indexing

Reading a page• Reading a page of 8K (2 sectors):

• 1 seek + 1 rotational latency + 2 sector transfer time

• 6.46 + 4.17 + 0.03 * 2 = 10.69 ms

• Reading 100 consecutive pages on the same track:

• 6.46 + 4.17 + 0.03 * 10 = 13.63 ms

• The lesson: Put blocks that are accessed together on the same track/cylinder as much as possible

Friday, April 20, 12

Page 11: Secondary Storage and Indexing

Disk scheduling• The disk controller can order the requests

to minimize seeks

• When the controller is moving from low tracks to high tracks, serve the next track request in the direction of the movement, queue the rest

• The method is called the elevator algorithm

Friday, April 20, 12

Page 12: Secondary Storage and Indexing

Checksums• For each sector, store a number of error checking bits called


• The checksum is 1 if the number of 1’s in the given sector is odd, and 0 if the number of 1’s is even.

• When reading a sector, check that the checksum is correct.

• Checks for 1 bit errors.

• Errors for more than 1 bits, the checksum will catch it in 50% of the time.

• For better error correction, use multiple bits (8 bits, bit i stores the parity of the ith bit of each byte).

Friday, April 20, 12

Page 13: Secondary Storage and Indexing

Stable storage• When we are writing a sector, if the write fails, then we

lost the data on that sector.

• Use two sectors for each sector, XL and XR.

• First write XL, check the checksum. If XL is written correctly, then write XR.

• If XL is written incorrectly, then the old version of X is still stored in XR.

• If XR is written incorrectly, then the new version of X is stored in XL.

Friday, April 20, 12

Page 14: Secondary Storage and Indexing

Multiple disks

• Raid (redundant array of inexpensive disks) is a series of methods for improving access time and reducing possibility of data loss by using multiple disks.

Friday, April 20, 12

Page 15: Secondary Storage and Indexing


• RAID-0, striping

• Distribute the data into multiple disks

• Example with 4 disks:

• Disk 1 has pages 1,5,9

• Disk 2 has pages 2,6,10

• Disk 3 has pages 3,7,11

• Disk 4 has pages 4,8,12

Friday, April 20, 12

Page 16: Secondary Storage and Indexing


• RAID-0, striping

• Reads are faster (read from all disks simultaneously)

• Writes are the same

• No redundancy in case a disk fails

Friday, April 20, 12

Page 17: Secondary Storage and Indexing

RAID-1• RAID-1, mirroring

• Mirror each disk onto another disk

• Reads are twice as fast, read from any disk available

• Writes are slow, each write require writing to two disks

• If one of the disks fail, the other one contains all the data (no data loss)

Friday, April 20, 12

Page 18: Secondary Storage and Indexing

RAID-4• One block contains the parity of the remaining disks

• Block i in the parity disk contains the parity of the ith block in all the remaining disks

• Reads are unchanged

• Writes are slower, each write requires a write to the parity disk as well

• If a disk fails, the lost data can be constructed from the remaining disks

Friday, April 20, 12

Page 19: Secondary Storage and Indexing

RAID-5• Similar to RAID-4, but the parity block is distributed

to all the disks

• Example: Given 5 disks (4 regular and 1 parity):

• Use disk 1 for parity of block 1

• Use disk 2 for parity of block 2

• etc.

• Reads are the same

• Writes are faster as the parity block is no longer a bottleneck

Friday, April 20, 12

Page 20: Secondary Storage and Indexing

Tuple organization• A disk page typically stores multiple tuples.

Many different organizations exist.

• The number of tuples that can fit in a page is determined by the number of attributes and the types of attributes the relation has.

Header info row directory

1 2 N... Free space Data rowsRow N Row N-1 Row 1...

Friday, April 20, 12

Page 21: Secondary Storage and Indexing

Tuple addressing• Tuples have a physical address which contains the

relevant subset of:

• Host name/Disk number/Surface No/ Track No/Sector No

• Physical address tends to be long

• Tuples are also given a logical address in the relation,

• A map table stored on disk contains the mapping from the logical address to physical address

Friday, April 20, 12

Page 22: Secondary Storage and Indexing

Tuple addressing

• When tuples are brought from disk to memory, its current address becomes a memory address

• Pointer swizzling is the act of changing physical address to the memory address in the map table for pages in memory

Friday, April 20, 12

Page 23: Secondary Storage and Indexing


• An index is a lookup structure built on a search key

• the search key can consist of multiple attribute

• the index contains pointers to tuples (logical address)

• The index itself is also packed into pages and stored on disk.

Friday, April 20, 12

Page 24: Secondary Storage and Indexing

Dense vs. sparse

• The index is called dense if it contains an entry for each tuple in the relation.

• An index is called sparse if it does not contain an entry for each tuple.

• A sparse index is possible if the addressed relation is sorted with respect to the index key.

Friday, April 20, 12

Page 25: Secondary Storage and Indexing

Dense Index Example1, t1

















Friday, April 20, 12

Page 26: Secondary Storage and Indexing

Sparse Index Example1, t1












1,t1 points to all values between 1 and 5 8,t7 points to all values greater than 5

Friday, April 20, 12

Page 27: Secondary Storage and Indexing

Index types

• An index can be

• primary, i.e. determines where the tuples are stored

• secondary, i.e. points to the tuples

• There can be many secondary indices.

• An index can be multi-level, i.e. a tree index, where each level is an index on the level below.

Friday, April 20, 12

Page 28: Secondary Storage and Indexing

B- trees

• B trees (called B+ trees in some books) are constructed on a list of attributes (also called the index key)

• Each node on a B-tree is mapped to a disk page

• Leaf nodes:

• A leaf node can contain at most n tuples (key values and pointers) and 1 additional pointer to the sibling node.

• A leaf node must contain at least floor((n+1)/2) tuples (plus one additional pointer to the next sibling node.

Friday, April 20, 12

Page 29: Secondary Storage and Indexing

B- trees

• Internal nodes:

• An internal node can contain at most n + 1 pointers and n key values.

• An internal node must contain at least floor((n+1)/2) pointers (and one less key value), except the root which can contain a single key value and 2 pointers.

Friday, April 20, 12

Page 30: Secondary Storage and Indexing

B- tree example• Suppose n = 3

• Each leaf node will have at least 2 and at most 3 tuples.

• Each internal node will point to at least 2 and at most 4 nodes below (and hence will have between 1 and 2 key values).

• Suppose n = 99

• Each leaf node will have at least 50 and at most 99 tuples.

• Each internal node will point to at least 50 and at most 100 nodes below (and hence will have between 49 and 99 key values).

• The root can have 2 pointers and 1 key value in the least.

Friday, April 20, 12

Page 31: Secondary Storage and Indexing

Sibling nodes

• Leaf nodes point to the next node in the leaf, called a sibling node.

Friday, April 20, 12

Page 32: Secondary Storage and Indexing

B- trees

• Leaf nodes contain pairs of

• key values

• pointers to the tuple

• If the B-tree is a secondary index, then there is an entry in the leaf level for each tuple in the relation.

• The leaf nodes also contain a pointer to the next (sibling) leaf node.

Friday, April 20, 12

Page 33: Secondary Storage and Indexing

B- trees• Internal nodes contain n key values and n+1 pointers

• The pointers point to the nodes at the level below

10 25 32













Friday, April 20, 12

Page 34: Secondary Storage and Indexing

Example B-tree

Assume at most 4 key values per node

2 7 11 15 22 30 41 53 54 63 66 69 71 76 78 84 93

11 3066 78


pointers to tuples

Friday, April 20, 12

Page 35: Secondary Storage and Indexing

B-trees with duplicate values

• If the B-tree is built on a key value that may contain duplicates, build the index in an identical way, except:

• The non-leaf node pointing to leaf node contains the key value of the first node that is not repeating from the previous sibling

• If there is no such key, then a null value is stored at this location.

Friday, April 20, 12

Page 36: Secondary Storage and Indexing

Example B-tree with duplicates

Assume at most 4 key values per node

2 7 11 15 15 15 18 18 22 41 41 41 41 41 55 63

11 18- 55


Friday, April 20, 12

Page 37: Secondary Storage and Indexing

B-tree equality search• Given select * from R where A = x and an index on R.A (assume no

duplicate values for R.A):

• While not at leaf level:

• Starting from the root, find the address for the node below that may contain this value (the pointer to the left of the first key value that is greater than x or the last pointer if no such value exists)

• Read the node from disk

• If the leaf level contains a tuple with the searched value, read the matched tuples from disk and return

Friday, April 20, 12

Page 38: Secondary Storage and Indexing

B-tree equality search• Given select * from R where A = x and an index on R.A (assume R.A

may contain duplicate values):

• While not at leaf level:

• Starting from the root, find the address for the node below that may contain this value (the pointer to the left of the first key value that is greater than x or the last pointer if no such value exists)

• Read the node from disk

• If the leaf level contains a tuple with the searched value, scan all sibling pointers until a value different than x is found. Read the matched tuples from disk and return

Friday, April 20, 12

Page 39: Secondary Storage and Indexing

B-tree range search

• Given select * from R where A < y and A > x an index on R.A:

• Using the same algorithm from before, find the first leaf node containing a value > x

• Traverse the sibling pointers from left to right until all tuples in the range are read

• Read all the matching tuples from the disk

Friday, April 20, 12

Page 40: Secondary Storage and Indexing

Index only search

• Given select A from R where A < 120 and A > 10 and an index on R.A:

• Scan the index for matching tuples as before and return the found A values (no need to read the tuples from disk)

Friday, April 20, 12

Page 41: Secondary Storage and Indexing

Index partial match• Given an index on R.A, R.B (index is sorted on A first and then

on B)

• Select * from R where A > 10 and A < 100 and B=2

• Scan index for the range A > 10 and A < 100, and for each matching tuple check the B value, read matched tuples from disk

• Select * from R where B > 10 and B < 100

• Scan the leaf level of the index completely to find the matching B tuples, read matched tuples from disk

Friday, April 20, 12

Page 42: Secondary Storage and Indexing

Insertion1. Given a new entry A to be inserted

1.1. Search the tree for the new entry

1.2. If the leaf node X has space for the new entry, insert.

1.3. Otherwise

1.3.1. Create a new leaf node Y and distribute the entries in X and the entry A to X and the new node

1.3.2. Create a new entry B with the address of Y and the lowest entry in Y

1.3.3. Insert B into the parent of X recursively (go to step 1.2)

Friday, April 20, 12

Page 43: Secondary Storage and Indexing


Insert Example

Insert record with key 57 (at most 4 key values)

2 7 11 15 22 30 41 53 54 63 66 69 71 76 78 84 93

11 3066 78


Friday, April 20, 12

Page 44: Secondary Storage and Indexing


Insert Example

Insert record with key 57 (at most 4 key values)

2 7 11 15 22 30 41 53 54 57 63 66 69 71 76 78 84 93

11 3066 78


We are done! No rebalancing necessary

Friday, April 20, 12

Page 45: Secondary Storage and Indexing


Another Insert Example

Insert 65

2 7 11 15 22 30 41 53 54 57 63 66 69 71 76 78 84 93

11 3066 78


Friday, April 20, 12

Page 46: Secondary Storage and Indexing


Another Insert Example

Overflown node is split

2 7 11 15 22 30 41 53 54 57 66 69 71 76 78 84 93

11 30 63 66 78


63 65

Friday, April 20, 12

Page 47: Secondary Storage and Indexing


Another Insert Example

Insert 70 and 94, one more node split


2 7

11 30 63 66 71 76

11 15 22 30 41 53 54 57 66 69 70 78 84 93 94

63 65 71 76

Friday, April 20, 12

Page 48: Secondary Storage and Indexing


Another Insert Example

Finally, insert 90 (which will cause the parent to split)


2 7

11 30 63 66 71 76

11 15 22 30 41 53 54 57 66 69 70 78 84 93 94

63 65 71 76

Friday, April 20, 12

Page 49: Secondary Storage and Indexing


Another Insert Example

Finally, insert 90 (which will cause the parent to split)

53 71

2 7

11 30 63 66

11 15 22 30 41 53 54 57 66 69 70 78 84 90

63 65 71 76 93 94

78 93

Friday, April 20, 12

Page 50: Secondary Storage and Indexing


• Suppose we would like to delete entry A

• Locate leaf node X containing entry A and delete A

• If X has n/2 or more pointers, then adjust the parent node entry pointing to this node if necessary recursively (if we deleted the smallest entry in the node)

Friday, April 20, 12

Page 51: Secondary Storage and Indexing

Deletion• Otherwise, the node has too few pointers.

• If a sibling node with the same parent has more than n/2 pointers, then redistribute entries with the sibling and adjust the parent pointers

• Else

• delete A

• insert all the entries in A to a sibling B

• adjust the parent entry for B

• delete the entry Y in the parent for A recursively (go to the first step of this algorithm)

Friday, April 20, 12

Page 52: Secondary Storage and Indexing


Deletion Example

Delete key 30

2 7 11 15 17 22 30 53 54 78 84 93

11 22



Friday, April 20, 12

Page 53: Secondary Storage and Indexing


Deletion Example

Delete key 30Borrow from neighbor

Adjust the internal node

2 7 11 15 17 22 53 54 78 84 93

11 17



Friday, April 20, 12

Page 54: Secondary Storage and Indexing


Deletion Example

Delete key 30Borrow from neighbor

Redistribute betweenthe second andthird leaf nodes.

Adjust the internal node

2 7 11 15 17 22 53 54 78 84 93

11 17



Friday, April 20, 12

Page 55: Secondary Storage and Indexing


Another Deletion Example

Delete key 7

Cannot borrow from neighbor,Merge with neighbor

2 7 11 15 17 22 53 54 78 84 93

11 17



Friday, April 20, 12

Page 56: Secondary Storage and Indexing


Another Deletion Example

2 11 15 17 22 53 54 78 84 93




Delete the corresponding pointer

Friday, April 20, 12

Page 57: Secondary Storage and Indexing


Another Deletion Example

Delete 53, must merge with a sibling

2 11 15 17 22 53 54 78 84




Friday, April 20, 12

Page 58: Secondary Storage and Indexing


Another Deletion Example

2 11 15 17 22 54 78 84



53Node too empty,

cannot borrow from sibling,

must merge with sibling

Friday, April 20, 12

Page 59: Secondary Storage and Indexing


Another Deletion Example

2 11 15 17 22 54 78 84

17 54



Friday, April 20, 12

Page 60: Secondary Storage and Indexing


Another Deletion Example

2 11 15 17 22 54 78 84

17 54

The final tree.

Friday, April 20, 12

Page 61: Secondary Storage and Indexing


A B-Tree Example


disk page has capacity of 4K bytes

each tuple address takes 6 bytes and each key value takes 2 bytes

each node is 70% full

need to store 1 million tuples

Friday, April 20, 12

Page 62: Secondary Storage and Indexing


A B+-Tree Example

Leaf node capacity

• each (key value, tuple address) pair takes 8 bytes

• disk page capacity is 4K, so (4*1024)/8 = 512 (key value, rowid) pairs per leaf page

in reality there are extra headers and pointers that we will ignore

• Hence, the maximum number of points for the tree is about 256 (and 255 key values)

Friday, April 20, 12

Page 63: Secondary Storage and Indexing


Example Continued• If all pages are 70% full, each page has about

512*0.7 = 359 pointers

• To store 1 million tuples, requires

1,000,000 / 359 = 2786 pages at the leaf level

2789 / 359 = 8 pages at next level up

1 root page pointing to those 8 pages

Hence, we have a B-tree with 3 levels

Friday, April 20, 12

Page 64: Secondary Storage and Indexing

Hashing• Given a hash of K buckets

• Allocate a number of disk blocks M to each bucket

• For each tuple t, apply the hash function. Suppose, we hash on attribute A, if h(t.A) = x, then store t in the blocks allocated for bucket x.

• Search on attribute A (select * from r where r.a=c)

• Cost: M/2 (search half the pages for that bucket in the average

Friday, April 20, 12

Page 65: Secondary Storage and Indexing


• Search on another attribute

• Cost: N

• Insertion cost: 1 read and 1 write (find the last page in the appropriate bucket and store)

• Deletion/Update cost: M/2 (search cost) + 1 to update

Friday, April 20, 12

Page 66: Secondary Storage and Indexing

Hashing - collisions

• If a bucket has too many tuples, than the allocated M pages may not be sufficient

• Allocate additional overflow area

• If the overflow area is large, the benefit of the hash is lost

Friday, April 20, 12

Page 67: Secondary Storage and Indexing

Extensible hashing• The address space of the hash (K) can be adjusted to the

number of tuples in the relation

• Use a hash function h

• But, use only first z bits of the hashed value to address the tuples

• If a bucket overflows, split the hash directory and use z+1 bits to address

Friday, April 20, 12

Page 68: Secondary Storage and Indexing

Extensible hashing• Using a single bit to address




Page 1



Page 0Overflow!

new point

Friday, April 20, 12

Page 69: Secondary Storage and Indexing

Extensible hashing• Double the directory

Page 1



Page 0

Distribute to00 and 10

Friday, April 20, 12

Page 70: Secondary Storage and Indexing

Extensible hashing• Double the directory

Page 1



Page 0

Page 3



Page 2

Make a copy of thedirectory

Friday, April 20, 12

Page 71: Secondary Storage and Indexing

Extensible hashing

Page 1



Page 0

Page 3



Page 2

Update thelink for the new node

Friday, April 20, 12

Page 72: Secondary Storage and Indexing

Extensible hashing

Page 1



Page 0

Page 3



Page 2

How do we knowwhich nodes canbe split withoutsplitting the directory?




Friday, April 20, 12

Page 73: Secondary Storage and Indexing

Linear hashing• The addressing is the same, but we allow overflows

• We decide to split based on a global rule

• If number of pages/number of tuples > k %

• Split one bucket at a time



Friday, April 20, 12

Page 74: Secondary Storage and Indexing

Linear hashing0


new point

decide to split




split the contents

into 00 and 10




bucket 1 still contains

all entries

01 and 11

Friday, April 20, 12

Page 75: Secondary Storage and Indexing

Linear hashing• The bucket split is the next one in sequence

• it may not be the one that has overflow pages

• eventually all buckets will be split

Friday, April 20, 12