Why Concerning Storage and Indexing?
description
Transcript of Why Concerning Storage and Indexing?
Part III: Storage and Indexing(Chapters 8-11)
Part V: Transctions, Concurrency control, Scheduling, and Recovery
(Chapters 16-18)
Why Concerning Storage and Indexing?
Performance: another major factor in user satisfaction Depends on
• Efficient data structures for data representation• Efficiency of system operation on those
structures Disks contains data files and system files including
dictionary and index files Disk access: one of the most critical factor in
performance.
Why Not Store Everything in Main Memory?
Cost and size Main memory is volatile: What’s the problem? Typical storage hierarchy:
Factors: access speed, cost per unit, reliability Cache and main memory (RAM) for currently
used data: fast but costly Flash memory: limited number of writes (and
slow), non-volatile, disk-substitute in embedded systems
Disk for the main database (secondary storage). Tapes for archiving older versions of the data
(tertiary storage).
Buffer Management in a DBMS
CC & Recovery may require additional I/O when a frame is chosen for replacement. Why?
DB
MAIN MEMORY
DISK
disk page
free frame
Page Requests from Higher Levels
BUFFER POOL
choice of frame dictatedby replacement policy
Indexes An index on a file speeds up selections on the
search key fields for the index. Any subset of the fields of a relation can be the
search key for an index on the relation. Search key is not the same as key (minimal set
of fields that uniquely identify a record in a relation).
An index contains a collection of data entries, and supports efficient retrieval of all data entries k* with a given key value k. Given data entry k*, we can find record with key
k quickly. Classes: dense/sparse index, primary/secondary,
clustered/un-clustered
Dense vs Sparse Index Dense index: one
index entry per search key value.
Sparse index: index records for only some of the records Every sparse index is
clustered! Sparse indexes are
smaller Which one is faster? Which one has less
overhead?
Ashby, 25, 3000
Smith, 44, 3000
Ashby
Cass
Smith
22
25
30
40
44
44
50
Sparse Indexon
Name Data File
Dense Indexon
Age
33
Bristow, 30, 2007
Basu, 33, 4003
Cass, 50, 5004
Tracy, 44, 5004
Daniels, 22, 6003
Jones, 40, 6003
Clustered vs. Unclustered Index
Suppose that Alternative (2) is used for data entries, and that the data records are stored in a Heap file. To build clustered index, first sort the Heap file (with
some free space on each page for future inserts). Overflow pages may be needed for inserts. (Thus,
order of data recs is `close to’, but not identical to, the sort order.) Index entries
Data entries
direct search for
(Index File)
(Data file)
Data Records
data entries
Data entries
Data Records
CLUSTERED UNCLUSTERED
Index Trees As for any index, 3 alternatives for data
entries k*: Data record with key value k <k, rid of data record with search key value k> <k, list of rids of data records with search key
k> Choice is orthogonal to the indexing
technique used to locate data entries k*. Tree-structured indexing techniques support
both range searches and equality searches. ISAM (indexed sequential access method):
static structure (data entries reside in leaf pages and overflow pages)
B+ tree: dynamic, adjusts gracefully under inserts and deletes.
ISAM
Index file may still be quite large. But we can apply the idea repeatedly – making a tree
Leaf pages contain data entries.
P0
K1 P
1K 2 P
2K m
P m
index entry
Non-leaf
Pages
Pages
Overflow page
Primary pages
Leaf
Example ISAM Tree Each node can hold 2 entries; no need for `next-
leaf-page’ pointers. Why? Sequential allocation of leaf pages. Insert 23. Insert 48, 41, 42.
10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*
20 33 51 63
40
Root
After Inserting 23*, 48*, 41*, 42* ...
10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*
20 33 51 63
40
Root
23* 48* 41*
42*
Overflow
Pages
Leaf
Index
Pages
Pages
Primary
B+ Tree: Most Widely Used Index
Insert/delete at log F N cost; keep tree height-balanced. F = fanout , N = # leaf pages;
F is typically >> 2. Why? Minimum 50% occupancy (except for root). Each
node contains d <= m <= 2d entries. The parameter d is called the order of the tree.
Supports equality and range-searches efficiently.
Index Entries
Data Entries("Sequence set")
(Direct search)
Extendible Hashing Situation: Bucket (primary page) becomes
full. Why not re-organize file by doubling # of buckets? Reading and writing all pages for reorganizing a
data file is expensive Idea: Use directory of pointers to buckets, double
# of buckets by doubling the directory, splitting just the bucket that overflowed!
Directory much smaller than file, so doubling it is much cheaper. Only one page of data entries is split. No overflow page!
Trick lies in how hash function is adjusted!
Example
Directory is array of size 4. To find bucket for r, take
last `global depth’ # bits of h(r); we denote r by h(r). If h(r) = 5 = binary 101,
it is in bucket pointed to by 01.
Insert: If bucket is full, split it (allocate new page, re-distribute).
If necessary, double the directory. (When is splitting a bucket does not require doubling? We can tell by comparing global depth with local depth for the split bucket.)
13*00
01
10
11
2
2
2
2
2
LOCAL DEPTH
GLOBAL DEPTH
DIRECTORY
Bucket A
Bucket B
Bucket C
Bucket D
DATA PAGES
10*
1* 21*
4* 12* 32* 16*
15* 7* 19*
5*
Insert h(r)=20 (Causes Doubling)--- How to determine keys in A and A2?
20*
00
01
10
11
2 2
2
2
LOCAL DEPTH 2
2
DIRECTORY
GLOBAL DEPTHBucket A
Bucket B
Bucket C
Bucket D
Bucket A2(`split image'of Bucket A)
1* 5* 21*13*
32*16*
10*
15* 7* 19*
4* 12*
19*
2
2
2
000
001
010
011
100
101
110
111
3
3
3DIRECTORY
Bucket A
Bucket B
Bucket C
Bucket D
Bucket A2(`split image'of Bucket A)
32*
1* 5* 21*13*
16*
10*
15* 7*
4* 20*12*
LOCAL DEPTH
GLOBAL DEPTH
Linear Hashing
This is another dynamic hashing scheme, an alternative to Extendible Hashing.
LH handles the problem of long overflow chains without using a directory, and handles duplicates.
Idea: Use a family of hash functions h0, h1, h2, ... hi(key) = h(key) mod(2iN); N = initial # buckets h is some hash function (range is not 0 to N-1) If N = 2d0, for some d0, hi consists of applying h and
looking at the last di bits, where di = d0 + i. hi+1 doubles the range of hi (similar to directory
doubling)
Overview of LH File
In the middle of a round.
Levelh
Buckets that existed at thebeginning of this round:
this is the range of
NextBucket to be split
of other buckets) in this round
Levelh search key value )(
search key value )(
Buckets split in this round:If is in this range, must useh Level+1
`split image' bucket.to decide if entry is in
created (through splitting`split image' buckets:
LawyersRecently reported in the Massachusetts Bar Association
Lawyers Journal, the following are questions actually asked of witnesses by attorneys during trials:
"Now doctor, isn't it true that when a person dies in his sleep,he doesn't know about it until the next morning?“
"Were you alone or by yourself?“
"Was it you or your younger brother who was killed in the war?“
"How far apart were the vehicles at the time of the collision?“
Q: "How was your first marriage terminated?"A: "By death."Q: "And by whose death was it terminated?"
Part V:Transactions, Concurrency control, Scheduling, and
Recovery
Ch. 16 - 18
Overview
Transactions and ACID properties Serial execution and serializable execution Serializability (dependency) graph Serializability theorem Conflict equivalence and view equivalence Properties of schedules: SR, RC, ACA, ST Schedulers
3 options to handle the request from TM optimistic (aggressive) vs pessimistic
(conservative) 2PL and Strict 2PL Timestamp ordering and Strict TO
Atomicity of Transactions
A transaction might commit after completing all its actions, or it could abort (or be aborted by the DBMS) after executing some actions.
A very important property guaranteed by the DBMS for all transactions is that they are atomic.
Atomicity: a transaction is assumed to be executing all its actions in one step, or not executing any actions at all. Not easy to achieve. Why?
Consistency, Isolation, Durability
A transaction executed in isolation must preserve DB consistency.
Even if multiple transactions executed concurrently, each should be unaware of other transactions are being executed concurrently.
When a transaction complete successfully, the changes it made must persist, even with failures afterwards.
Conflicts and Equivalence
When do two operations conflict? They are issued by different transactions They operate on the same data object At least one of them is a write operation
Conflict equivalent Two executions are conflict equivalent if in both
executions, all conflicting operations have the same order.
Serializability Correctness criterion
Serializability is the correctness definition of DB All serializable schedules are equally correct Scheduling algorithms enforce certain ordering In distributed DBMS, variable delays may disturb
any particular ordering which is supposed to occur
Serialization graph (dependency graph) shows dependency relationship among transactions
Serialization Theorem For a schedule H, if SG(H) is acyclic, then H is
serializable.
Properties of Schedules Recoverability
To ensure that aborting a transaction does not change the semantics of committed transactionsw1(x)r2(x)w2(y)C2
Is it recoverable? What if T1 aborts? Recoverable execution depends on commit order A transaction cannot commit until all values it
read are guaranteed not to be aborted. How to do it?
Delaying commit: T2 cannot commit until T1 commits
Cascaded abort is sometimes necessary. Why?w1(x)r2(x)w2(x)A1
Properties of Schedules
Recoverability Cascaded abort is sometimes necessary
w1(x)r2(x)w2(x)A1 Avoiding cascaded aborts
Achieved if every transaction reads only the values written by committed transactions
Must delay each r(x) until all transactions that issued w(x) is either committed or abortedw1(x) ….. C1 r2(x) w2(y) …
Properties of Schedules Restoring before images
Implementing transaction abort by simply restoring before images of all writes is very convenientw0(x)w1(x)w2(x) A1 A2
Value of x must be restored to the initial value, not the value written by T1
Solution: delay w(x) until all transactions that have written x are either committed or aborted
Strictness Executions that satisfy both requirements Delay both r(x) and w(x) until all transactions that
have written w(x) are either committed or abortedr1(x)w1(x) w1(y) w2(z) w2(x) C1 --- is it strict?
Relationships among Properties
Recoverability (RC) RC if Ti reads from Tj and Ci is in H, then Ci
follows Cj Avoiding cascaded aborts (ACA)
ACA if Ti reads x from Tj then ri(x) follows Cj Strictness (ST)
ST if whenever Oi(x) follows wj(x), then Oi(x) follows either Aj or Cj
What is the relationship among ST, ACA, and RC?ST < ACA < RC
What about with SR and Serial execution?
Two-Phase Locking (2PL)
Two-Phase Locking Protocol Each Xact must obtain a S (shared) lock on
object before reading, and an X (exclusive) lock on object before writing.
A transaction can not request additional locks once it releases any locks.
If an Xact holds an X lock on an object, no other Xact can get a lock (S or X) on that object.
Multiple-Granularity Locks Why consider it? Database consists of tables, pages, tuples
(records) Hard to decide what granularity to lock
(tuples vs. pages vs. tables). Shouldn’t have to decide. How? Data “containers” are nested:
Tuples
Tables
Pages
Database
contains
The Phantom Problem
T1 implicitly assumes that it has locked the set of all sailor records with rating = 1. Assumption only holds if no sailor records are
added while T1 is executing! Why did this problem happen? Example shows that conflict serializability
guarantees serializability only if the set of objects is fixed!
Need some mechanism to enforce this assumption -- index locking and predicate locking
Index Locking
If there is a dense index on the rating field using Alternative (2), T1 should lock the index page containing the data entries with rating = 1. If there are no records with rating = 1, T1
must lock the index page where such a data entry would be, if it existed!
What if there is no suitable index? T1 must lock all pages, and lock the
file/table to prevent new pages from being added, to ensure that no new records with rating = 1 are added.
r=1Data
Index
Predicate Locking Grant lock on all records that satisfy some logical
predicate, e.g. age > 2*salary. Index locking is a special case of predicate
locking for which an index supports efficient implementation of the predicate lock. What is the predicate in the sailor example? rating=1
Why not using predicate locks in commercial DBMS?
In general, predicate locking has a significant locking overhead.
Locking in B+ Trees
How can we efficiently lock a particular leaf node? Don’t confuse this with multiple granularity
locking -- How are they different? One solution: Ignore the tree structure, just
lock pages while traversing the tree, following 2PL -- What’s wrong?
This has terrible performance! Root node (and many higher level nodes)
becomes a bottleneck. Why? Because every tree access begins at the root.
B+ Tree Locking
Higher levels of the tree only direct searches for leaf pages.
For inserts, a node on a path from root to modified leaf must be locked (in X mode, of course), only if a split can propagate up to it from the modified leaf. (Similar point holds w.r.t. deletes.)
We can exploit these observations to design efficient locking protocols that guarantee serializability even though they violate 2PL.
A Simple Tree Locking Algorithm Search: Start at root and go down;
repeatedly, S lock child then unlock parent.
Insert/Delete: Start at root and go down, obtaining X locks as needed. Once child is locked, check if it is safe: If child is safe, release all locks on ancestors.
Safe node: Node such that changes will not propagate up beyond this node. When is a node safe for inserts?
• Node is not full. When is a node safe for deletes?
• Node is not half-empty.
Timestamp Ordering Idea: Any conflicting operations are
executed in their timestamp order Simple and aggressive
Schedule immediately and reject requests that arrive too late
How do you know a request has arrived too late?
Give each object a read-timestamp (RTS) and a write-timestamp (WTS), give each transaction a timestamp (TS) when it begins
Timestamp Ordering Timestamp ordering rule:
If Oi(x) and Oj(x) are conflicting operation, Oi(x) is processed before Oj(x), if and only if TS(Ti) < TS(Tj).
Request arriving too late: Oi(x) arrives after the scheduler has sent
conflicting operation Oj(x) with TS(Tj) > TS(Ti)
Basic Timestamp Ordering Ri(x): if TS(Ti) < WTS(x), reject it;
otherwise (TS(Ti) >=WTS(x)), then schedule it and set RTS(x) to max (RTS(x), TS(Ti))
Wi(x): if TS(Ti) < RTS(x) or TS(Ti) <WTS(x), reject it; otherwise (TS(Ti) >=WTS(x)), then schedule it and set WTS(x) to max (WTS(x), TS(Ti))
When restarted, Ti is assigned a new timestamp Thomas Write Rule:
For wi(x), if TS(Ti) < WTS(x) and TS(Ti) >= RTS(x), then wi(x) can be ignored, rather than being rejected.
Why is it correct? Ignoring obsolete write
Exercise: Non-equivalence of 2PL and TO
H1=r2(x) w3(x) C3 w1(y)C1 r2(y) w2(y) C21. Is this schedule possible to timestamp
ordering?2. Is it possible with 2PL?
H1 is legal with strict timestamp ordering. What is the equivalent serial schedule?T1 T2 T3It is not possible with 2PLT2 must release lock on x for T3, but then gets lock on y – violation of two-phaseness
Relationship between 2PL and TO Schedules generated by 2PL and TO
They are all correct (serializable) They are not the same set: H1 shows that Is the relationship inclusive?
S {schedules by 2PL} subset of S {schedules by TO}?S {schedules by TO} subset of S {schedules by 2PL}?
Consider w3(x) C3 w2(x) C2 r1(x)Is it legal with TO?
Is it legal with 2PL? Two sets of schedules are intersecting, but subset
2PL TOSR
Failure and Recovery Failure and consistency
Transaction failures System failures Media failures
Principle of recovery Redundancy Database can be protected by ensuring that its
correct state can be reconstructed from information stored redundantly in the system
Recovering database – restart operation Bringing the stable DB to a consistent state by
removing effects of uncommitted transactions and applying missing effects of committed transactions.
Recovery and Restart
Types of storage media Volatile storage: fast , but not surviving system
failures Non-volatile storage Stable storage: information never lost (practically)
Recovery Ideally, stable DB should contain, for each data item,
the last value written by committed transaction Practically, stable DB may contain values written by
uncommitted transactions, or may not contain the last committed values.
Why?1) Updating of uncommitted T 2) Buffering of committed values in the cache
Function of Recovery Manager
Atomicity: Transactions may abort (“Rollback”).
Durability: What if DBMS stops running? (Causes?)
crash! Desired Behavior after
system restarts:– T1, T2 & T3 should be
durable.– T4 & T5 should be
aborted (effects not seen).
T1T2T3T4T5
Recovery Management
Design rules for recovery manager Undo rule: committed values must be saved
before overwritten by uncommitted values in the stable DB
Redo rule: before commit, new values it wrote must be in the stable storage (DB or log)
Restart activity Preparation: during normal operation Actual recovery: after failure
Preparation Logging Checkpointing
Cache Manager Two operations: fetch and flush
Use dirty bit for deciding flushing operation Flush: if the slot in cache is not dirty, do nothing;
otherwise, copy the value into stable storage Fetch: select a slot, using replacement algorithm
if full (and flush if necessary), copy the value into slot, reset dirty bit, update cache directory
When to flush? Depends on recovery strategy of the system Different recovery algorithms use different
strategies Idempotence of restart
Any sequence of incomplete execution, followed by a complete execution of restart has the same effect of just one complete execution
Handling the Memory Pool
Write to disk: force/no-force
Cache page: steal/no-steal Force every write to disk?
Poor response time. But provides durability.
Steal buffer-pool frames from uncommited Xacts? If not, poor throughput. If so, how can we ensure
atomicity?
Force
No Force
No Steal Steal
Trivial
Desired
More on Steal and Force STEAL (why enforcing Atomicity is hard)
To steal frame F: Current page in F (say P) is written to disk; some Xact holds lock on P.• What if the Xact with the lock on P aborts?• Must remember the old value of P at steal time
(to support UNDOing the write to page P). NO FORCE (why enforcing Durability is hard)
What if system crashes before a modified page is written to disk?
Write as little as possible, in a convenient place, at commit time, to support REDOing modifications.
Recovery Algorithms
Undo/redo algorithm Most complicated of the four recovery algorithms Flexible in deciding when to flush (no-force) Maximize efficiency during normal operation at the
expense of less efficient recovery Comparison with other recovery algorithms
Issues: disk I/O, log space, recovery time No-redo requires more frequent flush (force) Uncommitted transaction is allowed to replace dirty
slot for in-place update – undo might be necessary Restart procedure
Process log forward and backward for redo and undo
Undo/Redo Recovery
A transaction T writes vale V to data object X. What will happen?
System fetches X if it is not already in cache Record V in the log and in X’s slot C No need for the cache manager to flush C
If cache manager replaces C (steal), and either T aborts or system fails before T commits, undo is required
If T commits and system fails before C is flushed (no force), redo is required
Restart Procedure for Undo/Redo Recovery
1. Discard all cache slots2. Scan the log to analyze which transactions
committed, aborted, or active, to determine data for redo/undo
3. Redo all actions that were committed but not recorded in the stable DB
4. Undo all actions of transactions that were aborted or active at the time of failure