G.Vadivu & P.Subha, Assistant Professor, SRM University, Kattankulathur 8/22/2011School of...

G.Vadivu & P.Subha,Assistant Professor,

SRM University, Kattankulathur

8/22/2011School of Computing, Department of IT

The contents of the slides are solely for the purpose of teaching students at SRM University. All copyrights and Trademarks of organizations/persons apply even if not specified explicitly.

Speed with which data can be accessed Cost per unit of data Reliability

data loss on power failure or system crash physical failure of the storage device

Can differentiate storage into: volatile storage: loses contents when power is switched off non-volatile storage:

Contents persist even when power is switched off. Includes secondary and tertiary storage, as well as battery-backed up

main-memory.

Cache – fastest and most costly form of storage; volatile; managed by the computer system hardware

Main memory: fast access (10s to 100s of nanoseconds; 1 nanosecond = 10–9 seconds) generally too small (or too expensive) to store the entire database

capacities of up to a few Gigabytes widely used currently Capacities have gone up and per-byte costs have decreased steadily and

rapidly (roughly factor of 2 every 2 to 3 years) Volatile — contents of main memory are usually lost if a power failure or

system crash occurs.

Flash memory Data survives power failure Data can be written at a location only once, but location can be erased

and written to again Can support only a limited number (10K – 1M) of write/erase

cycles. Erasing of memory has to be done to an entire bank of memory

Reads are roughly as fast as main memory But writes are slow (few microseconds), erase is slower

Flash memory NOR Flash

Fast reads, very slow erase, lower capacity Used to store program code in many embedded devices

NAND Flash Page-at-a-time read/write, multi-page erase High capacity (several GB) Widely used as data storage mechanism in portable devices

Magnetic-disk Data is stored on spinning disk, and read/written magnetically Primary medium for the long-term storage of data; typically stores

entire database. Data must be moved from disk to main memory for access, and

written back for storage direct-access – possible to read data on disk in any order, unlike

magnetic tape Survives power failures and system crashes

disk failure can destroy data: is rare but does happen

Optical storage non-volatile, data is read optically from a spinning disk using a

laser CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular

forms Write-one, read-many (WORM) optical disks used for archival

storage (CD-R, DVD-R, DVD+R) Multiple write versions also available (CD-RW, DVD-RW,

DVD+RW, and DVD-RAM) Reads and writes are slower than with magnetic disk Juke-box systems, with large numbers of removable disks, a few

drives, and a mechanism for automatic loading/unloading of disks available for storing large volumes of data

Tape storage non-volatile, used primarily for backup (to recover from disk

failure), and for archival data sequential-access – much slower than disk very high capacity (40 to 300 GB tapes available) tape can be removed from drive storage costs much cheaper

than disk, but drives are expensive Tape jukeboxes available for storing massive amounts of data

hundreds of terabytes (1 terabyte = 109 bytes) to even a petabyte (1 petabyte = 1012 bytes)

RAID: Redundant Arrays of Independent Disks disk organization techniques that manage a large numbers of

disks, providing a view of a single disk of high capacity and high speed by using multiple disks in

parallel, and high reliability by storing data redundantly, so that data can be

recovered even if a disk fails

The chance that some disk out of a set of N disks will fail is much higher than the chance that a specific single disk will fail. E.g., a system with 100 disks, each with MTTF of 100,000

hours (approx. 11 years), will have a system MTTF of 1000 hours (approx. 41 days)

Two main goals of parallelism in a disk system:

1. Load balance multiple small accesses to increase throughput

2. Parallelize large accesses to reduce response time. Improve transfer rate by striping data across multiple disks. Bit-level striping – split the bits of each byte across multiple disks

But seek/access time worse than for a single disk Bit level striping is not used much any more

Block-level striping – with n disks, block i of a file goes to disk (i mod n) + 1 Requests for different blocks can run in parallel if the blocks

reside on different disks A request for a long sequence of blocks can utilize all disks in

parallel

RAID organizations, or RAID levels, have differing cost, performance and reliability characteristics

RAID Level 0: Block striping; non-redundant. Used in high-performance applications where data lost is not

critical. RAID Level 1: Mirrored disks with block striping

Offers best write performance. Popular for applications such as storing log files in a database

system.

RAID Level 2: Memory-Style Error-Correcting-Codes (ECC) with bit striping.

RAID Level 3: Bit-Interleaved Parity a single parity bit is enough for error correction, not just

detection When writing data, corresponding parity bits must also be

computed and written to a parity bit disk To recover data in a damaged disk, compute XOR of bits from

other disks (including parity bit disk)

RAID Level 3 (Cont.) Faster data transfer than with a single disk, but fewer I/Os per

second since every disk has to participate in every I/O. RAID Level 4: Block-Interleaved Parity; uses block-level striping,

and keeps a parity block on a separate disk for corresponding blocks from N other disks. When writing data block, corresponding block of parity bits

must also be computed and written to parity disk To find value of a damaged block, compute XOR of bits from

corresponding blocks (including parity block) from other disks.

RAID Level 4 (Cont.) Provides higher I/O rates for independent block reads than Level 3

block read goes to a single disk, so blocks stored on different disks can be read in parallel

Before writing a block, parity data must be computed Can be done by using old parity block, old value of current block

and new value of current block (2 block reads + 2 block writes) Or by recomputing the parity value using the new values of

blocks corresponding to the parity blockMore efficient for writing large amounts of data sequentially

Parity block becomes a bottleneck for independent block writes since every block write also writes to parity disk

RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among all N + 1 disks, rather than storing data in N disks and parity in 1 disk. E.g., with 5 disks, parity block for nth set of blocks is stored on

disk (n mod 5) + 1, with the data blocks stored on the other 4 disks.

RAID Level 5 (Cont.) Higher I/O rates than Level 4.

Block writes occur in parallel if the blocks and their parity blocks are on different disks.

Subsumes Level 4: provides same benefits, but avoids bottleneck of parity disk.

RAID Level 6: P+Q Redundancy scheme; similar to Level 5, but stores extra redundant information to guard against multiple disk failures. Better reliability than Level 5 at a higher cost; not used as

widely.

A transaction is a unit of program execution that

accesses and possibly updates various data items. A transaction must see a consistent database. During transaction execution the database may be inconsistent. When the transaction is committed, the database must be

consistent.

To ensure integrity of data, the database system must maintain: Atomicity. Either all operations of the transaction are properly

reflected in the database or none are. Consistency. Execution of a transaction in isolation preserves the

consistency of the database. Isolation. Although multiple transactions may execute

concurrently, each transaction must be unaware of other concurrently executing transactions. That is, for every pair of transactions Ti and Tj, it appears to Ti

that either Tj, finished execution before Ti started, or Tj started execution after Ti finished.

Durability. After a transaction completes successfully, the changes it has made to the database persist, even if there are system failures.

Transaction to transfer $100 from Checking account A to Saving account B:1. read(A)2. A := A – 1003. write(A)4. read(B)5. B := B + 1006. write(B)

Consistency requirement – the sum of A and B is unchanged by the execution of the transaction.

Atomicity requirement — if the transaction fails after step 3 and before step 6, the system should ensure that its updates are not reflected in the database, else an inconsistency will result.

Durability requirement — once the user has been notified that the transaction has completed (i.e., the transfer of the $100 has taken place), the updates to the database by the transaction must persist despite failures.

Isolation requirement — if between steps 3 and 6, another transaction is allowed to access the partially updated database, it will see an inconsistent database.

Active, the initial state; the transaction stays in this state while it is executing

Partially committed, after the final statement has been executed. Failed, after the discovery that normal execution can no longer

proceed. Aborted, after the transaction has been rolled back and the

database restored to its state prior to the start of the transaction.

1) Restart the transaction – only if no internal logical error

2) kill the transaction

Committed, after successful completion..

The shadow-database scheme:assume that only one transaction is active at a time.a pointer called db_pointer always points to the

current consistent copy of the database.all updates are made on a shadow copy of the

database, and db_pointer is made to point to the updated shadow copy only after the transaction reaches partial commit and all updated pages have been flushed to disk.

in case transaction fails, old consistent copy pointed to by db_pointer can be used, and the shadow copy can be deleted.

Multiple transactions are allowed to run concurrently in the system. Advantages are: increased processor and disk utilization, leading to

better transaction throughput: one transaction can be using the CPU while another is reading from or writing to the disk

reduced waiting time for transactions: short transactions need not wait behind long ones.

Schedules – sequences that indicate the chronological order in which instructions of concurrent transactions are executed a schedule for a set of transactions must consist of all

instructions of those transactions must preserve the order in which the instructions appear in each

individual transaction.

Let T1 transfer $50 from A to B, and T2 transfer 10% of the balance from A to B. The following is a serial schedule (Schedule 1 in the text), in which T1 is followed by T2.

Let T1 and T2 be the transactions defined previously. The following schedule is not a serial schedule, but it is equivalent to Schedule 1.

The following concurrent schedule does not preserve the value of the the sum A + B.

Basic Assumption – Each transaction preserves database consistency.

Thus serial execution of a set of transactions preserves database consistency.

A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule. Different forms of schedule equivalence give rise to the notions of:

1. conflict serializability

2. view serializability

Instructions li and lj of transactions Ti and Tj respectively, conflict if and only if there exists some item Q accessed by both li and lj, and at least one of these instructions wrote Q.

1. li = read(Q), lj = read(Q). li and lj don’t conflict.2. li = read(Q), lj = write(Q). They conflict.3. li = write(Q), lj = read(Q). They conflict4. li = write(Q), lj = write(Q). They conflict

If a schedule S can be transformed into a schedule S´ by a series of swaps of non-conflicting instructions, we say that S and S´ are conflict equivalent.

We say that a schedule S is conflict serializable if it is conflict equivalent to a serial schedule

Example of a schedule that is not conflict serializable:

T3 T4

read(Q)write(Q)

write(Q)

We are unable to swap instructions in the above schedule to obtain either the serial schedule < T3, T4 >, or the serial schedule < T4, T3 >.

Schedule 3 below can be transformed into Schedule 1, a serial schedule where T2 follows T1, by series of swaps of non-conflicting instructions. Therefore Schedule 3 is conflict serializable.

Let S and S´ be two schedules with the same set of transactions. S and S´ are view equivalent if the following three conditions are met:

1. For each data item Q, if transaction Ti reads the initial value of Q in schedule S, then transaction Ti must, in schedule S´, also read the initial value of Q.

2. For each data item Q if transaction Ti executes read(Q) in schedule S, and that value was produced by transaction Tj (if any), then transaction Ti must in schedule S´ also read the value of Q that was produced by transaction Tj .

3. For each data item Q, the transaction (if any) that performs the final write(Q) operation in schedule S must perform the final write(Q) operation in schedule S´.

As can be seen, view equivalence is also based purely on readsand writes alone.

A schedule S is view serializable it is view equivalent to a serial schedule.

Every conflict serializable schedule is also view serializable. Schedule 9 (from text) — a schedule which is view-serializable

but not conflict serializable.Every view serializable schedule that is not conflict serializable has blind writes.

Serializable — default Repeatable read — only committed records to be read, repeated reads of same

record must return same value. However, a transaction may not be serializable – it may find some records inserted by a transaction but not find others.

Read committed — only committed records can be read, but successive reads of record may return different (but committed) values.

Read uncommitted — even uncommitted records may be read.

A lock is a mechanism to control concurrent access to a data item Data items can be locked in two modes :

1. exclusive (X) mode. Data item can be both read as well as written. X-lock is requested using lock-X instruction.

2. shared (S) mode. Data item can only be read. S-lock is requested using lock-S instruction. Lock requests are made to concurrency-control manager. Transaction can

proceed only after request is granted.

Lock-compatibility matrix

A transaction may be granted a lock on an item if the requested lock is compatible with locks already held on the item by other transactions

Any number of transactions can hold shared locks on an item, but if any transaction holds an exclusive on the item no other

transaction may hold any lock on the item. If a lock cannot be granted, the requesting transaction is made to wait

till all incompatible locks held by other transactions have been released. The lock is then granted.

Example of a transaction performing locking:

T2: lock-S(A);

read (A);

unlock(A);

lock-S(B);

read (B);

unlock(B);

display(A+B) Locking as above is not sufficient to guarantee serializability — if A and B

get updated in-between the read of A and B, the displayed sum would be wrong.

A locking protocol is a set of rules followed by all transactions while requesting and releasing locks. Locking protocols restrict the set of possible schedules.

Consider the partial schedule

Neither T3 nor T4 can make progress — executing lock-S(B) causes T4 to wait for T3 to release its lock on B, while executing lock-X(A) causes T3 to wait for T4 to release its lock on A.

Such a situation is called a deadlock. To handle a deadlock one of T3 or T4 must be rolled back

and its locks released.

The potential for deadlock exists in most locking protocols. Deadlocks are a necessary evil.

Starvation is also possible if concurrency control manager is badly designed. For example: A transaction may be waiting for an X-lock on an item, while a sequence

of other transactions request and are granted an S-lock on the same item. The same transaction is repeatedly rolled back due to deadlocks.

Concurrency control manager can be designed to prevent starvation.

This is a protocol which ensures conflict-serializable schedules. Phase 1: Growing Phase

transaction may obtain locks transaction may not release locks

Phase 2: Shrinking Phase transaction may release locks transaction may not obtain locks

The protocol assures serializability. It can be proved that the transactions

can be serialized in the order of their lock points (i.e. the point where a

transaction acquired its final lock).

Two-phase locking does not ensure freedom from deadlocks

Cascading roll-back is possible under two-phase locking. To avoid this, follow a modified protocol called strict two-phase locking. Here a transaction must hold all its exclusive locks till it commits/aborts.

Rigorous two-phase locking is even stricter: here all locks are held till commit/abort. In this protocol transactions can be serialized in the order in which they commit.

There can be conflict serializable schedules that cannot be obtained if two-phase locking is used.

However, in the absence of extra information (e.g., ordering of access to data), two-phase locking is needed for conflict serializability in the following sense:

Given a transaction Ti that does not follow two-phase locking, we can find a transaction Tj that uses two-phase locking, and a schedule for Ti and Tj that is not conflict serializable.

Two-phase locking with lock conversions:

– First Phase: can acquire a lock-S on item can acquire a lock-X on item can convert a lock-S to a lock-X (upgrade)

– Second Phase: can release a lock-S can release a lock-X can convert a lock-X to a lock-S (downgrade)

This protocol assures serializability. But still relies on the programmer to insert the various locking instructions.

Consider the following two transactions:

T1: write (X) T2: write(Y)

write(Y) write(X) Schedule with deadlock

T1 T2

lock-X on Xwrite (X)

lock-X on Ywrite (X) wait for lock-X on X

wait for lock-X on Y

System is deadlocked if there is a set of transactions such that every transaction in the set is waiting for another transaction in the set.

Deadlock prevention protocols ensure that the system will never enter into a deadlock state. Some prevention strategies : Require that each transaction locks all its data items before it begins

execution (predeclaration). Impose partial ordering of all data items and require that a transaction

can lock data items only in the order specified by the partial order (graph-based protocol).

Following schemes use transaction timestamps for the sake of deadlock prevention alone.

wait-die scheme — non-preemptive older transaction may wait for younger one to release data item.

Younger transactions never wait for older ones; they are rolled back instead.

a transaction may die several times before acquiring needed data item wound-wait scheme — preemptive

older transaction wounds (forces rollback) of younger transaction instead of waiting for it. Younger transactions may wait for older ones.

may be fewer rollbacks than wait-die scheme.

Both in wait-die and in wound-wait schemes, a rolled back transactions is restarted with its original timestamp. Older transactions thus have precedence over newer ones, and starvation is hence avoided.

Timeout-Based Schemes : a transaction waits for a lock only for a specified amount of time.

After that, the wait times out and the transaction is rolled back. thus deadlocks are not possible simple to implement; but starvation is possible. Also difficult to

determine good value of the timeout interval.

Deadlocks can be described as a wait-for graph, which consists of a pair G = (V,E), V is a set of vertices (all the transactions in the system) E is a set of edges; each element is an ordered pair Ti Tj.

If Ti Tj is in E, then there is a directed edge from Ti to Tj, implying that Ti is waiting for Tj to release a data item.

When Ti requests a data item currently being held by Tj, then the edge Ti Tj is inserted in the wait-for graph. This edge is removed only when Tj is no longer holding a data item needed by Ti.

The system is in a deadlock state if and only if the wait-for graph has a cycle. Must invoke a deadlock-detection algorithm periodically to look for cycles.

Wait-for graph without a cycle Wait-for graph with a cycle

When deadlock is detected : Some transaction will have to rolled back (made a victim) to break

deadlock. Select that transaction as victim that will incur minimum cost.

Rollback -- determine how far to roll back transaction Total rollback: Abort the transaction and then restart it. More effective to roll back transaction only as far as necessary to

break deadlock. Starvation happens if same transaction is always chosen as victim.

Include the number of rollbacks in the cost factor to avoid starvation

Transaction failure : Logical errors: transaction cannot complete due to some internal error

condition System errors: the database system must terminate an active

transaction due to an error condition (e.g., deadlock) System crash: a power failure or other hardware or software failure

causes the system to crash. Fail-stop assumption: non-volatile storage contents are assumed to

not be corrupted by system crash Database systems have numerous integrity checks to prevent

corruption of disk data Disk failure: a head crash or similar disk failure destroys all or part of

disk storage Destruction is assumed to be detectable: disk drives use checksums to

detect failures

Recovery algorithms are techniques to ensure database consistency and transaction atomicity and durability despite failures Focus of this chapter

Recovery algorithms have two parts

1. Actions taken during normal transaction processing to ensure enough information exists to recover from failures

2. Actions taken after a failure to recover the database contents to a state that ensures atomicity, consistency and durability

Parallel machines are becoming quite common and affordable Prices of microprocessors, memory and disks have dropped sharply Recent desktop computers feature multiple processors and this trend is

projected to accelerate Databases are growing increasingly large

large volumes of transaction data are collected and stored for later analysis.

multimedia objects like images are increasingly stored in databases Large-scale parallel database systems increasingly used for:

storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Data can be partitioned across multiple disks for parallel I/O. Individual relational operations (e.g., sort, join, aggregation) can be

executed in parallel data can be partitioned and each processor can work independently on

its own partition. Queries are expressed in high level language (SQL, translated to

relational algebra) makes parallelization easier.

Different queries can be run in parallel with each other. Concurrency control takes care of conflicts.

Thus, databases naturally lend themselves to parallelism.

Reduce the time required to retrieve relations from disk by partitioning the relations on multiple disks. Horizontal partitioning – tuples of a relation are divided among many disks

such that each tuple resides on one disk. Partitioning techniques (number of disks = n):

Round-robin:

Send the ith tuple inserted in the relation to disk i mod n.

Hash partitioning: Choose one or more attributes as the partitioning attributes. Choose hash function h with range 0…n - 1 Let i denote result of hash function h applied to the partitioning

attribute value of a tuple. Send tuple to disk i.

Partitioning techniques (cont.): Range partitioning:

Choose an attribute as the partitioning attribute. A partitioning vector [vo, v1, ..., vn-2] is chosen.

Let v be the partitioning attribute value of a tuple. Tuples such that vi vi+1 go to disk I + 1. Tuples with v < v0 go to disk 0 and tuples with v vn-2 go to disk n-1.

E.g., with a partitioning vector [5,11], a tuple with partitioning attribute value of 2 will go to disk 0, a tuple with value 8 will go to disk 1, while a tuple with value 20 will go to disk2.

DDBS: Multiple logically interrelated databases distributed over a computer network

A distributed database system consists of loosely coupled sites that share no physical component

Database systems that run on each site are independent of each other Transactions may access data at one or more sites

In a homogeneous distributed database All sites have identical software Are aware of each other and agree to cooperate in processing user requests. Each site surrenders part of its autonomy in terms of right to change

schemas or software Appears to user as a single system

In a heterogeneous distributed database Different sites may use different schemas and software

Difference in schema is a major problem for query processing Difference in software is a major problem for transaction processing

Sites may not be aware of each other and may provide only limited facilities for cooperation in transaction processing

System maintains multiple copies of data, stored in different sites, for faster retrieval and fault tolerance

A relation or fragment of a relation is replicated if it is stored redundantly in two or more sites.

Full replication of a relation is the case where the relation is stored at all sites.

Fully redundant databases are those in which every site contains a copy of the entire database.

Advantages of Replication Availability: failure of site containing relation r does not result in

unavailability of r is replicas exist. Parallelism: queries on r may be processed by several nodes in parallel. Reduced data transfer: relation r is available locally at each site

containing a replica of r. Disadvantages of Replication

Increased cost of updates: each replica of relation r must be updated. Increased complexity of concurrency control: concurrent updates to distinct

replicas may lead to inconsistent data unless special concurrency control mechanisms are implemented. One solution: choose one copy as primary copy and apply concurrency

control operations on primary copy

Division of relation r into fragments r1, r2, …, rn which contain sufficient information to reconstruct relation r.

Horizontal fragmentation: each tuple of r is assigned to one or more fragments

Vertical fragmentation: the schema for relation r is split into several smaller schemas All schemas must contain a common candidate key (or superkey) to ensure

lossless join property. A special attribute, the tuple-id attribute may be added to each schema to

serve as a candidate key. Example : relation account with following schema Account = (account_number, branch_name , balance )

branch_nameaccount_number balance

HillsideHillsideHillside

A-305A-226A-155

50033662

account1 = branch_name=“Hillside” (account )

branch_nameaccount_number balance

ValleyviewValleyviewValleyviewValleyview

A-177A-402A-408A-639

205100001123750

account2 = branch_name=“Valleyview” (account )

branch_name customer_name tuple_id

HillsideHillsideValleyviewValleyviewHillsideValleyviewValleyview

LowmanCampCampKahnKahnKahnGreen

deposit1 = branch_name, customer_name, tuple_id (employee_info )

1234567

account_number balance tuple_id

50033620510000621123750

1234567

A-305A-226A-177A-402A-155A-408A-639

deposit2 = account_number, balance, tuple_id (employee_info )

Horizontal: allows parallel processing on fragments of a relation allows a relation to be split so that tuples are located where they are

most frequently accessed Vertical:

allows tuples to be split so that each part of the tuple is stored where it is most frequently accessed

tuple-id attribute allows efficient joining of vertical fragments Vertical and horizontal fragmentation can be mixed.

Fragments may be successively fragmented to an arbitrary depth. Replication and fragmentation can be combined

Relation is partitioned into several fragments: system maintains several identical replicas of each such fragment.

1. Raghu Ramakrishnan, Johannes Gehrke, Database Management System, McGraw Hill., 3rd Edition 2003.

2. Elmashri & Navathe, Fundamentals of Database System, Addison-Wesley Publishing, 3rd Edition,2000.

3. Date C.J, An Introduction to Database, Addison-Wesley Pub Co, 7th Edition , 2001.

4. Jeffrey D. Ullman, Jennifer Widom, A First Course in Database System, Prentice Hall, AWL 1st Edition ,2001.

5. Peter rob, Carlos Coronel, Database Systems – Design, Implementation, and Management, 4th Edition, Thomson Learning, 2001.

Define flash memory. Define log disk. What is meant by log-based file system? Define a transaction. List the required properties of a transaction to ensure integrity of the data. What is meant by cascading rollback? Define concurrency control. Define locking protocol. Define cache coherency. Define parallel aggregation. Define query optimization Define locking protocol. Define fuzzy checkpoint. What is meant by Write-ahead-logging (WAL) rule?

G.Vadivu & P.Subha, Assistant Professor, SRM University, Kattankulathur 8/22/2011School of...

Documents

Transcript of G.Vadivu & P.Subha, Assistant Professor, SRM University, Kattankulathur 8/22/2011School of...