G.Vadivu & P.Subha, Assistant Professor, SRM University, Kattankulathur 8/22/2011School of...
-
Upload
jesus-gilbert -
Category
Documents
-
view
215 -
download
0
Transcript of G.Vadivu & P.Subha, Assistant Professor, SRM University, Kattankulathur 8/22/2011School of...
G.Vadivu & P.Subha,Assistant Professor,
SRM University, Kattankulathur
8/22/2011School of Computing, Department of IT
The contents of the slides are solely for the purpose of teaching students at SRM University. All copyrights and Trademarks of organizations/persons apply even if not specified explicitly.
Speed with which data can be accessed Cost per unit of data Reliability
data loss on power failure or system crash physical failure of the storage device
Can differentiate storage into: volatile storage: loses contents when power is switched off non-volatile storage:
Contents persist even when power is switched off. Includes secondary and tertiary storage, as well as battery-backed up
main-memory.
Cache – fastest and most costly form of storage; volatile; managed by the computer system hardware
Main memory: fast access (10s to 100s of nanoseconds; 1 nanosecond = 10–9 seconds) generally too small (or too expensive) to store the entire database
capacities of up to a few Gigabytes widely used currently Capacities have gone up and per-byte costs have decreased steadily and
rapidly (roughly factor of 2 every 2 to 3 years) Volatile — contents of main memory are usually lost if a power failure or
system crash occurs.
Flash memory Data survives power failure Data can be written at a location only once, but location can be erased
and written to again Can support only a limited number (10K – 1M) of write/erase
cycles. Erasing of memory has to be done to an entire bank of memory
Reads are roughly as fast as main memory But writes are slow (few microseconds), erase is slower
Flash memory NOR Flash
Fast reads, very slow erase, lower capacity Used to store program code in many embedded devices
NAND Flash Page-at-a-time read/write, multi-page erase High capacity (several GB) Widely used as data storage mechanism in portable devices
Magnetic-disk Data is stored on spinning disk, and read/written magnetically Primary medium for the long-term storage of data; typically stores
entire database. Data must be moved from disk to main memory for access, and
written back for storage direct-access – possible to read data on disk in any order, unlike
magnetic tape Survives power failures and system crashes
disk failure can destroy data: is rare but does happen
Optical storage non-volatile, data is read optically from a spinning disk using a
laser CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular
forms Write-one, read-many (WORM) optical disks used for archival
storage (CD-R, DVD-R, DVD+R) Multiple write versions also available (CD-RW, DVD-RW,
DVD+RW, and DVD-RAM) Reads and writes are slower than with magnetic disk Juke-box systems, with large numbers of removable disks, a few
drives, and a mechanism for automatic loading/unloading of disks available for storing large volumes of data
Tape storage non-volatile, used primarily for backup (to recover from disk
failure), and for archival data sequential-access – much slower than disk very high capacity (40 to 300 GB tapes available) tape can be removed from drive storage costs much cheaper
than disk, but drives are expensive Tape jukeboxes available for storing massive amounts of data
hundreds of terabytes (1 terabyte = 109 bytes) to even a petabyte (1 petabyte = 1012 bytes)
RAID: Redundant Arrays of Independent Disks disk organization techniques that manage a large numbers of
disks, providing a view of a single disk of high capacity and high speed by using multiple disks in
parallel, and high reliability by storing data redundantly, so that data can be
recovered even if a disk fails
The chance that some disk out of a set of N disks will fail is much higher than the chance that a specific single disk will fail. E.g., a system with 100 disks, each with MTTF of 100,000
hours (approx. 11 years), will have a system MTTF of 1000 hours (approx. 41 days)
Two main goals of parallelism in a disk system:
1. Load balance multiple small accesses to increase throughput
2. Parallelize large accesses to reduce response time. Improve transfer rate by striping data across multiple disks. Bit-level striping – split the bits of each byte across multiple disks
But seek/access time worse than for a single disk Bit level striping is not used much any more
Block-level striping – with n disks, block i of a file goes to disk (i mod n) + 1 Requests for different blocks can run in parallel if the blocks
reside on different disks A request for a long sequence of blocks can utilize all disks in
parallel
RAID organizations, or RAID levels, have differing cost, performance and reliability characteristics
RAID Level 0: Block striping; non-redundant. Used in high-performance applications where data lost is not
critical. RAID Level 1: Mirrored disks with block striping
Offers best write performance. Popular for applications such as storing log files in a database
system.
RAID Level 2: Memory-Style Error-Correcting-Codes (ECC) with bit striping.
RAID Level 3: Bit-Interleaved Parity a single parity bit is enough for error correction, not just
detection When writing data, corresponding parity bits must also be
computed and written to a parity bit disk To recover data in a damaged disk, compute XOR of bits from
other disks (including parity bit disk)
RAID Level 3 (Cont.) Faster data transfer than with a single disk, but fewer I/Os per
second since every disk has to participate in every I/O. RAID Level 4: Block-Interleaved Parity; uses block-level striping,
and keeps a parity block on a separate disk for corresponding blocks from N other disks. When writing data block, corresponding block of parity bits
must also be computed and written to parity disk To find value of a damaged block, compute XOR of bits from
corresponding blocks (including parity block) from other disks.
RAID Level 4 (Cont.) Provides higher I/O rates for independent block reads than Level 3
block read goes to a single disk, so blocks stored on different disks can be read in parallel
Before writing a block, parity data must be computed Can be done by using old parity block, old value of current block
and new value of current block (2 block reads + 2 block writes) Or by recomputing the parity value using the new values of
blocks corresponding to the parity blockMore efficient for writing large amounts of data sequentially
Parity block becomes a bottleneck for independent block writes since every block write also writes to parity disk
RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among all N + 1 disks, rather than storing data in N disks and parity in 1 disk. E.g., with 5 disks, parity block for nth set of blocks is stored on
disk (n mod 5) + 1, with the data blocks stored on the other 4 disks.
RAID Level 5 (Cont.) Higher I/O rates than Level 4.
Block writes occur in parallel if the blocks and their parity blocks are on different disks.
Subsumes Level 4: provides same benefits, but avoids bottleneck of parity disk.
RAID Level 6: P+Q Redundancy scheme; similar to Level 5, but stores extra redundant information to guard against multiple disk failures. Better reliability than Level 5 at a higher cost; not used as
widely.
A transaction is a unit of program execution that
accesses and possibly updates various data items. A transaction must see a consistent database. During transaction execution the database may be inconsistent. When the transaction is committed, the database must be
consistent.
To ensure integrity of data, the database system must maintain: Atomicity. Either all operations of the transaction are properly
reflected in the database or none are. Consistency. Execution of a transaction in isolation preserves the
consistency of the database. Isolation. Although multiple transactions may execute
concurrently, each transaction must be unaware of other concurrently executing transactions. That is, for every pair of transactions Ti and Tj, it appears to Ti
that either Tj, finished execution before Ti started, or Tj started execution after Ti finished.
Durability. After a transaction completes successfully, the changes it has made to the database persist, even if there are system failures.
Transaction to transfer $100 from Checking account A to Saving account B:1. read(A)2. A := A – 1003. write(A)4. read(B)5. B := B + 1006. write(B)
Consistency requirement – the sum of A and B is unchanged by the execution of the transaction.
Atomicity requirement — if the transaction fails after step 3 and before step 6, the system should ensure that its updates are not reflected in the database, else an inconsistency will result.
Durability requirement — once the user has been notified that the transaction has completed (i.e., the transfer of the $100 has taken place), the updates to the database by the transaction must persist despite failures.
Isolation requirement — if between steps 3 and 6, another transaction is allowed to access the partially updated database, it will see an inconsistent database.
Active, the initial state; the transaction stays in this state while it is executing
Partially committed, after the final statement has been executed. Failed, after the discovery that normal execution can no longer
proceed. Aborted, after the transaction has been rolled back and the
database restored to its state prior to the start of the transaction.
1) Restart the transaction – only if no internal logical error
2) kill the transaction
Committed, after successful completion..
The shadow-database scheme:assume that only one transaction is active at a time.a pointer called db_pointer always points to the
current consistent copy of the database.all updates are made on a shadow copy of the
database, and db_pointer is made to point to the updated shadow copy only after the transaction reaches partial commit and all updated pages have been flushed to disk.
in case transaction fails, old consistent copy pointed to by db_pointer can be used, and the shadow copy can be deleted.
Multiple transactions are allowed to run concurrently in the system. Advantages are: increased processor and disk utilization, leading to
better transaction throughput: one transaction can be using the CPU while another is reading from or writing to the disk
reduced waiting time for transactions: short transactions need not wait behind long ones.
Schedules – sequences that indicate the chronological order in which instructions of concurrent transactions are executed a schedule for a set of transactions must consist of all
instructions of those transactions must preserve the order in which the instructions appear in each
individual transaction.
Let T1 transfer $50 from A to B, and T2 transfer 10% of the balance from A to B. The following is a serial schedule (Schedule 1 in the text), in which T1 is followed by T2.
Let T1 and T2 be the transactions defined previously. The following schedule is not a serial schedule, but it is equivalent to Schedule 1.
The following concurrent schedule does not preserve the value of the the sum A + B.
Basic Assumption – Each transaction preserves database consistency.
Thus serial execution of a set of transactions preserves database consistency.
A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule. Different forms of schedule equivalence give rise to the notions of:
1. conflict serializability
2. view serializability
Instructions li and lj of transactions Ti and Tj respectively, conflict if and only if there exists some item Q accessed by both li and lj, and at least one of these instructions wrote Q.
1. li = read(Q), lj = read(Q). li and lj don’t conflict.2. li = read(Q), lj = write(Q). They conflict.3. li = write(Q), lj = read(Q). They conflict4. li = write(Q), lj = write(Q). They conflict
If a schedule S can be transformed into a schedule S´ by a series of swaps of non-conflicting instructions, we say that S and S´ are conflict equivalent.
We say that a schedule S is conflict serializable if it is conflict equivalent to a serial schedule
Example of a schedule that is not conflict serializable:
T3 T4
read(Q)write(Q)
write(Q)
We are unable to swap instructions in the above schedule to obtain either the serial schedule < T3, T4 >, or the serial schedule < T4, T3 >.
Schedule 3 below can be transformed into Schedule 1, a serial schedule where T2 follows T1, by series of swaps of non-conflicting instructions. Therefore Schedule 3 is conflict serializable.
Let S and S´ be two schedules with the same set of transactions. S and S´ are view equivalent if the following three conditions are met:
1. For each data item Q, if transaction Ti reads the initial value of Q in schedule S, then transaction Ti must, in schedule S´, also read the initial value of Q.
2. For each data item Q if transaction Ti executes read(Q) in schedule S, and that value was produced by transaction Tj (if any), then transaction Ti must in schedule S´ also read the value of Q that was produced by transaction Tj .
3. For each data item Q, the transaction (if any) that performs the final write(Q) operation in schedule S must perform the final write(Q) operation in schedule S´.
As can be seen, view equivalence is also based purely on readsand writes alone.
A schedule S is view serializable it is view equivalent to a serial schedule.
Every conflict serializable schedule is also view serializable. Schedule 9 (from text) — a schedule which is view-serializable
but not conflict serializable.Every view serializable schedule that is not conflict serializable has blind writes.
Serializable — default Repeatable read — only committed records to be read, repeated reads of same
record must return same value. However, a transaction may not be serializable – it may find some records inserted by a transaction but not find others.
Read committed — only committed records can be read, but successive reads of record may return different (but committed) values.
Read uncommitted — even uncommitted records may be read.
A lock is a mechanism to control concurrent access to a data item Data items can be locked in two modes :
1. exclusive (X) mode. Data item can be both read as well as written. X-lock is requested using lock-X instruction.
2. shared (S) mode. Data item can only be read. S-lock is requested using lock-S instruction. Lock requests are made to concurrency-control manager. Transaction can
proceed only after request is granted.
Lock-compatibility matrix
A transaction may be granted a lock on an item if the requested lock is compatible with locks already held on the item by other transactions
Any number of transactions can hold shared locks on an item, but if any transaction holds an exclusive on the item no other
transaction may hold any lock on the item. If a lock cannot be granted, the requesting transaction is made to wait
till all incompatible locks held by other transactions have been released. The lock is then granted.
Example of a transaction performing locking:
T2: lock-S(A);
read (A);
unlock(A);
lock-S(B);
read (B);
unlock(B);
display(A+B) Locking as above is not sufficient to guarantee serializability — if A and B
get updated in-between the read of A and B, the displayed sum would be wrong.
A locking protocol is a set of rules followed by all transactions while requesting and releasing locks. Locking protocols restrict the set of possible schedules.
Consider the partial schedule
Neither T3 nor T4 can make progress — executing lock-S(B) causes T4 to wait for T3 to release its lock on B, while executing lock-X(A) causes T3 to wait for T4 to release its lock on A.
Such a situation is called a deadlock. To handle a deadlock one of T3 or T4 must be rolled back
and its locks released.
The potential for deadlock exists in most locking protocols. Deadlocks are a necessary evil.
Starvation is also possible if concurrency control manager is badly designed. For example: A transaction may be waiting for an X-lock on an item, while a sequence
of other transactions request and are granted an S-lock on the same item. The same transaction is repeatedly rolled back due to deadlocks.
Concurrency control manager can be designed to prevent starvation.
This is a protocol which ensures conflict-serializable schedules. Phase 1: Growing Phase
transaction may obtain locks transaction may not release locks
Phase 2: Shrinking Phase transaction may release locks transaction may not obtain locks
The protocol assures serializability. It can be proved that the transactions
can be serialized in the order of their lock points (i.e. the point where a
transaction acquired its final lock).
Two-phase locking does not ensure freedom from deadlocks
Cascading roll-back is possible under two-phase locking. To avoid this, follow a modified protocol called strict two-phase locking. Here a transaction must hold all its exclusive locks till it commits/aborts.
Rigorous two-phase locking is even stricter: here all locks are held till commit/abort. In this protocol transactions can be serialized in the order in which they commit.
There can be conflict serializable schedules that cannot be obtained if two-phase locking is used.
However, in the absence of extra information (e.g., ordering of access to data), two-phase locking is needed for conflict serializability in the following sense:
Given a transaction Ti that does not follow two-phase locking, we can find a transaction Tj that uses two-phase locking, and a schedule for Ti and Tj that is not conflict serializable.
Two-phase locking with lock conversions:
– First Phase: can acquire a lock-S on item can acquire a lock-X on item can convert a lock-S to a lock-X (upgrade)
– Second Phase: can release a lock-S can release a lock-X can convert a lock-X to a lock-S (downgrade)
This protocol assures serializability. But still relies on the programmer to insert the various locking instructions.
Consider the following two transactions:
T1: write (X) T2: write(Y)
write(Y) write(X) Schedule with deadlock
T1 T2
lock-X on Xwrite (X)
lock-X on Ywrite (X) wait for lock-X on X
wait for lock-X on Y
System is deadlocked if there is a set of transactions such that every transaction in the set is waiting for another transaction in the set.
Deadlock prevention protocols ensure that the system will never enter into a deadlock state. Some prevention strategies : Require that each transaction locks all its data items before it begins
execution (predeclaration). Impose partial ordering of all data items and require that a transaction
can lock data items only in the order specified by the partial order (graph-based protocol).
Following schemes use transaction timestamps for the sake of deadlock prevention alone.
wait-die scheme — non-preemptive older transaction may wait for younger one to release data item.
Younger transactions never wait for older ones; they are rolled back instead.
a transaction may die several times before acquiring needed data item wound-wait scheme — preemptive
older transaction wounds (forces rollback) of younger transaction instead of waiting for it. Younger transactions may wait for older ones.
may be fewer rollbacks than wait-die scheme.
Both in wait-die and in wound-wait schemes, a rolled back transactions is restarted with its original timestamp. Older transactions thus have precedence over newer ones, and starvation is hence avoided.
Timeout-Based Schemes : a transaction waits for a lock only for a specified amount of time.
After that, the wait times out and the transaction is rolled back. thus deadlocks are not possible simple to implement; but starvation is possible. Also difficult to
determine good value of the timeout interval.
Deadlocks can be described as a wait-for graph, which consists of a pair G = (V,E), V is a set of vertices (all the transactions in the system) E is a set of edges; each element is an ordered pair Ti Tj.
If Ti Tj is in E, then there is a directed edge from Ti to Tj, implying that Ti is waiting for Tj to release a data item.
When Ti requests a data item currently being held by Tj, then the edge Ti Tj is inserted in the wait-for graph. This edge is removed only when Tj is no longer holding a data item needed by Ti.
The system is in a deadlock state if and only if the wait-for graph has a cycle. Must invoke a deadlock-detection algorithm periodically to look for cycles.
Wait-for graph without a cycle Wait-for graph with a cycle
When deadlock is detected : Some transaction will have to rolled back (made a victim) to break
deadlock. Select that transaction as victim that will incur minimum cost.
Rollback -- determine how far to roll back transaction Total rollback: Abort the transaction and then restart it. More effective to roll back transaction only as far as necessary to
break deadlock. Starvation happens if same transaction is always chosen as victim.
Include the number of rollbacks in the cost factor to avoid starvation
Transaction failure : Logical errors: transaction cannot complete due to some internal error
condition System errors: the database system must terminate an active
transaction due to an error condition (e.g., deadlock) System crash: a power failure or other hardware or software failure
causes the system to crash. Fail-stop assumption: non-volatile storage contents are assumed to
not be corrupted by system crash Database systems have numerous integrity checks to prevent
corruption of disk data Disk failure: a head crash or similar disk failure destroys all or part of
disk storage Destruction is assumed to be detectable: disk drives use checksums to
detect failures
Recovery algorithms are techniques to ensure database consistency and transaction atomicity and durability despite failures Focus of this chapter
Recovery algorithms have two parts
1. Actions taken during normal transaction processing to ensure enough information exists to recover from failures
2. Actions taken after a failure to recover the database contents to a state that ensures atomicity, consistency and durability
Parallel machines are becoming quite common and affordable Prices of microprocessors, memory and disks have dropped sharply Recent desktop computers feature multiple processors and this trend is
projected to accelerate Databases are growing increasingly large
large volumes of transaction data are collected and stored for later analysis.
multimedia objects like images are increasingly stored in databases Large-scale parallel database systems increasingly used for:
storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing
Data can be partitioned across multiple disks for parallel I/O. Individual relational operations (e.g., sort, join, aggregation) can be
executed in parallel data can be partitioned and each processor can work independently on
its own partition. Queries are expressed in high level language (SQL, translated to
relational algebra) makes parallelization easier.
Different queries can be run in parallel with each other. Concurrency control takes care of conflicts.
Thus, databases naturally lend themselves to parallelism.
Reduce the time required to retrieve relations from disk by partitioning the relations on multiple disks. Horizontal partitioning – tuples of a relation are divided among many disks
such that each tuple resides on one disk. Partitioning techniques (number of disks = n):
Round-robin:
Send the ith tuple inserted in the relation to disk i mod n.
Hash partitioning: Choose one or more attributes as the partitioning attributes. Choose hash function h with range 0…n - 1 Let i denote result of hash function h applied to the partitioning
attribute value of a tuple. Send tuple to disk i.
Partitioning techniques (cont.): Range partitioning:
Choose an attribute as the partitioning attribute. A partitioning vector [vo, v1, ..., vn-2] is chosen.
Let v be the partitioning attribute value of a tuple. Tuples such that vi vi+1 go to disk I + 1. Tuples with v < v0 go to disk 0 and tuples with v vn-2 go to disk n-1.
E.g., with a partitioning vector [5,11], a tuple with partitioning attribute value of 2 will go to disk 0, a tuple with value 8 will go to disk 1, while a tuple with value 20 will go to disk2.
DDBS: Multiple logically interrelated databases distributed over a computer network
A distributed database system consists of loosely coupled sites that share no physical component
Database systems that run on each site are independent of each other Transactions may access data at one or more sites
In a homogeneous distributed database All sites have identical software Are aware of each other and agree to cooperate in processing user requests. Each site surrenders part of its autonomy in terms of right to change
schemas or software Appears to user as a single system
In a heterogeneous distributed database Different sites may use different schemas and software
Difference in schema is a major problem for query processing Difference in software is a major problem for transaction processing
Sites may not be aware of each other and may provide only limited facilities for cooperation in transaction processing
System maintains multiple copies of data, stored in different sites, for faster retrieval and fault tolerance
A relation or fragment of a relation is replicated if it is stored redundantly in two or more sites.
Full replication of a relation is the case where the relation is stored at all sites.
Fully redundant databases are those in which every site contains a copy of the entire database.
Advantages of Replication Availability: failure of site containing relation r does not result in
unavailability of r is replicas exist. Parallelism: queries on r may be processed by several nodes in parallel. Reduced data transfer: relation r is available locally at each site
containing a replica of r. Disadvantages of Replication
Increased cost of updates: each replica of relation r must be updated. Increased complexity of concurrency control: concurrent updates to distinct
replicas may lead to inconsistent data unless special concurrency control mechanisms are implemented. One solution: choose one copy as primary copy and apply concurrency
control operations on primary copy
Division of relation r into fragments r1, r2, …, rn which contain sufficient information to reconstruct relation r.
Horizontal fragmentation: each tuple of r is assigned to one or more fragments
Vertical fragmentation: the schema for relation r is split into several smaller schemas All schemas must contain a common candidate key (or superkey) to ensure
lossless join property. A special attribute, the tuple-id attribute may be added to each schema to
serve as a candidate key. Example : relation account with following schema Account = (account_number, branch_name , balance )
branch_nameaccount_number balance
HillsideHillsideHillside
A-305A-226A-155
50033662
account1 = branch_name=“Hillside” (account )
branch_nameaccount_number balance
ValleyviewValleyviewValleyviewValleyview
A-177A-402A-408A-639
205100001123750
account2 = branch_name=“Valleyview” (account )
branch_name customer_name tuple_id
HillsideHillsideValleyviewValleyviewHillsideValleyviewValleyview
LowmanCampCampKahnKahnKahnGreen
deposit1 = branch_name, customer_name, tuple_id (employee_info )
1234567
account_number balance tuple_id
50033620510000621123750
1234567
A-305A-226A-177A-402A-155A-408A-639
deposit2 = account_number, balance, tuple_id (employee_info )
Horizontal: allows parallel processing on fragments of a relation allows a relation to be split so that tuples are located where they are
most frequently accessed Vertical:
allows tuples to be split so that each part of the tuple is stored where it is most frequently accessed
tuple-id attribute allows efficient joining of vertical fragments Vertical and horizontal fragmentation can be mixed.
Fragments may be successively fragmented to an arbitrary depth. Replication and fragmentation can be combined
Relation is partitioned into several fragments: system maintains several identical replicas of each such fragment.
1. Raghu Ramakrishnan, Johannes Gehrke, Database Management System, McGraw Hill., 3rd Edition 2003.
2. Elmashri & Navathe, Fundamentals of Database System, Addison-Wesley Publishing, 3rd Edition,2000.
3. Date C.J, An Introduction to Database, Addison-Wesley Pub Co, 7th Edition , 2001.
4. Jeffrey D. Ullman, Jennifer Widom, A First Course in Database System, Prentice Hall, AWL 1st Edition ,2001.
5. Peter rob, Carlos Coronel, Database Systems – Design, Implementation, and Management, 4th Edition, Thomson Learning, 2001.
Define flash memory. Define log disk. What is meant by log-based file system? Define a transaction. List the required properties of a transaction to ensure integrity of the data. What is meant by cascading rollback? Define concurrency control. Define locking protocol. Define cache coherency. Define parallel aggregation. Define query optimization Define locking protocol. Define fuzzy checkpoint. What is meant by Write-ahead-logging (WAL) rule?