Http://csiweb.ucd.ie/Staff/acater/comp30150.html §1. Recovery Recovery (and concurrency) is tied to...

http://csiweb.ucd.ie/Staff/acater/comp30150.html

§1. Recovery

Recovery (and concurrency) is tied to notion of transaction processing

Transaction is a logical unit of work.

Specifically in DBMS context, it is a sequence of (usually) several database operations that transform a consistent state of the database into another consistent state (or do not change it).

Consistency is not necessarily preserved at intermediate points during processing.


Example: consider the relational-database tables:

Supplier S# Sname City

(S) S3 Acme Cork

… … …

Shipments S# P# Qty

(SP) S3 P117 0144

… … …

and the user request: “Change supplier no S3 to no S9”

=> must change both tables above -- “cascade” the update.


TRANEX1: PROC OPTIONS ( MAIN ) ;

/* declarations omitted */

GET LIST ( SX , SY ); /* get values from user */

EXEC SQL UPDATE S

SET S# = :SY

WHERE S# = :SX ;

EXEC SQL UPDATE SP

SET S# = :SY

WHERE S# = :SX ;

RETURN ;

END /* TRANEX1 */ ;


NB: This “simple update” involves 2 actual updates to database. In between, database is not consistent, SP records might exist with no S records corresponding.

Must guard against doing 1 update but not the other, which would leave database inconsistent.

Ideally one wants a guarantee that both updates will be carried out - but that is impossible.

Many possible sources of failure:

OS errors DBMS errors Program errors

Operator mistake Hardware failures Power cuts

Fire Sabotage etc


But all is not lost - you can be guaranteed recovery

The key to recovery is redundancy : redundant information is stored in dumps and logs, and provides the ability to partially or completely reconstruct the database.

Outline of commonest recovery mechanism:

Dump total database regularly; Record all changes in log;

If failure occurs, either

•Restore DB from dump and redo completed txns if possible

•Undo uncompleted txns

Duplexing is also possible: keep 2 copies of database and update both simultaneously. But ensure that they have different failure modes.


With "transaction processing" you are guaranteed that if a failure occurs during updates, the work of transactions will either be redone (if properly finished) or undone.

Transactions must appear atomic to the end-user -- executed in their entirety or not at all.

Achieve this by commit and rollback operations.

• Commit signals successful end-of-transaction: database is believed consistent, update can be made permanent -- “committed”.

• Rollback = unsuccessful end-of-transaction; tells system that all updates within the transaction must be “rolled back” i.e. undone.


TRANEX2: PROC OPTIONS ( MAIN ) ;

/* declarations omitted */

EXEC SQL WHENEVER SQLERROR GO TO UNDO ;

GET LIST ( SX , SY ); /* get values from user */

EXEC SQL UPDATE S

SET S# = :SY

WHERE S# = :SX ;

EXEC SQL UPDATE SP

SET S# = :SY

WHERE S# = :SX ;

EXEC SQL COMMIT ;

GO TO FINISH ;

UNDO: EXEC SQL ROLLBACK ;

FINISH: RETURN ;

END /* TRANEX2 */ ;


In “TRANEX2” example we issue commit if we get through the updates successfully. But if either update fails it “raises an error condition” and a program-initiated rollback is issued to undo changes.

Commit/Rollback operations may not appear in code but may be implicit - depends on implementation.

Rollback of transactions in progress at the time of a crash should also happen automatically upon DBMS restart.

Messages are also an issue: (see later).


Undoing updates

System keeps log ( = journal ) of all update operations.

Log records values of items before and after any change, identifying the item changed/ deleted/ inserted, and the id of the transaction. This can be used to restore database to consistent state.

When a transaction commits, log also records that fact.

Log may be stored online or online/offline combination.

For busy multi-user systems the log can quickly become very large.


Synchronisation points (SYNCH points)

Executing a Commit or Rollback operation establishes a synchpoint: it represents the boundary between two consecutive transactions - a point where the database is consistent.

Synchpoints are only established by Commit, Rollback, and normal “program termination”.

Once established:

- all updates since previous synchpoint are committed or undone;

- all database positioning is lost (see below - “cursors” are closed)

- all record locks are released (see later - “concurrency”)


Cursors

When SQL is embedded in a host language like COBOL (or C or C++ or Java or …), there is a need to bridge between the set-at-a-time nature of SQL and the record-at-a-time nature of COBOL (…).

This is done using cursors - pointers that allow you to run through a set of records, pointing to each one in turn.

Cursors allow the host language to process the records in the way natural for it; SQL on its own does not need them.


The process is illustrated in outline in the example, which is intended to retrieve supplier details (S#, SNAME, and STATUS) for all suppliers in the city given by the host variable Y.

EXEC SQL DECLARE X CURSOR FOR /* define cursor X */

SELECT S#, SNAME, STATUS

FROM S

WHERE CITY = :Y;

EXEC SQL OPEN X; /* execute the query /*

DO WHILE (more-records-to-come);

EXEC SQL FETCH X INTO :S#,:SNAME,:STATUS;

/* fetch next supplier */

........ /* and then do something! */

END;

EXEC SQL CLOSE X; /*deactivate the cursor*/


Commit / Rollback terminate transactions, not the program. A given program may carry out numerous consecutive transactions.

A transaction is a unit of work; it is also a unit of recovery.

If a program commits, then its updates must be guaranteed even if a crash occurs before the updates are flushed to disk.

System restart after a crash should install updates in the database from entries in the log.

Implication: one must write the log before Commit operation finishes - so-called “Write-Ahead Log Protocol”


System and Media Recovery

Local failure -- affects only current transaction e.g. arithmetic overflow error

Global failure -- affects all transactions in progress at time of failure. Include:

• System failure ("soft crash")– no physical damage

• Media failure ("hard crash")– database is physically damaged, eg disk head crash


Media failure - physical damage.

• Restore database from backup, redo transactions that had completed.

• No need to undo.

• Can use standard dump/restore software.

System failure -- no damage, but

• The contents of memory are lost;– so the state of transactions is lost;

– so transactions cannot be completed;

– so they must be undone at restart time.

• May also have to redo transactions that had been finished but not flushed-to-disk.

• If the log is very large, restart can be very expensive.


System uses checkpoints to help identify rapidly which transactions to undo, which transactions to redo.

System "takes a checkpoint" at regular intervals: it flushes buffers to disk, and writes checkpoint record to physical log. Thus it records all transactions in progress at time of checkpoint.

T1 … T5 here are meant to be classes of transactions


Classes of transactions to be undone:

those in checkpoint record (like T3) or begun after it (like T5) without Commit in log

Classes of transactions to be redone:

those in checkpoint record (like T2) or begun after it (like T4) that do have a Commit in log (but their changes are perhaps not flushed to disk)

System restart will use log to undo, and then redo, appropriate transactions. Only then will normal activity begin.


Message handling

Non-trivial in transaction processing.

Eg: Transfer $100 from 68224 to 97636"Transaction" should update database and issue message to user.

If it does a Commit (or a voluntary Rollback), then an appropriate message should be sent.

But in the event of a system failure, neither message should be displayed - just as if the transaction had never started. (One could display system-generated failure message.)

So should not output messages until end-of-transaction.


TRANSFER: PROC; GET (FROM, TO, AMOUNT); /* input message */ FIND UNIQUE (ACCOUNT WHERE ACCOUNT# = FROM); /* now decrement the FROM balance */ ASSIGN (BALANCE - AMOUNT) TO BALANCE; IF BALANCE < 0 THEN DO; PUT ('INSUFFICIENT FUNDS'); /*output msg*/ /* undo update & terminate transaction */ ROLLBACK; END; ELSE DO; FIND UNIQUE (ACCOUNT WHERE ACCOUNT# = TO); /* now increment the TO balance*/ ASSIGN (BALANCE + AMOUNT) TO BALANCE; PUT ('TRANSFER COMPLETE'); /*output msg*/ /* commit update and terminate transaction*/ COMMIT; END;END /* TRANSFER */


Place messages in pending queue, to be delivered on termination or discarded on failure.

[In the case of a cash dispensing terminal, the delivery of your money is one message]

Messages are handled by Data Communications Manager (DCM): when it receives input it places it in a queue.GET retrieves a copy from queue & logs it.PUT outputs to queue

Commit/Rollback cause DCM to log messages, transmit them or clear input queues.

DCM cancels output messages on failure. Log is used for redo.


Transaction structure - general format

• - accept input

• - perform database processing

• - send output


Undo & Redo are idempotent : System can fail during Undo/Redo.

Must ensure that

•Undo(Undo(Undo...(x))) = Undo(x)

•Redo(Redo(Redo...(x))) = Redo(x)

ie that Undoing a change any number of times has the same effect as undoing it once; and similarly, Redoing it any number of times has the same effect as Redoing once.


3 types of System Startup

• Cold start: start from scratch. Normally only when 1st installed, but also possible after disasters (esp media failure)

• Warm start: Startup after controlled shutdown. No need for Redo/Undo.

• Emergency restart: a process invoked by operator after failure. Involves Redo/Undo and - perhaps - reloading the database.


Two phase Commit

When a transaction commits, you are guaranteed that a recovery manager will be able to redo the transaction in the event of a failure. (force-write log).

If transaction involves 2 (or more) systems - eg in a distributed DBMS - recovery is more difficult: there are 2 (or more) independent recovery mechanisms. The aim is still to preserve the "all or nothing" principle of transaction processing. This leads to two phase commit.

Need to be able to exercise control over different systems, so need one system to act as coordinator component.


Individual transactions now send Commit/ Rollback to coordinator, which operates in 2 phases:

1. request all participants in the transaction to get ready to go, and send an acknowledgement to coordinator - OK or NOT OK

2. coordinator then broadcasts Commit to all participants, if all replies were OK; or Rollback if not all OK or if timeout occurred.


IN COORDINATOR:

for each participant

send “get ready to commit” to participant;

wait for reply or timeout;

if all participants replied ‘OK’

then forcewrite “broadcasting COMMIT” to coordinator log


until acknowledgement received

send “COMMIT” to participant

wait for acknowledgement or timeout

else forcewrite “broadcasting ROLLBACK” to coordinator log


until acknowledgement received

send “ROLLBACK” to participant

wait for acknowledgement or timeout


IN PARTICIPANT:

wait for “get ready to commit” message;

force outstanding change records to local log;

force “agree to commit” to local log;

if errors occurred

then send ‘NOT OK’ to coordinator

else send ‘OK’ to coordinator

wait for broadcast command from coordinator

if command is “COMMIT”

then commit changes to local resources

if command is “ROLLBACK”

then undo changes to local resources

send acknowledgement to coordinator


Note 1) Timeout 2) Resources are held until global termination - termination on all systems.

In the event of failure :-

•in coordinator

– before broadcasting: restart should issue Rollback

– after: restart procedure issues Commit or Rollback as appropriate

•in participant

– before "agree to commit”: restart issues "NOT OK" (note that timeout will have occurred anyway)

– after: ask coordinator to rebroadcast message, and undo/redo transaction locally

Http://csiweb.ucd.ie/Staff/acater/comp30150.html §1. Recovery Recovery (and concurrency) is tied to...

Documents

Transcript of Http://csiweb.ucd.ie/Staff/acater/comp30150.html §1. Recovery Recovery (and concurrency) is tied to...