Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of...

Reliability of Disk Systems

Reliability• So far, we looked at ways to improve the performance of

disk systems.• Next, we will look at ways to improve the reliability of disk

systems.• What is reliability?

– Essentially, it is the availability of data when there is a disk “failure” of some sort.

• This is achieved at the cost of some redundancy – data and/or disks.

Intermittent Failures• In an intermittent failure, we may get several “bad” reads,

for example, but with repeated attempts we may eventually get a “good”.

• Disk sectors are stored with some redundant bits that can be used to tell us if an I/O operation was successful.

• For writes, we may want to again check the status– We can, of course, re-read the sector and compare it to the

original

– But this is expensive

– Instead, we simply re-read the sector and check the status bits

Checksums for failure detection• A useful tool for status validation is the checksum

– One or more bits that, with high probability, verify the correctness of the operation

– The checksum is written by the disk controller.

• A simple form of checksum is the parity bit:– Here, a bit is added to the data so that the number of 1’s

amongst the data bits + the parity bit is always even.

– A disk read (per sector) would return a status value of “good” if the bit string has an even number of 1’s; otherwise, status = bad

11101110 0 Good

00101010 0 Bad

(Interleaved) Parity bits• It is possible that more than one bit in a sector be corrupted

– Error(s) may not be detected.• Suppose bits error randomly: Probability of undetected error

(i.e. even 1’s) is thus 50% (Why?)

• Let’s have 8 parity bits

01110110 Byte 111001101 Byte 2 00001111 Byte 310110100 Byte of parity bits

• Probability of error is 1/28 = 1/256• With n parity bits, the probability of undetected error = 1/2n

Recovery from disk crashes• Mean time to failure (MTTF) = when 50% of the disks

have crashed, typically 10 years• Simplified (assuming this happens linearly)

– In the 1st year = 5%,

– In the 2nd year = 5%,

– …

– In the 20th year = 5%

• However the mean time to a disk crash doesn’t have to be the same as the mean time to data loss; there are solutions.

Redundant Array of Independent Disks, RAID• RAID 1:Mirror each disk (data/redundant disks)

• If a disk fails, restore using the mirror

Assume: • 5% failure per year; MTTF = 10 years (for disks). • 3 hours to replace and restore failed disk.

If a failure to one disk occurs, then the other better not fail in the next three hours. • Probability of failure = 5% 3/(24 365) = 1/58400. • If one disk fails every 10 years (10 5% = 50%), then one of two will fail every

5 years (5 (5% + 5%) = 50% ). • One in 58,400 of those failures results in data loss; MTTF = 292,000 years (5

58,400 = 292,000).

Drawback: We need one redundant disk for each data disk.This is the mean time to failure for data.

RAID 4• RAID 4: One redundant disk only.• n data disks & 1 redundant disk (for any n)

• We’ll refer to the expression xy as modulo-2 sum of x and y (XOR)– E.g. 11110000 10101010 = 01011010

• Now, each block in the redundant disk has the modulo-2 sum for the corresponding blocks in the other disks.

i th Block of Disk 1: 11110000i th Block of Disk 2: 10101010i th Block of Disk 3: 00111000i th Block of red. disk: 01100010

• In effect this is just a distributed form of the block-interleaved parity discussed earlier.

Properties of XOR: • Commutativity: xy = yx• Associativity: x(yz) = (xy)z• Identity: x0 = 0x = x (0 is vector 00…0)• Self-inverse: xx = 0

– As a useful consequence, if xy=z, then we can “add” x to both sides and get y=xz

– More generally:

0 = x1...xn+1

Then “adding” xi to both sides, we get:

xi = x1…xi-1 xi+1...xn+1

Failure recovery in RAID 4We must be able to restore whatever disk crashes. • Just compute the modulo 2 sum of corresponding blocks of

the other disks.

• Use equation

• Example:

i th Block of Disk1: 11110000i th Block of Disk 2: 10101010i th Block of Disk 3: 00111000i th Block of red disk: 01100010

rednjjj xxxxxx ...... 111

Disk 2 crashes. Compute it by taking the modulo 2

sum of the rest.

RAID 4 (Cont’d)Maintaining RAID 4 is relatively easy:

•Reading: as usual– Interesting possibility: If we want to read from disk i, but it is

busy and all other disks are free, then instead we can read the corresponding blocks from all other disks and modulo 2 sum them.

•Writing: – Write block.

– Update redundant block

How do we get the value for the redundant block?

• Naively: Read all n-1 corresponding blocks

n+1 disk I/O’s, which is

n-1 blocks read,

1 data block write,

1 redundant block write.

• Better: How?

nired xxxx ......1

How do we get the value for the redundant block?

• Better Writing: To write block j of data disk i (new value = v): – Read old value of that block, say o.

– Read the jth block of the redundant disk, value = r.

– Compute w = v o r.

– Write v in block j of disk i.

– Write w in block j of the redundant disk. • Total: 4 disk I/O; (true for any number of data disks)• Problem Why does this work?

– Intuition: v o is the “change” to the parity. – Redundant disk must change to compensate.

Examplei th Block of Disk1: 11110000 x1i th Block of Disk 2: 10101010 x2 = oi th Block of Disk 3: 00111000 x3i th Block of red disk: 01100010 r

Suppose we change 10101010 into 0110111010101010 o01101110 v01100010 r---------------10100110 w

11110000 x101101110 x2 = v00111000 x3-------------10100110 w = new r

If done the naïve way

RAID 5• RAID 4: Problem: The redundant disk is involved in every write Bottleneck!

• Solution is RAID 5: vary the redundant disk for different blocks. – Example: n+1 disks; – cylinder j is redundant on disk i if i = j mod n+1.

• Example: n=3. So, there are 4 disks. – First disk numbered 0, would be the “redundant” when considering cylinders

numbered: 0, 4, 8, 12 etc. (because they leave reminder 0 when divided by 4).

– Disk numbered 1, would be the “redundant” for its cylinders numbered: 1, 5, 9, 13. And so on

Cylinder 2

Cylinder 3

123

Cylinder 1

Disk 0

Cylinder 2

Cylinder 3

023

Cylinder 0

Disk 1

Cylinder 1

Cylinder 3

013

Cylinder 0

Disk 2

Cylinder 1

Cylinder 2

012

Cylinder 0

Disk 3

RAID 5 (Cont’d)• The reading/writing load for each disk is the same.• In one block write what’s the probability that a

disk is involved?– Each disk has 1/(n+1) probability to have the block.– If not, i.e. with probability n/(n+1), then it has 1/n chance

that it will be the redundant block for that block number. – So, each of the four disks is involved in:

1/(n+1) * 1 + (n/(n+1))*(1/n) *1= 2/(n+1) of the writes.

RAID 6 - for multiple disk crashesLet’s focus on recovering from two disk crashes.

Setup:• 7 disks, numbered 1 through 7• The first 4 are data disks, and disks 5 through 7 are redundant.• The relationship between data and redundant disks is summarized by

a 3 x 7 matrix of 0's and 1's

1 2 3 4 5 6 7

1 1 1 0 1 0 0

1 1 0 1 0 1 0

1 0 1 1 0 0 1

The columns for the redundant disks have a single 1.

All columns are different. No all-0’s column.

Data disks

Redundant disks The disks with 1

in a given row of the matrix are treated as if they were the entire set of disks in a RAID level 4 scheme.

RAID 6 - example1) 11110000

2) 10101010

3) 00111000

4) 01000001

5) 01100010

6) 00011011

7) 10001001

1 2 3 4 5 6 7

1 1 1 0 1 0 0

1 1 0 1 0 1 0

1 0 1 1 0 0 1

Redundant disksData disks

disk 5 is modulo 2 sum of disks 1,2,3



1 2 3 4 5 6 7

1 1 1 0 1 0 0

1 1 0 1 0 1 0

1 0 1 1 0 0 1

Why is it possible to recover from b a

two disk crashes?

r• Let the failed disks be a and b.• Since all columns of the redundancy matrix are different, we

must be able to find some row r in which the columns for a and b are different. – Suppose that a has 0 in row r, while b has 1 there.

• Then we can compute the correct b by taking the modulo-2 sum of corresponding bits from all the disks other than b that have 1 in row r. – Note that a is not among these, so none of them have failed.

• Having done so, we must recompute a, with all other disks available.

RAID 6 Failure Recovery

• Example:

Before failure After failure

1 2 3 4 5 6 7

1 1 1 0 1 0 0

1 1 0 1 0 1 0

1 0 1 1 0 0 1

RAID 6 – How many redundant disks?• The number of disks can be one less than any power of 2, say 2k – 1.

• Of these disks, k are redundant, and the remaining 2k– 1– k are data disks, so the redundancy grows roughly as the logarithm of the number of data disks.

• For any k, we can construct the redundancy matrix by writing all possible columns of k 0's and 1's, except the all-0's column.

– The columns with a single 1 correspond to the redundant disks, and the columns with more than one 1 are the data disks.

Note finally that we can combine RAID 6 with RAID 5 to reduce the performance bottleneck on the redundant disks

Exercises

RAID 4i th Block of Disk 1: 11110000

i th Block of Disk 2: 10101010



i th Block of red. disk:

Now suppose that Disk 1 crashed. Recover it.

RAID 61) 11110000

2) 10101010

3) 00111000

4) 01000001

5)

6)

7)

1 2 3 4 5 6 7

1 1 1 0 1 0 0

1 1 0 1 0 1 0

1 0 1 1 0 0 1

Redundant disksData disks

Now suppose that Disk 2 and Disk 5 crash.

Recover them.

RAID 6 - exercise• Find a RAID level 6 scheme using 15 disks, 4 of which are

redundant.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1 0 1 1 1 0 0 0 1 1 1 1 0 0 0

1 1 0 1 1 0 1 1 0 1 0 0 1 0 0

1 1 1 0 1 1 0 1 0 0 1 0 0 1 0

1 1 1 1 0 1 1 0 1 0 0 0 0 0 1

In-Class exercise• Suppose we have four disks: 1 and 2 are data disks, 3 and

4 are redundant• Disk 3 is a mirror of 1. Disk 4 holds parity check bits for

disks 2 and 3• which combination of simultaneous 2-disk failures can we

recover from?

Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of...

Documents

Transcript of Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of...