Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems...

31
Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1

Transcript of Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems...

Page 1: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Insert Your Company Name. All Rights Reserved.

Solving Data Lossin Massive Storage Systems

Jason ReschCleversafe

1

Page 2: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Insert Your Company Name. All Rights Reserved.

In the beginning

There was replicationLong before advanced data protection techniques

were known, data was copied Replication is wastefulTo survive N faults, N+1 copies were neededApplied to disks, (N+1) times the hardware,

power, floor space, and cooling are requiredNot Cheap Not GreenNot Performant

2

Presenter
Presentation Notes
Replication is the most natural and obvious path to achieve protection of data. It follows the adage of not keeping all of one's eggs in one basket, because accidents happen, drives fail, natural disasters strike. Nature follows the paradigm of replication to preserve copies of genes. Everyone could intuitively grasp the protection that replication affords.   However, replication comes at a cost. If someone wanted protection from 2 simultaneous faults, they would have to make 3 copies, thus tripling capital and operational costs.   There is also a performance impact. One would have 3 times the amount of disks, so in theory could support 3 times the I/O, however replication requires for every byte written, 3 bytes of I/O are generated, and thus no performance gain is realized from having those extra disks.   Another performance related issue is synchronized replication. To have the highest level of protection, data should be replicated before acknowledging the write as successful, however this can hurt performance and availability, especially if the other sites are remote. If one decides to take the opposite approach, and asynchronously replicate, then some data at all times exists as only one copy, and is vulnerable to failure at the primary location.
Page 3: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Insert Your Company Name. All Rights Reserved.

Enter RAID

In the 1980s, RAID was invented By storing a little extra information, regarding a

larger set of information, errors can be correctedRAID 5 stores parity information:

Parity is the property denoting even or odd If the number of 1's across a set of drives is even parity

bit is set to 0, if odd it is set to 1 If any disk is lost, the parity along with the bits on the

surviving disks will yield the content of the lost disk

Example RAID 5 recovery: 0 0 1 P = 1 0 X 1 P = 1

3

Presenter
Presentation Notes
RAID, or Redundant Array of Independent Disks, represented a huge technological leap for data protection. It enabled data to be protected against faults, in a similar manner to replication, however, no copies had to be made. It also improved performance, as writes could be spread across disks. The only thing that had to be given up was a small amount of storage space and a little processing of the information before storage.   Here is an example of RAID 5 storage. Each of these bits here represent data stored across different disks in an array. The first 3 bits are raw data the user requested to be stored, while the final bit here, is calculated from the others. It is called the parity bit, because it denotes whether there is an even or odd number of 1's across the other 3 disks. When it is even, this bit will be set to 0, when it is odd, it will be set to 1.   Should any of these 4 disks fail, the original bit can be computed based on the content of the other 3 remaining disks. For example, say the second disk failed. Based on the parity bit, we know there was an odd number of 1's to start. Therefore the second disk cannot have had a 1, as that would imply an even number of bits. It must be 0.   This logical process continues for every bit stored in each disk, until all bits in the drive are recovered. Note that the process to replace any missing bit for a drive is equivalent to the Exclusive-OR operation of the bits on the other drives.
Page 4: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Insert Your Company Name. All Rights Reserved.

Paradise Lost

RAID 5 was greatGave similar protection to making 1 copy, yet

overhead was significantly lessFor example: Using 3 disks for data and 1 for

parity, the overhead was only 33% However, two factors would conspire to destroy the

practical usefulness of RAID 5Disk capacity outpacing performanceThe growing chance of Latent Sector Errors

(LSEs) which increased with disk capacity4

Presenter
Presentation Notes
RAID 5 was extremely useful, it provided similar protection to making a copy, because it could tolerate a single failure and still recover. Making a copy would normally mean two times the amount of storage hardware, or a 200% expansion in size. However, with RAID 5, only 1 disk's capacity is sacrificed. Therefore in a 3 disk array, there is 66% utilization, in a 4 disk array, 75%, a 5 disk array, 80%, and so on. One could make RAID 5 arbitrarily efficient, however putting too many disks in the same array is risky, as it increases the chance of encountering a secondary failure.   As useful as RAID 5 was, it would not last. There were two primary reasons for this. The first is that while disk capacity has roughly doubled every year, disk performance only doubles every 2.3 years. This exponential divergence has imposed serious difficulties for rebuilding RAID arrays. The other issue is that of Latent Sector Errors, which were negligible when disks were small, but now represent a serious risk.
Page 5: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Hard drive capacity growth

Typical hard drive in 1991 was 40 MB and took 57 seconds to read entirely

In 2006 an typical hard drive was 750 GB19,200-fold increase!Took 3.27h to read [1]

Today's 2 TB drives can take up to 8 hours

5

Presenter
Presentation Notes
As an example, look at this chart of hard drive capacity over a 15-year time. In 1991, drives were very small, a typical hard drive was 40 MB, yet reading at the blistering rate of 0.7 MB/s it took less than a minute to read every bit on the disk.   Now jump forward to 2006. Drives can now read at 65 MB/s, 93 times faster. However drives are now almost 20,000 times the capacity! This means instead of taking 57 seconds to read, it takes almost 12,000 seconds, or a little over 3 hours to read fully.   Things haven't gotten any better lately, today's 2 TB disks, which can read at 80 MB/s can take about 8 hours to read.
Page 6: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Impact on RAID 5

RAID 5 can tolerate only one error at a time After first failure, data is in a vulnerable stateNo additional redundancy existsSecondary disk failure causes irrecoverable loss

This was exceedingly unlikely when a disk could be rebuilt in minutes (as was the case in 1991)

Today, disks can take hours or days to rebuildLonger rebuild time means the chance of a

secondary failure is ~500 times greater

6

Presenter
Presentation Notes
So what impact has this increased read time had on the reliability of RAID? After a disk failure in a RAID 5 array, the data on that array exists in a highly vulnerable state, it is essentially in the same configuration as a RAID 0 array, which has no redundancy. Should any other disk fail before the rebuild completes, there will be irrecoverable data loss. One cannot use a single parity bit to solve the problem when two bits are unknown.   For this reason, it is critical that data be rebuilt as fast as possible after a failure, to minimize the window where a secondary failure would be fatal. Since a 2 TB disk takes some 8 hours to rebuild while a disk in 1991 took less than a minute, this represents about a 500-fold increase in risk, or in other words, the chance of losing data to a secondary disk failure is 500 times greater today than it was in 1991.   Note that some vendors have developed strategies to reduce the span of this vulnerability window. IBM's System XIV, claims to have reduced rebuild times to 30 minutes, by spreading the rebuild work across disks. While this does decrease the vulnerability window by some factor, it increases the chance of a secondary failure by the same factor, because more disks are involved in the rebuilding process. There is no net increase in reliability through this approach.   It seems there is no way to rescue RAID 5 from the reduced reliability imposed by the increasing gap between performance and capacity.
Page 7: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Disks can fail in many ways

Outright disk failure is just one possibility

More commonly, one or more sectors may be found unreadable at some future time

A latent failure while rebuilding RAID 5 will cause data loss

Jon Elerath 2007 [2]

7

Presenter
Presentation Notes
Secondary disk failures during rebuilds are only one of RAID 5's problems. More likely to occur are Latent Sector Errors. These errors occur when data is written improperly, or when it becomes corrupted at some later time, preventing it from being read in the future.   Every disk employs some form of error correcting code at the sector-level so that it can tolerate a certain number of bit flips or other errors. However there is a maximum number of bit flips that can be occur in each sector while remaining recoverable. If this number is exceeded, the drive will recognize that a particular sector cannot be read, resulting in a Latent Sector Error (or LSE)   This diagram shows some of the paths that can lead to different types of errors on a drive, the right branch in particular shows some of the causes for LSEs.   When a LSE occurs, most disks will try re-reading the track several times with the read-head in varying positions, slightly askew to one side or another. In RAID arrays, this delay prevents the invalid or corrupted data from being used in reassembling the data during reads or rebuilds. However, if one occurs during a rebuild in which there is no additional redundancy, there will be data loss.
Page 8: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Chance of a LSE during a rebuild

Drive manufacturers often report LSE rates of 1 per every 1014 to 1015 bits (11 – 113 TB) read When disks were only a couple of MB or GB, this

probability was negligibleConsider a RAID 5 array using 2 TB disks:

After a disk failure, all other disks need to be read flawlessly, without encountering a LSE.

For a 4 disk array, 6 TB of data must be readThis works out to a 41% chance of a LSE during rebuild

assuming LSE rate of 10-14 (5% if 10-15)

8

Presenter
Presentation Notes
The likelihood of encountering a LSE during rebuild is today, quite significant. When disks were only a few MB or even GB, the risk was remote. Perhaps one would be expected every hundred, or every thousand rebuilds. However, with disks in the TB range, the risk is now significant. For an array with 6 TB usable, there is around a 40% chance of encountering one during a rebuild. This means about half the time a disk fails, there will be data loss.
Page 9: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Impact of a LSE during rebuild

A disk sector is corrupted (usually 512 bytes)Effect may be minor, even unnoticedOther times it may lead to corruption of a file If the sector contained critical metadata, it may

result in severe file system corruption In some cases, especially with desktop-class

drives, the drive may spend many minutes in a recovery mode, causing it to be kicked from the array and thus failing the whole rebuild

9

Presenter
Presentation Notes
So what effect does a LSE have? There is actually a wide degree of variation in how it could manifest. If the drive is mostly empty, there is a decent chance the corrupt sector will affect a block which is considered free, or unused by the file system. In this case, there may be no noticeable impact.   If the drive is mostly utilized, there is a good chance the LSE will result in file corruption. With the risk of a particular file being affected proportional to its size. Larger files are at a higher risk of being affected than smaller ones.   If you are really unlucky, the LSE will corrupt some important piece of file system meta-data, perhaps a directory, or inode list, possibly causing loss of a large number of files or directories.   The worst case, however, is when the delay in trying to recover from the error causes the RAID controller to consider that drive failed. When this happens the disk may be ejected from the RAID array. This then counts as a secondary failure, causing the entire rebuild to fail. This is more common with desktop-class drives, enterprise drives are often configured to respond within a fixed amount of time no matter what, as not to be ejected from the array. Western Digital actually allows this time to be configured, referring to the feature as Time-Limited Error Recovery. Typically arrays give up to 8 - 10 seconds for a drive to respond to a request before considering it failed.
Page 10: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Quantifying Risk

We now know: bigger disks = increased risksBut how significant is this risk?How much data is expected to be lost?

Fortunately there are techniques for calculating these risks if one knows the Disk's:Mean Time to Failure (MTTF)Capacity and performanceRate of Latent Sector Errors

10

Presenter
Presentation Notes
It should now be clear that that the bigger the disk, the bigger the risk, but is that risk significant enough to warrant action? To answer that question requires a little bit of calculation. Knowing a few things about the drives and the configuration in which they are used allows the reliability to be estimated.
Page 11: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Mean Time To Failure (MTTF)

Average time between failuresOver useful life of a componentNot to be confused with expected life

A 30-year old human has a MTTF of 900 years [3] This doesn't imply they will live another 900 years It implies a 1 in 900 chance of failing over 1 year

Example application of MTTF:Assume a drive has a MTTF of 20 yearsWe operate 1,000 such drives over 6 monthsThis works out to 500 drive-years

We should therefore expect (500 / 20) = 25 failures11

Presenter
Presentation Notes
The first thing one should know about a disk drive is its mean time to failure, which gives the average amount of time before a failure is observed. Sometimes this statistic is given as an annual failure rate, but the two have equivalent information assuming the failure rate is constant. If a drive has a 4% Annual Failure Rate, it is equivalent to a 25 year Mean Time To Failure.   It is important to note that the MTTF says nothing about a drives expected or useful life. It is common for manufacturers to report a MTTF over 100 years, but this certainly doesn't mean your drive will last that long. Rather it means that over a year (within the drives normal lifespan) there is a 1 in 100 chance of the drive failing.   Here is an example calculation using MTTF. If you have 1,000 drives with a 20-year MTTF, how many failures would you expect over a 6-month period? You have 500-drive years, divide this by the 20-year MTTF and you get 25 failures.
Page 12: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Mean Time To Repair (MTTR)

Average time to fully repair a failed component Includes:

Time for operator to replace failed driveTime to rebuild lost data on the new drive

Time to replace can vary significantlyMay be hours or days, or zero with hot spares

Time to rebuild is often estimatedTake a drive's capacity and divide by its throughputThis is a best case scenario: in practice rebuilds may

compete with normal I/O requests (1/3) * (Capacity / Throughput) is more realistic [4]

12

Presenter
Presentation Notes
Another value that must be known is the disk's Mean Time to Repair, or MTTR. It includes not only the time for someone to physically service the failed disk, but also the time for the lost data to be rebuilt. This time can be estimated based on the disk's performance and capacity.   Often times the system must remain online during a rebuild, and so the rebuild must contend with ongoing I/O requests. This seeking back and forth can significantly reduce performance below a pure sequential read/write on the disks, and so often the time to repair is a fraction of what would be expected given the disk's raw performance characteristics.
Page 13: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Estimating time to data loss

The MTTFs of sub-components can be combined to yield the MTTF for the system as a whole:

Essentially, the inverse of the sum of the inverses Also known as the Harmonic Sum

When the MTTFs are identical, a shortcut exists:

Where N is the number of sub-components

This explains why RAID 0 is so unreliable

Has only a fraction the reliability of an individual disk

( ) 1111 −−−− ++= psumemcpucomputer MTTFMTTFMTTFMTTF

NMTTFMTTF scsys =

NumDisksMTTFMTTDL diskRAID =0

13

Presenter
Presentation Notes
When you have a system, there are a number of sub components, all of which may be critical to the proper functioning of the system. Consider a computer, there is a CPU, RAM, motherboard, power supply, hard drive, etc., and every component may have a different MTTF. If you wanted to know the MTTF of your computer, how would you calculate it?   There is a rather simple way, if you take the inverse of each MTTF, essentially 1 over the MTTF, for each component, and sum them all together you arrive at the system's failure rate. If you take the inverse of the failure rate, you get the MTTF.   There is a handy shortcut available when you have a lot of components with the same MTTF, for example hard drives in a RAID array. The mean time for any disk to fail is the MTTF of a disk, divided by the number of disks.   Consider the example where we had 1,000 drives with a MTTF of 20 years. The average time before the first failure would be observed is 20 years divided by 1,000, or 7.3 days. You would be replacing a drive every week.
Page 14: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Estimating time to loss in RAID 5

There are two paths to data loss in RAID 5:Disk Failure followed by another during rebuildDisk Failure followed by a LSE while rebuilding

We know how to predict the time to the first failure

This doesn’t imply data loss, only that a rebuild must occur We must estimate the likelihood of a secondary failure

Assume the array had N disks to start After the first failure N-1 disks remain One of these must fail during the rebuild to cause data loss

NumDisksMTTFilureMTTFirstFa disk=

14

Presenter
Presentation Notes
There are two primary paths to data loss in RAID 5. A disk failure, followed by either a secondary disk failure or a latent sector error. While true, that two latent sector errors cause data loss, the chance of this is extremely small, because it would have to be the same sector on two different disks.   Going by the 10-14 error rate, and assuming 512-byte sectors, there is about a 10-11 chance that any given sector will be unreadable. For the same sector to be bad on two different drives there is roughly a 10-21 chance. However even a 2 TB disk only has 109 sectors, so the chances of seeing a double-LSE are around one in 100 billion.   We just saw how we can estimate the time before the first disk failure occurs. This is useful, but it only gets us halfway there. In a RAID 5 array, this is just the mean time before a rebuild must occur. We must also estimate the odds of a LSE or secondary failure occurring.
Page 15: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Chance of Secondary Disk Failure

Disk Failure followed by another during rebuild

Second Failure must happen within the Rebuild Time

Therefore chance of second failure during rebuild is:

Putting it all together [5]:

NumDisksMTTFilureMTTFirstFa disk=

( )1−= NumDisksMTTFailureMTTSecondF disk

ailureMTTSecondFMTTRceailureChanSecondaryF disk=

ceailureChanSecondaryFilureMTTFirstFaMTTDL DFRAID =_5

( )1

2

_5 −⋅⋅=

NNMTTRMTTFMTTDL

disk

diskDFRAID

15

Presenter
Presentation Notes
After the first disk failure, there are N minus 1 disks left. For a secondary disk failure, one of these disks must fail. We can calculate the MTTF of the remaining disks as follows, but remember, the error must occur before the rebuild finishes. The chances of this occurring are equal to MTTR over the MTTF of the remaining drives.   Let's say the chance of a secondary failure during rebuild was 1 in 10. This means we would expect to recover from 10 failures of a single drive for each double-disk failure. Therefore to find the Mean time to a double-disk failure, we multiply the MTTF of the first disk failing by the inverse of the chance seeing a failure during the rebuild window.   This gives us the mean time to failure of the RAID array, but only for double disk failures. We still need to consider the chances of a Latent Sector Error.
Page 16: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Chance of LSE During Rebuild

Disk Failure followed by a Latent Sector Error

(N – 1) Disks remain and must be read entirelyTherefore chance of a LSE during rebuild is:

Putting it all together:

NMTTFilureMTTFirstFa disk=

( ) ( )111Re −⋅−−= NkBitsPerDisrateLSEbuildgErrorDurin

buildgErrorDurinilureMTTFirstFaMTTDL LSERAID Re_5 =

NbuildgErrorDurinMTTFMTTDL disk

LSERAID ⋅=

Re_5

16

Presenter
Presentation Notes
Disk manufactures report the LSE rates in terms of amount of data read. If they report it is 10-14 per bit, that means for every bit read there is a 1 minus 10-14 chance of reading the bit without an error. During a rebuild all remaining disks must be read, so if we count the number of bits, we can determine the probability of success by raising the per bit success rate to the power of the number of bits to be read. This is the probability that no LSE occurs. By subtracting it from 1, we get the chance of encountering an error.   Once this chance of a LSE is known, we apply the same strategy we used for determining the MTTF due to disk failures. We multiply the MTTF of the first disk failure by the inverse of the chance of a LSE.
Page 17: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Combining paths to loss

There are two paths to data loss in RAID 5:Disk Failure followed by another during rebuildDisk Failure followed by a LSE while rebuilding

We can now calculate the MTTF for each path, but how can they be combined into a single estimate?

We simply use the Harmonic sum, as we learned before

( ) 11_5

1_55

−−− += LSERAIDDFRAIDRAID MTTDLMTTDLMTTDL

17

Presenter
Presentation Notes
We now know how to calculate the MTTF for both paths to data loss. But often we want to know the MTTF for any data loss, rather than having two separate statistics.   Fortunately, it is simple to combine the two MTTFs, using the same harmonic sum which we learned about earlier. Simply add the inverse of each MTTF, take the inverse of that, and you will have the MTTF for encountering either type of data loss in the array.
Page 18: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

What good is a MTTF number?

The MTTF statistic on its own is not very meaningfulHowever it can be used to generate actionable

information, such as chance of data loss or expected amount of data loss over a period of time.

Failures can be assumed to be random processes Constant failure rates imply a Poisson distribution

Where e is Euler’s number ~= 2.71828182845904523536…

MTTDLtetenceOverTimFailureCha −−=1)(

18

Presenter
Presentation Notes
Say you do all the calculation and you find out the MTTF is 37 years, what does this mean? Does it mean everything will be fine for 37 years, or that half the components will fail by then? Actually it means neither. Without some additional work, there is not much meaning at all in the MTTF number, at least no information which can be used to quantify risk directly.   However, assuming that failures are random and evenly distributed over time, it implies they follow a Poisson distribution. This allows straightforward calculation of the chance of a failure over a period of time, using the following formula.   e to the negative t over MTTF yields the system reliability over time t. The chance for failure over that same time is 1 minus the reliability.   Note that hard disks don't follow a truly constant failure rate, instead it may follow a bathtub curve, or slightly increase over time. For a slightly more accurate calculation one could use a Weibull distribution which takes into account changing failure rates over time. However, studies have found that over a disk's useful life, the failure rate, while variable, usually stays beneath some upper bound. Using this upper bound provides a more conservative estimate of the system's reliability. While it doesn't hurt to slightly underestimate system's reliability, the opposite error can have catastrophic consequences, therefore its best to be on the safe side.
Page 19: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Estimating amount of Data Lost

Another useful statistic is Expected Data Loss (EDL)

Z is the amount of data lost in a data loss event For Disk Failures, it is the usable capacity of the RAID array For LSEs is theoretically the sector size, but more practically it may

be the average file size, as often a whole file can become unusable Depends on data format and associated application resiliency

Assume an array with 6 TB usable and 500 MB files:

( ) MTTDLtZtEDL ⋅=)(

( ) DFDF MTTDLtTBtEDL ⋅= 6)(

( ) LSELSE MTTDLtMBtEDL ⋅= 500)()()()( tEDLtEDLtEDL LSEDFtotal +=

19

Presenter
Presentation Notes
Remember that the mean time to data loss includes two different types of loss, a LSE which may affect only a sector or a file, and a double-disk failure, which destroys the whole array. While latent sector errors are more common, far more data will be lost on average, from double-disk failures. Therefore when some data loss is tolerable, a more important statistic may be the expected amount of data loss.   The expected amount of data loss is calculated much like the expected value of a bet on a still-spinning roulette wheel. One would take the chance of winning multiplied by the pay off should one win. With expected data loss, we take the chance of data loss multiplied by the amount of data loss, should such a data loss event occur.   Over a long enough period of time, many data loss events will occur, and the actual amount of data loss should closely match expectations. The number of data loss events over time t is simply t over MTTF. We then multiply by the amount of data loss should that even occur, perhaps the average file size in the case of a LSE, or the usable size of the array, in the case of a double-disk failure.   These two results are computed separately for both double disk failures, and LSEs and added together to yield the expected amount of data loss.
Page 20: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Example RAID 5 Calculation

Let’s calculate expected data loss and chance of loss for a RAID 5 configuration using 2 TB disks, assume:RAID 5 configuration of 3 data disks, 1 parity diskMTTF of disk is 220,000 hours (~4% AFR) [6]LSE rate is 10-14 per bitDisks Rebuild at 30 MB/s → MTTR is 19.41 hoursAverage weighted file size is 500 MB Picking a random sector on the disk, what is the average

size of the file that contains that sector?

20

Presenter
Presentation Notes
Taking all we've learned, we can now use that knowledge to do some calculations for an actual system. Assume a system with these attributes, RAID 5 in a (3+1) configuration, 2 TB disks, LSE rate of 10-14, 19 hour rebuild time, and average file size of 500 MB.
Page 21: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Why RAID 5 is dead

Using the same formulae described previously:

FailureChanceOverTime(10 years) = 47.98% MTTFtotal = 15.30 years

MTTFDF = 23,704.84 years MTTFLSE = 15.31 years

EDLtotal(10 years) = 2,980.68 MB EDLDF(10 years) = 2,654.08 MB EDLLSE(10 years) = 326.60 MB

21

Presenter
Presentation Notes
Use the formulae we learned, we might be surprised to see how little RAID 5 does to protect against data loss. Over a 10 year period, RAID 5 would have a 50% chance of losing data.   Note that while the chance of a double disk failure remains very remote, LSEs are comparatively very likely. Despite the increased likelihood of LSEs, most of the expected data loss still comes from double disk failures.   You might wonder why I chose 10 years, when drives and systems don't last that long. One can assume that over these 10 years, there may have been a number of hardware replacements, upgrades, migrations, it has no effect on the calculation. Assume a brand new RAID 5 array is constructed each year over those 10 years, and the data is migrated to the fresh system, the end result of the reliability calculation is still the same.   When considering the chance of data loss, or the amount of data loss, the lifetime of the system is of little consequence. What should matter is how long does that data need to remain protected, or how long will that data remain useful? For some documents it may be a year or less, for others, it may be many centuries or longer. The lifetime of the data and the impact of its loss should be the drivers in deciding how reliable of a system is needed.
Page 22: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

The successor, RAID 6

RAID 6 can recover from two simultaneous failuresLoss requires one of: 3 disk failures during a rebuild window 2 disk failures during a rebuild window plus a LSE

Reliability formulas for RAID systems:Notice a pattern?

( )11

2

_5 −⋅⋅=

NNMTTRMTTFMTTDL

disk

diskDFRAID

NMTTRMTTFMTTDL

disk

diskDFRAID ⋅

= 0

1

_0

( ) ( )212

2

3

_6 −⋅−⋅⋅⋅

=NNNMTTR

MTTFMTTDLdisk

diskDFRAID

22

Presenter
Presentation Notes
Given how poorly RAID 5 works on today's hardware, it is little wonder why the industry has made a move towards systems offering dual parity, also known as RAID 6. RAID 6 can recover and rebuild from two simultaneous disk failures, or a disk failure followed by a LSE.   While two disks are wasted for storing redundant information, this doesn't mean RAID 6 has to be less storage efficient. By doubling the dimensions of any RAID 5 array, the same level of efficiency can be had. For example a 3+1 or 4+1 RAID 5 array have the same efficiency of a 6+2 or 8+2 RAID 6 array.   Looking at the reliability formulae for different levels of RAID, a pattern begins to emerge.
Page 23: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Why RAID 6 is so much better

Every additional tolerated failure increases MTTF by: MTTF / (MTTR × N) MTTF is usually many years, while MTTR is a time in hours With current disk MTTF and MTTR times, each additional

tolerated failure increases reliability by a factor of several hundred to a few thousand!

Reliability metrics for a RAID 6 array (6+2):FailureChanceOverTime(10 years) = 0.13%EDLtotal(10 years) = 7.20 MB

23

Presenter
Presentation Notes
Essentially, for every additional tolerated failure, the calculated mean time to failure increases by another MTTF over MTTR. Since the MTTF is usually much larger than the MTTR, this has the effect of increasing the MTTF of the RAID array by a factor of a few hundred to a few thousand.   If we keep all the other properties of the disk the same as they were for the RAID 5 array we looked at, but change the configuration from a 3+1 to a 6+2, the chance of data loss over 10 years drops from 48% to 0.13%. Or, from 1 in 2, to 1 in 780. A 340-fold increase in reliability!   The expected data loss over this time also drops, from about 3 GB to 7.2 MB.
Page 24: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Problem Solved?

For that RAID 6 system, the chance of data loss over 10 years is about 1 in 780 It would seem the data loss daemon has been slain

However, there are two factors not accounted for: Some storage systems are massive (in the PB scale) Disk capacities keep doubling

24

Presenter
Presentation Notes
It would seem the problem of data loss has once and for all been solved. For the time being, and for small systems, this is true. But as we shall see on the following slides, problems remain for systems in the Petabyte scale, and as disks continue to grow in capacity, the same issues which haunted RAID 5 will visit upon RAID 6.
Page 25: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Issues of Scale

Large systems require a large number of arrays One cannot create a 998+2 RAID 6 array Too many disks would have to be touched for each update The chance of tertiary failures would be too great

Each array has its own independent chance of failure Recall that Its true whether the component is a disk or an array Consider a 5 PB storage system This requires 427 individual RAID 6 arrays

Assuming 2 TB disks in a 6+2 configuration

Failure of any array causes irrecoverable data loss

NMTTFMTTF scsys =

25

Presenter
Presentation Notes
When one has many PB to store, they can't get around having many hundreds of individual RAID 6 arrays. Each array has its own chance for failure, and should any raid array fail, the system experiences data loss. It is as if the massive storage system is a big RAID 0 array, only each sub-component is a RAID 6 array instead of a single disk.   Consider a system which needed to store 5 PB. Using RAID 6 arrays of 2 TB disks, one would need 427 independent RAID 6 arrays just to provide sufficient usable storage.
Page 26: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Why RAID 6 is dead (in big systems)

For a 5 Petabyte RAID 6 system:FailureChanceOverTime(10 years) = 42.19% MTTFtotal = 18.24 years

MTTFDF = 44,944.69 years MTTFLSE = 18.25 years

EDLtotal(10 years) = 3,073.56 MB EDLDF(10 years) = 2,799.64 MB EDLLSE(10 years) = 273.91 MB

This is essentially as bad as the single RAID 5 array…26

Presenter
Presentation Notes
If we run the same reliability calculations for the 5 PB system, composed of RAID 6 arrays, we get the following reliability numbers. A 42% chance of failure over 10 years, and 3 GB of expected data loss. This is nearly identical to the single RAID 5 array.
Page 27: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Why RAID 6 is dead (for big disks)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1,000 2,000 4,000 8,000 16,000 32,000

Disk Size in Gigabytes

Annual Chance of Data Loss in 1,000 Disk System (Linear)

RAID 5 (3+1)

RAID 6 (6+2)

27

Presenter
Presentation Notes
If disks continue their trend of doubling every year, it isn't long before even moderately sized RAID 6 systems become impractical. This graph shows two systems with a usable capacity equal to 1,000 disks. As disk sizes grow, the annual chance of data loss rapidly approaches 100%. RAID 5 is nearing the end of its S-curve as we are at 2 TB disks, while RAID 6 is just beginning its S-curve.   Note that in this graph it is assumed that disk performance remains constant over the next few years. While performance may increase slightly, there are physical limits to how fast platters can spin without falling apart, which is why disk RPM speeds have remained around 7200 rpm for the past decade.
Page 28: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Is Replication The Answer?

When spending millions of dollars for a storage system, who wants to double or triple that cost?

Instead, we can take the same path that was taken from RAID 5 to RAID 6 Scale out fault toleranceMaintain same level of storage efficiencyOnly additional cost:

Increased processing

28

Presenter
Presentation Notes
So what is the solution, do we have to return to replication?   This would be both very costly and quite unnecessary. Others have suggested moving to triple-parity RAID, but this is only kicking the can, in a few short years disks will make triple parity RAID obsolete.   Instead what is needed are systems which can adapt their reliability on the fly as required, remaining backwards compatible with previous formats. There should be open standards to support any arbitrary dimension of parity, using standardized implementations and a common on-disk format.   This would allow maintenance of a small fixed overhead for storage, with the only cost over time being increased processing. However CPUs are doubling in power as regularly as hard drives, and so this cost will be amortized over time to near nothing.
Page 29: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Reliability for arbitrary K-of-N

Where K is the number of data Disks, and N is the total number of disks in the array

System tolerates N – K failures without loss [7]

MTTFDF =

MTTFLSE =

29

Presenter
Presentation Notes
A system which requires K components to function, out of N total is known as a K-of-N system. RAID 5 and RAID 6 are specific cases of a K-of-N system, for RAID 5, K must always be N-1, and for RAID 6, it must always be N-2. However, there are techniques known as erasure codes, which allow any arbitrary value to be selected for K and N, so long as K is less than or equal to N.   These two formulas generalize the reliability formulas we looked at for RAID, and apply to any combination of K and N. For example, assume RAID 0, and therefore N=K. You get the familiar MTTF / n.   The Latent sector error formula is essentially identical to the one above, except notice that 1 has been added to K. This is because we are concerned with the mean time to a maximum number of failures. We then multiply by the inverse of the probability of a latent sector error.
Page 30: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

Solution: Scale Fault Tolerance

0.00000000000001%

0.000000000001%

0.0000000001%

0.00000001%

0.000001%

0.0001%

0.01%

1.0%

100.0%

1,000 2,000 4,000 8,000 16,000 32,000

Disk Size in Gigabytes

Annual Chance of Data Loss in 1,000 Disk System (Log)

RAID 5 (3+1)

RAID 6 (6+2)

RAID (12+4)

RAID (18+6)

RAID (24+8)

30

Presenter
Presentation Notes
Using the generalized reliability formula, we may estimate the chance of data loss for as of yet, non-existent dimensions of RAID. This is the same graph as before, but on a log scale. Notice how RAID 5 and RAID 6 approach 100% chance of error. However, if one increased the redundancy, to tolerate, 4, 6, or 8 errors, for example, look at how much more reliable of a system can be formed.   Also note that all these estimates are best-case scenarios. It doesn't factor in operator errors, natural disasters, batch correlated failures, etc. So in reality the true reliability of the system is worse than we would guess from the idealized calculations. Therefore, one should plan for an extra margin of safety by using a configuration that seems more reliable than is needed.   Also, there are approaches for reducing the risks of correlated failures. One recent technique is called RAIN, redundant array of independent nodes. This places different disks for the same array in different boxes, so that a bad memory or bad power supply in one box doesn't end up corrupting data across all the disks. Taking it one step further, there is geographic dispersion, which places these nodes at different sites so no one natural disaster, or site failure will cause data loss. By using different “vintages” of hard drives, or mixing manufacturers, one also achieves a level of protection against bad batches of disks.   Of an interesting note, is that each of these lines, while having a different slope, seems like it is converging on a common point. This point lays several more doublings away, perhaps when disks approach quarter PB in size or so. It is the time when disks are so large that the mean time to repair a disk approaches its mean time to failure. In other words, a disk would be expected to fail before one could finish filling it.   Obviously by then, some new paradigm of storage is required, but whatever technology ultimately replaces hard drives, K-of-N systems retain their usefulness, as they offer high very high reliability with minimal overhead.
Page 31: Solving Data Loss in Massive Storage Systems€¦ · Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 1. ... because it could tolerate a single failure and s\൴ill

2010 Storage Developer Conference. Cleversafe. All Rights Reserved.

References

[1] http://www.tomshardware.com/reviews/15-years-of-hard-drive-history,1368.html

[2] http://queue.acm.org/detail.cfm?id=1317403

[3] http://www.faqs.org/faqs/arch-storage/part2/section-151.html

[4] http://storageadvisors.adaptec.com/2005/11/01/raid-reliability-calculations/

[5] RAID: High-Performance, Reliable Secondary Storage (1994 Chen et al.)

[6] Failure Trends in a Large Disk Drive Population (2007 Pinherio et al.)

[7] On Computing MTBF for a k-out-of-n:G Repairable System

31