RAID: high-performance, reliable secondary storageparis/7360/PAPERS03/raid.pdfRAID Leve15 42...

41
RAID: High-Performance, Reliable Secondary Storage PETER M. CHEN Department of Electr~cal Engineering and Computer Sctence, 1301 Beal Avenue, University of Michigan, Ann Arbor, Michigan, 48109-2122 EDWARD K. LEE DECSystems Research Center, 130 Lytton Avenue. Palo Alto, California 94301-1044 GARTH A. GIBSON School of Computer Sctence, Carnegie Mellon University, 5000 Forbes Aven ue, Pittsburgh, Pennsylvania 15213-3891 RANDY H. KATZ Department of Electrical Engineering and Computer Sctence. 571 Evans Hall, University of California, Berkeley, California 94720 DAVID A. PATTERSON Department of Electrical Engineering and Computer Science, 571 Euans Hall, University of Cahfornia, Berkeley, California 94720 Disk arrays were proposed in the 1980s as a way to use parallelism between multiple disks to improve aggregate 1/0 performance. Today they appear in the product lines of most major computer manufacturers. This article gives a comprehensive overview of disk arrays and provides a framework in which to organize current and future work. First, the article introduces disk technology and reviews the driving forces that have popularized disk arrays: performance and reliability. It discusses the two architectural techniques used in disk arrays: striping across multiple disks to improve performance and redundancy to improve reliability. Next, the article describes seven disk array architectures, called RAID (Redundant Arrays of Inexpensive Disks) levels O–6 and compares their performance, cost, and reliability. It goes on to discuss advanced research and implementation topics such as refining the basic RAID levels to improve performance and designing algorithms to maintain data consistency. Last, the article describes six disk array prototypes or products and discusses future opportunities for research, with an annotated bibliography of disk array-related literature. Categories and Subject Descriptors: B.4.2 [Input/ Output and Data Communications]: Input/Output Devices; B.4.5 [Input/ Output and Data Communications]: Reliability, Testing, and Fault-Tolerance; D.4.2 [Operating Systems]: Storage Management; E.4 [Data]: Coding and Information Theory; General Terms: Design, Performance, Reliability Additional Key Words and Phrases: Disk Array, parallel 1/0, RAID, redundancy, storage, striping Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its data appear, and notice is given that copying is by permission of the Association for Computing Machinery, To copy otherwise, or to republish, requires a fee and/or specific permission. 01994 ACM 0360-0300/94/0600-0145 $03,50 ACM Computmg Surveys, Vol 26, No. 2, June 1994

Transcript of RAID: high-performance, reliable secondary storageparis/7360/PAPERS03/raid.pdfRAID Leve15 42...

  • RAID: High-Performance, Reliable Secondary Storage

    PETER M. CHEN

    Department of Electr~cal Engineering and Computer Sctence, 1301 Beal Avenue,

    University of Michigan, Ann Arbor, Michigan, 48109-2122

    EDWARD K. LEE

    DECSystems Research Center, 130 Lytton Avenue. Palo Alto, California 94301-1044

    GARTH A. GIBSON

    School of Computer Sctence, Carnegie Mellon University, 5000 Forbes Aven ue,

    Pittsburgh, Pennsylvania 15213-3891

    RANDY H. KATZ

    Department of Electrical Engineering and Computer Sctence. 571 Evans Hall,

    University of California, Berkeley, California 94720

    DAVID A. PATTERSON

    Department of Electrical Engineering and Computer Science, 571 Euans Hall,

    University of Cahfornia, Berkeley, California 94720

    Disk arrays were proposed in the 1980s as a way to use parallelism between multiple

    disks to improve aggregate 1/0 performance. Today they appear in the product lines of

    most major computer manufacturers. This article gives a comprehensive overview of

    disk arrays and provides a framework in which to organize current and future work.

    First, the article introduces disk technology and reviews the driving forces that have

    popularized disk arrays: performance and reliability. It discusses the two architectural

    techniques used in disk arrays: striping across multiple disks to improve performance

    and redundancy to improve reliability. Next, the article describes seven disk array

    architectures, called RAID (Redundant Arrays of Inexpensive Disks) levels O–6 and

    compares their performance, cost, and reliability. It goes on to discuss advanced

    research and implementation topics such as refining the basic RAID levels to improve

    performance and designing algorithms to maintain data consistency. Last, the article

    describes six disk array prototypes or products and discusses future opportunities for

    research, with an annotated bibliography of disk array-related literature.

    Categories and Subject Descriptors: B.4.2 [Input/ Output and Data

    Communications]: Input/Output Devices; B.4.5 [Input/ Output and Data

    Communications]: Reliability, Testing, and Fault-Tolerance; D.4.2 [Operating

    Systems]: Storage Management; E.4 [Data]: Coding and Information Theory;

    General Terms: Design, Performance, Reliability

    Additional Key Words and Phrases: Disk Array, parallel 1/0, RAID, redundancy,

    storage, striping

    Permission to copy without fee all or part of this material is granted provided that the copies are not made

    or distributed for direct commercial advantage, the ACM copyright notice and the title of the publicationand its data appear, and notice is given that copying is by permission of the Association for ComputingMachinery, To copy otherwise, or to republish, requires a fee and/or specific permission.01994 ACM 0360-0300/94/0600-0145 $03,50

    ACM Computmg Surveys, Vol 26, No. 2, June 1994

  • 46 ● Peter M. Chen et al.

    CONTENTS

    1 INTRODUCTION2 BACKGROUND

    2.1 Disk Termmology2.2 Data Paths2.3 Technology Trends

    3 DISK ARRAY BASICS3.1 Data StrlpIng and Redundancy32 Basic RAID Orgamzations

    33 Performance and C!ost Comparisons34 Reliability35 Implementation Considerations

    4 ADVANCED TOPICS

    41 Impruvmg Small Write Performance forRAID Leve15

    42 Declustered Parity43 Exploltmg On-I,lne Spare Disks44 Data Strip] ngm Dlsli Arrays45 Performance and Rellabdlty Modellng

    5. CASE STUDIES51 Thmkmg Mach] nes Corporation ScaleArray

    52 StorageTek Iceherg 9200 D]sk Array Subsystem5.3 NCR 62985.4 l’lckerTAIP/DataMesh

    5.5 The RAID-11 Storage Server56 IBM Hagar Disk Array Controller

    6 OPPORTUNITIES F’OR FUTURE RESEARCH61 Experience with Disk Arrays62 InteractIon among New Orgamzatlons63 Scalabdlty, Massively Parallel Computers,

    and Small Disks

    64 Latency7 CONCLUSIONSACKNOWLEDGMENTSANNOTATED BIBLIOGRAPHY

    1. INTRODUCTION

    In recent years, interest in RAID, Redun-dant Arrays of Inexpensive Disks,l hasgrown explosively. The driving force be-hind this phenomenon is the sustainedexponential improvements in the per-formance and density of semiconductortechnology. Improvements in semicon-ductor technology make possible fastermicroprocessors and larger primarymemory systems which in turn require

    1Because of the restrictiveness of “Inexpensive,”sometimes RAID is said to stand for “RedundantArrays of Independent Disks.”

    larger, higher-performance secondarystorage systems. These improvementshave both quantitative and qualitativeconsequences.

    On the quantitative side, Amdahl’sLaw [Amdahl 1967] predicts that largeimprovements in microprocessors will re-sult in only marginal improvementsin overall system performance unlessaccompanied by corresponding improve-ments in secondary storage systems. Un-fortunately, while RISC microprocessorperformance has been improving 50~0 ormore per year [Patterson and Hennessy1994, p. 27], disk access times, whichdepend on improvements of mechanicalsystems, have been improving less than10% per year. Disk transfer rates, whichtrack improvements in both mechanicalsystems and magnetic-media densities,have improved at the faster rate of ap-proximately 20% per year, but this isstill far slower than the rate of processorimprovement. Assuming that semicon-ductor and disk technologies continuetheir current trends, we must concludethat the performance gap between micro-processors and magnetic disks will con-tinue to widen.

    In addition to the quantitative effect, asecond, perhaps more important, qualita-tive effect is driving the need for higher-performance secondary storage systems.As microprocessors become faster, theymake possible new applications andgreatly expand the scope of existing ap-plications. In particular, image-intensiveapplications such as video, hypertext, andmultimedia are becoming common. Evenin existing application areas such ascomputer-aided design and scientificcomputing, faster microprocessors makeit possible to tackle new problems requir-ing faster access to larger data sets. Thisshift in applications, along with a trendtoward large, shared, high-performance,network-based storage systems, is caus-ing us to reevaluate the way we designand use secondary storage systems [Katz1992].

    Disk arrays, which organize multiple,independent disks into a large, high-per-formance logical disk, are a natural solu-

    ACM Computing Surveys, Vol 26, No 2, June 1994

  • RAID ● 147

    tion to the problem. Disk arrays stripedata across multiple disks and accessthem in parallel to achieve both higherdata transfer rates on large data access-es and higher 1/0 rates on small dataaccesses [Salem and Garcia-Molina 1986;Livny et al. 1987]. Data striping also re-sults in uniform load balancing across allof the disks, eliminating hot spots thatotherwise saturate a small number ofdisks while the majority of disks sit idle.

    Large disk arrays, are highly vulnera-ble to disk failures however. A disk arraywith 100 disks is 100 times more likely tofail than a single-disk array. An MTTF(mean-time-to-failure) of 200,000 hours,or approximately 23 years, for a singledisk implies an MTTF of 2000 hours, orapproximately three months, for a diskarray with 100 disks. The obvious solu-tion is to employ redundancy in the formof error-correcting codes to tolerate diskfailures. This allows a redundant diskarray to avoid losing data for much longerthan an unprotected single disk. How-ever, redundancy has negative conse-quences. Since all write operations mustupdate the redundant information, theperformance of writes in redundant diskarrays can be significantly worse thanthe performance of writes in nonredun-dant disk arrays. Also, keeping the re-dundant information consistent in theface of concurrent 1/0 operations andsystem crashes can be difficult.

    A number of different data-striping andredundancy schemes have been devel-oped. The combinations and arrange-ments of these schemes lead to a bewil-dering set of options for users anddesigners of disk arrays. Each option pre-sents subtle tradeoffs among reliability,performance, and cost that are difficultto evaluate without understanding thealternatives. To address this problem,this article presents a systematic tutorialand survey of disk arrays. We describeseven basic disk array organizationsalong with their advantages and disad-vantages and compare their reliability,performance, and cost. We draw atten-tion to the general principles governingthe design and configuration of disk ar-

    rays as well as practical issues that mustbe addressed in the implementation ofdisk arrays. A later section describes op-timization and variations to the sevenbasic disk array organizations. Finally,we discuss existing research in the mod-eling of disk arrays and fruitful avenulesfor future research. This article should beof value to anyone interested in disk ar-rays, including students, researchers, de-signers, and users of disk arrays.

    2. BACKGROUND

    This section provides basic backgroundmaterial on disks, 1/0 data paths, anddisk technology trends for readers whoare unfamiliar with secondary storagesystems.

    2.1 Disk Terminology

    Figure 1 illustrates the basic componentsof a simplified magnetic disk drive. Adisk consists of a set of platters coatedwith a magnetic medium rotating at aconstant angular velocity and a set ofdisk arms with magnetic read/writeheads that are moved radially across theplatters’ surfaces by an actuator. Oncethe heads are correctly positioned, datais read and written in small arcs calledsectors on the platters’ surfaces as theplatters rotate relative to the heads. Al-though all heads are moved collective] y,in almost every disk drive, only a singlehead can read or write data at any giventime. A complete circular swath of data isreferred to as a track, and each platter’ssurface consists of concentric rings oftracks. A vertical collection of tracks atthe same radial position is logically re-ferred to as a cylinder. Sectors are numb-ered so that a sequential scan of allsectors traverses the entire disk in theminimal possible time.

    Given the simplified disk describedabove, disk service times can be brok-en into three primary components: seektime, rotational latency, and data traris-fer time. Seek time is the amount of timeneeded to move a head to the correctradial position and typically ranges from

    ACM Computing Surveys, Vol. 26, No 2, June 1994

  • 148 * Peter M. Chen et al.

    Plattel-

    Inner Track l+tld

    __..–— .—----

    ...~-k-. --- —-.. ——-- ~------._ ._.. ._ ___ ----------

    -Actuator

    Figure 1. Disk terminology Heads res] de on arms which are positioned by actuators. Tracks areconcentric rings cm a platter. A sector is the basic unit of reads and writes A cylinder is a stack of tracks atone actuator positron. An HDA (head-disk assembly) is everything in the figure plus the air-tight casingIn some devices it M possible to transfer data from multiple surfaces simultaneously, but this is both rareand expensive. The collection of heads that participate m a single logical transfer that is suread over.-multiple surfaces is called a head groap.

    1 to 30 milliseconds depending on theseek distance and the particular disk.Rotational latency is the amount of timeneeded for the desired sector to rotateunder the disk head. Full rotation timesfor disks vary currently from 8 to 28milliseconds. The data transfer time isdependent on the rate at which data canbe transferred to/from a platter’s surfaceand is a function of the platter’s rate ofrotation, the density of the magnetic me-dia, and the radial distance of the headfrom the center of the platter—somedisks use a technique called zone-bit-re-cording to store more data on the longeroutside tracks than the shorter insidetracks. Typical data transfer rates rangefrom 1 to 5 MB per second. The seek timeand rotational latency are sometimes col-lectively referred to as the heacl-position-ing time. Table 1 tabulates the statisticsfor a typical high-end disk available in1993.

    The slow head-positioning time andfast data transfer rate of disks lead tovery different performance for a se-quence of accesses depending on the sizeand relative location of each access. Sup-pose we need to transfer 1 MB from thedisk in Table 1, and the data is laid outin two ways: sequential within a singlecylinder or randomly placed in 8 KBblocks. In either case the time for the

    Table 1. Speclflcatlons for the Seagate ST43401 NElite-3 SCSI D!sk Drive

    Form Factor/Disk Diameter 5,25 inch

    Capxity 2.8 GB

    Cylinders 2627

    Tracks Per Cylinder 21

    Sec[ors Pcr Tmck -99

    Bytes Pcr Sector 512

    Full Rolahon Time 11.lms

    Mlnunum Seek

    (single cylinder)1,7 ms

    Average Seek11.Oms

    (random cylinder to cylmdcr)

    ~Average seek in this table 1s calculated assuming aumform distribution of accesses. This is the stan-dard way manufacturers report average seek times.In reality, measurements of production systemsshow that spatial locality sigmficantly lowers theeffective average seek distance [Hennessy and Pat-terson 1990, p 559]

    actual data transfer of 1 MB is about 200ms. But the time for positioning the headgoes from about 16 ms in the sequentiallayout to about 2000 ms in the randomlayout. This sensitivity to the workload iswhy I/O-intensive applications are cate-

    ACM Comput]ng Surveys, Vol 26, No 2, June 1994

  • RAID - 14’9

    gorized as high data rate, meaning mini-mal head positioning via large, sequen-tial accesses, or high 1/0 rate, meaninglots of head positioning via small, morerandom accesses. For example, scientificprograms that manipulate large arraysof data fall in the high data rate cate-gory, while transaction-processing pro-grams fall in the high 1/0 rate category.

    2.2 Data Paths

    A hierarchy of industry standard inter-faces has been defined for transferringdata recorded on a disk platter’s surfaceto or from a host computer. In this sec-tion we review the complete data path,from a disk to a users’ application (Fig-ure 2). We assume a read operation forthe purposes of this discussion.

    On the disk platter’s surface, informa-tion is represented as reversals in thedirection of stored magnetic fields. These“flux reversals” are sensed, amplified,and digitized into pulses by the lowest-level read electronics. The protocolST506/ 412 is one standard that definesan interface to disk systems at this low-est, most inflexible, and technology-de-pendent level. Above this level of the readelectronics path, pulses are decoded toseparate data bits from timing-relatedflux reversals. The bit-level ESDI (En-hanced Small Device Interface) and SMD(Storage Module Interface) standards de-fine an interface at this more flexible,encoding-independent level. Then, to betransformed into the highest, most flexi-ble packet-level, these bits are alignedinto bytes, error-correcting codes applied,and the extracted data delivered to thehost as data blocks over a peripheral businterface such as SCSI (Small ComputerStandard Interface), or IPI-3 (the thirdlevel of the Intelligent Peripheral Inter-face). These steps are performed today byintelligent on-disk controllers, which of-ten include speed matching and caching“track buffers.” SCSI and IPI-3 also in-clude a level of data mapping: the com-puter specifies a logical block number,and the controller embedded on the diskmaps that block number to a physical

    cylinder, track, and sector. This mappingallows the embedded disk controller toavoid bad areas of the disk by remappinglogical blocks that are affected to newareas of the disk.

    Topologies and devices on the data pathbetween disk and host computer varywidely depending on the size and type of1/0 system. Mainframes have the rich-est 1/0 systems, with many devices andcomplex interconnection schemes to ac-cess them. An IBM channel path, whichencompasses the set of cables and associ-ated electronics that transfer data andcontrol information between an 1/0 de-vice and main memory, consists of achannel, a storage director, and a headof string. The collection of disks thatshare the same pathway to the head ofstring is called a string. In the worksta-tion/file server world, the channel pro-cessor is usually called an 1/0 controlleror host-bus adaptor (HBA), and the func-tionality of the storage director and headof string is contained in an embeddedcontroller on the disk drive. As in themainframe world, the use of high-levelperipheral interfaces such as SCSI andIPI-3 allow multiple disks to share a sin-gle peripheral bus or string.

    From the HBA, data is transferred viadirect memory access, over a system bus,such as VME (Versa Module Eurocarcl),S-Bus, MicroChannel, EISA (ExtendedIndustry Standard Architecture), or PCI(Peripheral Component Interconnect), tothe host operating system’s buffers. Then,in most operating systems, the CPU per-forms a memory-to-memory copy over ahigh-speed memory bus from the operat-ing system buffers to buffers in the appli-cation’s address space.

    2.3 Technology Trends

    Much of the motivation for disk arrayscomes from the current trends in disktechnology. As Table 2 shows, magneticdisk drives have been improving rapidlyby some metrics and hardly at all byother metrics. Smaller distances betweenthe magnetic read/write head and thedisk surface, more accurate positioning

    ACM Computmg Surveys, Vol. 26, No 2, June 1994

  • .-A -. ..-,

    15U “ Yeter M. (,’hen et al.

    M

    3~fAE!:?!

    or C7vmntl Proccs70r

    I

    +

    S1l-loz——..

    _d__

    HDisk Cont!ollcr/Slor

  • Table 2.

    And Density

    Linear Density

    Inter-Track Density

    Capacity

    (3.5” form factor)

    Transfer Rate

    Seek Time

    RAID ● 151

    Trends in Disk Technology.

    1993Historical Rate

    of Improvement

    50-150

    Mbils/sq. inch27% per year

    40,000-60,000

    bilslinch13% per year

    Magnetic disks are improving rapidly in density and capacity, but more slowly in performance. A realdensitv is the recording densitv Ber sauare inch of magnetic media. In 1989, IBM demonstrated a 1

    Gbit/;q.-inch densityi; a labo;a;ory environment. Lines; density is the number of bits written along atrack. Intertrack density refers to the number of concentric tracks on a single platter.

    arrays. In particular, we examine theconcepts of data striping and redun-dancy; basic RAID organizations; perfor-mance and cost comparisons between thebasic RAID organizations; reliability ofRAID-based systems in the face of sys-tem crashes, uncorrectable bit-errors, andcorrelated disk failures; and finally, is-sues in the implementation of block-in-terleaved, redundant disk arrays.

    3.1 Data Striping and Redundancy

    Redundant disk arrays employ two or-thogonal concepts: data striping for im-proved performance and redundancy forimproved reliability. Data striping dis-tributes data transparently over multipledisks to make them appear as a singlefast, large disk. Striping improves aggre-gate 1/0 performance by allowing multi-ple 1/0s to be serviced in parallel. Thereare two aspects to this parallelism. First,multiple independent requests can beserviced in parallel by separate disks.This decreases the queuing time seen by1/0 requests. Second, single multiple-block requests can be serviced by multi-ple disks acting in coordination. This in-creases the effective transfer rate seen by

    a single request. The more disks in thedisk array, the larger the potential per-formance benefits. Unfortunately, a largenumber of disks lowers the overall relia-bility of the disk array, as mentionedbefore. Assuming independent failures,100 disks collectively have only 1/100ththe reliability of a single disk. Thus, re-dundancy is necessary to tolerate diskfailures and allow continuous operationwithout data loss.

    We will see that the majority of redun-dant disk array organizations can be dis-tinguished based on two features: (1) thegranularity of data interleaving and (2)the method and pattern in which theredundant information is computed anddistributed across the disk array. Datainterleaving can be characterized as ei-ther fine grained or coarse grained. Fine-grained disk arrays conceptually inter-leave data in relatively small units sothat all 1/0 requests, regardless of theirsize, access all of the disks in the diskarray. This results in very high datatransfer rates for all 1/0 requests buthas the disadvantages that (1) only onelogical 1/0 request can be in service atany given time and (2) all disks mustwaste time positioning for every request.

    ACM Computing Surveys, Vol. 26, No. 2, June 1994

  • 152 ● Peter M. Chen et al.

    Coarse-grained disk arrays interleavedata in relatively large units so that small1/0 requests need access only a smallnumber of disks while large requests canaccess all the disks in the disk array.This allows multiple small requests to beserviced simultaneously while still allow-ing large requests to see the highertransfer rates afforded by using multipledisks.

    The incorporation of’ redundancy indisk arrays brings up two somewhat or-thogonal problems. The first problem isto select the method for computing theredundant information. Most redundantdisk arrays today use parity, though someuse Hamming or Reed-Solomon codes.The second problem is that of selecting amethod for distributing the redundantinformation across the disk array. Al-though there are an unlimited number ofpatterns in which redundant informationcan be distributed, we classify these pat-terns roughly into two different distribu-tions schemes, those that concentrate re-dundant information on a small numberof disks and those that distributed re-dundant information uniformly across allof the disks. Schemes that uniformly dis-tribute redundant information are gener-ally more desirable because they avoidhot spots and other load-balancing prob-lems suffered by schemes that do notdistribute redundant information uni-formly. Although the basic concepts ofdata striping and redundancy are con-ceptually simple, selecting between themany possible data striping and redun-dancy schemes involves complex trade-offs between reliability, performance, andcost.

    3.2 Basic RAID Organizations

    This section describes the basic RAIDorganizations that will be used as thebasis for further examinations of the per-formance, cost, and reliability of diskarrays. In addition to presenting RAIDlevels 1 through 5 that first appeared inthe landmark paper by Patterson, Gib-son, and Katz [Patterson et al. 1988], wepresent two other RAID organizations,

    RAID levels O and 6, that have sincebecome generally accepted.x For the ben-efit of those unfamiliar with the originalnumerical classification of RAID, we willuse English phrases in preference to thenumerical classifications. It should comeas no surprise to the reader that even theoriginal authors have been confusedsometimes with regard to the disk arrayorganization referred to by a particularRAID level! Figure 3 illustrates the sevenRAID organizations schematically.

    3.2.1 Nonredundant (RAID Level O)

    A nonredundant disk array, or RAID levelO, has the lowest cost of any RAID orga-nization because it does not employ re-dundancy at all. This scheme offers thebest write performance since it does notneed to update redundant information.Surprisingly, it does not have the bestread performance. Redundancy schemesthat duplicate data, such as mirroring,can perform better on reads by selec-tively scheduling requests on the diskwith the shortest expected seek and rota-tional delays [Bitten and Gray 1988].Without redundancy, any single disk fail-ure will result in data loss. Nonredun-dant disk arrays are widely used insupercomputing environments whereperformance and capacity, rather thanreliability, are the primary concerns.

    3.2.2 Mirrored (RAID Level 1)

    The traditional solution, called mirroringor shadowing, uses twice as many disksas a nonredundant disk array [Bitten andGray 1988]. Whenever data is written toa disk the same data is also written to aredundant disk, so that there are alwaystwo copies of the information. When datais read, it can he retrieved from the diskwith the shorter queuing, seek, and rota-tional delays [Chen et al. 1990]. If a diskfails, the other copy is used to servicerequests. Mirroring is frequently used in

    2Strictly speaking, RAID level O IS not a type ofredundant array of inexpensive disks since it storesno error-correcting codes.

    ACM Computing Surveys, Vol 26, No 2, June 1994

  • RAID ● 153

    E3E3E3Non-Redundant (RAID Level O)

    EE3BMirrored (RAID Level 1)

    Mcmo]y-S[ylc ECC (RAID LCW4 2)

    BwIntcrleavcd Parity (R~lD LCVC1 3)

    EE3mn!iiiiBlock-In{ crleavcd PJriIy (RAID Level 4)

    I]lock-In~c!lcavcd Dislribuwd-Parjly [RAID Level 5)

    B\ ... Ei23N.... ‘,., ,, . .. Es~.\’ ‘.+’

    ..,’

    P+Q Redundancy (RAID Level 6)

    Figure 3. RAID levels O through 6. All RAID levels are illustrated at a user capacity of four disks. Diskswith multiple platters indicate block-level striping while disks without multiple platters indicate bit-levelstriping. The shaded platters represent redundant information.

    database applications where availabilityand transaction rate are more importantthan storage efficiency [Gray et al. 1990].

    3.2.3 Memoiy-Siyle ECC (RAID Level 2)

    Memory systems have provided recoveryfrom failed components with much less

    cost than mirroring by using Hammingcodes [Peterson and Weldon 1972]. Ham-ming codes contain parity for distinctoverlapping subsets of components. Inone version of this scheme, four datadisks require three redundant disks, oneless than mirroring. Since the number ofredundant disks is proportional to the log

    ACM Computing Surveys, Vol. 26, No. 2, June 1.994

  • 154 “ Peter M. Chenetal.

    of the total number of disks in the sys-tem, storage efficiency increases as thenumber of data disks increases.

    If a single component fails, several ofthe parity components will have inconsis-tent values, and the failed component isthe one held in common by each incorrectsubset. The lost information is recoveredby reading the other components in asubset, including the parity component,and setting the missing bit to O or 1 tocreate the proper parity value for thatsubset. Thus, multiple redundant disksare needed to identify the failed disk, butonly one is needed to recover the lostinformation.

    Readers unfamiliar with parity canthink of the redundant disk as havingthe sum of all the data in the other disks.When a disk fails, you can subtract allthe data on the good disks from the par-ity disk; the remaining information mustbe the missing information. Parity issimply this sum modulo two.

    3.2.4 Bit-Interleaved Parity (RAID Level 3)

    One can improve upon memory-style EGGdisk arrays by noting that, unlike mem-ory component failures, disk controllerscan easilv identifv which disk has failed.Thus, on”e can u~e a single parity diskrather than a set of parity disks to re-cover lost information.

    In a bit-interleaved parity disk array,data is conceptually interleaved bit-wiseover the data disks, and a single paritydisk is added to tolerate any single diskfailure. Each read request accesses alldata disks, and each write request ac-cesses all data disks and the parity disk.Thus, only one request can be serviced ata time. Because the parity disk containsonly parity and no data, the parity diskcannot participate on reads, resulting inslightly lower read performance than forredundancy schemes that distribute theparity and data over all disks. Bit-inter-leaved parity disk arrays are frequentlyused in applications that require highbandwidth but not high 1/0 rates. Alsothey are simpler to implement than RAIDLevels 4, 5, and 6.

    3.2,5 Block-interleaved Parity (RAID Level 4)

    The block-interleaved parity disk arrayis similar to the bit-interleaved paritydisk array except that data is interleavedacross disks in blocks of arbitrary sizerather than in bits. The size of theseblocks is called the striping unit [Chenand Patterson 1990]. Read requestssmaller than the striping unit access onlya single data disk. Write reauests mustupda~e the requested data ‘blocks andmust compute and update the parityblock. For large writes that touch blockson all disks, p~rity is easily computed byexclusive-oring the new data for eachdisk. For small write reauests that uwdate only one data disk,’ parity is comp-uted by noting how the new data differsfrom the old data and applying thosedifferences to the ~aritv block. Smallwrite requests thu~ re&ire four disk1/0s: one to write the new data, two toread the old data and old parity for com-puting the new parity, and one to writethe new parity. This is referred to as aread-modify-write procedure. Because ablock-interleaved parity disk array hasonly one parity disk, which must be up-dated on all write operations. the ~aritvdisk can easily become a bottleneck. B~-cause of this limitation, the block-inter-leaved distributed-parity disk array isuniversally m-eferred over the block-in-terleaved ~a~ity disk array.

    3.2.6 Block-Interleaved Distributed-Parly (RAIDLevel 5)

    The block-interleaved distributed-~ aritvdisk array eliminates the parity disk bo~-tleneck present in the block-interleavedparity disk array by distributing the par-

    ity uniformly over all of the disks. An

    additional, frequently overlooked advan-

    tage to distributing the parity is that it

    also distributes data over all of the disks

    rather than over all but one. This allows

    all disks to participate in servicing read

    operations in contrast to redundance. .schemes with dedicated parity disks inwhich the parity disk cannot participatein servicing read requests. Block-inter-

    ACM Computing Surveys, Vol 26, No 2, June 1994

  • RAID ● 155

    leaved distributed-parity disk arrayshave the best small read, large read, andlarge write performance of any redun-dant disk array. Small write requests aresomewhat inefficient compared with re-dundancy schemes such as mirroringhowever, due to the need to performread-modify-write operations to updateparity. This is the major performanceweakness of RAID level-5 disk arrays andhas been the subject of intensive re-search [Menon et al. 1993; Stodolsky andGibson 1993].

    The exact method used to distributeparity in block-interleaved distributed-parity disk arrays can affect perfor-mance. Figure 4 illustrates the bestparity distribution of those investigatedin [Lee and Katz 1991b], called the left-symmetric parity distribution. A usefulproperty of the left-symmetric parity dis-tribution is that whenever you traversethe striping units sequentially, you willaccess each disk once before accessingany disk twice. This property reduces diskconflicts when servicing large requests.

    3.2.7 P + Q Redundancy (RAID Level 6,)

    Parity is a redundancy code capable ofcorrecting any single self-identifyingfailure. As larger disk arrays are consid-ered, multiple failures are possible, andstronger codes are needed [Burkhard andMenon 1993]. Moreover, when a disk failsin a parity-protected disk array, recover-ing the contents of the failed disk re-quires a successful reading of the con-tents of all nonfailed disks. As we willsee in Section 3.4, the probability of en-countering an uncorrectable read errorduring recovery can be significant. Thus,applications with more stringent reliabil-ity requirements require stronger error-correcting codes.

    One such scheme, called P + Q redun-dancy, uses Reed-Solomon codes to pro-tect against up to two disk failures usingthe bare minimum of two redundantdisks. The P + Q redundant disk arraysare structurally very similar to the block-interleaved distributed-parity disk ar-rays and operate in much the same

    o 1 2

    5 6 7

    10 11 ;P2 8 9.,

    (Left-SJ mmetnc)

    Figure 4. RAID level-5 left-symmetric parityplacement. Each square corresponds to a stripeunit. Each column of squares corresponds to a disk.PO computes the parity over stripe units O, 1,2, and3; PI computes parity over stripe units 4, 5, 6, and7; etc. Lee and Katz [ 1991b] show that the left-sym-metric parity distribution has the best perfor-mance. Only the minimum repeating pattern isshown.

    manner. In particular, P + Q redundantdisk arrays also perform small write op-erations using a read-modify-write proce-dure, except that instead of four diskaccesses per write requests, P + Q re-dundant disk arrays require six disk ac-cesses due to the need to update both the“P and “Q” information.

    3.3 Performance and Cost Comparisons

    The three primary metrics in the evalua-tion of disk arrays are reliability, perfor-mance, and cost. RAID levels O through 6cover a wide range of tradeoffs amongthese metrics. It is important to considerall three metrics to understand fully thevalue and cost of each disk array organi-zation. In this section, we compare RAIDlevels O through 6 based on performanceand cost. The following section examinesreliability y.

    3.3.1 Ground Rules and Observations

    While there are only three primarymetrics in the evaluation of disk arrays

    ACM Computing Surveys, Vol 26, No 2, June 1994

  • 156 “ Peter M. Chen et al.

    (reliability, performance, and cost), thereare many different ways to measure eachmetric and an even larger number ofways of using them. For example, shouldperformance be measured in 1/0s persecond, bytes per second, or responsetime? Would a hybrid metric such as 1/0sper second per dollar be more appropri-ate? Once a metric is agreed upon, shouldwe compare systems at the same cost,the same total user capacity, the sameperformance, or the same reliability? Themethod one uses depends largely on thepurpose of the comparison and the in-tended use of the system. In time-sharingapplications, the primary metric may beuser capacity per dollar; in transaction-processing applications the primary met-ric may be 1/0s per second per dollar;and in scientific applications, the pri-mary metric may be bytes per second perdollar. In certain heterogeneous systems,such as file servers, both 1/0s per secondand bytes per second may be important.In many cases, these metrics may all beconditioned on meeting a reliabilitythreshold.

    Most large secondary storage systems,and disk arrays in particular, arethroughput oriented. That is, generallywe are more concerned with the aggre-gate throughput of the system than, forexample, its response time on individualrequests (as long as requests are satis-fied within a specified time limit). Sucha bias has a sound technical basis: astechniques such as asynchronous 1/0,prefetching, read caching, and writebuffering become more widely used, fastresponse time depends on sustaining ahigh throughput.

    In throughput-oriented systems, per-formance can potentially increase Iin-early as additional components areadded; if one disk provides 30 1/0s persecond, 2 should provide 60 1/0s persecond. Thus, in comparing the perfor-mance of disk arrays, we will normalizethe performance of the system by its cost.In other words we will use performancemetrics such as 1/0s per second per dol-lar rather than the absolute number of1/0s per second.

    Even after the metrics are agreed upon,one must decide whether to compare sys-tems of equivalent capacity, cost, or someother metric. We chose to compare sys-tems of equiualen t file capacity wherefile capacity is the amount of informationthe file system can store on the deviceand excludes the storage used for redun-dancy. Comparing systems with the samefile capacity makes it easy to chooseequivalent workloads for two differentredundancy schemes. Were we to com-pare systems with different file capaci-ties, we would be confronted with toughchoices such as how a workload on asystem with user capacity X maps onto asystem with user capacity 2X.

    Finally, there is currently much confu-sion in comparisons of RAID levels 1through 5. The confusion arises becausea RAID level sometimes specifies not aspecific implementation of a system butrather its configuration and use. For ex-ample, a RAID level-5 disk array (block-interleaved distributed parity) with a

    parity group size of two is comparable toRAID level 1 (mirroring) with the excep-tion that in a mirrored disk array, cer-tain disk-scheduling and data layoutoptimizations can be performed that,

    generally, are not implemented for RAIDlevel-5 disk arrays [Hsiao and DeWitt1990; Orji and Solworth 1993]. Analo-

    gously, a RAID level-5 disk array can beconfigured to operate equivalently to aRAID level-3 disk array by choosing aunit of data striping such that the small-est unit of array access always accesses afull parity stripe of data. In other words,RAID level-l and RAID level-3 disk ar-rays can be viewed as a subclass of RAIDlevel-5 disk arrays. Since RAID level-2and RAID level-4 disk arrays are, practi-cally speaking, in all ways inferior toRAID level-5 disk arrays, the problem ofselecting among RAID levels 1 through 5is a subset of the more general problemof choosing an appropriate parity groupsize and striping unit size for RAID level-5 disk arrays. A parity group size close totwo may indicate the use of RAID level-1disk arrays; a striping unit much smallerthan the size of an average request may

    ACM Computmg Surveys, Vol. 26, No 2, June 1994

  • RAID ● 157

    Table 3. Throughput per Dollar Relative to RAID Level 0.

    Small Read Small Write Large Read Large Write Storage Efficiency

    RAID level O 1 1 1 1 1

    RAID level 1 I 1 1/2 1 1/2 1/2

    RAID level 3 l/G l/G (G- 1)/G (G-1 )/G (G-1)/G

    RAID level 5 I 1 max(l/G, I/4) 1 (G-lj/G (G-1)/G

    RAID level 6 1 max(l/G,l/6) 1 (G-2)/G (G-2)/G

    This table compares the throughputs of various redundancy schemes for four types of 1/0 requests. Smallhere refers to 1/0 requests of one striping unit; large refers to 1/0 requests of one full stripe (one stripe

    unit from each disk in an error correction group). G refers to the number of disks in an error correctiongroup, In all cases, the higher the number the better. The entries in this table account for the major

    performance effects but not some of the second-order effects. For instance, since RAID level 1 stores twocopies of the data, a common optimization is to read dynamically the disk whose positioning time to the

    data is smaller.

    indicate the use of a RAID level-3 diskarray.

    3.3.2 Comparisons

    Table 3 tabulates the maximum through-put per dollar relative to RAID level O forRAID levels O, 1, 3, 5, and 6. The cost ofeach system is assumed to be propor-tional to the total number of disks in thedisk array. Thus, the table illustratesthat given equivalent cost RAID level-Oand RAID level- 1 systems, the RAIDlevel-l system can sustain half the num-ber of small writes per second that aRAID level-O system can sustain. Equiva-lently, we can say that the cost of smallwrites is twice as expensive in a RAIDlevel-l system as in a RAID level-O sys-tem. In addition to performance, the tableshows the storage efficiency of each diskarray organization. The storage effi-ciency is approximately inverse to thecost of each unit of user capacity relativeto a RAID level-O system. For the abovedisk array organizations, the storage effi-ciency is equal to the performance/costmetric for large writes.

    Figure 5 graphs the performance/costmetrics from Table 3 for RAID levels 1, 3,5, and 6 over a range of parity groupsizes. The performance/cost of RAIDlevel-l systems is equivalent to the per-formance/cost of RAID level-5 systems

    when the parity group size is equal totwo. The performance/cost of RAIDlevel-3 systems is always less than orequal to the performance/cost of RAIDlevel-5 systems. This is expected giventhat a RAID level-3 system is a subclassof RAID level-5 systems derived by re-stricting the striping unit size such thatall requests access exactly a parity stripeof data. Since the configuration of RAIDlevel-5 systems is not subject to such arestriction, the performance/cost ofRAID level-5 systems can never be lessthan that of an equivalent RAID level-3system. It is important to stress thatthese performance\cost observations ap-ply only to the abstract models of diskarrays for which we have formulated per-formance/cost metrics. In reality, a spe-cific implementation of a RAID level-3system can have better performance/costthan a specific implementation of a RAIDlevel-5 system.

    As previously mentioned, the questionof which RAID level to use is often betterexpressed as more general configurationquestions concerning the size of the par-ity group and striping unit. If a paritygroup size of two is indicated, then mir-roring is desirable. If a very small strip-ing unit is indicated then a RAID level-3system may be sufficient. To aid thereader in evaluating such decisions, Fig-ure 6 plots the four performance/cost

    ACM Computing Surveys, Vol 26, No. 2, June 1994

  • 158 . Peter M. Chen et al.

    Small Reads

    ‘“0 r

    05

    L

    AID 3

    () o0 5 10 15 ?0

    Group SI/c

    Large ReadsAll) 5&6

    7

    R ID

    “RAID3

    o 5 1() 15 20Group SI~c

    Small Writes

    1.0,

    RAID 1

    \

    RAID 3,5 & 6

    0 5 10 Is 20GIm]p SI)C

    Large }Vrites

    RAID 3&5

    F

    RAID 1

    RAID6

    5 10 15 20GIOUp Size

    Figure 5. Throughput per dollar relatlve to RAID level 0, RAID level-l performance is approximatelyequal to RAID level-5 performance with a group size of two. Note that for small writes, RAID levels 3, 5,and 6 are equally cost effective at small group sizes, but as group size increases, RAID levels 5 and 6

    become better alternatives.

    metrics from Table 3 on the same graphfor each of the RAID levels 3, 5, and 6.This makes the performance/cost trade-offs explicit in choosing an appropriateparity group size. Section 4.4 addresseshow to choose the unit of striping.

    3.4 Reliability

    Reliability is as important a metric tomany 1/0 systems as performance andcost, and it is perhaps the main reasonfor the popularity of redundant disk ar-rays, This section starts by reviewing thebasic reliability provided by a block-in-terleaved parity disk array and then liststhree factors that can undermine the po-tential reliability of disk arrays.

    3.4.1 Basic Reliability

    Redundancy in disk arrays is motivatedby the need to overcome disk failures.When only independent disk failures areconsidered, a simple parity scheme worksadmirably. Patterson et al. [1988] derivethe mean time between failures for aRAID level 5 to be

    MTTF( disk )2

    NX (G – 1) x MTTR(disk) ‘

    where MTTF( disk) is the mean-time-to-failure (MTTF) of a single disk,MTTR( disk) is the mean-time-to-repair(MTTR) of a single disk, N is the total

    ACM Computing Surveys, Vol 26, No 2, June 1994

  • RAID ● 159

    RAID Level 3

    cLqe kCdS &! \~r,lc

  • 160 “ Peter M. Chen et al.

    breakdown, or software crash that caninterrupt an 1/0 operation to a disk ar-ray. Such crashes can interrupt write op-erations, resulting in states where thedata is updated and the parity is not, orvisa versa. In either case, the parity isinconsistent and cannot be used in theevent of a disk failure. Techniques suchas redundant hardware and power sup-plies can be applied to make such crashesless frequent [Menon and Cartney 1993],but no technique can prevent systemscrashes 100% of the time.

    System crashes can cause parity incon-sistencies in both bit-interleaved andblock-interleaved disk arrays, but theproblem is of practical concern only inblock-interleaved disk arrays. This is be-cause in bit-interleaved disk arrays theinconsistent parity can only affect thedata that is currently being written. Ifwrites do not have to be atomic, applica-tions cannot assume either that the writeduring a system crash completed or didnot complete, and thus it is generallypermissible for the bit-interleaved diskarray to store arbitrary data on the up-dated sectors. In a block-interleaved diskarray, however, an interrupted write op-eration can affect the ~arit~ of other datablocks in that stripe ;hat ;ere not beingwritten. Thus, for reliability purposes,svstem crashes in block-interleaved diska&ays are similar to disk failures in thatthey may result in the loss of the correctparity for stripes that were being modi-fied during the crash.

    In actuality, system crashes can bemuch worse than disk failures for tworeasons. First, they may occur more fre-quently than disk failures. Second, a sys-tem crash in disk arrays using P + Qredundancy is analogous to a double diskfailure because both the “P” and “Q” in-formation is made inconsistent. To avoidthe loss of parity on system crashes, in-formation sufficient to recover the paritymust be logged to nonvolatile storage be-fore executing each write operation. Theinformation need only be saved untilthe corresponding write completes. Hard-ware implementations of RAID systemscan implement such logging efficiently

    using nonvolatile RAM. In software im-plementations that do not have access tofast nonvolatile storage, it is generallynot possible to protect against systemcrashes without significantly sacrificingperformance.

    3.4.3 Uncorrectable Bit Errors

    Although modern disks are highly reli-able devices that can withstand sig-nificant amounts of abuse, they occa-sionally fail to read or write small bits ofdata. ~urrently, most disks cite uncor-rectable bit error rates of one error in

    10 lJ bits read. Unfortunately. the exact“,interpretation of what is meant by anuncorrectable bit error is unclear. Forexample, does the act of reading disksactually generate errors, or do the errorsoccur during writes and become evidentduring reads?

    Generally, disk manufactures agreethat reading a disk is very unlikely tocause permanent errors. Most uncorrect-able errors are generated because data isincorrectly writ;en or gradually damagedas the magnetic media ages. These errorsare detected only when we attempt toread the data. Our interpretation of un-correctable bit error rates is that theyrem-esent the rate at which errors arede~ected during reads from the disk dur-ing the normal operation of the disk drive.It is important to stress that there is nogenerally agreed upon interpretation ofbit error rates.

    The primary ramification of an uncor-rectable bit error is felt when a disk failsand the contents of the failed disk mustbe reconstructed by reading data fromthe nonfailed disks. For example, the re-construction of a failed disk in a 100 GB

    disk array requires the successful read-ing of approximately 200 million sectorsof information. A bit error rate of one in1014 bits implies that one 512 byte sectorin 24 billion sectors cannot be correctlyread. Thus, if we assume that the proba-bility of reading sectors is independent ofeach other, the probability of reading all200 million sectors successfully is ap-proximately (1 – 1/(2.4 X 1010)) A (2.0

    ACM Computmg Surveys, Vol 26. No. 2, .June 1994

  • x 108) = 99.29%. This means that on av-

    erage, 0.8% of disk failures would resultin data loss due to an uncorrectable biterror.

    The above example indicates that un-recoverable bit errors can be a significantfactor in designing large, highly reliabledisk arrays. This conclusion is heavilydependent on our particular interpreta-tion of what is meant by an unrecov-erable bit error and the guaranteedunrecoverable bit error rates as suppliedby the disk manufactures; actual errorrates may be much better.

    One approach that can be used with orwithout redundancy is to try to protectagainst bit errors by predicting when adisk is about to fail. VAXsimPLUS, aproduct from Digital Equipment Corpo-ration, monitors the warnings given bydisks and notifies an operator when itfeels the disk is about to fail. Such pre-dictions can significantly lower incidentof data loss [Emlich and Polich 1989;Malhotra and Trivedi 1993].

    3.4.4 Correlated Disk Failures

    The simplest model of reliability of diskarrays [Patterson et al. 1988] assumesthat all disk failures are independentwhen calculating mean time to data loss.This resulted in very high mean time todata loss estimates, up to millions ofyears. In reality, common environmentaland manufacturing factors can cause cor-related disk failures frequently. For ex-ample, an earthquake might sharplyincrease the failure rate for all disksin a disk array for a short period of time.More commonly, power surges, powerfailures, and simply the act of poweringdisks on and off can place simultaneousstress on the electrical components of allaffected disks. Disks also share commonsupport hardware; when this hardwarefails, it can lead to multiple, simultane-ous disk failures.

    Aside from environmental factors, thedisks themselves have certain correlatedfailure modes built into them. For exam-ple, disks are generally more likely to faileither very early or very late in their

    lifetimes. Early failures are caused fre-quently by transient defects which maynot have been detected during the manu-facturer’s burn-in process; late failuresoccur when a disk wears out. A system-atic manufacturing defect can producealso bad batches of disks that can failclose together in time. Correlated diskfailures greatly reduce the reliability ofdisk arrays by making it much morelikely that an initial disk failure will beclosely followed by additional disk fail-ures before the failed disk can be recon-structed.

    3.4.5 Reliability Revisited

    The previous sections have described howsystem crashes, uncorrectable bit errors,and correlated disk failures can decreasethe reliability of redundant disk arrays.In this section, we will calculate mean-time-to-data-loss statistics after incorpo-rating these factors.

    The new failure modes imply that thereare now three relatively common ways tolose data in a block-interleaved parity-protected disk array:

    e

    double disk failure,

    system crash followed by a disk failure,and

    disk failure followed bv an uncorrect-able bit error during reconstruction.

    As mentioned above, a system crashfollowed by a disk failure can be pro-tected against in most hardware disk ar-ray implementations with the help ofnonvolatile storage, but such protectionis unlikely in software disk arrays. Theabove three failure modes are the hard-est failure combinations, in that we arecurrently unaware of any techniquesto protect against them without signifi-cantly degrading performance. To con-struct a simple model of correlated diskfailures, we will assume that each suc-cessive disk failure is 10 times morelikely than the previous failure (until thefailed disk has been reconstructed).Table 4 tabulates values of the reliability

    ACM Computing Surveys, Vol 26, No. 2, June 1994

  • 162 * Peter M. Chen et al.

    Table 4. Reliablilty Parameters

    Total User CapacNy

    Disk Size

    Sector Size

    Bit Error RaIc (BER)

    p(dl\k)

    The probability of reading

    all sectors on a disk(Dcnved from disk SIX,

    sector si~e, and BER.)

    Parily Group SIZC

    MTTF(disk)

    M’ITF(disk2)

    MlTF(dtsk3)

    MITR(disk)

    M’ITF(sys)

    MITR(sys)

    100 dtsks (500 GB)

    5 GB

    512 byms

    1 in 10A14 b{~

    1 m 2.4 IOAIO scclors

    99.96%

    16 disks

    200,000 hours

    20,000 hours

    2,(XKI hours

    1 hour

    1 month

    1 hour

    This table lists parameters used for reliabdity cal-culations m this section.

    parameters we will use for calculatingnumeric reliability estimates in this sec-tion. Note that the reliability estimateswill be given per a constant user capacityof 100 disks, consisting of independent,16-disk parity groups.

    Table 5, which tabulates reliabilitymetrics for RAID level-5 disk arrays,shows that the frequency of the threefailure combinations are within an orderof magnitude of each other. This meansthat none of the three failure modes canbe ignored in determining reliability. Thismakes it difficult to improve the overallreliability of the system without improv-ing the reliability y of several componentsof the system; a more reliable disk willgreatly reduce the frequency of doubledisk failures, but its protection againstthe other two failure combinations is lesspronounced. Frequencies of both systemcrashes and bit error rates also must bereduced before significant improvementsin overall system reliability can beachieved. Also, note the deceptively reas-suring MTTDL numbers. Even with a

    MTTDL of 285 years, there is a 3.4%chance of losing data in the first 10 years.

    Table 6 tabulates the reliability met-rics for P + Q redundant disk arrays.As can be seen, system crashes are theAchilles’s heel of P + Q redundancyschemes. Since system crashes invalidateboth the P and Q information, their effectis similar to a double disk failure. Thus,unless the system provides protectionagainst system crashes, as is assumed inthe calculation of the reliability for hard-ware RAID systems, P + Q redundancydoes not provide a significant advantageover parity-protected disk arrays. In gen-eral, P + Q redundancy is most usefulfor protecting against unrecoverable biterrors that occur during reconstructionand against multiple correlated disk fail-ures.

    3.4.6 Summary and Conclusions

    This section has examined the reliabilityof block-interleaved redundant disk ar-rays when factors other than indepen-dent disk failures are taken into account.We see that system crashes and unrecov-erable bit errors can significantly reducethe reliability of block-interleaved parity-protected disk arrays. We have shownthat P + Q redundant disk arrays arevery effective in protecting against bothdouble disk failures and unrecoverablebit errors but are susceptible to systemcrashes. In order to realize the full relia-bility advantages of P + Q redundantdisk arrays, nonvolatile storage must beused to protect against system crashes.

    Numeric reliability calculations serveas useful guidelines and bounds for theactual reliability of disk arravs. How-.ever, it is infeasible to compare the re-

    liability of real system based on suchnumbers. Frequently, reliability calcula-tions ignore important implementation-specific factors that are difficult to quan-tify, such as the reliability of softwarecomponents. What is useful to know,however, and what we have presentedhere, are the types of common failuresthat a disk array can tolerate, how theylimit the reliability of the system, and

    ACM Computmg Surveys, Vol. 26, No 2, June 1994

  • RAID ● 163

    Table 5. Failure Characteristics for RAID Level-5 Disk Arrays.

    Probability of

    MTTDL NITTDL Data Loss over

    10 Year Period

    Double Disk Failure MTTF (disk) x MTTF (disk2) 285 yrs. 3.4%Nx (G- 1) xh4TTR(disk)

    Sys Crash + Disk FailureMTTF (sys) x MTTF (disk)

    154 yrs. 6.3%N X kf~~/? ( SYS)

    Disk Failure + Bit Error IWTF (disk) 36 yrs. 24 .4%

    Nx (1- (p(disk))G-’)#

    Software RAID (harmonic sum of above) 26 yrs. 31.6%

    Hardware RAID (NVRAM) ::-:;:: ;:;;::g 32 yrs. 26.8%

    MTTDL is the mean time to data loss. The 10-year failure rate is the percent chance of data loss in a10-year period, For numeric calculations, the parity group size, G, is equal to 16, and the user datacapacity is equal to 100 data disks. Note that the total number of disks in the system, N, is equal to thenumber of data disks times G/(G – 1).

    Table 6. Failure Characteristics for a P + Q disk array.

    Triple DiskFailure

    Sys Crash+Disk Failure

    DoubleDisk Failure+ Bit Error

    softwareRAID

    HardwareRAID(NVRAM)

    Probabilityof Data

    MTfDL MTTDL Loss over10 YearPeriod

    MT7F (disk) X MT77F (d1sk2) X MTTF(disk3 ) )38052 yrs. 0.03%

    Nx (G - 1) x (G -2) xMTTR2(disk)

    MTTF (SYS) X MTTF (duk)

    N X M7TR (S]S)144 yrs. 7.7%

    M’f’TF (disk) x IUTTF (disk2 ) )

    Nx(G-l) x(l-(l-p (disk))) ‘6-2)) x MTTR (disk)47697 yrs. 0.029??

    (harmonic sum of above) 143 yrs. 6.8%

    (harmonic sum excluding sys crmh+disk failure) 21166 yrs. 0.05%

    MTTDL is the mean time to data loss. The 10-year failure rate is the percent chance of data loss in a10-year period. For numeric calculations, the parity group size, G, is equal to 16, and the user datacapacity is equal to 100 data disks. Note that the total number of disks in the system, N, is equal to thenumber of data disks times G/(G – 2).

    ACM Computmg Surveys, Vol. 26, No. 2, June 1994

  • 164 ● Peter M. Chen et al.

    thus its approximate reliability in com-parison to other disk array organizationsof similar complexity.

    3.5 Implementation Considerations

    Although the operation of block-inter-leaved redundant disk arrays is concep-tually simple, a disk array implementermust address many practical considera-tions for the system to function correctlyand reliably at an acceptable level of per-formance. One problem is that the neces-sary state information for a disk arrayconsists of more than just the data andparity stored on the disks. Informationsuch as which disks are failed, how muchof a failed disk has been reconstructed,and which sectors are currently beingupdated must be accurately maintainedin the face of system crashes. We willrefer to such state information that isneither user data nor parity as metastateinformation. Another problem, addressedin Section 3.5.4, is that multiple disksare usually connected to the host com-puter via a common bus or string.

    3.5.1 Avoiding Stale Data

    The only piece of metastate informationthat must be maintained in redundantdisk arrays is the validity of each sectorof data and parity in a disk array. Thefollowing restrictions must be observedin maintaining this information.

    0

    *

    When a disk fails, the logical sectorscorresponding to the failed disk mustbe marked invalid before any requestthat would normally access to the faileddisk can be attempted. This invalidmark prevents users from reading cor-rupted data on the failed disk.

    When an invalid logical sector is recon-structed to a spa~e disk, the logicalsector must be marked ualid beforeany write request that would normallywrite to the failed disk can be serviced.This ensures that ensuing writes up-date the reconstructed data on thespare disk.

    Both restrictions are needed to ensurethat users do not receive stale data fromthe disk array. Without the first restric-tion, it would be possible for users toread stale data from a disk that is con-sidered to have failed but works inter-mittently. Without the second restriction,successive write operations would fail toupdate the newly reconstructed sector,resulting in stale data. The valid/invalid state information can be main-tained as a bit-vector either on a sepa-rate device or by reserving a smallamount of storage on the disks currentlyconfigured into the disk array. If onekeeps track of which disks are failed/op-erational, one needs only to keep a bit-vector for the failed disks. Generally, it ismore convenient to maintain thevalid/invalid state information on a perstriping unit rather than a per sectorbasis since most implementations willtend to reconstruct an entire striping unitof data at a time rather than a singlesector. Because disk failures are rela-tively rare events and large groups ofstriping units can be invalidated at atime, updating the valid/invalid metas-tate information to stable storage doesnot present a significant performanceoverhead.

    3.5.2 Regenerating Parity after a System Crash

    System crashes can result in inconsistentparity by interrupting write operations.Thus, unless it is known which paritysectors were being updated, all paritysectors must be regenerated when ever adisk array comes up from a system crash.This is an expensive operation that re-quires scanning the contents of the en-tire disk array. To avoid this overhead,information concerning the consistent\inconsistent state of each parity sectormust be logged to stable storage. Thefollowing restriction must be observed.

    Before servicing any write request, thecorresponding parity sectors must bemarked inconsistent.

    When bringing a system up from a sys-tem crash, all inconsistent parity sec-tors must be regenerated.

    ACM Computmg Surveys, Vol. 26, No, 2, June 1994

  • RAID ● 165

    Note that because regenerating a con-sistent parity sector does no harm, it isnot absolutely necessary to mark a paritysector as consistent. To avoid havingto regenerate a large number of paritysectors after each crash, however, it isclearly desirable to mark parity sectorsperiodically, as consistent.

    Unlike updating valid/invalid infor-mation, the updating of consistent/in-consistent slate information is a poten-tial performance problem in softwareRAID systems, which usually do not haveaccess to fast. nonvolatile storage. A sim-plistic implementation would ~equire adisk write to mark a parity sector asinconsistent before each write operationand a corresponding disk write to markthe parity sector as consistent after eachwrite operation. A more palatable solu-tion is to maintain a most recently usedpool that keeps track of a fixed numberof inconsistent parity sectors on stablestorage. By keeping a copy of the pool inmain memory, one can avoid accessingstable storage to mark parity sectors thatare already marked as inconsistent. Byvarying the size of the pool, one cantradeoff the hit rate of the pool againstthe amount of parity information thatneeds to be regenerated when recoveringfrom a system crash.

    The above method should work effi-ciently for requests that exhibit good lo-cality of reference. If the disk array mustservice a large number of random writerequests, as in transaction-processing en-vironments, we recommend incorporat-ing a group commit mechanism s: thata large number of parity sectors can bemarked inconsistent with a sinde accessto stable storage. This so~ves thethroughput problem but results in higherlatencies for random write reauests sincethe parity sectors must be ma~ked incon-sistent before the writes can proceed.

    3.5.3 Operating with a Failed Disk

    A system crash in a block-interleavedredundant disk array is similar to a diskfailure in that it can result in the loss ofparity information. This means that a

    disk array operating with a failed diskcan potentially lose data in the event of asystem crash. Because system crashes aresimificantlv more common in most svs-tevms than ~isk failures, operating wit~ afailed disk can be risky.

    While operating with a failed disk, auser must perform some form of loggingon every write operation to prevent theloss of information in the event of asystem crash. We describe two elegantmethods to perform this logging. The firstmethod. called demand reconstruction. isthe easiest and most efficient but ~e-quires stand-by spare disks. With de-mand reconstruction, accesses to a paritystripe with an invalid sector triggerreconstruction of the appropriate dataimmediately onto a spare disk. Write op-erations. thus. never deal with invalidsectors created by disk failures. A back-ground process scans the entire disk ar-ray to ensure that all the contents of thefailed disk are eventually reconstructedwithin an acceptable tim~ period.

    The second method, called parity spar-ing [Chandy and Reddy 1993], can beapplied to systems without stand-byspares but requires additional metastateinformation. Before servicirw a write re-quest that would access a ~arity stripewith an invalid sector, the invalid sectoris reconstructed and relocated to over-write its corresponding parity sector.Then the sector is marked as relocated.Since the corresponding parity stripe nolonger has parity, a system crash canonly affect the data being written. Whenthe failed disk is eventually replaced, (1)the relocated sector is copied to the sparedisk, (2) the parity is regenerated, and(3) the sector is no longer marked asrelocated.

    3.5.4 Orthogonal RAID

    TO this point, we have ignored the issueof how to connect disks to the host com-puter. In fact, how one does this candrastically affect performance and relia-bility. Most computers connect multipledisks via some smaller number of strings.Thus, a string failure causes multiple,

    ACM Computing Surveys, Vol. 26, No 2, June 1994

  • 166 “ Peter M. Chen et al.

    0,>11

  • RAID ● 167

    any other disk system. If system crashesare a significant problem, nonvolatilememory is necessary to prevent loss ofdata that are buffered but not yet com-mitted. This technique may also improvethroughput in two ways: (1) by givingfuture updates the opportunity to over-write previous updates, thus eliminatingthe need to write the first update [Menonand Cortney 1993], and (2) by lengthen-ing the queue of requests seen by a diskscheduler and allowing more efficientscheduling [Seltzer et al 19901.

    Barring these overwrites, however, thistechnique does nothing to improvethroughput. So under high load, the writebuffer space will fill more quickly than itempties, and response time of a RAID

    level 5 will still be four times worse thana RAID level O.

    An extension of write buffering is togroup sequential writes together. Thistechnique can make writes to all types ofdisk systems faster, but it has a particu-lar appeal to RAID level-5 disk arrays.By writing larger units, small writes canbe turned into full stripe writes, thuseliminating altogether the Achilles heelof RAID level-5 workloads [Rosenblumand Ousterhout 1991; Menon and Court-ney 1993].

    Read caching is normally used in disksystems to improve the response timeand throughput when reading data. In aRAID level-5 disk array, however, it canserve a secondary pm-pose. If the old datarequired for computing the new parity isin the cache, read caching reduces thenumber of disk accesses required forsmall writes from four to three. This isvery likely, for example, in transaction-processing systems where records arefrequently updated by reading the oldvalue, changing it, and writing back thenew value to the same location.

    Also, by caching recently written par-ity, the read of the old parity can some-times be eliminated, further reducing thenumber of disk accesses for small writesfrom three to two. Because parity iscomputed over many logically consecu-tive disk sectors, the caching of parityexploits both temporal and spatial local-

    ity. This is in contrast to the caching ofdata which, for the purposes of reducingdisk operations on small writes, relies onthe assumption that recently read sec-tors are likely to be written rather thanon the principle of spatial locality. Ofcourse, caching parity blocks reduces thespace available for caching data, whichmay increase the number of data misses.

    4.1.2 Floating Parity

    Menon et al. [1993] proposed a variationon the organization of parity in RAIDlevel-5 disk array, called floating parity,that shortens the read-modify-write ofparity updated by small, random writesto little more than a single disk accesstime on average. Floating parity clusters

    parity into cylinders, each containing atrack of free blocks. Whenever a parityblock needs to be updated, the new par-ity block can be written on the rotation-ally nearest unallocated block followingthe old parity block. Menon et al. showthat for disks with 16 tracks per cylin-der, the nearest unallocated block imme-diately follows the parity block being read65% of the time, and the average numberof blocks that must be skipped to get tothe nearest unallocated block is small,between 0.7 and 0.8. Thus, the writing ofthe new parity block can usually occurimmediately after the old parity block isread, making the entire read-modify-write access only about a millisecondlonger than a read access.

    To implement floating parity effi-ciently, directories for the locations ofunallocated blocks and parity blocks mustbe stored in primary memory. These ta-bles are about 1 MB in size for each diskarray containing four to ten 500 MBdisks. To exploit unallocated blocks im-mediately following the parity data beingread, the data must be modified and adisk head switched to the track contain-ing the unallocated block before the diskrotates though an interjector gap. Be-cause of these constraints, and becauseonly a disk controller can have exactknowledge of its geometry, floating par-ity is most likely to be implemented inthe disk controller.

    ACM Computing Surveys, Vol. 26, No. 2, June 1994

  • 168 “ Peter M. Chen et al

    Menon et al. [1993] also propose float-ing data as well as parity. This makesthe overhead for small writes in RAIDlevel-5 disk arrays comparable to mirror-ing. The main disadvantage of floatingdata is that logically sequential data mayend up discontinuous on disk. Also, float-ing data requires much more free diskspace than floating only the parity sincethere are many more data blocks thanparity blocks.

    4.1.3 Parity Logging

    Stodolsky and Gibson [ 1993] propose anapproach, called parity logging, to re-duce the penalty of small writes in RAIDlevel-5 disk arravs ~Bhide and Dias 19921.Parity logging ~educes the overhead fo~small writes by delaying the read of theold parity and the write of the new par-ity. Instead of updating the parity imme-diately, an update image, which is thedifference between the old and new par-ity, is temporarily written to a log. Delay-ing the update allows the parity to begrouped together in large contiguousblocks that can be updated more effi-ciently.

    This delay takes place in two parts.First, the parity update image is storedtemporarily in nonvolatile memory. Whenthis memory, which could be a few tensof KB, fills up, the parity update image iswritten to a log region on disk. When thelog fills up, the parity update image is

    read into memory and added to the old

    parity. The resulting new parity is thenwritten to disk. Although this schemetransfers more data to and from disk, thetransfers are in much larger units andare hence more efficient; large sequentialdisk accesses are an order of magnitude

    more efficient than small random ac-

    cesses (Section 2.1). Parity logging re-

    duces the small write overhead from four

    disk accesses to a little more than two

    disk accesses, the same overhead in-

    curred by mirrored disk arrays. The costs

    of parity logging are the memory used for

    temporarily storing update images, the

    extra disk space used for the log of up-

    date images, and the additional memory

    used when applying the parity updateimage to the old parity. This techniquecan be applied also to the second copy ofdata in mirrored disk arrays to reducethe cost of writes in mirrored disk arraysfrom two to a little more than one di~kaccess [Orji and Solworth 1993].

    4.2 Declustered Parity

    Many applications, notably database andtransaction processing, require both highthroughput and high data availabilityfrom their storage systems. The most de-manding of these applications requirescontinuous operation—the ability to sat-isfy requests for data in the presence ofdisk failures while simultaneously recon-.strutting the contents of failed disks ontoreplacement disks. It is unacceptable tofulfill this requirement with arbitrarilydegraded performance, especially in long-lived real-time applications such as videoservice; customers are unlikely to toler-ate movies played at a slower speed orhaving their viewing terminated prema-turely.

    Unfortunately, disk failures causelarge performance degradations in stan-dard RAID Ievel-5 disk arrays. In theworst case, a workload consisting en-tirelv of small reads will double the effec-.tive load at nonfailed disks due to extradisk accesses needed to reconstruct datafor reads to the failed disk. The addi-tional disk accesses needed for completereconstruction of the failed disk increasethe load even further.

    In storage svstems that stri~e datau. .

    across several RAIDs, the average in-crease in load is significantly less than inRAIDs with one large parity group, butthe RAID with the failed disk still expe-

    riences a 100% increase in load in the

    worst case. The failed RAID creates a hot

    spot that degrades the performance ofthe entire system. The basic problem inthese large systems is that althoughinter-RAID striping distributes load uni-formly when no disk is failed, it nonuni-formly distributes the increased load thatresults from a failed disk; the small set ofdisks in the same parity group as the

    ACM C!omputmg Surveys, Vol 26, No 2, June 1994

  • RAID ● 169

    Igog2g4g6Disk O

    I

    go

    g2

    g4

    g6

    Disk O

    cgog2g4g6 [[ go gog2 g2g4 g4S6 g6Disk 1 Disk 2 Disk 3

    cg]g3g527 3glg3g5g7 nglg3g5g7 oglg3g5g7Disk 4 Disk 5 Disk 6 Disk 7

    Stmdard, Mu]tiple RAID

    3glg2g5g7 1glg3g4g7Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6 Disk 7

    Dcclustered Parity RAID

    Figure 8. Standard versus declustered-parity RAID. This figure illustrates examples of standard anddeclustered-parity RAID with eight disks and a parity group size of four, Identically labeled blocks belongto the same parity group. In the standard RAID organization parity groups are composed of disks from oneof two nonoverlapping subsets of disks. In the declustered-parity RAID, parity groups span manyoverlapping subsets of disks.

    failed disk bear the entire weight ofthe increased load. The declustered-par-ity RAID organization solves this prob-lem by distributing the increased loaduniformly over all disks [Muntz and Lui1990; Merchant and Yu 1992; Hollandand Gibson 1992; Holland et al. 1993; Ngand Mattson 1992].

    Figure 8 illustrates examples of stan-dard and declustered-parity RAIDs forsystems with an array size of eight disksand a parity group size of four. In thiscase, a multiple-RAID system is con-structed by striping data over two RAIDsof four disks each with non-overlapp-ing parity groups. The declustered-parityRAID is constructed by overlapping par-ity groups. If Disk 2 fails, each read toDisk 2 in the standard, multiple RAIDgenerates a single disk access to Disks O,1, and 3 and no disk access to Disks 4, 5,6, and 7. In the declustered-parity RAID,a random read to Disk 2 generates anaccess to Disks 4, 5, and 7 one-quarter of

    the time; to Disks O, 1, and 3 half of thetime; and to disk 6 three-quarters of thetime. Although the increased load is notuniform, it is more balanced than in thestandard RAID. Slightly more complexdeclustered-parity RAIDs exist that dis-tribute the load uniformly such that eachread to disk 2 generates an average of0.429 disk accesses to all nonfailed disks.

    The simplest way to create a declus-tered-parity RAID that distributes loaduniformly is to create a set of paritygroups including every possible mappingof parity group members to disks. In our

    [1

    8example, this would result in

    4= 70

    distinct mappings of parity groups todisks. For nearly all practical array andparity group sizes, declustered-parityRAID organizations are possible that dis-tribute reconstruction load uniformlywith much fewer than the combinatorialnumber of parity groups. Such organiza-tions can be devised using the theory of

    ACM Computing Surveys, Vol. 26, No 2, June 1994

  • 170 “ Peter M. Chen et al.

    balanced incomplete block designs [Hall1986]. In practice, the load does not needto be absolutely balanced, and a closeapproximation is sufficient.

    To summarize, often a declustered-par-ity RAID is preferable to a standard,multiple RAID because it distributes loaduniformly during both the normal andfailed modes of operation. This allows amore graceful degradation in perfor-mance when a disk fails and allows thefailed disk to be reconstructed morequickly since all disks in the disk arraycan participate in its reconstruction. Ad-ditionally, unlike the example in Figure8, the disk array size in a declustered-parity RAID does not have to be a multi-ple of the parity group size. Any combi-nation of array and parity group sizessuch that the array size is greater thanthe parity group size is feasible. Declus-tered-parity RAID has two main disad-vantages. First, it can be somewhat lessreliable than standard, multiple RAID;any two disk failures will result in dataloss since each pair of disks has a paritygroup in common. In a standard, multi-ple RAID, the parity groups are disjoint,so it is possible to have more than onedisk failure without losing data as longas each failure is in a different paritygroup. Second, the more complex parity

    groups could disrupt the sequentialplacement of data across the disks. Thus,large requests are more likely to en-counter disk contention in declustered-parity RAID than in standard multipleRAID. In practice, it is difficult to con-struct workloads where this effect issignificant.

    4.3 Exploiting On-Line Spare Disks

    On-line spare disks allow reconstructionof failed disks to start immediately,reducing the window of vulnerabilityduring which an additional disk failurewould result in data loss. Unfortunately,they are idle most of time and do notcontribute to the normal operation of thesystem. This section describes two tech-niques, distributed sparing and paritysparing, that exploit on-line spare disks

    Figure 9. Distributed sparing. Distributed sparingdistributes the capacity of the spare disk through-

    put the array. This allows all disks, including thedisk that would otherwise have been a dedicatedspare, to service requests. This figure illustrates aRAID level-5 disk array with distributed sparing.The ‘Ps denote parity blocks, and ‘S’s denote spare

    blocks,

    to enhance performance during the nor-mal operation of the system.

    As Figure 9 illustrates, distributedsparing distributes the capacity of a sparedisk across all the disks in the disk array[Menon et al. 1991]. The distribution ofspare capacity is similar to the distribu-tion of parity in RAID level-5 disk ar-rays. Instead of N data and one sparedisk, distributed sparing uses N + 1 datadisks that each have l\(lV + l)th sparecapacity. When a disk fails, the blocks onthe failed disk are reconstructed onto thecorresponding spare blocks. Distributedsparing obviates dedicated spare disks,allowing all disks to participate in servic-ing requests, and thereby improving per-formance during the normal operation ofthe disk array. Additionally, because eachdisk is partially empty, each disk failurerequires less work to reconstruct the con-tents of the failed disk. Distributed spar-ing has a few disadvantages. First, thereconstructed data must eventually becopied onto a permanent replacement forthe failed disk. This creates extra workfor the disk array, but, since the copyingcan be done leisurely, it does not signifi-cantly affect performance. Second, be-cause the reconstructed data is distri-buted across many disk whereas it wasformerly on a single disk, reconstructiondisturbs the original data placement,which can be a concern for some 1/0 in-tensive applications. In disk arrays withdedicated spares, the data placementafter reconstruction is identical to thedata placement before reconstruction.

    ACM Computmg Surveys, Vol 26, No 2, June 1994

  • Figure 10. Parity sparing. Parity sparing is simi-lar to distributed sparing except that the sparespace is used to store a second set of parity infor-mation.

    Parity sparing is similar to distributedsparing, except that it uses the sparecapacity to store parity information[Chandy and Reddy 1993]. As with dis-tributed sparing, this eliminates dedi-cated spare disks, improving perfor-mance during normal operation. The sec-ond set of parity blocks can be used in avariety of ways. First, they can be usedto partition the disk array logically intotwo separate disk arrays, resulting inhigher reliability. In Figure 10, for exam-ple, POa might compute the parity overblocks 1 and 2 while POb computes theparity over blocks 3 and 4. Second, theadditional parity blocks can be used toaugment the original parity groups. InFigure 10, if one assumes that the parityof blocks 1, 2, 3, 4, POa, and POb isalways zero, write operations need to up-date only one of POa or POb. This has thebenefit of improving small write perfor-mance by allowing each small write tochoose the parity block it will updatebased on information such as the queuelength and disk arm position at the twoalternative disks. Third, the extra parityblocks can be used to implement P + Qredundancy. When a disk fails, the diskarray converts to simple parity. By logi-cal extension, a second disk failure wouldresult in a RAID level-O disk array.

    Both distributed sparing and paritysparing offer interesting ways to exploiton-line spares for improved performance.‘They are most effective for disk arrayswith a small number of disks where thefraction of spare disks to nonspare disksis likely to be large. As disk arrays be-come larger, a smaller fraction of sparedisks is needed to achieve the same levelof reliability [Gibson 1991].

    RAID ● 171

    4.4 Data Striping in Disk Arrays

    Distributing data across the disk arrayspeeds up 1/0s by allowing a single 1/0to transfer data in parallel from multipledisks or by allowing multiple 1/0s tooccur in parallel. The disk array designermust keep in mind several tradeoffs whendeciding how to distribute data over thedisks in the disk array to maximizeperformance, balancing two conflictinggoals:

    Maximize the amount of useful datathat each disk transfers with each logi-cal 1/0. Typically, a disk must spendsome time seeking and rotating be-tween each logical 1/0 that it services.This positioning time represents wast-ed work—no data is transferred duringthis time. Hence it is beneficial to max-imize the amount of useful work donein between these positioning times.

    Utilize all disks. Idle times are similarto positioning times in that during idletimes, no useful work is done. Idle timescan arise in two different situations.First, hot spots can exist, where certaindisks (the hot disks) are more heavilyused than other disks (the cold disks)[Friedman 1993; Wilmot 1989]. Second,it is possible that all disks could beused evenly when viewed over a longperiod of time but not evenly at everyinstant. For example, if there is onlyone request to the disk array and thatrequest only uses one disk, then allother disks will remain idle.

    These goals are in conflict because theschemes that guarantee use of all disksspread data widely among more disksand hence cause each disk to transferless data per logical 1/0. On the otherhand, schemes that maximize the amountof data transferred per logical 1/0 mayleave some disks idle. Finding the rightbalance between these two goals is themain tradeoff in deciding how to dis-tribute data among multiple disks and isheavily workload dependent.

    Data striping, or interleaving, is themost common way to distribute dataamong multiple disks. In this scheme,

    ACM Computing Surveys, Vol. 26, No. 2, June 1994

  • 172 ● Peter M. Chen et al,

    logically contiguous pieces of data arestored on each disk in turn. We refer tothe size of each piece of data as the strip-ing unit. The main design parameter indata striping is the size of this stripingunit. Smaller striping units cause logicaldata to be spread over more disks; largerstriping units cause logical data to begrouped, or clustered, together on fewerdisks. Consequently, the size of the strip-ing unit determines how many disks eachlogical 1/0 uses.

    Because the interaction between work-load and striping unit can have a sub-stantial effect on the ~erformance of adisk array with block-interleaved strip-ing, Chen and Patterson [1990] de-veloped rules of thumb for selecting astriping unit. Their simulation-basedmodel evaluated a spindle-synchronizeddisk array of 16 disks. The stochasticworkload applied to the disk array hadtwo main parameters: average requestsize (varied from 4–1500 KB). and thenumber of concurrent, independent logi-cal requests (varied from 1–20). Theirgoal was to find the size of a striping unitthat gave the largest throughput for anincompletely specified workload. Theyfound that the most important workloadparameter was concurrency. When theconcurrency of the workload was known,a simple formula specified a striping unitthat provided S)570 of the maximumthroughput possible for any particularrequest distribution:

    1 sector + 1/4* average positioning time

    * data transfer rate

    * (concurrency – 1)

    where the average positioning time is thedisk’s average seek time for the workloadplus an average rotational delay. A strip-ing unit selected by this expression issmall when the concurrency is low sothat every access can utilize all disks,and larger when the concurrency is highso that more different accesses can beserviced in parallel. Intuitively, the prod-uct of average positioning time and datatransfer rate balances the benefits and

    the costs of striping data. The benefit ofstriping is the decreased transfer time ofa single request, which saves approxi-mately the transfer time of a stripe unit.The cost of striping is the increased diskutilization that arises from an additionaldisk positioning itself to access the data.The constant, 1/4, is sensitive to thenumber of disks in the array [Chen andLee 1993].

    If nothing is known about a workload’sconcurrency, Chen and Patterson [19901found that a good compromise size for astriping unit is

    2/3 * average positioning time

    * data transfer rate.

    The constant, 2/3, is sensitive to thenumber of disks in the array; researchneeds to be done quantifying this rela-tionship.

    Lee and Katz [199 la] use an analyticmodel of nonredundant disk arrays toderive an equation for the optimal size ofdata striping. The disk array system theymodel is similar to that used by Chenand Patterson [ 1990] described above.They show that the optimal size of datastriping is equal to

    {

    PX(L – l)Z

    N

    where P is the average disk positioningtime, X the average disk transfer rate, Lthe concurrency, Z the request size, andN the array size in disks. Their resultsagree closely with those of Chen and Pat-terson. In particular, note that theirequation predicts also that the optimalsize of data striping is dependent onlythe relative rates at which a disk posi-tions and transfers data, PX, rather thanP or X individually. Lee and Katz showthat the optimal striping unit depends onrequest size; Chen and Patterson showthat this dependency can be ignoredwithout significantly affecting perfor-mance.

    Chen and Lee [1993] conducted a fol-low-up study to Chen and Patterson[1990] to determine the striping unit for

    ACM Computing Surveys. V()] 26, No 2, June 1994

  • RAID ● 173

    RAID level-5 disk arrays. Reads in aRAID level-5 are similar to reads (andwrites) in a RAID level O, causing theoptimal striping unit for a read-intensiveworkload in a RAID level-5 to be identi-cal to the optimal striping unit in a RAIDlevel O. For write-intensive workloads,however, the overhead of maintainingparity causes full-stripe writes (writesthat span the entire parity group) to bemore efficient than read-modify writes orreconstruct writes (writes that do notspan an entire parity group). This addi-tional factor causes the optimal stripingunit for RAID level-5 to be smaller forwrite-intensive workloads than the strip-ing unit for RAID level O by a factor of 4for a 16-disk array. They explored alsothe relationship between the optimalstriping unit and the number of disksand found that the optimal striping unitfor reads varies inversely to the numberof disks, but that the optimal stripingunit for writes varies with the number ofdisks. Overall, they found that the opti-mal striping unit for workloads with anunspecified mix of reads and writes wasindependent of the number of disks andrecommended (in the absence of specificworkload information) that the stripingunit for RAID level-5 disk arrays withany number of disks be set to

    1/2 * average positioning time

    * data transfer rate.

    Currently, researchers are investigat-ing ways to distribute data other than asimple round-robin scheme. Some vari-ations are: choosing a different stripingunit for each file and distributing data byhashing or heat-balancing [Weikum andZabback 1992; Scheuermann et al. 1991;Cope