The TickerTAIP parallel RAID architecture...

The TickerTAIP parallel RAID architecture

Pei Cao

Princeton UniversityDept. of Computer Science

Princeton, NJ 08540

[email protected]

Swee Boon Lim

University of IllinoisDept. of Computer Science

1304 W. Springfield Ave.Urbana, IL 61801

[email protected]

Traditional disk arrays have a centralized architecture, witha single controller through which all requests jlow. Such acontroller is a single point of failure, and its performance

limits the maximum size that the array can grow to. We

describe here TickerTAP, a parallel architecture for disk

arrays that distributes the controller functions across

several loosely-coupled processors. The result is better

scalabili~, fault tolerance, ad flexibility.

This paper presents the TickerTMP architecture and anevaluation of its behavior. We demonstrate the feasibili~ byan existence proof; describe a family of distributedalgorithms for calculating RAID parity; discuss techniquesfor establishing request atomicity, sequencing and recovery;

and evaluate the pe~ormance of the TickerTMP design in

both absolute terms and by comparison to a centralized

MD implementation. We conclude that the TickerTMP

architectural approach is feasible, usejid, and effective.

1 Introduction

A disk array is a structure that connects several diskstogether to extend the cost, power and space advantages ofsmall disks to higher capacity configurations. By providingpartial redundancy such as parity, availability can beincreased as well. Such RMDs (for Redundant Arrays ofInexpensive Disks) were first described in the early 1980s[Lawlor81, Park86], and popularized by the work of a groupat UC Berkeley [Patterson88, Patterson89].

The traditional architecture of a RAID array, shown inFigure 1, has a central controller, one or more disks, and

disk’controller disk’s

Figure 1: traditional RAID array architecture.

Shivakurnar John WilkesVenkataraman

University of Wisconsin Hewlett-Packard LaboratoriesDept. of Computer Science

1210 West Dayton St. P.O. Box 10490, 1U13Madison, WI 53706 palo Alto, CA 94303-0969

[email protected] [email protected]

multiple head-of-string disk interfaces. The RAID controllerinterfaces to the host, processes read and write requests, andcarries out parity calculations, block placement, and datarecove~ after a disk failure.

The RAID controller is crucial to the performance andavailability of the system. If its bandwidth, processing power,or capacity are inadequate, the performance of the array as awhole will suffer. (This is increasingly likely to happen: paritycalculation is memory-bound, and memory speeds have notkept pace with recent CPU performance improvements[Ousterhout90].) Latency through the controller can reducethe performance of small requests. The single point of failurethat the controller represents can also be a concern: publishedfailure rates for disk drives and packaged electronics are nowsimilar. Although some commercial products include spareRAID controllers, they are not normally simultaneouslyactive: one typically acts as a backup for the other, and is heldin reserve until it is needed because of failure of the primary.This is expensive: the backup has to have all the capacity ofthe primary controller.

To address these concerns, we have developed the TickerTMParchitecture for parallel RAIDs (Figure 2). In thisarchitecture, there is no central controlled it has been replacedby a cooperating set of array controller nodes that togetherprovide all the functions needed. The TickerTAIP architectureoffers fault-tolerance (no central controller to break),performance scalability (no central bottleneck), smoothincremented growth (by simply adding another node), andflexibility (it is easy to mix and match components).

This paper provides an evaluation of the TickerTAIParchitecture.

small-area network

interconnect(s) Iarray controller nodes

Figure 2: TickerTAIP array architecture.

0884-7495/93$3.00@1993IEEE52

1.1 Related work

Many papers have been published on RAID reliability,performance, and on design variations for parity placementand recovery schemes, such as [Clark88a, Gibson88,Menon89, Schulze89, Dunphy90, Gray90, Lee90, Muntz90,Holland92]. Our work builds on these studies: weconcentrate here on the archhectural issues of parallelizingthe techniques used in a centralized RAID array, so we takesuch work as a given—and assume familiarity with basicRAID concepts in what follows.

Something similar to the TickerTAIP physical architecturewas realized in the HP7937 family of disks [HPdisks89].These disks can be connected together by a 10MB/s bus,which allows access to “remote” dk.ks as well as fast switch-over between attached hosts in the event of system failure.No multi-disk functions (such as a disk array) were provided,however.

A proposal was made to connect networks of processors toform a widely-distributed RAID controller in [Stone-braker89]. This approach was called RADD-RedundantArrays of Distributed Disks. It proposed using disks spreadacross a wide-area network to improve availability in the faceof a site failure. In contrast to the RADD study, we emphasizethe use of parallelism inside a single RAID serve~ we assumethe kind of fast, reliable interconnect that is easilyconstructed inside a single server cabineq we closely coupleprocessors and disks, so that a node failure is treated as (oneor more) disk failures; and we provide much improvedperformance analyses-[Stonebraker89] used “all diskoperations take 30ms”. The result is a new, detailed

characterization of the parallel RAID design approach in asignificantly different environment.

1.2 Paper outline

We begin this paper by describing the TickerTAIParchitecture, includhg descriptions and evaluations ofalgorithms for parity calculation, recovery from controllerfailure, and extensions to provide sequencing of concurrentrequests from multiple hosts.

To evaluate TickerTAIP we constructed a working prototypeas existence proof and functional testbed, and then built adetailed event-based simulation that we calibrated againstthe prototype. These tools are presented as backgroundmaterial for the performance analysis of TlckerTAIP thatfollows, with particular emphasis on comparing it against a

worker nodes

Figure 3: alternative TickerTAIP host interconnectionarchitecture with dedicated originator nodes.

centralized RAID implementation, We conclude the paperwith a summary of our results.

2 The TickerTAIP architecture

A TlckerTAIP array is composed of a number of workernodes, which are nodes with one or more local disksconnected through a SCSI bus. Originator nodes provideconnections to host computer clients. The nodes areconnected to one another by a high-performance, small-areanetwork with sufficient internal redundancy to survive singlefailures. (Mesh-based switching fabrics can achieve this withreasonable cost and complexity. A design that would meetthe performance, scalability and fault-tolerance needs of aTickerTAIP array is described in [Wilkes91].)

In Figure 2, the nodes are shown as being both workers andoriginators: that is, they have both host and disk connections.A second design, shown in Figure 3, dedicates separatenodes to the worker and originator functions. Each nodemight then be of a different type (e.g., SCSI-originator, FDDI-originator, IPI-worker, SCSI-worker). This makes it easy tosupport a TickerTAIP array that has multiple different khdsof host interface. Since each node is plug-compatible fromthe point of view of the internal interconnect, it is also easyto configure an array with any desired ratio of worker andoriginator nodes, a flexibility less easily achieved in thetraditional centralized architecture.

Figure 4 shows the environment in which we envision aTickerTAIP array operating. The array provides disk servicesto one or more host computers through the originator nodes.There may be several originator nodes, each connected to adifferent hose alternatively, a single host can be connected tomultiple originators for higher performance and greaterfailure resilience. For simplicity, we require tliat all data fora request be returned to the host along the path used to issuethe request.

In the context of this model, a traditional RAID array lookslike a TickerTAIP array with several unintelligent workernodes, a single originator node on which all the paritycalculations take place, and shared-memory communicationbetween the components.

3 Design issuesThk section describes the TickerTAIP design issues ingreater detail. It begins with an examination of normal mode

client processes/ # host 1/0

I= ~n’’’a’e .i.k.r..,p.rw

Figure 4: TickerTAIP system environment.

53

operation (i.e., fault-free) and then examines the supportneeded to cope with failures.

3.1 Normal-mode readsIn normal mode, read requests are straightforward since noparity computation is required: the data is read at the workersand forwarded to the originator. For multi-stripe requests, wefound it beneficial to perform sequential reads of both dataand parity, and then to discard the parity blocks, rather thanto generate separate requests that omitted reading the parityblocks.

3.2 Normal-mode writesIn a RAID array, writes require calculation or modification ofstored parity to maintain the partial data redundancy. Eachstripe is considered separately in determining the method andsite for parity computation.

3.2.1 How to calculate new parity

The first design choice is how to calculate the new parity.There are three alternatives, which depend on the amount ofthe stripe being updated (Figure 5):

● full stripe: all of the data blocks in the stripes have to bewritten, and parity can be calculated entirely from thenew dam,

● small stripe: less than half of the data blocks in a stripeare to be written, and parity is calculated by first readingthe old data of the blocks which will be written, XORingthem with the new data and then XORing the results withthe old parity block data;

● large stripe: more than half of the data blocks in thestripe are to be writtew the new parity block can becomputed by reading the data blocks in the stripe that are

not being written and XORing them with the new data(i.e., reducing this to the full-stripe case) [Chen90].

Depending on its size, a single 1/0 request will involve oneor more stripe size types. The effect on performance ofenabling the large-stripe mode is discussed later.

3.2.2 Where to calculate new parity

The second design consideration is where the parity is to becalculated. Traditional centralized RAID architecturescalculate all parity at the originator node, since only it has thenecessary processing capability. In TickerTAIP, every nodehas a processor, so there are several choices. The key designgoal is to load balance the work amongst the nodes—inparticular, to spread out the parity calculations over as manynodes as possible. Here are three possibilities (shown inFigure 6):

.

.

.

at originator: all parity calculations are done at theoriginato~

solely-parity: all parity calculations for a stripe takeplace at the parity node for that stripe;

at parity: as for solely-parity, except that partial resultsdu~ng a small-stripe-w-tite &e cal&lated at the workernodes and shipped to the parity node.

The solely-parity scheme always uses more messages thanthe at parity one, so we did not pursue it further. We provideperformance comparisons between the other two later in thepaper.

3.3 Single failures—request atomicityWe begin with a discussion of single-point failures. One goalof a regular RAID array is to survive single-disk failures.TickerTAIP extends the single-fault-masking semantics toinclude failure of a part of its distributed controlle~ we do not

f fdata blocks parity block

a. Full stripe

b. Large stripet

data block being read

mread-m’edify-write ~cles

c. Small stripe

a. At originator

mb. Solely parity

c. At parity

Figure 5: the three different stripe update size policies. Figure 6: three different places to calculate parity.

54

make the simplifying assumption that the controller is not apossible failure point. (We have legislated internal single-fault-tolerance for the interconnect fabric, so this is not aconcern.)

3.3.1 Disk failure

Disk failures are treated in just the same way as in atraditional RAID array: the array continues operation indegraded (failed) mode until the disk is repaired or replace~the contents of the new disk are reconstructed; and executionresumes in normal mode. From the outside of the array, theeffect is as if nothing has happened. Inside, appropriate datareconstructions occur on reads, and 1/0 operations to thefailed disk are suppressed.

3.3.2 Worker failure

A TickerTAIP worker failure is treated just like a disk failure,and is masked in just the same way. (Just as with a regularRAID controller with multiple disks per head-of-stringcontroller, a failing worker means that an entire column ofdisks is lost at once, but the same recovery algorithms apply.)We assume fail-silent nodes: the isolation offered by thenetworking protocols used to communicate between nodes islikely to make this assumption realistic in practice for all butthe most extreme cases—for which RAID arrays are probablynot appropriate choices. A node is suspected to have failed ifit does not respond within a reasonable time; the node thatdetects this initiates a distributed consensus protocol muchlike two-phase commit, taking the role of coordinator of theconsensus protocol. All the remaining nodes reachagreement by this means on the number and identity of thefailed node(s). This protocol ensures that all the remainingnodes enter failure mode at the same time. Multiple failurescause the array to shut itself down safely to prevent possibledata corruption.

3.3.3 Originator failure and request atomicity

Failure of a node with an originator on it brings newconcerns: a channel to a host is lost, any worker on the samenode will be lost as well, and the fate of requests that arrivedthrough this node needs to be determined since the failedoriginator was responsible for coordinating their execution.

Originator failures during reads are fairly simple: the readoperation is aborted since there is no longer a route tocommunicate its results back to the host.

Failures during write operations are more complicated,because different portions of the write could be at differentstages unless extra steps are taken to avoid compromising theconsistency of the stripes involved in the write. Worst isfailure of a node that is both a worker and an originator, sinceit will be the only one with a copy of the data destined for itsown disk. (For example, if such a node fails during a partial-stripe write after some of the blocks in the stripe have beenwritten, it may not be possible to reconstruct the state of theentire stripe, violating the rule that a single failure must bemasked.)

Our solution is to both these concerns is to ensure writeatomicity: that is, either a write operation completessuccessfully or it makes no changes. To achieve this, weadded a two-phase commit protocol to write operations.Before a write can proceed, sufficient data must be replicatedin more than one node’s memory to let the operation restartand complete-even if an arbitrary node fails. If this cannotbe achieved, the request is aborted before it can make anychanges.

We identified two approaches to implementing the two phasecommic early commit tries to make the decision as quicklyas possible; late commit delays its commit decision until althat is left to do is the writes.

In late commit, the commit point (when it decides whether tocontinue or not) is reached only after the parity has beencomputed, and this provides the needed partial redundancy;by this stage, all that is left to do is the writes.

In early commit, the goal is for the array to get to its commitpoint as quickly as possible during the execution of therequest. This requires that the new data destined for theoriginator/worker node has to be replicated elsewhere, incase the originator fails after commit. The same must be donefor old data being read as part of a large-stripe write, in casethe reading node fails before it can provide the data. Weduplicate this data on the parity nodes of the affectedstripes-this involves sending no additional data in the caseof parity calculations at the parity node (which we will seebelow is the preferred policy). The commit point is reachedas soon as the necessary redundancy has been achieved.

Late commit is much easier to implement, but has lowerconcurrency and higher request latency. We explore the sizeof this cost later.

A write operation is aborted if any involved worker does notreach its commit point. When a worker node fails, theoriginator is responsible for restarting the operation. In theevent of an originator failure, a temporary originator iselected to complete or abort the processing of the request,chosen from amongst those nodes that were alreadyparticipating in the request, since this simplifies recovery. Toimplement the two-phase commit protocols, new states wereadded to the worker and originator node state tables. Theplacement of these additional states determines the commitpolicy: for early commit, as soon as possible; for latecommit, just before the writes.

3.4 Multiple failures-request sequencingThis section discusses measures designed to help limit theeffects of multiple concurrent failures. The RAID architecturetolerates any single disk failure. However, it provides nobehavior guarantees in the event of multiple failures(especially powerfail), and it does not ensure theindependence of overlapping requests that are executingsimultaneously. In the terminology of [LampsonS 1], multiplefailures are catastrophes: events outside the covered fault setfor RAID. TickerTAIP introduces coverage for partialcontroller failures; and it goes beyond this by using request

55

sequencing to limit the effects of multiple failures in a waythat is useful to file system designers. Like a regular RAID,however, a powerfail during a write can corrupt the stripebeing written to unless more extensive recovery techniques(such as logging) are used.

3.4.1 Requirements

File system designers typically rely on the presence ofordering invariants to allow them to recover from crashes orpower failure. For example, in 4.2BSD-based file systems,metadata (inode and directory) writes must occur before thedata to which they refer is allowed to reach the disk[McKusick84]. The simplest way to achieve this is to deferqueueing the data write until the metadata write hascompleted. Unfortunately, this can severely limitconcurrency: for example, parity calculations can no longerbe overlapped with the execution of the previous request. Abetter way to achieve the desired invariant is to provide—andpreserve-partial write orderings in the 1/0 subsystem. Thistechnique can significantly improve file system performance.From our perspective as RAID array designers, it also allowsthe RAID array to make more intelligent decisions aboutrequest scheduling, using rotation position information thatis not readily available in the host [Jacobson91 ].

TickerTAIP can support multiple hosts. As a result, somemechanism needs to be provided to let requests fromdifferent hosts be serialized without recourse to eithersending all requests through a single host, or requiring onerequest to complete before the next can be issued.

Finally, multiple overlapping requests from a single ormultiple hosts can be in flight simultaneously. This couldlead to parts of one write replacing parts of another in a non-serializable fashion, which clearly should be prevented. (Ourwrite commit protocols provide atomicity for each request,but no serializability guarantees.)

3.4.2 Request sequencing

To address these requirements, we introduced a requestsequencing mechanism using partial orderings for both readsand writes. Internally, these are represented in the form ofdirected acyclic graphs (DAGs): each request is representedby a node in the DAG, while the edges of the DAG representdependencies between requests.

To express the DAG, each request is given a unique identifier.A request is allowed to list one or more requests on which itexplicitly depends; TickerTAIP guarantees that the effect isas l~no request begins until the requests on which it dependscomplete (this allows freedom to perform eager evaluation,some of which we exploit in our current implementation.). Ifa request is aborted, all requests that explicitly depend on itare also aborted (and so on, transitively). In addition,TickerTAIP will arbitrarily assign sufficient implicitdependencies to prevent overlapping requests fromexecuting concurrently. Aborts are not propagated acrossimplicit dependencies. In the absence of explicitdependencies, the order in which requests are serviced issome arbitrary serializable schedule.

3.4.3 Sequencer states

The management of sequencing is performed through a high-level state table, with the following states (diagramed, withtheir transitions, in Figure 7):

.

.

●

●

✎

✎

NotIssued: the request itself has not yet reachedTickerTAIP, but another request has referred to thisrequest in its dependency list.

Unresolved: the request has been issued, but it dependson a request that has not yet reached the TickerTAIParray.

Resolved: all of the requests that this one depends onhave arrived at the array, but at least one has yet tocomplete.

InProgress: all of a request’s dependencies have beensatisfied, so it has begun executing.

Completed: a request has successfully finished.

Aborted: a request was aborted, or a request on whichthis request explicitly depended has been aborted.

Old request state has to be garbage collected. We do this byrequiring that the hosts number their requests sequentially,and by keeping track of the oldest outstanding incompleterequest from each host. When this request completes, anyolder completed requests can be deleted. Any request that

referenced byanother request

ANotIssued

issuedby a host

issuedby a host

anti-dependentsresolved

Resolved

anti-dependentscompleted

~

an anti-% $#nt

J’

InProgress

requestcompleted

requestaborted

Completed Aborted

Figure 7: states of a request. An “anti-dependent” is arequest that this request is waiting for.

56

depends on an older request than the oldest recorded one canconsider the dependency immediately satisfied.

Aborts are an important exception to this mechanism since arequest that depends on an aborted request should itself beaborted, whenever it occurred. Unfortunately, there is a racecondition between the request being aborted and the hostissuing requests that depend on it. Our solution is to maintainstate about aborted requests for a guaranteed minimumtime—10 seconds in our prototype. Similarly, a time-out onthe Not Issued state can be used to detect errors such as ahost that never issues a request other requests are waiting for.

3.4.4 Sequencer design alternatives

We considered four designs for the sequencer mechanism:

1. Fully centralized a single, central sequencer managesthe state table and its transitions. (A primary and abackup are used to eliminate a single point of failure.) Inthe absence of contention, each request suffers twoadditional round-trip message latency times: betweenthe originator and the sequencer, and between thesequencer and its backup. One of these trips is unneededif the originator is co-located with the sequencer.

2. Partially centralized a centralized sequencer handlesthe state table until all the dependencies have beenresolved, at which point the responsibility is shifted tothe worker nodes involved in the request. This requiresthat the status of every request be sent to all the workers,to allow them to do the resolution of subsequenttransitions. This has more concurrency, but requires abroadcast on every request completion.

3. Originator-driven: in place of a central sequencer, theoriginator nodes (since there will typically be fewer ofthese than the workers) conduct a distributed-consensusprotocol to determine the overlaps and sequenceconstraints, after which the partially-centralizedapproach is used. This always generates more messagesthan the centralized schemes.

4. Worker-driven: the workers are responsible for all thestates and their transitions. This widens the distributed-consensus protocol to every node in the array, and stillrequires the end-of-request broadcast.

Although the higher-numbered of the above designspotentially increase concurrency, they do so at the cost ofincreased message traffic and complexity.

We chose to implement the fully centralized model in ourprototype, largely because of the complexity of the failure-recovery protocols required for the alternatives. As expected,we measured the resulting latency to be that of two round-tripmessages (440ps) plus a few tens of microseconds of statetable management. We find the additional overhead toprovide request sequencing acceptable for the benefits that itprovides, although we also made sequencing optional.

3.5 The RAIDmapPrevious sections have presented the policy issues; this onediscusses an implementation technique we found useful. Our

first design for sequencing and coordinating requestsretained a great deal of centralized authority: the originatortried to coordinate the actions taking place at each of thedifferent nodes (reads and writes, parity calculations). Wesoon found ourselves faced with the messy probIem ofcoping with a complex set of interdependent actions takingplace on multiple remote nodes.

We then developed a new approach: rather than assume thateach of the workers need to be told what to do. assume thatthey would do whatever was necessary without furtherinteraction. Once the workers are told about the originalrequest (its starting point and length, and whether it is a reador a write) and given any data they needed, they can proceedon their own. We call this armroach collaborative execution:. .it is characterized by each node assuming that other nodes arealready doing their part of the request. For example, if nodeA needs data from node B, A can rely on B to generate andship the data to A with no further prompting.

To orchestrate the work, we developed a structure known asa RAIDmap, a rectangular array with an entry for eachcolumn (worker) and each stripe. Each worker builds itscolumn of the RAIDmap, filling in the blanks as a function ofthe operation (read or write), the layout policy and theexecution policy. The layout policy determines where dataand parity blocks are placed in the RAID array (e.g.,mirroring, or RAID 4 or 5). The execution policy determinesthe algorithm used to service the request (e.g., where parityis to be calculated). A simplified RAIDmap is shown inFigure 8.

One component of the RAIDmap is a state table for eachblock (the states are described in Table 1). It is this table thatthe layout and execution policies fill out. A request enters thestates in order, leaving each one when the associated functionhas been completed, or immediately if the state is marked as“not needed” for this request. For example, a read requestwill enter state 1 (to read the data), skip through state 2 tostate 3 (to send it to the originator), and then skip through theremaining states.

The RAIDmap proved to be a flexible mechanism, allowingus to test out several different policy alternatives (e.g.,whether to calculate partial parity results locally, or whetherto send the data to the originator or parity nodes). In addition,the same technique is used in failure mode: the RAIDmapindicates to each node how it is to behave, but now it is filledout in such a way as to reflect the failure mode operations.Finally, the RAIDmap simplified configuring a centralizedimplementation using the same policies and assumptions asthe distributed case.

The goal of any RAID implementation is to maximize diskutilization and minimize request latency. To help achievethis, the RAIDmap computation is overlapped with otheroperations such as moving data or accessing disks. For thesame reason, workers send data needed elsewhere beforeservicing their own parity computations or local disktransfers.

57

Table 1: state-space for each block at a worker node Table 2: characteristics of the HP79560 disk drive

IStatFunction IOther information

e II 1 I Read old data I disk address I

2 XOR old with new data

3 Send old data (or XOR’d old node to send it todata) to another node

4 Await incoming data (or number of blocksXOR’d data) to wait for

5XOR incoming data withlocal old data/parity

property value

diameter 5.25”

I surfaces/heads I 19 data, 1 servo I

I formatted capacity ] 1.3GB I

track size 72 sectors

sector size 512 bytes

rotation speed 4002 RPM

disk transfer rate ] 2.2MB/s1

SCSIbustransferrate I 5MB/sI 6 \ Write new data or parity I disk address I

I controller overhead I 1ms IIt also proved important to optimize the disk accessesthemselves. When we delayed writes in our prototypeimplementation until the parity data was available,throughput increased by 25–3(YZObecause the data and paritywrites were coalesced together, reducing the number of diskseeks needed.

4 Evaluating TickerTAIP

This section presents the vehicles we used to evaluate thedesign choices and performance of the TickerTAIParchitecture.

4.1 The prototypeWe first constructed a working prototype implementation ofthe TickerTAIP design, including all the fault tolerancefeatures described above. The intent of this implementationwas a functional testbed, to help ensure that we had made ourdesign complete. (For example, we were able to test our faultrecovery code by telling nodes to “die”.) The prototype alsolet us measure path lengths and get some early performancedata.

We implemented the designs on an array of seven storagenodes, each comprised of a Parsytec MSC card with a T800transputer, 4MB of RAM, and a local SCSI interface,connected to a local SCSI disk drive, The disks were spin-synchronized for these experiments. A stripe unit (the block-level interleave unit) was 4KB. Each node had a localHP79560 SCSI disk with the properties shown in Table 2,

IStripnode O node 1 Inode 2 node 3

e IType

—,—,- ---0 ,, 2,0,3 –,o,– small

unwed Umisad data parity stripe, , m

. 4,1,2 5,1,2 I -,1 ,– 1 3,1,2 ( fullI

data data parity data stripe

-r-,-2

-,2,- 6,2,1 7,2,1 largeLn-wsed parity data data stripe

Figure 8: RAIDmap example for a write request spanninglogical blocks 2 through 7. Each tuple contains: logical andphysical block number, node to send parity data to, and a blocktype. The rightmost column is discussed in section 3.2.1.

I track-to-track switch time I 1.67ms I

I seekss 12 cylinders I 1.28+ 1.15dd ms I

seeks >12 cylinders 4.84 + 0.193~d +

0.00494d ms

The prototype was built in C to run on Helios [Perihelion91]:a small, lightweight operating system nucleus. We measuredthe one-way message latency for short messages betweendirectly-comected nodes to be 110ps, and the peak inter-node bandwidth to be 1.6MB/s. Performance was limited bythe relatively slow processors, and because the design of theParsytec cards means that they cannot overlap computationand data transfer across their SCSI bus. Nevertheless, theprototype provided useful comparative performance data fordesign choices, and served as the calibration point of oursimulator.

Our prototype comprised a total of 13.3k lines of code,including comments and test routines. About 12k lines of thiswas directly associated with the RAID functionality.

4.2 The simulatorWe also built a detailed event driven simulator using theAT&T C++ tasking library [ATT89]. This enabled us toexplore the effects of changing link and processor speeds,and to experiment with larger configurations than ourprototype. Our model encompassed the followingcomponents:

●

✎

✎

Workloads: bothfied (all requests of a single type) andimitative (patterns that simulate existing workloadpatterns); we used a closed queueing model, and themethod of batched means [Pawlikowski90] to obtainsteady-state measurements.

Hosk a collection of workloads sharing an access port tothe TickerTAIP array; disk driver path lengths wereestimated from measurements made on our local HP-UXsystems.

TickerTMP nodes (workers and originators}: code pathlengths were derived from measurements of the

algorithms running on the working prototype and ourHP-UX workstations (we assumed that the Parsytec MSClimitations would not occur in a real design).

58

●

●

Disk we modelled the HP79560 disks as used on theprototype implementation, using data taken frommeasurements of the real hardware. The disk modelincluded: the seek time profile from Table 2, plusdifferent settling times for reads and writes; track- andcylinder-skews, including track switch times during adata transfe~ rotation positio~ and SCSI bus andcontroller overheads, including overlapped datatransfers from the mechanism into a disk track builerand transmissions across the SCSI bus: the granularityused was 4KB.

Linkx represent communication channels such as thesmall-area network and the SCSI busses. For simplicity,we report here data from a complete point-to-pointintercomect design with .a DMA engine per liti,however, preliminmy studies suggest that similar resultswould be obtained from mesh-based switching fabrics.

Under the same design choice and performance parameters,our simulation results agreed with the prototype (real)implementation within 370 most of the time, and alwayswithin 6Y0. This gave us confidence in the predictive abilitiesof the simulator.

The system we evaluate here is a RAID5 disk array with left-syrrnnetric parity [Lee90], 4KB block-level interleave, spin-synchronized disks, FIFO disk scheduling, and with no datareplication, spare blocks, floating parity or indirect-writes.The configuration has 4 hosts and 11 worker nodes, eachworker with a single HP96760 disk attached to it via a 5MB/sSCSI bus. Four of the nodes were both originators andworkers.

The throughput numbers reported here were obtained whenthe system was driven to saturation response times with onlyone request in the system at a time. For the throughputmeasurements we timed 10000 requests in each m, forlatency we timed 2000 requests. Each data point on a graphrepresents the average of two independent runs, with allvariances less than 1.5Y0, and most less than 0.5%. Eachvalue in a table is the average of five such runs; the variancesare reported with each table.

4.3 Read performance

Table 3 shows the performance of our simulated 1l-diskarray for random read requests across a range of link speeds.There is no significant difference in throughput for any linkspeed above lMB/s, but 10MB/s or more is needed tominimize request latencies for the larger transfers.

4.4 Design alternatives: writes

We first consider the effect of the large stripe policy. Figure9 shows the resulti in general, enabling the large write policyfor large stripes gives slightly better throughput, so that iswhat we used for the remainder of the experiments.

Next, we compare the at-originator and at-parity policies forparity calculation. Figure 10 gives the results. At-parityworks best because it spreads the work most evenly, so weused it for the remainder of our experiments. As expected,

Table 3: read erformance for fixed-size workloads, with“tvarying hn speeds. (All variances less than 2Y0.)

Request throughputlatency (in ms)

size MB/slMB/s 10MB/s 100MBA

I 4KB ] 0.94 I 33 I 31 I 301

40KB 1.79 38 \ 34 33

lMB 15.2 178 I 86 I 76I [ I I

IOMB 21.1 1520 I 610 \ 520

the difference is greatest for large write requests and slowprocessors.

The effect of the late commit protocol on performance isshown in Figure 11. (Although not shown, the performanceof early commit is slightly better than that of late commit, butnot as good as no commit). The effect of the commit protocol

on throughput is small (c 2%), but the effect on responsetime is more marked: the commit point is acting as asynchronization barrier that prevents some of the usualoverlaps between disk accesses and other operations. Forexample, a disk that is only doing writes for a request will notstart seeking until the commit message is given. (This could

um

12

10

8

6

4

2

0 I J

10 100 1000 10000Request Size (KB) (log scale)

Response Times vs Request Size (10 MIPS, 10 MB/s)0.14

:/ ~

With +-

0.12Without -+---

0.1,.

0.08 .-/

,, . . . . ...4

0.06 +./

101 I

10 100 1000 10000Request Size (KB) (log scale)

Figure 9: effect of enabling the large-stripe paritycomputation policy for writes larger than one half the stripe.

59

be avoided by having it do a seek in parallel with the otherpre-cormnit work.)

Nonetheless, we recommend late commit overall: itsthroughput is almost as good as no commit protocol at all,and it is much easier to implement than early commit.

4.5 Comparison with centralized FiAiD arrayHow would TickeflAIP compare with a traditionalcentralized RAID array? This section answers that question.We simulated both the same 1l-node TickerTAIP system asbefore and an 1l-disk centralized RAID. The simulationcomponents and algorithms used in the two cases were thesame: our goal was to provide a direct comparison of the twoarchitectures, as much uncontaminated by other factors aspossible. For amusement, we also provide &ta for a 10-disknon-fault-tolerant striping array implemented using theTickerTAIP architecture.

The results for 10MIPS processors with 10MB/s links areshown in Figure 12. Clearly a non-disk bottleneck is limitingthe throughput of the centralized system for request sizeslarger than 256KB, and its response time for requests largerthan 32KB. The obvious candklate is a CPU bottleneck fromparity calculations, and this is indeed what we found. Toshow this, we plot performance as a function of CPU and linkspeed (Figure 13), and both varying together (Figure 14).

o 5 10 15 20 25 30Link (MB/s) 6 CPU(MIPS)

Responsetime vs Link and CPU (Random lMB)0.7 r 1

at Originator +

0.6 at Parity -+---

0.5 .$

$,,,

0.4 ‘.,,,

$.,,

0.3 i,,,

~,,.~.,,

0.2

%.....-. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

0.1+--------------

0 I I0 5 10 15 20 25 30

Link Speed (MB/s) CPU (MIPS)

Figura 10: effect of parity calculation policy on throughputand response times for 1MB random writes.

These graphs show that the TickerTAIP architecture issuccessfully exploiting load balancing to achieve similar (orbetter) throughput and response times with less powerfulprocessors than the centralized architecture. For lMB writerequests, TickerTAIP’s 5MIPS processors and 2–5MB/s linksgive comparable throughput to 25MIPS and 5MB/s for thecentralized array. The centralized array needs a 50MIPSprocessor to get similar response times as TickerTAIP.

Finally, we looked at the effect of scaling the number ofworkers in the array, with both constant request size (400KB)and a varying one with a fixed amount of data per disk (tenfull stripes). In these experiments, 4 of the worker nodeswere also originators. The results are seen in Figure 12. Whhvarying request size, the TickerT~ architecture continuesto scale linearly over the range we investigat~ with constantrequest size, it degrades somewhat as the disks are kept lessbusy doing uset%l work.

4.6 Synthetic workioads

The results reported so far have been from fixed, constant-sized workloads. To test our hypothesis that TlckerTAIPperformance would scale as weil over some other workloads,we tested a number of additional workload mixtures,designed to model “real world” applications:

● OLTP: based on the ‘WC-A database benchmark,

Throughput vs Request Size (10 MIPS, 10 t.lB/s)12 r

10

8

6

4

2

n

1 10 100 1000 10000Request Size (KS) (log scale)

Response Times vs Request Size (10 MIPS, 10 !4S/s)0.13

COmmit ~0.12

0.11

0.1

0.09

0.08

0.07

0.06

0.05

0.04 I J

1 10 100 1000 10000Request Size (KB) (log scale)

Figure 11: effect of the late commit protocol on writethroughput and response time.

●

●

timeshure: based on measurements of a local UNIXtimesharing system [Ruemmler93];

scientific: based on measurements taken fromsuper;omputer applications running on a Cray[Miller91]; “large” has a mean request size of about0.3MB, “small” has a mean around 30KB.

Table 4 gives the throughputs of the disk arrays under theseworkloads for a range of processor and link speeds. Asexpected, TickerTAIP outperforms the centralizedarchitecture at lower CPU speeds, although both areeventually able to drive the disks to saturation-mostlybecause the request sizes are quite small. TickerTAIP’savailability is still higher, of course.

5 ConclusionsTickerTAIP is a new parallel architecture for RAID arrays.Our experience is that it is eminently practical (for example,our prototype implementation took only 12k lines ofcommented code). The TickerTAIP architecture exploits itsphysical redundancy to tolerate any single point of failure,including single failures in its distributed controlle~ isscalable across a range of system sizes with smoothincremental growth; and the mixed workerloriginator nodemodel provides considerable configuration flexibility.Further, we showed how to provideand prototyped—

Throughput vs Size (CPU 10 MIPS, Link 10 (MB/s) )16

Cent?fAlized +

14 Tic,kerTAIP -+---striping -+-

12,.,’

m’ *,,,

10 ,,, ,/,,.’ ,,/

8 d’ ,,j

6

‘ ‘ ..& . . . . .0

1 10 100 1000 10000Request Size (KB) (log scale)

Responsetimes vs Request Size (CPU 10 MIPS, Link 10 (M8/s) )0.3

0.25

0.2

0.15

0.1

0.05❑..... ..... .......--- ... .. . ... .

01 11 10 100 1000 10000

Request Size (KBl (log scale)

Figure 12: write throughput and response time for threedifferent array architectures.

051015202530 35404550MIPS

‘2.$

Responsetime vs CPU (Random 1 MS, Link 100 (M81

1

1

Centralized ~TickerTAIP -+---

0.8

0.6

0.4 7

0.2 -j% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .

))

,~0 10 20 30 40 50 60 70 80 90 100

MIPS

Throughput vs Link (Random lMB, CPU 100 MIPS)12

—,.,,

TickerTAIP -+---10

~,..

s

,:’

6 - /’

;;’

n.-0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Link Speed (MB/See)

Responsetime vs Link (Random 1!6B, CPU 100 MIPS)

0.25

Centralized +--TickerTAIP ‘+--

\0.2 - %

. .. .

....

‘- .. ..0.15

+---- .--------- . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . ..

0.1

0.05

t 1o~1234567 8910

Link Speed (MA/sec)

Figure 13: throughput and response time as a function ofCPU and link speed. 1MB random writes.

61

partial write ordering to support clean semantics in the faceof multiple outstanding requests and multiple faults, such aspower failures. We have also demonstrated that—at least inthis application-eleven 5MIPS processors are just as goodas a single 50MIPS one, and provided quantitative data onhow central and parallel RAID array implementationscompare.

Most of the performance differences result from the cost ofdoing parity calculations, and this turns out to be the mainthing that changes with the processor speed most of the othercpu-intensive work is hidden by disk delays. We suggest thatprocessors are, in fact, cost effective XOR engines—much ofthe cost is in the memory system, and microprocessors arecommodity parts. We believe that designs using dedicatedXOR hardware would exhibit similar performance and coststructures to the models presented here.

With small request sizes, it is easy for either architecture tosaturate the disks. With larger ones, the difference is moremarked as parity calculations become more significant. Thedifference is also likely to increase as multiple disks areadded to each worker, and with better disk schedulingalgorithms, which let each disk spend more time doing datatransfers [Seltzer90, Jacobson91]. These improvements areobvious upgrade paths for TickerTAIP. And both will makeits architecture even more attractive than the centralizedmodel, as the cost of processors increases faster than linearly

Throughput vs Link and CPU (Random lMB I /0)16

?., . .. ...s ... ..... ..... .. ...... . .. . . .... .....e...enrrralyzeae+ ..+... ,

~~ ~ TickerTAIP -+---Striping -m-

051015202530 35404550

Link (MB/SeC) and CPU (MIPS)

Respon.setirne vs Link speed and CPU (Random lNB 1/0)

:~

01 I

o 10 20 30 40 50 60 70 80 90 100Link (MB/see) CPU (MIPS)

Figure 14: throughput and response time as a function ofboth CPU and link speeds. 1MB random writes.

with their performance, and the centrrdized scheme alreadyrequires an expensive CPU.

We commend the TickerTAIP parallel RAID architecture tofuture disk array implementers. We also suggest itsconsideration for multicomputers with locally attached disks.

Table 4: throughputs, in MB/s, of the three array architecturesunder different workloads. The shading highlights comparable

total-MIPS confi urations. (Variances are shown iniparent eses after the numbers)

I speeds I Throughput, in MB/s IWorkload ,in~ Cpu Central

MB/s MIPS RAIDTickerTA/P Striping

1 large 10 10 8.23 (4.8%) 8.39 (3.3%) 9.81 (2.1%)

scientific

Throughput vs Array Size & Reguest Size25 -

scaled write sizes

20

15

10

5

01 Io

# o; Node~O (n) ;~quest2~ize ??n-1) xi’?K)35

Throughput vs Array Size (400KB random write requests)10

‘ ~ /

4

L___lo 5 10 15 25 30

# of Node; ”(n)35

Figure 15: effect of TickerTAIP array size on performance.

62

AcknowledgmentsThe TickerTAIP work was done as part of the DataMeshresearch project at Hewlett-Packard Laboratories[Wilkes92]. The implementation is based loosely on acentralized version written by Janet Wiener, and uses theSCSI disk driver developed by Chia Chao. Marjan Wilkeshelped edit the paper. Federico Malucelli providedsignificant input into our understanding of the sequencingoptions described here, and Chris Ruemmler helped usimprove our disk models.

Finally, why the name? Because tickertaip is used in all thebest pa(ra[ie[)raids!

References

[AlT89] AT&T. Unix System VAT&T C++ language systemrelease 2.0 selected readings, Select Code 307-144, 1989.

[Chen90] Peter M. Chen, Garth A. Gibson, Randy H. Katz,

and David A. Patterson. An evaluation of redundant

arrays of disks using an Amdahl 5890. Proc. of

SIGA4ETRZCS (Boulder, CO), pp 74-85, May 1990.

[Dunphy90] Robert H. Dunphy, Jr, Robert Walsh, andJohn H. Bowers. Disk drive memory. United States patent4914656, filed 28 June 1988, granted 3 April 1990.

[Gray90] Jim Gray, Bob Horst, and Mark Walker. Paritystriping of disc arrays: low-cost reliable storage with

acceptable throughput. Proc. of the 16th VLDB (Brisbane,

Australia), pp 148–159, August 1990.

[Holland92] Mark Holland and Garth A. Gibson. Parity

declustering for continuous operation in redundant disk

arrays. Proc. of5th ASPLOS (Boston, MA). In ComputerArchitecture News 20 (special issue) :23–35, October1992.

[HPdisks89] Hewlett-Packard Company. HP 7936 and HP7937 disc drives hardware support manual, part number07937-90903, July 1989.

[Jacobson91] David M. Jacobson and John Wilkes. Disk

scheduling algorithms based on rotational position.

Technical report HPL-CSP-9 1–7. Hewlett-Packard

Laboratories, 24 February 1991.

[Lampson81] B. W. Lampson and H. E. Sturgis. Atomictransactions. In Distributed Systems—Architecture andImplementation: an Advanced Course, volume 105 of

Lecture Notes in Computer Science, pp 246–265.Springer-Verlag, New York, 1981.

[Lawlor81] F. D. Lawlor. Efficient mass storage parityrecovery mechanism. IBM Technical Disclosure Bulletin

24(2):986987, July 1981.

[Lee90] Edward K. Lee. Software and performance issues inthe implementation of a RAID prototype. UCB/CSD90/573. University of California at Berkeley, May 1990.

[McKusick84] Marshall K. McKusick, William N. Joy,Samuel J. Leffler, and Robert S. Fabry. A fast file system

for UNIX. ACM Transactions on Computer Systems

2(3):181–197, August 1984.

[Miller91] Ethan L. Miller and Randy H. Katz. Analyzing

the 1/0 behavior of supercomputer applications. Digest of

papers, 11th IEEE Symposium on Mass Storage Systems(Monterey, CA), pp 51-59, October 1991.

[Muntz90] Richard R. Muntz and John C. S. Lui.

Performance analysis of disk arrays under failure. Proc. of

16th VLDB (Brisbane, Australia), pp 162–173, August

1990.

[Ousterhout90] John K. Ousterhout. Why aren’t operatingsystems getting faster as fast as hardware? USENIXSummer Conference (Anaheim, CA), pp 247–56, June1990.

[Park86] Arvin Park and K. Balasubramanian. Providing

fault tolerance in parallel secondary storage systems.

Technical report CS–TR-057-86. Dept. of ComputerScience, Princeton University, November 1986.

[Patterson88] David A. Patterson, Garth Gibson, and

Randy H. Katz. A case for redundant arrays of

inexpensive disks (RAID). Proc. of SIGMOD (Chicago,IL), June 1988.

[Patterson89] David A. Patterson, Peter Chen, Garth Gibson,and Randy H. Katz. Introduction to redundant arrays ofinexpensive disks (RAID). Spring COMPCON89 (San

Francisco, CA), pp 112–17. IEEE, March 1989.

[Pawlikowski90] K. Pawlikowski. Steady-state simulation of

queueing processes: a survey of problems and solutions.

Computing Surveys 22(2): 123–70, June 1990.

[Perihelion91] Perihelion Software. The Helios paralleloperating system. Prentice-Hall, London, 1991.

[Rosenblum92] Mendel Rosenblum and John K. Ousterhout.The design and implementation of a log-structured filesystem. ACM Transactions on Computer Systems10(1):26-52, February 1992.

[Ruemmler93] Chris Ruemmler and John Wilkes. UNIX disk

access patterns. USENIX Winter Technical Conference.

(San Diego, CA), pp 405420, January 1993.

[Seltzer90] Margo Seltzer, Peter Chen, and John Ousterhout.Disk scheduling revisited. USENIX Winter Technical

Conference (Washington, DC), pp 3 13–23, January 1990.

[Stonebraker89] Michael Stonebraker. Distributed RAID - anew multiple copy algorithm. UCB/ERL M89/56.University of California at Berkeley, 15 May 1989.

[Wilkes91] John Wilkes. The DataMesh research project.

Transmuting’91 (Sunnyvale, CA), volume 2, pp 547–53.

10S Press, Amsterdam, April 1991.

[Wilkes92] John Wilkes. DataMesh research project, phase1. USENIX Workshop on File Systems (Ann Arbor, MI),pp 63–96, May 1992.

63

The TickerTAIP parallel RAID architecture...

Documents

Transcript of The TickerTAIP parallel RAID architecture...