Slide 1 ISTORE: A Platform for Scalable, Available, Maintainable Storage-Intensive Applications...

Post on 21-Dec-2015

216 views 1 download

Transcript of Slide 1 ISTORE: A Platform for Scalable, Available, Maintainable Storage-Intensive Applications...

Slide 1

ISTORE: A Platform for Scalable, Available,

Maintainable Storage-Intensive Applications

Aaron Brown, David Oppenheimer, Jim Beck, Rich Martin, Randi Thomas, David Patterson,

and Kathy Yelick

Computer Science DivisionUniversity of California, Berkeley

http://iram.cs.berkeley.edu/istore/

Slide 2

ISTORE Philosophy: SAM• The ISTORE project is researching techniques for

bringing scalability, availability, and maintainability (SAM) to large server systems

• ISTORE vision: a self-testing HW/SW platform that automatically reacts to situations requiring an administrative response– brings self-maintenance to applications and storage

• ISTORE target: high-end servers for data-intensive infrastructure services– single-purpose systems managing large amounts of data for

large numbers of active network users– e.g. TB of data, 10,000s requests/sec, millions of users

Slide 3

Motivation: Service Demands

• Emergence of a true information infrastructure– today: e-commerce, online database services,

online backup, search engines, and web servers

– tomorrow: more of above (with ever-growing datasets), plus thin-client/PDA infrastructure support

– these services have different needs than traditionally fault-tolerant services (ATMs, telephone switch, ...)

» rapid software evolution

» unpredictable, wildly fluctuating demand and user base

» often must incorporate low-cost, off-the-shelf HW and SW components

Slide 4

Service Demands (2)

• Infrastructure users expect “always-on”service and constant quality of service– infrastructure must provide scalable fault-

toleranceand performance-tolerance

» to a rapidly growing and evolving application base

– failures and slowdowns have major business impact

» e.g., recent EBay, E*Trade, Schwab outages

Slide 5

The Need for 24x7 Availability • Today’s widely deployed systems can’t

provide 24x7 fault- and performance-tolerance– they rely on manual administration

» static data and application partitioning» human detection of and response to most anomalous

behaviors and changes in system environment

– human administrators are too expensive, too slow, too prone to mistakes

» Jim Gray reports 42% of Tandem failures due to administrator error (in 1985)

• Tomorrow’s ever-growing infrastructure systems need to be self-maintaining– self-maintaining systems anticipate problems and

handle them as they arise, automatically

Slide 6

Self-Maintaining Systems• Self-maintaining systems require:

– a robust platform that provides online self-testing of its hardware and software

– easy incremental scalability when existing resources stop providing desired quality of service

– rapid detection of anomalous behavior and changes in system environment

» failures, load spikes, changing access patterns, ...

– fast and flexible reaction to detected conditions– flexible specification of conditions that trigger

adaptation

• Systems deployed on the ISTORE platform will be self-maintaining

Slide 7

Target Application Model• Scalable applications for data storage

and access– e.g., bottom (data) tier of three-tier systems

• Desired properties:– ability to manage replicated/distributed state

» including distribution of workload across replicas

– ability to create and destroy replicas on the fly– persistence model that can tolerate node failure

without loss of data» logging of writes, soft-state, etc.

– ability to migrate service between nodes» e.g., checkpoint and restore, or kill and restart

– built-in application self-testing

Slide 8

Target Application Model (2)• What existing application architectures

come close to fitting this model?– parallel shared-nothing DBMSs

» IBM DB2, Teradata, Tandem SQL/MX

– distributed server applications» Lotus Notes/Domino» traditional distributed filesystems/fileservers

– cluster-aware applications (with small mods?)» LARD cluster web server (Rice)» Microsoft Cluster Server Phase 2 (?)

• What doesn’t fit?– simple 2-node “hot standby” failover clusters

» Microsoft Cluster Server Phase 1

Slide 9

The ISTORE Approach• Divides self-maintenance into two

components:1) reactive self-maintenance: dynamic reaction to

exceptional system events» self-diagnosing, self-monitoring hardware» software monitoring and problem detection» automatic reaction to detected problems

2) proactive self-maintenance: continuous online self- testing and self-analysis

» automatic characterization of system components» in situ fault injection, self-testing, and scrubbing to

detect flaky hardware components and to exercise rarely-taken application code paths before they’re used

Slide 10

Reactive Self-Maintenance• ISTORE defines a layered system model

for monitoring and reaction:

Self-monitoringhardware

SW monitoring

Problem detection

Coordinationof reaction

Reaction mechanisms

Provided by ISTORE Runtime System

Provided byApplication

• ISTORE API defines interface between runtime system and app. reaction mechanisms

Polic

ies

ISTORE API

• Policies define system’s monitoring, detection, and reaction behavior

Slide 11

• Hardware architecture: plug-and-play intelligent devices with integrated self-monitoring, diagnostics, and fault injection hardware– intelligence used to collect and filter monitoring data– diagnostics and fault injection enhance robustness– networked to create a scalable shared-nothing cluster

Self-monitoringhardware

Disk IntelligentDisk “Brick”

CPU, memory, diagnosticprocessor, redundant NICs

IntelligentChassis:scalable

redundantswitching,

power,env’t monitoring

x64

Slide 12

ISTORE-II Hardware Vision• System-on-a-chip enables computer,

memory, redundant network interfaces without significantly increasing size of disk

• Target for + 5-7 years:• 1999 IBM MicroDrive:– 1.7” x 1.4” x 0.2”

(43 mm x 36 mm x 5 mm)– 340 MB, 5400 RPM,

5 MB/s, 15 ms seek

• 2006 MicroDrive?– 9 GB, 50 MB/s

(1.6X/yr capacity, 1.4X/yr BW)

Slide 13

2006 ISTORE• ISTORE node

– Add 20% pad to MicroDrive size for packaging, connectors

– Then double thickness to add IRAM– 2.0” x 1.7” x 0.5” (51 mm x 43 mm x 13 mm)

• Crossbar switches growing by Moore’s Law– 2x/1.5 yrs 4X transistors/3yrs– Crossbars grow by N2 2X switch/3yrs– 16 x 16 in 1999 64 x 64 in 2005

• ISTORE rack (19” x 33” x 84”)(480 mm x 840 mm x 2130

mm) – 1 tray (3” high) 16 x 32 512 ISTORE nodes– 20 trays+switches+UPS 10,240 ISTORE nodes(!)

Slide 14

• Each node includes extra diagnostic support– diagnostic processor: independent hardware running

monitoring and control software» monitors hardware and environmental state not

normally visible to system software» control

•reboot/power-cycle main CPU•inject simulated faults: power, bus transients, memory errors, network interface failure, ...

– separate “diagnostic network” connects the diagnostic processors of each brick

» provides independent network path to diagnostic CPU•works when brick CPU is powered off or has failed

Self-monitoringhardware

Slide 15

• Software collects and filters monitoring data– hardware monitors device “health”,

environmental conditions, and indicators that software is working

» some information processed locally to provide fail-fast behavior when higher-level software deemed potentially untrustworthy

» most information passed on to software monitoring

– software monitoring layer also collects higher-level performance data, access patterns, app. heartbeats

SW monitoring

Slide 16

• The data is collected in a virtual “database”– desired monitoring data is selected and aggregated by

specifying “views” over the database» database schema + views hide differences in monitoring

implementation on heterogeneous HW and SW

• Running example– If ambient temperature of a shelf is rising significantly

faster than that of other shelves, » reduce power consumption on those nodes, then» if necessary, migrate non-redundant data replicas off

some nodes on that shelf and shut them down

– view: for each shelf, average temperature across all temperature sensors on that shelf

SW monitoring

Slide 17

• Conditions requiring administrative response are detected by observing values and/or patterns in the monitoring data– triggers specify these patterns and invoke appropriate

adaptation algorithms» input to a trigger is a view of the monitoring data» views and triggers can be specified separately to allow

•easy selection of desired reaction algorithm•easy redefinition of conditions that invoke a particular reaction

• Running example– trigger: change in temperature of one shelf > 0 and

more than twice the change in temperature of any other shelf, averaged over a one-minute period

Problem detection

Slide 18

• Adaptation algorithms coordinate application-level reaction mechanisms– adaptation algorithms define a sequence of operations

that address the anomaly detected by the associated trigger

– adaptation algorithms call application-implemented mechanisms via a standard API

» but are independent of application mechanism details

• Running example: coordination of reaction1) identify nodes with non-redundant data2) invoke application mechanism to migrate that data off

n of those nodes3) reduce power consumption by those n nodes4) install trigger to monitor temperature change and

shut down nodes if power reduction is ineffective

Coordinationof reaction

Slide 19

• ISTORE expects reaction mechanisms to be implemented by the application– these reaction mechanisms are application-specific

» e.g., moving data requires knowledge of data semantics, consistency policies, ...

– a research goal of ISTORE is to provide a standard API to these mechanisms

» initially, try to leverage and extend existing mechanisms to avoid wholesale rewriting of applications

•many data-intensive applications already support functionality similar to the needed mechanisms

» eventually, generalize and extend API to encompass mechanisms and needs of future applications

Reaction mechanisms

Slide 20

• Programmer or administrator specifies policies to control the system’s adaptive behavior– the policy compiler turns a high-level declarative

specification of desired behavior into the appropriate:» adaptation algorithms (that invoke application

mechanisms through the ISTORE API)» triggers (to invoke the adaptation algorithms when the

appropriate conditions are detected)» views (that enable monitoring needed by the triggers)

• Running example– policy: if ambient temperature of a shelf is rising

significantly faster than that of other shelves, reduce power and prepare to shut down nodes

Policies

Slide 21

Summary: Layered System Model

• Layered system model for monitoring and reaction provides reactive self-maintenance

Self-monitoringhardware

SW monitoring

Problem detection

Coordinationof reaction

Reaction mechanisms

Provided by ISTORE Runtime System

Provided byApplication

• Self-maintenance in ISTORE also consists of proactive, continuous self-testing and analysis

Polic

ies

ISTORE API

Slide 22

The ISTORE Approach• Divides self-maintenance into two

components:1) reactive self-maintenance: dynamic reaction to

exceptional system events» self-diagnosing, self-monitoring hardware» software monitoring and problem detection» automatic reaction to detected problems

2) proactive self-maintenance: continuous online self- testing and self-analysis

» in situ fault injection, self-testing, and scrubbing to detect flaky hardware components and to exercise rarely-taken application code paths before they’re used

» automatic characterization of system components

Slide 23

Continuous Online Self-Testing

• Self-maintaining systems should automatically carry out preventative maintenance– need aggressive in situ component testing via

» fault injection: triggering hardware and software error handling paths to verify their integrity/existence

» stress testing: pushing HW/SW components past normal operating parameters

» scrubbing: periodic restoration of potentially “decaying” hardware or software state

• ISTORE periodically isolates nodes from the system and performs extensive self-tests– nodes can be easily isolated due to ISTORE’s built-in

redundancy» even in a deployed, running system

Slide 24

Self-Testing: Hardware• Goals of hardware self-testing is to

detect flaky components and preserve data integrity

• Examples:– fault injection: power cycle disk to check for

stiction– stress testing: run disk controller at 100%

utilization to test behavior under load– scrubbing: read all disk sectors and rewrite any

that suffer soft errors; “fire” disk if too many errors

Slide 25

Self-Testing: Software• Software self-testing proactively identifies

weaknesses in software before they cause a visible failure– helps prevent failure due to bugs that only appear in

certain hardware/software configurations– helps identify bugs that occur when software is driven

into an untested state only reachable in a live system» e.g., long uptimes, heavy load, unexpected requests

• Examples– fault injection (includes HW- and SW-induced faults

that the SW is expected to handle): SCSI parity error, invalid return codes from operating system

– stress testing: heavy load, pathological requests

– scrubbing: restart/reboot long-running software

Slide 26

Online Self-Analysis• Self-maintaining systems require

knowledge of their components’ dynamic runtime behavior– current “plug-and-play” hardware approaches are

not sufficient» need more than just discovery of new devices’

functional capabilities and supported APIs

– also need dynamic component characterization

Slide 27

Characterizing HW/SW Behavior

• An ISTORE may contain black-box components– heterogeneous hardware devices– application-supplied reaction mechanisms whose

implementations are hidden

• To select and tune adaptation algorithms, the ISTORE system needs to understand the behavior of these components– in the context of a complex, live system– examples:

» characterize performance of disks in system, use that data to select destination disks for replica creation

» isolate two nodes, invoke replication from one to the other, monitor actions taken by application (e.g., how long it takes, how much data is moved)

Slide 28

Support for Application Self-tuning

• ISTORE’s characterization mechanisms can also help applications tune themselves– current systems require manual tuning to meet

scalability and performance goals» especially true for shared-nothing systems in which

computational and storage resources aren’t pooled

– possible research direction is to expose characterization information to application via an extension of the ISTORE API

– this would allow “aware” applications to automatically adapt their behavior based on system conditions

Slide 29

ISTORE API• The ISTORE API defines interfaces for

– adaptation algorithms to invoke application reaction mechanisms

» e.g., migrate data, replicate data, checkpoint, shutdown, ...

– applications to provide hints to the runtime system so it can optimize adaptation algorithms & data storage

» e.g., application tags data whose unavailability can be temporarily tolerated

– runtime system to invoke application self-testing and fault injection, and for application to report results

– runtime system to inform application about current state of system, hardware capabilities, ...

Slide 30

Summary• ISTORE focuses on Scalability, Availability,

and Maintainability for emerging data-intensive network applications

• ISTORE provides a platform for deploying self-maintaining systems that are up 24x7

• ISTORE will achieve self-maintenance via:– hardware platform with integrated diagnostic support– reactive self-maintenance: a layered, policy-driven

runtime system that provides a framework for monitoring and reaction

– proactive self-maintenance: support for continuous on-line self-testing and component characterization

– and a standard API for interfacing applications to the runtime system