Pond

50
Pond The OceanStore Prototype

description

Pond. The OceanStore Prototype. Introduction. Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte storage within three years Solution: OceanStore. OceanStore. Internet-scale Cooperative file system High durability - PowerPoint PPT Presentation

Transcript of Pond

PondThe OceanStore Prototype

Introduction

Problem: Rising cost of storage managementObservations:

Universal connectivity via Internet$100 terabyte storage within three years

Solution: OceanStore

OceanStore

Internet-scaleCooperative file systemHigh durabilityUniversal availabilityTwo-tier storage system

Upper tier: powerful serversLower tier: less powerful hosts

OceanStore

More on OceanStore

Unit of storage: data objectApplications: email, UNIX file systemRequirements for the object interface

Information universally accessibleBalance between privacy and sharingSimple and usable consistency modelData integrity

OceanStore Assumptions

Infrastructure untrusted except in aggregate

Most nodes are not faulty and malicious

Infrastructure constantly changingResources enter and exit the network without prior warningSelf-organizing, self-repairing, self-tuning

OceanStore Challenges

Expressive storage interfaceHigh durability on untrusted and changing base

Data Model

The view of the system that is presented to client applications

Storage Organization

OceanStore data object ~= fileOrdered sequence of read-only versions

Every version of every object kept foreverCan be used as backup

An object contains metadata, data, and references to previous versions

Storage Organization

A stream of objects identified by AGUID

Active globally-unique identifierCryptographically-secure hash of an application-specific name and the owner’s public keyPrevents namespace collisions

Storage Organization

Each version of data object stored in a B-tree like data structure

Each block has a BGUID• Cryptographically-secure hash of the

block content

Each version has a VGUIDTwo versions may share blocks

Storage Organization

Application-Specific Consistency

An update is the operation of adding a new version to the head of a version stream Updates are applied atomically

Represented as an array of potential actionsEach guarded by a predicate

Application-Specific Consistency

Example actionsReplacing some bytesAppending new data to an objectTruncating an object

Example predicatesCheck for the latest version numberCompare bytes

Application-Specific Consistency

To implement ACID semanticCheck for readersIf none, update

Append to a mailboxNo checking

No explicit locks or leases

Application-Specific Consistency

Predicate for readsExamples • Can’t read something older than 30

seconds• Only can read data from a specific time

frame

System Architecture

Unit of synchronization: data objectChanges to different objects are independent

Virtualization through Tapestry

Resources are virtual and not tied to particular hardwareA virtual resource has a GUID, globally unique identifierUse Tapestry, a decentralized object location and routing system

Scalable overlay network, built on TCP/IP

Virtualization through Tapestry

Use GUIDs to address hosts and resourcesHosts publish the GUIDs of their resources in TapestryHosts also can unpublish GUIDs and leave the network

Replication and Consistency

A data object is a sequence of read-only versions, consisting of read-only blocks, named by BGUIDsNo issues for replicationThe mapping from AGUID to the latest VGUID may changeUse primary-copy replication

Replication and Consistency

The primary copy Enforces access controlSerializes concurrent updates

Archival Storage

Replication: 2x storage to tolerate one failureErasure code is much better

A block is divided into m fragmentsm fragments encoded into n > m fragmentsAny m fragments can restore the original object

Caching of Data Objects

Reconstructing a block from erasure code is an expensive processNeed to locate m fragments from m machinesUse whole-block caching for frequently-read objects

Caching of Data Objects

To read a block, look for the block firstIf not available

Find block fragmentsDecode fragmentsPublish that the host now caches the blockAmortize the cost of erasure encoding/decoding

Caching of Data Objects

Updates are pushed to secondary replicas via application-level multicast tree

The Full Update Path

Serialized updates are disseminated via the multicast tree for an objectAt the same time, updates are encoded and fragmented for long-term storage

The Full Update Path

The Primary Replica

Primary servers run Byzantine agreement protocol

Need more than 2/3 nonfaulty participantsMessages required grow quadratic in the number of participants

Public-Key Cryptography

Too expensiveUse symmetric-key message authentication codes (MACs)

Two to three orders of magnitude fasterDownside: can’t prove the authenticity of a message to the third partyUsed only for the inner ring

Public-key cryptography for outer ring

Proactive Threshold Signatures

Byzantine agreement guarantees correctness if not more than 1/3 servers fail during the life of the systemNot practical for a long-lived systemNeed to reboot servers at regular intervalsKey holders are fixed

Proactive Threshold Signatures

Proactive threshold signaturesMore flexibility in choosing the membership of the inner ring

A public key is paired with a number of private keysEach server uses its key to generate a signature share

Proactive Threshold Signatures

Any k shares may be combined to produce a full signatureTo change membership of an inner ring

Regenerate signature sharesNo need to change the public keyTransparent to secondary hosts

The Responsible Party

Who chooses the inner ring?Responsible party:

A server that publishes sets of failure-independent nodes• Through offline measurement and

analysis

Software Architecture

Java atop the Staged Event Driven Architecture (SEDA)

Each subsystem is implemented as a stageWith each own state and thread poolStages communicate through events50,000 semicolons by five graduate students and many undergrad interns

Software Architecture

Language Choice

Java: speed of developmentStrongly typedGarbage collectedReduced debugging timeSupport for eventsEasy to port multithreaded code in Java• Ported to Windows 2000 in one week

Language Choice

Problems with Java:Unpredictability introduced by garbage collectionEvery thread in the system is halted while the garbage collector runsAny on-going process stalls for ~100 millisecondsMay add several seconds to requests travel cross machines

Experimental Setup

Two test bedsLocal cluster of 42 machines at Berkeley• Each with 2 1.0 GHz Pentium III • 1.5GB PC133 SDRAM• 2 36GB hard drives, RAID 0• Gigabit Ethernet adaptor• Linux 2.4.18 SMP

Experimental Setup

PlanetLab, ~100 nodes across ~40 sites• 1.2 GHz Pentium III, 1GB RAM• ~1000 virtual nodes

Storage Overhead

For 32 choose 16 erasure encoding2.7x for data > 8KB

For 64 choose 16 erasure encoding4.8x for data > 8KB

The Latency Benchmark

A single client submits updates of various sizes to a four-node inner ringMetric: Time from before the request is signed to the signature over the result is checkedUpdate 40 MB of data over 1000 updates, with 100ms between updates

The Latency Benchmark

Update Latency (ms)

Key Size

Update Size

5% Time

Median Time

95%Time

512b4kB 39 40 41

2MB 1037 1086 1348

1024b

4kB 98 99 100

2MB 1098 1150 1448

Latency Breakdown

Phase Time (ms)

Check 0.3

Serialize

6.1

Apply 1.5

Archive 4.5

Sign 77.8

The Throughput Microbenchmark

A number of clients submit updates of various sizes to disjoint objects, to a four-node inner ringThe clients

Create their objectsSynchronize themselvesUpdate the object as many time as possible for 100 seconds

The Throughput Microbenchmark

Archive Retrieval Performance

Populate the archive by submitting updates of various sizes to a four-node inner ringDelete all copies of the data in its reconstructed formA single client submits reads

Archive Retrieval Performance

Throughput: 1.19 MB/s (Planetlab)2.59 MB/s (local cluster)

Latency~30-70 milliseconds

The Stream Benchmark

Ran 500 virtual nodes on PlanetLabInner Ring in SF Bay AreaReplicas clustered in 7 largest P-Lab sites

Streams updates to all replicasOne writer - content creator – repeatedly appends to data objectOthers read new versions as they arriveMeasure network resource consumption

The Stream Benchmark

The Tag Benchmark

Measures the latency of token passingOceanStore 2.2 times slower than TCP/IP

The Andrew Benchmark

File system benchmark4.6x than NFS in read-intensive phases7.3x slower in write-intensive phases