OceanStore: An Architecture for Global-Scale Persistent Storage

1
OceanStore: An Architecture for Global-Scale Persistent Storage Professor John Kubiatowicz, University of California at Berkeley http://oceanstore.cs.berkeley.edu Overview OceanStore is a global-scale data utility for Internet services How OceanStore is used Application/user data is stored in objects Objects are placed in global OceanStore infrastructure Objects are accessed via Global Unique Identifiers Objects are modified via action/predicate pairs Each operation creates new version of object Internet services (applications) define object format and content Potential Internet services Web caches, global file systems, Hotmail-like mail portals, etc. Goals Global-scale Extreme durability of data Use untrusted infrastructure Maintenance-free operation Privacy of data Automatic performance tuning Enabling technologies Peer-to-peer and overlay networks Erasure encoding and replication Byzantine agreements Repair and automatic node failover Encryption and access control Introspection and data clustering Key components: Tapestry and Inner Ring Tapestry Decentralized Object Location and Routing (DOLR) Provides routing to object independent of its location Automatically reroutes to backup nodes when failures occur Based on Plaxton algorithm Overlay network Scales for systems with large number of nodes See Tapestry poster for more information Inner Ring A set of nodes per object chosen by Responsible Party Applies updates/writes requested by user Checks all predicates and access control lists Byzantine agreement used to check and serialize updates Based on algorithm by Castro and Liskov Ensures correctness even with f of 3f+1 nodes compromised Threshold encryption used Key components: Archival Storage and Replicas Archival Storage Provides extreme durability of data objects Disseminates archival fragments throughout infrastructure Fragment replication and repair ensures durability Utilizes erasure codes Redundancy without overhead of complete replication Data objects are coded at a rate, r = m/n Produces n fragments, where any m can reconstruct object Storage overhead is n/m Replicas Full copies of data objects stored in peer-to-peer infrastructure Enables fast access Introspection allows replicas to self-organize Replicas migrate towards client accesses Encryption of objects ensures data privacy Dissemination tree is used to alert replicas of object updates Phase Linux N FS Pond-512 Pond-1024 I 0.9 2.8 6.6 II 9.4 16.8 40.4 III 8.3 1.8 1.9 IV 6.9 1.5 1.5 V 21.5 32.0 70.0 Total 47.0 54.9 120.3 Pond prototype benchmarks Update Latency (ms) 1150 2MB 99 4kB 1024 b 1086 2MB 40 4kB 512b Median Time Update Size Key Size Latency Breakdown Phase Time (ms) Check 0.3 Seriali ze 6.1 Apply 1.5 Archive 4.5 Sign 77.8 Application benchmarks Conclusions and future directions OceanStore’s accomplishments Major prototype completed Several fully-functional Internet services built and deployed Demonstrated feasibility of the approach Published results on system’s performance Collaborating with other global-scale research initiatives Current research directions Investigate new introspective data placement strategies Finish adding features Tentative update sharing between sessions Archival repair Replica management Improve existing performance and deploy to larger networks Examine bottlenecks Improve stability Data structure improvements Develop more applications Current status: Pond implementation complete Pond implementation All major subsystems completed Fault-tolerant inner ring, erasure-coding archive Software released to developer community outside Berkeley 280K lines of Java, JNI libraries for crypto, archive Several applications implemented See FAST paper on Pond prototype and benchmarking Deployed on PlanetLab Initiative to provide researchers with wide-area testbed http://www.planet-lab.org ~100 hosts, ~40 sites, multiple continets Allows pond to run up to 1000 virtual nodes Have successfully run applications in wide-area Created tools to allow quick deployment to PlanetLab Current status: Pond implementation complete Pond implementation All major subsystems completed Fault-tolerant inner ring, erasure-coding archive Software released to developer community outside Berkeley 280K lines of Java, JNI libraries for crypto, archive Several applications implemented See FAST paper on Pond prototype and benchmarking Deployed on PlanetLab Initiative to provide researchers with wide-area testbed http://www.planet-lab.org ~100 hosts, ~40 sites, multiple continents Allows pond to run up to 1000 virtual nodes Have successfully run applications in wide-area Created tools to allow quick deployment to PlanetLab Internet services built on OceanStore MINNO Global-scale e-mail system built on OceanStore Enables e-mail storage and access to user accounts Send e-mail via SMTP proxy, read and organize via IMAP MINNO stores data in four types of OceanStore objects: Folder list, Folder, Message, and Maildrop Relaxed consistency model enables fast wide-area access Riptide Web caching infrastructure Uses data migration to move web objects closer to users Verifies integrity of web content NFS Provides traditional file system support Enables time travel (reverting files/dirs) through OceanStore’s versioning primitiv Many others Palm pilot synchronizer, AFS, etc. Object update latency Measures latency of inner ring Byzantine agreement commit time Shows threshold signature is costly 100 ms latency on object writes Object update throughput Measures object write throughput Base system provides 8 MBps Batch updates to get good performance Archival Storage Archival Storage Client Inner ring Inner ring Replicas Client Client NFS: Andrew benchmark Client in Berkeley, server in Seattle 4.6x than NFS in read-intensive phases 7.3x slower in write-intensive phases Reasonable time w/ key size of 512 Signature time is the bottleneck MINNO: Login time Client cache sync time w/ new msg retrieval Measured time vs. latency to inner ring Simulates mobile clients MINNO adapts well with data migration and tentative commits enabled Outperforms traditional IMAP server w/ no processing overhead

description

Current status: Pond implementation complete. Pond implementation All major subsystems completed Fault-tolerant inner ring, erasure-coding archive Software released to developer community outside Berkeley 280K lines of Java, JNI libraries for crypto, archive Several applications implemented - PowerPoint PPT Presentation

Transcript of OceanStore: An Architecture for Global-Scale Persistent Storage

Page 1: OceanStore: An Architecture for Global-Scale Persistent Storage

OceanStore: An Architecture for Global-Scale Persistent StorageProfessor John Kubiatowicz, University of California at Berkeley

http://oceanstore.cs.berkeley.edu

Overview

• OceanStore is a global-scale data utility for Internet services• How OceanStore is used

Application/user data is stored in objects Objects are placed in global OceanStore infrastructure Objects are accessed via Global Unique Identifiers Objects are modified via action/predicate pairs Each operation creates new version of object Internet services (applications) define object format and content

• Potential Internet services• Web caches, global file systems, Hotmail-like mail portals, etc.

• Goals Global-scale Extreme durability of data Use untrusted infrastructure Maintenance-free operation Privacy of data Automatic performance tuning

• Enabling technologies Peer-to-peer and overlay networks Erasure encoding and replication Byzantine agreements Repair and automatic node failover Encryption and access control Introspection and data clustering

Key components: Tapestry and Inner Ring

• Tapestry Decentralized Object Location and Routing (DOLR) Provides routing to object independent of its location Automatically reroutes to backup nodes when failures occur Based on Plaxton algorithm

Overlay network Scales for systems with large number of nodes

See Tapestry poster for more information

• Inner Ring A set of nodes per object chosen by Responsible Party Applies updates/writes requested by user Checks all predicates and access control lists Byzantine agreement used to check and serialize updates

Based on algorithm by Castro and Liskov Ensures correctness even with f of 3f+1 nodes compromised

Threshold encryption used

Key components: Archival Storage and Replicas

• Archival Storage Provides extreme durability of data objects Disseminates archival fragments throughout infrastructure Fragment replication and repair ensures durability Utilizes erasure codes

Redundancy without overhead of complete replication Data objects are coded at a rate, r = m/n Produces n fragments, where any m can reconstruct object Storage overhead is n/m

• Replicas Full copies of data objects stored in peer-to-peer infrastructure Enables fast access Introspection allows replicas to self-organize

Replicas migrate towards client accesses Encryption of objects ensures data privacy Dissemination tree is used to alert replicas of object updates

Phase Linux NFS Pond-512 Pond-1024I 0.9 2.8 6.6

II 9.4 16.8 40.4III 8.3 1.8 1.9IV 6.9 1.5 1.5V 21.5 32.0 70.0

Total 47.0 54.9 120.3

Pond prototype benchmarks

Update Latency (ms)

11502MB

994kB1024b

10862MB

404kB512b

MedianTime

Update Size

Key Size

Latency Breakdown

Phase Time (ms)

Check 0.3

Serialize 6.1

Apply 1.5

Archive 4.5

Sign 77.8

Application benchmarks Conclusions and future directions

• OceanStore’s accomplishments Major prototype completed Several fully-functional Internet services built and deployed Demonstrated feasibility of the approach Published results on system’s performance Collaborating with other global-scale research initiatives

• Current research directions Investigate new introspective data placement strategies Finish adding features

Tentative update sharing between sessions Archival repair Replica management

Improve existing performance and deploy to larger networks Examine bottlenecks Improve stability Data structure improvements

Develop more applications

Current status: Pond implementation complete

• Pond implementation All major subsystems completed

Fault-tolerant inner ring, erasure-coding archive Software released to developer community outside Berkeley

280K lines of Java, JNI libraries for crypto, archive Several applications implemented See FAST paper on Pond prototype and benchmarking

• Deployed on PlanetLab Initiative to provide researchers with wide-area testbed http://www.planet-lab.org ~100 hosts, ~40 sites, multiple continets Allows pond to run up to 1000 virtual nodes Have successfully run applications in wide-area Created tools to allow quick deployment to PlanetLab

Current status: Pond implementation complete

• Pond implementation All major subsystems completed

Fault-tolerant inner ring, erasure-coding archive Software released to developer community outside Berkeley

280K lines of Java, JNI libraries for crypto, archive Several applications implemented See FAST paper on Pond prototype and benchmarking

• Deployed on PlanetLab Initiative to provide researchers with wide-area testbed http://www.planet-lab.org ~100 hosts, ~40 sites, multiple continents Allows pond to run up to 1000 virtual nodes Have successfully run applications in wide-area Created tools to allow quick deployment to PlanetLab

Internet services built on OceanStore

• MINNO Global-scale e-mail system built on OceanStore Enables e-mail storage and access to user accounts Send e-mail via SMTP proxy, read and organize via IMAP MINNO stores data in four types of OceanStore objects:

Folder list, Folder, Message, and Maildrop Relaxed consistency model enables fast wide-area access

• Riptide Web caching infrastructure Uses data migration to move web objects closer to users Verifies integrity of web content

• NFS Provides traditional file system support Enables time travel (reverting files/dirs) through OceanStore’s versioning primitives

• Many others Palm pilot synchronizer, AFS, etc.

• Object update latency• Measures latency of inner ring• Byzantine agreement commit time• Shows threshold signature is costly• 100 ms latency on object writes

• Object update throughput• Measures object write throughput• Base system provides 8 MBps• Batch updates to get good performance

Archival Storage

Archival Storage

Client

Inner ring

Inner ring

Replicas

ClientClient

• NFS: Andrew benchmark• Client in Berkeley, server in Seattle• 4.6x than NFS in read-intensive phases• 7.3x slower in write-intensive phases• Reasonable time w/ key size of 512• Signature time is the bottleneck

• MINNO: Login time• Client cache sync time w/ new msg retrieval• Measured time vs. latency to inner ring

• Simulates mobile clients• MINNO adapts well with data migration and

tentative commits enabled• Outperforms traditional IMAP server w/ no

processing overhead