Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

26
ZFS & Zones: Your Compute fell into My Data! SVP, Engineering [email protected] Bryan Cantrill @bcantrill

description

As the amount of unstructured data has greatly exceeded a single computer's ability to process it, data has become increasingly isolated from the compute elements . The resulting haul from stores of record (e.g., SAN, NAS, S3) to transient compute (e.g., Hadoop, EC2) creates needless mechanical work and human labor. Is there a better way? In this talk, we'll explore the coming convergence of data and compute in the cloud, focusing in particular on Joyent's Manta, a new internet-facing object storage facility that features compute. We will describe the design principles for Manta, the engineering challenges in building it, and more generally, the opportunities presented by the convergence of compute and data.

Transcript of Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

Page 1: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

ZFS & Zones:Your Compute fell intoMy Data!SVP, Engineering

[email protected]

Bryan Cantrill

@bcantrill

Page 2: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

The filesystem: Some prehistory

• When they were originally developed in the 1970s, filesystems were designed as an abstraction over a disk

• Over time, it became increasingly expensive to make bigger disks — and reliability suffered

• In the 1980s, both problems were solved by using many hard-drives instead of just larger and large drives: a redundant array of inexpensive disks (RAID)

• Even though filesystems were still relatively young at the time, it was deemed too complicated to rewrite them to accommodate the (new) notion of many disks

• This software problem was solved by introducing a new layer of software: the volume manager

Page 3: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

The volume management divide

• Volume management abstracts many physical devices into single logical volumes, allowing filesystems retained a one-to-one mapping with a device (a logical one)

• This gave rise to a problematic divide:

• The volume manager understands multiple disks, but nothing of the higher level semantics of the filesystem

• The filesystem understands the higher semantics of the data, but has no physical device understanding

• This divide became entrenched over the 1990s, and had devastating ramifications for reliability, performance and manageability

Page 4: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

Volume management deficiencies

• Because the volume management layer had no notion of the transactional semantics of the filesystem, system failure induced excruciating file system checks

• Worse, the system was left with no protection against many variants of device-level data corruption:

• The only failure the volume manager can reasonably detect is media failure that results in incorrect data on disk

• This doesn’t account for phantom reads (i.e., the wrong disk block is read from), phantom writes (i.e., the wrong disk block is written to) or driver pathologies (e.g. memory errors)

• And because they did not understand more than one device, device failure often meant filesystem failure

Page 5: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

Volume management deficiencies

• Lacking visibility into the hardware layer, the filesystem could not effectively use the parallelism inherent in multiple disks — and could not effectively schedule I/O

• Spindles were underutilized (leaving bandwidth and/or IOPS on the table) or overutilized (thrashing the device and yielding pathological performance

• Management was a nightmare: filesystems could not be expanded or shrunk — requiring every filesystem to know in advance its intended capacity

Page 6: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

The ZFS revolution

• Starting in 2001, Sun began a revolutionary new software effort: to unify storage and eliminate the divide

• In this model, filesystems would lose their one-to-one association with devices: many filesystems would be multiplexed on many devices

• By starting with a clean sheet of paper, ZFS opened up vistas of innovation — and by its architecture was able to solve many otherwise intractable problems

• Sun shipped ZFS in 2005, and used it as the foundation of its enterprise storage products starting in 2008

• ZFS was open sourced in 2005; it remains the only open source enterprise-grade filesystem

Page 7: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

ZFS advantages

• Copy-on-write design allows on-disk consistency to be always assured (eliminating file system check)

• Copy-on-write design allows constant-time snapshots in unlimited quantity — and writable clones!

• Filesystem architecture allows filesystems to be created instantly and expanded — or shrunk! — on-the-fly

• Integrated volume management allows for intelligent device behavior with respect to disk failure and recovery

• Adaptive replacement cache (ARC) allows for optimal use of DRAM — especially on high DRAM systems

• Support for dedicated log and cache devices allows for optimal use of flash-based SSDs

Page 8: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

ZFS at Joyent

• Joyent was the earliest ZFS adopter: becoming (in 2005) the first production user of ZFS outside of Sun

• ZFS is one of the four foundational technologies of Joyent’s SmartOS, our illumos derivative

• The other three foundational technologies in SmartOS are DTrace, Zones and KVM

• Search “fork yeah illumos” for the (uncensored) history of OpenSolaris, illumos, SmartOS and derivatives

• Joyent has extended ZFS to provide better support multi-tenant operation with I/O throttling

Page 9: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

ZFS as the basis for object storage?

• We view ZFS as our most foundational differentiator...

• As we began to think about building our own internet facing object store in the fall of 2011, we naturally gravitated to ZFS...

• Could we extend ZFS in some important way that would offer something interesting and compelling?

• Short answer: meh

Page 10: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

• Operating a public cloud has significant technological and business challenges:

• From a technological perspective, must deliver highly elastic infrastructure with acceptable quality of service across a broad class of users and applications

• From a business perspective, must drive utilization as high as possible while still satisfying customer expectations for quality of service

• These aspirations are in tension: multi-tenancy can significantly degrade quality of service

• The key enabling technology for multi-tenancy is virtualization — but where in the stack to virtualize?

Aside: Virtualization in the cloud

Page 11: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

• The historical answer — since the 1960s — has been to virtualize at the level of the hardware:

• A virtual machine is presented upon which each tenant runs an operating system of their choosing

• There are as many operating systems as tenants

• The historical motivation for hardware virtualization remains its advantage today: it can run entire legacy stacks unmodified

• However, hardware virtualization exacts a heavy tolls: operating systems are not designed to share resources like DRAM, CPU, I/O devices or the network

• Hardware virtualization limits tenancy and inhibits performance!

Hardware-level virtualization?

Page 12: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

• Virtualizing at the application platform layer addresses the tenancy challenges of hardware virtualization…

• ...but at the cost of dictating abstraction to the developer

• This creates the “Google App Engine problem”: developers are in a straightjacket where toy programs are easy — but sophisticated apps are impossible

• Virtualizing at the application platform layer poses many other challenges:

• Security, resource containment, language specificity, environment-specific engineering costs

Platform-level virtualization?

Page 13: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

• Virtualizing at the OS level hits the sweet spot:

• Single OS (single kernel) allows for efficient use of hardware resources, and therefore allows load factors to be high

• Disjoint instances are securely compartmentalized by the operating system

• Gives customers what appears to be a virtual machine (albeit a very fast one) on which to run higher-level software

• Gives customers PaaS when the abstractions work for them, IaaS when they need more generality

• OS-level virtualization allows for high levels of tenancy without dictating abstraction or sacrificing efficiency

• Zones is a bullet-proof implementation of OS-level virtualization — and is the core abstraction in Joyent’s SmartOS

Joyent’s solution: OS-level virtualization

Page 14: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

Idea: ZFS + Zones?

Page 15: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

• Building a sophisticated distributed system on top of ZFS and zones, we have built Manta, an internet-facing object storage system offering in situ compute

• That is, the description of compute can be brought to where objects reside instead of having to backhaul objects to transient compute

• The abstractions made available for computation are anything that can run on the OS...

• ...and as a reminder, the OS — Unix — was built around the notion of ad hoc unstructured data processing, and allows for remarkably terse expressions of computation

Manta: ZFS + Zones!

Page 16: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

Aside: Unix

• When Unix appeared in the early 1970s, it was not just a new system, but a new way of thinking about systems

• Instead of a sealed monolith, the operating system was a collection of small, easily understood programs

• First Edition Unix (1971) contained many programs that we still use today (ls, rm, cat, mv)

• Its very name conveyed this minimalist aesthetic: Unix is a homophone of “eunuchs” — a castrated Multics

We were a bit oppressed by the big system mentality. Ken wanted to do something simple. — Dennis Ritchie

Page 17: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

Unix: Let there be light

• In 1969, Doug McIlroy had the idea of connecting different components:

At the same time that Thompson and Ritchie were sketching out a file system, I was sketching out how to do data processing on the blackboard by connecting together cascades of processes

• This was the primordial pipe, but it took three years to persuade Thompson to adopt it:

And one day I came up with a syntax for the shell that went along with the piping, and Ken said, “I’m going to do it!”

Page 18: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

Unix: ...and there was light

And the next morning we had this orgy of one-liners. — Doug McIlroy

Page 19: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

The Unix philosophy

• The pipe — coupled with the small-system aesthetic — gave rise to the Unix philosophy, as articulated by Doug McIlroy:

• Write programs that do one thing and do it well

• Write programs to work together

• Write programs that handle text streams, because that is a universal interface

• Four decades later, this philosophy remains the single most important revolution in software systems thinking!

Page 20: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

• In 1986, Jon Bentley posed the challenge that became the Epic Rap Battle of computer science history:

Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies.

• Don Knuth’s solution: an elaborate program in WEB, a Pascal-like literate programming system of his own invention, using a purpose-built algorithm

• Doug McIlroy’s solution shows the power of the Unix philosophy:

tr -cs A-Za-z '\n' | tr A-Z a-z | \ sort | uniq -c | sort -rn | sed ${1}q

Doug McIlroy v. Don Knuth: FIGHT!

Page 21: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

Big Data: History repeats itself?

• The original Google MapReduce paper (Dean et al., OSDI ’04) poses a problem disturbingly similar to Bentley’s challenge nearly two decades prior:

Count of URL Access Frequency: The function processes logs of web page requests and outputs ⟨URL, 1⟩. The reduce function adds together all values for the same URL and emits a ⟨URL, total count⟩ pair

• But the solutions do not adhere to the Unix philosophy...

• ...and nor do they make use of the substantial Unix foundation for data processing

• e.g., Appendix A of the OSDI ’04 paper has a 71 line word count in C++ — with nary a wc in sight

Page 22: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

• Manta allows for an arbitrarily scalable variant of McIlroy’s solution to Bentley’s challenge: mfind -t o /bcantrill/public/v7/usr/man | \ mjob create -o -m "tr -cs A-Za-z '\n' | \ tr A-Z a-z | sort | uniq -c" -r \ "awk '{ x[\$2] += \$1 } END { for (w in x) { print x[w] \" \" w } }' | \ sort -rn | sed ${1}q"

• This description not only terse, it is high performing: data is left at rest — with the “map” phase doing heavy reduction of the data stream

• As such, Manta — like Unix — is not merely syntactic sugar; it converges compute and data in a new way

Manta: Unix for Big Data

Page 23: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

• Eventual consistency represents the wrong CAP tradeoffs for most; we prefer consistency over availability for writes (but still availability for reads)

• Many more details:http://dtrace.org/blogs/dap/2013/07/03/fault-tolerance-in-manta/

• Celebrity endorsement:

Manta: CAP tradeoffs

Page 24: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

• Hierarchical storage is an excellent idea (ht: Multics); Manta implements proper directories, delimited with a forward slash

• Manta implements a snapshot/link hybrid dubbed a snaplink; can be used to effect versioning

• Manta has full support for CORS headers

• Manta uses SSH-based HTTP auth for client-side tooling (IETF draft-cavage-http-signatures-00)

• Manta SDKs exist for node.js, Java, Ruby, Python

• “npm install manta” for command line interface

Manta: Other design principles

Page 25: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

• We believe compute/data convergence to be the future of big data: stores of record must support computation as a first-class, in situ operation

• We believe that Unix is a natural way of expressing this computation — and that the OS is the right level at which to virtualize to support this securely

• We believe that ZFS is the only sane storage underpinning for such a system

• Manta will surely not be the only system to represent the confluence of these — but it is the first

• We are actively retooling our software stack in terms of Manta — Manta is changing the way we develop software!

Manta and the future of big data

Page 26: Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

• Product page:

http://joyent.com/products/manta

• node.js module:

https://github.com/joyent/node-manta

• Manta documentation:

http://apidocs.joyent.com/manta/

• IRC, e-mail, Twitter, etc.:

#manta on freenode, [email protected], @mcavage, @dapsays, @yunongx, @joyent

Manta: More information