Erasure codes and storage tiers on gluster

Post on 07-Aug-2015

78 views 2 download

Tags:

Transcript of Erasure codes and storage tiers on gluster

Dan Lambright1

Erasure Codes and Storage Tiers onGluster

Dan LambrightSA summitSep 23, 2014

Dan Lambright2

AGENDA

● Why erasure codes (ec) in Gluster● How ec works

● Brief peek at underlying mathematics● Storage tiering in gluster ● Demo● “One more thing”

Dan Lambright3

Why erasure codes in gluster?

● Desire protection from double failure

● RAID6 controllers are expensive

● Imagine a 64 node volume● Each brick on a separate bare metal machine● Cost is 64 x $ for LSI MegaRaid controller

20K

=

Dan Lambright4

Why erasure codes in gluster?

● Triplication (3 way replication) is expensive

● Two redundant disks for every data disk

● 200% overhead! :(

Dan Lambright5

Erasure codes

● Store m disks worth of data on k disks (k>m)

● n redundant disks (k-m),

● can pick n to choose failure tolerance● A generalization of RAID6

● Distributed across nodes

Dan Lambright6

Overhead analysis

● Can also consider mean time before failure

k total disks n how many failures admitted

m number of data disks

Capacity overhead(n/k)

RAID level

3 1 2 33.33% 5

5 1 4 20% 5

6 2 4 33.33% 6

7 3 4 42.86% E

9 1 8 11.11% 5

10 2 8 20% 6

11 3 8 27.27% E

12 4 8 33.33% E

ERASURE CODES PRIMER

Dan Lambright8

ERASURE CODE TERMS

● m data disks

● n parity disks

● k total number disks = m+n

● Symbol – Smallest data unit. w bits.● Typically w = 8 = a byte

● Chunk (aka fragment) – r symbols per disk

● Stripe – collection of m+n chunks across k disks● Unit of manipulation for recovery● Also known as a “slice”

Dan Lambright9

ERASURE CODE TERMS

r=6m=4n =2k=6w=1

symbol

fragment

“Stripe” of 6 fragments

011010

Dan Lambright10

Systematic

● m data chunks, n coding chunks

● (can stripe parity and data chunks on the same disk)● Reads are simple, only decode on repairs

Slice 1

Slice 2

Slice 3

Dan Lambright11

Non-Systematic

● All k chunks in a stripe are coded

● Do not to distinguish data from code servers

● Encode/decode on writes and reads

Slice 1

Slice 2

Slice 3

Dan Lambright12

Encoding / Decoding Overhead

● Network RTT dominate the encode/decode overhead

● Packages exist to implement the math ● Intel has fast routines for Inverse, dot product,

encoding, decoding, etc● Jerasure library from academia● Gluster's is purpose built and fast

GLUSTER IMPLEMENTATION

Dan Lambright14

GLUSTERFS “Disperse Volumes”

● Done by Datalab corp. by Xavier Hernandez.● Use case : archiving medical records● Developed over last 2 years● Now part of gluster upstream

Dan Lambright15

CLI

Two new options have been added to the 'create' command of the cli interface:

gluster volume create <name> disperse <count> redundancy <count>

Disperse is “k” (total number volumes)

Redundancy is “n”

Dan Lambright16

“Disperse volumes” design choices

● The “symbols” are bytes: w = 8

● The fragment size r = 128

● Algorithm: Reed solomon

● Generator matrix: Vandermonde

● Non–systematic

● Encoding / decoding done on client side

● Modeled after AFR● Concurrent writes must be processed in order

STORAGE TIERS

Dan Lambright18

Storage Tiers

● Different “subvolume” tiers presented as a single volume

● HDD, SSD, tape, “persistent memory”, etc.

● Plug-in policy describes how data moves between tiers

● V1 policy: Cache

● slow and fast tiers

● CLI to add/remove cache tier from existing volume

Dan Lambright19

Example: Erasure codes + SSD

● User sees one volume

● SSD “caches” ec data

Tiered volume

“cache”:on SSD

econ HDD

Hot Cold

demotepromote

Dan Lambright20

Future : Data classification (DC)

● Add rules to storage graph

● Rule determines subvolume

● File name● Attribute (size, content)● Etc.

Filename =*.lock ?`

Yes No

Secure / Encrypted

HDD

Dan Lambright21

Future flexibility

● Many use cases● Compliance● Multi-tenancy● Rack-aware placement (for performance)

● Policies described by language● Arbitrary number of tiers, rules, subvolumes ..● Template based

DEMO

promote

ONE MORE THING..

promote

Dan Lambright24

Bitrot

● A daemon that scans gluster volumes● Finds corrupted data● Digest associated with each file● Alert / recover on mismatch

● “Plug-ins” to daemon may do other things..● Tuning parameters to be non-intrusive to performance● Encryption● Compression● Etc.

25

Do it!

● Learn the math:● http://web.eecs.utk.edu/~plank/plank/papers/FAST-

2013-Tutorial.html

● Get the bits: ● https://forge.gluster.org/disperse

RED HAT CONFIDENTIAL – DO NOT DISTRIBUTE

Thank You!

● dlambright@redhat.com

● RHS:

www.redhat.com/storage/

● GlusterFS:

www.gluster.org

@Glusterorg

@RedHatStorage

Gluster

Red Hat Storage

Slides Available on Mojo