Erasure codes and storage tiers on gluster

26
Dan Lambright 1 Erasure Codes and Storage Tiers on Gluster Dan Lambright SA summit Sep 23, 2014

Transcript of Erasure codes and storage tiers on gluster

Page 1: Erasure codes and storage tiers on gluster

Dan Lambright1

Erasure Codes and Storage Tiers onGluster

Dan LambrightSA summitSep 23, 2014

Page 2: Erasure codes and storage tiers on gluster

Dan Lambright2

AGENDA

● Why erasure codes (ec) in Gluster● How ec works

● Brief peek at underlying mathematics● Storage tiering in gluster ● Demo● “One more thing”

Page 3: Erasure codes and storage tiers on gluster

Dan Lambright3

Why erasure codes in gluster?

● Desire protection from double failure

● RAID6 controllers are expensive

● Imagine a 64 node volume● Each brick on a separate bare metal machine● Cost is 64 x $ for LSI MegaRaid controller

20K

=

Page 4: Erasure codes and storage tiers on gluster

Dan Lambright4

Why erasure codes in gluster?

● Triplication (3 way replication) is expensive

● Two redundant disks for every data disk

● 200% overhead! :(

Page 5: Erasure codes and storage tiers on gluster

Dan Lambright5

Erasure codes

● Store m disks worth of data on k disks (k>m)

● n redundant disks (k-m),

● can pick n to choose failure tolerance● A generalization of RAID6

● Distributed across nodes

Page 6: Erasure codes and storage tiers on gluster

Dan Lambright6

Overhead analysis

● Can also consider mean time before failure

k total disks n how many failures admitted

m number of data disks

Capacity overhead(n/k)

RAID level

3 1 2 33.33% 5

5 1 4 20% 5

6 2 4 33.33% 6

7 3 4 42.86% E

9 1 8 11.11% 5

10 2 8 20% 6

11 3 8 27.27% E

12 4 8 33.33% E

Page 7: Erasure codes and storage tiers on gluster

ERASURE CODES PRIMER

Page 8: Erasure codes and storage tiers on gluster

Dan Lambright8

ERASURE CODE TERMS

● m data disks

● n parity disks

● k total number disks = m+n

● Symbol – Smallest data unit. w bits.● Typically w = 8 = a byte

● Chunk (aka fragment) – r symbols per disk

● Stripe – collection of m+n chunks across k disks● Unit of manipulation for recovery● Also known as a “slice”

Page 9: Erasure codes and storage tiers on gluster

Dan Lambright9

ERASURE CODE TERMS

r=6m=4n =2k=6w=1

symbol

fragment

“Stripe” of 6 fragments

011010

Page 10: Erasure codes and storage tiers on gluster

Dan Lambright10

Systematic

● m data chunks, n coding chunks

● (can stripe parity and data chunks on the same disk)● Reads are simple, only decode on repairs

Slice 1

Slice 2

Slice 3

Page 11: Erasure codes and storage tiers on gluster

Dan Lambright11

Non-Systematic

● All k chunks in a stripe are coded

● Do not to distinguish data from code servers

● Encode/decode on writes and reads

Slice 1

Slice 2

Slice 3

Page 12: Erasure codes and storage tiers on gluster

Dan Lambright12

Encoding / Decoding Overhead

● Network RTT dominate the encode/decode overhead

● Packages exist to implement the math ● Intel has fast routines for Inverse, dot product,

encoding, decoding, etc● Jerasure library from academia● Gluster's is purpose built and fast

Page 13: Erasure codes and storage tiers on gluster

GLUSTER IMPLEMENTATION

Page 14: Erasure codes and storage tiers on gluster

Dan Lambright14

GLUSTERFS “Disperse Volumes”

● Done by Datalab corp. by Xavier Hernandez.● Use case : archiving medical records● Developed over last 2 years● Now part of gluster upstream

Page 15: Erasure codes and storage tiers on gluster

Dan Lambright15

CLI

Two new options have been added to the 'create' command of the cli interface:

gluster volume create <name> disperse <count> redundancy <count>

Disperse is “k” (total number volumes)

Redundancy is “n”

Page 16: Erasure codes and storage tiers on gluster

Dan Lambright16

“Disperse volumes” design choices

● The “symbols” are bytes: w = 8

● The fragment size r = 128

● Algorithm: Reed solomon

● Generator matrix: Vandermonde

● Non–systematic

● Encoding / decoding done on client side

● Modeled after AFR● Concurrent writes must be processed in order

Page 17: Erasure codes and storage tiers on gluster

STORAGE TIERS

Page 18: Erasure codes and storage tiers on gluster

Dan Lambright18

Storage Tiers

● Different “subvolume” tiers presented as a single volume

● HDD, SSD, tape, “persistent memory”, etc.

● Plug-in policy describes how data moves between tiers

● V1 policy: Cache

● slow and fast tiers

● CLI to add/remove cache tier from existing volume

Page 19: Erasure codes and storage tiers on gluster

Dan Lambright19

Example: Erasure codes + SSD

● User sees one volume

● SSD “caches” ec data

Tiered volume

“cache”:on SSD

econ HDD

Hot Cold

demotepromote

Page 20: Erasure codes and storage tiers on gluster

Dan Lambright20

Future : Data classification (DC)

● Add rules to storage graph

● Rule determines subvolume

● File name● Attribute (size, content)● Etc.

Filename =*.lock ?`

Yes No

Secure / Encrypted

HDD

Page 21: Erasure codes and storage tiers on gluster

Dan Lambright21

Future flexibility

● Many use cases● Compliance● Multi-tenancy● Rack-aware placement (for performance)

● Policies described by language● Arbitrary number of tiers, rules, subvolumes ..● Template based

Page 22: Erasure codes and storage tiers on gluster

DEMO

promote

Page 23: Erasure codes and storage tiers on gluster

ONE MORE THING..

promote

Page 24: Erasure codes and storage tiers on gluster

Dan Lambright24

Bitrot

● A daemon that scans gluster volumes● Finds corrupted data● Digest associated with each file● Alert / recover on mismatch

● “Plug-ins” to daemon may do other things..● Tuning parameters to be non-intrusive to performance● Encryption● Compression● Etc.

Page 25: Erasure codes and storage tiers on gluster

25

Do it!

● Learn the math:● http://web.eecs.utk.edu/~plank/plank/papers/FAST-

2013-Tutorial.html

● Get the bits: ● https://forge.gluster.org/disperse

Page 26: Erasure codes and storage tiers on gluster

RED HAT CONFIDENTIAL – DO NOT DISTRIBUTE

Thank You!

[email protected]

● RHS:

www.redhat.com/storage/

● GlusterFS:

www.gluster.org

@Glusterorg

@RedHatStorage

Gluster

Red Hat Storage

Slides Available on Mojo