sosp087-hendricks

download sosp087-hendricks

of 31

Transcript of sosp087-hendricks

  • 8/9/2019 sosp087-hendricks

    1/31

    Low-Overhead ByzantineFault-Tolerant Storage

    James Hendricks, Gregory R. GangerCarnegie Mellon University

    Michael K. ReiterUniversity of North Carolina at Chapel Hill

  • 8/9/2019 sosp087-hendricks

    2/31

    James Hendricks, 2

    Motivation As systems grow in size and complexity

    Must tolerate more faults, more types of faults Modern storage systems take ad-hoc approach

    Not clear which faults to tolerate

    Instead: tolerate arbitrary (Byzantine ) faults But, Byzantine fault-tolerance = expensive?

    Fast reads, slow large writes

  • 8/9/2019 sosp087-hendricks

    3/31

    James Hendricks, 3

    1 2 3 4 5 6 7 8 9 100

    10

    20

    30

    40

    50

    60

    70

    80

    1 2 3 4 5 6 7 8 9 100

    10

    20

    30

    40

    50

    60

    70

    80

    1 2 3 4 5 6 7 8 9 100

    10

    20

    30

    40

    50

    60

    70

    80

    1 2 3 4 5 6 7 8 9 100

    10

    20

    30

    40

    50

    60

    70

    80

    1 2 3 4 5 6 7 8 9 100

    10

    20

    30

    40

    50

    60

    70

    80

    Write bandwidth

    Number of faults tolerated ( f )

    Crash fault-tolerant erasure-codedstorage (non-Byzantine)

    Replicated Byzantinefault-tolerant storage B

    a n d w i d t h ( M B / s )

    Low-overhead erasure-codedByzantine fault-tolerant storage

    f +1 -of-2f +1erasure-coded

    f +1 replicas

    3 f +1 replicas

  • 8/9/2019 sosp087-hendricks

    4/31

    James Hendricks, 4

    Summary of ResultsWe present a low overhead Byzantine fault-tolerant erasure-coded block storage protocol

    Write overhead: 2-round + crypto. checksum Read overhead: cryptographic checksum

    Performance of our Byzantine -tolerantprotocol nearly matches that of protocol thattolerates only crashes

    Within 10% for large enough requests

  • 8/9/2019 sosp087-hendricks

    5/31

    James Hendricks, 5

    B

    Erasure codesAn m -of-n erasure code encodes block B inton fragments , each size |B|/ m , such that any m fragments can be used to reconstruct block B

    d1 d2 d3 d4 d5

    3-of-5erasure code

    d1, , d 5 encode(B)Read d

    1, d

    2, d

    3B decode(d 1, d 2, d 3)Read d 1, d 3, d 5B decode(d 1, d 3, d 5)

  • 8/9/2019 sosp087-hendricks

    6/31

    James Hendricks, 6

    Design of Our Protocol

  • 8/9/2019 sosp087-hendricks

    7/31

    James Hendricks, 7

    Parameters and interfaceParameters

    f : Number of faulty servers tolerated m f + 1: Fragments needed to decode block n = m + 2 f 3f + 1: Number of servers

    Interface: Read and write fixed-size blocks Not a filesystem. No metadata. No locking.

    No access control. No support for variable-sized reads/writes.

    A building block for a filesystem

  • 8/9/2019 sosp087-hendricks

    8/31

    James Hendricks, 8

    2-of-4Erasure codeSend fragmentto each server

    Write protocol: prepare & commit

    B

    Response with

    cryptographictoken

    Forwardtokens

    Returnstatus

    d3

    d1d2

    d4

    Prepare Commit

  • 8/9/2019 sosp087-hendricks

    9/31

  • 8/9/2019 sosp087-hendricks

    10/31

    James Hendricks, 10

    Read protocol common case

    Requesttimestamps andoptimistically read

    fragments

    Return fragmentsand timestamp

    B

    d1

    d2

  • 8/9/2019 sosp087-hendricks

    11/31

    James Hendricks, 11

    Issue 1 of 3: Wasteful encodingErasure code m + 2 f Hash m + 2 f Send to each serverBut only wait for m + f responses

    This is wasteful!

    d1d2d3d4

    Bm : Fragments needed to decodef : Number of faults tolerated

  • 8/9/2019 sosp087-hendricks

    12/31

    James Hendricks, 12

    Solution 1: Partial encoding

    Instead:- Erasure code m +f - Hash m +f - Hear from m +f

    Pro: Compute f fewer frags

    Con: Client may need tosend entire block on failure- Should happen rarely

    d1d2

    d3B

  • 8/9/2019 sosp087-hendricks

    13/31

    James Hendricks, 13

    Issue 2: Block must be uniqueFragments must comprise a unique block

    If not, different readers read different blocks

    Challenge: servers dont see entire block

    Servers cant verify hash of block Servers cant verify encoding of block given

    hashes of fragments

  • 8/9/2019 sosp087-hendricks

    14/31

    James Hendricks, 14

    Soln 2: Homomorphic fingerprintingFragment is consistent with checksum if hashand homomorphic fingerprint [PODC07] match

    Key property: Block decoded from

    consistent fragments is unique

    hash4hash3hash2hash1

    fp2fp1

    d 4

    hash4

    fp4 fp4

  • 8/9/2019 sosp087-hendricks

    15/31

    James Hendricks, 15

    Issue 3: Write orderingReads must return most recently written block

    - Required for linearizability (atomic)- Faulty server may propose uncommitted write- Must be prevented. Prior approaches:

    4f +1 servers, signatures, or 3+ round writes

    Our approach:- 3f +1 servers, MACs, 2 round writes

  • 8/9/2019 sosp087-hendricks

    16/31

    James Hendricks, 16

    Solution 3: Hashes of nonces

    Client Servers

    Prepare: Store hash(nonce)Return nonce n o n c e

    Collect nonces

    Commit: store nonces

    Find timestamps Return timestamp, nonces

    Return nonce_hashwith fragment

    Read at timestamp

    Prepare

    n o n c e s

    Compare hash(nonce)

    with nonce_hash

    W r i t e

    R e a

    d

  • 8/9/2019 sosp087-hendricks

    17/31

    James Hendricks, 17

    Bringing it all together: Write- Erasure code m+f fragments- Hash & fingerprint fragments- Send to first m+f servers - Verify hash, fingerprint

    - Choose nonce- Generate MAC- Store fragment- Forward MACs

    to servers

    - Verify MAC- Free older fragments

    - Write completed

    Overhead: Not incrash-only protocol

    d i

    Client Servers

  • 8/9/2019 sosp087-hendricks

    18/31

    James Hendricks, 18

    Bringing it all together: Read- Request fragments

    from first m servers- Request latest nonce,

    timestamp, checksum - Return fragment(if requested)- Return latest nonce,

    timestamp, checksum- Verify provided checksummatches fragment hash&fp

    - Verify timestamps match- Verify nonces

    - Read complete

    d i

    Overhead: Not incrash-only protocolClient Servers

  • 8/9/2019 sosp087-hendricks

    19/31

    James Hendricks, 19

    Evaluation

  • 8/9/2019 sosp087-hendricks

    20/31

    James Hendricks, 20

    Experimental setup m = f + 1

    Single client, NVRAM at servers

    Write or read 64 kB blocks- Fragment size decreases as f increases

    3 GHz Pentium D, Intel PRO/1000

    Fragments needed to decode block

    Number of faults tolerated

  • 8/9/2019 sosp087-hendricks

    21/31

    James Hendricks, 21

    Prototype implementation

    Four protocols implemented:

    Our protocol

    Crash-only erasure-coded

    Crash-only replication-based

    P ASIS [Goodson04] emulation

    Read validation: Decode, encode , hash 4f+1 fragments4f+1 servers , versioning, garbage collection

    All use same hashing and erasure coding libraries

    B y z an t i n

    e

    t ol er an t

    E r a s ur e

    c o d e d

  • 8/9/2019 sosp087-hendricks

    22/31

    James Hendricks, 22

    1 2 3 4 5 6 7 8 9 100

    10

    20

    30

    40

    50

    60

    70

    80

    1 2 3 4 5 6 7 8 9 100

    10

    20

    30

    40

    50

    60

    70

    80

    1 2 3 4 5 6 7 8 9 100

    10

    20

    30

    40

    50

    60

    70

    80

    1 2 3 4 5 6 7 8 9 100

    10

    20

    30

    40

    50

    60

    70

    80

    1 2 3 4 5 6 7 8 9 100

    10

    20

    30

    40

    50

    60

    70

    80

    Write throughput

    Number of faults tolerated ( f )

    B a n

    d w i d t h ( M B / s )

    C r a s h - o n l y r e p l i c a t i o n

    b a s e d ( f + 1 = m r e p l i c a s )

    P AS I S - e m u l a t i o n

    O u r p r o t o c o l

    Cr a s h-only e r a s ur e -code d

  • 8/9/2019 sosp087-hendricks

    23/31

    James Hendricks, 23

    1 2 3 4 5 6 7 8 9 100

    10

    20

    30

    40

    50

    60

    70

    80

    1 2 3 4 5 6 7 8 9 100

    10

    20

    30

    40

    50

    60

    70

    80

    Write throughput

    Number of faults tolerated ( f )

    B a n

    d w i d t h ( M B / s ) O u r p r o t o c o l

    Cr a s h-only e r a s ur e -code d

    64 kBf +1

    16 kB fragments

  • 8/9/2019 sosp087-hendricks

    24/31

    James Hendricks, 24

    1 2 3 4 5 6 7 8 9 100

    1

    2

    3

    4

    5

    6

    7

    1 2 3 4 5 6 7 8 9 100

    1

    2

    3

    4

    5

    6

    7

    1 2 3 4 5 6 7 8 9 100

    1

    2

    3

    4

    5

    6

    7

    1 2 3 4 5 6 7 8 9 100

    1

    2

    3

    4

    5

    6

    7

    1 2 3 4 5 6 7 8 9 100

    1

    2

    3

    4

    5

    6

    7

    Write response time

    L a t e n c y

    ( m s )

    Number of faults tolerated ( f )

    D. C r a

    s h - o n l

    y r e p l i c a t

    i o n b a

    s e d

    C. P A

    S I S - e m

    u l a t i o n

    B. O u r p r o t o c o

    l

    A. Crash-onl y erasure -coded

    A B C D

    RPC

    Hash

    Encode

    FingerprintOverhead breakdown ( f =10)

  • 8/9/2019 sosp087-hendricks

    25/31

    James Hendricks, 25

    1 2 3 4 5 6 7 8 9 100

    20

    40

    60

    80

    00

    20

    1 2 3 4 5 6 7 8 9 100

    20

    40

    60

    80

    00

    20

    1 2 3 4 5 6 7 8 9 100

    20

    40

    60

    80

    00

    20

    1 2 3 4 5 6 7 8 9 100

    20

    40

    60

    80

    00

    20

    1 2 3 4 5 6 7 8 9 100

    20

    40

    60

    80

    00

    20

    Read throughput

    B a n

    d w i d t h ( M B / s )

    Number of faults tolerated ( f )

    Crash-only replication based

    P AS I S - e m u l a t i o n

    ( c o m p u t e b o u n d )

    Our protocolCrash-only erasure-coded

  • 8/9/2019 sosp087-hendricks

    26/31

    James Hendricks, 26

    1 2 3 4 5 6 7 8 9 100

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    5

    1 2 3 4 5 6 7 8 9 100

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    5

    1 2 3 4 5 6 7 8 9 100

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    5

    1 2 3 4 5 6 7 8 9 100

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    5

    1 2 3 4 5 6 7 8 9 100

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    5

    Read response time

    L a t e n c y

    ( m s )

    Number of faults tolerated ( f )

    D. Replicated

    C. P A

    S I S - e m

    u l a t i o n

    B. Our protocol

    A. Erasure-coded

    A B C D

    RPCHash

    EncodeFingerprint

    Overhead breakdown ( f =10)

  • 8/9/2019 sosp087-hendricks

    27/31

    James Hendricks, 27

    ConclusionsByzantine fault-tolerant storage can rivalcrash-only storage performance

    We present a low overhead Byzantine fault-

    tolerant erasure-coded block storageprotocol and prototype- Write overhead: 2-round, hash and fingerprint

    - Read overhead: hash and fingerprint- Close to performance of systems that tolerate

    only crashes for reads and large writes

  • 8/9/2019 sosp087-hendricks

    28/31

    James Hendricks, 28

    Backup slides

  • 8/9/2019 sosp087-hendricks

    29/31

    James Hendricks, 29

    Why not multicast?May be unstableor unavailable

    (UDP)

    More data means

    more work atserver (network,disk, hash)

    Doesnt scalewith clients!

  • 8/9/2019 sosp087-hendricks

    30/31

    James Hendricks, 30

    Cryptographic hash overheadByzantine storage requires cryptographic hashing.Does this matter?

    Systems must tolerate non-crash faults E.g., misdirected write

    Many modern systems checksum data E.g., Google File System ZFS supports SHA-256 cryptographic hash function

    May hash data for authentication

    Conclusion : BFT may not introduce new hashing

  • 8/9/2019 sosp087-hendricks

    31/31

    James Hendricks, 31

    Is 3f+1 servers expensive?

    Consider a typical storage cluster Usually more primary drives than parity drives

    Usually several hot spares

    Conclusion : May already use 3 f +1 servers

    Primary drives Parity drives Hot sparesf = number of

    faults tolerated