sosp087-hendricks

8/9/2019 sosp087-hendricks

1/31

Low-Overhead ByzantineFault-Tolerant Storage

James Hendricks, Gregory R. GangerCarnegie Mellon University

Michael K. ReiterUniversity of North Carolina at Chapel Hill


2/31

James Hendricks, 2

Motivation As systems grow in size and complexity

Must tolerate more faults, more types of faults Modern storage systems take ad-hoc approach

Not clear which faults to tolerate

Instead: tolerate arbitrary (Byzantine ) faults But, Byzantine fault-tolerance = expensive?

Fast reads, slow large writes


3/31

James Hendricks, 3

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

Write bandwidth

Number of faults tolerated ( f )

Crash fault-tolerant erasure-codedstorage (non-Byzantine)

Replicated Byzantinefault-tolerant storage B

a n d w i d t h ( M B / s )

Low-overhead erasure-codedByzantine fault-tolerant storage

f +1 -of-2f +1erasure-coded

f +1 replicas

3 f +1 replicas


4/31

James Hendricks, 4

Summary of ResultsWe present a low overhead Byzantine fault-tolerant erasure-coded block storage protocol

Write overhead: 2-round + crypto. checksum Read overhead: cryptographic checksum

Performance of our Byzantine -tolerantprotocol nearly matches that of protocol thattolerates only crashes

Within 10% for large enough requests


5/31

James Hendricks, 5

B

Erasure codesAn m -of-n erasure code encodes block B inton fragments , each size |B|/ m , such that any m fragments can be used to reconstruct block B

d1 d2 d3 d4 d5

3-of-5erasure code

d1, , d 5 encode(B)Read d

1, d

2, d

3B decode(d 1, d 2, d 3)Read d 1, d 3, d 5B decode(d 1, d 3, d 5)


6/31

James Hendricks, 6

Design of Our Protocol


7/31

James Hendricks, 7

Parameters and interfaceParameters

f : Number of faulty servers tolerated m f + 1: Fragments needed to decode block n = m + 2 f 3f + 1: Number of servers

Interface: Read and write fixed-size blocks Not a filesystem. No metadata. No locking.

No access control. No support for variable-sized reads/writes.

A building block for a filesystem


8/31

James Hendricks, 8

2-of-4Erasure codeSend fragmentto each server

Write protocol: prepare & commit

B

Response with

cryptographictoken

Forwardtokens

Returnstatus

d3

d1d2

d4

Prepare Commit


9/31


10/31

James Hendricks, 10

Read protocol common case

Requesttimestamps andoptimistically read

fragments

Return fragmentsand timestamp

B

d1

d2


11/31

James Hendricks, 11

Issue 1 of 3: Wasteful encodingErasure code m + 2 f Hash m + 2 f Send to each serverBut only wait for m + f responses

This is wasteful!

d1d2d3d4

Bm : Fragments needed to decodef : Number of faults tolerated


12/31

James Hendricks, 12

Solution 1: Partial encoding

Instead:- Erasure code m +f - Hash m +f - Hear from m +f

Pro: Compute f fewer frags

Con: Client may need tosend entire block on failure- Should happen rarely

d1d2

d3B


13/31

James Hendricks, 13

Issue 2: Block must be uniqueFragments must comprise a unique block

If not, different readers read different blocks

Challenge: servers dont see entire block

Servers cant verify hash of block Servers cant verify encoding of block given

hashes of fragments


14/31

James Hendricks, 14

Soln 2: Homomorphic fingerprintingFragment is consistent with checksum if hashand homomorphic fingerprint [PODC07] match

Key property: Block decoded from

consistent fragments is unique

hash4hash3hash2hash1

fp2fp1

d 4

hash4

fp4 fp4


15/31

James Hendricks, 15

Issue 3: Write orderingReads must return most recently written block

- Required for linearizability (atomic)- Faulty server may propose uncommitted write- Must be prevented. Prior approaches:

4f +1 servers, signatures, or 3+ round writes

Our approach:- 3f +1 servers, MACs, 2 round writes


16/31

James Hendricks, 16

Solution 3: Hashes of nonces

Client Servers

Prepare: Store hash(nonce)Return nonce n o n c e

Collect nonces

Commit: store nonces

Find timestamps Return timestamp, nonces

Return nonce_hashwith fragment

Read at timestamp

Prepare

n o n c e s

Compare hash(nonce)

with nonce_hash

W r i t e

R e a

d


17/31

James Hendricks, 17

Bringing it all together: Write- Erasure code m+f fragments- Hash & fingerprint fragments- Send to first m+f servers - Verify hash, fingerprint

- Choose nonce- Generate MAC- Store fragment- Forward MACs

to servers

- Verify MAC- Free older fragments

- Write completed

Overhead: Not incrash-only protocol

d i

Client Servers


18/31

James Hendricks, 18

Bringing it all together: Read- Request fragments

from first m servers- Request latest nonce,

timestamp, checksum - Return fragment(if requested)- Return latest nonce,

timestamp, checksum- Verify provided checksummatches fragment hash&fp

- Verify timestamps match- Verify nonces

- Read complete

d i

Overhead: Not incrash-only protocolClient Servers


19/31

James Hendricks, 19

Evaluation


20/31

James Hendricks, 20

Experimental setup m = f + 1

Single client, NVRAM at servers

Write or read 64 kB blocks- Fragment size decreases as f increases

3 GHz Pentium D, Intel PRO/1000

Fragments needed to decode block

Number of faults tolerated


21/31

James Hendricks, 21

Prototype implementation

Four protocols implemented:

Our protocol

Crash-only erasure-coded

Crash-only replication-based

P ASIS [Goodson04] emulation

Read validation: Decode, encode , hash 4f+1 fragments4f+1 servers , versioning, garbage collection

All use same hashing and erasure coding libraries

B y z an t i n

e

t ol er an t

E r a s ur e

c o d e d


22/31

James Hendricks, 22

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

Write throughput


B a n

d w i d t h ( M B / s )

C r a s h - o n l y r e p l i c a t i o n

b a s e d ( f + 1 = m r e p l i c a s )

P AS I S - e m u l a t i o n

O u r p r o t o c o l

Cr a s h-only e r a s ur e -code d


23/31

James Hendricks, 23

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

Write throughput


B a n

d w i d t h ( M B / s ) O u r p r o t o c o l

Cr a s h-only e r a s ur e -code d

64 kBf +1

16 kB fragments


24/31

James Hendricks, 24

1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

Write response time

L a t e n c y

( m s )


D. C r a

s h - o n l

y r e p l i c a t

i o n b a

s e d

C. P A

S I S - e m

u l a t i o n

B. O u r p r o t o c o

l

A. Crash-onl y erasure -coded

A B C D

RPC

Hash

Encode

FingerprintOverhead breakdown ( f =10)


25/31

James Hendricks, 25

1 2 3 4 5 6 7 8 9 100

20

40

60

80

00

20

1 2 3 4 5 6 7 8 9 100

20

40

60

80

00

20

1 2 3 4 5 6 7 8 9 100

20

40

60

80

00

20

1 2 3 4 5 6 7 8 9 100

20

40

60

80

00

20

1 2 3 4 5 6 7 8 9 100

20

40

60

80

00

20

Read throughput

B a n

d w i d t h ( M B / s )


Crash-only replication based

P AS I S - e m u l a t i o n

( c o m p u t e b o u n d )

Our protocolCrash-only erasure-coded


26/31

James Hendricks, 26

1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Read response time

L a t e n c y

( m s )


D. Replicated

C. P A

S I S - e m

u l a t i o n

B. Our protocol

A. Erasure-coded

A B C D

RPCHash

EncodeFingerprint

Overhead breakdown ( f =10)


27/31

James Hendricks, 27

ConclusionsByzantine fault-tolerant storage can rivalcrash-only storage performance

We present a low overhead Byzantine fault-

tolerant erasure-coded block storageprotocol and prototype- Write overhead: 2-round, hash and fingerprint

- Read overhead: hash and fingerprint- Close to performance of systems that tolerate

only crashes for reads and large writes


28/31

James Hendricks, 28

Backup slides


29/31

James Hendricks, 29

Why not multicast?May be unstableor unavailable

(UDP)

More data means

more work atserver (network,disk, hash)

Doesnt scalewith clients!


30/31

James Hendricks, 30

Cryptographic hash overheadByzantine storage requires cryptographic hashing.Does this matter?

Systems must tolerate non-crash faults E.g., misdirected write

Many modern systems checksum data E.g., Google File System ZFS supports SHA-256 cryptographic hash function

May hash data for authentication

Conclusion : BFT may not introduce new hashing


31/31

James Hendricks, 31

Is 3f+1 servers expensive?

Consider a typical storage cluster Usually more primary drives than parity drives

Usually several hot spares

Conclusion : May already use 3 f +1 servers

Primary drives Parity drives Hot sparesf = number of

faults tolerated

sosp087-hendricks

Documents

Transcript of sosp087-hendricks