sosp087-hendricks
-
Upload
noahwatkins7595 -
Category
Documents
-
view
217 -
download
0
Transcript of sosp087-hendricks
-
8/9/2019 sosp087-hendricks
1/31
Low-Overhead ByzantineFault-Tolerant Storage
James Hendricks, Gregory R. GangerCarnegie Mellon University
Michael K. ReiterUniversity of North Carolina at Chapel Hill
-
8/9/2019 sosp087-hendricks
2/31
James Hendricks, 2
Motivation As systems grow in size and complexity
Must tolerate more faults, more types of faults Modern storage systems take ad-hoc approach
Not clear which faults to tolerate
Instead: tolerate arbitrary (Byzantine ) faults But, Byzantine fault-tolerance = expensive?
Fast reads, slow large writes
-
8/9/2019 sosp087-hendricks
3/31
James Hendricks, 3
1 2 3 4 5 6 7 8 9 100
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 100
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 100
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 100
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 100
10
20
30
40
50
60
70
80
Write bandwidth
Number of faults tolerated ( f )
Crash fault-tolerant erasure-codedstorage (non-Byzantine)
Replicated Byzantinefault-tolerant storage B
a n d w i d t h ( M B / s )
Low-overhead erasure-codedByzantine fault-tolerant storage
f +1 -of-2f +1erasure-coded
f +1 replicas
3 f +1 replicas
-
8/9/2019 sosp087-hendricks
4/31
James Hendricks, 4
Summary of ResultsWe present a low overhead Byzantine fault-tolerant erasure-coded block storage protocol
Write overhead: 2-round + crypto. checksum Read overhead: cryptographic checksum
Performance of our Byzantine -tolerantprotocol nearly matches that of protocol thattolerates only crashes
Within 10% for large enough requests
-
8/9/2019 sosp087-hendricks
5/31
James Hendricks, 5
B
Erasure codesAn m -of-n erasure code encodes block B inton fragments , each size |B|/ m , such that any m fragments can be used to reconstruct block B
d1 d2 d3 d4 d5
3-of-5erasure code
d1, , d 5 encode(B)Read d
1, d
2, d
3B decode(d 1, d 2, d 3)Read d 1, d 3, d 5B decode(d 1, d 3, d 5)
-
8/9/2019 sosp087-hendricks
6/31
James Hendricks, 6
Design of Our Protocol
-
8/9/2019 sosp087-hendricks
7/31
James Hendricks, 7
Parameters and interfaceParameters
f : Number of faulty servers tolerated m f + 1: Fragments needed to decode block n = m + 2 f 3f + 1: Number of servers
Interface: Read and write fixed-size blocks Not a filesystem. No metadata. No locking.
No access control. No support for variable-sized reads/writes.
A building block for a filesystem
-
8/9/2019 sosp087-hendricks
8/31
James Hendricks, 8
2-of-4Erasure codeSend fragmentto each server
Write protocol: prepare & commit
B
Response with
cryptographictoken
Forwardtokens
Returnstatus
d3
d1d2
d4
Prepare Commit
-
8/9/2019 sosp087-hendricks
9/31
-
8/9/2019 sosp087-hendricks
10/31
James Hendricks, 10
Read protocol common case
Requesttimestamps andoptimistically read
fragments
Return fragmentsand timestamp
B
d1
d2
-
8/9/2019 sosp087-hendricks
11/31
James Hendricks, 11
Issue 1 of 3: Wasteful encodingErasure code m + 2 f Hash m + 2 f Send to each serverBut only wait for m + f responses
This is wasteful!
d1d2d3d4
Bm : Fragments needed to decodef : Number of faults tolerated
-
8/9/2019 sosp087-hendricks
12/31
James Hendricks, 12
Solution 1: Partial encoding
Instead:- Erasure code m +f - Hash m +f - Hear from m +f
Pro: Compute f fewer frags
Con: Client may need tosend entire block on failure- Should happen rarely
d1d2
d3B
-
8/9/2019 sosp087-hendricks
13/31
James Hendricks, 13
Issue 2: Block must be uniqueFragments must comprise a unique block
If not, different readers read different blocks
Challenge: servers dont see entire block
Servers cant verify hash of block Servers cant verify encoding of block given
hashes of fragments
-
8/9/2019 sosp087-hendricks
14/31
James Hendricks, 14
Soln 2: Homomorphic fingerprintingFragment is consistent with checksum if hashand homomorphic fingerprint [PODC07] match
Key property: Block decoded from
consistent fragments is unique
hash4hash3hash2hash1
fp2fp1
d 4
hash4
fp4 fp4
-
8/9/2019 sosp087-hendricks
15/31
James Hendricks, 15
Issue 3: Write orderingReads must return most recently written block
- Required for linearizability (atomic)- Faulty server may propose uncommitted write- Must be prevented. Prior approaches:
4f +1 servers, signatures, or 3+ round writes
Our approach:- 3f +1 servers, MACs, 2 round writes
-
8/9/2019 sosp087-hendricks
16/31
James Hendricks, 16
Solution 3: Hashes of nonces
Client Servers
Prepare: Store hash(nonce)Return nonce n o n c e
Collect nonces
Commit: store nonces
Find timestamps Return timestamp, nonces
Return nonce_hashwith fragment
Read at timestamp
Prepare
n o n c e s
Compare hash(nonce)
with nonce_hash
W r i t e
R e a
d
-
8/9/2019 sosp087-hendricks
17/31
James Hendricks, 17
Bringing it all together: Write- Erasure code m+f fragments- Hash & fingerprint fragments- Send to first m+f servers - Verify hash, fingerprint
- Choose nonce- Generate MAC- Store fragment- Forward MACs
to servers
- Verify MAC- Free older fragments
- Write completed
Overhead: Not incrash-only protocol
d i
Client Servers
-
8/9/2019 sosp087-hendricks
18/31
James Hendricks, 18
Bringing it all together: Read- Request fragments
from first m servers- Request latest nonce,
timestamp, checksum - Return fragment(if requested)- Return latest nonce,
timestamp, checksum- Verify provided checksummatches fragment hash&fp
- Verify timestamps match- Verify nonces
- Read complete
d i
Overhead: Not incrash-only protocolClient Servers
-
8/9/2019 sosp087-hendricks
19/31
James Hendricks, 19
Evaluation
-
8/9/2019 sosp087-hendricks
20/31
James Hendricks, 20
Experimental setup m = f + 1
Single client, NVRAM at servers
Write or read 64 kB blocks- Fragment size decreases as f increases
3 GHz Pentium D, Intel PRO/1000
Fragments needed to decode block
Number of faults tolerated
-
8/9/2019 sosp087-hendricks
21/31
James Hendricks, 21
Prototype implementation
Four protocols implemented:
Our protocol
Crash-only erasure-coded
Crash-only replication-based
P ASIS [Goodson04] emulation
Read validation: Decode, encode , hash 4f+1 fragments4f+1 servers , versioning, garbage collection
All use same hashing and erasure coding libraries
B y z an t i n
e
t ol er an t
E r a s ur e
c o d e d
-
8/9/2019 sosp087-hendricks
22/31
James Hendricks, 22
1 2 3 4 5 6 7 8 9 100
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 100
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 100
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 100
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 100
10
20
30
40
50
60
70
80
Write throughput
Number of faults tolerated ( f )
B a n
d w i d t h ( M B / s )
C r a s h - o n l y r e p l i c a t i o n
b a s e d ( f + 1 = m r e p l i c a s )
P AS I S - e m u l a t i o n
O u r p r o t o c o l
Cr a s h-only e r a s ur e -code d
-
8/9/2019 sosp087-hendricks
23/31
James Hendricks, 23
1 2 3 4 5 6 7 8 9 100
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 100
10
20
30
40
50
60
70
80
Write throughput
Number of faults tolerated ( f )
B a n
d w i d t h ( M B / s ) O u r p r o t o c o l
Cr a s h-only e r a s ur e -code d
64 kBf +1
16 kB fragments
-
8/9/2019 sosp087-hendricks
24/31
James Hendricks, 24
1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
Write response time
L a t e n c y
( m s )
Number of faults tolerated ( f )
D. C r a
s h - o n l
y r e p l i c a t
i o n b a
s e d
C. P A
S I S - e m
u l a t i o n
B. O u r p r o t o c o
l
A. Crash-onl y erasure -coded
A B C D
RPC
Hash
Encode
FingerprintOverhead breakdown ( f =10)
-
8/9/2019 sosp087-hendricks
25/31
James Hendricks, 25
1 2 3 4 5 6 7 8 9 100
20
40
60
80
00
20
1 2 3 4 5 6 7 8 9 100
20
40
60
80
00
20
1 2 3 4 5 6 7 8 9 100
20
40
60
80
00
20
1 2 3 4 5 6 7 8 9 100
20
40
60
80
00
20
1 2 3 4 5 6 7 8 9 100
20
40
60
80
00
20
Read throughput
B a n
d w i d t h ( M B / s )
Number of faults tolerated ( f )
Crash-only replication based
P AS I S - e m u l a t i o n
( c o m p u t e b o u n d )
Our protocolCrash-only erasure-coded
-
8/9/2019 sosp087-hendricks
26/31
James Hendricks, 26
1 2 3 4 5 6 7 8 9 100
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
1 2 3 4 5 6 7 8 9 100
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
1 2 3 4 5 6 7 8 9 100
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
1 2 3 4 5 6 7 8 9 100
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
1 2 3 4 5 6 7 8 9 100
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Read response time
L a t e n c y
( m s )
Number of faults tolerated ( f )
D. Replicated
C. P A
S I S - e m
u l a t i o n
B. Our protocol
A. Erasure-coded
A B C D
RPCHash
EncodeFingerprint
Overhead breakdown ( f =10)
-
8/9/2019 sosp087-hendricks
27/31
James Hendricks, 27
ConclusionsByzantine fault-tolerant storage can rivalcrash-only storage performance
We present a low overhead Byzantine fault-
tolerant erasure-coded block storageprotocol and prototype- Write overhead: 2-round, hash and fingerprint
- Read overhead: hash and fingerprint- Close to performance of systems that tolerate
only crashes for reads and large writes
-
8/9/2019 sosp087-hendricks
28/31
James Hendricks, 28
Backup slides
-
8/9/2019 sosp087-hendricks
29/31
James Hendricks, 29
Why not multicast?May be unstableor unavailable
(UDP)
More data means
more work atserver (network,disk, hash)
Doesnt scalewith clients!
-
8/9/2019 sosp087-hendricks
30/31
James Hendricks, 30
Cryptographic hash overheadByzantine storage requires cryptographic hashing.Does this matter?
Systems must tolerate non-crash faults E.g., misdirected write
Many modern systems checksum data E.g., Google File System ZFS supports SHA-256 cryptographic hash function
May hash data for authentication
Conclusion : BFT may not introduce new hashing
-
8/9/2019 sosp087-hendricks
31/31
James Hendricks, 31
Is 3f+1 servers expensive?
Consider a typical storage cluster Usually more primary drives than parity drives
Usually several hot spares
Conclusion : May already use 3 f +1 servers
Primary drives Parity drives Hot sparesf = number of
faults tolerated