Post on 15-Mar-2018
Harendra Kumar, Yuvraj Patel, Ram Kesavan, Sumith Makam
High-Performance Metadata Integrity Protectionin the WAFL Copy-on-Write File System
NetApp, Inc., University of Wisconsin-Madison
Example
2
Customer Data Center
“Freeing free block” panic Support checklist• Start recovery run (fsck
like tool)• Seek Engineering help
root-cause the panic
Recovery Run?
Example
3
Scribble bug or Logic bug?
H/W fault or S/W bug?
India USA
How long the recovery run will take???
Engineering
When corruption happened?
Customer
Summary¡ Bugs keep coming
– Hardware faults– Software bugs
¡ Important to protect metadata for correctness¡ Need of the hour
– Simple techniques for strong data integrity– No/negligible performance impact (deployable)– Diagnostic capability
4
Our Solution¡ Separate solutions for separate problems
– Deployed in production¡ Incremental checksum for scribble bugs¡ Digest-based transaction auditing for logic bugs
– In house¡ Page-level protection for diagnostics
5
Key Results¡ Techniques protect metadata
– Negligible performance impact– More than 3x reduction in recovery runs– Deployed in > 250K systems worldwide
¡ Field data (~5 years)– 33 systems protected from 8 unique scribble bugs– 50 systems protected from 9 unique logic bugs
6
Outline¡ Introduction¡ Scribble protection¡ Page-level protection¡ Digest-based transaction audit¡ Evaluation¡ Conclusion
7
Scribble protection¡ Aim: Avoid scribbles corrupting metadata¡ Rolling Incremental checksum on all metadata
update
8
Incremental checksum example
9
P Q R S T …
TimeIndirect block loaded in memory
P QR’ S T …
Indirect block modified
Incremental checksum = C’
P Q’R’ S T …
Incremental checksum = C’’
Indirect block modified
Just before persisting
• Compute Adler 32 bit checksum of the block = C”
• Compare full checksum & Incremental checksum
On successful verification
RAID/Storage
Persist
Incremental checksumcomputation is dependent on the amount of data modified
and cache-line friendly
Metadata updates• Small in Size• Frequent
Concurrent incrementalchecksum computationpossible without locks
Adler 32 bit checksum of full block = C Incremental checksum initialized to C
10
Incremental checksum example
10
P Q R S T …
TimeIndirect block loaded in memory
P QR’ S T …
Adler 32 bit checksum of full block = C Incremental checksum = C
Indirect block modified
Incremental checksum = C’
P Q’R’ S T …
Incremental checksum = C’
Scribble bug Just before persisting
• Compute Adler 32 bit checksum of the block = C”
• Compare full checksum & Incremental checksum
On verification failure
Panic the system as there can be potential other metadata that is corrupted.
Scribble ends up corrupting the indirect block.(Q à Q’)
Without incremental checksum, this scribble bug can lead to “Freeing free block” panic
Outline¡ Introduction¡ Scribble protection¡ Page-level protection¡ Digest-based transaction audit¡ Evaluation¡ Conclusion
11
Page-level protection¡ Scribble bugs only caught at the end of CP¡ Difficult to root cause scribble bugs¡ Page permissions + Write Protect Enable (WP)
bit– Keep pages read-only by default– Flip WP bit before and after modification
12
Outline¡ Introduction¡ Scribble protection¡ Page-level protection¡ Digest-based transaction audit¡ Evaluation¡ Conclusion
13
Digest-based transaction auditing¡ Logic bugs and their nature¡ Distributed invariants à Consistency equations¡ Lightweight digest (transaction checksum)
– Maintain different digests for different invariants
14
Digest-based transaction auditing
15
Inode
A
XYZB
1 1 0 0 0Bitmap
(A) (B) (C) (D) (E)
B
Client modifies inode A• Adds new block
In-memory state of inode
Inode
A
XYZB
B
PQR
Digest-based transaction auditing
16
Inode
A
XYZB
B
PQR
Inode
D
XYZB
B C
PQR
0 1 1 1 0Bitmap
(A) (B) (C) (D) (E)
During CP
BA
• During indirect block updates• Maintain blocks allocated
digest D1 = C + D• Maintain blocks freed
digest D2 = A
C
Freed block
Allocated block
• During bitmap updates• Maintain blocks allocated
digest D3 = C + D• Maintain blocks freed
digest D4 = AEnd of CP
Compare digests1. D1 == D32. D2 == D4
1 1 0 0 0 Bitmap(A) (B) (C) (D) (E)
Digest-based transaction auditing
17
Inode
A
XYZB
B
PQR
Inode
D
XYZB
B C
PQR
0 1 1 1 0Bitmap
(A) (B) (C) (D) (E)
During CP
BA
• During indirect block updates• Maintain blocks allocated
digest D1 = C + D• Maintain blocks freed
digest D2 = A
C
Freed block
Allocated block
• During bitmap updates• Maintain blocks allocated
digest D3 = C + D• Maintain blocks freed
digest D4 = AEnd of CP
Compare digests1. D1 == D32. D2 == D4
Digests are easy to maintainLightweight - Strong one to one audit avoided
1 1 0 0 0 Bitmap(A) (B) (C) (D) (E)
• During bitmap updates• Maintain blocks allocated
digest D3 = C• Maintain blocks freed
digest D4 = A
D not updated due to race
Digest-based transaction auditing
18
Inode
A
XYZB
B
PQR
Inode
D
XYZB
B C
PQR
0 1 1 0 0Bitmap
(A) (B) (C) (D) (E)
During CP
BA
• During indirect block updates• Maintain blocks allocated
digest D1 = C + D• Maintain blocks freed
digest D2 = A
C
Freed block
Allocated block
End of CPCompare digests
1. D1 != D32. D2 == D4
1 1 0 0 0 Bitmap(A) (B) (C) (D) (E)
• During bitmap updates• Maintain blocks allocated
digest D3 = C• Maintain blocks freed
digest D4 = A
D not updated due to race
Digest-based transaction auditing
19
Inode
A
XYZB
B
PQR
Inode
D
XYZB
B C
PQR
0 1 1 0 0Bitmap
(A) (B) (C) (D) (E)
During CP
BA
• During indirect block updates• Maintain blocks allocated
digest D1 = C + D• Maintain blocks freed
digest D2 = A
C
Freed block
Allocated block
End of CPCompare digests
1. D1 != D32. D2 == D4
Without Digest-based transaction auditing, this race can lead to “Freeing free block” panic
1 1 0 0 0 Bitmap(A) (B) (C) (D) (E)
Outline¡ Introduction¡ Scribble protection¡ Page-level protection¡ Digest-based transaction audit¡ Evaluation¡ Conclusion
20
Evaluation
21
¡ Running on >250K systems for 5+ years¡ Negligible regression on file server benchmarks
(eg. SPEC FS) ¡ Heavy metadata updates by DB workloads
– Database/OLTP benchmark (similar to SPC-1) built in-house
0
7.5
15
22.5
30
80K 88K 96K 104K 112K 120K 128K
ObservedLatency(m
s)
AchievedThroughput(IOPS)
alloff
allon
Performance Evaluation
22
Incremental checksum + Digest-based transaction auditing performance20+ audit equations
• Negligible throughput and latency until 120K ops
• 25% Increase in latency - thereafter
High range - 20 core, 128 GB DRAM, 8 GB NVRAM
Performance evaluation¡ Page level protection
– 20% performance penalty– Used in-house (debug only kernels)– Only used once in field to catch a recurring
scribble bug
23
Protection from corruption bugs¡ 5 year data during in-house development
– Unit test data hard to gather– 75 scribble bugs found by page-level protection– 32 scribble bugs found by incremental checksum– 23 logic bugs found by transaction auditing
¡ More than 3x reduction in no. of recovery runs across ONTAP 8.0 -> 8.3
24
Outline¡ Introduction¡ Scribble protection¡ Page-level protection¡ Digest-based transaction audit¡ Evaluation¡ Conclusion
25
Conclusion¡ Introduced two techniques to enforce data
integrity with minimal performance impact¡ Disprove common belief - “Strong data integrity
requires high performance penalty” ¡ End-to-end protection applicable to databases,
distributed applications¡ Concentrate more on innovation than worrying
about data integrity
26
Thank you!
Questions???J
27