5 Local Fault Tolerancewlloyd/classes/599s15/... · 2017-11-27 · A bit hard to compare (FS vs...
Transcript of 5 Local Fault Tolerancewlloyd/classes/599s15/... · 2017-11-27 · A bit hard to compare (FS vs...
Crash Recovery
Wyatt Lloyd!
Assignment 1 Posted Saturday
• On github, instructions in readme.md:!– https://github.com/USC657/username-Assignment1!
• Posted later than I intended!– => You get lots of late days!
• Please start ASAP!– Let me know of any issues with:!
• Environment!• Version of go later than 1.2 !
Assignment Late Days • 8 late days for the semester !
• Use on any assignment!– (Save for later, harder assignments) !
• Use in 1 day increment!– e.g., 1 second late = 1 hour late = 1 day late !
• Based on last aNhandin annotated tag time !
Assignment 1 Progress
��
��
���
���
���
���
�� �� �� � � �� �� ��
����������
�� �
������������������
Paper Presentations • Due 4 days early (new, was 1 week…)!
– i.e., Thursday at 11:59:59pm for Tuesday!– Or Saturday at 11:59:59pm for Thursday !
• 3 things due:!– The required paper summary for that day !– Supplemental paper summary!– Slides!
• Slides!– Email them to Bailan and me in powerpoint
compatible format (google slides are fine) !
Local Fault Tolerance
• ¬Distributed Systems!
• Building block for building reliable distributed systems!
Local Faults • Power crash fault!– Focus of today!– Lose power, regain power, want to keep
working!– Kernel panic!!– Common!!
• Corruption (bit flips)!– Cosmic radiation, …!– Use Error Correcting Codes (ECC) !
• The norm in datacenters!
Power Faults • What happens when you pull the plug on a server? !
• Disk state maintained!– Hard drive!– SSD!– “Nonvolatile”!
• Memory state lost!– DRAM!– “Volatile”!
• Future memory will not be lost…!– NVRAM!– Research: Get to rethink system fault tolerance!
Aside:�Memory Not Lost Immediately
• “Lest we remember: cold-boot attacks on encryption keys” !– Alex J Halderman et al. (Princeton)!– Usenix Security ’08!
• “Contrary to popular assumption, DRAMs used in most modern computers retain their contents for seconds to minutes after power is lost, even at operating temperatures and even if removed from a motherboard.” !– Enables attackers to steal encryption keys from
memory…!
Aside
5s" " " " 30s " " "60s " " " 5min !
Crash Faults
• Reasonable assumptions for us:!– State saved to disk is still there!– In memory state is gone!
Synchronous Logging
• Essentially use the disk as our memory!– Wait for synchronous disk write before
continuing!
• Why not do this?!– Disk are *very* slow!– Fundamental tension!
!int!fd!=!open(“journal.log”,!O_RDWR!|!O_SYNC);!!
Disk Drive Performance Primer • Random reads/writes!
– In place updates (e.g., O_SYNC)!– Seek time + rotational latency!– ~10ms + ~5ms!– ~80 IOPS/drive from f4 paper!
• Sequential reads/writes!– Read/write to contiguous blocks!– Much faster (100-200 MBps)!– Todo: Experiment to see read/write latency for different block sizes!
• Q: Why is this interesting?!
• Takeaway:!– Random slow for disks!– Sequential fast for disks !
• If writes are big enough…!
Aside:�SSD Performance Primer
• All reads are fast and have high throughput !– No disk head to seek or disk to rotate!
!• Randoms writes are still slow and have low throughput !
– Eventually (once SSD is “full”)!– Also due to how SSDs physically work!
• Must “erase” block together on flash chips !• Many parallel flash chips in an SSD !• Max throughput require 256 MiB or 512 MiB writes in 3 modern SSDs
(RIPQ paper)!
• Sequential writes are still fast and have high throughput !– Higher than disks, e.g., 600MBps vs 200MBps!
Write-Ahead Logging • Store everything that matters to disk before
we do it!– LOG: will do Zahaib.status = “Presenting today”!– FILE: Zahaib.status = “Presenting today”!– LOG: did Zahaib.status = “Presenting today”!
• Typically use a dedicated disk!– Much faster, but still rotational latency!
Recovery
• Replay log!– Wait for replay to complete before
continuing!– Updates should be idempotent!
• i.e., Zahaib.friends = 500, not += 1 !
• Remaining issues?!– Slow recovery!– Atomicity!
Speeding Up Recovery
• How can we make recovery faster? !– Remove completed prefix of log!
• i.e., part of log where every “will do” has a matching “did”!
Atomicity • All or nothing!
– Maintains invariants!
• Banking, money transfer:!– Minlan.account -= $100;!– Ethan.account += $100;!
• Social network, friend addition:!– Minlan.friends += Ethan;!– Ethan.friends += Minlan;!
• Filesystem, rename:!– Create new directory entry!– Erase old directory entry!
Atomicity & Logging
• Write-Ahead Logging!– Will do transaction 1!
• Minlan.friends += Ethan!• Ethan.friends += Minlan !
– Did transaction 1!• Now actually do updates…!
Atomicity & Recovery
• Write-Ahead Logging!– Will do transaction 1!
• Minlan.friends += Ethan!• Ethan.friends += Minlan !
– Did transaction 1!• Now actually do updates…!
• What happens when failure happens!– “Did transaction” identifies commit !
Pretty Simple Right?
• Unless things you thought were atomic aren’t actually!
• Unless things you didn’t think were written to disk were!
• Unless you also get a failure during recovery!!
Aries • Write-Ahead Logging for Databases!– Used in many commercial DBs !– 1992 Transactions on Database Systems !– C. Mohan et al. (IBM Research)!– Considered Gold Standard!
• More complicated that we’ve discussed!– Failure during recovery!– Aborted transactions!– Only commits transactions that really
commmitted!
Recent Research • “From ARIES to MARS: Transaction Support for
Next-Generation, Solid-State Drives” !– Joel Coburn, Trevor Bunker, Rajesh K Gupta and
Steven Swanson (UCSD) !– SOSP ’13!
• Write-ahead logging scheme for non-volatile memory !– e.g., phase-change memory, spin-transfer torque
MRAMs, and the memristor !– WAL without restricting to append-only logs!
Should Real Systems Care?
• Yes!!– (Battery backups don’t stop kernel panics) !
Should Research Prototypes Care?
• No?!– Not the focus of most prototypes!– Can be done properly!
• Yes?!– Can affect design and/or results!– Improves accuracy of results!– Could become a real system…!
Takeaways • Lecture: Use write-ahead logging!
• Papers: Very hard to handle local faults properly!
• Critical for real systems!– Debatably important for research prototypes!
• Distributed fault tolerance is even harder!– Assuming failing nodes fit a specific fault model !
Intermission
EXPLODE
Jamie Tsao!
EXPLODE
A Lightweight, General System for Finding Serious Storage System Errors!
!Junfeng Yang, Can Sar, and Dawson Engler !
!OSDI 2006!
(also BUGS 2005 workshop) !
Torturing Storage Systems... ! Must recover correctly from crashes at any point in program.!� Modifications, flushing!
! Testing methods currently terrible !� Manual inspection!� Bug reports from angry users !� Power cord yanking to simulate power failures !� Unit testing of undocumented kernel methods? !
! Uses model checking, but in situ, running live systems mounted from a lightweight device driver in a stock kernel !� Alternatives includes running a third protection ring inside,
partial system checks, or modeling!
...and Torturing Databases
! The new paper in class is a bit different!! Based mostly on database reliability issues !� ACID, versus just any storage system invariant !
! Uses power faults as fault model!� EXPLODE creates corruptions before
propagation to disk!� EXPLODE more OS-and-below level issues,
while this one is on higher-level software !
Model-checking in EXPLODE ! Set up with storage component!
� init(), mount(), unmount(), recover(), threads()!! Checking by exploring choices!
� mutate() calls choose(N) to branch off to different possible states from calling system-specific methods!
� Calls check_crash_*() to create crash disk image!! Permutes over possible write sets!
� Calls check() to verify condition, logging error cases!! Uses scheduler to pick states and get_sig() to eliminate duplicates !
� Checkpoint states and rerun them deterministically from choice sequence!
� Control threads to eliminate non-deterministic behavior!! All of this runs on some extra RAM disk, with EKM in a modified
kernel!
Different check on databases ! Does not force deterministic thread scheduling!
� Allows finding of concurrent bugs in databases!! Includes suggestion for workload and tracing of errors when
looking for bugs!� EXPLODE makes users create situations to test themselves!� Record/replay like logging and checkpoints/states in EXPLODE!
! Pattern-based ranking for finding most problematic areas!� Exhaustive fault injection policy similar to model-checking !
! EXPLODE cannot run on Windows databases!� EXPLODE implemented on Linux kernel !� Intercept on iSCSI layer so can run on any OS!
! Black box, white box?!
EXPLODE's Results
! A lot of bugs found (36 in total). This was just from writing a little bit of code!
...and the other results ! A bit hard to compare (FS vs database) !� ext3 and XFS for Linux to check for FS failure !
! Also did some analysis of pattern-based ranking for vulnerability (EXPLODE doesn't have this!
! Found concurrency problems!! Durability most prevalent (7 of 8 databases, last one
hanged)!� Like sync'ing, committing not persistent after
recovery!� Also note that all of these databases have
issues, despite extensive testing !
Take-away ! Model-checking: Expanding all possible
states and checking all choices from them !� Corner case as easy to find as common case !� This is useful for doing some interesting state
space searches!! Combine systems together to check!� Sum of parts different from whole!
! All those files systems have bugs!!� A bug-free system is somehow surprising? !
Alice & Bob
Zahaib Aktar!
All File Systems Are Not Created Equal:�On the Complexity of Crafting �Crash-Consistent Applications
Thanumalayan Sankaranarayana Pillai, Vijay Chidambram, Ramnathan Algappan, Samer Al-Kiswany,!
Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau!(University of Wisconsin Madison)!
!OSDI 2014!
!
0"
Problem at Hand ● Crash recovery is essential but hard to get right!
o Reason: applications are built atop unreliable Filesystems!!
● What makes Filesystems unreliable:!o Filesystem guarantees are unclear!
" Disk state mutation in case of crashes is largely non-deterministic!" Different FSs such as Linux ext3 and ext4 have different robustness !
o Building high performance crash consistency protocols is hard!" Non-deterministic state => large number of corner cases!" Application level crash consistency protocols are big and complex !
1"
Comparison with Torturing DBs ● How is the problem different? !
o Overall the two papers are quite complementary!o Required paper considers a specific failure instance: power loss !o Check ACID consistency, A&B more general!
!● Focus slightly different! !
o Req paper focuses on applications (DBs) inability in face of crash !o Alice and Bob more general, covers a range of Apps (Hadoop, DBs) !o Alice and Bob examine the shortcomings of the underlying FS !
!● Similarities!
o Similar in overall goal, expose bugs in case of crash !o Technique quite similar, both acknowledge the short comings of FS !
!!!
!2"
Techniques and Insights ● What FS behaviour is necessary for building crash consistent apps?!
o Persistence properties!● Are modern application-level crash consistency protocols correct? !!● Propose BOB: Block Order Breaker!
o Reorder block traces and find sequences which break consistency !!
● Propose Alice: Application-Level Intelligent Crash Explorer!o Application updates are a series of sys calls (e.g. append, write)!o Permute sys call workloads and analyze permutations!o Finds out persistence properties assumed by applications !
!!
1"
Comparison with Torturing DBs ● How does this build upon the required paper? !
o A&B present a much more general framework for analyzing crashes !o While req. paper looks at apps, A&B looks at both FS and apps !
!● Techniques!
o Pretty similar ideas! Both collect block level traces !o Req. paper injects failures at different points and check consistency!o A&B permutes different block orderings and checks consistency!
!● Study depth!
o While req. paper acknowledges that FS can cause problems !o A&B empirically demonstrate and quantify the extent of the problem!
4"
Key Findings ● BOB studied six different Linux filesystems!
o Persistence properties vary both between and within FileSystem !!
● App level consistency dependent on underlying FS’ persistence properties!o This dependency is undesirable: its a crash vulnerability o Finds a total of 60 vulnerabilities across 11 apps such as SQLite, Git, HDFS!!
● Many apps expect ordering among sys calls!o When ordering broken: 7/11 apps do not recover from crashes !o 10/11 apps also expect atomicity of filesystem updates!
" not so bad, 512 byte (a disk sector) writes/rename ops. are mostly atomic!
" but may break in the future with smaller sectors!!o 7/11 apps do not meet durability guarantees!
!
1"
Comparison with Tortuting DBs ● Developer Assumptions:!
o Req. paper identifies 5 low-level vulnerability patterns !o establishes developer ignorance/wrong assumptions!o A&B also finds wrong develop assumptions major cause of failure!
" A&B also puts blame on ambiguous FS specifications!!● Results !
o Req. paper finds that 7/8 DBs violate atomicity constraints !o Similar findings by A&B w.r.t appends o Both papers reveal a great extent of vulnerabilities in target
systems!o A&B: single vulnerability in PostgreSQL and LMDB already known
(validation!)!o A&B: 31/60 vulnerabilities violate a user expectation and not a
documented spec!!
! 6"
Things to remember ● Years of research for filesystem consistency, but!
o Techniques like logging, copy-on-write and similar approaches fall short!o Plenty of bugs still remain!
!● App developers need to be careful on following accounts !
o Must not assume FS guarantees!o Different FS’ vary greatly, must make apps independent of different FS !
!● Alice and Bob: but not your everyday Alice and Bob!!
o Bob analyzes block level traces and finds persistence property violation !o Alice permutes sys calls and analyzes persistence properties assumed by
apps!!
7"