5 Local Fault Tolerancewlloyd/classes/599s15/... · 2017-11-27 · A bit hard to compare (FS vs...

Crash Recovery

Wyatt Lloyd!

Assignment 1 Posted Saturday

•  On github, instructions in readme.md:!–  https://github.com/USC657/username-Assignment1!

•  Posted later than I intended!–  => You get lots of late days!

•  Please start ASAP!– Let me know of any issues with:!

•  Environment!•  Version of go later than 1.2 !

Assignment Late Days •  8 late days for the semester !

•  Use on any assignment!–  (Save for later, harder assignments) !

•  Use in 1 day increment!–  e.g., 1 second late = 1 hour late = 1 day late !

•  Based on last aNhandin annotated tag time !

Assignment 1 Progress

��

��

��

��

��

��

��

��

��

��

Paper Presentations •  Due 4 days early (new, was 1 week…)!

–  i.e., Thursday at 11:59:59pm for Tuesday!–  Or Saturday at 11:59:59pm for Thursday !

•  3 things due:!–  The required paper summary for that day !–  Supplemental paper summary!–  Slides!

•  Slides!–  Email them to Bailan and me in powerpoint

compatible format (google slides are fine) !

Local Fault Tolerance

•  ¬Distributed Systems!

•  Building block for building reliable distributed systems!

Local Faults •  Power crash fault!– Focus of today!– Lose power, regain power, want to keep

working!– Kernel panic!!– Common!!

•  Corruption (bit flips)!– Cosmic radiation, …!– Use Error Correcting Codes (ECC) !

•  The norm in datacenters!

Power Faults •  What happens when you pull the plug on a server? !

•  Disk state maintained!–  Hard drive!–  SSD!–  “Nonvolatile”!

•  Memory state lost!–  DRAM!–  “Volatile”!

•  Future memory will not be lost…!–  NVRAM!–  Research: Get to rethink system fault tolerance!

Aside:�Memory Not Lost Immediately

•  “Lest we remember: cold-boot attacks on encryption keys” !–  Alex J Halderman et al. (Princeton)!–  Usenix Security ’08!

•  “Contrary to popular assumption, DRAMs used in most modern computers retain their contents for seconds to minutes after power is lost, even at operating temperatures and even if removed from a motherboard.” !–  Enables attackers to steal encryption keys from

memory…!

Aside

5s" " " " 30s " " "60s " " " 5min !

Crash Faults

•  Reasonable assumptions for us:!– State saved to disk is still there!–  In memory state is gone!

Synchronous Logging

•  Essentially use the disk as our memory!– Wait for synchronous disk write before

continuing!

•  Why not do this?!– Disk are *very* slow!– Fundamental tension!

!int!fd!=!open(“journal.log”,!O_RDWR!|!O_SYNC);!!

Disk Drive Performance Primer •  Random reads/writes!

–  In place updates (e.g., O_SYNC)!–  Seek time + rotational latency!–  ~10ms + ~5ms!–  ~80 IOPS/drive from f4 paper!

•  Sequential reads/writes!–  Read/write to contiguous blocks!–  Much faster (100-200 MBps)!–  Todo: Experiment to see read/write latency for different block sizes!

•  Q: Why is this interesting?!

•  Takeaway:!–  Random slow for disks!–  Sequential fast for disks !

•  If writes are big enough…!

Aside:�SSD Performance Primer

•  All reads are fast and have high throughput !–  No disk head to seek or disk to rotate!

!•  Randoms writes are still slow and have low throughput !

–  Eventually (once SSD is “full”)!–  Also due to how SSDs physically work!

•  Must “erase” block together on flash chips !•  Many parallel flash chips in an SSD !•  Max throughput require 256 MiB or 512 MiB writes in 3 modern SSDs

(RIPQ paper)!

•  Sequential writes are still fast and have high throughput !–  Higher than disks, e.g., 600MBps vs 200MBps!

Write-Ahead Logging •  Store everything that matters to disk before

we do it!– LOG: will do Zahaib.status = “Presenting today”!– FILE: Zahaib.status = “Presenting today”!– LOG: did Zahaib.status = “Presenting today”!

•  Typically use a dedicated disk!– Much faster, but still rotational latency!

Recovery

•  Replay log!– Wait for replay to complete before

continuing!– Updates should be idempotent!

•  i.e., Zahaib.friends = 500, not += 1 !

•  Remaining issues?!– Slow recovery!– Atomicity!

Speeding Up Recovery

•  How can we make recovery faster? !– Remove completed prefix of log!

•  i.e., part of log where every “will do” has a matching “did”!

Atomicity •  All or nothing!

–  Maintains invariants!

•  Banking, money transfer:!–  Minlan.account -= $100;!–  Ethan.account += $100;!

•  Social network, friend addition:!–  Minlan.friends += Ethan;!–  Ethan.friends += Minlan;!

•  Filesystem, rename:!–  Create new directory entry!–  Erase old directory entry!

Atomicity & Logging

•  Write-Ahead Logging!– Will do transaction 1!

•  Minlan.friends += Ethan!•  Ethan.friends += Minlan !

– Did transaction 1!•  Now actually do updates…!

Atomicity & Recovery

•  Write-Ahead Logging!– Will do transaction 1!

•  Minlan.friends += Ethan!•  Ethan.friends += Minlan !

– Did transaction 1!•  Now actually do updates…!

•  What happens when failure happens!– “Did transaction” identifies commit !

Pretty Simple Right?

•  Unless things you thought were atomic aren’t actually!

•  Unless things you didn’t think were written to disk were!

•  Unless you also get a failure during recovery!!

Aries •  Write-Ahead Logging for Databases!– Used in many commercial DBs !–  1992 Transactions on Database Systems !– C. Mohan et al. (IBM Research)!– Considered Gold Standard!

•  More complicated that we’ve discussed!– Failure during recovery!– Aborted transactions!– Only commits transactions that really

commmitted!

Recent Research •  “From ARIES to MARS: Transaction Support for

Next-Generation, Solid-State Drives” !–  Joel Coburn, Trevor Bunker, Rajesh K Gupta and

Steven Swanson (UCSD) !–  SOSP ’13!

•  Write-ahead logging scheme for non-volatile memory !–  e.g., phase-change memory, spin-transfer torque

MRAMs, and the memristor !–  WAL without restricting to append-only logs!

Should Real Systems Care?

•  Yes!!–  (Battery backups don’t stop kernel panics) !

Should Research Prototypes Care?

•  No?!– Not the focus of most prototypes!– Can be done properly!

•  Yes?!– Can affect design and/or results!–  Improves accuracy of results!– Could become a real system…!

Takeaways •  Lecture: Use write-ahead logging!

•  Papers: Very hard to handle local faults properly!

•  Critical for real systems!– Debatably important for research prototypes!

•  Distributed fault tolerance is even harder!– Assuming failing nodes fit a specific fault model !

Intermission

EXPLODE

Jamie Tsao!

EXPLODE

A Lightweight, General System for Finding Serious Storage System Errors!

!Junfeng Yang, Can Sar, and Dawson Engler !

!OSDI 2006!

(also BUGS 2005 workshop) !

Torturing Storage Systems... !  Must recover correctly from crashes at any point in program.!�  Modifications, flushing!

!  Testing methods currently terrible !�  Manual inspection!�  Bug reports from angry users !�  Power cord yanking to simulate power failures !�  Unit testing of undocumented kernel methods? !

!  Uses model checking, but in situ, running live systems mounted from a lightweight device driver in a stock kernel !�  Alternatives includes running a third protection ring inside,

partial system checks, or modeling!

...and Torturing Databases

!  The new paper in class is a bit different!!  Based mostly on database reliability issues !�  ACID, versus just any storage system invariant !

!  Uses power faults as fault model!�  EXPLODE creates corruptions before

propagation to disk!�  EXPLODE more OS-and-below level issues,

while this one is on higher-level software !

Model-checking in EXPLODE !  Set up with storage component!

�  init(), mount(), unmount(), recover(), threads()!!  Checking by exploring choices!

�  mutate() calls choose(N) to branch off to different possible states from calling system-specific methods!

�  Calls check_crash_*() to create crash disk image!!  Permutes over possible write sets!

�  Calls check() to verify condition, logging error cases!!  Uses scheduler to pick states and get_sig() to eliminate duplicates !

�  Checkpoint states and rerun them deterministically from choice sequence!

�  Control threads to eliminate non-deterministic behavior!!  All of this runs on some extra RAM disk, with EKM in a modified

kernel!

Different check on databases !  Does not force deterministic thread scheduling!

�  Allows finding of concurrent bugs in databases!!  Includes suggestion for workload and tracing of errors when

looking for bugs!�  EXPLODE makes users create situations to test themselves!�  Record/replay like logging and checkpoints/states in EXPLODE!

!  Pattern-based ranking for finding most problematic areas!�  Exhaustive fault injection policy similar to model-checking !

!  EXPLODE cannot run on Windows databases!�  EXPLODE implemented on Linux kernel !�  Intercept on iSCSI layer so can run on any OS!

!  Black box, white box?!

EXPLODE's Results

!  A lot of bugs found (36 in total). This was just from writing a little bit of code!

...and the other results !  A bit hard to compare (FS vs database) !�  ext3 and XFS for Linux to check for FS failure !

!  Also did some analysis of pattern-based ranking for vulnerability (EXPLODE doesn't have this!

!  Found concurrency problems!!  Durability most prevalent (7 of 8 databases, last one

hanged)!�  Like sync'ing, committing not persistent after

recovery!�  Also note that all of these databases have

issues, despite extensive testing !

Take-away !  Model-checking: Expanding all possible

states and checking all choices from them !�  Corner case as easy to find as common case !�  This is useful for doing some interesting state

space searches!!  Combine systems together to check!�  Sum of parts different from whole!

!  All those files systems have bugs!!�  A bug-free system is somehow surprising? !

Alice & Bob

Zahaib Aktar!

All File Systems Are Not Created Equal:�On the Complexity of Crafting �Crash-Consistent Applications

Thanumalayan Sankaranarayana Pillai, Vĳay Chidambram, Ramnathan Algappan, Samer Al-Kiswany,!

Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau!(University of Wisconsin Madison)!

!OSDI 2014!

!

0"

Problem at Hand ●  Crash recovery is essential but hard to get right!

o  Reason: applications are built atop unreliable Filesystems!!

●  What makes Filesystems unreliable:!o  Filesystem guarantees are unclear!

"  Disk state mutation in case of crashes is largely non-deterministic!"  Different FSs such as Linux ext3 and ext4 have different robustness !

o  Building high performance crash consistency protocols is hard!"  Non-deterministic state => large number of corner cases!"  Application level crash consistency protocols are big and complex !

1"

Comparison with Torturing DBs ●  How is the problem different? !

o  Overall the two papers are quite complementary!o  Required paper considers a specific failure instance: power loss !o  Check ACID consistency, A&B more general!

!●  Focus slightly different! !

o  Req paper focuses on applications (DBs) inability in face of crash !o  Alice and Bob more general, covers a range of Apps (Hadoop, DBs) !o  Alice and Bob examine the shortcomings of the underlying FS !

!●  Similarities!

o  Similar in overall goal, expose bugs in case of crash !o  Technique quite similar, both acknowledge the short comings of FS !

!!!

!2"

Techniques and Insights ●  What FS behaviour is necessary for building crash consistent apps?!

o  Persistence properties!●  Are modern application-level crash consistency protocols correct? !!●  Propose BOB: Block Order Breaker!

o  Reorder block traces and find sequences which break consistency !!

●  Propose Alice: Application-Level Intelligent Crash Explorer!o  Application updates are a series of sys calls (e.g. append, write)!o  Permute sys call workloads and analyze permutations!o  Finds out persistence properties assumed by applications !

!!

1"

Comparison with Torturing DBs ●  How does this build upon the required paper? !

o  A&B present a much more general framework for analyzing crashes !o  While req. paper looks at apps, A&B looks at both FS and apps !

!●  Techniques!

o  Pretty similar ideas! Both collect block level traces !o  Req. paper injects failures at different points and check consistency!o  A&B permutes different block orderings and checks consistency!

!●  Study depth!

o  While req. paper acknowledges that FS can cause problems !o  A&B empirically demonstrate and quantify the extent of the problem!

4"

Key Findings ●  BOB studied six different Linux filesystems!

o  Persistence properties vary both between and within FileSystem !!

●  App level consistency dependent on underlying FS’ persistence properties!o  This dependency is undesirable: its a crash vulnerability o  Finds a total of 60 vulnerabilities across 11 apps such as SQLite, Git, HDFS!!

●  Many apps expect ordering among sys calls!o  When ordering broken: 7/11 apps do not recover from crashes !o  10/11 apps also expect atomicity of filesystem updates!

"  not so bad, 512 byte (a disk sector) writes/rename ops. are mostly atomic!

"  but may break in the future with smaller sectors!!o  7/11 apps do not meet durability guarantees!

!

1"

Comparison with Tortuting DBs ●  Developer Assumptions:!

o  Req. paper identifies 5 low-level vulnerability patterns !o  establishes developer ignorance/wrong assumptions!o  A&B also finds wrong develop assumptions major cause of failure!

"  A&B also puts blame on ambiguous FS specifications!!●  Results !

o  Req. paper finds that 7/8 DBs violate atomicity constraints !o  Similar findings by A&B w.r.t appends o  Both papers reveal a great extent of vulnerabilities in target

systems!o  A&B: single vulnerability in PostgreSQL and LMDB already known

(validation!)!o  A&B: 31/60 vulnerabilities violate a user expectation and not a

documented spec!!

! 6"

Things to remember ●  Years of research for filesystem consistency, but!

o  Techniques like logging, copy-on-write and similar approaches fall short!o  Plenty of bugs still remain!

!●  App developers need to be careful on following accounts !

o  Must not assume FS guarantees!o  Different FS’ vary greatly, must make apps independent of different FS !

!●  Alice and Bob: but not your everyday Alice and Bob!!

o  Bob analyzes block level traces and finds persistence property violation !o  Alice permutes sys calls and analyzes persistence properties assumed by

apps!!

7"

5 Local Fault Tolerancewlloyd/classes/599s15/... · 2017-11-27 · A bit hard to compare (FS vs...

Documents

Transcript of 5 Local Fault Tolerancewlloyd/classes/599s15/... · 2017-11-27 · A bit hard to compare (FS vs...