FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell,...
-
Upload
howard-day -
Category
Documents
-
view
218 -
download
0
Transcript of FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell,...
FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms
Naveen Sastry, Pete Broadwell,Jonathan Traupman, David Patterson
University of California, Berkeley
Presentation Outline1. Introduction
– Objective/Motivation– Background
2. Methods– Implementation– Test setup
3. Evaluation– Test results– Conclusions
The Berkeley/Stanford ROC Project
• Purpose: investigating novel techniques for building highly-dependable Internet services
• Example techniques:– Advanced support for operator undo– Stability through targeted restarts– Integrated root cause analysis– Online verification of recovery
mechanisms
FIG Project Objective/Motivation
Objective:• Develop a lightweight, extensible tool
for injecting errors to test recovery code/mechanisms
Motivation:• Testing and production environments
are always different• Large systems will require recovery
code, which should be tested as part of normal operation
““Software’s Invisible Users”Software’s Invisible Users”
ApplicationOther libraries Other apps
System libraries (libc)
OS
User interface
User Input
Concept: Jim Whittaker
Florida Institute of Technology
Related Testing Methods1. Ballista (DeVale, Koopman, Siewiorek)
• “Top-down” testing of POSIX-compliant OS and library interfaces
2. Fuzz (Miller, Fredriksen, So)
• Tested UNIX applications by feeding them random input streams
3. Holodeck (Whittaker et al.)
• Similar approach to ours, but only for Windows 2000/XP
FIG Implementation• Thin stub library
between app & libraries
• Traps API calls– Logs them– Inserts faults
• Can be inserted into any app without modification– Uses LD_PRELOAD
Application
libfig.so
libc.so, other libs
OS
Normal call path Injected fault
Extensibility• API stubs are
automatically generated
• Very easy to add new APIs to log
• Fault injection is under script control
• Can simulate multiple fault models (e.g., memory pressure)
MALLOC_INDEX
interval 82 to infinity return 0
errno ENOMEM probability 0.03
OPEN_INDEX
// device out of space.
interval 100 to infinity return
–1 errno ENOSPC probability 0.001
// kernel out of memory.
interval 100 to 120 return –1
errno ENOMEM probability 0.1
// too many files open.
callnumber 108 return -1 errno EMFILE
probability 1.0
Sample control file:
Test Setup: Applications
• GNU file utilities (ls, mv, etc.)• Emacs 20.7.1 – with and without X• Apache 1.3.22
• Berkeley DB 4.0.14
• Netscape Navigator 4.76
• MySQL server 3.23.36
Test Setup:Instrumented Calls & Their Errors
• malloc() – memory exhaustion• read() – I/O error, system call was
interrupted• write() – I/O error, no space left on
device, call interrupted• open() – memory exhaustion, no space
on device, too many files open• select() – memory exhaustion
Test Results: Client Appsread() write() select() malloc()
EINTR EIO ENOSPC EIO ENOMEM ENOMEM
Emacs – no X
o.k. exit warn warn o.k. crash
Emacs -w/X
o.k. crash o.k. crashcrash/exit
crash
Netscape warn exit exit exit n/a exit
Test Results: Server Appsread() write() select() malloc()
EINTR EIO ENOSPC EIO ENOMEM ENOMEM
Berkeley DB – Xact
retrydetec
tXact abort
Xact abort
n/aXact abort
Berkeley DB – no Xact
retrydetec
tdata loss
data loss
n/adetect, or data
loss
MySQL Server
Xact abort
retry, warn
Xact abort
Xact abort
retryrestart process
Apache o.k.req. drop
req. drop
req. drop
o.k. n/a
Netscape Reacts
Test Results: OverheadTime (s) Overhead
No FIG 33.46 N/A
FIG, no logging 34.28 2.5%
Logging w/o timestamps 47.83 42.9%
Logging w/timestamps 61.74 84.5%
strace (all syscalls) 112.85 237.3%
Timing using Berkeley DB (non-transactional) to read, sort and write one million words.
• Note: FIG communicates with a separate logging daemon through shared memory to reduce logging overhead.
Strategies forReliable Services:
• Intelligent retry– ls: “bounded retry” of malloc()
• Resource preallocation– Apache: allocates buffer pool at startup
• Degraded service– Apache: deactivates logging if disk full
• Process pools– Apache and MySQL
FIG as a Prototype for Online Error Injection• Low run-time overhead• Easy to enable/disable• Easy to configure• Extensible• Can simulate multiple fault
models
A Case for OnlineError Injection
• Recovery code is not usually exercised during normal operation
• Deployed environments tend to differ from testing environments
• Can run error injection tests on a subset of deployed systems
• FIG can simulate common environmental errors
Conclusions• FIG exposed a variety of deficiencies in
how our test applications handled environmental errors
• Server apps are generally more robust than client applications
• FIG exhibits low overhead• FIG is suitable for online error injection
Future Directions
• Limitations of FIG:– Only for UNIX-like OSes– Limited to app/library interface (proxy for
app/OS interaction)
• Make FIG part of a larger test suite• Include clock time and event based
error triggers• Greater flexibility in configuration file
Other Related Work
1. Xept (Vo et al.)
• Instruments object code to ensure that error handling code exists
2. Processor & memory errors• DOCTOR, HYBRID, DEFINE
3. Process memory corruption• FERRARI, DEFINE