FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell,...

21
FIG: A Prototype Tool for On- Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California, Berkeley

Transcript of FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell,...

Page 1: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California,

FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms

Naveen Sastry, Pete Broadwell,Jonathan Traupman, David Patterson

University of California, Berkeley

Page 2: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California,

Presentation Outline1. Introduction

– Objective/Motivation– Background

2. Methods– Implementation– Test setup

3. Evaluation– Test results– Conclusions

Page 3: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California,

The Berkeley/Stanford ROC Project

• Purpose: investigating novel techniques for building highly-dependable Internet services

• Example techniques:– Advanced support for operator undo– Stability through targeted restarts– Integrated root cause analysis– Online verification of recovery

mechanisms

Page 4: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California,

FIG Project Objective/Motivation

Objective:• Develop a lightweight, extensible tool

for injecting errors to test recovery code/mechanisms

Motivation:• Testing and production environments

are always different• Large systems will require recovery

code, which should be tested as part of normal operation

Page 5: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California,

““Software’s Invisible Users”Software’s Invisible Users”

ApplicationOther libraries Other apps

System libraries (libc)

OS

User interface

User Input

Concept: Jim Whittaker

Florida Institute of Technology

Page 6: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California,

Related Testing Methods1. Ballista (DeVale, Koopman, Siewiorek)

• “Top-down” testing of POSIX-compliant OS and library interfaces

2. Fuzz (Miller, Fredriksen, So)

• Tested UNIX applications by feeding them random input streams

3. Holodeck (Whittaker et al.)

• Similar approach to ours, but only for Windows 2000/XP

Page 7: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California,

FIG Implementation• Thin stub library

between app & libraries

• Traps API calls– Logs them– Inserts faults

• Can be inserted into any app without modification– Uses LD_PRELOAD

Application

libfig.so

libc.so, other libs

OS

Normal call path Injected fault

Page 8: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California,

Extensibility• API stubs are

automatically generated

• Very easy to add new APIs to log

• Fault injection is under script control

• Can simulate multiple fault models (e.g., memory pressure)

MALLOC_INDEX

interval 82 to infinity return 0

errno ENOMEM probability 0.03

OPEN_INDEX

// device out of space.

interval 100 to infinity return

–1 errno ENOSPC probability 0.001

// kernel out of memory.

interval 100 to 120 return –1

errno ENOMEM probability 0.1

// too many files open.

callnumber 108 return -1 errno EMFILE

probability 1.0

Sample control file:

Page 9: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California,

Test Setup: Applications

• GNU file utilities (ls, mv, etc.)• Emacs 20.7.1 – with and without X• Apache 1.3.22

• Berkeley DB 4.0.14

• Netscape Navigator 4.76

• MySQL server 3.23.36

Page 10: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California,

Test Setup:Instrumented Calls & Their Errors

• malloc() – memory exhaustion• read() – I/O error, system call was

interrupted• write() – I/O error, no space left on

device, call interrupted• open() – memory exhaustion, no space

on device, too many files open• select() – memory exhaustion

Page 11: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California,

Test Results: Client Appsread() write() select() malloc()

EINTR EIO ENOSPC EIO ENOMEM ENOMEM

Emacs – no X

o.k. exit warn warn o.k. crash

Emacs -w/X

o.k. crash o.k. crashcrash/exit

crash

Netscape warn exit exit exit n/a exit

Page 12: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California,

Test Results: Server Appsread() write() select() malloc()

EINTR EIO ENOSPC EIO ENOMEM ENOMEM

Berkeley DB – Xact

retrydetec

tXact abort

Xact abort

n/aXact abort

Berkeley DB – no Xact

retrydetec

tdata loss

data loss

n/adetect, or data

loss

MySQL Server

Xact abort

retry, warn

Xact abort

Xact abort

retryrestart process

Apache o.k.req. drop

req. drop

req. drop

o.k. n/a

Page 13: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California,

Netscape Reacts

Page 14: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California,

Test Results: OverheadTime (s) Overhead

No FIG 33.46 N/A

FIG, no logging 34.28 2.5%

Logging w/o timestamps 47.83 42.9%

Logging w/timestamps 61.74 84.5%

strace (all syscalls) 112.85 237.3%

Timing using Berkeley DB (non-transactional) to read, sort and write one million words.

• Note: FIG communicates with a separate logging daemon through shared memory to reduce logging overhead.

Page 15: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California,

Strategies forReliable Services:

• Intelligent retry– ls: “bounded retry” of malloc()

• Resource preallocation– Apache: allocates buffer pool at startup

• Degraded service– Apache: deactivates logging if disk full

• Process pools– Apache and MySQL

Page 16: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California,

FIG as a Prototype for Online Error Injection• Low run-time overhead• Easy to enable/disable• Easy to configure• Extensible• Can simulate multiple fault

models

Page 17: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California,

A Case for OnlineError Injection

• Recovery code is not usually exercised during normal operation

• Deployed environments tend to differ from testing environments

• Can run error injection tests on a subset of deployed systems

• FIG can simulate common environmental errors

Page 18: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California,

Conclusions• FIG exposed a variety of deficiencies in

how our test applications handled environmental errors

• Server apps are generally more robust than client applications

• FIG exhibits low overhead• FIG is suitable for online error injection

Page 19: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California,
Page 20: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California,

Future Directions

• Limitations of FIG:– Only for UNIX-like OSes– Limited to app/library interface (proxy for

app/OS interaction)

• Make FIG part of a larger test suite• Include clock time and event based

error triggers• Greater flexibility in configuration file

Page 21: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California,

Other Related Work

1. Xept (Vo et al.)

• Instruments object code to ensure that error handling code exists

2. Processor & memory errors• DOCTOR, HYBRID, DEFINE

3. Process memory corruption• FERRARI, DEFINE