Post on 11-Jan-2016
description
1
OS II: Dependability & TrustSWIFI-based OS Evaluations
Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de
Prof. Neeraj Suri
Stefan Winter
Dept. of Computer ScienceTU Darmstadt, Germany
2
Fault Detection: Software Testing So far: Verification & Validation
Testing Techniques Static vs. Dynamic Black-box vs. White-box
Last time: Fault Injection (FI) Applications Techniques Some FI tools
Today: Testing (SWIFI) of operating systems WHERE: Error propagation in OSs [Johansson’05] WHAT: Error selection for testing [Johansson’07] WHEN: Injection trigger selection [Johansson’07]
Next lecture: Profiling the OS extensions (state change @ runtime)
3
FI Recap
Fault Injection (FI) is the process of either inserting bugs into your system or exposing your system to operational perturbations
FI applications for dependable system development Defect Count Estimation (Fault Seeding) Test Suite Evaluation (Mutation Testing) Security Testing Experimental Dependability Evaluations
FI techniques Physical FI HW FI Simulated FI SWIFI
4
FI Recap (cont.)
Where to apply change (location, abstraction/system level)
What to inject (what should be injected/corrupted?) Which trigger to use (event, instruction, timeout,
exception, … ?) When to inject (on first/second/… trigger event) How often to inject (Heisen-/Bohrbugs) … What to record & interpret? For what purpose? How is the system loaded at the time of the
injection Applications running and their load (workload) System resources Real realistic synthetic workload
5
Outline for today‘s lecture
Drivers - a major dependability issue in commodity OSs An error propagation view
FI-based robustness evaluations of the kernel Black box assumption Fault representativeness vs. failure relevance
Design and implementation issues of a suitable FI framework Fault modeling Failure modeling Workloads
6
The problem: Drivers!
Device drivers Numerous: 250 installed (100
active) drivers in XP/Vista Large & complex: 70% of Linux
code base Immature: every day 25 new / 100
revised versions Vista drivers Access Rights: kernel mode
operation in monolithic OSs
Device drivers are the dominant cause of OS failures despite sustained testing efforts
Causes of WinXP outages
Causes of Win2k outages
7
The problem (cont.)
Problem statement:Driver failures lead to OS API failures
Mitigation approaches1. Harden OS robustness2. Improve driver reliability
8
The problem (cont.)
The problem in terms of error propagation
The effect of testing in terms of error propagation
The effect of robustness hardening in terms of error propagation
9
Issues with the driver testing approach
What if the driver is not the root cause?
What if we cannot remove defects (e.g. commercial OSs)?
10
Issues with the hardening approach
What if we cannot remove robustness vulnerabilities?
More issues with the hardening approach in next week‘s lecture...
11
FI-based robustness evaluations
Fault containment wrappers are expensive Additional code is an additional source of bugs Runtime overhead for error checks
Where should we add fault containment wrappers? Where errors with critical effects are likely to occur Where propagation is likely Where critical errors propagate
How do we know where which errors propagate? Propagation analysis (cf. PROPANE)
12
A
B D
C
E
F
IncreasinglIncreasinglyy
badbad
C
E
A
F
DB
!!
Robustness Evaluations
13
Robustness Evaluations
Experimental technique to ascertain “vulnerabilities” Identify (potential) sources, error propagation & hot spots,
etc. Estimate their “effects” on applications Component enhancement with “wrappers”
• if (X > 100 && Y < 30) then Exception();• Location of wrappers
Aspects Metrics for error propagation profiles Experimental analysis
14
System Model
Applications
Operating System
Drivers
?
15
Device Driver
Model the interfaces (defined in C) Export (functions provided by the driver) Import (functions used by the driver)
Driver X
dsx.1 … dsx.m osx.1 … osx.n
Hardware
Exported Imported
16
Metrics
Three metrics for profiling1. Propagation - how errors flow through the OS2. Exposure - which OS services are affected3. Diffusion - which drivers are the sources
Impact analysis
– Metrics– Case study (WinCE)– Results
17
Service Error Permeability
1. Service Error Permeability: Measure one driver’s influence
on one OS service Used to study service-driver
relations
)osin error in Pr(error POS
)dsin error in Pr(error PDS
..
..
zxiizx
yxiiyx
s
s
xD
is
18
OS Service Error Exposure
2. OS Service Error Exposure: An application uses certain services How are these services influenced
by driver errors? Used to compare services
x jxx jx ds
ijx
os
ijx PDSPOS
D.
D.
i
..
E
xD
is
19
Driver Error Diffusion
3. Driver Error Diffusion: Which driver affects the
system the most? Used to compare drivers
xD
i .i . s
.s
. Djxjx ds
ijx
os
ijx
x PDSPOS
is
20
Case Study: Windows CE
Targeted drivers Serial Ethernet
FI at interface Data level errors
Effects on OS services 4 Test applications
Test App
OS
DriversTargetDriver
Manager
Interceptor
DriversDrivers
Host
21
Error Model
Data level errors in OS-Driver interface Wrong values Based on the C-type
• Boundary• Special values• Offsets
Transient First occurrence
22
Impact Analysis
Impact ascertained via failure mode analysis
Failure classes: Class NF: No visible effect Class 1: Error, no violation Class 2: Error, violation Class 3: OS Crash/Hang
?
23
Error Model
Error C-Type #cases
Integers
int 7
unsigned int 5
long 7
unsigned long 5
short 7
unsigned short 5
LARGE_INTEGER 7
Void * void 3
Char’s
char 7
unsigned char 5
wchar_t 5
Boolean bool 1
Enums multiple #ident’s
Structs multiple 1
Case # New value
1 previous – 1
2 previous +1
3 1
4 0
5 -1
6 INT_MIN
7 INT_MAX
LONG RegQueryValueEx([in] HKEY hKey,
[in] LPCWSTR lpValueName,
[in] LPDWORD lpReserved,
[out] LPDWORD lpType,
[out] LPBYTE lpData,
[in/out] LPDWORD lpcbData);
24
Service Error Permeability
Ethernet driver 42 imported svcs 12 exported svcs
Most Class 1 3 Crashes (Class 3)
25
OS Service Error Exposure
Serial driver 50 imported svcs 10 exported svcs
Clustering of failures
26
Driver Error Diffusion Higher diffusion for Ethernet Most Class NF Failures at boot-up
Ethernet Serial
#Experiments 414 411
#Injections 228 187
#Class NF 330(80%)
377(92%)
#Class 1 80 (19%) 25 (7%)
#Class 2 1 9
#Class 3 3 0
0.616 0.460
0.002 0.022
0.007 0
k1DC
k3DC
k2DC
27
Error Models: “What to Inject?”
FI’s effectiveness arises based on the chosen error model being (a) representative of actual errors, and (b) effectively triggering “vulnerabilities”.
Comparative evaluation of “effectiveness” of different error models: Fewest injections? Most failures? Best “coverage”?
Propose a composite error model for enhancing FI effectiveness
28
Chosen Drivers & Error Models
Error Models: Data-type (DT) Bit-flips (BF) Fuzzing (FZ)
Driver Description#Injection cases
DT BF FZ
cerfio_serial Serial port 397 2362 1410
91C111 Ethernet 255 1722 1050
atadisk CompactFlash 294 1658 1035
29
Error Models – Data-Type (DT) Errors
int foo(int a, int b) {…}
int ret = foo(0x45a209f1, 0x00000000);
30
Error Models – Data-Type (DT) Errors
Case New Value
1 Previous – 1
2 Previous +1
3 1
4 0
5 -1
6 INT_MIN
7 INT_MAX
0x80000000
int foo(int a, int b) {…}
int ret = foo(0x45a209f1, 0x00000000);
31
Error Models – Data-Type (DT) Errors
Varied #cases depending on the data type Requires tracking of the types for correct injection Complex implementation but scales well
int foo(int a, int b) {…}
int ret = foo(0x80000000, 0x00000000);
32
Error Models – Data-Type (DT) Errors
Data type C-Type #Cases
Integers
int 7
unsigned int 5
long 7
unsigned long 5
short 7
unsigned short 5
LARGE_INTEGER 7
Misc.
* void 3
HKEY 6
struct {…} multiple
Strings 4
Characters
char 7
unsigned char 5
wchar_t 5
Boolean bool 1
Enums multiple casesmultiple cases
33
Error Models – Bit-Flip (BF) Errors
int foo(int a, int b) {…}
int ret = foo(0x45a209f1, 0x00000000);
34
Error Models – Bit-Flip (BF) Errors
int foo(int a, int b) {…}
int ret = foo(0x45a209f1, 0x00000000);
1000101101000100000100111110001
35
Error Models – Bit-Flip (BF) Errors
int foo(int a, int b) {…}
int ret = foo(0x45a209f1, 0x00000000);
1000101101000101000100111110001
1000101101000100000100111110001
36
Error Models – Bit-Flip (BF) Errors
int foo(int a, int b) {…}
int ret = foo(0x45a289f1, 0x00000000);
Typically 32 cases per parameter Easy to implement
1000101101000101000100111110001
37
Error Models – Fuzzing (FZ) Errors
int foo(int a, int b) {…}
int ret = foo(0x45a209f1, 0x00000000);
38
Error Models – Fuzzing (FZ) Errors
int foo(int a, int b) {…}
int ret = foo(0x45a209f1, 0x00000000);
0x17af34c2
39
Error Models – Fuzzing (FZ) Errors
int foo(int a, int b) {…}
int ret = foo(0x17af34c2, 0x00000000);
Selective #cases Simple implementation
40
Comparison
Compare Error Models on:
Number of failures Effectiveness Experimentation Time Identifying services
Error propagation
41
Failure Classes & Driver Diffusion
Failure Class Description
No Failure No observable effect
Class 1Error propagated, but still satisfied the OS service specification
Class 2Error propagated and violated the service specification
Class 3 The OS hung or crashed
42
Failure Classes & Driver Diffusion
Failure Class Description
No Failure No observable effect
Class 1Error propagated, but still satisfied the OS service specification
Class 2Error propagated and violated the service specification
Class 3 The OS hung or crashed
Driver Diffusion: a measure of a driver’s abilityto spread errors:
i .s
. Dyxds
iyx
x PDSxD
is
43
Number of Failures (Class 3)
0
10
20
30
40
50
60
70
80
FZBFDTFZBFDTFZBFDT
#C3
Failu
res
91C111cerfio_serial atadisk
44
Failure Classes & Driver Diffusion
Drivers DT BF FZ
cerfio_serial 1.50 1.05 1.56
91C111 0.73 0.98 0.69
atadisk 0.63 1.86 0.29
Driver Diffusion (Class 3)
Class 3
Class 2
Class 1
No failure
0%
20%
40%
60%
80%
100%
BFDT FZ
atadisk
BFDT FZ
91C111
BFDT FZ
cerfio_serial
45
Experimentation Time
Driver Error ModelExec. time
h min
cerfio_serial
DT 5 15
BF 38 14
FZ 20 44
91C111
DT 1 56
BF 17 20
FZ 7 48
atadisk
DT 2 56
BF 20 51
FZ 11 55
46
Identifying Services (Class 3)
Which OS services can cause Class 3 failures?
Which error model identifies most services (coverage)?
Is some model consistently better/worse?
Can we combine models?
Service DT BF FZ
1 X
2 X X
3 X
4 X X
5 X
6 X X
7 X X
8 X X
9 X X X
10 X X X
11 X X X
12 X
13 X
14 X X X
15 X
16 X X X
17 X
18 X
47
Identifying Services (Class 3 + 2)
Which OS services can cause Class 3 failures?
Which error model identifies most services (coverage)?
Is some model consistently better/worse?
Can we combine models?
Service DT BF FZ
1 O X O
2 X X O
3 X O
4 X X
5 X
6 X X
7 X X O
8 X X
9 X X X
10 X X X
11 X X X
12 O X
13 X
14 X X X
15 X
16 X X X
17 X
18 X
48
Bit-Flips: Sensitivity to Bit Position?
0
2
4
6
8
10
024681012141618202224262830Bit position
#Ser
vice
s
[LSB][MSB]
49
024681012141618
024681012141618202224262830
#Ser
vice
s
Bit position
Bit-Flips: Bit Position Profile
Cumulative #services identified
50
Fuzzing – Number of injections?
91111C
cerfio_serial
atadisk
0.2
0.4
0.6
0.8
1.2
1.0
1.4
1.6
1.8
2.0
Dif
fusi
on
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15#Injections
51
Composite Error Model
Let’s take the best of bit-flips and fuzzing Bit-flips: bit 0-9 and 31 Fuzzing: 10 cases
~50% fewer injections Identifies the same service set
500
1500
2500
3500
cerfio_serial
91C111atadisk
#Inj
ecti
ons All BF & FZ
Composite
52
Composite Error Model – Results
BFDT FZCM
atadisk
BFDT FZCM
91C111BFDT FZ
CM
cerfio_serial
Class 3
Class 2
Class 1
No failure
0%
20%
40%
60%
80%
100%
53
Summary Comparison across three well established error models + CM
Data-type Bit-flips Fuzzing
Model Implementation Coverage Execution
DT
BF
FZ
CM
54
Summary Comparison across three well established error models + CM
Data-type Bit-flips Fuzzing
Model Implementation Coverage Execution
DT
BF
FZ
CM
Requires tracking
data types
Requires few experiments
55
Summary Comparison across three well established error models + CM
Data-type Bit-flips Fuzzing
Model Implementation Coverage Execution
DT
BF
FZ
CM
Found the most Class 3 failures
Requires many experiments
56
Summary Comparison across three well established error models + CM
Data-type Bit-flips Fuzzing
Model Implementation Coverage Execution
DT
BF
FZ
CM
Finds additional services
57
Summary Comparison across three well established error models + CM
Data-type Bit-flips Fuzzing
Model Implementation Coverage Execution
DT
BF
FZ
CM
Profiling gives combined BF & FZ with high coverage
58
Summary Comparison across three well established error models + CM
Data-type Bit-flips Fuzzing
Outlook: Outlook: When to do the injection? More drivers, OS’s, models?
Model Implementation Coverage Execution
DT
BF
FZ
CM
59
On the Impact of Injection TriggersOn the Impact of Injection Triggersfor OS Robustness Evaluationfor OS Robustness Evaluation
Andréas JohanssonAndréas Johansson, Neeraj Suri, Neeraj Suri
Department of Computer ScienceDepartment of Computer ScienceTechnische Universtät DarmstadtTechnische Universtät Darmstadt
GermanyGermany
DEEDS: Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de
Brendan MurphyBrendan Murphy
Microsoft ResearchMicrosoft ResearchCambridgeCambridge
UKUK
Presented at ISSRE 2007Presented at ISSRE 2007
60
Operating System RobustnessOperating System Robustness
Operating SystemOperating System Key operational element Used in virtually all environments robustness! Drivers are a major source of failures [1] [2]
[1] Ganapathi et. al., LISA’06[2] Chou et. al., SOSP’01
61
Operating System RobustnessOperating System Robustness
External faults Robustness Drivers Interfaces
Experimental Fault injection Run-time
Interface OS-Driver No source code
Goal Identify services with robustness
issues Identify drivers spreading errors
Applications
Drivers
OS
62
Operating System RobustnessOperating System Robustness
The issues behind FI based OS robustness The issues behind FI based OS robustness Where to inject? [3] What to inject? [4] When to inject? [today]
OutlineOutline Problem definition Call strings and call blocks System and error model Experimental setup and method Results
[3] Johansson et. al., DSN’05[4] Johansson et. al., DSN’07
63
Fault InjectionFault Injection
Target: interface OS-DriverTarget: interface OS-Driver Each call potential injectionEach call potential injection Problem: too many callsProblem: too many calls
First-occurrence Sample (uniform?)
Service invocations
64
Fault InjectionFault Injection
Observation: calls are not made randomlyObservation: calls are not made randomly Repeating sequences of calls
Idea: select calls based on “operations”Idea: select calls based on “operations” Identify subsequences, select services
65
Call Strings & Call BlocksCall Strings & Call Blocks
Call stringCall string List of tokens (invocations) to a specific driver
Call blockCall block Subsequence of a call string May be repeating Corresponds to a higher level “operation” Used as trigger for injection
66
System and Error ModelSystem and Error Model
Error model: bit-flipsError model: bit-flips Shown to be effective Simple to implement
Injection Function parameter values
67
Experimental ProcessExperimental Process
Execute workloadExecute workload Record call string
Extract call blocksExtract call blocks Select service targets (1 per call block)
Define triggersDefine triggers Based on tracking call blocks
Perform injectionsPerform injections
68
Injection SetupInjection Setup
Target OS: Windows CE .NetTarget OS: Windows CE .Net Target HW: XScale 255Target HW: XScale 255
69
Failure ClassesFailure Classes
Failure Class Description
No Failure No observable effect
Class 1Error propagated, but still satisfied the OS service specification
Class 2Error propagated and violated the service specification
Class 3 The OS hung or crashed
70
Selected DriversSelected Drivers
Serial port driverSerial port driver Ethernet card driverEthernet card driver
Workload/driver phases:Workload/driver phases:
71
Serial Driver Call String and Call BlocksSerial Driver Call String and Call Blocks
Call string:Call string:
D02775(747){23}732775(747){23}23D02775(747){23}732775(747){23}23
Init Working Clean up
72
Ethernet Driver Call String and Call BlocksEthernet Driver Call String and Call Blocks
73
Driver ProfilesDriver Profiles
Driver invocation patterns differDriver invocation patterns differ Impact of call block injection efficiencyImpact of call block injection efficiency
Serial Ethernet
74
Serial Driver ResultsSerial Driver Results
75
Serial Driver Service IdentificationSerial Driver Service Identification
FO δ α β1 γ1 ω1 β2 γ2 ω2
CreateThread x x x
DisableThreadLibraryCalls
x x
EventModify x x
FreeLibrary x x
HalTranslateBusAddress x
InitializeCriticalSection x
InterlockedDecrement x
LoadLibrary x x
LocalAlloc x x
memcpy x x x
memset x x x
SetProcPermissions x x x
TransBusAddrToStatic x
76
Ethernet Driver ResultsEthernet Driver Results
TriggerSerial Ethernet
#Injections #C3 #Injections #C3
First Occ. 2436 8 1820 12
Call Blocks
8408 13 2356 12
77
SummarySummary
Where, What & When?Where, What & When? New timing model for interface fault injectionNew timing model for interface fault injection
Faults in device driversFaults in device drivers Based on call strings & call blocksBased on call strings & call blocks
ResultsResults Significant differenceSignificant difference More servicesMore services Driver dependentDriver dependent Driver profilingDriver profiling More injections (2436 vs. 8408)More injections (2436 vs. 8408) Focus on init/clean up?Focus on init/clean up?
78
Discussion & OutlookDiscussion & Outlook
Call block identificationCall block identification Scalability? New data structures (suffix trees)
Call block selectionCall block selection Working phase vs. initial/clean up
Determinism & concurrencyDeterminism & concurrency Workload selectionWorkload selection
Error modelsError models