Operating Systems Meet Fault Tolerance
-
Upload
vasily-sartakov -
Category
Education
-
view
140 -
download
0
description
Transcript of Operating Systems Meet Fault Tolerance
Operating Systems Meet Fault Tolerance
Bjorn Dobel
06.08.2013
Bjorn Dobel () OS Resilience 06.08.2013 1 / 58
Introduction
“If there’s more than one possible outcome of a job or task,
and one of those outcome will result in disaster or an
undesirable consequence, then somebody will do it that way.”
(Edward Murphy jr.)
Bjorn Dobel () OS Resilience 06.08.2013 2 / 58
Introduction
Outline
Murphy and the OS: Is it really that bad?
Fault-Tolerant Operating SystemsMinix3CuriOSL4ReAnimator
Dealing with Hardware ErrorsTransparent replication as an OS service
Bjorn Dobel () OS Resilience 06.08.2013 3 / 58
Introduction
Why Things go Wrong
Programming in C:
This pointer is certainly never going to be NULL!
Layering vs. responsibility:
Of course, someone in the higher layers will already have checkedthis return value.
Concurrency:
This struct is shared between an IRQ handler and a kernel thread.But they will never execute in parallel.
Hardware interaction:But the device spec said, this was not allowed to happen!
Hypocrisy:
I’m a cool OS hacker. I won’t make mistakes, so I don’t need totest my code!
Bjorn Dobel () OS Resilience 06.08.2013 4 / 58
Introduction
Why Things go Wrong
Programming in C:
This pointer is certainly never going to be NULL!
Layering vs. responsibility:
Of course, someone in the higher layers will already have checkedthis return value.
Concurrency:
This struct is shared between an IRQ handler and a kernel thread.But they will never execute in parallel.
Hardware interaction:But the device spec said, this was not allowed to happen!
Hypocrisy:
I’m a cool OS hacker. I won’t make mistakes, so I don’t need totest my code!
Bjorn Dobel () OS Resilience 06.08.2013 4 / 58
Introduction
Why Things go Wrong
Programming in C:
This pointer is certainly never going to be NULL!
Layering vs. responsibility:
Of course, someone in the higher layers will already have checkedthis return value.
Concurrency:
This struct is shared between an IRQ handler and a kernel thread.But they will never execute in parallel.
Hardware interaction:But the device spec said, this was not allowed to happen!
Hypocrisy:
I’m a cool OS hacker. I won’t make mistakes, so I don’t need totest my code!
Bjorn Dobel () OS Resilience 06.08.2013 4 / 58
Introduction
Why Things go Wrong
Programming in C:
This pointer is certainly never going to be NULL!
Layering vs. responsibility:
Of course, someone in the higher layers will already have checkedthis return value.
Concurrency:
This struct is shared between an IRQ handler and a kernel thread.But they will never execute in parallel.
Hardware interaction:But the device spec said, this was not allowed to happen!
Hypocrisy:
I’m a cool OS hacker. I won’t make mistakes, so I don’t need totest my code!
Bjorn Dobel () OS Resilience 06.08.2013 4 / 58
Introduction
Why Things go Wrong
Programming in C:
This pointer is certainly never going to be NULL!
Layering vs. responsibility:
Of course, someone in the higher layers will already have checkedthis return value.
Concurrency:
This struct is shared between an IRQ handler and a kernel thread.But they will never execute in parallel.
Hardware interaction:But the device spec said, this was not allowed to happen!
Hypocrisy:
I’m a cool OS hacker. I won’t make mistakes, so I don’t need totest my code!
Bjorn Dobel () OS Resilience 06.08.2013 4 / 58
Introduction
A Classic Study
A. Chou et al.: An empirical study of operating system errors, SOSP 2001
Automated software error detection (today: http://www.coverity.com)
Target: Linux (1.0 - 2.4)Where are the errors?How are they distributed?How long do they survive?Du bugs cluster in certain locations?
Bjorn Dobel () OS Resilience 06.08.2013 5 / 58
Introduction
Revalidation of Chou’s Results
N. Palix et al.: Faults in Linux: Ten years later, ASPLOS 2011
10 years of work on tools to decrease error counts - has it worked?
Repeated Chou’s analysis until Linux 2.6.34
Bjorn Dobel () OS Resilience 06.08.2013 6 / 58
Introduction
Linux: Lines of Code
Bjorn Dobel () OS Resilience 06.08.2013 7 / 58
Introduction
Fault Rate per Subdirectory (2001)
Bjorn Dobel () OS Resilience 06.08.2013 8 / 58
Introduction
Fault Rate per Subdirectory (2011)
Bjorn Dobel () OS Resilience 06.08.2013 9 / 58
Introduction
Bug Lifetimes (2011)
Bjorn Dobel () OS Resilience 06.08.2013 10 / 58
Introduction
Bug Distribution
Bjorn Dobel () OS Resilience 06.08.2013 11 / 58
Introduction
Break
Faults are an issue.
Hardware-related stu↵ is worst.
Now what can the OS do about it?
Bjorn Dobel () OS Resilience 06.08.2013 12 / 58
Introduction
Minix3 – A Fault-tolerant OS
Userprocesses User Processes
Server Processes
Device Processes Disk TTY Net Printer Other
File PM Reinc ... Other
Shell Make User ... Other
Kernel
Kernel Clock Task System Task
Bjorn Dobel () OS Resilience 06.08.2013 13 / 58
Introduction
Minix3: Fault Tolerance
Address Space IsolationApplications only access private memoryFaults do not spread to other components
User-level OS servicesPrinciple of Least PrivilegeFine-grain control over resource access
e.g., DMA only for specific drivers
Small componentsEasy to replace (micro-reboot)
Bjorn Dobel () OS Resilience 06.08.2013 14 / 58
Introduction
Minix3: Fault Detection
Fault model: transient errors caused by software bugs
Fix: Component restart
Reincarnation server monitors componentsProgram termination (crash)CPU exception (div by 0)Heartbeat messages
Users may also indicate that something is wrong
Bjorn Dobel () OS Resilience 06.08.2013 15 / 58
Introduction
Repair
Restarting a component is insu�cient:Applications may depend on restarted componentAfter restart, component state is lost
Minix3: explicit mechanismsReincarnation server signals applications about restartApplications store state at data store serverIn any case: program interaction needed
Restarted app: store/recover stateUser apps: recover server connection
Bjorn Dobel () OS Resilience 06.08.2013 16 / 58
Introduction
Break
Minix3 fault tolerance:Architectural IsolationExplicit monitoring and notifications
Other approaches:CuriOS: smart session state handlingL4ReAnimator: semi-transparent restart in a capability-based system
Bjorn Dobel () OS Resilience 06.08.2013 17 / 58
Introduction
CuriOS: Servers and Sessions
State recovery is trickyMinix3: Data Store for application dataBut: applications interact
Servers store session-specific stateServer restart requires potential rollback for every participant
ServerState
ClientAState
ClientBStateServer
Client A
Client B
Bjorn Dobel () OS Resilience 06.08.2013 18 / 58
Introduction
CuriOS: Server State Regions
CuriOS kernel (CuiK) manages dedicated session memory: Server StateRegions
SSRs are managed by the kernel and attached to a client-server connection
ServerState
Server
Client A
Client State A
Client B
Client State B
Bjorn Dobel () OS Resilience 06.08.2013 19 / 58
Introduction
CuriOS: Protecting Sessions
SSR gets mapped only when a client actually invokes the server
Solves another problem: failure while handling A’s request will never corruptB’s session state
ServerState
Server
Client A
Client State A
Client B
Client State B
Bjorn Dobel () OS Resilience 06.08.2013 20 / 58
Introduction
CuriOS: Protecting Sessions
SSR gets mapped only when a client actually invokes the server
Solves another problem: failure while handling A’s request will never corruptB’s session state
ServerState
Server
Client A
Client State A
Client B
Client State B
c
a
l
l
(
)
Bjorn Dobel () OS Resilience 06.08.2013 20 / 58
Introduction
CuriOS: Protecting Sessions
SSR gets mapped only when a client actually invokes the server
Solves another problem: failure while handling A’s request will never corruptB’s session state
ServerState
Server
Client A
Client State A
Client B
Client State B
ClientAState
c
a
l
l
(
)
Bjorn Dobel () OS Resilience 06.08.2013 20 / 58
Introduction
CuriOS: Protecting Sessions
SSR gets mapped only when a client actually invokes the server
Solves another problem: failure while handling A’s request will never corruptB’s session state
ServerState
Server
Client A
Client State A
Client B
Client State B
r
e
p
l
y
(
)
Bjorn Dobel () OS Resilience 06.08.2013 20 / 58
Introduction
CuriOS: Protecting Sessions
SSR gets mapped only when a client actually invokes the server
Solves another problem: failure while handling A’s request will never corruptB’s session state
ServerState
Server
Client A
Client State A
Client B
Client State Bc
a
l
l
(
)
Bjorn Dobel () OS Resilience 06.08.2013 20 / 58
Introduction
CuriOS: Protecting Sessions
SSR gets mapped only when a client actually invokes the server
Solves another problem: failure while handling A’s request will never corruptB’s session state
ServerState
Server
Client A
Client State A
Client B
Client State BClientBState
c
a
l
l
(
)
Bjorn Dobel () OS Resilience 06.08.2013 20 / 58
Introduction
CuriOS: Protecting Sessions
SSR gets mapped only when a client actually invokes the server
Solves another problem: failure while handling A’s request will never corruptB’s session state
ServerState
Server
Client A
Client State A
Client B
Client State Br
e
p
l
y
(
)
Bjorn Dobel () OS Resilience 06.08.2013 20 / 58
Introduction
CuriOS: Transparent Restart
CuriOS is a Single-Address-Space OS:Every application runs on the same page table (with modified access rights)
OS A A B BShared Mem
OS A A B BB Running
OS A A B BA Running
OS A A B BVirt. Memory
Bjorn Dobel () OS Resilience 06.08.2013 21 / 58
Introduction
Transparent Restart
Single Address SpaceEach object has unique addressIdentical in all programsServer := C++ object
RestartReplace old C++ object with new oneReuse previous memory locationReferences in other applications remain validOS blocks access during restart
Bjorn Dobel () OS Resilience 06.08.2013 22 / 58
Introduction
L4ReAnimator: Restart on L4Re
L4Re ApplicationsLoader component: nedDetects application termination: parent signalRestart: re-execute Lua init script (or parts of it)
Problem after restart: capabilitiesNo single component knows everyone owning a capability to an objectMinix3 signals won’t work
Bjorn Dobel () OS Resilience 06.08.2013 23 / 58
Introduction
L4Re: Session Creation
Client Server
Loader
SessionCreationCapability
Bjorn Dobel () OS Resilience 06.08.2013 24 / 58
Introduction
L4Re: Session Creation
Client Server
Loader
(1)create
Bjorn Dobel () OS Resilience 06.08.2013 24 / 58
Introduction
L4Re: Session Creation
Client Server
Loader
(2) Mappedduring startup
Bjorn Dobel () OS Resilience 06.08.2013 24 / 58
Introduction
L4Re: Session Creation
Client Server
Loader
(3) factory.create()
Bjorn Dobel () OS Resilience 06.08.2013 24 / 58
Introduction
L4Re: Session Creation
Client Server
Loader
Session Capability
Bjorn Dobel () OS Resilience 06.08.2013 24 / 58
Introduction
L4Re: Session Creation
Client Server
Loader
(4) create
Bjorn Dobel () OS Resilience 06.08.2013 24 / 58
Introduction
L4Re: Session Creation
Client Server
Loader
(5) use
Bjorn Dobel () OS Resilience 06.08.2013 24 / 58
Introduction
L4Re: Server Crash
Client Server
Loader
Session Capability
(6)exit
Kernel destroys mem-ory, server objects(channels...)
Bjorn Dobel () OS Resilience 06.08.2013 25 / 58
Introduction
L4Re: Server Crash
Client Server
Loader
Session Capability
(7)restart
Bjorn Dobel () OS Resilience 06.08.2013 25 / 58
Introduction
L4Re: Restarted Server
Client Server
Loader
Session Capability
Bjorn Dobel () OS Resilience 06.08.2013 26 / 58
Introduction
L4Re: Restarted Server
Client Server
Loader
Session Capability
use
Bjorn Dobel () OS Resilience 06.08.2013 26 / 58
Introduction
L4Re: Restarted Server
Client Server
Loader
Session Capability
use
Error!
Bjorn Dobel () OS Resilience 06.08.2013 26 / 58
Introduction
L4ReAnimator
Only the application itself can detect that a capability vanished
Kernel raises Capability fault
Application needs to re-obtain the capability: execute capability fault handler
Capfault handler: application-specificCreate new communication channelRestore session state
Programming model:Capfault handler provided by server implementorHandling transparent for application developerSemi-transparency
Bjorn Dobel () OS Resilience 06.08.2013 27 / 58
Introduction
L4ReAnimator: Cleanup
Some channels have resources attached (e.g., frame bu↵er for graphicalconsole)
Resource may come from a di↵erent resource (e.g., frame bu↵er frommemory manager)
Resources remain intact (stale) upon crash
Client ends up using old version of the resource
Requires additional app-specific knowledge
Unmap handler
Bjorn Dobel () OS Resilience 06.08.2013 28 / 58
Introduction
Summary
L4ReAnimatorCapfault: Clients detect server restarts lazilyCapfault Handler: application-specific knowledge on how to regain access tothe serverUnmap handler: clean up old resources after restart
All these frameworks only deal with software errors.
What about hardware faults? ! Next session!
Bjorn Dobel () OS Resilience 06.08.2013 29 / 58