Operating Systems Meet Fault Tolerance

48
Operating Systems Meet Fault Tolerance Bj¨ orn D¨ obel 06.08.2013 Bj¨ornD¨ obel () OS Resilience 06.08.2013 1 / 58

description

Lecture by Bjoern Doebel for Summer Systems School'13 (http://sss.ksyslabs.org)

Transcript of Operating Systems Meet Fault Tolerance

Page 1: Operating Systems Meet Fault Tolerance

Operating Systems Meet Fault Tolerance

Bjorn Dobel

06.08.2013

Bjorn Dobel () OS Resilience 06.08.2013 1 / 58

Page 2: Operating Systems Meet Fault Tolerance

Introduction

“If there’s more than one possible outcome of a job or task,

and one of those outcome will result in disaster or an

undesirable consequence, then somebody will do it that way.”

(Edward Murphy jr.)

Bjorn Dobel () OS Resilience 06.08.2013 2 / 58

Page 3: Operating Systems Meet Fault Tolerance

Introduction

Outline

Murphy and the OS: Is it really that bad?

Fault-Tolerant Operating SystemsMinix3CuriOSL4ReAnimator

Dealing with Hardware ErrorsTransparent replication as an OS service

Bjorn Dobel () OS Resilience 06.08.2013 3 / 58

Page 4: Operating Systems Meet Fault Tolerance

Introduction

Why Things go Wrong

Programming in C:

This pointer is certainly never going to be NULL!

Layering vs. responsibility:

Of course, someone in the higher layers will already have checkedthis return value.

Concurrency:

This struct is shared between an IRQ handler and a kernel thread.But they will never execute in parallel.

Hardware interaction:But the device spec said, this was not allowed to happen!

Hypocrisy:

I’m a cool OS hacker. I won’t make mistakes, so I don’t need totest my code!

Bjorn Dobel () OS Resilience 06.08.2013 4 / 58

Page 5: Operating Systems Meet Fault Tolerance

Introduction

Why Things go Wrong

Programming in C:

This pointer is certainly never going to be NULL!

Layering vs. responsibility:

Of course, someone in the higher layers will already have checkedthis return value.

Concurrency:

This struct is shared between an IRQ handler and a kernel thread.But they will never execute in parallel.

Hardware interaction:But the device spec said, this was not allowed to happen!

Hypocrisy:

I’m a cool OS hacker. I won’t make mistakes, so I don’t need totest my code!

Bjorn Dobel () OS Resilience 06.08.2013 4 / 58

Page 6: Operating Systems Meet Fault Tolerance

Introduction

Why Things go Wrong

Programming in C:

This pointer is certainly never going to be NULL!

Layering vs. responsibility:

Of course, someone in the higher layers will already have checkedthis return value.

Concurrency:

This struct is shared between an IRQ handler and a kernel thread.But they will never execute in parallel.

Hardware interaction:But the device spec said, this was not allowed to happen!

Hypocrisy:

I’m a cool OS hacker. I won’t make mistakes, so I don’t need totest my code!

Bjorn Dobel () OS Resilience 06.08.2013 4 / 58

Page 7: Operating Systems Meet Fault Tolerance

Introduction

Why Things go Wrong

Programming in C:

This pointer is certainly never going to be NULL!

Layering vs. responsibility:

Of course, someone in the higher layers will already have checkedthis return value.

Concurrency:

This struct is shared between an IRQ handler and a kernel thread.But they will never execute in parallel.

Hardware interaction:But the device spec said, this was not allowed to happen!

Hypocrisy:

I’m a cool OS hacker. I won’t make mistakes, so I don’t need totest my code!

Bjorn Dobel () OS Resilience 06.08.2013 4 / 58

Page 8: Operating Systems Meet Fault Tolerance

Introduction

Why Things go Wrong

Programming in C:

This pointer is certainly never going to be NULL!

Layering vs. responsibility:

Of course, someone in the higher layers will already have checkedthis return value.

Concurrency:

This struct is shared between an IRQ handler and a kernel thread.But they will never execute in parallel.

Hardware interaction:But the device spec said, this was not allowed to happen!

Hypocrisy:

I’m a cool OS hacker. I won’t make mistakes, so I don’t need totest my code!

Bjorn Dobel () OS Resilience 06.08.2013 4 / 58

Page 9: Operating Systems Meet Fault Tolerance

Introduction

A Classic Study

A. Chou et al.: An empirical study of operating system errors, SOSP 2001

Automated software error detection (today: http://www.coverity.com)

Target: Linux (1.0 - 2.4)Where are the errors?How are they distributed?How long do they survive?Du bugs cluster in certain locations?

Bjorn Dobel () OS Resilience 06.08.2013 5 / 58

Page 10: Operating Systems Meet Fault Tolerance

Introduction

Revalidation of Chou’s Results

N. Palix et al.: Faults in Linux: Ten years later, ASPLOS 2011

10 years of work on tools to decrease error counts - has it worked?

Repeated Chou’s analysis until Linux 2.6.34

Bjorn Dobel () OS Resilience 06.08.2013 6 / 58

Page 11: Operating Systems Meet Fault Tolerance

Introduction

Linux: Lines of Code

Bjorn Dobel () OS Resilience 06.08.2013 7 / 58

Page 12: Operating Systems Meet Fault Tolerance

Introduction

Fault Rate per Subdirectory (2001)

Bjorn Dobel () OS Resilience 06.08.2013 8 / 58

Page 13: Operating Systems Meet Fault Tolerance

Introduction

Fault Rate per Subdirectory (2011)

Bjorn Dobel () OS Resilience 06.08.2013 9 / 58

Page 14: Operating Systems Meet Fault Tolerance

Introduction

Bug Lifetimes (2011)

Bjorn Dobel () OS Resilience 06.08.2013 10 / 58

Page 15: Operating Systems Meet Fault Tolerance

Introduction

Bug Distribution

Bjorn Dobel () OS Resilience 06.08.2013 11 / 58

Page 16: Operating Systems Meet Fault Tolerance

Introduction

Break

Faults are an issue.

Hardware-related stu↵ is worst.

Now what can the OS do about it?

Bjorn Dobel () OS Resilience 06.08.2013 12 / 58

Page 17: Operating Systems Meet Fault Tolerance

Introduction

Minix3 – A Fault-tolerant OS

Userprocesses User Processes

Server Processes

Device Processes Disk TTY Net Printer Other

File PM Reinc ... Other

Shell Make User ... Other

Kernel

Kernel Clock Task System Task

Bjorn Dobel () OS Resilience 06.08.2013 13 / 58

Page 18: Operating Systems Meet Fault Tolerance

Introduction

Minix3: Fault Tolerance

Address Space IsolationApplications only access private memoryFaults do not spread to other components

User-level OS servicesPrinciple of Least PrivilegeFine-grain control over resource access

e.g., DMA only for specific drivers

Small componentsEasy to replace (micro-reboot)

Bjorn Dobel () OS Resilience 06.08.2013 14 / 58

Page 19: Operating Systems Meet Fault Tolerance

Introduction

Minix3: Fault Detection

Fault model: transient errors caused by software bugs

Fix: Component restart

Reincarnation server monitors componentsProgram termination (crash)CPU exception (div by 0)Heartbeat messages

Users may also indicate that something is wrong

Bjorn Dobel () OS Resilience 06.08.2013 15 / 58

Page 20: Operating Systems Meet Fault Tolerance

Introduction

Repair

Restarting a component is insu�cient:Applications may depend on restarted componentAfter restart, component state is lost

Minix3: explicit mechanismsReincarnation server signals applications about restartApplications store state at data store serverIn any case: program interaction needed

Restarted app: store/recover stateUser apps: recover server connection

Bjorn Dobel () OS Resilience 06.08.2013 16 / 58

Page 21: Operating Systems Meet Fault Tolerance

Introduction

Break

Minix3 fault tolerance:Architectural IsolationExplicit monitoring and notifications

Other approaches:CuriOS: smart session state handlingL4ReAnimator: semi-transparent restart in a capability-based system

Bjorn Dobel () OS Resilience 06.08.2013 17 / 58

Page 22: Operating Systems Meet Fault Tolerance

Introduction

CuriOS: Servers and Sessions

State recovery is trickyMinix3: Data Store for application dataBut: applications interact

Servers store session-specific stateServer restart requires potential rollback for every participant

ServerState

ClientAState

ClientBStateServer

Client A

Client B

Bjorn Dobel () OS Resilience 06.08.2013 18 / 58

Page 23: Operating Systems Meet Fault Tolerance

Introduction

CuriOS: Server State Regions

CuriOS kernel (CuiK) manages dedicated session memory: Server StateRegions

SSRs are managed by the kernel and attached to a client-server connection

ServerState

Server

Client A

Client State A

Client B

Client State B

Bjorn Dobel () OS Resilience 06.08.2013 19 / 58

Page 24: Operating Systems Meet Fault Tolerance

Introduction

CuriOS: Protecting Sessions

SSR gets mapped only when a client actually invokes the server

Solves another problem: failure while handling A’s request will never corruptB’s session state

ServerState

Server

Client A

Client State A

Client B

Client State B

Bjorn Dobel () OS Resilience 06.08.2013 20 / 58

Page 25: Operating Systems Meet Fault Tolerance

Introduction

CuriOS: Protecting Sessions

SSR gets mapped only when a client actually invokes the server

Solves another problem: failure while handling A’s request will never corruptB’s session state

ServerState

Server

Client A

Client State A

Client B

Client State B

c

a

l

l

(

)

Bjorn Dobel () OS Resilience 06.08.2013 20 / 58

Page 26: Operating Systems Meet Fault Tolerance

Introduction

CuriOS: Protecting Sessions

SSR gets mapped only when a client actually invokes the server

Solves another problem: failure while handling A’s request will never corruptB’s session state

ServerState

Server

Client A

Client State A

Client B

Client State B

ClientAState

c

a

l

l

(

)

Bjorn Dobel () OS Resilience 06.08.2013 20 / 58

Page 27: Operating Systems Meet Fault Tolerance

Introduction

CuriOS: Protecting Sessions

SSR gets mapped only when a client actually invokes the server

Solves another problem: failure while handling A’s request will never corruptB’s session state

ServerState

Server

Client A

Client State A

Client B

Client State B

r

e

p

l

y

(

)

Bjorn Dobel () OS Resilience 06.08.2013 20 / 58

Page 28: Operating Systems Meet Fault Tolerance

Introduction

CuriOS: Protecting Sessions

SSR gets mapped only when a client actually invokes the server

Solves another problem: failure while handling A’s request will never corruptB’s session state

ServerState

Server

Client A

Client State A

Client B

Client State Bc

a

l

l

(

)

Bjorn Dobel () OS Resilience 06.08.2013 20 / 58

Page 29: Operating Systems Meet Fault Tolerance

Introduction

CuriOS: Protecting Sessions

SSR gets mapped only when a client actually invokes the server

Solves another problem: failure while handling A’s request will never corruptB’s session state

ServerState

Server

Client A

Client State A

Client B

Client State BClientBState

c

a

l

l

(

)

Bjorn Dobel () OS Resilience 06.08.2013 20 / 58

Page 30: Operating Systems Meet Fault Tolerance

Introduction

CuriOS: Protecting Sessions

SSR gets mapped only when a client actually invokes the server

Solves another problem: failure while handling A’s request will never corruptB’s session state

ServerState

Server

Client A

Client State A

Client B

Client State Br

e

p

l

y

(

)

Bjorn Dobel () OS Resilience 06.08.2013 20 / 58

Page 31: Operating Systems Meet Fault Tolerance

Introduction

CuriOS: Transparent Restart

CuriOS is a Single-Address-Space OS:Every application runs on the same page table (with modified access rights)

OS A A B BShared Mem

OS A A B BB Running

OS A A B BA Running

OS A A B BVirt. Memory

Bjorn Dobel () OS Resilience 06.08.2013 21 / 58

Page 32: Operating Systems Meet Fault Tolerance

Introduction

Transparent Restart

Single Address SpaceEach object has unique addressIdentical in all programsServer := C++ object

RestartReplace old C++ object with new oneReuse previous memory locationReferences in other applications remain validOS blocks access during restart

Bjorn Dobel () OS Resilience 06.08.2013 22 / 58

Page 33: Operating Systems Meet Fault Tolerance

Introduction

L4ReAnimator: Restart on L4Re

L4Re ApplicationsLoader component: nedDetects application termination: parent signalRestart: re-execute Lua init script (or parts of it)

Problem after restart: capabilitiesNo single component knows everyone owning a capability to an objectMinix3 signals won’t work

Bjorn Dobel () OS Resilience 06.08.2013 23 / 58

Page 34: Operating Systems Meet Fault Tolerance

Introduction

L4Re: Session Creation

Client Server

Loader

SessionCreationCapability

Bjorn Dobel () OS Resilience 06.08.2013 24 / 58

Page 35: Operating Systems Meet Fault Tolerance

Introduction

L4Re: Session Creation

Client Server

Loader

(1)create

Bjorn Dobel () OS Resilience 06.08.2013 24 / 58

Page 36: Operating Systems Meet Fault Tolerance

Introduction

L4Re: Session Creation

Client Server

Loader

(2) Mappedduring startup

Bjorn Dobel () OS Resilience 06.08.2013 24 / 58

Page 37: Operating Systems Meet Fault Tolerance

Introduction

L4Re: Session Creation

Client Server

Loader

(3) factory.create()

Bjorn Dobel () OS Resilience 06.08.2013 24 / 58

Page 38: Operating Systems Meet Fault Tolerance

Introduction

L4Re: Session Creation

Client Server

Loader

Session Capability

Bjorn Dobel () OS Resilience 06.08.2013 24 / 58

Page 39: Operating Systems Meet Fault Tolerance

Introduction

L4Re: Session Creation

Client Server

Loader

(4) create

Bjorn Dobel () OS Resilience 06.08.2013 24 / 58

Page 40: Operating Systems Meet Fault Tolerance

Introduction

L4Re: Session Creation

Client Server

Loader

(5) use

Bjorn Dobel () OS Resilience 06.08.2013 24 / 58

Page 41: Operating Systems Meet Fault Tolerance

Introduction

L4Re: Server Crash

Client Server

Loader

Session Capability

(6)exit

Kernel destroys mem-ory, server objects(channels...)

Bjorn Dobel () OS Resilience 06.08.2013 25 / 58

Page 42: Operating Systems Meet Fault Tolerance

Introduction

L4Re: Server Crash

Client Server

Loader

Session Capability

(7)restart

Bjorn Dobel () OS Resilience 06.08.2013 25 / 58

Page 43: Operating Systems Meet Fault Tolerance

Introduction

L4Re: Restarted Server

Client Server

Loader

Session Capability

Bjorn Dobel () OS Resilience 06.08.2013 26 / 58

Page 44: Operating Systems Meet Fault Tolerance

Introduction

L4Re: Restarted Server

Client Server

Loader

Session Capability

use

Bjorn Dobel () OS Resilience 06.08.2013 26 / 58

Page 45: Operating Systems Meet Fault Tolerance

Introduction

L4Re: Restarted Server

Client Server

Loader

Session Capability

use

Error!

Bjorn Dobel () OS Resilience 06.08.2013 26 / 58

Page 46: Operating Systems Meet Fault Tolerance

Introduction

L4ReAnimator

Only the application itself can detect that a capability vanished

Kernel raises Capability fault

Application needs to re-obtain the capability: execute capability fault handler

Capfault handler: application-specificCreate new communication channelRestore session state

Programming model:Capfault handler provided by server implementorHandling transparent for application developerSemi-transparency

Bjorn Dobel () OS Resilience 06.08.2013 27 / 58

Page 47: Operating Systems Meet Fault Tolerance

Introduction

L4ReAnimator: Cleanup

Some channels have resources attached (e.g., frame bu↵er for graphicalconsole)

Resource may come from a di↵erent resource (e.g., frame bu↵er frommemory manager)

Resources remain intact (stale) upon crash

Client ends up using old version of the resource

Requires additional app-specific knowledge

Unmap handler

Bjorn Dobel () OS Resilience 06.08.2013 28 / 58

Page 48: Operating Systems Meet Fault Tolerance

Introduction

Summary

L4ReAnimatorCapfault: Clients detect server restarts lazilyCapfault Handler: application-specific knowledge on how to regain access tothe serverUnmap handler: clean up old resources after restart

All these frameworks only deal with software errors.

What about hardware faults? ! Next session!

Bjorn Dobel () OS Resilience 06.08.2013 29 / 58