ExtraVirt: Detecting and recovering from transient processor faults

6
1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan

description

ExtraVirt: Detecting and recovering from transient processor faults. Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan. Flips Happen. Similar die area + Decreasing transition energy = Increasing risk of transient failure. Multi-Processors & Virtual Machine. - PowerPoint PPT Presentation

Transcript of ExtraVirt: Detecting and recovering from transient processor faults

Page 1: ExtraVirt:  Detecting and recovering from transient processor faults

1

ExtraVirt: Detecting and recovering from transient processor faults

Dominic Lucchetti, Steve Reinhardt, Peter Chen

University of Michigan

Page 2: ExtraVirt:  Detecting and recovering from transient processor faults

2

Flips Happen

Similar die area+

Decreasing transition energy=

Increasing risk of transient failure

Page 3: ExtraVirt:  Detecting and recovering from transient processor faults

3

Multi-Processors &Virtual Machine

Multi-Processor Ensure error

independence Enable fault detection Efficient resource sharing

Virtual Machine No changes to OS or

applications VM replay

Synchronize replicas Recover correct state

Replica 1 Replica 2

Hypervisor

DeviceDrivers

Replication Management Layer (RML)

Page 4: ExtraVirt:  Detecting and recovering from transient processor faults

4

Example: Memory

Copy on write Reduces overhead Protects checkpoints

Merge on checkpoint Verify correctness Re-execute on

deviation Memory Fault

Protection ECC against RAM

faults MMU against CPU

faults

Memory CheckpointReplica 1Checkpoint Replica 2

A

B

CD

E

A

B

CX

E

A

B

C

E

Verify

Replica 3

A

B

CD

E

Page 5: ExtraVirt:  Detecting and recovering from transient processor faults

5

Status

Present VM Replay Beginnings of Replication

Management Layer (RML) Still much to do…

Future Replicate the un-replicated Handle faults in device

drivers Expanded fault model

Replica 1 Replica 2

Hypervisor/RML

DeviceDrivers

Page 6: ExtraVirt:  Detecting and recovering from transient processor faults

6

Questions?