ExtraVirt: Detecting and recovering from transient processor faults
description
Transcript of ExtraVirt: Detecting and recovering from transient processor faults
1
ExtraVirt: Detecting and recovering from transient processor faults
Dominic Lucchetti, Steve Reinhardt, Peter Chen
University of Michigan
2
Flips Happen
Similar die area+
Decreasing transition energy=
Increasing risk of transient failure
3
Multi-Processors &Virtual Machine
Multi-Processor Ensure error
independence Enable fault detection Efficient resource sharing
Virtual Machine No changes to OS or
applications VM replay
Synchronize replicas Recover correct state
Replica 1 Replica 2
Hypervisor
DeviceDrivers
Replication Management Layer (RML)
4
Example: Memory
Copy on write Reduces overhead Protects checkpoints
Merge on checkpoint Verify correctness Re-execute on
deviation Memory Fault
Protection ECC against RAM
faults MMU against CPU
faults
Memory CheckpointReplica 1Checkpoint Replica 2
A
B
CD
E
A
B
CX
E
A
B
C
E
Verify
Replica 3
A
B
CD
E
5
Status
Present VM Replay Beginnings of Replication
Management Layer (RML) Still much to do…
Future Replicate the un-replicated Handle faults in device
drivers Expanded fault model
Replica 1 Replica 2
Hypervisor/RML
DeviceDrivers
6
Questions?