Transient Fault Detection and Recovery via Simultaneous Multithreading
description
Transcript of Transient Fault Detection and Recovery via Simultaneous Multithreading
Transient Fault Detection and Recovery via Simultaneous
Multithreading
Nevroz ŞEN
26/04/2007
AGENDA
Introduction & Motivation SMT, SRT & SRTR Fault Detection via SMT (SRT) Fault Recovery via SMT (SRTR) Conclusion
INTRODUCTION Transient Faults:
Faults that persist for a “short” duration Caused by cosmic rays (e.g., neutrons) Charges and/or discharges internal nodes of logic or
SRAM Cells – High Frequency crosstalk Solution
No practical solution to absorb cosmic rays 1 fault per 1000 computers per year (estimated fault
rate) Future is worse
Smaller feature size, reduce voltage, higher transistor count, reduced noise margin
INTRODUCTION Fault tolerant systems use
redundancy to improve reliability: Time redundancy: seperate executions Space redundancy: seperate physical
copies of resources DMR/TMR
Data redundancy ECC Parity
MOTIVATION
Simultaneous Multithreading improves the performance of a processor by allowing multiple independent threads to execute simultaneously (same cycle) in different functional units
Use the replication provided by the different threads to run two copies of the same program so we are able to detect errors
MOTIVATION
R1 (R2)
InputReplication
OutputComparison
Memory covered by ECCRAID array covered by parityServernet covered by CRC
R1 (R2)
microprocessor microprocessor
Replicated Microprocessors + Cycle-by-Cycle Lockstepping
MOTIVATION
R1 (R2)
InputReplication
OutputComparison
Memory covered by ECCRAID array covered by parityServernet covered by CRC
R1 (R2)
Thread Thread
Replicated Threads + Cycle-by-Cycle Lockstepping ???
MOTIVATION Less hardware compared to replicated
microprocessors SMT needs ~5% more hardware over uniprocessor SRT adds very little hardware overhead to existing
SMT Better performance than complete
replication Better use of resources
Lower cost Avoids complete replication Market volume of SMT & SRT
MOTIVATION - CHALLENGES
Cycle-by-cycle output comparison and input replication (Cycle-by-Cycle Lockstepping); Equivalent instructions from different threads
might execute in different cycles Equivalent instructions from different threads
might execute in different order with respect to other instructions in the same thread
Precise scheduling of the threads is crucial Branch misprediction Cache miss
SMT – SRT - SRTR
FunctionalUnits
InstructionScheduler
Thread1 Thread2
Simultaneous Multithreading (SMT)Simultaneous Multithreading (SMT)
SMT – SRT - SRTR
SRT: Simultaneous & Redundantly Threaded Processor
SRT = SMT + Fault Detection SRTR: Simultaneous &
Redundantly Threaded Processor with Recovery
SRTR = SRT + Fault Recovery
Fault Detection via SMT - SRT
Sphere of Replication (SoR) Output comparison Input replication Performance Optimizations for SRT Simulation Results
SRT - Sphere of Replication (SoR) Logical boundary of
redundant execution within a system
Components inside sphere are protected against faults using replication
External components must use other means of fault tolerance (parity, ECC, etc.)
Its size matters: Error detection latency Stored-state size
SRT - Sphere of Replication (SoR)for SRT
Excludes instruction and data cachesAlternates SoRs possible (e.g., exclude register file)
OUTPUT COMPARISION Compare & validate output before sending it outside
the SoR - Catch faults before propagating to rest of system
No need to compare every instruction; Incorrect value caused by a fault propagates through computations and is eventually consumed by a store, checking only stores suffices.
Check; 1. Address and data for stores from redundant threads.
Both comparison and validation at commit time2. Address for uncached load from redundant threads3. Address for cached load from redundant threads:
not required Other output comparison based on the boundary of an
SoR
OUTPUT COMPARISION – Store Queue
Store: ...
Store: R1 (R2)Store: ...Store: R1 (R2)Store: ...Store: ...
Store: ...StoreQueue
OutputComparison To Data Cache
•Bottleneck if store queue is shared•Separate per-thread store queues boost performance
INPUT REPLICATION Replicate & deliver same input (coming from
outside SoR) to redundant copies. To do this; Instructions: Assume no self-modification. No check Cached load data:
Active Load Address Buffer Load Value Queue
Uncached load data: Synchronize when comparing addresses that leave the SoR When data returns, replicate the value for the two threads
External Interrupts: Stall lead thread and deliver interrupt synchronously Record interrupt delivery point and deliver later
INPUT REPLICATION – Active Load Address Buffer (ALAB)
Delays a cache block’s replacement or invalidation after the retirement of the trailing load
Counter tracks trailing thread’s outstanding loads When a cache block is about to be replaced:
The ALAB is searched for an entry matching the block’s address
If counter != 0 then: Do not replace nor invalidate until trailing thread
is done Set the pending-invalidate bit
Else replace - invalidate
INPUT REPLICATION – Load Value Queue (LVQ)
An alternative to ALAB – Simpler Pre-designated leading & trailing
threads Protected by ECC
add load R1(R2)sub add
load R1 (R2)sub
probe cacheLVQ
Leading Thread
Trailing Thread
INPUT REPLICATION – Load Value Queue (LVQ)
Advantages over ALAB; Reduces the pressure on data cache
ports Accelerate fault detection of faulty
addresses Simple design
Performance Optimizations for SRT
Idea: Using one thread to improve cache and branch prediction behavior for the other thread. Two techniques; Slack Fetch
Maintains a constant slack of instructions between the threads
Prevents the trailing thread from seeing mispredictions and cache misses
Branch Outcome Queue (BOQ)
Performance Optimizations for SRT - Branch Outcome Queue (BOQ)
Fetch
Decode
Dispatch
CommitExecute
Data Cache
BOQ
• Sends the outcomes of the committed branch outcomes (branch PCs and outcomes) to the trailing thread• In the fetch stage trailing thread uses the head of queue like a branch target buffer
Simulation Results Simulation
Environment: Modified Simplescalar
“sim-outorder” Long front-end pipeline
because of out-of-order nature and SMT
Simple approximation of trace cache
Used 11 SPEC95 benchmarks
Simulation Results
ORH: On-Chip Replicated Hardware ORH-Dual -> two
pipelines, each with half the resources
SMT- Dual -> Replicated threads with no detection hardware
Simulation Results - Slack Fetch & Branch Outcome Queue
Max 27% performance improvements for SF, BOQ, and SF + BOQ
Performance better with slack of 256 instructions over 32 or 128
Prevents trailing thread from wasting resources by speculating
Simulation Results - Input Replication
Very low performance degradation for 64- entry ALAB or LVQ
On average a 16-entry ALAB and a 16-entry LVQ degrade performance by 8% and 5% respectively.
Simulation Results - Overall
Comparison with ORH- Dual
SRT processor: 256 slack fetch, BOQ with 128 entries, 64-entry store buffer, and 64-entry LVQ
Average: 16% Maksimum: %29 over a lockstepping processor with the “same” hardware
Fault Recovery via SMT (SRTR) What is wrong with SRT: A leading
non-store instruction may commit before the check for the fault occurs Relies on the trailing thread to trigger
the detection However, an SRTR processor works well
in a fail-fast architecture A faulty instruction cannot be undone
once the instruction commits.
Fault Recovery via SMT (SRTR) - Motivation In SRT, a leading instruction may commit before the
check for faults occurs, relying on the trailing thread to trigger detection.
In contrast, SRTR must not allow any leading instruction to commit before checking occurs,
SRTR uses the time between the completion and commit time of leading instruction and checks the results as soon as the trailing completes
In SPEC95, complete to commit takes about 29 cycles This short slack has some implications:
Leading thread provides branch predictions The StB, LVQ and BOQ need to handle mispredictions
Fault Recovery via SMT (SRTR) - Motivation Leading thread provides the trailing thread
with branch predictions instead of outcomes (SRT).
Register value queue (RVQ), to store register values and other information necessary for checking of instructions, avoiding bandwidth pressure on the register file.
Dependence-based checking elision (DBCE) to reduce the number of checks is developed
Recovery via traditional rollback ability of modern pipelines
SRTR Additions to SMT
SRTR Addition to MSTPredq : Prediction QueueLVQ : Load Value QueueCVs : Commit VectorsAL: Active ListRVQ: Register Value Queue
SRTR – AL & LVQ Leading and trailing instructions occupy the
same positions in their ALs (private for each thread)
May enter their AL and become ready to commit them at different times
The LVQ has to be modified to allow speculative loads
The Shadow Active List holds pointers to LVQ entries A trailing load might issue before the leading load Branches place the LVQ tail pointer in the SAL The LVQ’s tail pointer points to the LVQ has to be
rolled back in a misprediction
SRTR – PREDQ Leading thread places predicted PC Similar to BOQ but only holds
predictions instead of outcomes Using the predQ, the two threads
fetch essentially the same instructions
On a misprediction detection leading clears the predQ
ECC protected
SRTR – RVQ & CV SRTR checks when the trailing instruction
completes The Register Value Queue is used to store
register values for checking, avoiding pressure on the register file
RVQ entries are allocated when instruction enter the AL
Pointers to the RVQ entries are placed in the SAL to facilitate their search
If check succeeds, the entries in the CV vector are set to checked-ok and comitted
If check fails, the entries in the CV vectors are set to failed
Rollback done when entries in head of AL
SRTR - Pipeline
•After the leading instruction writes its result back, it enters the fault-check stage• The leading instruction puts its value in the RVQ using the pointer from the SAL.• The trailing instructions also use the SAL to obtain their RVQ pointers and find their leading counterparts’
SRTR – DBCE SRTR uses a separate structure, the register
value queue (RVQ), to store register values and other information necessary for checking of instructions, avoiding bandwidth pressure on the register file.
Check each inst brings BW pressure on RVQ DBCE (Dependence Based Checking Elision)
scheme reduce the number of checks, and thereby, the RVQ bandwidth demand.
SRTR – DBCE
Idea: Faults propagate through dependent
instructions Exploits register dependence chains
so that only the last instruction in a chain uses the RVQ, and has the leading and trailing values checked.
SRTR – DBCE
• If the last instruction check succeeds, commit previous ones • If the check fails, all the instructions in the chain are marked as having failed and the earliest instruction in the chain triggers a rollback.
SRTR - Performance Detection performance between SRT &
SRTR Better results in the interaction between
branch mispredictions and slack. Better than SRT between %1-%7
SRTR - Performance
• SRTR’s average performance peaks at a slack of 32
CONCLUSION A more efficient way to detect Transient Faults
is presented The trailing thread repeats the computation
performed by the leading thread, and the values produced by the two threads are compared.
Defined some concepts: LVQ, ALAB, Slack Fetch and BOQ
An SRT processor can provide higher performance then an equivalently sized on-chip HW replicated solution.
SRT can be extended for fault recovery-SRTR
REFERANCES T. N. Vijaykumar, Irith Pomeranz, and Karl Cheng,
“Transient Fault Recovery using Simultaneous Multithreading,” Proc. 29th Annual Int’l Symp. on Computer Architecture, May 2002.
S. K. Reinhardt and S. S. Mukherjee. Transient-fault detection via simultaneous multithreading. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 25–36, June 2000.
Eric Rotenberg, “AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessor,” Proceedings of Fault-Tolerant Computing Systems (FTCS), 1999.
S.S.Mukherjee, M.Kontz, & S.K.Reinhardt, “Detailed Design and Evaluation of Redundant Multithreading Alternatives,” International Symposium on Computer Architecture (ISCA), 2002