Cost-Effective Register File Soft Error reduction Pablo Montesinos, Wei Liu and Josep Torellas,...

18
Cost-Effective Register File Soft Error reduction Pablo Montesinos, Wei Liu and Josep Torellas, University of Illinois at Urbana-Champaign
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of Cost-Effective Register File Soft Error reduction Pablo Montesinos, Wei Liu and Josep Torellas,...

Cost-Effective Register File Soft Error reduction

Pablo Montesinos, Wei Liu and Josep Torellas,University of Illinois at Urbana-Champaign

Overview

Study of register file vulnerability to SDC(Silent Data Corruption)

Shield – cost effective protection to register files

Highighting policies and techniques used in shield

Experiment - Results

Register File AVF RF-AVF is the probability that a fault that occurs will

lead to error. Register lifetime is divided into PreWrite, Useful,

and PostLastRead parts.

Based on AVF calculation we can divide lifetime of bit into ACE (Architecturally Correct Execution) and un-ACE cycles.

Register File AVF

During PreWrite Period – un-ACE If used atleast once after write the reg

switches to ACE state. After last read on reg, switches back

to un-ACE during PostLastRead

Highlighting Insights (1)

The combined %-USEFUL time of all registers is small

Highlighting Insights (1) The average number of useful (live) registers is less

than 20 (SPECint) and 17(SPECfp).

It is thus possible to redue the vulnerability of the register file by only protecting a subset of carefully chosen registers at a time.

Highlighting Insights (2) Only a few long-lived registers contribute to

overall Total useful time

On average less than 10% of register versions are long-lived.

Highlighting Insights (2)

On average 40% of useful time comes from the few long-lived versions.

In SPECfp, 5% of long-lived versions account for 46% of the useful time.

Motivation

Register files have a very high access rate.

High temperature thus leading to lesser Qcrit for the devices.

An error in an RF can propagate with hght failure probability

If we isolate a few register versions, predicting their life-time, and protect these register versions alone, high reliability can be achieved with limited overhead.

Shield - Architecture

Life-Time Prediction

Shielding Decision

Register Error Check

Error Recovery

Reg-Version Lifetime Prediction

P12 => Used(1) , Renamed(1)

P7 => Used(0) , Renamed(1)

Shielding Decision These prediction bits are stored as status in the ECC

table. The decision to shield an incoming register version

written is by: Availability of free ECC-Table entry Same register# present in the ECC table will be replaced

with new entry. Existing reg-version with lesser lifetime than incoming reg-

version will be replaced. Replacement policy:

Register Error Check & Recovery On a read request the register data is sent

to the original datapath and shield. If the Reg# matches with a tag entry, then

the reg-data is checked for errors at the ECC-Checker.

If Error is detected Processor stalls the instruction I reading reg P Reg-data is corrected and written into RF Oldest read instruction reading reg P in ROB and

all succeeding instructions is flushed. Processor resumes from flushed instruction.

Experiments- Results

AVF computation for RF with shield

Experiments-Results

AVF of intREG reduced by different replacement policies: LRU = 31% Effective = 63% OptEffective = 84% ( pinning of global pointers to

particular ECC entries + Effective )

AVF for fpREG can be reduced maximum by 100%, because fewer fp-registers are in useful state.

Power and Area Impact

Shield only uses 3ECC generators and 3 ECC checkers.

Shield has 45% power overhead over a plain register file. (Full ECC has 2X)

Shield introduces an overall 10% area overhead.

Conclusion

A cost-effective architectural technique has been proposed to reduce the vulnerability of RF by 84%

The area and power overhead indicated is a marginal tradeoff for reliability achieved.