Recent Challenges. 2 Soft Errors Scaling: SEU (Single-event upset): −Ionizing radiation corrupts...

Post on 14-Dec-2015

216 views 1 download

Tags:

Transcript of Recent Challenges. 2 Soft Errors Scaling: SEU (Single-event upset): −Ionizing radiation corrupts...

Recent Challenges

2

Soft Errors

• Scaling: SEU (Single-event upset):

− Ionizing radiation corrupts data stored Cause:

− Radioactive impurities in device packages− Recently: cosmic radiation

Scaling worsens SEU:1. Voltage scaling + reduced node capacitances − lower the charge threshold necessary to corrupt the

data2. Greater level of integration− increases the likelihood that soft errors will affect the

device

3

SEU

• Sources: Configuration memory Flip-flops Memory blocks Combinational circuits (transient error permanent)

Combinational circuits (transient error permanent)

4

5

SEU in Configuration Memory

• SEU in cinfiguration bits (SRAM-based): In Virtex FPGAs, ~ 91% of sensitive bits to soft errors

are configuration bits − flash- or antifuse-based do not suffer

Any change to the configuration memory may alter the functionality

Persist until FPGA is reprogrammed

6

SEU Mitigation Techniques

• Mitigation techniques:1. Circuit and technology-level:

− Addition of metal capacitors to nodes in the memory increases the amount of charge necessary to cause SEU

2. System-level:− Ensures that the system can detect and recover.

− Regularly verify their configuration memory by comparing the current values with the desired configuration state using cyclic redundancy checks (Altera Stratix III)

3. User-level:a) TMR (triple modular redundancy):

− Replicating a design three times and voting among outputs

− Reduce the sensitivity to soft errors in the design by careful selection of the resources used

7

Circuit Level

• [Ebrahimi]: Reduce # SRAM cells in a switch box (6 5)( 6 4)

0 1 2 3

0

1

2

3

8

Circuit Level

• [Ebrahimi]: Reduce # SRAM cells in a switch box (6 5)( 6 4)

W

S

N

a b

c d

e

f

w

x

y

z

0

0

0

E

0

9

User Design Level• Care bits [Golshan07] :

Only a subset of configuration bits affect the design due to SEU.

• Resource A is used for net A A-B SRAM is not a care bit if B is not used by other nets. A-C SRAM bit is a care bit (change to ‘1’ hurts net A). A-D SRAM bit is not a care bit (w.r.t. net A) if D not used.

10

User Design Level

• Soft Error Routing Problem [Golshan07]: Given a routing graph and a set of multi-terminal nets,

route each net with the least care-cost, where care-cost is the number of routing care bits.

• Experiments: 14% reduction in the number of care bits

− ~80% of soft errors in the FPGA: configuration memory [Kuon07]

Recent Challenges

Process Variation

12

Process Variation Sources

Leff2.3

2.2

2.1

1.9

1.8

50

100

0

2040

60

x 10-7

Wafer XWafer Y

2.0

[IBM, Intel and TSMC]

13

Variation Variations• Variation of variation over years

• Variation from mean value

− Gate oxides are so thin that a change of one atom can cause a 25 percent difference in substrate current.

− EE Times (04/11/2006)

ILD: inter-layer dielectric

14

Statistical Description

The combined set of underlying deterministic and random contributions are lumped into a combined “random” statistical description.

For devices on one wafer, the distribution (mean and variance) for L can be different from devices within a single die.

15

Inter-die vs. Intra-die Variations

• Figures are courtesy of IBM, Intel and TSMC

Intra-die spatial Correlation

Inter-die global Correlation

Leff

16

Impact of Variation• Importance of variation:

Timing violations− Yield loss

17

Impact of Variation

• Process variations can cause up to 2000% variation in leakage current and 30% variation in frequency in 180nm CMOS

− Borkar, S., Karnik, T., Narenda, S., Tschanz, J., Keshavarzi, A., De, V. Parameter Variations and Impact on Circuits and Microarchitecture. In Proc. of DAC (2003), 338-342.

18

Impact of Variation

Die-to-die frequency variation

19

Variation in FPGA

• Binning: Historically: most of variation between dies

− FPGA manufacturers test the speed of each FPGA after manufacturing and binning each device according to its speed.

− Higher speeds: more expensive− Unacceptable leakage power: discard the device

More recently: significant within die variation− Cannot be leveraged in the same manner− Operating speeds must be reduced to maintain

functionality− 90nm: speed reduction of 5.7% − 22nm: speed reduction of 22.4%

20

Solutions

• Architectural solution:1. Select the logic block architecture parameters to minimize

this variation− LUT size is particularly important [Wong05]

− LUT size = 4 : highest leakage yield− LUT size = 7 : highest timing yield− LUT size = 5 : maximum combined leakage and timing yield.

2. Adaptively compensate for any variation through body-biasing [Nabaa06]:− Slow blocks: set to a body bias decrease Vt increase

block’s speed− Fast blocks: increase threshold voltage reduce leakage

power Experiments:

− Area penalty: 1%–2% − Delay variability reduction: 30% − Leakage variability reduction: 78%

21

Solutions

• CAD-Level:

1. Statistical static timing analysis (SSTA) in FPGA CAD tools − Improve delays by avoiding the margins that are

necessary for traditional STA

2. Testing multiple logically equivalent configurations of the FPGA to find one that is functional at the desired speed [Sedcole07]

3. Generating critical paths that will be more robust in the face of variation [Matsumoto07]

22

Inter-die vs. Intra-die Variations

P0 = nominal design value

ΔPintradie = intra-die variation (within a given chip)

Δ Pinterdie = Inter-die variation (from one chip to another)

Δ Pe = remaining “random” or unexplained variation

P: a structural or electrical parameter e.g.− W,− tox,− Vth,

− channel mobility,− coupling capacitances,− line resistances.

23

Corner Analysis

• PRCA (Process Corner Analysis): Takes

1. nominal values of process parameters2. and a delta for each parameter by which it varies.

Finds− performance as max and min values.

• Pros: Simple

• Cons: conservative inaccurate

24

Corner Analysis

• PRCA shortcoming: Process corners are believed to coincide with

performance corners.− Fact: best-case corner may not depend on Pmin or Pmax

for a particular interconnect parameter but on a value within that range.

H

W

T

Hmax

Wmax Tmax

Hmin

WminTmin

M3

M2

H

TWCg

Cg

M1

25

SSTA

26

Solutions

• CAD-Level:

2. Testing multiple logically equivalent configurations of the FPGA to find one that is functional at the desired speed [Sedcole07]

27

References

• [Kuon07] Kuon, Tessier, “FPGA Architecture: Survey and Challenges,” Foundations and Trends in Electronic Design Automation, Vol. 2, No. 2 (2007) 135–253.

• [Lin07] Yan Lin and Lei He, Device and Architecture Concurrent Optimization for FPGA Transient Soft Error Rate, ICCAD 2007

• [Golshan07] S. Golshan and E. Bozorgzadeh, “Single-event-upset (SEU) awareness in FPGA routing,” in DAC ’07:

• [Xilinx] www.xilinx.com• [Altera] www.altera.com• [Wong05] H.-Y.Wong, L. Cheng, Y. Lin, and L. He, “FPGA

device and architecture evaluation considering process variations,” in ICCAD, 2005.

• [Nabaa06] G. Nabaa, N. Azizi, and F. N. Najm, “An adaptive FPGA architecture with process variation compensation and reduced leakage,” DAC, 2006.

28

References

• [Sedcole07] P. Sedcole and P. Y. K. Cheung, “Parametric yield in FPGAs due to within-die delay variations: A quantitative analysis,” in FPGA, 2007.