Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos...

9
Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan Peh (MIT)

Transcript of Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos...

Page 1: Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan.

Enabling System-Level Modeling of Variation-Induced Faults

in Networks-on-Chips

Enabling System-Level Modeling of Variation-Induced Faults

in Networks-on-Chips

Konstantinos Aisopos (Princeton, MIT)

Chia-Hsin Owen Chen (MIT)

Li-Shiuan Peh (MIT)

Page 2: Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan.

The Tale of Resilient NoCsThe Tale of Resilient NoCs

Silicon technologies move into the nanometer regime

Devices become unreliable due to Process Variation (PV)

System designers propose resilient NoC architectures

From 1994 to 2011… Dally’s Reliable Router (1994) RoCo (ISCA’06) BulletProof (HPCA’06) Vicis (DAC’09)

What fault model are these proposals evaluated with?

uniform random fault distribution across gates

>50% inaccuracy in capturing fault locations

can we do

better?

Page 3: Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan.

Methodology for Accurate Fault ModelingMethodology for Accurate Fault Modeling

What is the golden reference of the expected PV maps?

The SPICE models of Standard Cells of the technology

How do we use them to capture variation-induced faults?

(list of standard cells and their interconnections)

Layouts of Standard

Cells

SPICE models of Standard

Cells

extraction

router RTL synthesized

netlistsynthesis

SPICE model netlist

Monte Carlo simulations

Page 4: Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan.

Methodology for Accurate Fault ModelingMethodology for Accurate Fault Modeling

Challenge: duration of simulation

Solution: hybrid timing / circuit-level simulation

Step 1. Find the critical paths and the inputs that result in

their longest delays (with Static Timing Analysis)

Step 2. Perform Monte Carlo circuit-level simulations

only for these paths / input permutations

to capture variation-induced timing violations

Step 3. Map circuit-level timing violations back to

system-level faults

Page 5: Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan.

Methodology for Accurate Fault ModelingMethodology for Accurate Fault Modeling

Step 3:

mapping circuit-level violations system-level faults

Each Verilog signal piggybacks a vector of system-level faults

critical path1

critical path2

X

unfair arbitration

X

X

datacorruption

packetloss

100 Monte Carlo

simulations

1/100

3/100

# timing

violations?

X

X

P(fault type = packet loss)

(1/100) U (3 /100)

Page 6: Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan.

Probability / System Impact of Faults? Probability / System Impact of Faults?

(1) for fixed configuration and fixed runtime conditions

Pro

bab

ilit

y o

f o

ccu

rren

ce

configuration: 5-input / 5-output router, 4-stage pipeline, 4 private VCs, 3 buffers/VC, 64bit wiresruntime conditions: 2.8GHz, 27C

datacorruption

packet loss

misrouting credit generation

credit loss

erroneous allocation

unfairarbitration

packet duplication

packetconservation

flow control

Page 7: Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan.

Probability / System Impact of Faults? Probability / System Impact of Faults?

datavnet

num VCs 2

num buff/VC 3

controlvnet

num VCs 2

num buff/VC 1

channel width (bits) 64

num inputs 5 (4 directions, network interface)

num outputs 5 (4 directions, network interface)

frequency 75% synthesis frequency (2.85GHz)

temperature not fixed (input argument)

core power 1 watt

topology 8x8 mesh, 4 memory controllers at corners

floorplan 256mm2, 2mmx2mm cores, 0.2mmx0.2mm routers

L1 cache 32KB/node, private unified, 2W, MESI

L2 cache 1MB/node, shared distributed, 16W

workload uniform random traffic, PARSEC suite

temperature

Fault Model

process parameters- threshold voltage (μ,σ)- transistor width (μ,σ) - transistor length (μ,σ)- oxide thickness (μ,σ)

probability of faults

Hotspot 5.0

thermal model Orion 2.0

power model

Garnet

network simulator

multi-core simulator

floorplan

power

temperature = fixed °C

(2) for dynamic runtime conditions

system and network configuration router configuration

Page 8: Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan.

Probability / System Impact of Faults? Probability / System Impact of Faults?

(2) for dynamic runtime conditions

8%-10% fault probabilities for high traffic

up to 1% fault probabilities for real workloads

Page 9: Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan.

ConclusionsConclusions

Presented a fault modeling tool for system-level simulators

Accurate + easy-to-integrate into any network simulator

(already available in GEMS and GARNET)

Do you need a fault model to accurately evaluate…

…a resilient coherence protocol (tolerating lost messages)? …a resilient routing algorithm (tolerating misrouted packets)? …an Error Correction Code (protecting data bits)?

…then consider integrating our tool into your simulator to accurately model faults!

Download here: www.mit.edu/~kaisopos/FaultModel