Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos...
-
Upload
grant-craig -
Category
Documents
-
view
214 -
download
0
Transcript of Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos...
Enabling System-Level Modeling of Variation-Induced Faults
in Networks-on-Chips
Enabling System-Level Modeling of Variation-Induced Faults
in Networks-on-Chips
Konstantinos Aisopos (Princeton, MIT)
Chia-Hsin Owen Chen (MIT)
Li-Shiuan Peh (MIT)
The Tale of Resilient NoCsThe Tale of Resilient NoCs
Silicon technologies move into the nanometer regime
Devices become unreliable due to Process Variation (PV)
System designers propose resilient NoC architectures
From 1994 to 2011… Dally’s Reliable Router (1994) RoCo (ISCA’06) BulletProof (HPCA’06) Vicis (DAC’09)
What fault model are these proposals evaluated with?
uniform random fault distribution across gates
>50% inaccuracy in capturing fault locations
can we do
better?
Methodology for Accurate Fault ModelingMethodology for Accurate Fault Modeling
What is the golden reference of the expected PV maps?
The SPICE models of Standard Cells of the technology
How do we use them to capture variation-induced faults?
(list of standard cells and their interconnections)
Layouts of Standard
Cells
SPICE models of Standard
Cells
extraction
router RTL synthesized
netlistsynthesis
SPICE model netlist
Monte Carlo simulations
Methodology for Accurate Fault ModelingMethodology for Accurate Fault Modeling
Challenge: duration of simulation
Solution: hybrid timing / circuit-level simulation
Step 1. Find the critical paths and the inputs that result in
their longest delays (with Static Timing Analysis)
Step 2. Perform Monte Carlo circuit-level simulations
only for these paths / input permutations
to capture variation-induced timing violations
Step 3. Map circuit-level timing violations back to
system-level faults
Methodology for Accurate Fault ModelingMethodology for Accurate Fault Modeling
Step 3:
mapping circuit-level violations system-level faults
Each Verilog signal piggybacks a vector of system-level faults
critical path1
critical path2
X
unfair arbitration
X
X
datacorruption
packetloss
100 Monte Carlo
simulations
1/100
3/100
# timing
violations?
X
X
P(fault type = packet loss)
(1/100) U (3 /100)
Probability / System Impact of Faults? Probability / System Impact of Faults?
(1) for fixed configuration and fixed runtime conditions
Pro
bab
ilit
y o
f o
ccu
rren
ce
configuration: 5-input / 5-output router, 4-stage pipeline, 4 private VCs, 3 buffers/VC, 64bit wiresruntime conditions: 2.8GHz, 27C
datacorruption
packet loss
misrouting credit generation
credit loss
erroneous allocation
unfairarbitration
packet duplication
packetconservation
flow control
Probability / System Impact of Faults? Probability / System Impact of Faults?
datavnet
num VCs 2
num buff/VC 3
controlvnet
num VCs 2
num buff/VC 1
channel width (bits) 64
num inputs 5 (4 directions, network interface)
num outputs 5 (4 directions, network interface)
frequency 75% synthesis frequency (2.85GHz)
temperature not fixed (input argument)
core power 1 watt
topology 8x8 mesh, 4 memory controllers at corners
floorplan 256mm2, 2mmx2mm cores, 0.2mmx0.2mm routers
L1 cache 32KB/node, private unified, 2W, MESI
L2 cache 1MB/node, shared distributed, 16W
workload uniform random traffic, PARSEC suite
temperature
Fault Model
process parameters- threshold voltage (μ,σ)- transistor width (μ,σ) - transistor length (μ,σ)- oxide thickness (μ,σ)
probability of faults
Hotspot 5.0
thermal model Orion 2.0
power model
Garnet
network simulator
multi-core simulator
floorplan
power
temperature = fixed °C
(2) for dynamic runtime conditions
system and network configuration router configuration
Probability / System Impact of Faults? Probability / System Impact of Faults?
(2) for dynamic runtime conditions
8%-10% fault probabilities for high traffic
up to 1% fault probabilities for real workloads
ConclusionsConclusions
Presented a fault modeling tool for system-level simulators
Accurate + easy-to-integrate into any network simulator
(already available in GEMS and GARNET)
Do you need a fault model to accurately evaluate…
…a resilient coherence protocol (tolerating lost messages)? …a resilient routing algorithm (tolerating misrouted packets)? …an Error Correction Code (protecting data bits)?
…then consider integrating our tool into your simulator to accurately model faults!
Download here: www.mit.edu/~kaisopos/FaultModel