AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs
description
Transcript of AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs
l i a b l eh k C o m p u t i n gL a b o r a t o r y
AgeSim: A Simulation Framework AgeSim: A Simulation Framework for Evaluating the Lifetime Reliabifor Evaluating the Lifetime Reliabi
lity of Processor-Based SoCslity of Processor-Based SoCs
Presenter: Lin HuangPresenter: Lin Huang
Lin Huang and Qiang Xu
CUhk REliable computing laboratory (CURE)
The Chinese University of Hong Kong
Lifetime Reliability Becomes A Serious Lifetime Reliability Becomes A Serious ConcernConcern
Useful life
Fai
lure
rat
e
Infantmortality
180nm130nm90nm
~ 7 year[T. M. Mak]
< 7 year ~ 10 year
Time
WearoutFailure mechanisms
Electromigration
NBTI
TDDB
Reliability-related factors
Temperature
Supply voltage
Frequency
Design-Stage Decisions Affect Lifetime Design-Stage Decisions Affect Lifetime ReliabilityReliability
Functionality Power consumption Area constraint Thermal issue Expected service life …
SPECIFICATION
IC
DPM / DTMDVFS
Timeout
Thermal throttling
Power gating
…
RedundancyLevel
Quantity
…
Task AllocationRound-robin
Optimized
…
Without an efficient yet accurate lifetime reliability simulation framework,making the good decisions is extremely difficult if not impossible !
The Challenges in Simulation-Based The Challenges in Simulation-Based Lifetime Reliability AnalysisLifetime Reliability Analysis
Increasing failure rate
Exponential distribution assumption in previous work
Time
Fai
lure
rat
e
Useful lifeInfant
mortality Wearout
The Challenges in Simulation-Based The Challenges in Simulation-Based Lifetime Reliability AnalysisLifetime Reliability Analysis
Operational temperature varies significantly and rapidly
Obtained with HotSpot 4.0 [Huang-ieeetc08]
How to achieve efficient yet accurate lifetime reliability simulation with such limited information, when failure mechanisms follow arbitrary failure distributions?
Key IdeaKey Idea
General failure distribution with general scale parameter by which time is divided Example: Weibull failure distribution
Suppose we can express the reliability function as
and can be computed according to limited tracing information Example: reliability function
Key IdeaKey Idea
Aging rate Capture the impact of certain usage strategy
Reliability-related usage strategy A combination of …
Dynamic power/thermal managementTrigger mechanismLoad-sharing strategy
… given the application flow with certain characteristic
Key IdeaKey Idea
Temperature
Supply voltage
Frequency
USAGESTRATEGY
FuturePast
Representative workload
Aging rate
Key IdeaKey Idea
FuturePast
Representative workload
Proposed Simulation Framework: AgeSimProposed Simulation Framework: AgeSim– Step One: Simulation and Tracing– Step One: Simulation and Tracing
ExecutionMode
Power(Data)
Temperature(Data)Power /
ThermalManager
TemperatureSimulator
PowerSimulator
Power StateMachine
TriggerMechanism
ApplicationFlow
Load-sharingStrategy
RedundancyScheme
time step
Proposed Simulation Framework: AgeSimProposed Simulation Framework: AgeSim– Step One: Simulation and Tracing– Step One: Simulation and Tracing
Power StateMachine
TriggerMechanism
ApplicationFlow
Load-sharingStrategy
RedundancyScheme
Power /ThermalManager
TemperatureSimulator
PowerSimulator
ExecutionMode
Power(Data)
Temperature(Data)
Reliability-Related Factors
Trace File
Proposed Simulation Framework: AgeSimProposed Simulation Framework: AgeSim– Step Two: Aging Rate Calculation– Step Two: Aging Rate Calculation
&
&
Reliability-Related Factors
Trace File
Aging rate
Model ValidationModel Validation
By average temperature28.3% error in MTTF
By AgeSimalmost identical results
Case Study ICase Study IDynamic Voltage and Frequency ScalingDynamic Voltage and Frequency Scaling
DVFS1 Low voltage: 90%Vdd
DVFS2 Low voltage: 80%Vdd
No DVFS
HVIdle
HVRun
Task departure
Task arrival
LVIdle
LVRun
Task departure
Task arrival
T>TH T<TL
Case Study ICase Study IDynamic Voltage and Frequency ScalingDynamic Voltage and Frequency Scaling
System load The ratio between task arrival rate and service rate
Case Study ICase Study IDynamic Voltage and Frequency ScalingDynamic Voltage and Frequency Scaling
System load The ratio between task arrival rate and service rate
Case Study IICase Study IITask Allocation on Multi-Core ProcessorsTask Allocation on Multi-Core Processors
Random allocation
Performance-aware allocation Always choose the
available core with highest frequency
[Sarangi-ieeetsm08]
Example Chip Frequency Map
Case Study IICase Study IITask Allocation on Multi-Core ProcessorsTask Allocation on Multi-Core Processors
System load The ratio between task arrival rate and service rate
Discussion on the Flexibility of Discussion on the Flexibility of AgeSimAgeSim
Task allocation and scheduling for MPSoC under lifetime reliability constraint
Multiprocessor with different redundancy schemes Example: gracefully degrading redundancy, standby redundancy
ConclusionConclusion
Lifetime reliability has become a serious concern for high-performance ICs
Design stage decisions significantly affect system reliability
We propose an efficient yet accurate simulation framework to evaluate the system reliability under various usage strategy Arbitrary failure distribution Fine-grained tracing for representative workloads
AgeSim is effective and flexible
AgeSim: A Simulation Framework for EvaluatiAgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based ng the Lifetime Reliability of Processor-Based
SoCsSoCs
Thank you for your attention !Thank you for your attention !
Backup SlidesBackup Slides
Multiple representative workload Aging rate Accuracy Key idea
Multiple Representative WorkloadsMultiple Representative Workloads
The proposed method could be easily extended to analyze the system with multiple representative workloads
We can organize the workloads into a hyper-workload with their occurrence probabilities
We can extract the aging rate and occurrence probability for each workload and then compute the unified aging rate by
Aging RateAging Rate
Time
Fai
lure
rat
e Aging rate is independent of time
AccuracyAccuracy
Key IdeaKey Idea
Power StateMachine
TriggerMechanism
ApplicationFlow
Load-sharingStrategy
RedundancyScheme
Processorusage strategy
Power StateMachine
TriggerMechanism
ApplicationFlow
Load-sharingStrategy
RedundancyScheme
Agingrate
Reliabilityfunction