AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs
description
Transcript of AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs
![Page 1: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/1.jpg)
l i a b l eh k C o m p u t i n gL a b o r a t o r y
AgeSim: A Simulation Framework AgeSim: A Simulation Framework for Evaluating the Lifetime Reliabifor Evaluating the Lifetime Reliabi
lity of Processor-Based SoCslity of Processor-Based SoCs
Presenter: Lin HuangPresenter: Lin Huang
Lin Huang and Qiang Xu
CUhk REliable computing laboratory (CURE)
The Chinese University of Hong Kong
![Page 2: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/2.jpg)
Lifetime Reliability Becomes A Serious Lifetime Reliability Becomes A Serious ConcernConcern
Useful life
Fai
lure
rat
e
Infantmortality
180nm130nm90nm
~ 7 year[T. M. Mak]
< 7 year ~ 10 year
Time
WearoutFailure mechanisms
Electromigration
NBTI
TDDB
Reliability-related factors
Temperature
Supply voltage
Frequency
![Page 3: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/3.jpg)
Design-Stage Decisions Affect Lifetime Design-Stage Decisions Affect Lifetime ReliabilityReliability
Functionality Power consumption Area constraint Thermal issue Expected service life …
SPECIFICATION
IC
DPM / DTMDVFS
Timeout
Thermal throttling
Power gating
…
RedundancyLevel
Quantity
…
Task AllocationRound-robin
Optimized
…
Without an efficient yet accurate lifetime reliability simulation framework,making the good decisions is extremely difficult if not impossible !
![Page 4: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/4.jpg)
The Challenges in Simulation-Based The Challenges in Simulation-Based Lifetime Reliability AnalysisLifetime Reliability Analysis
Increasing failure rate
Exponential distribution assumption in previous work
Time
Fai
lure
rat
e
Useful lifeInfant
mortality Wearout
![Page 5: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/5.jpg)
The Challenges in Simulation-Based The Challenges in Simulation-Based Lifetime Reliability AnalysisLifetime Reliability Analysis
Operational temperature varies significantly and rapidly
Obtained with HotSpot 4.0 [Huang-ieeetc08]
How to achieve efficient yet accurate lifetime reliability simulation with such limited information, when failure mechanisms follow arbitrary failure distributions?
![Page 6: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/6.jpg)
Key IdeaKey Idea
General failure distribution with general scale parameter by which time is divided Example: Weibull failure distribution
Suppose we can express the reliability function as
and can be computed according to limited tracing information Example: reliability function
![Page 7: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/7.jpg)
Key IdeaKey Idea
Aging rate Capture the impact of certain usage strategy
Reliability-related usage strategy A combination of …
Dynamic power/thermal managementTrigger mechanismLoad-sharing strategy
… given the application flow with certain characteristic
![Page 8: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/8.jpg)
Key IdeaKey Idea
Temperature
Supply voltage
Frequency
USAGESTRATEGY
FuturePast
Representative workload
Aging rate
![Page 9: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/9.jpg)
Key IdeaKey Idea
FuturePast
Representative workload
![Page 10: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/10.jpg)
Proposed Simulation Framework: AgeSimProposed Simulation Framework: AgeSim– Step One: Simulation and Tracing– Step One: Simulation and Tracing
ExecutionMode
Power(Data)
Temperature(Data)Power /
ThermalManager
TemperatureSimulator
PowerSimulator
Power StateMachine
TriggerMechanism
ApplicationFlow
Load-sharingStrategy
RedundancyScheme
time step
![Page 11: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/11.jpg)
Proposed Simulation Framework: AgeSimProposed Simulation Framework: AgeSim– Step One: Simulation and Tracing– Step One: Simulation and Tracing
Power StateMachine
TriggerMechanism
ApplicationFlow
Load-sharingStrategy
RedundancyScheme
Power /ThermalManager
TemperatureSimulator
PowerSimulator
ExecutionMode
Power(Data)
Temperature(Data)
Reliability-Related Factors
Trace File
![Page 12: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/12.jpg)
Proposed Simulation Framework: AgeSimProposed Simulation Framework: AgeSim– Step Two: Aging Rate Calculation– Step Two: Aging Rate Calculation
&
&
Reliability-Related Factors
Trace File
Aging rate
![Page 13: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/13.jpg)
Model ValidationModel Validation
By average temperature28.3% error in MTTF
By AgeSimalmost identical results
![Page 14: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/14.jpg)
Case Study ICase Study IDynamic Voltage and Frequency ScalingDynamic Voltage and Frequency Scaling
DVFS1 Low voltage: 90%Vdd
DVFS2 Low voltage: 80%Vdd
No DVFS
HVIdle
HVRun
Task departure
Task arrival
LVIdle
LVRun
Task departure
Task arrival
T>TH T<TL
![Page 15: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/15.jpg)
Case Study ICase Study IDynamic Voltage and Frequency ScalingDynamic Voltage and Frequency Scaling
System load The ratio between task arrival rate and service rate
![Page 16: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/16.jpg)
Case Study ICase Study IDynamic Voltage and Frequency ScalingDynamic Voltage and Frequency Scaling
System load The ratio between task arrival rate and service rate
![Page 17: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/17.jpg)
Case Study IICase Study IITask Allocation on Multi-Core ProcessorsTask Allocation on Multi-Core Processors
Random allocation
Performance-aware allocation Always choose the
available core with highest frequency
[Sarangi-ieeetsm08]
Example Chip Frequency Map
![Page 18: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/18.jpg)
Case Study IICase Study IITask Allocation on Multi-Core ProcessorsTask Allocation on Multi-Core Processors
System load The ratio between task arrival rate and service rate
![Page 19: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/19.jpg)
Discussion on the Flexibility of Discussion on the Flexibility of AgeSimAgeSim
Task allocation and scheduling for MPSoC under lifetime reliability constraint
Multiprocessor with different redundancy schemes Example: gracefully degrading redundancy, standby redundancy
![Page 20: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/20.jpg)
ConclusionConclusion
Lifetime reliability has become a serious concern for high-performance ICs
Design stage decisions significantly affect system reliability
We propose an efficient yet accurate simulation framework to evaluate the system reliability under various usage strategy Arbitrary failure distribution Fine-grained tracing for representative workloads
AgeSim is effective and flexible
![Page 21: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/21.jpg)
AgeSim: A Simulation Framework for EvaluatiAgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based ng the Lifetime Reliability of Processor-Based
SoCsSoCs
Thank you for your attention !Thank you for your attention !
![Page 22: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/22.jpg)
Backup SlidesBackup Slides
Multiple representative workload Aging rate Accuracy Key idea
![Page 23: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/23.jpg)
Multiple Representative WorkloadsMultiple Representative Workloads
The proposed method could be easily extended to analyze the system with multiple representative workloads
We can organize the workloads into a hyper-workload with their occurrence probabilities
We can extract the aging rate and occurrence probability for each workload and then compute the unified aging rate by
![Page 24: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/24.jpg)
Aging RateAging Rate
Time
Fai
lure
rat
e Aging rate is independent of time
![Page 25: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/25.jpg)
AccuracyAccuracy
![Page 26: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs](https://reader038.fdocuments.us/reader038/viewer/2022103101/56814967550346895db6bb3a/html5/thumbnails/26.jpg)
Key IdeaKey Idea
Power StateMachine
TriggerMechanism
ApplicationFlow
Load-sharingStrategy
RedundancyScheme
Processorusage strategy
Power StateMachine
TriggerMechanism
ApplicationFlow
Load-sharingStrategy
RedundancyScheme
Agingrate
Reliabilityfunction