On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk...
-
date post
19-Dec-2015 -
Category
Documents
-
view
215 -
download
2
Transcript of On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk...
![Page 1: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/1.jpg)
On Modeling the Lifetime Reliability of Homogeneous
Manycore SystemsLin Huang and Qiang Xu
CUhk REliable computing laboratory (CURE)
The Chinese University of Hong Kong
![Page 2: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/2.jpg)
Integrated Circuit (IC) Product Reliability
IC errors can be broadly classified into two categories● Soft errors
• Do not fundamentally damage the circuits
● Hard errors• Permanent once manifest
• E.g., time dependent dielectric breakdown (TDDB) in the gate oxides, electromigration (EM) and stress migration (SM) in the interconnects, and thermal cycling (TC)
![Page 3: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/3.jpg)
Manycore Systems
State-of-the-art computing systems have started to employ multiple cores on a single die● General-purpose processors, multi-digital signal processor systems
● Power-efficiency
● Short time-to-market
Source: Intel Source: Nvidia
![Page 4: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/4.jpg)
Problem Formulation
To model the lifetime reliability of homogeneous manycore systems using a load-sharing nonrepairable k-out-of-n: G system with general failure distributions
Key features● k-out-of-n: G systems: to provide fault tolerance
● Load-sharing: each embedded core carries only part of the load assigned by the operating system
● Nonrepairable: embedded cores are integrated on a single silicon die
● General failure distribution: embedded cores age in operation
![Page 5: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/5.jpg)
Queueing Model for Task Allocation
Embedded cores execute tasks independently and one core can perform at most one task at a time
Consider a manycore system composed of a set identical embedded cores● The set of active cores , spare cores , and faulty cores
λa
Set S1
Set S2 S3
∩
Processor Cores
Central Task Allocation Queue
Applications
![Page 6: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/6.jpg)
Queueing Model for Task Allocation
A general-purpose parallel processing system with a central queue with a bulk arrival is modeled as queueing system
The probability that a certain active core is occupied by tasks (also called utilization) is computed as
Target system● Gracefully degrading systems
● Standby redundant systems
λa
Set S1
Set S2 S3
∩
Processor Cores
Central Task Allocation Queue
Applications
![Page 7: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/7.jpg)
Lifetime Reliability of Entire System– Gracefully Degrading System
A functioning manycore system may contains good cores
Let be the probability that the system has active cores at time
The system reliability can therefore be expressed as
Thus, the Mean Time to Failure (MTTF) of the system can be written as
![Page 8: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/8.jpg)
Lifetime Reliability of Entire System– Gracefully Degrading System
To determine●
•
● • Conditional probability
•
● For any• Conditional probability
•
The remaining is how to compute
![Page 9: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/9.jpg)
Behavior of Single Processor Core
States of cores● Spare mode – cold standby
● Active mode• Processing state
• Wait state – warm standby
The same shape but different scaleparameter● E.g.,
Active
Spare(Cold
Standby)
Wait(Warm
Standby)
Processing
![Page 10: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/10.jpg)
CoreCore Core Core Core
Lifetime Reliability of A Single Core – Gracefully Degrading System
Define accumulated time in a certain state at time as how long it spends in such a state up to time
Calculation
![Page 11: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/11.jpg)
Lifetime Reliability of A Single Core – Gracefully Degrading System
Theorem 1 Suppose a manycore system with gracefully degrading scheme has experienced core failures, in the order of occurrence time at , respectively, for any core that has survived until time● its accumulated time in the processing state up to time
● its accumulated time as warm standby up to time
![Page 12: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/12.jpg)
Lifetime Reliability of A Single Core – Gracefully Degrading System
Recall that the reliability functions in wait and processing states have the same shape but different scale parameter● General reliability function , abbreviated as
● Reliability function in processing state , denoted as
● Reliability function in wait state , denoted as
● Relationships: and
![Page 13: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/13.jpg)
Lifetime Reliability of A Single Core – Gracefully Degrading System
A subdivision of the time :
By the continuity of reliability function, we have
wait processing wait
Accumulated time in the processing state
Accumulated time in the wait state
![Page 14: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/14.jpg)
Lifetime Reliability of A Single Core – Gracefully Degrading System
Theorem 2 Given a gracefully degrading manycore system that has experienced core failures which occur at respectively, the probability that a certain core survives at time provided that it has survived until time is given by
where
![Page 15: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/15.jpg)
Lifetime Reliability of Entire System– Standby Redundant System
A standby redundant system is functioning if it contains at least good cores, among which are configured as active one, the remaining are spares
To determine● Again, the key point is to compute
![Page 16: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/16.jpg)
Lifetime Reliability of A Single Core – Standby Redundant System
Define a core’s birth time as the time point when it is configured as an active one
Theorem 3 In a standby redundant manycore system, for any core with birth time that has survived until time● its accumulated time in the processing state up to time
● its accumulated time as warm standby up to time
![Page 17: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/17.jpg)
Lifetime Reliability of A Single Core – Standby Redundant System
Theorem 4 In a manycore system with standby redundant scheme, the probability that a certain core with birth time survives at time
is given by
where
![Page 18: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/18.jpg)
Experimental Setup
Lifetime distributions● Exponential
● Weibull
● Linear failure rate
System parameters●
●
Consider a manycore system consisting of cores
![Page 19: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/19.jpg)
Misleading Caused by Exponential Assumption
Redundancy Scheme
Sojourn Time (years)
0-Failure State
1-Failure State
2-Failure State
3-Failure State
4-Failure State
0 — 0.2188 — — — — 0.2188
1Degrading 0.2121 0.2188 — — — 0.4309
Standby 0.2188 0.2188 — — — 0.4376
2Degrading 0.2059 0.2121 0.2188 — — 0.6368
Standby 0.2188 0.2188 0.2188 — — 0.6564
3Degrading 0.2000 0.2059 0.2121 0.2188 — 0.8368
Standby 0.2188 0.2188 0.2188 0.2188 — 0.8752
4Degrading 0.1944 0.2000 0.2059 0.2121 0.2188 1.0312
Standby 0.2188 0.2188 0.2188 0.2188 0.2188 1.0940
: Expected lifetime of the -core system
![Page 20: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/20.jpg)
Lifetime Reliability for Non-Exponential Lifetime Distribution
(a) Weibull Distribution (b) Linear Failure Rate Distribution
![Page 21: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/21.jpg)
Detailed Results for Gracefully Degrading System
Distribution
Sojourn Time (years)
0-Failure State
1-Failure State
2-Failure State
3-Failure State
4-Failure State
Weibull
0 2.2039 — — — — 2.2039
1 2.2153 0.5573 — — — 2.7726
2 2.2260 0.5600 0.3055 — — 3.0915
3 2.2359 0.5626 0.3142 0.1040 — 3.2167
4 2.2452 0.5649 0.2988 0.0955 0.0820 3.2864
Linear Failure Rate
0 1.8572 — — — — 1.8572
1 1.8463 1.1367 — — — 2.9830
2 1.8354 1.1325 0.8926 — — 3.8605
3 1.8243 1.1282 0.8798 0.6941 — 4.5264
4 1.8133 1.1237 0.8762 0.7055 0.6269 5.1456
![Page 22: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/22.jpg)
The Impact of Workload
![Page 23: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/23.jpg)
Comparison Between Gracefully Degrading System and Standby Redundant System
DistributionRedundancy
Scheme Hot Standby
Warm StandbyCold
Standby
Weibull
2Degrading 1.5039 1.8232 2.1497 2.2930 2.4265 2.6258
Standby 1.5314 1.8227 2.1133 2.2488 2.3484 2.5309
4Degrading 1.5046 1.8521 2.2305 2.4432 2.5771 2.8376
Standby 1.5577 1.8545 2.1715 2.3103 2.4266 2.6261
Linear Failure Rate
2Degrading 1.9115 2.3197 2.7070 2.8697 3.0105 3.2424
Standby 1.9608 2.3314 2.7330 2.8851 3.0091 3.2146
4Degrading 2.1348 2.7122 3.3642 3.6529 3.9385 4.3590
Standby 2.3008 2.7899 3.4307 3.6015 3.8588 4.1881
![Page 24: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/24.jpg)
Conclusion
State-of-the art CMOS technology enables the chip-level manycore processors
The lifetime reliability of such large circuit is a major concern
We propose a comprehensive analytical model to estimate the lifetime reliability of manycore systems
Some experimental results are shown to demonstrate the effectiveness of the proposed model
![Page 25: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d385503460f94a10c57/html5/thumbnails/25.jpg)
Thank You for Your Attention!