Dark Silicon Phenomenon
description
Transcript of Dark Silicon Phenomenon
![Page 1: Dark Silicon Phenomenon](https://reader036.fdocuments.us/reader036/viewer/2022062810/56815a83550346895dc7f146/html5/thumbnails/1.jpg)
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 318693
Gulay Yalcin, Anita Sobe, Alexey Voronin, Jons-Tobias Wamhoff, Derin Harmanci, Adrián Cristal, Osman Unsal, Pascal Felber, Christof Fetzer
PDP2014, Turin, Italy
13 February 2014
Combining Error Detection and Transactional Memory for Energy-Efficient Computing below
Safe Operation Margin
![Page 2: Dark Silicon Phenomenon](https://reader036.fdocuments.us/reader036/viewer/2022062810/56815a83550346895dc7f146/html5/thumbnails/2.jpg)
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Dark Silicon Phenomenon
Number of transistors can be increased.In order to stay within a chip’s power budget, some must remain “dark”.
One solution: Downscale the voltage.
2
![Page 3: Dark Silicon Phenomenon](https://reader036.fdocuments.us/reader036/viewer/2022062810/56815a83550346895dc7f146/html5/thumbnails/3.jpg)
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
How about Reliability?
3
When the Vdd is reduced, the error rate increases exponentially [1].
[1] Dan Ernst et al. “Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation.” In Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, pages 7–18, 2003
Our goal is:Investigating the edge cases on voltage reduction while the error recovery still leads to a reduced energy consumption.
![Page 4: Dark Silicon Phenomenon](https://reader036.fdocuments.us/reader036/viewer/2022062810/56815a83550346895dc7f146/html5/thumbnails/4.jpg)
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Agenda / Overview
MotivationExperiment: Scaling Vdd in a Real System
Basics of ReliabilityError Recovery with TMError Detection Schemes
AnalysisConclusion
4
![Page 5: Dark Silicon Phenomenon](https://reader036.fdocuments.us/reader036/viewer/2022062810/56815a83550346895dc7f146/html5/thumbnails/5.jpg)
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Reducing Vdd in a Real System
5
AMD FX-61006-core CPU CPU-heavy executionEvery 10 seconds reduce Vdd by 12.5mVMonitor
Incorrect Result System Crash Machine Check Architecture
The system encounters errors which can not be corrected by MCA even only after 10% reduction in Vdd
Errors are in instruction cache (37%), execution unit (61%) and others (less than 2%).
![Page 6: Dark Silicon Phenomenon](https://reader036.fdocuments.us/reader036/viewer/2022062810/56815a83550346895dc7f146/html5/thumbnails/6.jpg)
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Basics of Reliability
6
Transactional Memory can provide a lightweight Coordinated Local
Checkpoitning [2]
[2] Gulay Yalcin et al. “FaulTM: Fault Tolerance Using Hardware Transactional Memory , DATE 2013
![Page 7: Dark Silicon Phenomenon](https://reader036.fdocuments.us/reader036/viewer/2022062810/56815a83550346895dc7f146/html5/thumbnails/7.jpg)
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
TM provides checkpointing/rollback
7
Processor 1
Checkpoint (Log Area)
Checkpoint (Log Area)Checkpoint
(Log Area)Checkpoint (Log Area)Checkpoint
(Log Area)
P2P3
P4Pn
TM write-sets log the tentative memory updates.
Synchronize checkpoints
Data-Versioning provides a synchronization mechanism between
checkpoints.
![Page 8: Dark Silicon Phenomenon](https://reader036.fdocuments.us/reader036/viewer/2022062810/56815a83550346895dc7f146/html5/thumbnails/8.jpg)
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Error Detection Schemes - Replication
Execute instruction streams multiple timesCompare the results of executionsLess comparison with TM. Dual/Triple Modular Redundancy+ High Error Detection Rate- High Energy Overhead
8
![Page 9: Dark Silicon Phenomenon](https://reader036.fdocuments.us/reader036/viewer/2022062810/56815a83550346895dc7f146/html5/thumbnails/9.jpg)
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Error Detection Schemes-Assertions/Invariants
Assertions: Conditions referring to the current and previous state of the program.Check the stateAdding manually or automatic TM facilitates inserting invariantsEx:
9
![Page 10: Dark Silicon Phenomenon](https://reader036.fdocuments.us/reader036/viewer/2022062810/56815a83550346895dc7f146/html5/thumbnails/10.jpg)
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Error Detection Schemes - Symptoms
Monitor program executions to inspect if there is a symptom of hardware faults.Symptoms:
Mispredictions in high confidence branches,high OS activity,fatal traps (e.g. undefined instruction code)
Reliability at a low cost
10
![Page 11: Dark Silicon Phenomenon](https://reader036.fdocuments.us/reader036/viewer/2022062810/56815a83550346895dc7f146/html5/thumbnails/11.jpg)
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Error Detection Schemes- Encoded Processing
Apply software coding (ECC-like) techniquesThe redundancy is added by applying arithmetic codes to the values.Arithmetic codes: AN, ANBDmem etc.With TM, the validation of a code word can be deferred until a TX commits.Ex:
11
![Page 12: Dark Silicon Phenomenon](https://reader036.fdocuments.us/reader036/viewer/2022062810/56815a83550346895dc7f146/html5/thumbnails/12.jpg)
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Comparing Error Detection Schemes
12
![Page 13: Dark Silicon Phenomenon](https://reader036.fdocuments.us/reader036/viewer/2022062810/56815a83550346895dc7f146/html5/thumbnails/13.jpg)
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Analysis
Gem5 full system simulator 1GHz in-order cores 4 coresX86 ISA64KB L1 data and instruction cachesUnified 2MB L2 cache
SPLASH2 benchmark suite.
13
![Page 14: Dark Silicon Phenomenon](https://reader036.fdocuments.us/reader036/viewer/2022062810/56815a83550346895dc7f146/html5/thumbnails/14.jpg)
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Energy Analysis
14
E ≈ C x Vdd 2
Vdd
Error-free Overhead
RecoveryOverhead
Fault Injection
TX size
Error Detection Rate
![Page 15: Dark Silicon Phenomenon](https://reader036.fdocuments.us/reader036/viewer/2022062810/56815a83550346895dc7f146/html5/thumbnails/15.jpg)
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Energy Reduction
15
![Page 16: Dark Silicon Phenomenon](https://reader036.fdocuments.us/reader036/viewer/2022062810/56815a83550346895dc7f146/html5/thumbnails/16.jpg)
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Reliability of the System
16
![Page 17: Dark Silicon Phenomenon](https://reader036.fdocuments.us/reader036/viewer/2022062810/56815a83550346895dc7f146/html5/thumbnails/17.jpg)
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Conclusion
The energy consumption of CPUs can be reduced if we have efficient hardware support for Transactional Memory and for Error Detection.
17
![Page 18: Dark Silicon Phenomenon](https://reader036.fdocuments.us/reader036/viewer/2022062810/56815a83550346895dc7f146/html5/thumbnails/18.jpg)
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Future Work: Combining DMR and Symptoms
18
![Page 19: Dark Silicon Phenomenon](https://reader036.fdocuments.us/reader036/viewer/2022062810/56815a83550346895dc7f146/html5/thumbnails/19.jpg)
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Thanks!
19