COOPERATIVE CROSS-LAYER PROTECTION FOR RESOURCE ...
Transcript of COOPERATIVE CROSS-LAYER PROTECTION FOR RESOURCE ...
COOPERATIVE CROSS-LAYER PROTECTION FOR RESOURCE CONSTRAINED MOBILE MULTIMEDIA SYSTEMS
Kyoungwoo Lee (final defense)
Prof. Nikil DuttProf. Nalini VenkatasubramanianProf. Lichun Bao
Nov. 26, 2008
Contents
Thesis MotivationThesis Proposal – Cooperative, Cross-layer Methods
PPC (Partially Protected Caches)EAVE (Error-Aware Video Encoding)CC-PROTECT (Cooperative, Cross-layer Protection)
Thesis Contribution and Future Direction
2
Mobile Multimedia Embedded Systems3
Web Browsing
Image Browsing
Satellite TVVideo Streaming
Animation
Video Conferencing
Map Routing
Mobile TV
3D GraphicsResource-limited mobile devices!Main problem is to achieve low power with high performance, high QoS, and high reliability
Reliability
Reliability is an emerging and critical concern in mobile devicesNew enhanced technology makes devices vulnerable to errors due to high complexity and high integration
Exponential increase of soft error rate as technology scales [Baumann, 05]Mobile applications are running close to humans
In pervasive computing, failures of healthcare mobile devices cause serious results
Redundancy techniques incur high overheads of power and performanceTMR (Triple Modular Redundancy) may exceed 200% overheads without optimization [Nieuwland, 06]
Challenging to optimize multiple properties (e.g., performance, power, QoS, and reliability) in mobile embedded systems
4
Soft error is becoming an every second concern!Soft Error Rate (SER) – FIT (Failures in Time) = number of errors in 109 hours
5
SER (FIT) MTTF Reason
1 Mbit @ 0.13 µm 1000 104 years
SER (FIT) MTTF Reason
1 Mbit @ 0.13 µm 1000 104 years64 MB @ 0.13 µm 64x8x1000 81 days High Integration
SER (FIT) MTTF Reason
1 Mbit @ 0.13 µm 1000 104 years64 MB @ 0.13 µm 64x8x1000 81 days High Integration
128 MB @ 65 nm 2x1000x64x8x1000 1 hour Technology scaling and Twice Integration
SER (FIT) MTTF Reason
1 Mbit @ 0.13 µm 1000 104 years64 MB @ 0.13 µm 64x8x1000 81 days High Integration
128 MB @ 65 nm 2x1000x64x8x1000 1 hour Technology scaling and Twice Integration
A system @ 65 nm 2x2x1000x64x8x1000 30 minutes Memory takes up 50% of soft errors in a system
SER (FIT) MTTF Reason
1 Mbit @ 0.13 µm 1000 104 years64 MB @ 0.13 µm 64x8x1000 81 days High Integration
128 MB @ 65 nm 2x1000x64x8x1000 1 hour Technology scaling and Twice Integration
A system @ 65 nm 2x2x1000x64x8x1000 30 minutes Memory takes up 50% of soft errors in a system
A system with voltage scaling @ 65 nm
100x2x2x1000x64x8x1000
18 seconds Exponential relationship b/w SER & Supply Voltage
SER (FIT) MTTF Reason
1 Mbit @ 0.13 µm 1000 104 years64 MB @ 0.13 µm 64x8x1000 81 days High Integration
128 MB @ 65 nm 2x1000x64x8x1000 1 hour Technology scaling and Twice Integration
A system @ 65 nm 2x2x1000x64x8x1000 30 minutes Memory takes up 50% of soft errors in a system
A system with voltage scaling @ 65 nm
100x2x2x1000x64x8x1000
18 seconds Exponential relationship b/w SER & Supply Voltage
A system with voltage scaling @ flight (35,000 ft) @ 65 nm
800x100x2x2x1000x64x8x1000 FIT
0.02 seconds
High Intensity of Neutron Flux at flight (high altitude)
Errors and Failures in Mobile Embedded Systems
Faults or Errors can cause Failures6
Application
Middleware/ OS
Hardware
Network
Soft Error
PacketLoss
Bug
Exception
Errors and Error Control Schemes at Hardware
7
Failures Causes Metrics Traditional ApproachesSoft Errors, Hard Failures, System Crash
External Radiations, Thermal Effects, Power Loss, Poor Design, Aging
FIT, MTTF, MTBF
Spatial Redundancy (TMR, Duplex, RAID-1 etc.) and Data Redundancy (EDC, ECC, RAID-5, etc.)
•FIT: Failures in Time (109 hours)•MTTF: Mean Time To Failure•MTBF: Mean Time b/w Failures•TMR: Triple Modular Redundancy•EDC: Error Detection Codes•ECC: Error Correction Codes•RAID: Redundant Array of Inexpensive Drives
Hardware failures are increasing as technology scales(e.g.) SER increases by up to 1000 times [Mastipuram, 04]
Redundancy techniques are expensive(e.g.) ECC-based protection in caches can incur 95% performance penalty [Li, 05]
Application
MW/ OS
Hardware
Network
Errors and Error Control Schemes at Software
8
Failures Causes Metrics Traditional ApproachesWrong outputs, Infinite loops, Crash
Incomplete Specification, Poor software design, Bugs, Unhandled Exception
Number of Bugs/Klines, QoS, MTTF, MTBF
Spatial Redundancy (N-version Programming, etc.), Temporal Redundancy (Checkpoints and Backward Recovery, etc.)
•QoS: Quality of Service
Software errors become dominant as system’s complexity increases(e.g.) Several bugs per kilo lines
Hard to debug, and redundancy techniques are expensive(e.g.) Backward recovery with checkpoints is inappropriate for real-time applications
Application
MW/ OS
Hardware
Network
Errors and Error Control Schemes in Networks
9
Failures Causes Metrics Traditional ApproachesData Losses, Deadline Misses, Node (Link) Failure, System Down
Network Congestion, Noise/Interference, Malicious Attacks
Packet Loss Rate, Deadline Miss Rate, SNR, MTTF, MTBF, MTTR
Resource Reservation, Data Redundancy (CRC, etc.), Temporal Redundancy (Retransmission, etc.), Spatial Redundancy (Replicated Nodes, MIMO, etc.)
•SNR: Signal to Noise Ratio•MTTR: Mean Time To Recovery•CRC: Cyclic Redundancy Check•MIMO: Multiple-In Multiple-Out
Network is unreliable (especially, wireless networks)Joint approaches across OSI layers have been investigated for minimal costs [Vuran, 06][Schaar, 07]
Application
MW/ OS
Hardware
Network
Conventional Approaches
Most redundancy techniques incur overheads in terms of performance, power, area, etc.
Conventional TRM (Triple Modular Redundancy) can incur 200% overheads without optimization.Backward Recovery with Checkpoints cannot guarantee the completion time of a task.
Recently proposed techniques have focused on the cost reduction without losing reliability
However, they still incur overheads
10
Thesis Problem Statement
Study tradeoffs among system properties(e.g.) Redundancy incurs energy overheads while DVS increases SER significantly
Examine errors and error control schemes across system abstraction layers
(e.g.) network errors & error-resilient video encoding, soft errors & ECC or EDC, etc.
Maximize reliability with minimal costs of power and performance for mobile embedded systems
11
Cross-Layer MethodsCross-layer approaches:
aim at system-level optimizationIntegrate and coordinate techniques across system layers
Classification [Srivastava, 05]
Top-down, Bottom-up, or Both direction Top-down – PPC, PDVS [GRACE], etc.Bottom-up – EAVE, etc.Both direction – CC-PROTECT, etc.
Coupling or Merging layers Dynamo [Mohapatra], xTune [Kim], etc.
12
Top-down
Bott
om-u
p
CouplingM
erging
Cross-Layer Approaches – GRACE
GRACE project @ UIUC [W. Yuan Ph.D. thesis in ’04 and A. F. Harris III, Ph.D. thesis in ’06]
QoS/Power tradeoffsPrimarily OS adaptation for power management in multimedia mobile devicesNetwork adaptation for power management in multimedia communications
13
[GRACE, 05]
Application
Operating
System
Hardware
Cross-Layer Approaches – DYNAMO & FORGE
DYNAMO middleware for FORGE project @ UCI [S. Mohapatra Ph.D. thesis in ’05 and R. Cornea Ph.D. thesis in ’07]
QoS/Power tradeoffs for mobile embedded systemsMiddleware-driven coordination and proxy-based cooperation1. Content transcoding at the
application layer2. Network traffic shaping at the
network layer3. Backlight (LCD display) setting at
the hardware layer4. NIC shutdown, CPU DVS/DFS at
the hardware layer
14
Application
Middleware/ OS
Hardware
Proxy Server
(NW & MW)
12
3 4
Cross-Layer Approaches – xTune
xTune framework @ UCI and SRI [M. Kim Ph.D. thesis in ’08]QoS/Power/Timeliness adaptation for distributed real-time embedded systemsA Formal Methodology for cross-layer tuning and verifiable timeliness of Mobile Embedded Systems
15
Handheld Server
Proxy Server
Application
Middleware/ OS
Hardware
Thesis Proposed Contribution
Thesis proposes a cross-layer design methodology for mobile multimedia embedded systems with minimal costs
Reliability/QoS/Power/Performance system optimization for mobile multimedia systems
Cooperative, Cross-Layer ProtectionPPC, EAVE, & CCPROTECTLow-cost reliability
16
Overview of Thesis Proposals17
Hardware
UnprotectedCache
ProtectedCacheProtectedCache ECCECC
Error-prone Networks
Mobile Video Application
Error-prone Networks
Mobile Video Application
EAVE
Error-ResilientEncoder (e.g., PBPAIR)
Error-Controller(e.g., frame drop)Error-Controller
(e.g., frame drop)
OriginalVideo
Error-AwareVideo
Monitor & Translate SER
MW/OS
Packet Loss
Frame Drop
Error detection
Application
Multimedia Application
EDCEDC
Correction
QoSPPC (Partially Protected Caches)EAVE (Error-Aware Video Encoding)CC-PROTECT (Cooperative, Cross-layer Protection)
Contents
Thesis MotivationThesis Proposal – Cooperative, Cross-layer Methods
PPC (Partially Protected Caches)EAVECC-PROTECT
Thesis Contribution and Future Direction
18
Application
Hardware
Middleware/ OS Network
Conventional Protection for Caches
Conventional Protected CachesUnaware of fault tolerance at applicationsImplement a redundancy technique such as ECC to protect all data for every access
Overkill for multimedia applicationsECC (e.g., a Hamming Code) incurs high performance penalty by up to 95%, power overhead by up to 22%, and area cost by up to 25%
High Cost
CacheCache ECCECCU
naware of Application
19
PPC (Partially Protected Caches)
ObservationNot all data are equally failure critical
Multimedia data vs. control variables
Propose PPC architectures to provide an unequal protection for mobile multimedia systems [Lee, CASES06][Lee, TVLSI08]
Unprotected cache and Protected cache at the same level of memory hierarchyProtected cache is typically smaller to keep power and delay the same as or less than those of Unprotected cache
UnprotectedCache
ProtectedCacheProtectedCache
Memory
PPC
20
PPC for Multimedia Applications
Propose a selective data protection [Lee, CASES06]Unequal protection at hardware layer exploiting error-tolerance of multimedia data at application layerSimple data partitioning for multimedia applications
Multimedia data is failure non-criticalAll other data is failure critical
Fault Tolerance
Power/D
elay Reduction
21
UnprotectedCache Protected
CacheProtectedCache
Memory
PPC
PPC for General Applications
DPExplore [Lee, PPCDIPES08]Explore partitioning space by exploiting awareness of vulnerability of each data page
Vulnerable timeIt is vulnerable for the time when eventually it is read by CPU or written back to Memory
Pages causing high vulnerable time are failure criticalVulnerable time closely estimates failure rate
Read
Write
Eviction
Incoming
data
t0 t1 t2 t3
22
UnprotectedCache Protected
CacheProtectedCache
Memory
PPC
Summary – PPCAll data are not equally failure criticalPropose a PPC architecture to provide unequal protection
Support an unequal protection at hardware layer by exploiting error-tolerance and vulnerability at applicationPresent cost-efficient reliability
Related Publications[Lee, CASES06] – PPC for multimedia embedded systems[Lee, PPCDIPES08] – PPC for general applications[Lee, TVLSI08] – PPC and design space exploration
Under submission[Lee, TODAES??] – PPC for general applications and instruction caches
23
Application Data & Code
Failure Non-Critical
Failure Critical
Unprotected Cache
Protected Cache
PPC
Page Partitioning Algorithms
Error-tolerance of MM dataVulnerability of Data & Code
FNC & FC are mapped into Unprotected & Protected Caches
Contents
Thesis MotivationThesis Proposal – Cooperative, Cross-layer Methods
PPCEAVE (Error-Aware Video Encoding)CCPROTECT
Thesis Contribution and Future Direction
24
Application
Middleware/ OS Network
Active Error Exploitation – Intentional Frame Drop
Error-prone Networks
Mobile Video Application
Enc
CPU
Tx
WNI
Dec
CPU
Rx
WNIFDT-1FDT-1 FDT-2FDT-2 FDT-3FDT-3
•FDT: Frame Drop Type•Enc: Encoding, Dec: Decoding•WNI: Wireless Network Interface
Intentional Frame Drop (one way to actively exploit errors) can result in energy reduction for each operationFDT-1 affects the following components with respect to power, performance, and QoS in mobile video applications
25
Packet Loss
Error-Aware Video Encoding
Propose EE-PBPAIR [Lee, DIPES08]
Intentionally drop frames at video encodingReduce the energy consumption for video encodingMaintain the video quality by exploiting error-resilience of PBPAIR
Error-prone Networks
Packet Loss
Intentional frame drop
Error-Aware Video Encoder (EAVE)
Error-ResilientEncoder
(e.g., PBPAIR)
Error-Controller(e.g., frame dropping)Error-Controller
(e.g., frame dropping)
OriginalVideo
Error-Resilient
Video
•EIR: Error Injection Rate
26
Error-AwareVideo
Summary – EAVE
Intentional Frame Drop is one way to exploit errors activelyPropose an error-aware video encoding (EE-PBPAIR)
Present a knob (EIR) to adjust the amount of errors considering the QoS feedbackMaintain the video quality using error-resilience of PBPAIR
Related Publication[Lee, DIPES08] – EE-PBPAIR
Considering Submission[Lee, TECS??] – Generalized idea for error-resilient video encodings
•EIR: Error Injection Rate•PLR: Packet Loss Rate
27
Error Resilient Video Encoder
Error Controller
Hardware
MiddlewareEnergy
Reduction
CPU, Memory, and WNIC
Application
Network or Decoding Side
Error Rate = PlR + EIR
EIR PLR& QoS
Error-Aware Video Data
Contents
Thesis MotivationThesis Proposal – Cooperative, Cross-layer Methods
PPCEAVECC-PROTECT (Cooperative Cross-layer Protection)
Thesis Contribution and Future Direction
28
Application
Hardware
Middleware/ OS Network
Errors and Error Control Schemes – No Coupling
Different errors and their protection techniques have not been considered jointly
No coupling and no cooperation
Cooperating control schemes in a cross-layer manner can open a new venue
29
Error-prone Networks
Mobile Video Application
Application
Middleware/ OS Network
Hardware Soft Error
PacketLoss
PPC still incurs overheads due to ECC-protection30
UnprotectedCache
ProtectedCacheProtectedCache
Memory
PPC
Propose PPC architectures to provide an unequal protection for mobile multimedia systems [Lee, TVLSI08]
Unprotected cache and Protected cache a the same level of memory hierarchy
PPC still incurs overheads due to high expensive ECC-protection at the protected cache
29% energy reduction compared to the protected cache
10% energy overhead compared to the unprotected cache
PBPAIR is energy-inefficient in error-free network
PBPAIR is error-resilient and energy-efficient in generalPBPAIR may not be energy efficient in case of error-free network
31
PBPAIR
PLR
PacketLoss
network
Intra_Threshold•PBPAIR: Probability-Based Power Aware Intra Refresh [Kim, 06]
Outline of CC-PROTECT32
frame K frame K+1
UnprotectedCache
ProtectedCacheProtectedCache PPCEDCEDC
Error-prone Networks
Mobile Video Application
Error-prone Networks
Mobile Video Application
Error-Aware Video Encoder (EAVE)
Error-ResilientEncoder (e.g., PBPAIR)
Error-Controller(e.g., frame drop)Error-Controller
(e.g., frame drop)
OriginalVideo
Error-AwareVideo
DFR (Drop &Forward Recovery)
BER (Backward Error Recovery)
Feedback
Monitor & Translate SER
Trigger Selective DFR
Support EAVE & PPC
Parameter
MW/OS
Packet Loss
Frame Drop
Error detection
QoS Loss
Energy SavingBASE = Error-prone video encoding + unprotected cache
HW-PROTECT = Error-prone video encoding + PPC with ECC
APP-PROTECT = Error-resilient video encoding + unprotected cache
MULTI-PROTECT = Error-resilient video encoding + PPC with ECC
CC-PROTECT1 = Error-prone video encoding + PPC with EDC
CC-PROTECT2 = Error-prone video encoding + PPC with EDC + DFR
CC-PROTECT = error-resilient video encoding + PPC with EDC + DFR
33
EDC impact17% Reduction compared to HW-PROTECT4% Reduction compared to BASE
EDC + DFR impact36% Reduction compared to HW-PROTECT26% Reduction compared to BASE
EDC + DFR + PBPAIR(CC-PROTECT) impact56% Reduction compared to HW-PROTECT49% Reduction compared to BASE
Summary – CC-PROTECTPropose CC-PROTECT approach, which cooperates existing schemes across layers to mitigate the impact of soft errors on the failure rate and video quality in mobile video encoding systems
PPC (Partially Protected Caches) with EDC (Error Detection Codes) at hardware layerDFR (Drop and Forward Recovery) at middlewarePBPAIR (Probability-Based Power Aware Intra Refresh) at application layer
Demonstrate the effectiveness of low-cost (about 50%) reliability (1,000x) at the minimal cost of QoS (less than 1%)Related Publication
[Lee, ACMMM08] – CC-PROTECTConsidering Submission
[Lee, ACMTOMCCAP??] – Tradeoff space exploration with CC-PROTECT
34
Application
Middleware/ OS
Hardware UnprotectedCache Protected
CacheProtectedCache
ECC
DFR -Error Correction
PBPAIR -Error Resilience
EDC
Contents
Thesis MotivationThesis Proposal – Cooperative, Cross-layer Methods
PPCEAVECC-PROTECT
Thesis Contribution and Future Direction
35
Application
Hardware
Middleware/ OS Network
Overall Thesis Contribution
Cross-layer methodology to design mobile multimedia embedded systems with minimal costs
36
Application
Middleware/ OS
Hardware
Network
Soft Error
PacketLoss
1. Effective Cross-layer approaches for reliability
2. Low-cost reliability3. Expanded trade-off
space 4. Extended applicability of
existing techniques
Effectiveness of Thesis Proposals (Energy Saving)
25% energy reduction, as compared to a conventional protected cache with ECC
30% energy reduction, as compared to a conventional video encoding
PPC EAVE
56% energy reduction, as compared to a conventional composition of protections
37
CCPROTECT
Publication38
[Lee, ACMMM08] K. Lee, A. Shirvastava, M. Kim, N. Dutt, and N. Venkatasubramanian, “Mitigating the impact of hardware defects on multimedia applications – A cross-layer approach”, In ACM International Conference on Multimedia, Oct. 2008.
[Lee, TVLSI08] K. Lee, A. Shrivastava, I. Issenin, N. Dutt, and N. Venkatasubramanian, “Partially protected caches to reduce failures due to soft errors in multimedia applications”, In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), 2008, to appear.
[Lee, DIPES08] K. Lee, M. Kim, N. Dutt, and N. Venkatasubramanian, “Error exploiting video encoder to extend energy/QoS tradeoffs for mobile embedded systems”, In 6th IFIP Working Conference on Distributed and Parallel Embedded Systems (DIPES), Sep. 2008.
[Lee, PPCDIPES08] K. Lee, A. Shrivastava, N. Dutt, and N. Venkatasubramanian, “Data partitioning techniques for partially protected caches to reduce soft error induced failures”, In 6th IFIP Working Conference on Distributed and Parallel Embedded Systems (DIPES), Sep. 2008.
[Lee, CASES06] K. Lee, A. Shrivastava, I. Issenin, N. Dutt, and N. Venkatasubramanian, “Mitigating soft error failures for multimedia applications by selective data protection”, In Int. Conference on Compilers, Architecture, & Synthesis for Embedded Systems (CASES), Oct. 2006.
[Lee, ICME05] K. Lee, N. Dutt, and N. Venkatasubramanian, “Experimental Study on Energy Consumption of Video Encryption for Mobile Handheld Devices", In IEEE International Conference on Multimedia and Expo (ICME 05), Poster Session, July 2005.
[Mohapatra, IPDPS05] S. Mohapatra, R. Cornea, H. Oh, K. Lee, M. Kim, N. Dutt, R. Gupta, A. Nicolau, S. Shukla, and N. Venkatasubramanian, “A cross-layer approach for power-performance optimization in distributed mobile systems”, In Next Generation Software Program in conjunction with IEEE International Parallel and Distributed Processing Symposium (IPDPS), April 2005.
Application
Middleware/ OS
Hardware
Network
[Lee, TVLSI08][Lee, PPCDIPES08][Lee, CASES06]
[Lee, DIPES08]
[Lee, ACMMM08][Mohapatra, IPDPS05][Lee, ICME05]
Future Direction
Error Rate Translation/Integration
Different types of errorsDifferent components across system layers
Cross-layer methods for distributed embedded systems (Horizontal Expansion)
Network-aware methodsContext-aware approaches
39
Error-prone Networks
Mobile Video Application
Application
Middleware/ OS
Hardware
Network
Soft Error
PacketLoss
Bug
Exception
Thank you! Any Questions or Comments?
40
Backup Slides41
Why Cross-Layer Approach?Cross-layer interactions and conflicts arise between system properties
DVS increases SER exponentiallyOver protection or under protection
All ECC for multimedia data is an overkillCross-layer approaches can maximize the reliability with minimal power and performance overheads
Benefits of Cross-layer approachesGlobal system viewCoordination for intelligent selectionAdaptation
Cross-layer approaches have been promising to save the resources at the cost of QoS [Mohapatra, 05][Yuan, 04]
•DVS: Dynamic Voltage Scaling•SER: Soft Error Rate•ECC: Error Correction Codes•QoS: Quality of Service
42
Thesis Proposed Contribution: CC-PROTECT
Cooperative Cross-layer Protection (CC-PROTECT) by exploiting error-awareness and error control schemes across system abstraction layersContribution
Present cost-efficient reliability methods (cooperative cross-layer protection)Open expanded tradeoff spaces and operating pointsRediscover applicability of existing approaches for other purposes
43
Performance vs. Capacity44
Total energy available from a battery is a design issue and is fixed at a design time, along with its weight and sizeStark contrast between linear growth rate of battery capacity and exponential technology improvement rate of system components
[Udani] Sanjay Udani and Jonathan Smith, “Power management in mobile computing”
Generalized Fault Tolerance Techniques
1) Modular Redundancy2) N-Version Programming3) Error-Control Coding4) Checkpoints and Rollbacks5) Recovery Blocks
45
[Chetan, SPC04] S. Chetan, A. Ranganathan, and R. Campbell, “Towards Fault Tolerant Pervasive Computing”, in SPC ’04[Somani, IEEECom97] A. K. Somani and N. H. Vaidya, “Understanding Fault Tolerance and Reliability”, in IEEE Computer ’97 vol. 30 issue 4
1) Modular Redundancy
Modular RedundancyMultiple identical replicas of hardware modulesVoter mechanism
Compare outputs and select the correct output
Tolerate most hardware faultsEffective but expensive
ConsumerData
Producer Bvoter
Producer Afault
46
2) N-version Programming
N-version ProgrammingDifferent versions by different teams
Different versions may not contain the same bugs
Voter mechanismTolerate some software bugs
Producer A ConsumerData
voter
Program i Program j
Programmer K Programmer L
fault
47
3) Error-Control Coding
Error-Control CodingReplication is effective but expensiveError-Detection Coding and Error-Correction Coding
(example) Parity Bit, Hamming Code, CRC
Much less redundancy than replication
Producer A Consumer
Data
ErrorControl
Datafault
48
4) Checkpoints & Rollbacks
Checkpoints and RollbacksCheckpoint
A copy of an application’s stateSave it in storage immune to the failures
RollbackRestart the execution from a previously saved checkpoint
Recover from transient and permanent hardware and software failures
Producer A ConsumerData
Application
state (K-1) state K
faultCheckpoint
Rollback
State K
49
5) Recovery Blocks
Recovery BlocksMultiple alternates to perform the same functionality
One Primary module and Secondary modules Different approaches
1) Select a module with output satisfying acceptance test
2) Recovery Blocks and RollbacksRestart the execution from a previously saved checkpoint with secondary module
Tolerate software failures
Producer A ConsumerData
state (K-1) state K
faultCheckpoint
Rollback
Block XBlock YBlock Z
Block X2
Application
50
Soft Errors (Transient Faults)
SER increases exponentially as technology scalesIntegration, voltage scaling, altitude, latitude
Caches are most hit due to:Larger portion in processors (more than 50%) No masking effects (e.g., logical masking)
Transistor
01
5 hours MTTF
1 month MTTF
Intel Itanium II Processor
•MTTF: Mean time To FailureBit Flip
51
[Baumann, 05]
Related Work
Process Technology SolutionsHardening [Baze, IEEE Trans. on Nuclear Science 00]SOI [O. Musseau, IEEE Trans. on Nuclear Science 96]Process complexity, yield loss, and substrate cost
Microarchitectural Solutions for Caches
Cache Scrubbing [Mukherjee, PRDC04]Low Power Cache [Li, ISLPED04]Area Efficient Protection [Kim, DATE06]Multiple Bit Correction [Neuberger, TODAES 03]Cache Size Selection [Cai, ASP-DAC06]In-Cache Replication [Zhang, DSN03]Replication Cache [Zhang, IEEE Computers 05]High overheads in terms of power, performance, and area
52
Our Solution-Protects caches from failures due to soft errors exploiting error-tolerance of applications-Protection can be in conjunction with any techniques
Our Solution-Protects caches from failures due to soft errors exploiting error-tolerance of applications-Protection can be in conjunction with any techniques
Unequal Data Protection
All pages are not equally failure critical
Multimedia data is failure non-criticalProgram variables are failure criticalFailures: system crash, infinite loop, segmentation faults, etc
QoS degradation is not a failure
Only 9 pages out of 83 are failure critical
53
Failure Critical and Failure Non-Critical Data54
Soft Errors on Increase55
Increase exponentially due to technology scaling0.18 µm
1,000 FIT per Mbit of SRAM
0.13 µm 10,000 to 100,000 FIT per Mbit of SRAM
Voltage ScalingVoltage scaling increases SER significantly
SER ∝ Nflux CSx expQcritical{-x
Qs}
where Qcritical = C Vx
Experimental Setup for Page Failure Rates56
Experimental Framework57
Experimental Results – Failure Rate
Failure rate of PPC is close to that of Safe (Safe is a protected cache configuration with an ECC protection, i.e., protecting all data, and Unsafe is an unprotected cache)
58
Experimental Results – Performance
Runtime of PPC is close to that of Unsafe
59
Experimental Results – Power
Energy consumption of PPC is close to that of Unsafe
60
Experimental Setup for DPExplore61
DPExplore Results62
Video Encoding63
Error-Resilient Video Encoding
Error-resilient video encodings have been developed to combat errors in networks
PBPAIR – energy-efficient and error-resilient video encoding [Kim,06]Passive Error Exploitation
It compresses video data according to PLR
Error-prone Networks
Mobile Video Application
Packet LossMaintain the QoSEmbed Error-Resilience
against packet losses
64
•PBPAIR: Probability-BasedPower Aware Intra Refresh
NetworkResilience
PLRParameters
65
Related Work
Energy/QoS-aware video encoding
Video encoding parameters [Mopatra, IPDPS05]
Motion estimation algorithm [Tourapis, VCIP00]
Integrated power management [Mohapatra, ACM MM03]
Global cross-layer adaption [Yuan, MMCN04]
Transmission power and QoS [Eisenberg, IEEE Trans. on CSVT 02]
Not consider error-resilience
Error-resilient video encodingError-resilient GOP [Yang, JVCIP07]
AIR (Adaptive Intra Refreshing) [Worral, ICASSP01]
PGOP (Progressive GOP) [Cheng, PCS04]
PBPAIR (Probability-Based Power Aware Intra Refresh) [Kim, MCCR06]
Passive error exploitation
Our Solution-Error-aware video encoding: exploits errors actively to minimize energy consumption
Our Solution-Error-aware video encoding: exploits errors actively to minimize energy consumption
EE-PBPAIR66
Experimental Setup67
Experimental Results – Energy Reduction
Energy saving occurs at every component in a path from encoding to decoding in mobile video applications
EC= Energy ConsumptionEnc EC= EC for EncodingTx EC= EC for TransmissionDec EC= EC for DecodingRx EC= EC for Receiving
68
•PSNR: Peak Signal to Noise Ratio
PLR = 10% and EIR = 10%
Experimental Results – Expanded Tradeoff Space 69
Experimental Energy Saving70
•Source EC = Enc EC + Tx EC•Destination EC = Rx EC + Dec EC
Experimental Results – Adaptive EIR
Feedback-based approach (Adaptive EE-PBPAIR) maintains the required video quality compared to Static EE-PBPAIR
71
Adaptive EIR72
Conclusion
Studied two main cross-layer approaches
PPCEAVE
Demonstrated the effectiveness of our cooperative cross-layer approaches by exploiting error tolerance and error control schemes
NetworkEIR
FLR
Resilience
PLRfeedback
73
Tolerance
UnequalProtection
Failure Rate74
Video Quality75
Memory Access Time (performance)76
Future DirectionCooperative approaches combining PPC and EAVE
Middleware-driven cross-layer approach manages error control schemesTranslate errors to exploit existing approaches at other abstraction layers
PPCApply our approach for other components
Instruction caches and logics
EAVEIntelligent frame dropping techniques
To maximize the energy saving while minimizing the quality degradation
77
EIR
FLR
Resilience
PLRfeedback
Tolerance
UnequalProtection
SER
Thesis Outline
Thesis proposes a cross-layer methodExploit errors and error control schemes across layers to maximize reliability with minimal costs for mobile embedded systems
Topic 1 – Approach at hardware and application layersPPC (unequal data protection at hardware exploiting error tolerance at application) [Lee, CASES06][Lee, DIPES08][Lee, TVLSI08]
Topic 2 – Approach at application, middleware, and network layersEAVE (intentional exploitation of errors at application, incorporating error resilience in networks) [Lee, DIPES08]
Topic 3 – Approach across application/middleware-OS/HWCC-PROTECT (middleware-driven cooperative exploitation of errors and error control schemes across layers) [Lee, ACM MM 08]
78
Application
Hardware
Middleware/ OS Network
References (cross-layers and tools)[Bajic, 07] I. V. Bajic. Efficient cross-layer error control for wireless video multicast. 53(1):276–285, Mar 2007.
[Dynamo] DYNAMO. Power Aware Middleware for Distributed Mobile Computing. University of California at Irvine, http://dynamo.ics.uci.edu/.
[Forge] FORGE Project. A Framework for Optimization of Distributed Embedded Systems Software. University of California at Irvine, http://www.ics.uci.edu/~forge/.
[Grace] GRACE Project. Global Resource Adaptation through CoopEration. University of Illinois at Urbana-Champaign, http://rsim.cs.uiuc.edu/grace/.
[Kim, 08] M. Kim, N. Dutt, N. Venkatasubramanian, and C. Talcott. xTune: Online verifiable cross-layer adaptation for distributed real-time embedded systems. ACM SIGBED Review: Special Issue on the RTSS Forum on Deeply Embedded Real-Time Computing, 5(1), Jan 2008.
[Mohapatra, 03] S. Mohapatra, R. Cornea, N. Dutt, A. Nicolau, and N. Venkatasubramanian. Integrated power management for video streaming to mobile handheld devices. In ACM international conference on Multimedia, 2003.
[Mohapatra, 05] S. Mohapatra, R. Cornea, H. Oh, K. Lee, M. Kim, N. Dutt, R. Gupta, A. Nicolau, S. Shukla, and N. Venkatasubramanian. A cross-layer approach for power-performance optimization in distributed mobile systems. In Next Generation Software Program in conjunction with IPDPS, page218.1, April 2005.
[Shivakumar, 01] P. Shivakumar and N. Jouppi. CACTI 3.0: An Integrated Cache Timing, Power, and Area Model. In WRL Technical Report 2001/2, 2001.
[Synopsys] Synopsys Inc., Mountain View, CA, USA. Design Compiler Reference Manual, 2001.
[Schaar, 07] M. van der Schaar and D. S. Turaga. Cross-layer packetization and retransmission strategies for delay-sensitive wireless multimedia transmission. IEEE Transactions on Multimedia, 9(1):185–197, Jan. 2007.
[Vuran, 06] M. C. Vuran and I. F. Akyildiz. Cross-layer analysis of error control in wireless sensor networks. In IEEE Communications Society on Sensor and Ad Hoc Communications and Networks (SECON), pages 585–594, Sep 2006.
[Yuan, 03] W. Yuan and K. Nahrstedt. Energy-efficient soft real-time CPU scheduling for mobile multimedia systems. 37(5):149–163, Dec 2003.
[Yuan, 04] W. Yuan and K. Nahrstedt. Practical voltage scaling for mobile multimedia devices. In ACM international conference on Multimedia, pages 924–931, 2004.
79
References (soft errors and reliability)[Baumann, 05] R. Baumann. Soft errors in advanced computer systems. IEEE Design and Test of Computers, pages 258–266, 2005.
[Hazucha, 00] P. Hazucha and C. Svensson. Impact of CMOS technology scaling on the atmospheric neutron soft error rate. IEEE Trans. on Nuclear Science, 47(6):2586–2594, 2000.
[Li, 05] J.-F. Li and Y.-J. Huang. An error detection and correction scheme for RAMs with partial-write function. In IEEE International Workshop on Memory Technology, Design and Testing (MTDT), pages 115–120, 2005.
[Li, 04] L. Li, V. Degalahal, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin. Soft error and energy consumption interactions: A data cache perspective. In ISLPED, Aug 2004.
[Mastipuram, 04] R. Mastipuram and E. C. Wee. Soft Errors’ Impact on System Reliability. http://www.edn.com/article/CA454636, Sep 2004.
[Phelan, 03] R. Phelan. Addressing soft errors in arm core-based designs. Technical report, ARM, 2003.
[Pradhan, 96] D. K. Pradhan. Fault-Tolerant Computer System Design. Prentice Hall, 1996. ISBN 0-1305-7887-8.
[Shrivastava, 05] A. Shrivastava, I. Issenin, and N. Dutt. Compilation techniques for energy reduction in horizontally partitioned cache architectures. In CASES, pages 90–96, 2005.
[Wrobel, 01] F. Wrobel, J. M. Palau, M. C. Calvet, O. Bersillon, and H. Duarte. Simulation of nucleon-induced nuclear reactions in a simplified SRAM structure: Scaling effects on SEU and MBU cross sections. IEEE Trans. on Nuclear Science, 48(6), 2001.
[Xu, 96] J. Xu and B. Randell. Roll-forward error recovery in embedded real-time systems. In ICPADS, page 414, 1996.
[Nieuwland, 06] A. K. Nieuwland and S. Jasarevic and G. Jerin. Combinational Logic Soft Error Analysis and Protection. In IOLTS06, 2006.
80
References (error-resilient encoding, etc.)[Cheng, 04] L. Cheng and M. E. Zarki. PGOP: An error resilient techniques for low bit rate and low latency video communications. In Picture Coding Symposium
(PCS), Dec 2004.
[Kim, 06] M. Kim, H. Oh, N. Dutt, A. Nicolau, and N. Venkatasubramanian. PBPAIR: An energy-efficient error-resilient encoding using probability based power aware intra refresh. ACM SIGMOBILE Mobile Computing and Communications Review, 10(3):58–69, July 2006.
[Wang, 98] Y.Wang and Q.-F. Zhu. Error control and concealment for video communication: A review. 86(5):974–997, May 1998.
[Worrall, 01] S. Worrall, A. Sadka, P. Sweeney, and A. Kondoz. Motion adaptive error resilient encoding for MPEG-4. In ICASSP, May 2001.
81