Intrinsic Heterogeneity in MulticoresIntrinsic ... · C1 C2 C3 C4 C5 then finds (V,f) settings for...
Transcript of Intrinsic Heterogeneity in MulticoresIntrinsic ... · C1 C2 C3 C4 C5 then finds (V,f) settings for...
-
Intrinsic Heterogeneity in MulticoresIntrinsic Heterogeneity in MulticoresDue to Process Variation and Core Aging
Josep TorrellasD t t f C t S iDepartment of Computer Science
University of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu
Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois
Supported by NSF, Intel and IBM
-
Motivation
• Chips continue to integrate more and lower-quality transistors• Process variation: deviation of transistor parameters fromProcess variation: deviation of transistor parameters from
nominal specifications• Impact on multicores:
– Spatial variation: different cores on chip run at different frequencies and consume different power
– Temporal variation: cores age depending on the loadp g p g• Result: homogeneous multicores are really heterogeneous• This talk:
– Understand these effects– How to exploit/minimize this heterogeneity
Josep TorrellasIntrinsic Heterogeneity in Multicores 2
-
Executive Summary
• Spatial variation: chip has spatial localities• Temporal variation: aging is exponential on V and T• To exploit spatial variation in a multicore:
– Variation-aware job scheduling and power managementVariation aware job scheduling and power management• To minimize temporal variation:
– Hide and slow down aging with load and voltage changes• What will happen in the next few years:
– Multicores will become more dynamic (sensors/actuators)Homogeneous multicores can be turned heterogeneous– Homogeneous multicores can be turned heterogeneous
– Designs will become more aggressive (less provision for worst-case use)
Josep TorrellasIntrinsic Heterogeneity in Multicores 3
-
Roadmap
• Spatial variationSpatial variation• Exploiting spatial variation• Temporal variation• Minimizing temporal variation• The road ahead
Josep TorrellasIntrinsic Heterogeneity in Multicores 4
-
Roadmap
• Spatial variationSpatial variation• Exploiting spatial variation• Temporal variation• Minimizing temporal variation• The road ahead
Josep TorrellasIntrinsic Heterogeneity in Multicores 5
-
Technology Scaling Continues
Pentium 4
Core 2 Duo
Quad Opteron
Pentium 3
Pentium 4
number of transistors
Josep TorrellasIntrinsic Heterogeneity in Multicores 6
transistorsize
-
Spatial Variation in Transistor Parameters
Chip Frequencypdf
Chip Static Power (No-Load Power)
Switching delayNominal
Josep TorrellasIntrinsic Heterogeneity in Multicores 7
-
Spatial Variation Components
die-to-die (D2D)
within die (WID)
randomsystematic
spatialcorrelation
Josep TorrellasIntrinsic Heterogeneity in Multicores 8
-
Result: Large Multicores Become Heterogeneous
[Teodorescu ISCA08]
• Large multicores have significant core-to-core variation• For a 20-core multicore at 32nm (year 2010)
Static power (no-load power):• (max/min): 1.9X avg
Total power:• (max/min): 1.4X avg
Frequency:Frequency: • (max/min) 1.3X avg
Josep TorrellasIntrinsic Heterogeneity in Multicores 9
-
Can This Variation be Exploited?
• Traditionally: run all cores at the f ( l t )
C1 C2 C3 C4 C5
same frequency (slowest core)
C1 C2 C3 C4 C5C1 C2 C3 C4 C5C1 C2 C3 C4 C5C1 C2 C3 C4 C5
C6 C7 C8 C9C9 C10
L2 Cache
C6 C7 C8 C9C9 C10
L2 Cache
C6 C7 C8 C9 C10
L2 Cache
C6 C7 C8 C9C9 C10
L2 Cache
C6 C7 C8 C9 C10
L2 Cache• New multicores: Each core can run at different frequency (and
C11 C12 C13 C14 C15
L2 Cache
C11 C12 C13 C14 C15
L2 Cache
C11 C12 C13 C14 C15
L2 Cache
C11 C12 C13 C14 C15
L2 Cache
C11 C12 C13 C14 C15
L2 Cache
voltage)
• heterogeneous systemC16 C17 C18 C19 C20C16 C17 C18 C19 C20C16 C17 C18 C19 C20C16 C17 C18 C19 C20C16 C17 C18 C19 C20
Josep TorrellasIntrinsic Heterogeneity in Multicores 10
-
Roadmap
• Spatial variationSpatial variation• Exploiting spatial variation
– Variation-aware job scheduling and power management
• Temporal variation
Variation aware job scheduling and power management [Teodorescu ISCA08]
• Temporal variation • Minimizing temporal variation• The road ahead
Josep TorrellasIntrinsic Heterogeneity in Multicores 11
-
Variation-Aware Job Scheduling
ApplicationsAdditional information to guide scheduling decisions:
• Per core freq and static power• Application behavior
Additional information to guide scheduling decisions:
C1 C2 C3 C4 C5C1 C2 C3 C4 C5
• Application behavior– Dynamic power consumption– Compute intensity (IPC)
C6 C7 C8 C9C9 C10
C11 C12 C13 C14 C15
L2 Cache
C6 C7 C8 C9 C10
C11 C12 C13 C14 C15
L2 Cachep y ( )
Multiple possible goals
C11 C12 C13 C14 C15
C16 C17 C18 C19 C20
L2 Cache
C11 C12 C13 C14 C15
C16 C17 C18 C19 C20
L2 Cache
Josep TorrellasIntrinsic Heterogeneity in Multicores 12
C16 C17 C18 C19 C20C16 C17 C18 C19 C20
-
Examples of Possible Algorithms
• When the goal is to reduce power consumption:
• Assign applications with high dynamic power to low static power cores
• When the goal is to improve throughput:
• Assign high IPC applications to high frequency cores
Josep TorrellasIntrinsic Heterogeneity in Multicores 13
-
Variation-Aware Global Power Management
• Per core Dynamic Voltage and Frequency Scaling (DVFS)
• Challenge: find best (V,f) for each core
C1 C2 C3 C4 C5C1 C2 C3 C4 C5V,f V,f V,f V,f V,f
• Global (multicore-wide) power management solution is needed C6 C7 C8 C9C9 C10
L2 Cache
C6 C7 C8 C9 C10
L2 Cache
V,f V,f V,f V,f V,f
V f V f V f V f V fC11 C12 C13 C14 C15
C16 C17 C18 C19 C20
L2 Cache
C11 C12 C13 C14 C15
C16 C17 C18 C19 C20
L2 Cache
V,f V,f V,f V,f V,f
V f V f V f V f V fC16 C17 C18 C19 C20C16 C17 C18 C19 C20V,f V,f V,f V,f V,f
Josep TorrellasIntrinsic Heterogeneity in Multicores 14
-
DVFS under Variation
Vdd=1V
1V
Vdd=1V
0.85V
0.9V
0.8V0.85V
0.7V0.6V
0.6V
Josep TorrellasIntrinsic Heterogeneity in Multicores 15
-
DVFS under variation (All 20 Cores)
Josep TorrellasIntrinsic Heterogeneity in Multicores 16
-
Optimization Problem
V
best (Vi,fi) of each coreGiven a mapping of threads to cores (variation-aware):V,f V,f V,f V,f V,f
V f V f V f V f V f
C1 C2 C3 C4 C5
C6 C7 C8 C9C9 C10
L2 Cache
C1 C2 C3 C4 C5
C6 C7 C8 C9 C10
L2 Cache
FIND!V,f V,f V,f V,f V,f
V,f V,f V,f V,f V,f
V f V f V f V f V f
?C6 C7 C8 C9C9 C10C11 C12 C13 C14 C15L2 Cache
C6 C7 C8 C9 C10
C11 C12 C13 C14 C15
L2 Cache
V,f V,f V,f V,f V,f
• Goal: maximize system throughput 100WC16 C17 C18 C19 C20C16 C17 C18 C19 C20
• Constraint: keep total power below budget50W
75W
Josep TorrellasIntrinsic Heterogeneity in Multicores 17
-
Possible Solutions
• Exhaustive search: too expensiveExhaustive search: too expensive• Simulated annealing:
– Not practical at runtime• Linear programming (LinOpt)
– Simpler, fasterRequires some approximations– Requires some approximations
Josep TorrellasIntrinsic Heterogeneity in Multicores 18
-
LinOpt Implementation
• LinOpt works together with the OS schedulerp g• OS scheduler maps applications to cores • LinOpt then finds (V f) settings for each core
C1 C2 C3 C4 C5
LinOpt then finds (V,f) settings for each core
• LinOpt runs periodically:C6 C7 C8 C9C9 C10
C11 C12 C13 C14 C15
L2 Cache• As a system process on a core, or… • Power management unit (PMU) - e.g., Foxton
C16 C17 C18 C19 C20
L2 CachePMU
Josep TorrellasIntrinsic Heterogeneity in Multicores 19
-
LinOpt Implementation
Post-manufacturing profiling Powertarget V f V f V f V f V f
(Vi,fi) of each core
Each core: frequency, static power
Dynamic profilingLinOpt
g V,f V,f V,f V,f V,f
V,f V,f V,f V,f V,f
V,f V,f V,f V,f V,fDynamic profiling
Each app: dynamic power, IPCGoal
V,f V,f V,f V,f V,f
V,f V,f V,f V,f V,f
LinOpt
10ms TimeOS scheduling
interval
Josep TorrellasIntrinsic Heterogeneity in Multicores 20
interval
-
Roadmap
• Spatial variationSpatial variation• Exploiting spatial variation• Temporal variation• Minimizing temporal variation• The road ahead
Josep TorrellasIntrinsic Heterogeneity in Multicores 21
-
Temporal Variation: Aging
• Transistor delay gradually increases with time under normal use• Processor critical paths take longerProcessor critical paths take longer
elay Guardband (G0)
T0 + G0
cal Path De
T0
Criti In 7 years, 5-20%??
Josep TorrellasIntrinsic Heterogeneity in Multicores 2222
Time(Years) Y0
-
Aging: Two Key Mechanisms
• PMOS: Negative bias temperature instability (NBTI)
ΔVtRecovery
Stressy
• NMOS: Hot Carrier Injection (HCI) t (clock cycles)
ΔVt
Josep TorrellasIntrinsic Heterogeneity in Multicores 2323
t (years)
-
What Affects Aging Rate?
[Tiwari MICRO 08]
Factor NBTI HCIVdd exponential exponential
T exponential linearf, activity (α) --- linear
Josep TorrellasIntrinsic Heterogeneity in Multicores 2424
-
Roadmap
• Spatial and variationSpatial and variation• Exploiting spatial variation• Temporal variation• Minimizing temporal variation
– Aging-driven job scheduling and voltage changes
• The road ahead
g g j g g g[Tiwari MICRO08]
Josep TorrellasIntrinsic Heterogeneity in Multicores 25
-
Proposed Approaches
Factor NBTI HCIV exponential exponentialVdd exponential exponential
T exponential linearα f linear
• Aging driven job scheduling:
α , f --- linear
– Temperature, activity (α)• Voltage changes at key times of processor service life
Changes V– Changes Vdd– High impact on temperature
Josep TorrellasIntrinsic Heterogeneity in Multicores 2626
-
Aging Driven Job Scheduling
• Different policies are possible• A possible one:
– Send aging-intensive applications to the faster cores• High-T high activity appsHigh T, high activity apps• Fast aging of cores that have more room to age
– Send less aging-intensive applications to slower cores• Low-T, low activity, memory bound apps• Slow aging of cores that have less room to age
Josep TorrellasIntrinsic Heterogeneity in Multicores 2727
-
Voltage Changes (Vdd)
%)
S0
Vdd‐
Increase (%
Path Delay
Vdd+
Time (years)Critical P
Y0
No frequency change
Josep TorrellasIntrinsic Heterogeneity in Multicores 2828
-
Big Picture
• Vdd -Increases dela of logic paths– Increases delay of logic paths
– Slows down aging rate
• Vdd +– Reduces delay of logic paths
I i– Increases aging rate
Josep TorrellasIntrinsic Heterogeneity in Multicores 2929
-
When to Apply These Techniques?
• Aging rate is higher in beginning and lower at end• Apply V - in beginning• Apply Vdd- in beginning
– Slow down aging the most– Have room to slow down paths
• Apply Vdd+ at end– Little impact on aging rate anyway– Reduce logic path delaysg p y
S0
Josep TorrellasIntrinsic Heterogeneity in Multicores 3030
Time (years)Y0
-
Impact on the Aging Curve
S
e (%
)
S0
S
lay Increase S
dd
Del
t YVdd ‐ Vdd + Time (years)
• Need less guardband (S% rather than S0%) C l it t hi h f f th b i i
ttrans Y0
Josep TorrellasIntrinsic Heterogeneity in Multicores 3131
• Can cycle it at higher frequency from the beginning
-
Roadmap
• Spatial VariationSpatial Variation• Exploiting spatial variation• Temporal variation • Minimizing temporal variation• The road ahead
Josep TorrellasIntrinsic Heterogeneity in Multicores 32
-
The Road Ahead (I)
• Multicores will become more dynamic – Chips will come with many sensors (power temp activity)Chips will come with many sensors (power, temp, activity)– Multiple f domains common (AMD), multiple V domains likely– Control can be by a HW controller (Itanium Foxton) or in SW
• Homogeneous multicores appear as heterogeneous Driven by single thread performance– Driven by single-thread performance
– Intel’s Turbo Mode– Perhaps forms of Core-fusion
Josep TorrellasIntrinsic Heterogeneity in Multicores 33
-
The Road Ahead (II)
• Designs will become more aggressive– Use shorter guardbands and target the common usageUse shorter guardbands and target the common usage– Rely on on-chip aging sensors to seamlessly lower
performance– Example: Turbo Mode
• Inexpensive opportunities for system software to control theInexpensive opportunities for system software to control the heterogeneity of homogeneous platforms– Variation-aware job scheduling and power management– Changes in Vdd to slow down aging
Josep TorrellasIntrinsic Heterogeneity in Multicores 34
-
Intrinsic Heterogeneity in MulticoresIntrinsic Heterogeneity in MulticoresDue to Process Variation and Core Aging
Josep TorrellasD t t f C t S iDepartment of Computer Science
University of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu
Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois
Supported by NSF, Intel and IBM
-
Can Have Multiple V Changes
%)
Increase (%
Delay I
Y
V1 V2 V3
Time (years)ttrans Y0ttrans
• Proposal:– Manufacturer programs a schedule of V increases
with time based on typical usage (no f change)– On-chip age sensors can modify the schedule
Josep TorrellasIntrinsic Heterogeneity in Multicores 3636
On chip age sensors can modify the schedule
-
Current Results
• Variation-aware job scheduling and power management Impro es m lticore thro ghp t for a gi en po er b dget– Improves multicore throughput for a given power budget by 12-17%
• Slowing down aging– Improves frequency of multicore by 12%– Regains > 50% of performance losses due to aging
• Low hardware overhead• Low hardware overhead– Change V, job scheduling
Josep TorrellasIntrinsic Heterogeneity in Multicores 3737
-
Example
Josep TorrellasIntrinsic Heterogeneity in Multicores 3838
-
Example
App
1 2DApp
hottest
C
3 4B
3 4A coolest
Josep TorrellasIntrinsic Heterogeneity in Multicores 3939
-
Example
App
1 2DApp
C
3 4B
3 4A
Josep TorrellasIntrinsic Heterogeneity in Multicores 4040
-
What Affects Aging Rate?[Tiwari MICRO 08]
Factor NBTI HCIV V exponential exponentialVdd - Vt exponential exponential
T exponential linearf activity (α) linearf, activity (α) --- linear
• Change aging rate withAdaptive Supply Voltage (ASV):– Adaptive Supply Voltage (ASV):
• Changes Vdd• Exponential impact on T
Adaptive Body Bias (ABB):– Adaptive Body Bias (ABB): • Changes Vt• Exponential impact on T
Josep TorrellasIntrinsic Heterogeneity in Multicores 4141