Intrinsic Heterogeneity in MulticoresIntrinsic ... · C1 C2 C3 C4 C5 then finds (V,f) settings for...

41
Intrinsic Heterogeneity in Multicores Intrinsic Heterogeneity in Multicores Due to Process Variation and Core Aging Josep Torrellas D t t fC t Si Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois Supported by NSF, Intel and IBM

Transcript of Intrinsic Heterogeneity in MulticoresIntrinsic ... · C1 C2 C3 C4 C5 then finds (V,f) settings for...

  • Intrinsic Heterogeneity in MulticoresIntrinsic Heterogeneity in MulticoresDue to Process Variation and Core Aging

    Josep TorrellasD t t f C t S iDepartment of Computer Science

    University of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu

    Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois

    Supported by NSF, Intel and IBM

  • Motivation

    • Chips continue to integrate more and lower-quality transistors• Process variation: deviation of transistor parameters fromProcess variation: deviation of transistor parameters from

    nominal specifications• Impact on multicores:

    – Spatial variation: different cores on chip run at different frequencies and consume different power

    – Temporal variation: cores age depending on the loadp g p g• Result: homogeneous multicores are really heterogeneous• This talk:

    – Understand these effects– How to exploit/minimize this heterogeneity

    Josep TorrellasIntrinsic Heterogeneity in Multicores 2

  • Executive Summary

    • Spatial variation: chip has spatial localities• Temporal variation: aging is exponential on V and T• To exploit spatial variation in a multicore:

    – Variation-aware job scheduling and power managementVariation aware job scheduling and power management• To minimize temporal variation:

    – Hide and slow down aging with load and voltage changes• What will happen in the next few years:

    – Multicores will become more dynamic (sensors/actuators)Homogeneous multicores can be turned heterogeneous– Homogeneous multicores can be turned heterogeneous

    – Designs will become more aggressive (less provision for worst-case use)

    Josep TorrellasIntrinsic Heterogeneity in Multicores 3

  • Roadmap

    • Spatial variationSpatial variation• Exploiting spatial variation• Temporal variation• Minimizing temporal variation• The road ahead

    Josep TorrellasIntrinsic Heterogeneity in Multicores 4

  • Roadmap

    • Spatial variationSpatial variation• Exploiting spatial variation• Temporal variation• Minimizing temporal variation• The road ahead

    Josep TorrellasIntrinsic Heterogeneity in Multicores 5

  • Technology Scaling Continues

    Pentium 4

    Core 2 Duo

    Quad Opteron

    Pentium 3

    Pentium 4

    number of transistors

    Josep TorrellasIntrinsic Heterogeneity in Multicores 6

    transistorsize

  • Spatial Variation in Transistor Parameters

    Chip Frequencypdf

    Chip Static Power (No-Load Power)

    Switching delayNominal

    Josep TorrellasIntrinsic Heterogeneity in Multicores 7

  • Spatial Variation Components

    die-to-die (D2D)

    within die (WID)

    randomsystematic

    spatialcorrelation

    Josep TorrellasIntrinsic Heterogeneity in Multicores 8

  • Result: Large Multicores Become Heterogeneous

    [Teodorescu ISCA08]

    • Large multicores have significant core-to-core variation• For a 20-core multicore at 32nm (year 2010)

    Static power (no-load power):• (max/min): 1.9X avg

    Total power:• (max/min): 1.4X avg

    Frequency:Frequency: • (max/min) 1.3X avg

    Josep TorrellasIntrinsic Heterogeneity in Multicores 9

  • Can This Variation be Exploited?

    • Traditionally: run all cores at the f ( l t )

    C1 C2 C3 C4 C5

    same frequency (slowest core)

    C1 C2 C3 C4 C5C1 C2 C3 C4 C5C1 C2 C3 C4 C5C1 C2 C3 C4 C5

    C6 C7 C8 C9C9 C10

    L2 Cache

    C6 C7 C8 C9C9 C10

    L2 Cache

    C6 C7 C8 C9 C10

    L2 Cache

    C6 C7 C8 C9C9 C10

    L2 Cache

    C6 C7 C8 C9 C10

    L2 Cache• New multicores: Each core can run at different frequency (and

    C11 C12 C13 C14 C15

    L2 Cache

    C11 C12 C13 C14 C15

    L2 Cache

    C11 C12 C13 C14 C15

    L2 Cache

    C11 C12 C13 C14 C15

    L2 Cache

    C11 C12 C13 C14 C15

    L2 Cache

    voltage)

    • heterogeneous systemC16 C17 C18 C19 C20C16 C17 C18 C19 C20C16 C17 C18 C19 C20C16 C17 C18 C19 C20C16 C17 C18 C19 C20

    Josep TorrellasIntrinsic Heterogeneity in Multicores 10

  • Roadmap

    • Spatial variationSpatial variation• Exploiting spatial variation

    – Variation-aware job scheduling and power management

    • Temporal variation

    Variation aware job scheduling and power management [Teodorescu ISCA08]

    • Temporal variation • Minimizing temporal variation• The road ahead

    Josep TorrellasIntrinsic Heterogeneity in Multicores 11

  • Variation-Aware Job Scheduling

    ApplicationsAdditional information to guide scheduling decisions:

    • Per core freq and static power• Application behavior

    Additional information to guide scheduling decisions:

    C1 C2 C3 C4 C5C1 C2 C3 C4 C5

    • Application behavior– Dynamic power consumption– Compute intensity (IPC)

    C6 C7 C8 C9C9 C10

    C11 C12 C13 C14 C15

    L2 Cache

    C6 C7 C8 C9 C10

    C11 C12 C13 C14 C15

    L2 Cachep y ( )

    Multiple possible goals

    C11 C12 C13 C14 C15

    C16 C17 C18 C19 C20

    L2 Cache

    C11 C12 C13 C14 C15

    C16 C17 C18 C19 C20

    L2 Cache

    Josep TorrellasIntrinsic Heterogeneity in Multicores 12

    C16 C17 C18 C19 C20C16 C17 C18 C19 C20

  • Examples of Possible Algorithms

    • When the goal is to reduce power consumption:

    • Assign applications with high dynamic power to low static power cores

    • When the goal is to improve throughput:

    • Assign high IPC applications to high frequency cores

    Josep TorrellasIntrinsic Heterogeneity in Multicores 13

  • Variation-Aware Global Power Management

    • Per core Dynamic Voltage and Frequency Scaling (DVFS)

    • Challenge: find best (V,f) for each core

    C1 C2 C3 C4 C5C1 C2 C3 C4 C5V,f V,f V,f V,f V,f

    • Global (multicore-wide) power management solution is needed C6 C7 C8 C9C9 C10

    L2 Cache

    C6 C7 C8 C9 C10

    L2 Cache

    V,f V,f V,f V,f V,f

    V f V f V f V f V fC11 C12 C13 C14 C15

    C16 C17 C18 C19 C20

    L2 Cache

    C11 C12 C13 C14 C15

    C16 C17 C18 C19 C20

    L2 Cache

    V,f V,f V,f V,f V,f

    V f V f V f V f V fC16 C17 C18 C19 C20C16 C17 C18 C19 C20V,f V,f V,f V,f V,f

    Josep TorrellasIntrinsic Heterogeneity in Multicores 14

  • DVFS under Variation

    Vdd=1V

    1V

    Vdd=1V

    0.85V

    0.9V

    0.8V0.85V

    0.7V0.6V

    0.6V

    Josep TorrellasIntrinsic Heterogeneity in Multicores 15

  • DVFS under variation (All 20 Cores)

    Josep TorrellasIntrinsic Heterogeneity in Multicores 16

  • Optimization Problem

    V

    best (Vi,fi) of each coreGiven a mapping of threads to cores (variation-aware):V,f V,f V,f V,f V,f

    V f V f V f V f V f

    C1 C2 C3 C4 C5

    C6 C7 C8 C9C9 C10

    L2 Cache

    C1 C2 C3 C4 C5

    C6 C7 C8 C9 C10

    L2 Cache

    FIND!V,f V,f V,f V,f V,f

    V,f V,f V,f V,f V,f

    V f V f V f V f V f

    ?C6 C7 C8 C9C9 C10C11 C12 C13 C14 C15L2 Cache

    C6 C7 C8 C9 C10

    C11 C12 C13 C14 C15

    L2 Cache

    V,f V,f V,f V,f V,f

    • Goal: maximize system throughput 100WC16 C17 C18 C19 C20C16 C17 C18 C19 C20

    • Constraint: keep total power below budget50W

    75W

    Josep TorrellasIntrinsic Heterogeneity in Multicores 17

  • Possible Solutions

    • Exhaustive search: too expensiveExhaustive search: too expensive• Simulated annealing:

    – Not practical at runtime• Linear programming (LinOpt)

    – Simpler, fasterRequires some approximations– Requires some approximations

    Josep TorrellasIntrinsic Heterogeneity in Multicores 18

  • LinOpt Implementation

    • LinOpt works together with the OS schedulerp g• OS scheduler maps applications to cores • LinOpt then finds (V f) settings for each core

    C1 C2 C3 C4 C5

    LinOpt then finds (V,f) settings for each core

    • LinOpt runs periodically:C6 C7 C8 C9C9 C10

    C11 C12 C13 C14 C15

    L2 Cache• As a system process on a core, or… • Power management unit (PMU) - e.g., Foxton

    C16 C17 C18 C19 C20

    L2 CachePMU

    Josep TorrellasIntrinsic Heterogeneity in Multicores 19

  • LinOpt Implementation

    Post-manufacturing profiling Powertarget V f V f V f V f V f

    (Vi,fi) of each core

    Each core: frequency, static power

    Dynamic profilingLinOpt

    g V,f V,f V,f V,f V,f

    V,f V,f V,f V,f V,f

    V,f V,f V,f V,f V,fDynamic profiling

    Each app: dynamic power, IPCGoal

    V,f V,f V,f V,f V,f

    V,f V,f V,f V,f V,f

    LinOpt

    10ms TimeOS scheduling

    interval

    Josep TorrellasIntrinsic Heterogeneity in Multicores 20

    interval

  • Roadmap

    • Spatial variationSpatial variation• Exploiting spatial variation• Temporal variation• Minimizing temporal variation• The road ahead

    Josep TorrellasIntrinsic Heterogeneity in Multicores 21

  • Temporal Variation: Aging

    • Transistor delay gradually increases with time under normal use• Processor critical paths take longerProcessor critical paths take longer

    elay Guardband (G0)

    T0 + G0

    cal Path De

    T0

    Criti In 7 years, 5-20%??

    Josep TorrellasIntrinsic Heterogeneity in Multicores 2222

    Time(Years) Y0

  • Aging: Two Key Mechanisms

    • PMOS: Negative bias temperature instability (NBTI)

    ΔVtRecovery

    Stressy

    • NMOS: Hot Carrier Injection (HCI) t (clock cycles)

    ΔVt

    Josep TorrellasIntrinsic Heterogeneity in Multicores 2323

    t (years)

  • What Affects Aging Rate?

    [Tiwari MICRO 08]

    Factor NBTI HCIVdd exponential exponential

    T exponential linearf, activity (α) --- linear

    Josep TorrellasIntrinsic Heterogeneity in Multicores 2424

  • Roadmap

    • Spatial and variationSpatial and variation• Exploiting spatial variation• Temporal variation• Minimizing temporal variation

    – Aging-driven job scheduling and voltage changes

    • The road ahead

    g g j g g g[Tiwari MICRO08]

    Josep TorrellasIntrinsic Heterogeneity in Multicores 25

  • Proposed Approaches

    Factor NBTI HCIV exponential exponentialVdd exponential exponential

    T exponential linearα f linear

    • Aging driven job scheduling:

    α , f --- linear

    – Temperature, activity (α)• Voltage changes at key times of processor service life

    Changes V– Changes Vdd– High impact on temperature

    Josep TorrellasIntrinsic Heterogeneity in Multicores 2626

  • Aging Driven Job Scheduling

    • Different policies are possible• A possible one:

    – Send aging-intensive applications to the faster cores• High-T high activity appsHigh T, high activity apps• Fast aging of cores that have more room to age

    – Send less aging-intensive applications to slower cores• Low-T, low activity, memory bound apps• Slow aging of cores that have less room to age

    Josep TorrellasIntrinsic Heterogeneity in Multicores 2727

  • Voltage Changes (Vdd)

    %)

    S0

    Vdd‐

    Increase (%

    Path Delay 

    Vdd+

    Time (years)Critical P

    Y0

    No frequency change

    Josep TorrellasIntrinsic Heterogeneity in Multicores 2828

  • Big Picture

    • Vdd -Increases dela of logic paths– Increases delay of logic paths

    – Slows down aging rate

    • Vdd +– Reduces delay of logic paths

    I i– Increases aging rate

    Josep TorrellasIntrinsic Heterogeneity in Multicores 2929

  • When to Apply These Techniques?

    • Aging rate is higher in beginning and lower at end• Apply V - in beginning• Apply Vdd- in beginning

    – Slow down aging the most– Have room to slow down paths

    • Apply Vdd+ at end– Little impact on aging rate anyway– Reduce logic path delaysg p y

    S0

    Josep TorrellasIntrinsic Heterogeneity in Multicores 3030

    Time (years)Y0

  • Impact on the Aging Curve

    S

    e (%

    )

    S0

    S

    lay Increase S

    dd

    Del

    t YVdd ‐ Vdd + Time (years)

    • Need less guardband (S% rather than S0%) C l it t hi h f f th b i i

    ttrans Y0

    Josep TorrellasIntrinsic Heterogeneity in Multicores 3131

    • Can cycle it at higher frequency from the beginning

  • Roadmap

    • Spatial VariationSpatial Variation• Exploiting spatial variation• Temporal variation • Minimizing temporal variation• The road ahead

    Josep TorrellasIntrinsic Heterogeneity in Multicores 32

  • The Road Ahead (I)

    • Multicores will become more dynamic – Chips will come with many sensors (power temp activity)Chips will come with many sensors (power, temp, activity)– Multiple f domains common (AMD), multiple V domains likely– Control can be by a HW controller (Itanium Foxton) or in SW

    • Homogeneous multicores appear as heterogeneous Driven by single thread performance– Driven by single-thread performance

    – Intel’s Turbo Mode– Perhaps forms of Core-fusion

    Josep TorrellasIntrinsic Heterogeneity in Multicores 33

  • The Road Ahead (II)

    • Designs will become more aggressive– Use shorter guardbands and target the common usageUse shorter guardbands and target the common usage– Rely on on-chip aging sensors to seamlessly lower

    performance– Example: Turbo Mode

    • Inexpensive opportunities for system software to control theInexpensive opportunities for system software to control the heterogeneity of homogeneous platforms– Variation-aware job scheduling and power management– Changes in Vdd to slow down aging

    Josep TorrellasIntrinsic Heterogeneity in Multicores 34

  • Intrinsic Heterogeneity in MulticoresIntrinsic Heterogeneity in MulticoresDue to Process Variation and Core Aging

    Josep TorrellasD t t f C t S iDepartment of Computer Science

    University of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu

    Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois

    Supported by NSF, Intel and IBM

  • Can Have Multiple V Changes

    %)

    Increase (%

    Delay I

    Y

    V1 V2 V3

    Time (years)ttrans Y0ttrans

    • Proposal:– Manufacturer programs a schedule of V increases

    with time based on typical usage (no f change)– On-chip age sensors can modify the schedule

    Josep TorrellasIntrinsic Heterogeneity in Multicores 3636

    On chip age sensors can modify the schedule

  • Current Results

    • Variation-aware job scheduling and power management Impro es m lticore thro ghp t for a gi en po er b dget– Improves multicore throughput for a given power budget by 12-17%

    • Slowing down aging– Improves frequency of multicore by 12%– Regains > 50% of performance losses due to aging

    • Low hardware overhead• Low hardware overhead– Change V, job scheduling

    Josep TorrellasIntrinsic Heterogeneity in Multicores 3737

  • Example

    Josep TorrellasIntrinsic Heterogeneity in Multicores 3838

  • Example

    App

    1 2DApp

    hottest

    C

    3 4B

    3 4A coolest

    Josep TorrellasIntrinsic Heterogeneity in Multicores 3939

  • Example

    App

    1 2DApp

    C

    3 4B

    3 4A

    Josep TorrellasIntrinsic Heterogeneity in Multicores 4040

  • What Affects Aging Rate?[Tiwari MICRO 08]

    Factor NBTI HCIV V exponential exponentialVdd - Vt exponential exponential

    T exponential linearf activity (α) linearf, activity (α) --- linear

    • Change aging rate withAdaptive Supply Voltage (ASV):– Adaptive Supply Voltage (ASV):

    • Changes Vdd• Exponential impact on T

    Adaptive Body Bias (ABB):– Adaptive Body Bias (ABB): • Changes Vt• Exponential impact on T

    Josep TorrellasIntrinsic Heterogeneity in Multicores 4141