The Future Is Heterogeneous Computing

download The Future Is Heterogeneous Computing

of 26

Transcript of The Future Is Heterogeneous Computing

  • 8/13/2019 The Future Is Heterogeneous Computing

    1/26

  • 8/13/2019 The Future Is Heterogeneous Computing

    2/26

    Page 2|

    T h e Fu t u r e I s H e t e r o g e n e o u s Com p u t i n g

    |Oct 27, 2010

    Workload Example: Changing Consumer Behavior

    2

    20 hoursof videouploaded to YouTube

    every minute

    50 million +digital media files

    added to personal content libraries

    every day

    Approximately

    9 billionvideo files owned are

    high-definition

    1000

    imagesare uploaded to Facebook

    every second

  • 8/13/2019 The Future Is Heterogeneous Computing

    3/26

    Page 3 | T h e Fu t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010

    Challenges for Next Generation Systems

    The Power Wall

    Even more broadly constraining in the future!

    Complexity Management HW and SW

    Principles for managing exponential growth

    Parallelism, Programmability and Efficiency

    Optimized SW for System-level Solutions

    System balance

    Memory Technologies and System Design

    Interconnect Design

  • 8/13/2019 The Future Is Heterogeneous Computing

    4/26

    Page 4 | T h e Fu t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010

    The Power Wall

    Easy prediction: Power will continue to be the #1 designconstraint for Computer Systems design.

    Why? Vmin will not continue tracking Moores Law

    Integration of system-level components consume chip power

    A well utilized 100GB/sec DDR memory interface consumes~15W for the I/O alone!

    2ndOrder Effects of Power Thermal, packaging & cooling (node-level & datacenter-level)

    Electrical stability in the face of rising variablity

    Thermal Design Points (TDPs) in all market segmentscontinue to drop

    Lightly loaded and idle power characteristics are keyparameters in the Operational Expense (OpEx) equation

    Percent of total world energy consumed by computingdevices continues to grow year-on-year

  • 8/13/2019 The Future Is Heterogeneous Computing

    5/26

    Page 5 | T h e Fu t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010

    Optimized SW for System-level Solutions

    Long history of SW optimizations for HW characteristics

    Optimizing compilers Cache / TLB blocking

    Multi-processor coordination: communication & synchronization

    Non-uniform memory characteristics: Process and memory affinity

    Scarcity/Abundance principle favors increased use of

    Abstractions Abstraction leads to Increased productivity but costs performance

    Still allow experts burrow down into lower level on the metal details

    System-level Integration Era will demand even more Many Core: user mode and/or managed runtime scheduling?

    Heterogeneous Many Core: capability aware scheduling?

    SW productivity versus optimization dichotomy

    Exposed HW leads to better performance but requires a platform

    characteristics aware programming model

  • 8/13/2019 The Future Is Heterogeneous Computing

    6/26

    Page 6 | T h e Fu t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010

    The Memory Wall getting thicker

    There has always been a Critical Balance betweenData Availabilityand Processing

    Situation When? Implication Industry Solutions

    DRAM vs CPU Cycle Time GapEarly1990s

    Memory wait time

    dominates computing

    Non-blockingcaches

    O-o-O Machines

    SW Productivity CrisisObject oriented languages;

    Managed runtime environments

    Mid

    1990s

    Larger working sets

    More diverse data types

    Larger CachesCache Hierarchies

    Elaborate prefetch

    Single Thread CMP Focus2005 andbeyond

    Multiple working sets!

    Virtual Machines!

    More memory accesses

    Huge Caches

    Multiple MemoryControllers

    Extreme PHYs

    New & Emerging Abstractions

    Browser-based Runtimes

    Image/Video as basic data types

    Throughput-based designs

    2009 andbeyond

    Even larger working sets

    Larger data types

    A cc e l e r a t e d P a r a l l e l

    P r o c e s s i n g

    Ch i p St a c k i n g

    TBD

  • 8/13/2019 The Future Is Heterogeneous Computing

    7/26

    Page 7 | T h e Fu t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010

    Interconnect Challenges

    Coherence domain knowing when to stop

    Interesting implications for on-chip interconnect networks

    Industry Mantra: Never bet against Ethernet

    But, current Ethernet not well suited for lossless transmission

    Troublesome for storage, messaging and more

    The more subtle and trickier problems

    Adaptive routing, congestion management, QOS, End-to-endcharacteristics, and more

    Data centers of tomorrow are going to take great interest inthis area

  • 8/13/2019 The Future Is Heterogeneous Computing

    8/26

    Page 8 | T h e Fu t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010

    Single-thread Performance

    IPC

    Issue Width

    The IPC Complexity Wall

    o

    we arehere

    Integration(logscale

    )

    Time

    Moores Law

    !

    we are

    here

    o

    PowerBudget(TDP)

    Time

    The Power Wall

    we are

    here

    o

    Frequency

    Time

    The Frequency Wall

    we are

    here

    o

    Single-th

    readPerf

    ?

    Time

    we arehere

    o

    Single thread Perf (!)

    - DFM- Variability- Reliability- Wire delay

    Server: power=$$

    DT: eliminate fansMobile: battery

    Performance

    Cache Size

    Locality

    we are

    here

    o

  • 8/13/2019 The Future Is Heterogeneous Computing

    9/26

    Page 9 | T h e Fu t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010

    0

    20

    40

    60

    80

    100

    120

    140

    1 2 4 8 16 32 64 128

    Speed-up

    Number of CPU Cores

    0% Serial

    100% Serial

    0

    20

    40

    60

    80

    100

    120

    140

    1 2 4 8 16 32 64 128

    Speed-up

    Number of CPU Cores

    0% Serial

    10% Serial

    35% Serial

    100% Serial

    Parallel Programs and Amdahls Law

    Speed-up =1

    SW + (1 SW) / N

    SW: % Serial Work

    N: Number of processors

    Assume 100W TDP Socket

    10W for global clocking

    20W for on-chip network/caches

    15W for I/O (memory, PCIe, etc)

    This leaves 55W for all the cores

    850mW per Core !

  • 8/13/2019 The Future Is Heterogeneous Computing

    10/26

    Page 10 | T h e F u t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010

    Transistors

    (thousands)

    Single-thread

    Performance

    (SpecINT)

    Frequency

    (MHz)

    Typical Power(Watts)

    Number of

    Cores

    Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten

    Dotted line extrapolations by C. Moore

    35 Years of Microprocessor Trend Data

  • 8/13/2019 The Future Is Heterogeneous Computing

    11/26

    Page 11 | T h e F u t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010

    The Power Wall A g a i n !

    Escalating multi-core designs will crash into the power wall justlike single cores did due to escalating frequency

    Why?

    In order to maintain a reasonable balance, core additions must beaccompanied by increases in other resources that consume power(on-chip network, caches, memory and I/O BW, )

    Spiral upwards effect on power

    The use of multiple cores forces each core to actually slow down

    At some point, the power limits will not even allowyou to activate allof the cores at the same time

    Small, low-power cores tend to be very weak on single-threadedgeneral purpose workloads

    Customer value proposition will continue to demand excellentperformance on general purpose workloads

    The transition to compelling general purpose parallel workloads willnot be a fast one

  • 8/13/2019 The Future Is Heterogeneous Computing

    12/26

  • 8/13/2019 The Future Is Heterogeneous Computing

    13/26

    Page 13 | T h e F u t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010

    Three Eras of Processor Performance

    Single-CoreEra

    Single-threa

    d

    Performance

    ?

    Time

    we arehere

    o

    Enabled by: Moores Law Voltage Scaling MicroArchitecture

    Constrained by:Power

    Complexity

    Multi-CoreEra

    Throughpu

    tPerformance

    Time(# of Processors)

    we are

    here

    o

    Enabled by: Moores Law Desire for Throughput 20 years of SMP arch

    Constrained by:Power

    Parallel SW availabilityScalability

    HeterogeneousSystems Era

    TargetedApplication

    Performance

    Time(Data-parallel exploitation)

    we are

    here

    o

    Enabled by: Moores Law Abundant data parallelism Power efficient GPUs

    C u r r e n t l y constrained by:Programming models

    Communication overheads

  • 8/13/2019 The Future Is Heterogeneous Computing

    14/26

    Page 14 | T h e F u t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010

    2003

    AMD x86 64-bit CMP Evolution

    2005 2007 2008 2009 2010

    AMD Opteron Dual-CoreAMD Opteron

    Quad-CoreAMD Opteron

    45nm Quad-Core

    AMD Opteron

    Six-CoreAMD Opteron

    AMD Opteron6100 Series

    Mfg.Process

    90nm SOI 90nm SOI 65nm SOI 45nm SOI 45nm SOI 45nm SOI

    CPU Core

    K8 K8 Greyhound Greyhound+ Greyhound+ Greyhound+

    L2/L3 1MB/0 1MB/0 512kB/2MB 512kB/6MB 512kB/6MB 512kB/12MB

    HyperTransportTechnology

    3x 1.6GT/.s 3x 1.6GT/.s 3x 2GT/s 3x 4.0GT/s 3x 4.8GT/s 4x 6.4GT/s

    Memory 2x DDR1 300 2x DDR1 400 2x DDR2 667 2x DDR2 800 2x DDR2 1066 4x DDR3 1333

    Max Power Budget Remains Consistent

  • 8/13/2019 The Future Is Heterogeneous Computing

    15/26

    Page 15 | T h e F u t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010

    L3

    CACHE

    AMD Opteron 6100 SeriesSilicon and Package

    L3

    CACHE

    Core 2Core 1 Core 3

    Core 4 Core 5 Core 6

    12AMD64 x86 Cores

    18 MB on-chip cache

    4 Memory Channels @ 1333 MHz

    4 HT Links @ 6.4 GT/sec

  • 8/13/2019 The Future Is Heterogeneous Computing

    16/26

    Page 16 | T h e F u t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010

    AMD Radeon HD5870 GPU Architecture

  • 8/13/2019 The Future Is Heterogeneous Computing

    17/26

    17

    GPU Processing Performance Trend

    0

    500

    1000

    1500

    2000

    2500

    3000

    Sep-05

    Mar-06

    Oct-0

    6

    Apr-0

    7

    Nov-07

    Jun-08

    Dec-08

    Jul-0

    9

    GigaFLOPS

    RV770ATI RADEON

    HD 4800ATI Fi rePro

    V8700AM D FireStr eam

    92509270

    RV670ATI RADEON

    HD 3800ATI Fi reGL

    V7700AM D FireStr eam

    9170

    R600ATI RADEON

    HD 2900ATI Fi reGL

    V7600V8600V8650

    R580(+)ATI RADEONX19xx

    ATI Fi reStr eamR520ATI RADEON

    X1800ATI Fi reGL

    V7200V7300V7350

    Unified

    Shaders

    Double-precision

    floating pointGPGPU

    via CTM

    Stream SDK

    CAL+IL/Brook+

    2.5x ALU

    increase

    * Peak s ingle-precision perf ormance;For RV670, RV770 & Cypress divide by 5 f or peak double-precision performance

    * CypressATI RADEON

    HD 5870

    OpenCL 1.1+

    DirectX 11

    2.25x Perf.

  • 8/13/2019 The Future Is Heterogeneous Computing

    18/26

    18

    0

    2

    4

    6

    8

    1 0

    1 2

    1 4

    1 6

    Nov-05 Jan-06 Sep-07 Nov-07 Jun-08 Oct-09

    ATI RadeonX1800 XT

    ATI RadeonX1900 XTX

    ATI Radeon HD2900 PRO

    ATI Radeon HD3870

    ATI Radeon HD4870

    ATI Radeon HD5870

    GPU Efficiency

    7.50

    4.56

    4.50

    2.24

    2.21

    0.92

    2.01

    1.06

    1.07

    0.42

    GFLOPS/W

    GFLOPS/mm2

    14.47GFLOPS/W

    7.90GFLOPS/mm

    2

  • 8/13/2019 The Future Is Heterogeneous Computing

    19/26

    19

    Digital ContentCreation

    AMD Accelerated ParallelProcessing (APP) Technology is

    EngineeringSciences Government

    Gaming Productivity

    Heterogeneous: Developers leverage AMD GPUs and CPUs foroptimal application performance and user experience

    High performance: Massively parallel, programmable GPUarchitecture delivers unprecedented performance and power efficiency

    Industry Standards: OpenCL enables cross-platform development

  • 8/13/2019 The Future Is Heterogeneous Computing

    20/26

  • 8/13/2019 The Future Is Heterogeneous Computing

    21/26

    21

    Heterogeneous Computing:Next-Generation Software Ecosystem

    Hardware & Drivers: AMD Fusion,Discrete CPUs/GPUs

    OpenCL & Direct Compute

    Tools: HLL

    compilers,Debuggers,

    ProfilersMiddleware/Libraries: Video,

    Imaging, Math/Sciences,Physics

    High Level

    Frameworks

    End-user Applications

    AdvancedOptimizations

    &LoadBalancingLoad balance

    across CPUs and

    GPUs; leverage

    AMD Fusion

    performance

    advantagesDrive new

    features into

    industry standards

    Increase ease of

    applicationdevelopment

  • 8/13/2019 The Future Is Heterogeneous Computing

    22/26

    22

    AMD Balanced Platform Advantage

    Delivers advanced performance for a wide rangeof platform configurations

    Other Highly

    Parallel Workloads

    Graphics Workloads

    Serial/Task-Parallel

    Workloads

    CPU is excellent for running somealgorithms

    Ideal place to process if GPU isfully loaded

    Great use for additional CPUcores

    GPU is ideal for data parallel algorithmslike image processing, CAE, etc

    Great use for AMD AcceleratedParallel Processing (APP)technology

    Great use for additional GPUs

  • 8/13/2019 The Future Is Heterogeneous Computing

    23/26

    Page 23 | T h e F u t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010

    Challenges: Extracting Parallelism

    i=0i++

    load x(i)fmulstore

    cmp i (1000000)bc

    i,j=0i++

    j++load x(i,j)

    fmulstore

    cmp j (100000)bc

    cmp i (100000)bc

    2D array

    representing

    very large

    dataset

    Loop 1M

    times for

    1M pieces

    of data

    Coarse-grain data

    parallel Code

    Maps very well toThroughput-orienteddata parallel engines

    i=0

    i++load x(i)fmulstore

    cmp i (16)bc

    Loop 16 times for 16

    pieces of data

    Fine-grain data

    parallel Code

    Maps very well tointegrated SIMD

    dataflow (ie: SSE)

    Nested data

    parallel Code

    Lots of conditional dataparallelism. Benefitsfrom closer couplingbetween CPU & GPU

  • 8/13/2019 The Future Is Heterogeneous Computing

    24/26

    24

    A New Era of Processor Performance

    Throughput Performance GPU

    Homogeneous

    Computing

    S y st em - l ev e l p r o g r a m m a b le

    Mu l t i - C o r e

    Er a

    H e t e r o g e n e o u s

    Sy s t em s Er a

    S i n g l e -Co r e

    Er a

    H e t e r o g e n e o u s

    C om p u t i n g

    G r a p h i c s d r i v e r - b a s e d p r o g r a m s

    O p en CL / D X d r i v e r - b a s e d p r o g r a m s

    Programmability

    CPU

    M ic r o p r o c e ss o r A d v a n c em e n t

    GPU

    A

    dvancem

    ent

  • 8/13/2019 The Future Is Heterogeneous Computing

    25/26

    25

    Now the AMD Fusion Era of Computing Begins

  • 8/13/2019 The Future Is Heterogeneous Computing

    26/26

    26

    DISCLAIMER

    The information presented in this document is for informational purposes only and may contain t echnical inaccuracies, omissions and typographical errors.

    The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes,component and motherboard version changes, new model and/or product releases, product dif ferences between differing manufacturers, software changes,BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reservesthe right t o revise this information and to make changes f rom time to time to the content hereof without obligation of AMD to notify any person of suchrevisions or changes.

    AMD MAKES NO REPRESENTATIONS OR WA RRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES,ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

    AMD SPECIFICALLY DISCLAIMS A NY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE

    LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES A RISING FROM THE USE OF ANY INFORMATIONCONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

    This presentation contains forward-looking st atements concerning AMD and tec hnology partner product offerings which are made pursuant to the safe harbor provis ions of the Priv ate

    Sec urities Litigation Reform A ct of 19 95. Forward-looking statements are commonly identified by words s uch as "would," "may," "expects," "believes," "plans," "intends,"

    st rate gy, roadmaps , "project s" and othe r terms with similar meaning. Investors are c autioned that the forward- looking st atements in this presentation are bas ed on currentbeliefs , as sumptions and expectations, speak only as of the date of this pres entation and involve risks and uncertainties that could cause ac tual results to differ material ly fromcurrent expectations.

    ATTRIBUTION

    20 10 Advanced Micro Devices, Inc. A ll rights reserved. A MD, the AMD A rrow logo, A MD O pteron, ATI, the ATI logo, Radeonand c ombinations thereof are trademarks of A dvancedMi cro D evices, Inc. Mi crosoft, Windows, and Windows V ista are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. O penCL is trademark of

    A pple Inc. used under license to the Khronos G roup Inc. O ther names are for informational purposes only and may be trademarks of their respective owners.