The Future Is Heterogeneous Computing

8/13/2019 The Future Is Heterogeneous Computing

1/26


2/26

Page 2|

T h e Fu t u r e I s H e t e r o g e n e o u s Com p u t i n g

|Oct 27, 2010

Workload Example: Changing Consumer Behavior

2

20 hoursof videouploaded to YouTube

every minute

50 million +digital media files

added to personal content libraries

every day

Approximately

9 billionvideo files owned are

high-definition

1000

imagesare uploaded to Facebook

every second


3/26

Page 3 | T h e Fu t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010

Challenges for Next Generation Systems

The Power Wall

Even more broadly constraining in the future!

Complexity Management HW and SW

Principles for managing exponential growth

Parallelism, Programmability and Efficiency

Optimized SW for System-level Solutions

System balance

Memory Technologies and System Design

Interconnect Design


4/26


The Power Wall

Easy prediction: Power will continue to be the #1 designconstraint for Computer Systems design.

Why? Vmin will not continue tracking Moores Law

Integration of system-level components consume chip power

A well utilized 100GB/sec DDR memory interface consumes~15W for the I/O alone!

2ndOrder Effects of Power Thermal, packaging & cooling (node-level & datacenter-level)

Electrical stability in the face of rising variablity

Thermal Design Points (TDPs) in all market segmentscontinue to drop

Lightly loaded and idle power characteristics are keyparameters in the Operational Expense (OpEx) equation

Percent of total world energy consumed by computingdevices continues to grow year-on-year


5/26


Optimized SW for System-level Solutions

Long history of SW optimizations for HW characteristics

Optimizing compilers Cache / TLB blocking

Multi-processor coordination: communication & synchronization

Non-uniform memory characteristics: Process and memory affinity

Scarcity/Abundance principle favors increased use of

Abstractions Abstraction leads to Increased productivity but costs performance

Still allow experts burrow down into lower level on the metal details

System-level Integration Era will demand even more Many Core: user mode and/or managed runtime scheduling?

Heterogeneous Many Core: capability aware scheduling?

SW productivity versus optimization dichotomy

Exposed HW leads to better performance but requires a platform

characteristics aware programming model


6/26


The Memory Wall getting thicker

There has always been a Critical Balance betweenData Availabilityand Processing

Situation When? Implication Industry Solutions

DRAM vs CPU Cycle Time GapEarly1990s

Memory wait time

dominates computing

Non-blockingcaches

O-o-O Machines

SW Productivity CrisisObject oriented languages;

Managed runtime environments

Mid

1990s

Larger working sets

More diverse data types

Larger CachesCache Hierarchies

Elaborate prefetch

Single Thread CMP Focus2005 andbeyond

Multiple working sets!

Virtual Machines!

More memory accesses

Huge Caches

Multiple MemoryControllers

Extreme PHYs

New & Emerging Abstractions

Browser-based Runtimes

Image/Video as basic data types

Throughput-based designs

2009 andbeyond

Even larger working sets

Larger data types

A cc e l e r a t e d P a r a l l e l

P r o c e s s i n g

Ch i p St a c k i n g

TBD


7/26


Interconnect Challenges

Coherence domain knowing when to stop

Interesting implications for on-chip interconnect networks

Industry Mantra: Never bet against Ethernet

But, current Ethernet not well suited for lossless transmission

Troublesome for storage, messaging and more

The more subtle and trickier problems

Adaptive routing, congestion management, QOS, End-to-endcharacteristics, and more

Data centers of tomorrow are going to take great interest inthis area


8/26


Single-thread Performance

IPC

Issue Width

The IPC Complexity Wall

o

we arehere

Integration(logscale

)

Time

Moores Law

!

we are

here

o

PowerBudget(TDP)

Time

The Power Wall

we are

here

o

Frequency

Time

The Frequency Wall

we are

here

o

Single-th

readPerf

?

Time

we arehere

o

Single thread Perf (!)

- DFM- Variability- Reliability- Wire delay

Server: power=$$

DT: eliminate fansMobile: battery

Performance

Cache Size

Locality

we are

here

o


9/26


0

20

40

60

80

100

120

140

1 2 4 8 16 32 64 128

Speed-up

Number of CPU Cores

0% Serial

100% Serial

0

20

40

60

80

100

120

140

1 2 4 8 16 32 64 128

Speed-up

Number of CPU Cores

0% Serial

10% Serial

35% Serial

100% Serial

Parallel Programs and Amdahls Law

Speed-up =1

SW + (1 SW) / N

SW: % Serial Work

N: Number of processors

Assume 100W TDP Socket

10W for global clocking

20W for on-chip network/caches

15W for I/O (memory, PCIe, etc)

This leaves 55W for all the cores

850mW per Core !


10/26

Page 10 | T h e F u t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010

Transistors

(thousands)

Single-thread

Performance

(SpecINT)

Frequency

(MHz)

Typical Power(Watts)

Number of

Cores

Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten

Dotted line extrapolations by C. Moore

35 Years of Microprocessor Trend Data


11/26


The Power Wall A g a i n !

Escalating multi-core designs will crash into the power wall justlike single cores did due to escalating frequency

Why?

In order to maintain a reasonable balance, core additions must beaccompanied by increases in other resources that consume power(on-chip network, caches, memory and I/O BW, )

Spiral upwards effect on power

The use of multiple cores forces each core to actually slow down

At some point, the power limits will not even allowyou to activate allof the cores at the same time

Small, low-power cores tend to be very weak on single-threadedgeneral purpose workloads

Customer value proposition will continue to demand excellentperformance on general purpose workloads

The transition to compelling general purpose parallel workloads willnot be a fast one


12/26


13/26


Three Eras of Processor Performance

Single-CoreEra

Single-threa

d

Performance

?

Time

we arehere

o

Enabled by: Moores Law Voltage Scaling MicroArchitecture

Constrained by:Power

Complexity

Multi-CoreEra

Throughpu

tPerformance

Time(# of Processors)

we are

here

o

Enabled by: Moores Law Desire for Throughput 20 years of SMP arch

Constrained by:Power

Parallel SW availabilityScalability

HeterogeneousSystems Era

TargetedApplication

Performance

Time(Data-parallel exploitation)

we are

here

o

Enabled by: Moores Law Abundant data parallelism Power efficient GPUs

C u r r e n t l y constrained by:Programming models

Communication overheads


14/26


2003

AMD x86 64-bit CMP Evolution

2005 2007 2008 2009 2010

AMD Opteron Dual-CoreAMD Opteron

Quad-CoreAMD Opteron

45nm Quad-Core

AMD Opteron

Six-CoreAMD Opteron

AMD Opteron6100 Series

Mfg.Process

90nm SOI 90nm SOI 65nm SOI 45nm SOI 45nm SOI 45nm SOI

CPU Core

K8 K8 Greyhound Greyhound+ Greyhound+ Greyhound+

L2/L3 1MB/0 1MB/0 512kB/2MB 512kB/6MB 512kB/6MB 512kB/12MB

HyperTransportTechnology

3x 1.6GT/.s 3x 1.6GT/.s 3x 2GT/s 3x 4.0GT/s 3x 4.8GT/s 4x 6.4GT/s

Memory 2x DDR1 300 2x DDR1 400 2x DDR2 667 2x DDR2 800 2x DDR2 1066 4x DDR3 1333

Max Power Budget Remains Consistent


15/26


L3

CACHE

AMD Opteron 6100 SeriesSilicon and Package

L3

CACHE

Core 2Core 1 Core 3

Core 4 Core 5 Core 6

12AMD64 x86 Cores

18 MB on-chip cache

4 Memory Channels @ 1333 MHz

4 HT Links @ 6.4 GT/sec


16/26


AMD Radeon HD5870 GPU Architecture


17/26

17

GPU Processing Performance Trend

0

500

1000

1500

2000

2500

3000

Sep-05

Mar-06

Oct-0

6

Apr-0

7

Nov-07

Jun-08

Dec-08

Jul-0

9

GigaFLOPS

RV770ATI RADEON

HD 4800ATI Fi rePro

V8700AM D FireStr eam

92509270

RV670ATI RADEON

HD 3800ATI Fi reGL

V7700AM D FireStr eam

9170

R600ATI RADEON

HD 2900ATI Fi reGL

V7600V8600V8650

R580(+)ATI RADEONX19xx

ATI Fi reStr eamR520ATI RADEON

X1800ATI Fi reGL

V7200V7300V7350

Unified

Shaders

Double-precision

floating pointGPGPU

via CTM

Stream SDK

CAL+IL/Brook+

2.5x ALU

increase

* Peak s ingle-precision perf ormance;For RV670, RV770 & Cypress divide by 5 f or peak double-precision performance

* CypressATI RADEON

HD 5870

OpenCL 1.1+

DirectX 11

2.25x Perf.


18/26

18

0

2

4

6

8

1 0

1 2

1 4

1 6

Nov-05 Jan-06 Sep-07 Nov-07 Jun-08 Oct-09

ATI RadeonX1800 XT

ATI RadeonX1900 XTX

ATI Radeon HD2900 PRO

ATI Radeon HD3870

ATI Radeon HD4870

ATI Radeon HD5870

GPU Efficiency

7.50

4.56

4.50

2.24

2.21

0.92

2.01

1.06

1.07

0.42

GFLOPS/W

GFLOPS/mm2

14.47GFLOPS/W

7.90GFLOPS/mm

2


19/26

19

Digital ContentCreation

AMD Accelerated ParallelProcessing (APP) Technology is

EngineeringSciences Government

Gaming Productivity

Heterogeneous: Developers leverage AMD GPUs and CPUs foroptimal application performance and user experience

High performance: Massively parallel, programmable GPUarchitecture delivers unprecedented performance and power efficiency

Industry Standards: OpenCL enables cross-platform development


20/26


21/26

21

Heterogeneous Computing:Next-Generation Software Ecosystem

Hardware & Drivers: AMD Fusion,Discrete CPUs/GPUs

OpenCL & Direct Compute

Tools: HLL

compilers,Debuggers,

ProfilersMiddleware/Libraries: Video,

Imaging, Math/Sciences,Physics

High Level

Frameworks

End-user Applications

AdvancedOptimizations

&LoadBalancingLoad balance

across CPUs and

GPUs; leverage

AMD Fusion

performance

advantagesDrive new

features into

industry standards

Increase ease of

applicationdevelopment


22/26

22

AMD Balanced Platform Advantage

Delivers advanced performance for a wide rangeof platform configurations

Other Highly

Parallel Workloads

Graphics Workloads

Serial/Task-Parallel

Workloads

CPU is excellent for running somealgorithms

Ideal place to process if GPU isfully loaded

Great use for additional CPUcores

GPU is ideal for data parallel algorithmslike image processing, CAE, etc

Great use for AMD AcceleratedParallel Processing (APP)technology

Great use for additional GPUs


23/26


Challenges: Extracting Parallelism

i=0i++

load x(i)fmulstore

cmp i (1000000)bc

i,j=0i++

j++load x(i,j)

fmulstore

cmp j (100000)bc

cmp i (100000)bc

2D array

representing

very large

dataset

Loop 1M

times for

1M pieces

of data

Coarse-grain data

parallel Code

Maps very well toThroughput-orienteddata parallel engines

i=0

i++load x(i)fmulstore

cmp i (16)bc

Loop 16 times for 16

pieces of data

Fine-grain data

parallel Code

Maps very well tointegrated SIMD

dataflow (ie: SSE)

Nested data

parallel Code

Lots of conditional dataparallelism. Benefitsfrom closer couplingbetween CPU & GPU


24/26

24

A New Era of Processor Performance

Throughput Performance GPU

Homogeneous

Computing

S y st em - l ev e l p r o g r a m m a b le

Mu l t i - C o r e

Er a

H e t e r o g e n e o u s

Sy s t em s Er a

S i n g l e -Co r e

Er a

H e t e r o g e n e o u s

C om p u t i n g

G r a p h i c s d r i v e r - b a s e d p r o g r a m s

O p en CL / D X d r i v e r - b a s e d p r o g r a m s

Programmability

CPU

M ic r o p r o c e ss o r A d v a n c em e n t

GPU

A

dvancem

ent


25/26

25

Now the AMD Fusion Era of Computing Begins


26/26

26

DISCLAIMER

The information presented in this document is for informational purposes only and may contain t echnical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes,component and motherboard version changes, new model and/or product releases, product dif ferences between differing manufacturers, software changes,BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reservesthe right t o revise this information and to make changes f rom time to time to the content hereof without obligation of AMD to notify any person of suchrevisions or changes.

AMD MAKES NO REPRESENTATIONS OR WA RRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES,ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS A NY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE

LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES A RISING FROM THE USE OF ANY INFORMATIONCONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

This presentation contains forward-looking st atements concerning AMD and tec hnology partner product offerings which are made pursuant to the safe harbor provis ions of the Priv ate

Sec urities Litigation Reform A ct of 19 95. Forward-looking statements are commonly identified by words s uch as "would," "may," "expects," "believes," "plans," "intends,"

st rate gy, roadmaps , "project s" and othe r terms with similar meaning. Investors are c autioned that the forward- looking st atements in this presentation are bas ed on currentbeliefs , as sumptions and expectations, speak only as of the date of this pres entation and involve risks and uncertainties that could cause ac tual results to differ material ly fromcurrent expectations.

ATTRIBUTION

20 10 Advanced Micro Devices, Inc. A ll rights reserved. A MD, the AMD A rrow logo, A MD O pteron, ATI, the ATI logo, Radeonand c ombinations thereof are trademarks of A dvancedMi cro D evices, Inc. Mi crosoft, Windows, and Windows V ista are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. O penCL is trademark of

A pple Inc. used under license to the Khronos G roup Inc. O ther names are for informational purposes only and may be trademarks of their respective owners.

The Future Is Heterogeneous Computing

Documents

Transcript of The Future Is Heterogeneous Computing