Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf ·...

Hardware Evolution TrendsHardware Evolution Trendsof Extreme Scale Computing

TeraFlopEmbedded

PetaFlopDepartmental

ExaFlopData Center

Peter M. KoggeMcCourtney Chair in Computer Science & Engineering

Univ. of Notre DameIBM Fellow (retired)

1CalTech AyBi 199b April 26, 2011

IBM Fellow (retired)April 26, 2011

Topics

• What Is Exascale

• Moore’s Law, Multicore, & Energy 101

• HPC Computing: How We Got To Today

• The DARPA Exascale Studyy

• Exascale Strawmen

• A Deep Dive into Memory EnergiesA Deep Dive into Memory Energies

• Rockster: The Problem Isn’t Just at the High End

Materials presented here based onMaterials presented here based onan 18 month DARPA-sponsored Study on “Exascale Computing”


y p gand later studies

Background-Computing

• Classical “Measures of Performance” Flop/s: floating point operations per second Bytes per Flop/s: memory capacity per unit of performance Bytes per Flop/s: memory capacity per unit of performance

• Classical Goal: 1 byte per flop/s Bytes/s per Flop/s: memory bandwidth per unit of performance

• Classical Goal: 1 byte/s per flop/s• Classical Goal: 1 byte/s per flop/s

• Last 30 years of HPC: focus on flops/s• Last 30 years of HPC: focus on flops/s 1980s: “Gigaflop” (109/s) first in single vector processor 1994: “Teraflop” (1012/s) first via thousands of microprocessors 2009: “Petaflop” (1015/s) first via several hundred thousand cores


• Let’s call technology to reach “xyz flop” as “xyz scale”

Background-Technology

• Commercial technology: to date Riding Moore’s Law to denser, “faster,” chips Always shrunk prior “XXX” scale systems to smaller form factor Always shrunk prior XXX scale systems to smaller form factor Shrink, with clock speedup, enabled next “XXX” scale

• But microprocessor clock rates flat-lined in 2004 Driven by power dissipation limits

• “Exascale” now on horizon (Exaflops = 1018/s) But technology change radically changing our path to get there

E i ll E /P


Especially Energy/Power

Exascale• Exascale = 1,000X capability of Today• Exascale != Exaflops but

Exascale at the data center size => Exaflops Exascale at the data center size => Exaflops Exascale at the “rack” size => Petaflops for

departmental systems Exascale embedded => Teraflops in a cube Exascale embedded => Teraflops in a cube

• It took us 14+ years to get from 1st Petaflops workshop: 1994 Th NSF t di HTMT HPCS Thru NSF studies, HTMT, HPCS … To give us to Petaflops in 2009

• Study Questions: Can we ride silicon to Exa? What will such systems look like? Where are the Challenges?


For Those of You Who Want an Exaflop…..

You should be OVERJOYED if all you will need is:JUST Milli•JUST a Million cores

•ONLY 1 Nuclear Power PlantONLY 1 Nuclear Power Plant•MINIMAL programming support


Moore’s Law Energy 101 & MulticoreMoore s Law, Energy 101, & Multicore


Today’s CMOS Silicon Structures

Cell Plate Si

Capacitor Insulator

Storage Node PolyRefilling Poly

Si S b t t2nd Field Oxide

Si Substrate

A Single Transistor Multiple LayersOf Wire

A Trench CapacitorFor Memory

“Feature Size:” Minimum linear dimension• Width of Transistor Gate

½ it h b t t i t l t l l f t l


• ½ pitch between two wires at lowest levels of metal

Moore’s Law

• 1965, Gordon Moore: “the number of transistors that can be integrated on a die would double every 18 to 14 months ”14 months. Equivalent to Feature Size decreasing by factor of √2 Memory density 4X every 3 years

1 000 000

4,000,000

16,000,000

64,000,000

1000000

10000000

100000000

p

0.1 m

0.07 m

h u m a n m em or yh u m a n D NA

4X growth every 3 years!100

1000

y (G

T/cm

2)

4,000

16,000

64,000

256,000

1,000,000

10000

100000

1000000

it ca

paci

ty/c

hi0.7-0.8 m

0.5-0.6 m

0.35-0.4 m

0.18-0.25 m

0.13 m

en cycl oped i a2 h r s CD a u d i o

book

1

10

nsis

tor D

ensi

ty

64

256

1,000

10

100

1000

1980 1983 1986 1989 1992 1995 1998 2001 2004 2007 2010

Kb

1.6-2.4 m

1.0-1.2 m

2 h r s CD a u d i o30 sec H D TV

pa g e0.1

1995

2000

2005

2010

2015

2020

2025

Tra n


YearLogic DRAM

Smaller Feature Sizes Also Enable Higher Clock Rates

100,000 100000

10,000

MH

z)

10000

MH

z)3 GHz The 3 GHz Wall

100

1,000

Clo

ck (M

100

1000

Clo

ck (M

The 3 GHz Wall

101975 1980 1985 1990 1995 2000 2005 2010 2015 2020

1010100100010000

Historical ITRS Max Clock Rate (12 invertors) Feature SizeHistorical ITRS Max


Why the Clock Flattening? POWERPOWER

1000 1000

100

Die

100

are

cm

Hot, Hot, Hot!Rocket Nozzle

10Wat

ts p

er D

10

tts p

er S

qua

Iron

1 0.1

1Wat Light Bulb

1976 1986 1996 2006 1976 1986 1996 2006


CMOS Energy 101

1.2

0.6

0.8

1

olta

ge VinVg

Vin Vg

R & C C

0

0.2

0.4Vo Vg

00 5 10 15 20

Dissipate CV2/2A d t CV2/2

Dissipate CV2/2From CapacitanceAnd store CV2/2 From Capacitance

One clock cycle dissipates C*V2


One clock cycle dissipates C V

How Did CV2 Improve With Time?1000

ize 33.5

44.5

5

d

100

Feat

ure

S

00.5

11.5

22.5

Vdd

101985 1990 1995 2000 2005 2010 2015 2020 2025

Assume capacitance of a circuitscales as feature size

01985 1990 1995 2000 2005 2010 2015 2020 2025

High PerfLow PowerDRAM

10.00

100.00

1000.00

nm H

i Per

f Log

ic 330X 90nm picked as breakpoint because that’s when Vdd and thus clocks flattened

0.01

0.10

1.00

CV^

2 re

lativ

e to

90n

15X


1/1/

88

1/1/

90

1/1/

92

1/1/

94

1/1/

96

1/1/

98

1/1/

00

1/1/

02

1/1/

04

1/1/

06

1/1/

08

1/1/

10

1/1/

12

1/1/

14

1/1/

16

1/1/

18

1/1/

20

1/1/

22

1/1/

24

High Perf Logic Low Operating Pow er Logic Memory Process

Loss of Clock Increase Left Brute Parallelism as Only Performance Option

Memory Memoryy yThe Same Straw

NNow

Up to ~2002Circa 2002


The Multi-core Architectural Spectrum

Cache/M

Cache/M

Cache/Memory

Cache

. . .Cache C C CM

Cache/MemoryMemory

CoreMemory

Core. . .

Interconnect & ControlCache

Core Core

. . .Cache

Core Core

. . .

( ) Hi hi l D i

CORE

CORE

CORE

MEM . . .

(b) Pi li d D i

Cache/Memory

Core

Cache/Memory

Core

. . .

( ) A D i(a) Hierarchical Designs (b) Pipelined Designs (c) Array Designs

• Intel Core Duo• IBM Power5• AMD Opteron

• IBM Cell• Most Router chips• Many Video chips

• Tilera• Terasys• ExecubeAMD Opteron

• SUN Niagara• …

y p Execube• Yukon• Intel Teraflop

A d Th Th ’ “H t ” M lti C


And Then There’s “Heterogeneous” Multi Core

NVIDIA GeFORCE Heterogeneous Architecture(Not shown: explicit graphics engines)

HostProcessor

HostMemory

Conventional Computer

a.k.a. the “Host”

St i

Kernel Scheduler GPU: GraphicsProcessing Unita.k.a. the “Device” Multithreaded

StreamingMulti Processor

(SM)Streaming

Multi Processor(SM)


(SM)Streaming

Multi Processor(SM)

StreamingMulti ProcessorStreaming

Streaming esso

r

esso

r

esso

r

InstructionUnit

( )(SM)Multi Processor

(SM)Multi Processor(SM)


(SM)

trea

m P

roce

trea

m P

roce

trea

m P

roce

…

L2 & Memory Channel Interfaces

MemoryChips

MemoryChips…. Graphics Memory

k th “D i Gl b l M ”St St St

LocalMemory


Chips Chips a.k.a. the “Device Global Memory”

Die Photos

AMD 12 core: HierarchicalIBM Execube: Tiled Tile64: Tiled


Nvidia: HeterogeneousIBM Cell: Pipelined/Heterogeneous

Modern High Performance S CSuperComputers


A Modern “Heavyweight” HPC System

Power Distribution

Silicon Area Distribution

Processors

Routers3% Memory

86%

Random8%

Memory9%

Routers33%

Random2%

3%

33%

Board Area DistributionMemory

Processors56%

y10%

Processors24%

White Space50%

24%

Routers8%


8%Random8%

A “Light Weight” Node Alternative

2 Nodes per “Compute Card.” Each node:2 Nodes per Compute Card. Each node:• A low power compute chip• Some memory chips• “Nothing Else”

System Architecture:• Multiple Identical Boards/Rack

• 2 simple dual issue cores• Each with dual FPUs• Memory controller

L DRAM L3• Each board holds multiple Compute Cards• “Nothing Else”

• Large eDRAM L3• 3D message interface• Collective interface• All at subGHz clock“Packaging the Blue Gene/L supercomputer ” IBM J R&D March/May 2005


Packaging the Blue Gene/L supercomputer, IBM J. R&D, March/May 2005“Blue Gene/L compute chip: Synthesis, timing, and physical design,” IBM J. R&D, March/May 2005

Heterogeneous Systems

• Mix of heavyweight masters and GPU compute enginescompute engines

• GPUs: hierarchy of cores:cores: Host microprocessors Streaming Multiprocessors Multiple Thread Engines

http://www.nvidia.com/object/fermi_architecture.html

Multiple Thread Engines Specialized function

engines


Where Are We Now?A T h h TOP 00 dA Tour through TOP500 Land


Top500 & The LinPack Benchmark

• TOP500: www.top500.org Ranking of top 500 supercomputers in world Updated every 6 months Updated every 6 months

• Benchmark Used: LinPack Solves a dense system of linear equations y q Assumes LU matrix factorization with partial pivoting Key parameter: floating point ops/second

Li P k P t• LinPack Parameters Rpeak: maximum possible # of flops per second Rmax: max observed flops/sec for any size LinPack Nmax: size of problem when achieving Rmax N1/2: problem size when ½ of Rmax is achieved

• Key kernel: DAXPY: Y[i] = Y[i] + αX[i]


• Key kernel: DAXPY: Y[i] = Y[i] + αX[i]

Overall Metrics

1.E+07

1.E+08

1.E+04

1.E+05

1.E+06

1.E+02

1.E+03

1.E+04

1.E+00

1.E+01

2 6 0 4 8 2 6

1/1/1992

1/1/1996

1/1/2000

1/1/2004

1/1/2008

1/1/2012

1/1/2016

Rpeak Rmax Memory


p yRpeak Trend Rmax Trend Memory Trend

What Kind of Systems Are These?R d & 1 t H

1.E+07Roadrunner & 1st Heterogeneous

1.E+06

1.E+05

1.E+04

1st BlueGene

1/1/2004

1/1/2008

1/1/2012

H i ht Li ht i ht


Heavyweight LightweightRpeak Trend Heterogeneous

Flop Efficiency

0.90

1.00

0.60

0.70

0.80

peak

0.30

0.40

0.50

Rm

ax/R

p

0.00

0.10

0.20

87 91 95 99 03 07

12/3

1/19

8

12/3

1/19

9

12/3

1/19

9

12/3

1/19

9

12/3

1/20

0

12/3

1/20

0

Heavyweight Lightweight Heterogeneous

26CalTech AyBi 199b April 26, 2011But 80% is NOT typical of more complex applications


Bytes per Flop (Peak)

10.00

1.00

Rpe

ak)

0.10

Byt

es/F

lop

(R

0.01

7 1 5 9 3 7

12/3

1/19

87

12/3

1/19

91

12/3

1/19

95

12/3

1/19

99

12/3

1/20

03

12/3

1/20

07

H i ht Li ht i ht H t


Systems are getting less memory rich


Bytes per Flop (Rmax)

10.00

1.00

(Rm

ax)

0.10

Byt

es/F

lop

(

0.01

87 91 95 99 03 07

12/3

1/19

8

12/3

1/19

9

12/3

1/19

9

12/3

1/19

9

12/3

1/20

0

12/3

1/20

0




Even on sustained performance, heterogeneous system are memory poor

How Has “Processor Count” Changed?

1.E+06

1.E+04

1.E+05

1 E+02

1.E+03

otal Cou

nt

1.E+01

1.E+02To

1.E+00

/1992

/1996

/2000

/2004

/2008

/2012


1/1

1/1

1/1

1/1

1/1

1/1

Sockets Cores

What About Clock Rate?

10,000

1,000

MHz)

100lock Rate (

C

10

1/1992

1/1996

1/2000

1/2004

1/2008

1/2012


1/1

1/1

1/1

1/1

1/1

1/1


And Thread Level Concurrency?

14

16

10

12

14

re/C

ycle

)

6

8

LC (F

lops

/Cor

0

2

4T

12/3

1/19

87

12/3

1/19

91

12/3

1/19

95

12/3

1/19

99

12/3

1/20

03

12/3

1/20

07



Finally, What About Total Concurrency?

1 000 000

10,000,000

100,000

1,000,000

1,000

10,000TC

100

1,000

10

/1992

/1996

/2000

/2004

/2008

/2012


1/1/

1/1/

1/1/

1/1/

1/1/

1/1/


Total Power

9

10

6

7

8

W)

3

4

5

Pow

er (M

W

0

1

2

3

0

12/3

1/19

87

12/3

1/19

91

12/3

1/19

95

12/3

1/19

99

12/3

1/20

03

12/3

1/20

07



Energy per Flop

1,000,000

10,000,000

10,000

100,000

per F

lop

(pJ)

100

1,000

Ener

gy p

10

100

1987

1991

1995

1999

2003

2007

12/3

1/1

12/3

1/1

12/3

1/1

12/3

1/1

12/3

1/2

12/3

1/2


Heavyweight Lightweight Heterogeneous Historical CMOS Projection

The DARPA Exascale Study

http://www.darpa.mil/tcto/docs/ExaScale_Study_Initial.pdf


The Exascale Study

Study Directive: can we ride silicon to Exa by 2015?

1. Baseline today’s: Commodity Technology Architectures Performance (Linpack)( p )

2. Articulate scaling of potential application classes

3. Extrapolate roadmaps for3. Extrapolate roadmaps for “Mainstream” technologies Possible offshoots of mainstream technologies Alternative and emerging technologies Alternative and emerging technologies

4. Use technology roadmaps to extrapolate use in “strawman” designs


g

5. Analyze results & id “Challenges”

Architectures Considered

• Evolutionary Strawmen“Heavyweight” Strawman based on

commodity-derived microprocessors “Lightweight” Strawman based on custom

microprocessors

• “Aggressive” Strawman“Clean Sheet of Paper” CMOS Silicon


In process of adding Heterogeneous

1 EFlop/s “Clean Sheet of Paper” Strawman

• 4 FPUs+RegFiles/Core (=6 GF @1.5GHz)• 1 Chip = 742 Cores (=4.5TF/s)

Sizing done by “balancing” power budgets with achievable capabilities

p ( )• 213MB of L1I&D; 93MB of L2

• 1 Node = 1 Proc Chip + 16 DRAMs (16GB)• 1 Group = 12 Nodes + 12 Routers (=54TF/s)• 1 Rack = 32 Groups (=1.7 PF/s)1 Rack 32 Groups ( 1.7 PF/s)

• 384 nodes / rack• 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

• 166 MILLION cores• 166 MILLION cores• 680 MILLION FPUs• 3.6PB = 0.0036 bytes/flops• 68 MW w’aggressive assumptions

L l d t Bill D ll St f d


Largely due to Bill Dally, Stanford

A Single Node (No Router)

“Stacked” Memory

(a) Quilt Packaging (b) Thru via chip stack

Leakage28%

FPU21%

Reg File11%

28%

NodePower

Characteristics:• 742 Cores; 4 FPUs/core • 16 GB DRAM11%

Cache Access

Off-chip13%

• 290 Watts• 1.08 TF Peak @ 1.5GHz• ~3000 Flops per cycle


20%On-chip 6% DRAM Access

1%

Strawman Power Distribution

Processor13%

FPU16%Leakage Reg File

Disk14%

Assumed a 60X per bit decrease in access energy

Router

Network50%Memory

26%g

28%

L2/L316%

DRAM3%

34%

DRAM3%

e o y29%

Interconnect29% L1

39%

(a) Overall System Power (b) Memory Power (c) Interconnect Power

• 12 nodes per group32 gro ps per rack

Growing DRAM capacity 30X at least doubles overall Memory

• 32 groups per rack• 583 racks• 1 EFlops/3.6 PB


least doubles overall Memory Power• 166 million cores

• 67 MWatts

Data Center Performance Projections

Exascale1.E+09

1.E+10

1.E+08Lightweight

1.E+06

1.E+07

GFl

ops

Heavyweight

1.E+05

1.E+03

1.E+04

1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20

Projections made in mid 2007


But not at 20 MW!

Data Center Core Parallelism

1.E+08

1.E+09 170 Million CoresAND we will need

1.E+06

1.E+07

ralle

lism

10-100X moreThreading forLatency Management

1.E+05

1.E 06

oces

sor

Pa

y g

1.E+03

1.E+04Pro

1.E+021/1/96 1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20

Historical Top 10 Top System


Historical Top 10 Top SystemExa Strawman Evolutionary Light Node Evolutionary Heavy Node

Data Center Total Concurrency

1.E+09

1.E+10

Billion-way concurrency

1.E+07

1.E+08

curr

ecnc

y

1.E+05

1.E+06

Tota

l Con

c

Million-way concurrency

1.E+03

1.E+04

Thousand-way concurrency1.E 03

1/1/96 1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20

Top 10 Top System Top 1 TrendHistorical Exa Strawman Evolutionary Light Node


y gEvolutionary Heavy Node

“Business as Usual”Energy Projections

1.E+08

1.E+09

p)

“Bend” when Voltage stopped dropping

1 E+05

1.E+06

1.E+07

lop

( pJ/

Flop

1.E+03

1.E+04

1.E+05

nerg

y pe

r Fl

TechAs

Usual

1.E+01

1.E+02

1/1/80 1/1/84 1/1/88 1/1/92 1/1/96 1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20

En Usual

Historical Top 10 Green 500 UHPC 57KWUHPC 50 GF/W UHPC 80GF/W Top System Trend Line CMOS TechnologyExa HW Optimistic Exa HW Pessimistic Exa LW Optimistic Exa LW Pessimistic


But At What Memory Capacity?1.E+00

1.E-01

er-s

econ

dpe

r flo

p-pe 300TB

/Petaflop

1.E-02

Byt

es

How many useful apps are down here?

1.E-031/1/93 1/1/95 1/1/97 1/1/99 1/1/01 1/1/03 1/1/05 1/1/07 1/1/09 1/1/11 1/1/13 1/1/15


Top 10 Exa Strawman

Key Take-Aways• Developing Exascale systems will be really tough In any time frame, for any of the 3 classes

E l ti P i i t b t 2020i h• Evolutionary Progression is at best 2020ish With limited memory

• 4 key challenge areas4 key challenge areas Power!!!!!!!! Concurrency Memory Capacity Memory Capacity Resiliency

• Will require coordinated, cross-disciplinary efforts


A D Di iA Deep Dive into InterconnectInterconnect

to Deliver Operands


Assumed Technology Numbers

Operation Energy (pJ/bit)Register File Access 0.16Register File Access 0.16

SRAM Access 0.23DRAM Access 1DRAM Access 1

On-chip movement 0.0187Thru Silicon Vias (TSV) 0.011

Chip-to-Board 2Chip to Board 2Chip-to-optical 10Router on chip 2


Router on-chip 2

Aggressive Strawman Access Path: Interconnect-Driven

MoreRouters

ROUTER

Routers

Actual MemoryCore Actual MemoryAccess

Core Generating

Request

MemoryMICROPROCESSOR


Some sort of memory structure

Tracking the Energy

cces

s

spor

t

hip

Chi

p

(pJ)

cces

s

er L

ogic

Acc

ess

SRA

M A

c

M A

cces

s

hip

Tran

s

-Boa

rd-C

h

-Opt

ical

-C

0 Tota

l

RF

Ac

Rou

te

Tag

A

32K

B

DR

A M

On-

Ch

TSV

Chi

p -

Chi

p-

L1 Hit 39 30% 0% 28% 42% 0% 0% 0% 0% 0%L2/L3 Hit 385 7% 0% 6% 34% 0% 53% 0% 0% 0%L2/L3 Hit 385 7% 0% 6% 34% 0% 53% 0% 0% 0%

Local 1380 1% 0% 2% 2% 68% 26% 1% 0% 0%Global 13819 0% 24% 0% 0% 1% 13% 0% 10% 52%

Ave Read 706 2% 19% 2% 4% 8% 16% 0% 8% 41%Ave Read 706 2% 19% 2% 4% 8% 16% 0% 8% 41%Write to L1 39 30% 0% 28% 42% 0% 0% 0% 0% 0%Flush L1 891 0% 0% 1% 26% 32% 39% 1% 0% 0%


Linpack Energy

• Linpack: standard benchmark for TOP500 rankings

• Solve Ax=b by LU matrix decomposition & back solve

• Computationally intensive step: update rows after pivot

• Basic function Y[j,k] = α Y[I,k] + Y[j,k][j ] [ ] [j ] Read in C αs (C = “size” of cache line in dwords) (remote memory) Read in C Y’s from row i (remote memory) Repeat for all elements of each row k Repeat for all elements of each row k

• Read in C Y’s from row k (local memory)• Do processing• Write C Y’s (to cache)Write C Y s (to cache)• At some time flush C Y’s back to local memory from cache


Resulting Energy per Flop

Step Target pJ #Occurrances Total pJ % of TotalRead Alphas Remote 13 819 4 55 276 16 5%Read Alphas Remote 13,819 4 55,276 16.5%

Read pivot row Remote 13,819 4 55,276 16.5%Read 1st Y[i] Local 1,380 88 121,400 36.3%

R d Oth Y[i] L1 39 264 10 425 3 1%Read Other Y[i]s L1 39 264 10,425 3.1%Write Y's L1 39 352 13,900 4.2%Flush Y's Local 891 88 78,380 23.4%

Total 334,656Ave per Flop 475

And a core flop in 201552CalTech AyBi 199b April 26, 2011

pis only 10pJ!!!

Rockster on a TiledProcessor

“Extreme computing” in another dimension


The Problem:Take a Picture – Find the Rocks

• NxN pixel Image (1Kx1K)

• Grey scale pixels (8bits)y p ( )

• Intermediate “Rock Mask”

• Output: “Rock Descriptions”

• Image starts in memory (streamed from sensors)


The Rockster Algorithm1 M th i f t th til1. Move the image from memory to the tiles

2. 5x5 Gaussian smoothing

3. 3x3 Canny edge sharpener1. Include magnitude/angle computations per pixel

4 Horizon determination (sky fill)4. Horizon determination (sky fill)1. Homogeneous regions touching the ”top”

5. Edge following, including gap fixg g g g p

6. Background fill to generate binary “Rock Mask”1. 1 bit/pixel: rock/not rock2. Write mask back to memory

7. Connected Component to trace & report “Edge List” of rocks Al ith f JPL


rocks Algorithm from JPL

Simplified Tiled Multicore Architecture for Study

M Ctl

Til Til Til

MemCtl Q Columns

Tile(1,1)

Tile(1,j)

Tile(1,Q)

… …

.

• PxQ “Tiles”• All Tiles identical

• Each 64KB L2 today• Only X Y neighbor

Tile Tile Tile

.

.• Only X-Y neighbor routing• Memory Access only from (1,1)

P R

ows

Tile(i,1)

Tile(i,j)

Tile(i,Q)

… …

.

.

P

Tile(P 1)

Tile(P j)

Tile(P Q)

… …

.


(P,1) (P,j) (P,Q)

What Does All This Mean?• Under some reasonable assumptions with no off tile• Under some reasonable assumptions, with no off-tile

memory references:

% Core Logic44%

% L221%

A “Compute Cycle”

Of E

44%

% RegFile

% L120%

Of Energy% RegFile11%

Each off-tile memory reference costs eqvt of ddi i l 225 l f

% IFetch4%


an additional 225 compute cycles of energy

How Much Energy to Move the Image?

• Assume PxP tiles

• Each tile accesses N2/P2 pixels = 214/P2 lines

• Each access: 1 read from DRAM + 1 write to L2: T t th h (i j 1) til Transport through (i + j -1) tiles:

• Total: 3.2 + 0.15*P Million compute cycles

N t M til !• Note: More tiles – more energy!


Partitioning the Imageonto PxQ Tiles

(a) Subimage per Tilep = N/P; q = N/P

(b) Slice per Tilep = N/PQ; q =Np = N/P; q = N/P p = N/PQ; q =N

• Subimages of size “p” rows of pixels by “q” columns of pixels• Equivalently: pxq = N^2/(PQ) pixels

• Partitioning works only if 2N^2KB/(PQ) < L2 cache size


Partitioning works only if 2N 2KB/(PQ) L2 cache size

Ghost CellsM t d “ i hb i ll l ”• Many steps need “neighboring cell values”

• Ghost cells: copies of neighboring cells from neighboring tilestiles

• Data exchanges costs energy

TotalEnergy 2 4 8 16 32

Pgy

Subimage 149,760 299,520 599,040 1,198,080 2,396,160 Sliced 78,336 92,160 147,456 368,640 1,253,376

Explosive growth in energy as we grow # of tiles!


p g gy g

Cache Access Energy for Filters

Tiles 4 16 64 256 1024Filter PFilterEnergy 2 4 8 16 32

Subimage 3.9E+07 3.9E+07 3.9E+07 4.0E+07 4.0E+07

P

• Above #s in units of equivalent compute cycles summed

Sliced 3.2E+07 3.2E+07 3.2E+07 3.2E+07 3.2E+07

• Above #s in units of equivalent compute cycles, summed over all tiles

• Equivalent to about 30-40 compute cycles of energy perEquivalent to about 30 40 compute cycles of energy per pixel

• Remember need to add in Ghost energy


Rockster Conclusions

• Computation is cheap energy-wise• Accessing local memory is expensive• Accessing local memory is expensive• Accessing DRAM global memory is huge!• The way you partition data affects energy• Some odd counter-intuitive effectsSome odd, counter intuitive effectsSome energy grows as # tiles increase


New DARPA Program: UHPC• UHPC - Ubiquitous High Performance Computing

• Focii on “Extreme Scale” (1000X) technology power consumption power consumption cyber resiliency productivity

T t h hi b 2018• Target research machines by 2018 Petascale in a rack Terascale module

• Challenge problems streaming sensor data large dynamic graph-based informatics problemg y g p p decision class problem that encompasses search, hypothesis testing, and

planning 2 problems from HPCMP or CREATE suites


• 4 teams: Intel, Nvidia, Tilera, Sandia National Labs

Overall Conclusions• World has gone to multi-core to continue Moore’s Law

• Pushing performance another 1000X will be tough

• The major problem is in energy

• And that energy is in memory & interconnect For both scientific and embedded applications And even for data search applications

• We need to begin rearchitecting to reflect this• We need to begin rearchitecting to reflect this Need alternative metrics to “flops” DON’T MOVE THE DATA!

• We need to begin redesigning our architectures & codesfor energy minimization – now!


Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf ·...

Documents

Transcript of Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf ·...