Major Challenges to Achieve Exascale Performance · 1 Major Challenges to Achieve Exascale...

1

Major Challenges to Achieve

Exascale Performance

Shekhar Borkar

Intel Corp.

April 29, 2009

Acknowledgment: Exascale WG sponsored by Dr. Bill Harrod, DARPA (IPTO)

2

Outline

Exascale performance goals

Major challenges

Potential solutions

Paradigm shift

Summary

3

Performance Roadmap

1.E-04

1.E-02

1.E+00

1.E+02

1.E+04

1.E+06

1.E+08

1960 1970 1980 1990 2000 2010 2020

GF

LO

P

MFLOP

GFLOP

TFLOP

PFLOP

EFLOP

12 Years 11 Years 10 Years

4

From Giga to Exa, via Tera & Peta

1

10

100

1000

1986 1996 2006 2016

Re

lati

ve

Tr

Pe

rfo

rma

nc

e

G

Tera

Peta

30X 250X

0.001

0.01

0.1

1

1986 1996 2006 2016

Re

lati

ve

En

erg

y/O

p G

Tera

Peta

5V

Vcc scaling

1.E+00

1.E+02

1.E+04

1.E+06

1.E+08

1986 1996 2006 2016

G

Tera

Peta

36X

Exa

4,000X

Concurrency

2.5M X

Transistor Performance

1.E+00

1.E+02

1.E+04

1.E+06

1.E+08

1986 1996 2006 2016

G

Tera

Peta

80X

Exa

4,000X

1M X

Power

5

Building with Today’s Technology

200pJ per FLOP200W

150W

100W

100W

4450W

Decode and controlTranslations…etcPower supply lossesCooling…etc

5KW

Compute

Memory

Com

Disk

TFLOP Machine today

10TB disk @ 1TB/disk @10W

0.1B/FLOP @ 1.5nJ per Byte

100pJ com per FLOP

KW Tera, MW Peta, GW Exa?

6

The Power & Energy Challenge

200W

150W

100W

100W

4550W

5KW

Compute

Memory

Com

Disk

TFLOP Machine today

5W2W

~5W~3W5W

TFLOP Machine thenWith Exa Technology

~20W

7

1.5mm

2.0

mm

FPMAC0

Router

IMEM

DMEMRF

RIB

CLK

FPMAC1

MSINT

Glo

ba

l clk

sp

ine

+ c

lk b

uff

ers

1.5mm

2.0

mm

2.0

mm

FPMAC0

Router

IMEM

DMEMRF

RIB

CLK

FPMAC1

FPMAC0

Router

IMEM

DMEMRF

RIB

CLK

FPMAC1

MSINT

Glo

ba

l clk

sp

ine

+ c

lk b

uff

ers

Starting Point: Optimistic yet Realistic2

1.7

2m

m

12.64mmI/O Area

I/O Area

PLL

single tile

1.5mm

2.0mm

TAP

21

.72

mm

12.64mmI/O Area

I/O Area

PLL

single tile

1.5mm

2.0mm

TAP

100 MillionTransistors

275mm2Die Area

3mm2Tile area

1248 pin LGA, 14 layers, 343 signal pins

Package

1 poly, 8 metal (Cu)Interconnect

65nm CMOS ProcessTechnology

100 MillionTransistors

275mm2Die Area

3mm2Tile area

1248 pin LGA, 14 layers, 343 signal pins

Package

1 poly, 8 metal (Cu)Interconnect

65nm CMOS ProcessTechnology

80 Core TFLOP Chip

8

Scaling Assumptions

Technology

(High Volume)

45nm

(2008)

32nm

(2010)

22nm

(2012)

16nm

(2014)

11nm

(2016)

8nm

(2018)

5nm

(2020)

Transistor density 1.75 1.75 1.75 1.75 1.75 1.75 1.75

Frequency scaling 15% 10% 8% 5% 4% 3% 2%

Vdd scaling -10% -7.5% -5% -2.5% -1.5% -1% -0.5%

Dimension & Capacitance 0.75 0.75 0.75 0.75 0.75 0.75 0.75

SD Leakage scaling/micron 1X Optimistic to 1.43X Pessimistic

65nm Core + Local Memory

Memory 0.35MB

5mm2 (50%)

DP FP Add, MultiplyInteger Core, RF

Router

5mm2 (50%)

10mm2, 3GHz, 6GF, 1.8W

8nm Core + Local Memory

Memory 0.35MB0.17mm2 (50%)

DP FP Add, MultiplyInteger Core, RF

Router0.17mm2 (50%)

0.34mm2, 4.6GHz, 9.2GF, 0.24 to 0.46W

~0.6mm

9

Processor Chip

2018, 8nm technology node20mm

400mm2

20mm

Cores/Module 1150

Total Local Memory 400 MB

Frequency 4.61 GHz

Peak performance 10.6 TF

Power 300 - 600W

Energy efficiency 34 - 18 GF/Watt

30-60 MW for Exascale

0

5000

10000

15000

20000

65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm

Ch

ip P

erf

orm

an

ce (

GF

)

GFLOPs

0

100

200

300

400

500


Ch

ip P

ow

er

(W)

Power(W)

10

Processor Node

128 GB 128 GB

256GB/s 64b

128 GB 128 GB

Peak performance 10.6 TF

Total DRAM Capacity 512GB

Total DRAM BW 1TB/s (0.1B/FLOP)

DRAM Power 800 W*

Total Power 1100 - 1400W

Energy efficiency 9.5 - 8 GF/Watt

*Assumes 5% Vdd scaling each technology generation140 pJ energy consumed per accessed bit

110-140 MW for Exascale

11

Node Power Breakdown

DRAM

Fabric

Compute

10 TF, ~ 1KWAggressive voltage scaling

Hierarchical heterogeneous topologies

Efficient signalingRepartitioning

12

Voltage Scaling

0

0.2

0.4

0.6

0.8

1

0.3 0.5 0.7 0.9

Vdd (Normal)

No

rma

lize

d

0

2

4

6

8

10

Freq

Total Power

Leakage

Energy Efficiency

When designed to voltage scale

13

Energy Efficiency with Vdd Scaling

0

20

40

60

80

100

120

140

160


En

erg

y E

ffic

ien

cy (

GF

/W)

Vdd

0.7x

0.5x

~3X Compute energy efficiency with Vdd Scaling

14

On-die Mesh Interconnect

On-die network (mesh) power is highWorse if link width scales up each generation

20mm45nm

20mm32nm

20mm22nm

20mm16nm

70 Cores 123 Cores 214 Cores 375 Cores

0

100

200

300

400

500


Ch

ip P

ow

er

(W)

Network

Compute

15

Mesh—RetrospectiveBus: Good at board level, does not extend well

• Transmission line issues: loss and signal integrity, limited frequency

• Width is limited by pins and board area

• Broadcast, simple to implement

Point to point busses: fast signaling over longer distance

• Board level, between boards, and racks

• High frequency, narrow links

• 1D Ring, 2D Mesh and Torus to reduce latency

• Higher complexity and latency in each node

Hence, emergence of packet switched network

But, pt-to-pt packet switched network on a chip?

16

Interconnect Delay & Energy

1

10

100

1000

10000

0 5 10 15 20

Length (mm)

Dela

y (

ps)

0

0.5

1

1.5

2

pJ/B

it

65nm, 3GHz

Router Delay

17

Bus—The Other Extreme…Issues:Slow, < 300MHzShared, limited scalability?

Solutions:Repeaters to increase freqWide busses for bandwidthMultiple busses for scalability

Benefits:Power?Simpler cache coherency

Move away from frequency, embrace parallelism

18

Hierarchical & Heterogeneous

Bus

C C

C C

Bus to connect over short distances

Bus

C C

C C

Bus

C C

C C

Bus

C C

C C

Bus

C C

C C

2nd Level Bus

Hierarchy of BussesOr hierarchical circuit and packet switched networks

R R

R R

19

Revise DRAM Architecture

Page Page Page

RAS

CAS

Activates many pagesLots of reads and writes (refresh)Small amount of read data is usedRequires small number of pins

Traditional DRAM New DRAM architecture

Page Page PageAddr

Addr

Activates few pagesRead and write (refresh) what is neededAll read data is usedRequires large number of IO’s (3D)

Energy cost today:~175 pJ/bit

Signaling

DRAM

Array

M Control

20

Data Locality

Core-to-coreCommunication on the chip:~10pJ per Byte

Chip to memoryCommunication:~1.5nJ per Byte~150pJ per Byte

Chip to chipCommunication:~100pJ per Byte

Data movement is expensive—keep it local(1) Core to core, (2) Chip-to-chip, (3) Memory

21

Impact of Exploding Parallelism

100

150

200

250

300

350

400

450


Millio

n C

ore

s/E

FL

OP

1x Vdd

0.7x Vdd

0.5x Vdd

Almost flat because Vdd close to Vt

4X increase in the number of cores (Parallelism)

Increased communication and related energy

Increased HW, and unreliability

1. Strike a balance between Com & Computation2. Resiliency (Gradual, Intermittent, Permanent faults)

22

Road to Unreliability?

From Peta to Exa Reliability Issues

1,000X parallelism More hardware for something to go wrong

>1,000X intermittent faults due to soft errors

Aggressive Vcc scaling

to reduce power/energy

Gradual faults due to increased variations

More susceptible to Vcc droops (noise)

More susceptible to dynamic temp variations

Exacerbates intermittent faults—soft errors

Deeply scaled

technologies

Aging related faults

Lack of burn-in?

Variability increases dramatically

Resiliency will be the corner-stone

23

ResiliencyFaults Example

Permanent faults Stuck-at 0 & 1

Gradual faults Variability

Temperature

Intermittent faults Soft errors

Voltage droops

Aging faults Degradation

Faults cause errors (data & control)

Datapath errors Detected by parity/ECC

Silent data corruption Need HW hooks

Control errors Control lost (Blue screen)

Minimal overhead for resiliency

Circuit & Design

Microarchitecture

Microcode, Platform

Programming system

Applications Error detectionFault isolationFault confinementReconfigurationRecovery & Adapt

System Software

24

Needs a Paradigm Shift

Evaluate each (old) architecture feature with new priorities

Single thread performance Frequency

Programming productivity Legacy, compatibility

Architecture features for productivity

Constraints (1) Cost

(2) Reasonable Power/Energy

Throughput performance Parallelism

Power/Energy Architecture features for energy

Simplicity

Constraints (1) Programming productivity

(2) Cost

Past and present priorities—

Future priorities—

25

Summary

Von-Neumann computing & CMOS technology

(nothing else in sight)

Voltage scaling to reduce power and energy

• Explodes parallelism

• Cost of communication vs computation—critical balance

• Resiliency to combat side-effects and unreliability

Programming system for extreme parallelism

System software to harmonize all of the above

Major Challenges to Achieve Exascale Performance · 1 Major Challenges to Achieve Exascale...

Documents

Transcript of Major Challenges to Achieve Exascale Performance · 1 Major Challenges to Achieve Exascale...