Energy at Exaflops

16
1 SC09: Exa Panel Energy at Exaflops Peter M. Kogge McCourtney Prof. of CSE Univ. of Notre Dame IBM Fellow (retired)

Transcript of Energy at Exaflops

Page 1: Energy at Exaflops

1 SC09: Exa Panel

Energy at

Exaflops Peter M. Kogge

McCourtney Prof. of CSE Univ. of Notre Dame IBM Fellow (retired)

Page 2: Energy at Exaflops

2 SC09: Exa Panel

The Key Take Away

You can hide the latency… But

You Can’t Hide the Energy

Page 3: Energy at Exaflops

3 SC09: Exa Panel

“Business as Usual” Energy Projections

“Bend” when Voltage stopped dropping

Tech As

Usual 1 exaflop @ 30MW = 33pJ/Flop

Page 4: Energy at Exaflops

4 SC09: Exa Panel

Exascale Study Strawman

734 “Simple” cores per chip in 2013 technology

Memory Hierarchy •  Register File •  Level 1 Cache •  Level 2/3 Cache •  On-Module DRAM Memory

• 16x1GB •  Off-Module Memory

(b) Thru via chip stack

Page 5: Energy at Exaflops

5 SC09: Exa Panel

But At What Memory Capacity?

How many useful apps are down here?

Page 6: Energy at Exaflops

6 SC09: Exa Panel

Strawman Power Distribution

(a) Overall System Power (b) Memory Power (c) Interconnect Power

Growing DRAM capacity 30X at least doubles overall Memory Power

Assumed a 60X per bit decrease in access energy

Page 7: Energy at Exaflops

7 SC09: Exa Panel

A Budget to Match Panel Goals •  30MW for 1 Exaflop => 33pJ/Flop •  20% lost to leakage => ~26pJ/flop •  Using the report’s 2013 technology base:

–  ~10.6pJ for FPU operation –  ~14.6pJ for L1 Instr cache access –  ~1.8pJ for each Register File access

•  Strawman dataflow per flop op: ~19.7pJ –  1 instruction fetch amortized over 4 FPU ops –  Each FPU op takes 2 RF reads & 1 RF write –  (An FMA would require another RF read)

•  This leaves ~6 pJ for memory access/flop •  Moving to 2015 technology reduces per op to ~10pJ

–  Leaving 16pJ for memory

Page 8: Energy at Exaflops

8 SC09: Exa Panel

The (Energy) Path to Memory Access L1 Tag

Instruction

Access TLB Move to L2/L3 Access Tag

Read L1 (1W) Write RF

Read 4W Move 4W to L1

Write 4W to L1

Access Directory Move to Port Chip-Chip (TSV) Move to DRAM Block

Read 4W DRAM Move to Port Chip-Chip (TSV)

Move to Port On Module Chip-Chip Thru Router Module-Module

Rack-Rack

Thru Router

Thru Router Rack-Rack Thru Router Module-Module

On Module Chip-Chip Move to Port Chip-Chip (TSV) Move to DRAM Block

Hit

Hit Miss

Miss Local

Non-local

Page 9: Energy at Exaflops

9 SC09: Exa Panel

The Path from (Global) Memory

Access Directory Move to Port Read 1W DRAM Move to Port Chip-Chip (TSV)

Move to Port On Module Chip-Chip Thru Router Module-Module

Rack-Rack

Thru Router

Thru Router Rack-Rack Thru Router Module-Module

On Module Chip-Chip Move to RF Write RF

AND THIS DOESN’T ACCOUNT FOR TLB MISSES!!!

Page 10: Energy at Exaflops

10 SC09: Exa Panel

Tracking the Energy

Remember we have only 6-16pJ per flop instruction

Page 11: Energy at Exaflops

11 SC09: Exa Panel

What’s an “Average” Memory Reference?

And Interconnect is

85% of Energy!!!

By Reducing DRAM Power by 60X, It becomes 8% of Average – but still 10X 6pJ budget

706 pJ per access •  80% L1 hit •  90% L2 hit •  On L2 Miss:

•  60% Local •  40% Global

Page 12: Energy at Exaflops

12 SC09: Exa Panel

Access vs Reach

Curve Fit = 626*GB^.2

Page 13: Energy at Exaflops

13 SC09: Exa Panel

What Does This Tell Us?

•  Cannot afford ANY memory references –  One out of every 100+ instructions can access memory –  Even with 2015 Technology only have a 16pJ budget

•  Make it only 1 out of ever 30+ instructions

•  There are lot more energy sinks than you think

•  Cost of Interconnect Dominates

•  Must design for on-board or stacked DRAM

•  We need to redesign the entire access path: –  Alternative memory technologies – reduce access cost –  Alternative packaging costs – reduce bit movement cost –  Alternative transport protocols – reduce # bits moved –  Alternative execution models – reduce # of movements

Page 14: Energy at Exaflops

14 SC09: Exa Panel

Our Study Wasn’t Creative Enough!

Page 15: Energy at Exaflops

15 SC09: Exa Panel

And This Means You Can Reference Memory Every ….

• For 100% L1 Hit, once every 6.5 Flops

• For an 80% L1 Hit and 100% L2/L3 Hit, once every 18 Flops

• For an 80% L1 Hit, 90% L2/L3 Hit, and all the rest is local, once every 35 Flops

• For an 80% L1 Hit, 90% L2/L3 Hit, 60% local, and 40% Global, once every 118 Flops

Page 16: Energy at Exaflops

16 SC09: Exa Panel

Color Code

•  Red: FPU

•  Orange: other logic, tag access, RF

•  Dark Green: Chip-chip (TSV)

•  Light Green: on chip transport

•  Purple: rack-rack

•  Light Purple: Chip-Chip (non TSV)

•  Dark blue: DRAM access

•  Light Blue: cache data access