POWER TM Processor Design & Methodology Directions

35
© 2012 IBM Corporation POWER TM Processor Design & Methodology Directions Joshua Friedrich – Senior Technical Staff Member, IBM Server & Technology Group June 6, 2012

Transcript of POWER TM Processor Design & Methodology Directions

Page 1: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation

POWERTM Processor Design & Methodology Directions

Joshua Friedrich – Senior Technical Staff Member, IBM Server & Technology Group

June 6, 2012

Page 2: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation2

CMOS Microprocessor Trends, The First ~25 Years ( Good old days )Log(P

erf

orm

an

ce)

Single Thread

More active transistors, higher frequency

20051980

Page 3: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation3

Characteristics of Single Thread Era

0

1

2

3

4

5

6

POWER4 POWER5 POWER6

Fre

qu

en

cy

(G

Hz

)

0

100

200

300

400

500

POWER4 POWER5 POWER6

TX

s p

er

co

re

Dennard Scaling

Exponential Frequency Growth

Optical Scaling / Node Migration

Expanding uArch Complexity

Page 4: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation4

Single Thread Era Icon: POWER6

� 5+ GHz operation, >790M transistors, 341mm2 die

� 65nm SOI with 10 levels of Cu interconnect

� 2 superscalar, SMT cores with 7 instruction dispatch & 9 execution units

� 8 MB Level-2 cache + support for 32MB L3

� 2 memory controllers, Two-tier SMP Fabric

� Same pipeline depth & power at 2x frequency versus POWER5

2 MB L2

2 MB L2

2 MB L2

2 MB L2

L2 Dir

L2 Dir

Mem.Cntl.

Mem.Cntl.

SMP Fabric

L

3

CO

NTROL

LER

LSU

IFU /IDU

VMX

FXU

BF

U

DFU

RU

Core 1

Page 5: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation5

Single Thread Era EDA: Transistor Analysis & Optimization

EinsTLT: Static timing analysis of complex circuits

EinsTuner: Transistor Level timing optimization

clkin

w2

w3

w0

fbk

Cycle

fbk = 0

evaluate precharge

fbk = 1

w2 = 11st Timing

2nd Timing

w3_int

0

200

400

600

800

1000

1200

-6 -4 -2 0 2 4 6 8

Slack (ps)# o

f p

ath

s

Pre-Tuning

Post-Tuning

Page 6: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation6

End of an Era: The Power Wall

0.010.11

0.001

0.01

0.1

1

10

100

1000

Gate Length (microns)

Active Power

Passive Power

1994 2004

Po

wer

Den

sit

y (

W/c

m2) Air Cooling limit

Source: B. Meyerson (IBM) Source: S. Borkar (Intel)

Inability to scale Tox & lower voltage resulted in a power wall for single thread performance

Page 7: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation7

Microprocessor TrendsLog(P

erf

orm

ance)

More active transistors, higher frequency

Single Thread

More active transistors, higher frequency

20051980

Page 8: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation8

CMOS Scaling Continued

0%

20%

40%

60%

80%

100%

180nm 130nm 90nm 65nm 45nm 32nm

Gain by Traditional Scaling Gain by Innovation

0%

20%

40%

60%

80%

100%

180nm 130nm 90nm 65nm 45nm 32nm

Gain by Traditional Scaling Gain by Innovation

WL

BOX

DeepTrench

Cap

BL

Node

PassingWL

4.0

um

18fFStorage Node

WL

3fF BL (32 Cells)

Node

High-K Metal Gate

Page 9: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation9

Microprocessor TrendsLog(P

erf

orm

ance)

Multi-Core

More active transistors, higher frequency

Single Thread

More active transistors, higher frequency

20051980

Page 10: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation10

POWER Processors Began the Multi-Core / Multi-Thread Era

Power 4

2001

Introduces Dual core

Power 5

2004

Dual Core

Introduces SMT (4 threads)

Power 6

2007

Dual Core – 4 threads

Enhances SMT Efficiency

FX0

FX1

FP0

FP1

LS0

LS1

BRX

CRL

Superscalar, out of order

Cycles

FX0

FX1

FP0

FP1

LS0

LS1

BRX

CRL

2 Way SMT

FX0

FX1

FP0

FP1

LS0

LS1

BRX

CRL

2 Way SMT (Enhanced)

Thread 1 Executing

Thread 0 Executing

No Thread Executing

Page 11: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation11

POWER7 Processor Chip

� 567mm2, 45nm SOI w/ eDRAM

� 1.2B transistors– Equivalent function of 2.7B

(eDRAM)

� Eight processor cores w/ 4 way SMT– 32 threads / chip

� 32MB on chip eDRAM shared L3

� Dual DDR3 Memory Controllers w/ 100GB/s sustained Memory bandwidth

� Scalability up to 32 Sockets– 360GB/s SMP bandwidth/chip– 20,000 coherent operations in flight

FX0FX1FP0FP1LS0LS1BRXCRL

4 Way SMT

Thread 1 Executing

Thread 0 Executing

No Thread Executing

Thread 3 Executing

Thread 2 Executing

Page 12: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation12

Multi-core Era EDA: Power Analysis & Reduction

Unstructured Clock Buffer & Latch

Structured Clocking Construct

15% clock power reduction

s0

s1

s2

<0.14 , 0.75>

<0.13 , 0.19>

<0.14 , 0.25>

<0.14 , 0.75> <0.14 , 0.25>

BEFORE

Format: <SF,SP>

s0

s1

s2

< 0.06 , 0.06 >

<0.14 , 0.25>

<0.14 , 0.75><0.14 , 0.75>

AFTER

<0.14 , 0.75>∆Power < 0

5% data power reduction

Page 13: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation13

Multi-core Era EDA: Cycle Simulation

0

50

100

150

200

250

300

350

Runtime Memory Use

CycleSim

EventSim

25x

10x

H a r d w a r e A c c e le r a t o r T h r o u g h p u t

0

5

1 0

1 5

2 0

2 5

3 0

3 5

4 0

4 5

5 0

W e e k s

Tri

llio

n S

im C

ycle

s

0

2 0

4 0

6 0

8 0

1 0 0

1 2 0

Min

ute

s r

eal

tim

e @

3.5

GH

z

Mostly core only model Larger chip Models

(� slower sim speed)

Introduction of next

Generation AcceleratorPrior Gen Project

High water mark

Page 14: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation14

Multi-Core Era Limiters`

1

10

100

90 65 45 32 22 14 10

Technology Node

log

(p

erf

orm

an

ce

)

Ideal Growth

Likely Multi-Core Path

Technology complexity & rising costs

Power

SW parallelism

Socket BW

Page 15: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation15

Multi-core evolution

Compute Throughput Potential

Multi-core evolution

Socket Throughput Limitation(Power, memory bandwidth)

Need to Amplify Effective

Socket BandwidthTo Achieve Potential

POWER’s Multi-Core Advantage

Page 16: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation16 Multi-core evolution

Compute Throughput Potential

Socket Throughput Limitation(Power, memory bandwidth)

EDRAM = large, low power cache

High bandwidth memory buffer

Low-Power Off-Chip SignalingTechnology

Coherence Innovation to

minimize socket-to-socket communication

POWER Architecture & Technology Will Extend Multi-Core Gains

Page 17: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation17

Microprocessor TrendsLog(P

erf

orm

ance)

More active transistors, higher frequency

Multi-Core

More active transistors, higher frequency

Single Thread

More active transistors, higher frequency

2015(?)20051980

Page 18: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation18

Microprocessor Trends: The Next EraLog(P

erf

orm

ance) Designer-Driven Innovation

(System-level tech, Heterogeneity, HW-SW Opt.)

More active transistors, higher frequency

Multi-Core

More active transistors, higher frequency

Single Thread

More active transistors, higher frequency

2015(?)20051980

Page 19: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation19

Innovation Era: System Level Technologies

3D Stacking with Through Silicon Vias

FPGA Accelerators Flash Memory / SSDSilicon Photonics

Single Processor–Memory Socket

Page 20: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation20

Innovation Era: Heterogeneous Integration

uP

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

IO Hub

1/10GbE

Service Processor

PCIE

Slots

PCIE

Slots

PCIE

Slots

PCIE

Slots

PCIE

Slots

RAID Controller

Power ManagementuController

VRMs

Simple Server (2010)

uP

Single VRM

Disk

Simple Server (Future)

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

PCIE

Slots

PCIE

Slots

PCIE

Slots

PCIE

Slots

PCIE

Slots

10/40GbE

Internal VRMs

IO Subsystem

Disk

Service Subsystem

Page 21: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation21

Innovation Era: Hardware-Software Co-Optimization

Benefit of Specialized HW

Acceleration

0

20

40

60

80

100

Base w/ HW Accel

Encryption

Application

Profiling Hardware

Original program

Dynamically Optimized program

Software

Optimizer

ExecutionEngine

Dynamic Re-compilation with Profile Data

Specialized HW acceleration viable in many areas

•Graphics•Compression•Cryptography•More….

Page 22: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation22

Designer Time: Technology-Driven Eras vs. Innovation Era

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Designer

Time

ValidationTools

Implement

Plan

Innovate

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Designer

Time

ValidationTools

Implement

Plan

Innovate

Technology Focused Innovation Focused

Page 23: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation23

Manage Technology Complexity: Double Patterning

Need for Double / Triple Patterning

0

50

100

150

32nm 20nm Future

Technology Node

Pit

ch

(n

m)

Device Pitch

Single Exposure Limit

Metal Pitch

Double Patterning Limit

EUV?

Page 24: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation24

Manage Technology Complexity: Extreme Variability

% Delay Variation vs. Device Size

0

10

20

30

40

0 2.5 5

% Delay Variation vs. Operating Voltage

0

10

20

30

40

50

60

0.5 0.7 0.9 1.1 1.3

3

Single Patterned RC Distribution

Double Patterned RC Distribution

Variability grows rapidly w/ each technology generation.

Increasing variation as devices shrink

Increasing variation as voltages drop

Page 25: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation25

Productivity & TAT: Merge ASIC Productivity with Custom Performance

� Custom Processor Methodology

–Manual, custom design

–Tool support for Array / analog

–Transistor level analysis

–Clock mesh (low skew)

–Able to engineer wires & placement

–Small macros

� ASIC Methodology

–Synthesis Dominated

–Blackbox array / analog

–Gate level analysis

–Clock trees (high skew)

–Auto-place / auto-route only

–Large blocks

Custom performance w/ ASIC productivity

Page 26: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation26

Productivity: Reduce Custom Design

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

# of Customs over Time

custom-like data flow alignment

91% density

Each color indicates cells associated w/ a different logical bit.

>10x reduction over 5 designs

Datapath Synthesis Results

Page 27: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation27

Productivity: Reduce # of Design Partitions

60 logic macros, 25 customs, 14 unique arrays/RFs

1 macro, 0 customs, 9 unique arrays / RFsReduced area & power; equal cycle time

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

# of Macros over Time

Target

Page 28: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation28

TAT: Gate-Level Analysis, Hierarchical Abstraction & Multi-Threading

0

50

hrs.

Base Cleanup Coarse

Parallelism

Hierarchical

abstracts

Multi-

threading

Projected Chip Timing Runtime

�Fast macro and global analysis tools allow designers to iterate more quickly resulting in reduced implementation time

�Gate level analysis provides orders-of-magnitude speedups over transistor level analysis

�Hierarchical abstraction & multi-threading can provide 5-10x gains in chip level analysis

Arb

itra

ry U

nit

s

Runtime Accuracy

Macro Signoff Runtime

TX Level

Gate Level

Page 29: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation29

Productivity: Retiming

�Significant designer effort spent in optimizing cycle boundaries

�Retiming enables synthesis to optimally locate latches to balance

timing/area/power

� Invention is required to seamlessly handle verification implications

.

.

.latch

Doesn’t meet cycle time

Area/Power too high

Optimal

Page 30: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation30

Productivity: Traditional Design Flow

HDL Specification•Function, test, pervasive, power, etc.

Synthesis

Physical Netlist

Model Build

Verification (all types)

Logic Designer

Subject Experts

requirements

Page 31: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation31

Productivity: Aspect-Oriented Design Flow

Functional HDL

Power Spec

Pervasive Spec

Synthesis

Physical Netlist

Test Spec.

Model Build

Functional Sim

Model Build

Pervasive Verification

DFT Analysis

Expert designers

Page 32: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation32

Productivity: Aspect-Oriented Design Flow

Functional HDL

Power Spec

Pervasive Spec

Synthesis

Physical Netlist

Test Spec.

Model Build

Functional Sim

Model Build

Pervasive Verification

DFT Analysis

Expert designers

Automation

Page 33: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation33

Multi-Core Era Integration: Homogeneous IP

Standard Input Spec(HDL, ABS, assertions)

Construction Tools

Methodology Compliant IP

Macro Signoff Tools

Abstraction Formats

Global Analysis

Homogenous Design

Limited # of development groups

Strictly enforced design styles / rules with full methodology support

Page 34: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation34

Innovation Era Integration: Heterogeneous IP

Standard Input Spec(HDL, ABS, assertions)

Construction Tools

Methodology Compliant IP

Signoff Tools

Abstraction Formats

Global Analysis

Heterogeneous Design

Multiple IP sources to obtain best-of-breed component designs

Limited control on design styles

Multiple entry points into design flow

Alternate Input Formats

Alternate Input Formats

Foreign (non-compliant) IP

Foreign (non-compliant) IP

Foreign abstractions

Foreign abstractions

Page 35: POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation35

Summary

�Moore’s law continues…..

�However, we will need MORE INNOVATION, not just more cores to leverage it

�EDA tools need to provide MORE PRODUCTIVITY to enable designer innovation