UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis...

UW-Madison Computer Sciences Multifacet Group © 2010

Scalable Cores in Chip Multiprocessors

Thesis Defense2 November 2010

Dan Gibson

D. Gibson Thesis Defense - 2

Executive Summary (1/2)

• “Walls & Laws” suggest future CMPs will need Scalable Cores– Scale Up for Performance (e.g., one thread)

– Scale Down for per-core energy conservation (e.g., many threads)

• Area 1: How to build efficient scalable cores.– Forwardflow, one scalable core– Overprovision rather than borrow


Executive Summary (2/2)

• Area 2: How to use scalable cores:– Scale at fine granularity:

• Discover most-efficient configuration

– Scale for multi-threaded workloads• Scale up for sequential bottlenecks, improve

performance• Scale down for unimportant executions,

improve efficiency

– Using DVFS as a scalable core proxy


Document Outline

1. Introduction2. Extended Motivation

1. Scalable Cores2. Background3. Related Work

3. Methods4. Serialized Successor Representation5. Forwardflow6. Scalable Cores for CMPs

1. Scalable Forwardflow2. Overprovisioning vs. Borrowing3. Power-Awareness4. Single-Thread Scaling5. Multi-Thread Scaling6. DVFS as a Proxy for Scalable Cores

7. Conclusions/Future Work/ReflectionsA/B. Supplements

Of Course

Mostly Old Material: Recap

Mostly New Material: Talk Focus

TALK Outline…

If there’s time and interest

Hello, Software. I am a single x86 processor.


‘80s - `00s: Single-Core HeydayCore and Chip Microarchitecture Changed Enormously

386, 1985, 20MHz

486, 1989, 50MHz

P6, 1995 166MHzPIV, 2004, 3000MHz

10

100

1000

10000

1989 1994 1999 2004

Year

Fre

qu

ency

(M

Hz)

Intel

IBM

AMD

Clock Frequency Increased Dramatically

2.

1.

Hello, Software. I am still a single x86 processor.

Hitting the Power Wall

19711972

19741976

19781982

19851989

19931996

19981999

20002005

20072009

0.15

1.5

15

150

Ther

mal

Des

ign

Pow

er (W

)Core i7

4004

8008

8080

8085

8086 286386

486Pentium

Pentium MMX

Pentium II

Pentium III

Pentium 4Pentium D

Core 2


• Kneejerk Reactions:• Reduce Clock Frequency

(e.g., 3.0 Ghz to 2.4-ish GHz)

• De-Emphasize Pipeline Depth (e.g., Pentium M)

• What about Performance?

Resource borrowed from Yasuko’s WiDGET ISCA 2010 TalkOne example data point represents a range of actual products.

Chip Multiprocessors (CMPs)

1. Can’t clock (much) faster…

2. Hard to make uArch faster…

Use Die Area for More Cores!


Hello, Software. I am TWO x86 processors.(And my descendants will have more…)

• “Fundamental Turn Toward Concurrency” [Sutter2005]

• Software must now change to see continued performance gains.

This Won’t Be Easy.

In 1965, Gordon Moore sketched out his prediction of the pace of silicon technology. Decades later, Moore’s Law remains true, driven largely by Intel’s unparalleled silicon expertise.

Copyright © 2005 Intel Corporation.


• Cost per Device Falls Predictably– Density rises (Devices/mm2)

– Device size grows smaller

Rock, 65nm [JSSC2009] Rock16, 16nm [ITRS2007]

Moore’s Law in the Multicore Era

(If you want 1024 threads)

Or “Fell”


Amdahl’s Law

(1 - f ) + f

N

0

0.2

0.4

0.6

0.8

1

1.2

0.000.10

0.250.50

0.750.85

0.900.95

0.99

Parallel Fraction f

No

rmal

ized

Ru

nti

me

Parallel Runtime =

f = Parallel Fraction

N = Number of Cores

N = 8

Sequential: Not Good

Partially-Parallel: OK

Highly-Parallel: Very Good


Utilization Wall (aka SAF)

• Simultaneously Active Fraction (SAF): Fraction of devices in a fixed-area design that can be active at the same time, while still remaining within a fixed power budget.

[Venkatesh2009]

0

0.2

0.4

0.6

0.8

1

90nm 65nm 45nm 32nm

Dyn

amic

SA

F

LP DevicesHP Devices

[Chakraborty2008]

UTILIZATION


Architects Boxed In: Walls and Laws

• PW: Cannot clock much faster.

• UW: Cannot use all devices.

• AL: Single threads need help• Not all code is

parallel.Scalable CMPs

POWER

AMDAHL

Scalable CMPs→Scalable Cores

• Scale UP for Performance– Use more resources for more performance– (i.e., 2 Strong Oxen)

• Scale DOWN to Conserve Energy– Exploit TLP with many small cores– (i.e., 1024 Chickens)

If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?

-Attributed to Seymour Cray


Scalable Cores in CMPs

1. How to build a Scalable Core?– Should be efficient– Should offer a wide power/perf. Range

2. How to use Scalable Cores? – Optimize single-thread efficiency– Detect and ameliorate bottlenecks


This Thesis:


Area 1: Efficient Scalable Cores

Fear leads to Anger, Anger leads to Hate, Hate leads to Suffering

Naming Association

Broadcast Inefficiency

• Forwardflow Core Architecture– Raise Average I/MLP, Not Peak– Efficient SRAMs, no CAMs

• Serialized Successor Representation (SSR)– Use pointers instead of names

Basis for a Scalable

Core Design


Area 2: Scalable Cores in CMPs

• How to scale cores:– Overprovision each core?– Borrow/merge cores?

• When to scale cores:– For one thread?– For many threads?

• How to continue:– DVFS as a proxy for a scalable core


Outline

• Introduction: Scalable Cores– Motivation (Why scale in the first place?)– Definition

• Scalable Cores for CMPs– How to scale:

• Dynamically-Scalable Core (Forwardflow)• Overprovision or Borrow Resources?

– When to scale: Hardware Scaling Policies• For single-thread efficiency• For multi-threaded workloads

• Conclusions/Wrap-Up


Forwardflow (FF): A Scalable Core

• Forwardflow Core =Frontend (L1-I, Decode, ARF) +

FE

L1-D

Distributed Execution Logic/Window (DQ) +

L1-D CacheScale Down: Use a Smaller Window

Scale Up: Use a Bigger Window

Window Scaling vs. Core Scaling

• FF: Only scales instruction window– Not width,– Not registers, – etc.


• How does window scaling scale the core? – By affecting demand– Analogous to

Bernoulli’s Principle

FE DQ

2ddL VCfP ...=a

Power of Unscaled Components?

“Activity Factor”

FF Dynamic Configuration Space

CONFIG. SOBRIQUET DESCRIPTION

F-32 “Fully scaled down”

32-entry instruction window, single issue (1/4 of a DQ bank group)

F-6464-entry instruction window, dual issue (1/2 of a DQ bank group)

F-128 “Nominal”128-entry instruction window, quad issue, (one full DQ bank group)

F-256256-entry instruction window, “quad” issue, (2 BGs)

F-512512-entry instruction window, “quad” issue, (4 BGs)

F-1024 “Fully scaled up”

1K-entry instruction window, “quad” issue, (8 BGs)



Configuration…Component Configuration

Mem. Cons. Mod.

Sequential Consistency

Coherence Prot.

MOESI Directory (single chip)

Store Issue Policy

Permissions Prefetch at X

Frequency 3.0 GHz

Technology 32nm

Window Size Varied by experiment

Disambig. NoSQ

Branch Prediction

TAGE + 16-entry RAS + 256-entry BTB

Frontend 7 Cyc. Pred-to-dispatch

Component Configuration

L1-I Caches 32KB 4-way 64b 4cycle 2 proc. ports

L1-D Caches 32KB 4-way 64b 4 cycle LTU 4 proc. ports, WI/WT,

included by L2

L2 Caches 1MB 8-way 64b 11cycle WB/WA, Private

L3 Cache 8MB 16-way 64b 24cycle, Shared

Main Memory

8GB, 2 DDR2-like controllers (64 GB/s peak BW), 300 cycle latency

Inter-proc network

2D Mesh 16B link

FF Scalable Core Performance


0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Gmean h264ref libquantum

No

rmal

ized

Ru

nti

me

F-32

F-64

F-128

F-256

F-512

F-1024

3.2Mostly Compute-Bound: Not much scaling

Mostly Memory-Bound: Great scaling

FF Scalable Core Power


0

0.2

0.4

0.6

0.8

1

1.2

1.4

F-3

2

F-6

4

F-1

28

F-2

56

F-5

12

F-1

024

No

rmal

ized

Po

wer

FE

DQ/ALU

MEM

Static

WRT Nominal, F-128:

Scale Up Scale Down

8x Window 1/4 Window

-32% MEM power

+27% MEM power

-54% DQ/ALU power

+91% DQ/ALU power

+28% FE power

-39% FE power


FF Recap• Forwardflow Scalable Core

FE

L1-D

Scale Down: Use a Smaller Window

Scale Up: Use a Bigger Window

FE

L1-D

FE

L1-D

More details on Forwardflow


Outline







Overprovisioning vs. Borrowing

• Scaling Core Performance means Scaling Core Resources– From where can a scaled core acquire

more resources?

• Option 1: Overprovision All Cores– Every core can scale up fully using a core-

private resource pool• Option 2: Borrow From Other Cores

– Cores share resources with neighbors

What Resources?

• Forwardflow:– Resources = DQ Bank Groups = – (i.e., window space, functional units)

• Simple Experiment:– Overprovision: Each Core has 8 BGs,

enough for F-1024. What is the area cost?– Borrow: Each Core has 4 BGs, enough to

scale to F-512, borrows neighbors’ BGs to reach F-1024. What is the performance cost?



Per-Core Overprovisioning

FE

L1-D

L2

L3Bank

17.9

mm

16.6mm

Overprovisioned CMP:298mm2

Scale Up: Activate More Resources

8.9

6m

m

4.15mmOverprovisioned Tile:

37.2mm2

32nm


Resource Borrowing

FE

L1-D

L2

L3Bank

FE

L1-D

L2

L3Bank

Scale Up: Borrow Resources from Neighbor

8.3

1m

m

4.15mmBorrowed Tile:

34.5mm2

16.6

mm

16.6mm

Borrowing CMP:276mm2

32nm


Area CostPer-Core

12.3mm2 15.6mm2

+27%

Per-Tile

+8%

37.2mm234.5mm2

Per-CMP

+7%

276mm2 298mm2

32nm

Performance Cost of Borrowing

• Borrowing Slower?– Maybe Not:

Comparable Wire Delay (in this case)

– Maybe: Crosses Physical Core Boundary• Global vs. Local

Wiring?• Cross a clock

domain?


• Simple Experiment– 2-cycle lag crossing

core boundary– Slows inter-BG

communication– Slows dispatch

32nm

A Loose Loop


0.94

0.96

0.98

1

1.02

1.04

1.06

1.08

1.1

1.12

No

rmal

ized

Ru

nti

me

F-1024O F-512

• 2 cycles lag:

F-1024B

– 9% Runtime Reduction from Borrowing w.r.t. Overprovisioning

– Essentially No Performance Improvement From Scaling Up!

Overprovisioning vs. Borrowing

• Overprovisioning CAN be cheap– FF: 7% CMP area– CF: 12.5% area from borrowing [Ipek2007]

• If Borrowing introduces even small delays, it may no longer be worthwhile to scale at all.– This effect is worse if borrowing occurs at

smaller design points.



Outline







What to do for f=0.00

• What is important?– Performance: Just

Scale Up (done)

– Efficiency: Pick the most efficient configuration? • How to find the right

configuration?• Can we do better?

0

0.2

0.4

0.6

0.8

1

1.2

0.000.10

0.250.50

0.750.85

0.900.95

0.99Parallel Fraction f

No

rma

lize

d R

un

tim

e

What about local efficiency? (i.e., phases)

• Applications may exhibit phases at “micro-scale”– Not all phases are

equal

# Sum an arrayl_array: load [R1+ 0] -> R2 add R2 R3 -> R3 add R1 64 -> R1 brnz l_array...# Sum a listl_list: load [R1+ 8] -> R2 add R2 R3 -> R3 load [R1+ 0] -> R1 brnz l_list


Great for Big Windows(Scale Up?)

Big Window Makes No Difference

(Scale Down?)

00.20.40.60.8

11.21.41.61.8

2

E*D

2,

Norm

aliz

ed

to

Best

Sta

tic D

esig

n

Prior Art (some of it)

• POS (Positional Adaptation) [Huang03]:

– Code Configuration– POS: Static Profiling,

Measure Efficiency

• PAMRS (Power-Aware uArch Resource Scaling) [Iyer01]

– Detect “hot spots”– Measure all

configurations’ efficiency, pick best


POS

PAMRS

Want: Efficiency of POS, but dynamic response of PAMRS

MLP-based Window Size Estimation

• Play to the cards of the uArch:– FF: Pursue/measure

MLP– Something else:

Something else• Find the smallest

window that will expose as much MLP as the largest window

• Hardware:– Poison bits– Register names– Load miss bit– Counter– LFSR


Results

Explain window size estimation in detail with a gory example

FG Scaling Results

• MLP:– No profiling

needed– Safe, only hurts

efficiency >10% for 1 bmk.

– Compare to:• POS, 8 bmks• PAMRS, 20 bmks


00.20.40.60.8

11.21.41.61.8

2

Norm

aliz

ed

E*D

2

POS

PAMRS

MLP

Fewer of these compared to PAMRS

Fewer of these compared to POS


Recap: What to do for f=0.00

• Profiling (POS)– Can help, might hurt

• Dynamic Response: Seek MLP– Seldom hurts, usually

finds most-optimal configuration

0

0.2

0.4

0.6

0.8

1

1.2

0.000.10

0.250.50

0.750.85

0.900.95


No

rma

lize

d R

un

tim

e


Outline







What to do for f0.25-0.85

• Two Opportunities:1. Sequential

Bottlenecks• Detect, Fix• i.e., Scale Up• Better Performance

2. Useless Executions• Detect, Fix• i.e., Scale Down• Better Efficiency

0

0.2

0.4

0.6

0.8

1

1.2

0.000.10

0.250.50

0.750.85

0.900.95


No

rma

lize

d R

un

tim

e

What if the OS Knows?

• OS knows about bottlenecks– Can scale up a core

• OS knows about useless work– Can scale down, or,– Can shut off

unneeded cores (e.g., OPMS)

• Result: Amdahl’s Law in the Multicore Era [Hill/Marty08]


N

f

k

f

1

0

0.2

0.4

0.6

0.8

1

1.2

0.00 0.10 0.25 0.50 0.75 0.85 0.90 0.95 0.99

Parallel Fraction f

No

rma

lize

d R

un

tim

e

5.1k

If the OS doesn’t know

• Maybe programmer knows? (Prog)


Dunce

• SLE-like lock detector to identify critical sections? (Crit)

• Hardware spin detection? [Wells2006] (Spin)

• Holding a lock… except when spinning? (CSpin)

• Every thread spinning except one? (ASpin)(limit study, pretend global communication is OK)


Amdahl Microbenchmark

Bmks X Policies X Configs + Opacity

1

Unclear BehaviorSequential Parallel

N

f

k

f

1

f (1-f)

0

0.2

0.4

0.6

0.8

1

1.2

0.000.10

0.250.50

0.750.85

0.900.95


No

rma

lize

d R

un

tim

e

Real HW

Prog (Programmer-Guided Scaling)


0

1

2

3

4

5

P0P1P2P3P4P5P6P7

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.00 0.10 0.25 0.50 0.75 0.85 0.90 0.95 0.99

F-128

F-1024

Norm

aliz

ed

Ru

nti

me

Parallel Fraction f

Prog

sc_hint(slow)

sc_hint(fast)

F-1024

F-512

F-256

F-128

F-64

F-32

0

1

2

3

4

5

P0P1P2P3P4P5P6P7

Crit (SLE-Style Lock Detector for Scale-Up)


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.00 0.10 0.25 0.50 0.75 0.85 0.90 0.95 0.99

F-128

F-1024

Norm

aliz

ed

Ru

nti

me

Parallel Fraction f

Prog

F-1024

F-512

F-256

F-128

F-64

F-32

Crit

Barrier::Arrive() { l.Lock(); … l.Unlock();}

Lock::Lock() { CAS(myValue); …}

Lock::Unlock() { CAS(myValue); …}

WTH?

Crit: What goes wrong

• Intuition Mismatch:– Lock Detector Implementer's expectations

don’t match pthread library implementer's expectations.1. Critical Section != Sequential Bottleneck2. Lock+Unlock != CAS+Temporal Silent Store

• More general lesson– SW is really flexible. Programmers do

strange things. • HW designer: Be careful, SW may not be doing

what you think


0

1

2

3

4

5

P0P1P2P3P4P5P6P7

Spin (Spin Detector for Scale-Down)


F-1024

F-512

F-256

F-128

F-64

F-32

Spinning, Scale Down

Seldom/Never Spins

(Performs like CSpin, next)

0

1

2

3

4

5

P0P1P2P3P4P5P6P7

CSpin (Lock Detector for Scale-Up, Spin Detector for Scale-Down)


F-1024

F-512

F-256

F-128

F-64

F-32

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.00 0.10 0.25 0.50 0.75 0.85 0.90 0.95 0.99

F-128

F-1024

Norm

aliz

ed

Ru

nti

me

Parallel Fraction f

Prog

Crit

CSpin

LD thinks a lock is held,

but also spinning

0

1

2

3

4

5

P0P1P2P3P4P5P6P7

ASpin (Spin, but Scale Up if all others Scaled Down)


F-1024

F-512

F-256

F-128

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.00 0.10 0.25 0.50 0.75 0.85 0.90 0.95 0.99

F-128

F-1024

Norm

aliz

ed

Ru

nti

me

Parallel Fraction f

CSpin

Crit

Prog

F-64

F-32

All Spinning: Scale Up

Better late than never

ASpin

Amdahl Efficiency


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.00 0.10 0.25 0.50 0.75 0.85 0.90 0.95 0.99

F-128

Prog

Spin

ASpin

Norm

aliz

ed

E*D

2

Parallel Fraction f

1. Hope of SW Parallelism for Efficiency Seems Sound

2. “Programmer” can help. Psychology?, Difficulty for non-toy programs?

3.a. Spin-detection helps, by scaling down.

3.b. Can scale up when others spin (“others”?)

Real Workloads?


Workload Behavior

f0.90+ By design – graduate students

spend a lot of time making this so

No Prog Scaling Policy

Apache: Spin Det. helps. Synchronization Heavy.

JBB: Synchronization Heavy.

OLTP: (Just) Spin hurts a little, ASpin helps. Synchronization Heavy.

Zeus: (Just) Spin hurts a little, ASpin helps. Synchronization Heavy.

0

0.2

0.4

0.6

0.8

1

1.2

F-128

Spin

CSpin

ASpin

Norm

aliz

ed

E*D

2

F-1024


Outline





– How to continue: DVFS/Models for Future Software Evaluations


Conclusions (1/2)

• How to scale cores:– Forwardflow: An Energy-Proportional

Scalable Window Core Architecture• Scale up for performance• Scale down for energy conservation

– Overprovision Resources when cheap• Borrow only when necessary• Avoid loose loops


Conclusions (2/2)

• When to scale cores:– For single-thread efficiency:

• Seek efficient operation intrinsically (FF: MLP)

• Profiling can help, if possible.

– For threaded workloads:• Scale up for sequential bottlenecks

– If you can find them• Scale down for useless work

• How to emulate scalable cores– Proxy with DVFS, with caveats


Parallel Fraction f

No

rma

lize

d R

un

tim

e

1V

0V


Other Contributions

• Side Projects with Collaborators– Deconstructing Scalable Cores, Coming

Soon– “Diamonds are an Architect’s Best Friend”,

ISCA 2009– To CMP or Not to CMP, TR & ANCS Poster

• Parallel Programming at Wisconsin– CS 838, CS 758

• Various Infrastructure Work– Ruby, Tourmaline, Lapis, GEM5


Fun Facts About This Thesis

• Simulator:– C++: 135kl (101kl), Python: 16.7kl– 1188 Revs, 17,476 Builds ~15 builds per day since 5 July 2007

• Forwardflow used to be Fiberflow– Watch out, Metamucil

• Est. Simulation Time:– 2.9B CPU*Seconds = 95 Cluster*Days

(just in support of data in this thesis)


Questions/Pointers

Overp./Borrowing

FG Uniproc. Scaling

Multiproc. Scaling

DVFS vs. W. Scaling

SSR

All about FF

Estimating Power

LBUS/RBUS

Scalable Scheduling

Seeking MLP

Other Scalable Cores

Related WorkBackward Ptrs.

In the DocumentAlways in motion is the future.

DVFS vs. Scaling


DVFS Instead of Simulation

• So far:– “Benchmark” = 1ms – 10ms target time– Scaling “in the micro”

• i.e., Much faster than software

• What about longer runs?– “Benchmark” = minutes+– Scaling “in the macro”

• i.e., At the scale of systems

• No real hardware scalable core– Use DVFS instead, as a proxy.

You must unlearn what you have learned.

1V

0V

DVFS Effects


FEL1

-D

L2

L3

DRAM

DVFS Domain

1V

0V

+Freq: Compute operations are faster

+Freq: Memory seems slower

+Freq,+Volt: Dynamic Power Higher (~cubic)

+Pdyn: Higher temperature leads to higher static power

F-128 @ 3Ghz

F-128 @ 3.6Ghz

HW Scaling Effects


FEL1

-D

L2

L3

DRAM

+Window: Compute operations are not (much) faster

+Window: Memory seems faster

+Window: Dynamic Power Higher (~log)

Scale Up

F-128 @ 3Ghz

F-256 @ 3Ghz

How do they compare quantitatively?

0

0.2

0.4

0.6

0.8

1

F-256

3.6GHz

DVFS/HW Scaling Performance


More CPU-Bound: Prefer DVFS

More Memory-Bound: Prefer Window Scaling

a DVFS/ Scaling

config. pair with

comparable performance

Ru

nti

me N

orm

aliz

ed

to

F-1

28

DVFS/HW Scaling Power

• DVFS: +~38% Chip Power+~70% DVFS Domain

Dynamic Power+~20% Temp-induced

Leakage

• FF Scaling: +~10% Chip Power+~2% Temp-induced

Leakage


0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

FE

DQ/ALU

MEM

Static

F-128

3.6

GH

z

F-256

Pow

er

Norm

aliz

ed

to

F-1

28

DVFS Proxying Scalable Cores

• Performance: OK With Caveats– CPU-bound workloads: DVFS overestimates

scalable core performance– Memory-bound workloads: DVFS

underestimates scalable core performance

• Power: Not OK.– DVFS follows E*D2 curve– FF/Scalable Core should be better than

E*D2 curve.– Use a model instead.


SSR

• Per-Value Distributed Linked List– Starts at producer– Visits each successor– NULL pointer at last

successor

• Amenable to simple hardware– Serializes wakeup


ld R4 4 R1add R1 R3 R3sub R4 16 R4st R3 R8breq R4 R3

Effect of Serialized Wakeup


astar bzip2 gcc libquantum gmean0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Norm

alize

d R

unti

me

• Compared to idealized window– Low mean

performance loss from serialized wakeup (+2% runtime)

– Occasionally noticeable (i.e., bzip2, 50%+)

SSR Compiler Optimization


long split crit0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

RUUOoOSSR

Norm

alize

d R

unti

me

• long– Compiler cannot

identify dynamic repeated regs

• split– Compiler can identify

dynamic repeated regs, but cannot identify critical path

• crit– Compiler knows both

dynamic repeated regs and critical path


Power-Awareness

• How much energy is used by a computation?– Measure (e.g., with a multimeter)– Detailed Simulation (e.g., SPICE)– Simple Simulation (e.g., WATTCH)– Simple Model (e.g., 10W/core)

ii ENNumber of activations of element i

Energy per activation of element i

Measuring Energy Online


ii EN jestj EC

Event: “Easy” to measure

Activation: “Hard” to measure

Correlated

[Iyer01]: MAC in hardware.[Joseph01]: HW Perf. Ctrs, works for Pentium-eraThis work: Scalable core, use core’s resources to do the computation


DVFS Won’t Cut It

• Near saturation in voltage scaling• Subthreshold DVFS never energy-efficient [Zhai04]

• Need microarchitectural alternative

1996 1998 2000 2002 2004 2006 2008 20100

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

IMB PowerPC 405LP

TransMeta Crusoe TM 5800Intel Xscale 80200

Intel Itanium Montecito

Atom Sil-verthorne

Vmin

Vmax

Ope

ratin

g Vo

ltage

(V)

~80% ~33%

Resource borrowed from David’s “Two Cores” Talk

Scalable Interconnect


Logically: A Ring.Scale Down: A Ring with Fewer Elements

• Not straightforward• Overprovisioning won’t work well: Wrap-

around link is ugly• Needs to support 1-, 2-, 4-, 8-BG operation

Two Unidirectional Busses (gasp!)


F-1024F-512

10 01 10 01 10 01 11

11 10 01 10 01 10 11

10 01 11 00 00 00 00

11 10 01 00 00 00 00

Window Estimation Example


# Sum an arrayl_array: load [R1+ 0] -> R2 add R2 R3 -> R3 add R1 64 -> R1 brnz l_array load [R1+ 0] -> R2 add R2 R3 -> R3 add R1 64 -> R1 brnz l_array load [R1+ 0] -> R2 add R2 R3 -> R3 add R1 64 -> R1 brnz l_array

M

M

M

Miss, Start Profiling, Poison R2

W

Poison R3

Antidote R1

Indep. Miss,

Poison R3

Antidote R1

Indep. Miss, Set ELMRi, Poison R2

01234

ELMR4 8 16

1

Set ELMRi, Poison R2

8 1

MSb(ELMR) = 16 → Window size 16 needed.

Adding Hysteresis (1/2)


0

1

2

3

4

5

Runtim

eE*D

E*D^2

0

0.5

1

1.5

MLP

F-1024

libquantum

F-1024

F-512

F-256

F-128

F-64

F-32 0

1

2

3

4

5

Runtim

eE*D

E*D^2

00.5

11.5

2

MLP

F-1024

astar

F-1024

F-512

F-256

F-128

F-64

F-32

1. Many reconfigs

2. Too small most of the time. Must anticipate, not react.

Adding Hysteresis (2/2)

• Scale Down only “occasionally”– On full squash


Runtim

eE*D

E*D^2

00.5

11.5

2UpDown

UpOnly

F-1024

astar

0

1

2

3

4

5

UpDownUpOnly

F-1024

F-512

F-256

F-128

F-64

F-32• Intuition:

– Assume big window not useful

– Show, occasionally, that a big window IS useful.


Leakage Trends

• Leakage Starts to Dominate• SOI & DG Technology Helps (ca 2010/2013)• Tradeoffs Possible:

– Low-Leak Devices (slower access time)

DG Devices

LSP Devices

1MB Cache: Dynamic & Leakage Power [HP2008,ITRS2007]

Leakage Power by Circuit Variant [ITRS2007]

Pow

er (

mW

)

Nor

mal

ized

Pow

er


Forwardflow Overview

• Design Philosophy:– Avoid ‘broadcast’ accesses (e.g., no CAMs)

• Avoid ‘search’ operations (via pointers)– Prefer short wires, tolerate long wires– Decouple frontend from backend details

• Abstract backend as a pipeline


Forwardflow – Scalable Core Design

• Use Pointers to Explicitly Define Data Movement– Every Operand has a Next Use

Pointer– Pointers specify where data

moves (in log(N) space)– Pointers are agnostic of:

• Implementation• Structure sizes• Distance

– No search operation

ld R4 4 R1add R1 R3 R3sub R4 16 R4st R3 R8breq R4 R3


Forwardflow – Dataflow Queue

• Table of in-flight instructions• Combination Scheduler, ROB,

and PRF– Manages OOO Dependencies– Performs Scheduling– Holds Data Values for All

Operands• Each operand maintains a

next use pointer hence the log(N)

• Implemented as Banked RAMs Scalable

1 ld R4 4 R1

2 add R1 R3 R3

3 sub R4 16 R4

4 st R3 R8

5 breq R4 R5

Op1 Op2 DestDataflow Queue

Bird’s Eye View of FF Detailed View of FF


Forwardflow – DQ +/-’s

+ Explicit, Persistent Dependencies

+ No searching of any kind

- Multi-cycle Wakeup per value *

1 ld R4 4 R1

2 add R1 R3 R3

3 sub R4 16 R4

4 st R3 R8

5 breq R4 R5

Op1 Op2 Dest

* Average Number of Successors is Small [Ramirez04,Sassone07]

Dataflow Queue


DQ: Banks, Groups, and ALUs

Logical Organization Physical OrganizationDQ Bank Group – Fundamental Unit of Scaling


Forwardflow: Pipeline Tour

• RCT: Identifies Successors• ARF: Provides Architected Values• DQ: Chases Pointers

I$RCTRCTRCT

DQ

D$

ARF

PRED FETCH DECODE

DISPATCH

COMMIT

EXECUTE

Scalable, Decoupled Backend


RCT: Summarizing Pointers

• Want to dispatch:breq R4 R5

• Need to know:– Where to get R4?

• Result of DQ Entry 3

– Where to get R5?• From the ARF

• Register Consumer Table summarizes where most-recent version of registers can be found

1 ld R4 4 R1

2 add R1 R3 R3

3 sub R4 16 R4

4 st R3 R8

5



RCT: Summarizing Pointers

1 ld R4 4 R1

2 add R1 R3 R3

3 sub R4 16 R4

4 st R3 R8

5 breq R4 7


REF WRR1 2-S1 1-D

R2

R3 4-S1 2-D

R4 3-D 3-D

R5

Register Consumer Table (RCT)

breq R4 R5

5-S1

R4 Comes From DQ Entry 3-D

R5 Comes From ARF


Wakeup/Issue: Walking Pointers

• Follow Dest Ptr When New Result Produced– Continue following

pointers to subsequent successors

– At each successor, read ‘other’ value & try to issue

• NULL Ptr Last Successor

1 ld R4 4 R1

2 add R1 R3 R3

3 sub R4 16 R4

4 st R3 R8

5 breq R4 7



DQ: Fields and Banks

• Independent Fields Independent RAMs– I.E. accessed independently, independent ports,

etc.• Multi-Issue ≠ Multi-Port

– Multi-Issue Multi-Bank– Dispatch, Commit access contiguous DQ regions

• Bank on low-order bits for dispatch/commit BW

• Port Contention + Wire Delay = More Banks– Dispatch, Commit Share a Port

• Bank on a high-order bit to reduce contention


DQ: Banks, Groups, and ALUs

Logical Organization Physical OrganizationDQ Bank Group – Fundamental Unit of Scaling


Related Work

• Scalable Schedulers– Direct Instruction Wakeup [Ramirez04]:

• Scheduler has a pointer to the first successor• Secondary table for matrix of successors

– Hybrid Wakeup [Huang02]:• Scheduler has a pointer to the first successor• Each entry has a broadcast bit for multiple

successors– Half Price [Kim02]:

• Slice the scheduler in half• Second operand often unneeded


Related Work

• Dataflow & Distributed Machines– Tagged-Token [Arvind90]

• Values (tokens) flow to successors– TRIPS [Sankaralingam03]:

• Discrete Execution Tiles: X, RF, $, etc.• EDGE ISA

– Clustered Designs [e.g. Palacharla97]• Independent execution queues


RW: Scaling, etc.

• CoreFusion [Ipek07]– Fuse individual core structures into bigger

cores• Power aware microarchitecture resource

scaling [Iyer01]– Varies RUU & Width

• Positional Adaptation [Huang03]– Adaptively Applies Low-Power Techniques:

• Instruction Filtering, Sequential Cache, Reduced ALUs


RW: Scalable Cores

• CoreFusion [Ipek07]

– Fuse individual core structures into bigger cores

• Composable Lightweight Processors [Kim07]

– Many very small cores operate collectively, ala TRIPS

• WiDGET [Watanabe10]

– Scale window via smart steering


RW: Seeking MLP

• Big Windows [Many]

• Runahead Execution [Dundas97][Multu06]

– “Just keep executing”• WIB [Lebeck02]

– Defer, re-schedule later• Continual Flow [Srinivasan04]

– & Friends [Hilton09][Chaudhry09]

– Defer, re-dispatch later


Operand NetworksC

DF

Pointer Span

SPAN=5

Observation: ~85% of pointers designate near successors

Intuition: Most of these pointers yield IB traffic, some IBG-N, none IBG-D.

SPAN=16

Observation: Nearly all pointers (>95%) designate successors 16 or fewer entries away

Intuition: There will be very little IBG-D traffic.

astarsjengjbb


Is It Correct?

• Impossible to tell– Experiments do not prove, they support or refute

• What support has been observed of the hypothesis “This is correct”?– Reasonable agreement with published

observations (e.g. consumer fanouts)– Few timing-first functional violations– Predictable uBenchmark behavior

• Linked list: No parallelism• Streaming: Much parallelism


CoreFusion

• Borrow Everything– Merges multiple

discrete elements in multiple discrete cores into larger components

– Troublesome for N>2

BPRED

Decode

Sched.

PRF

I$

BPRED

Decode

Sched.

PRF

I$


“Vanilla” CMOS

P-N

N+ N+


Double-Gate, Tri-Gate, Multigate


ITRS-HP vs. ITRS-LSP Device

P-

N

N+ N+

LSP: ~2x Thicker Gate Oxides

LSP: ~2x Longer Gates

LSP: ~4x Vth


OoO Scaling

Decode Width = 2

op1 dest src1 src2

op2 dest src1 src2

op1 dest src1 src2

op2 dest src1 src2

op1 dest src1 src2

op2 dest src1 src2

Decode Width = 4

Number of Comparators ~ O(N2) Bypassing Complexity ~ O(N2)

Two-way Fully Bypassed

Four-way fully bypassed is beyond my powerpoint skill


OoO Scaling

• ROB Complexity: O(N), O(I~3/2)• PRF Complexity: O(ROB), O(I~3/2)• Scheduler Complexity:

– CAM: O(N*log(N)) (size of reg tag increases log(N))

– Matrix: O(N2) (in fairness, the constant in front is small)


Flavors of “Off”

Dynamic Power

Static Power

Response Lag Time

Active (Not Off)

U% 100% 0 cycles

Drowsy(Vdd Scaled)

1-5% 40% 1-2 cycles

Clock-Gated

1-5% 100% ~0 cycles

Vdd-Gated <1% <1% 100s cycles

Freq. Scaled

F% 100% ~0 cycles


Forwardflow – Resolving Branches

1 ld R4 4 R1

2 add R1 R3 R3

3 sub R4 16 R4

4 st R3 R8

5 breq R4 R5

6 ld R4 4 R1

7 add R1 R3 R3


• On Branch Pred.:– Checkpoint RCT– Checkpoint Pointer

Valid Bits

• Checkpoint Restore– Restores RCT– Invalidates Bad

Pointers


add

A Day in the Life of a Forwardflow Instruction: Decode

4-S1R4R3R2

7-DR1Register Consumer History

R1@7D

8-D

8-D

R3=0

8-S1

8-S1

add R1 R3 R3


A Day in the Life of a Forwardflow Instruction: Dispatch

9

8

7

R3R1add

R14R4ld


add R1@7D R3=0

Implicit -- Not actually written

0


A Day in the Life of a Forwardflow Instruction: Wakeup

7 ld R4 4 R1

8 add R1 0 R3

9 sub R4 16 R4

10

st R3 R8


DQ7 Result is 0!

7-Dnext

Update HW

value 0

DestPtr.Read(7)

DestVal.Write(7,0)

8-S1

0

8-S1


A Day in the Life of a Forwardflow Instruction: Issue (…and Execute)

7 ld R4 4 R1

8 add R1 0 R3

9 sub R4 16 R4

10

st R3 R8


S2Val.Read(8)

Meta.Read(8)

8-S1next

Update HW

value 0

S1Ptr.Read(8)

S1Val.Write(8,0)

00

add 0

add 0 + 0 → DQ8


A Day in the Life of a Forwardflow Instruction: Writeback

10

9

8

7

R8R3st

R416R4sub

R30R1add


8-Dnext

Update HW

value 0

DestPtr.Read(8)

DestVal.Write(8,0)

0

10-S1 10-S1R3:0


A Day in the Life of a Forwardflow Instruction: Commit

10

9

8

7

R8R3st

R416R4sub


0R1add R3:0Commit Logic

Meta.Read(8)

DestVal.Read(8)

add R3:0

ARF.Write(R3,0)


5-S1R4

4-S1R3

R2

2-S1R1

DQ Q&A

1 ld R4 4 R1

2 add R1 R3 R3

3 sub R4 16 R4

4 st R3 R8

5 breq R4 R5

6 ld R4 4 R1

7 add R1 R3 R3

8 sub R4 16 R4

9 st R3 R8


Register Consumer History

8-DR4

9-S1R3

R2

7-S1R1


Forwardflow – Wakeup

1 ld R4 4 R1

2 add R1 R3 R3

3 sub R4 16 R4

4 st R3 R8

5 breq R4 0


DQ1 Result is 7!

1-Dnext

Update HW

value 7

DestPtr.Read(1)

DestVal.Write(1,7)

2-S1

7

2-S1


S2Val.Read(2)

Meta.Read(2)

2-S1

Forwardflow – Selection

1 ld R4 4 R1

2 add R1 R3 R3

3 sub R4 16 R4

4 st R3 R8

5 breq R4 0


next

Update HW

value 7

S1Ptr.Read(2)

S1Val.Write(2,7)

77

add 44

DQ2

Issue


Forwardflow – Building Pointer Chains: Decode

• Decode must determine, for each operand, where the operand’s value will originate– Vanilla-OOO: Register Renaming– Forwardflow-OOO: Register Consumer Table

• RCT records last instruction to reference a particular architectural register– RAM-based table, analogous to renamer


Decode Example

7-S1R4R3R2

7-DR1Register Consumer History 5 ld R4 4 R4

6 add R4 R1 R4

7 ld R4 16 R1

8

9



8: add R1 R3 R3

Decode Example

4-S1R4R3R2

7-DR1Register Consumer History

R1@7D

8-D

8-D

R3=0

8-S1

8-S1

add→R3


Forwardflow –Dispatch

• Dispatch into DQ:– Writes metadata and

available operands– Appends instruction

to forward pointer chains

5 ld R4 4 R4

6 add R4 R1 R4

7 ld R4 16 R1

8 add R1 0 R3

9

Op1 Op2 Dest

Dataflow Queue

R3=0add→R3 R1@7D

UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis...

Documents

Transcript of UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis...