Download - WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe, John D. Davis †, David A. Wood *University of Wisconsin † Microsoft Research ISCA 2010,

WiDGET:Wisconsin Decoupled Grid Execution Tiles

Yasuko Watanabe*, John D. Davis†, David A. Wood*

*University of Wisconsin †Microsoft Research

ISCA 2010, Saint-Malo, France

Power vs. Performance

A full range of operating pointson a single chip

2

0.3 0.5 0.7 0.9 1.1 1.3 1.50.20.30.40.50.60.70.80.9

1

Normalized Performance

Nor

mal

ized

Chi

p Po

wer

Xeon-like

Atom-like

A single core

Executive Summary

• WiDGET framework– Sea of resources– In-order Execution Units (EUs)

• ALU & FIFO instruction buffers• Distributed in-order buffers → OoO execution

– Simple instruction steering– Core scaling through resource allocation

• ↓ EUs → Slower with less power• ↑ EUs → Turbo speed with more power

3

19711972

19741976

19781982

19851989

19931996

19981999

20002005

20072009

0.15

1.5

15

150

Ther

mal

Des

ign

Pow

er (W

)Intel CPU Power Trend

Core i7

4004

8008

8080

8085

8086 286386

486Pentium

Pentium MMX

Pentium II

Pentium III

Pentium 4Pentium D

Core 2

4

Conflicting Goal: Low Power & High Performance

• Need for high single-thread performance– Amdahl’s Law [Hill08]– Service-level agreements [Reddi10]

• Challenge– Energy proportional computing [Barroso07]

• Prior approach– Dynamic voltage and frequency scaling (DVFS)

5

Diminishing Returns of DVFS

• Near saturation in voltage scaling• Innovations in microarchitecture needed

1997 2000 2003 2006 20090

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

IBM PowerPC 405LPIntel Xscale 80200TransMeta Crusoe TM 5800Intel Itanium MontecitoAtom Sil-verthorneVminO

pera

ting

Volta

ge (V

)

6

Outline• High-level design overview• Microarchitecture• Single-thread evaluation• Conclusions

7

High-Level Design• Sea of resources

Core

Core

Core

Core

Core

Core

Core

Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L1I

Thread context management orInstruction Engine(Front-end + Back-end)

In-order Execution Unit (EU)

L2

8

WiDGET Vision

TLPPower

ILPPower

TLPILPPower

Just 5 examples.Much more can be done.

9

ILPPower

ILPPower

In-Order EU

10

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L2

Router

Router

Operand Buffer

•Executes 1 instruction/cycle

•EU aggregation for OoO-like performance

Increases both issue BW & bufferingPrevents stalled instructions from

blocking ready instructionsExtracts MLP & ILP

In-OrderInstr Buffers

EU ClusterL1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L1I

Thread context management orInstruction Engine(Front-end + Back-end)

In-order Execution Unit (EU)

L2

L1I 0

L1D 0

IE 0

L1I 1

IE 1

L1D 1

EU Cluster

Full bypass within a cluster

11

1-cycle inter-cluster link

Instruction Engine (IE)

12

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L2• Thread specific structures• Front-end + back-end• Similar to a conventional OoO pipe

• Steering logic for distributed EUs• Achieve OoO performance with in-order EUs• Expose independent instr chains

Steering

RF

DecodeFetch Rename

RF

DecodeFetch RenameBR

Pred

Front-End

CommitROB Commit

Back-End

Coarse-Grain OoO Execution

1

2

5

3

4

6

7

8

1

3

5

2

4

7

6

8

OoO Issue WiDGET

1

2

5

3

4

6

7

8

1

3

5

2

4

7

6

8

13

Methodology• Goal: Power proportionality

– Wide performance & power ranges• Full-system execution-driven simulator

– Based on GEMS– Integrated Wattch and CACTI

• SPEC CPU2006 benchmark suite• 2 comparison points

– Neon: Aggressive proc for high ILP– Mite: Simple, low-power proc

• Config: 1 - 8 EUs to 1 IE– 1 - 4 instruction buffers / EU

L1I 0

L1D 0

IE 0

14

• 21% power savings to match Neon’s performance• 8% power savings for 26% better performance than Neon• Power scaling of 54% to approximate Mite• Covers both Neon and Mite on a single chip

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.30.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NeonMite1 EU2 EUs3 EUs4 EUs5 EUs6 EUs7 EUs8 EUs

Normalized Performance

Nor

mal

ized

Chi

p Po

wer

Power Proportionality

1EU+1IB

8EUs+4IBsNeon

Mite

15

Power Breakdown

• Less than ⅓ of Neon’s execution power– Due to no OoO scheduler and limited bypass

• Increase in WiDGET’s power caused by:– Increased EUs and instruction buffers– Higher utilization of other resources

16

0

0.2

0.4

0.6

0.8

1

L3L2L1DL1IFetch/Decode/RenameBackendALUExecution

Nor

mal

ized

Pow

er

Neon 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

Mite 1 EU 2 EUs 3 EUs 4 EUs 5 EUs 6 EUs 7 EUs 8 EUs

Conclusions• WiDGET

– EU provisioning for power-performance target– Trade-off complexity for power– OoO approximation using in-order EUs– Distributed buffering to extract MLP & ILP

• Single-thread performance– Scale from close to Mite to better than Neon– Power proportional computing

17

How is this different from the next talk?

In-Order Approximation

In-order,

SteeringMechanism

Scalable CoresScalable CoresVision

Forwardflow[Gibson10]

WiDGET[watanabe10]

18

Thank you!Questions?

19

Backup Slides• DVFS• Design choices of WiDGET• Steering heuristic• Memory disambiguation• Comparison to related work• Vs. Clustered architectures• Vs. Complexity-effective superscalars• Steering cost model• Steering mechanism• Machine configuration• Area model• Power efficiency• Vs. Dynamic HW resizing• Vs. Heterogeneous CMPs• Vs. Dynamic multi-cores• Vs. Thread-level speculation• Vs. Braid Architecture• Vs. ILDP

20

Dynamic Voltage/Freq Scaling (DVFS)

• Dynamically trade-off power for performance– Change voltage and freq at runtime– Often regulated by OS

• Slow response time

• Linear reduction of V & F– Cubic in dynamic power– Linear in performance– Quadratic in dynamic energy

• Effective for thermal management

• Challenges– Controlling DVFS– Diminishing returns of DVFS

21

Service-Level Agreements (SLAs)• Expectations b/w consumer and provider

– QoS, boundaries, conditions, penalties

• EX: Web server SLA– Combination of latency, throughput, and QoS (min %

performed successfully)

• Guaranteeing SLAs on WiDGET1.Set deadline & throughput goals

• Adjust EU provisioning2.Set target machine specs

• X processor-like with X GB memory

22

Nehalem-like CMP

Design Choice of WiDGET (1/2)

L2

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

RF

DecodeFetch RenameOoOIQ

ROB

Commit

D $RF


ROB

Commit

D $

I $BR

Pred

OoO issue queue + full bypass =

35% of processor power

23

Design Choice of WiDGET (2/2)

L2

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

RF


ROB

Commit

D $RF


ROB

Commit

D $

I $BR

Pred

OoO issue queue + full bypass =

35% of processor power

Replace with simple building blocks

Decouple from the rest

In-Order Exec

Nehalem-like CMP

24

Steering Heuristic• Based on dependence-based steering [Palacharla97]

– Expose independent instr chains– Consumer directly behind the producer– Stall steering when no empty buffer is found

• WiDGET: Power-performance goal– Emphasize locality & scalability Cluster 0 Cluster 1

Outstanding Ops?

Producer bufEmpty bufwithin cluster

Any empty buf Avail behind producer? Avail behindeither of producers?

Empty buf ineither of clusters

0 1 2

Y Y NN

• Consumer-push operand transfers– Send steered EU ID to the producer EU– Multi-cast result to all consumers 25

Opportunities / Challenges

• Decoupled design + modularity =Reconfig by turning on/off componentsCore scaling by EU provisioning

Rather than fusing coresCore customization for ILP, TLP, DLP, MLP & power

• Challengeso Parallelism-communication trade-offo Varying communication demands

26

Memory Disambiguation on WiDGET

• Challenges arising from modularity– Less communication between modules– Less centralized structures

• Benefits of NoSQ– Mem dependency -> register dependency– Reduced communication– No centralized structure– Only register dependency relation b/w EUs

• Faster execution of loads27

Memory Instructions?

• No LSQ thanks to NoSQ [Sha06]

• Instead,– Exploit in-window ST-LD forwarding– LDs: Predict if dependent ST is in-flight @ Rename

• If so, read from ST’s source register, not from cache• Else, read from cache• @ Commit, re-execute if necessary

– STs: Write @ Commit– Prediction

• Dynamic distance in stores• Path-sensitive

28

Comparison to Related Work

29

Design EX Scale Up & Down? Symmetric? Decoupled

Exec? In-Order? Wire Delays? Data Driven? ISA Compativility?

WiDGET √ √ √ √ √ √ √

Adaptive Cores X - √ / X √ / X - - √Heterogeneous CMPs X X X √ / X - - √

Core Fusion √ √ X X √ - √

CLP √ √ √ √ √ √ X

TLS X √ X √ / X - X √

Multiscalar X √ X X - X XComplexity-Effective X √ √ √ X √ √

Salverda & Zilles √ √ X √ X √ √

ILDP & Braid X √ √ √ - √ X

Quad-Cluster X √ √ X √ √ / X √

Access/Execute X X X √ - √ X

Vs. OoO Clusters• OoO Clusters

– Goal: Superscalar ILP without impacting cycle time– Decentralize deep and wide structures in superscalars

• Cluster: Smaller OoO design

– Steering goal• High performance through communication hiding & load balance

• WiDGET– Goal: Power & Performance– Designed to scale cores up & down

• Cluster: 4 in-order EUs

– Steering goal• Localization to reduce communication latency & power

30

EX: Steering for Locality

31

1

32 5

4

21

43 Delay

Cluster 0 Cluster 1

21

43 Delay

Cluster 0 Cluster 1

5

Clustered Architectures

WiDGET

e.g., Advanced RMBS, Modulo

6

6

56

Exploit locality

Maintainload balance

Vs. Complexity-Effective Superscalars

• Palacharla et al.– Goal: Performance– Consider all buffers for steering and issuing

• More buffers -> More options

• WiDGET– Goal: Power-performance– Requirements

• Shorter wires, power gating, core scaling

– Differences: Localization & scalability• Cluster-affined steering• Keep dependent chains nearby• Issuing selection only from a subset of buffers

– New question: Which empty buffer to steer to? 32

Steering Cost Model [Salverda08]• Steer to distributed IQs

• Steering policy determines issue time– Constrained by dependency, structural hazards, issue

policy

• Ideal steering will issue an instr:– As soon as it becomes ready (horizon)– Without blocking others (frontier) (constraints of in-order)

• Steering Cost = horizon – frontier– Good steering: Min absolute (Steering Cost)

33

EX: Application of Cost Model• Steer instr 3 to In-order IQs

• Challenges– Check all IQs to find an optimal steering– Actual exec time is unknown in advance

• Argument of Salverda & Zilles– Too complex to build or– Too many execution resources needed to match OoO

3

1

2

4

1

2

IQ 0 IQ 1 IQ 2 IQ 3

1

2

3

Tim

e

C = -1 C = 0 C = 1

3

4

F

F

F

HorizonF: Frontier

Other instrsCost = H - F

FC = -1

34

Impact of Comm Delays

3

1

2

4

1

2

IQ 0

Tim

e

3

4

IQ 1 IQ 2 IQ 3

1

2

3

4

5

1

2

IQ 0

3

4

IQ 1 IQ 2 IQ 3

1

2

3

4

5

If 1-cycle comm latency is added…

Exec latency: 3

What should happen instead

Exec latency: 4 cyclesTrade off parallelism for comm

5 cycles

35

Observation Under Comm Delays

• Not beneficial to spread instrsReduced pressure for more execution resources

• Not much need to consider distant IQsReduced problem spaceSimplified steering

36

Instruction Steering Mechanism

• Goals– Expose independent instruction chains– Achieve OoO performance with multiple in-order EUs– Keep dependent instrs nearby

• 3 things to keep track– Producer’s location– Whether producer has another consumer– Empty buffers

Last Producer Table &Full bit vector

Empty bit vector

37

Steering Example

0+8

1

2

5

3

4

7

6

8

0 1

2 3

1 + 0

2 + 0

3 + 0

4 + 0

5 + 0

6 + 0

7 + 0

Last Producer Table

Register

Buffer ID

Has a consumer?

0

0

0

0

Empty / full bit vectors

0

1

2

3

1

1

1

1

1

5

2 3

4

7

6

8

0 0

0

1

0

0

1

1

01

2

1

2

1

1

1

38

Instruction Buffers

• Small FIFO buffer– Config: 16 entries

• 1 straight instr chain per buffer

• Entry– Consumer EU field

• Set if a consumer is steered to different EU• Read after computation

– Multi-cast the result to consumers

Instr Op 1 Op 2

Consumer EU bit vector

39

Machine ConfigurationsAtom

[Gerosa08]Xeon

[Tam06]WiDGET

L1 I / D* 32 KB, 4-way, 1 cycle

BR Predictor† Tage predictor; 16-entry RAS; 64-entry, 4-way BTB

Instr Engine 2-way front-end & back-end

4-way front-end & back-end;128-entry ROB

Exec Core 16-entry unified instr queue;2 INT, 2 FP, 2 AG

32-entry unified instr queue;3 INT, 3 FP, 2AG;0-cycle operand bypass to anywhere in core

16-entry instr buffer per EU;1 INT, 1FP, 1AG per EU;0-cycle operand bypass within a cluster

Disambiguation† No Store Queue (NoSQ) [Sha06]

L2 / L3 / DRAM* 1 MB, 8-way, 12 cycles / 4 MB, 16-way, 24 cycles / ~300 cycles

Process 45 nm

* Based on Xeon † Configuration choice40

Area Model (45nm)

• Assumptions– Single-threaded uniprocessor– On-chip 1MB L2– Atom chip ≈ WiDGET (2 EUs, 1 buffer per EU)

• WiDGET:> Mite by 10%< Neon by 19%

Mite WiDGET Neon05

1015202530354045

Are

a (m

m²)

41

Harmonic Mean IPCs

1 2 3 40

0.2

0.4

0.6

0.8

1

1.2

1.4NeonMite1 EU2 EUs3 EUs4 EUs5 EUs6 EUs7 EUs8 EUs

Instruction Buffers

Nor

mal

ized

IPC

• Best-case: 26% better than Neon• Dynamic performance range: 3.8

42

• 8 - 58% power savings compared to Neon• 21% power savings to match Neon’s performance• Dynamic power range: 2.2

1 2 3 40

0.2

0.4

0.6

0.8

1

1.2NeonMite1 EU2 EUs3 EUs4 EUs5 EUs6 EUs7 EUs8 EUs

Instruction Buffers

Nor

mal

ized

Pow

er

Harmonic Mean Power

43

Geometric Mean Power Efficiency (BIPS³/W)

• Best-case: 2x of Neon, 21x of Mite• 1.5x the efficiency of Xeon for the same performance

NeonMite

44

Energy-Proportional Computing for Servers[Barroso07]

• Servers– 10-50% utilization most of the time

• Yet, availability is crucial

– Common energy-saving techs inapplicable– 50% of full power even during low utilization

• Solution: Energy proportionality– Energy consumption in proportion to work done

• Key features– Wide dynamic power range– Active low-power modes

• Better than sleep states with wake-up penalties 45

PowerNap [Meisner09]

• Goals– Reduction of server idle power– Exploitation of frequent idle periods

• Mechanisms– System level– Reduce transition time into & out of nap state– Ease power-performance trade-offs– Modify hardware subsystems with high idle power

• e.g., DRAM (self-refresh), fans (variable speed)

46

Thread Motion [Rangan09]

• Goals– Fine-grained power management for CMPs– Alternative to per-core DVFS– High system throughput within power budget

• Mechanisms– Migrate threads rather than adjusting voltage– Homogeneous cores in multiple, static voltage/freq

domains– 2 migration policies

• Time-driven & miss-driven

47

Dynamic HW Resizing• Resize if under-utilized for power savings

– Elimination of transistor switching– e.g., IQs, LSQs, ROBs, caches

• Mechanisms– Physical: Enable/disable segments (or associativity)– Logical: Limit usable space– Wire partitioning with tri-state buffers

• Policies– Performance (e.g., IPC, ILP)– Occupancy– Usefulness

48

Vs. Heterogeneous CMPs• Their way

– Equip with small and powerful cores– Migrate thread to a powerful core for higher ILP

• Shortcomings– More design and verification time– Bound to static design choices– Poor performance for non-targeted apps– Difficult resource scheduling

• My way– Get high ILP by aggregating many in-order EUs

L2

OoO Core

49

Vs. Dynamic Multi-Cores• Their way

– Deploy small cores for TLP– Dynamically fuse cores for higher ILP

• Shortcomings– Large centralized structures [Ipek07]– Non-traditional ISA [Kim07]

• My Way– Only “fuse” EUs– No recompilation or binary translation

L2

Fuse

50

Vs. Thread-Level Speculation• Their way

– SW: Divides into contiguous segments– HW: Runs speculative threads in parallel

• Shortcomings– Only successful for regular program structures– Load imbalance– Squash propagation

• My Way– No SW reliance– Support a wider range of programs

L2

Speculation support

51

Vs. Braid Architecture [Tseng08]

• Their way– ISA extension– SW: Re-orders instrs based on dependency– HW: Sends a group of instrs to FIFO issue queues

• Shortcomings– Re-ordering limited to basic blocks

• My Way– No SW reliance– Exploit dynamic dependency

52

Vs. Instruction Level Distributed Processing(ILDP) [Kim02]

• Their way– New ISA or binary translation– SW: Identifies instr dependency– HW: Sends a group of instrs to FIFO issue queues

• Shortcomings– Lose binary compatibility

• My Way– No SW reliance– Exploit dynamic dependency

53

Vs. Multiscalar

• Similarity– ILP extraction from sequential threads– Parallel execution resources

• Differences– Divide by data dependency, not control dependency

• Less communication b/w resources• No imposed resource ordering

– Communication via multi-cast rather than ring– Higher resource utilization

– No load balancing issue54

Approach

• Simple building blocks for low power– In-order EUs

• Sea of resources– Power-Performance tradeoff through resource

allocation• Distributed buffering for:

– Latency tolerance– Coarse-grain out-of-order (OoO)

55

Download - WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Download - WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe, John D. Davis †, David A. Wood *University of Wisconsin † Microsoft Research ISCA 2010,