WiDGET:Wisconsin Decoupled Grid Execution Tiles
Yasuko Watanabe*, John D. Davis†, David A. Wood*
*University of Wisconsin †Microsoft Research
ISCA 2010, Saint-Malo, France
Power vs. Performance
A full range of operating pointson a single chip
2
0.3 0.5 0.7 0.9 1.1 1.3 1.50.20.30.40.50.60.70.80.9
1
Normalized Performance
Nor
mal
ized
Chi
p Po
wer
Xeon-like
Atom-like
A single core
Executive Summary
• WiDGET framework– Sea of resources– In-order Execution Units (EUs)
• ALU & FIFO instruction buffers• Distributed in-order buffers → OoO execution
– Simple instruction steering– Core scaling through resource allocation
• ↓ EUs → Slower with less power• ↑ EUs → Turbo speed with more power
3
19711972
19741976
19781982
19851989
19931996
19981999
20002005
20072009
0.15
1.5
15
150
Ther
mal
Des
ign
Pow
er (W
)Intel CPU Power Trend
Core i7
4004
8008
8080
8085
8086 286386
486Pentium
Pentium MMX
Pentium II
Pentium III
Pentium 4Pentium D
Core 2
4
Conflicting Goal: Low Power & High Performance
• Need for high single-thread performance– Amdahl’s Law [Hill08]– Service-level agreements [Reddi10]
• Challenge– Energy proportional computing [Barroso07]
• Prior approach– Dynamic voltage and frequency scaling (DVFS)
5
Diminishing Returns of DVFS
• Near saturation in voltage scaling• Innovations in microarchitecture needed
1997 2000 2003 2006 20090
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
IBM PowerPC 405LPIntel Xscale 80200TransMeta Crusoe TM 5800Intel Itanium MontecitoAtom Sil-verthorneVminO
pera
ting
Volta
ge (V
)
6
Outline• High-level design overview• Microarchitecture• Single-thread evaluation• Conclusions
7
High-Level Design• Sea of resources
Core
Core
Core
Core
Core
Core
Core
Core
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L1I
Thread context management orInstruction Engine(Front-end + Back-end)
In-order Execution Unit (EU)
L2
8
WiDGET Vision
TLPPower
ILPPower
TLPILPPower
Just 5 examples.Much more can be done.
9
ILPPower
ILPPower
In-Order EU
10
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L2
Router
Router
Operand Buffer
•Executes 1 instruction/cycle
•EU aggregation for OoO-like performance
Increases both issue BW & bufferingPrevents stalled instructions from
blocking ready instructionsExtracts MLP & ILP
In-OrderInstr Buffers
EU ClusterL1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L1I
Thread context management orInstruction Engine(Front-end + Back-end)
In-order Execution Unit (EU)
L2
L1I 0
L1D 0
IE 0
L1I 1
IE 1
L1D 1
EU Cluster
Full bypass within a cluster
11
1-cycle inter-cluster link
Instruction Engine (IE)
12
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L2• Thread specific structures• Front-end + back-end• Similar to a conventional OoO pipe
• Steering logic for distributed EUs• Achieve OoO performance with in-order EUs• Expose independent instr chains
Steering
RF
DecodeFetch Rename
RF
DecodeFetch RenameBR
Pred
Front-End
CommitROB Commit
Back-End
Coarse-Grain OoO Execution
1
2
5
3
4
6
7
8
1
3
5
2
4
7
6
8
OoO Issue WiDGET
1
2
5
3
4
6
7
8
1
3
5
2
4
7
6
8
13
Methodology• Goal: Power proportionality
– Wide performance & power ranges• Full-system execution-driven simulator
– Based on GEMS– Integrated Wattch and CACTI
• SPEC CPU2006 benchmark suite• 2 comparison points
– Neon: Aggressive proc for high ILP– Mite: Simple, low-power proc
• Config: 1 - 8 EUs to 1 IE– 1 - 4 instruction buffers / EU
L1I 0
L1D 0
IE 0
14
• 21% power savings to match Neon’s performance• 8% power savings for 26% better performance than Neon• Power scaling of 54% to approximate Mite• Covers both Neon and Mite on a single chip
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.30.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NeonMite1 EU2 EUs3 EUs4 EUs5 EUs6 EUs7 EUs8 EUs
Normalized Performance
Nor
mal
ized
Chi
p Po
wer
Power Proportionality
1EU+1IB
8EUs+4IBsNeon
Mite
15
Power Breakdown
• Less than ⅓ of Neon’s execution power– Due to no OoO scheduler and limited bypass
• Increase in WiDGET’s power caused by:– Increased EUs and instruction buffers– Higher utilization of other resources
16
0
0.2
0.4
0.6
0.8
1
L3L2L1DL1IFetch/Decode/RenameBackendALUExecution
Nor
mal
ized
Pow
er
Neon 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Mite 1 EU 2 EUs 3 EUs 4 EUs 5 EUs 6 EUs 7 EUs 8 EUs
Conclusions• WiDGET
– EU provisioning for power-performance target– Trade-off complexity for power– OoO approximation using in-order EUs– Distributed buffering to extract MLP & ILP
• Single-thread performance– Scale from close to Mite to better than Neon– Power proportional computing
17
How is this different from the next talk?
In-Order Approximation
In-order,
SteeringMechanism
Scalable CoresScalable CoresVision
Forwardflow[Gibson10]
WiDGET[watanabe10]
18
Thank you!Questions?
19
Backup Slides• DVFS• Design choices of WiDGET• Steering heuristic• Memory disambiguation• Comparison to related work• Vs. Clustered architectures• Vs. Complexity-effective superscalars• Steering cost model• Steering mechanism• Machine configuration• Area model• Power efficiency• Vs. Dynamic HW resizing• Vs. Heterogeneous CMPs• Vs. Dynamic multi-cores• Vs. Thread-level speculation• Vs. Braid Architecture• Vs. ILDP
20
Dynamic Voltage/Freq Scaling (DVFS)
• Dynamically trade-off power for performance– Change voltage and freq at runtime– Often regulated by OS
• Slow response time
• Linear reduction of V & F– Cubic in dynamic power– Linear in performance– Quadratic in dynamic energy
• Effective for thermal management
• Challenges– Controlling DVFS– Diminishing returns of DVFS
21
Service-Level Agreements (SLAs)• Expectations b/w consumer and provider
– QoS, boundaries, conditions, penalties
• EX: Web server SLA– Combination of latency, throughput, and QoS (min %
performed successfully)
• Guaranteeing SLAs on WiDGET1.Set deadline & throughput goals
• Adjust EU provisioning2.Set target machine specs
• X processor-like with X GB memory
22
Nehalem-like CMP
Design Choice of WiDGET (1/2)
L2
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
RF
DecodeFetch RenameOoOIQ
ROB
Commit
D $RF
DecodeFetch RenameOoOIQ
ROB
Commit
D $
I $BR
Pred
OoO issue queue + full bypass =
35% of processor power
23
Design Choice of WiDGET (2/2)
L2
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
RF
DecodeFetch RenameOoOIQ
ROB
Commit
D $RF
DecodeFetch RenameOoOIQ
ROB
Commit
D $
I $BR
Pred
OoO issue queue + full bypass =
35% of processor power
Replace with simple building blocks
Decouple from the rest
In-Order Exec
Nehalem-like CMP
24
Steering Heuristic• Based on dependence-based steering [Palacharla97]
– Expose independent instr chains– Consumer directly behind the producer– Stall steering when no empty buffer is found
• WiDGET: Power-performance goal– Emphasize locality & scalability Cluster 0 Cluster 1
Outstanding Ops?
Producer bufEmpty bufwithin cluster
Any empty buf Avail behind producer? Avail behindeither of producers?
Empty buf ineither of clusters
0 1 2
Y Y NN
• Consumer-push operand transfers– Send steered EU ID to the producer EU– Multi-cast result to all consumers 25
Opportunities / Challenges
• Decoupled design + modularity =Reconfig by turning on/off componentsCore scaling by EU provisioning
Rather than fusing coresCore customization for ILP, TLP, DLP, MLP & power
• Challengeso Parallelism-communication trade-offo Varying communication demands
26
Memory Disambiguation on WiDGET
• Challenges arising from modularity– Less communication between modules– Less centralized structures
• Benefits of NoSQ– Mem dependency -> register dependency– Reduced communication– No centralized structure– Only register dependency relation b/w EUs
• Faster execution of loads27
Memory Instructions?
• No LSQ thanks to NoSQ [Sha06]
• Instead,– Exploit in-window ST-LD forwarding– LDs: Predict if dependent ST is in-flight @ Rename
• If so, read from ST’s source register, not from cache• Else, read from cache• @ Commit, re-execute if necessary
– STs: Write @ Commit– Prediction
• Dynamic distance in stores• Path-sensitive
28
Comparison to Related Work
29
Design EX Scale Up & Down? Symmetric? Decoupled
Exec? In-Order? Wire Delays? Data Driven? ISA Compativility?
WiDGET √ √ √ √ √ √ √
Adaptive Cores X - √ / X √ / X - - √Heterogeneous CMPs X X X √ / X - - √
Core Fusion √ √ X X √ - √
CLP √ √ √ √ √ √ X
TLS X √ X √ / X - X √
Multiscalar X √ X X - X XComplexity-Effective X √ √ √ X √ √
Salverda & Zilles √ √ X √ X √ √
ILDP & Braid X √ √ √ - √ X
Quad-Cluster X √ √ X √ √ / X √
Access/Execute X X X √ - √ X
Vs. OoO Clusters• OoO Clusters
– Goal: Superscalar ILP without impacting cycle time– Decentralize deep and wide structures in superscalars
• Cluster: Smaller OoO design
– Steering goal• High performance through communication hiding & load balance
• WiDGET– Goal: Power & Performance– Designed to scale cores up & down
• Cluster: 4 in-order EUs
– Steering goal• Localization to reduce communication latency & power
30
EX: Steering for Locality
31
1
32 5
4
21
43 Delay
Cluster 0 Cluster 1
21
43 Delay
Cluster 0 Cluster 1
5
Clustered Architectures
WiDGET
e.g., Advanced RMBS, Modulo
6
6
56
Exploit locality
Maintainload balance
Vs. Complexity-Effective Superscalars
• Palacharla et al.– Goal: Performance– Consider all buffers for steering and issuing
• More buffers -> More options
• WiDGET– Goal: Power-performance– Requirements
• Shorter wires, power gating, core scaling
– Differences: Localization & scalability• Cluster-affined steering• Keep dependent chains nearby• Issuing selection only from a subset of buffers
– New question: Which empty buffer to steer to? 32
Steering Cost Model [Salverda08]• Steer to distributed IQs
• Steering policy determines issue time– Constrained by dependency, structural hazards, issue
policy
• Ideal steering will issue an instr:– As soon as it becomes ready (horizon)– Without blocking others (frontier) (constraints of in-order)
• Steering Cost = horizon – frontier– Good steering: Min absolute (Steering Cost)
33
EX: Application of Cost Model• Steer instr 3 to In-order IQs
• Challenges– Check all IQs to find an optimal steering– Actual exec time is unknown in advance
• Argument of Salverda & Zilles– Too complex to build or– Too many execution resources needed to match OoO
3
1
2
4
1
2
IQ 0 IQ 1 IQ 2 IQ 3
1
2
3
Tim
e
C = -1 C = 0 C = 1
3
4
F
F
F
HorizonF: Frontier
Other instrsCost = H - F
FC = -1
34
Impact of Comm Delays
3
1
2
4
1
2
IQ 0
Tim
e
3
4
IQ 1 IQ 2 IQ 3
1
2
3
4
5
1
2
IQ 0
3
4
IQ 1 IQ 2 IQ 3
1
2
3
4
5
If 1-cycle comm latency is added…
Exec latency: 3
What should happen instead
Exec latency: 4 cyclesTrade off parallelism for comm
5 cycles
35
Observation Under Comm Delays
• Not beneficial to spread instrsReduced pressure for more execution resources
• Not much need to consider distant IQsReduced problem spaceSimplified steering
36
Instruction Steering Mechanism
• Goals– Expose independent instruction chains– Achieve OoO performance with multiple in-order EUs– Keep dependent instrs nearby
• 3 things to keep track– Producer’s location– Whether producer has another consumer– Empty buffers
Last Producer Table &Full bit vector
Empty bit vector
37
Steering Example
0+8
1
2
5
3
4
7
6
8
0 1
2 3
1 + 0
2 + 0
3 + 0
4 + 0
5 + 0
6 + 0
7 + 0
Last Producer Table
Register
Buffer ID
Has a consumer?
0
0
0
0
Empty / full bit vectors
0
1
2
3
1
1
1
1
1
5
2 3
4
7
6
8
0 0
0
1
0
0
1
1
01
2
1
2
1
1
1
38
Instruction Buffers
• Small FIFO buffer– Config: 16 entries
• 1 straight instr chain per buffer
• Entry– Consumer EU field
• Set if a consumer is steered to different EU• Read after computation
– Multi-cast the result to consumers
Instr Op 1 Op 2
Consumer EU bit vector
39
Machine ConfigurationsAtom
[Gerosa08]Xeon
[Tam06]WiDGET
L1 I / D* 32 KB, 4-way, 1 cycle
BR Predictor† Tage predictor; 16-entry RAS; 64-entry, 4-way BTB
Instr Engine 2-way front-end & back-end
4-way front-end & back-end;128-entry ROB
Exec Core 16-entry unified instr queue;2 INT, 2 FP, 2 AG
32-entry unified instr queue;3 INT, 3 FP, 2AG;0-cycle operand bypass to anywhere in core
16-entry instr buffer per EU;1 INT, 1FP, 1AG per EU;0-cycle operand bypass within a cluster
Disambiguation† No Store Queue (NoSQ) [Sha06]
L2 / L3 / DRAM* 1 MB, 8-way, 12 cycles / 4 MB, 16-way, 24 cycles / ~300 cycles
Process 45 nm
* Based on Xeon † Configuration choice40
Area Model (45nm)
• Assumptions– Single-threaded uniprocessor– On-chip 1MB L2– Atom chip ≈ WiDGET (2 EUs, 1 buffer per EU)
• WiDGET:> Mite by 10%< Neon by 19%
Mite WiDGET Neon05
1015202530354045
Are
a (m
m²)
41
Harmonic Mean IPCs
1 2 3 40
0.2
0.4
0.6
0.8
1
1.2
1.4NeonMite1 EU2 EUs3 EUs4 EUs5 EUs6 EUs7 EUs8 EUs
Instruction Buffers
Nor
mal
ized
IPC
• Best-case: 26% better than Neon• Dynamic performance range: 3.8
42
• 8 - 58% power savings compared to Neon• 21% power savings to match Neon’s performance• Dynamic power range: 2.2
1 2 3 40
0.2
0.4
0.6
0.8
1
1.2NeonMite1 EU2 EUs3 EUs4 EUs5 EUs6 EUs7 EUs8 EUs
Instruction Buffers
Nor
mal
ized
Pow
er
Harmonic Mean Power
43
Geometric Mean Power Efficiency (BIPS³/W)
• Best-case: 2x of Neon, 21x of Mite• 1.5x the efficiency of Xeon for the same performance
NeonMite
44
Energy-Proportional Computing for Servers[Barroso07]
• Servers– 10-50% utilization most of the time
• Yet, availability is crucial
– Common energy-saving techs inapplicable– 50% of full power even during low utilization
• Solution: Energy proportionality– Energy consumption in proportion to work done
• Key features– Wide dynamic power range– Active low-power modes
• Better than sleep states with wake-up penalties 45
PowerNap [Meisner09]
• Goals– Reduction of server idle power– Exploitation of frequent idle periods
• Mechanisms– System level– Reduce transition time into & out of nap state– Ease power-performance trade-offs– Modify hardware subsystems with high idle power
• e.g., DRAM (self-refresh), fans (variable speed)
46
Thread Motion [Rangan09]
• Goals– Fine-grained power management for CMPs– Alternative to per-core DVFS– High system throughput within power budget
• Mechanisms– Migrate threads rather than adjusting voltage– Homogeneous cores in multiple, static voltage/freq
domains– 2 migration policies
• Time-driven & miss-driven
47
Dynamic HW Resizing• Resize if under-utilized for power savings
– Elimination of transistor switching– e.g., IQs, LSQs, ROBs, caches
• Mechanisms– Physical: Enable/disable segments (or associativity)– Logical: Limit usable space– Wire partitioning with tri-state buffers
• Policies– Performance (e.g., IPC, ILP)– Occupancy– Usefulness
48
Vs. Heterogeneous CMPs• Their way
– Equip with small and powerful cores– Migrate thread to a powerful core for higher ILP
• Shortcomings– More design and verification time– Bound to static design choices– Poor performance for non-targeted apps– Difficult resource scheduling
• My way– Get high ILP by aggregating many in-order EUs
L2
OoO Core
49
Vs. Dynamic Multi-Cores• Their way
– Deploy small cores for TLP– Dynamically fuse cores for higher ILP
• Shortcomings– Large centralized structures [Ipek07]– Non-traditional ISA [Kim07]
• My Way– Only “fuse” EUs– No recompilation or binary translation
L2
Fuse
50
Vs. Thread-Level Speculation• Their way
– SW: Divides into contiguous segments– HW: Runs speculative threads in parallel
• Shortcomings– Only successful for regular program structures– Load imbalance– Squash propagation
• My Way– No SW reliance– Support a wider range of programs
L2
Speculation support
51
Vs. Braid Architecture [Tseng08]
• Their way– ISA extension– SW: Re-orders instrs based on dependency– HW: Sends a group of instrs to FIFO issue queues
• Shortcomings– Re-ordering limited to basic blocks
• My Way– No SW reliance– Exploit dynamic dependency
52
Vs. Instruction Level Distributed Processing(ILDP) [Kim02]
• Their way– New ISA or binary translation– SW: Identifies instr dependency– HW: Sends a group of instrs to FIFO issue queues
• Shortcomings– Lose binary compatibility
• My Way– No SW reliance– Exploit dynamic dependency
53
Vs. Multiscalar
• Similarity– ILP extraction from sequential threads– Parallel execution resources
• Differences– Divide by data dependency, not control dependency
• Less communication b/w resources• No imposed resource ordering
– Communication via multi-cast rather than ring– Higher resource utilization
– No load balancing issue54
Approach
• Simple building blocks for low power– In-order EUs
• Sea of resources– Power-Performance tradeoff through resource
allocation• Distributed buffering for:
– Latency tolerance– Coarse-grain out-of-order (OoO)
55
Top Related