UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis...
-
Upload
roxanne-cunningham -
Category
Documents
-
view
221 -
download
0
Transcript of UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis...
UW-Madison Computer Sciences Multifacet Group © 2010
Scalable Cores in Chip Multiprocessors
Thesis Defense2 November 2010
Dan Gibson
D. Gibson Thesis Defense - 2
Executive Summary (1/2)
• “Walls & Laws” suggest future CMPs will need Scalable Cores– Scale Up for Performance (e.g., one thread)
– Scale Down for per-core energy conservation (e.g., many threads)
• Area 1: How to build efficient scalable cores.– Forwardflow, one scalable core– Overprovision rather than borrow
D. Gibson Thesis Defense - 3
Executive Summary (2/2)
• Area 2: How to use scalable cores:– Scale at fine granularity:
• Discover most-efficient configuration
– Scale for multi-threaded workloads• Scale up for sequential bottlenecks, improve
performance• Scale down for unimportant executions,
improve efficiency
– Using DVFS as a scalable core proxy
D. Gibson Thesis Defense - 4
Document Outline
1. Introduction2. Extended Motivation
1. Scalable Cores2. Background3. Related Work
3. Methods4. Serialized Successor Representation5. Forwardflow6. Scalable Cores for CMPs
1. Scalable Forwardflow2. Overprovisioning vs. Borrowing3. Power-Awareness4. Single-Thread Scaling5. Multi-Thread Scaling6. DVFS as a Proxy for Scalable Cores
7. Conclusions/Future Work/ReflectionsA/B. Supplements
Of Course
Mostly Old Material: Recap
Mostly New Material: Talk Focus
TALK Outline…
If there’s time and interest
Hello, Software. I am a single x86 processor.
D. Gibson Thesis Defense - 5
‘80s - `00s: Single-Core HeydayCore and Chip Microarchitecture Changed Enormously
386, 1985, 20MHz
486, 1989, 50MHz
P6, 1995 166MHzPIV, 2004, 3000MHz
10
100
1000
10000
1989 1994 1999 2004
Year
Fre
qu
ency
(M
Hz)
Intel
IBM
AMD
Clock Frequency Increased Dramatically
2.
1.
Hello, Software. I am still a single x86 processor.
Hitting the Power Wall
19711972
19741976
19781982
19851989
19931996
19981999
20002005
20072009
0.15
1.5
15
150
Ther
mal
Des
ign
Pow
er (W
)Core i7
4004
8008
8080
8085
8086 286386
486Pentium
Pentium MMX
Pentium II
Pentium III
Pentium 4Pentium D
Core 2
D. Gibson Thesis Defense - 6
• Kneejerk Reactions:• Reduce Clock Frequency
(e.g., 3.0 Ghz to 2.4-ish GHz)
• De-Emphasize Pipeline Depth (e.g., Pentium M)
• What about Performance?
Resource borrowed from Yasuko’s WiDGET ISCA 2010 TalkOne example data point represents a range of actual products.
Chip Multiprocessors (CMPs)
1. Can’t clock (much) faster…
2. Hard to make uArch faster…
Use Die Area for More Cores!
D. Gibson Thesis Defense - 7
Hello, Software. I am TWO x86 processors.(And my descendants will have more…)
• “Fundamental Turn Toward Concurrency” [Sutter2005]
• Software must now change to see continued performance gains.
This Won’t Be Easy.
In 1965, Gordon Moore sketched out his prediction of the pace of silicon technology. Decades later, Moore’s Law remains true, driven largely by Intel’s unparalleled silicon expertise.
Copyright © 2005 Intel Corporation.
D. Gibson Thesis Defense - 8
• Cost per Device Falls Predictably– Density rises (Devices/mm2)
– Device size grows smaller
Rock, 65nm [JSSC2009] Rock16, 16nm [ITRS2007]
Moore’s Law in the Multicore Era
(If you want 1024 threads)
Or “Fell”
D. Gibson Thesis Defense - 9
Amdahl’s Law
(1 - f ) + f
N
0
0.2
0.4
0.6
0.8
1
1.2
0.000.10
0.250.50
0.750.85
0.900.95
0.99
Parallel Fraction f
No
rmal
ized
Ru
nti
me
Parallel Runtime =
f = Parallel Fraction
N = Number of Cores
N = 8
Sequential: Not Good
Partially-Parallel: OK
Highly-Parallel: Very Good
D. Gibson Thesis Defense - 10
Utilization Wall (aka SAF)
• Simultaneously Active Fraction (SAF): Fraction of devices in a fixed-area design that can be active at the same time, while still remaining within a fixed power budget.
[Venkatesh2009]
0
0.2
0.4
0.6
0.8
1
90nm 65nm 45nm 32nm
Dyn
amic
SA
F
LP DevicesHP Devices
[Chakraborty2008]
UTILIZATION
D. Gibson Thesis Defense - 11
Architects Boxed In: Walls and Laws
• PW: Cannot clock much faster.
• UW: Cannot use all devices.
• AL: Single threads need help• Not all code is
parallel.Scalable CMPs
POWER
AMDAHL
Scalable CMPs→Scalable Cores
• Scale UP for Performance– Use more resources for more performance– (i.e., 2 Strong Oxen)
• Scale DOWN to Conserve Energy– Exploit TLP with many small cores– (i.e., 1024 Chickens)
If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?
-Attributed to Seymour Cray
D. Gibson Thesis Defense - 12
Scalable Cores in CMPs
1. How to build a Scalable Core?– Should be efficient– Should offer a wide power/perf. Range
2. How to use Scalable Cores? – Optimize single-thread efficiency– Detect and ameliorate bottlenecks
D. Gibson Thesis Defense - 13
This Thesis:
D. Gibson Thesis Defense - 14
Area 1: Efficient Scalable Cores
Fear leads to Anger, Anger leads to Hate, Hate leads to Suffering
Naming Association
Broadcast Inefficiency
• Forwardflow Core Architecture– Raise Average I/MLP, Not Peak– Efficient SRAMs, no CAMs
• Serialized Successor Representation (SSR)– Use pointers instead of names
Basis for a Scalable
Core Design
D. Gibson Thesis Defense - 15
Area 2: Scalable Cores in CMPs
• How to scale cores:– Overprovision each core?– Borrow/merge cores?
• When to scale cores:– For one thread?– For many threads?
• How to continue:– DVFS as a proxy for a scalable core
D. Gibson Thesis Defense - 16
Outline
• Introduction: Scalable Cores– Motivation (Why scale in the first place?)– Definition
• Scalable Cores for CMPs– How to scale:
• Dynamically-Scalable Core (Forwardflow)• Overprovision or Borrow Resources?
– When to scale: Hardware Scaling Policies• For single-thread efficiency• For multi-threaded workloads
• Conclusions/Wrap-Up
D. Gibson Thesis Defense - 17
Forwardflow (FF): A Scalable Core
• Forwardflow Core =Frontend (L1-I, Decode, ARF) +
FE
L1-D
Distributed Execution Logic/Window (DQ) +
L1-D CacheScale Down: Use a Smaller Window
Scale Up: Use a Bigger Window
Window Scaling vs. Core Scaling
• FF: Only scales instruction window– Not width,– Not registers, – etc.
D. Gibson Thesis Defense - 18
• How does window scaling scale the core? – By affecting demand– Analogous to
Bernoulli’s Principle
FE DQ
2ddL VCfP ...=a
Power of Unscaled Components?
“Activity Factor”
FF Dynamic Configuration Space
CONFIG. SOBRIQUET DESCRIPTION
F-32 “Fully scaled down”
32-entry instruction window, single issue (1/4 of a DQ bank group)
F-6464-entry instruction window, dual issue (1/2 of a DQ bank group)
F-128 “Nominal”128-entry instruction window, quad issue, (one full DQ bank group)
F-256256-entry instruction window, “quad” issue, (2 BGs)
F-512512-entry instruction window, “quad” issue, (4 BGs)
F-1024 “Fully scaled up”
1K-entry instruction window, “quad” issue, (8 BGs)
D. Gibson Thesis Defense - 19
D. Gibson Thesis Defense - 20
Configuration…Component Configuration
Mem. Cons. Mod.
Sequential Consistency
Coherence Prot.
MOESI Directory (single chip)
Store Issue Policy
Permissions Prefetch at X
Frequency 3.0 GHz
Technology 32nm
Window Size Varied by experiment
Disambig. NoSQ
Branch Prediction
TAGE + 16-entry RAS + 256-entry BTB
Frontend 7 Cyc. Pred-to-dispatch
Component Configuration
L1-I Caches 32KB 4-way 64b 4cycle 2 proc. ports
L1-D Caches 32KB 4-way 64b 4 cycle LTU 4 proc. ports, WI/WT,
included by L2
L2 Caches 1MB 8-way 64b 11cycle WB/WA, Private
L3 Cache 8MB 16-way 64b 24cycle, Shared
Main Memory
8GB, 2 DDR2-like controllers (64 GB/s peak BW), 300 cycle latency
Inter-proc network
2D Mesh 16B link
FF Scalable Core Performance
D. Gibson Thesis Defense - 21
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Gmean h264ref libquantum
No
rmal
ized
Ru
nti
me
F-32
F-64
F-128
F-256
F-512
F-1024
3.2Mostly Compute-Bound: Not much scaling
Mostly Memory-Bound: Great scaling
FF Scalable Core Power
D. Gibson Thesis Defense - 22
0
0.2
0.4
0.6
0.8
1
1.2
1.4
F-3
2
F-6
4
F-1
28
F-2
56
F-5
12
F-1
024
No
rmal
ized
Po
wer
FE
DQ/ALU
MEM
Static
WRT Nominal, F-128:
Scale Up Scale Down
8x Window 1/4 Window
-32% MEM power
+27% MEM power
-54% DQ/ALU power
+91% DQ/ALU power
+28% FE power
-39% FE power
D. Gibson Thesis Defense - 23
FF Recap• Forwardflow Scalable Core
FE
L1-D
Scale Down: Use a Smaller Window
Scale Up: Use a Bigger Window
FE
L1-D
FE
L1-D
More details on Forwardflow
D. Gibson Thesis Defense - 24
Outline
• Introduction: Scalable Cores– Motivation (Why scale in the first place?)– Definition
• Scalable Cores for CMPs– How to scale:
• Dynamically-Scalable Core (Forwardflow)• Overprovision or Borrow Resources?
– When to scale: Hardware Scaling Policies• For single-thread efficiency• For multi-threaded workloads
• Conclusions/Wrap-Up
D. Gibson Thesis Defense - 25
Overprovisioning vs. Borrowing
• Scaling Core Performance means Scaling Core Resources– From where can a scaled core acquire
more resources?
• Option 1: Overprovision All Cores– Every core can scale up fully using a core-
private resource pool• Option 2: Borrow From Other Cores
– Cores share resources with neighbors
What Resources?
• Forwardflow:– Resources = DQ Bank Groups = – (i.e., window space, functional units)
• Simple Experiment:– Overprovision: Each Core has 8 BGs,
enough for F-1024. What is the area cost?– Borrow: Each Core has 4 BGs, enough to
scale to F-512, borrows neighbors’ BGs to reach F-1024. What is the performance cost?
D. Gibson Thesis Defense - 26
D. Gibson Thesis Defense - 27
Per-Core Overprovisioning
FE
L1-D
L2
L3Bank
17.9
mm
16.6mm
Overprovisioned CMP:298mm2
Scale Up: Activate More Resources
8.9
6m
m
4.15mmOverprovisioned Tile:
37.2mm2
32nm
D. Gibson Thesis Defense - 28
Resource Borrowing
FE
L1-D
L2
L3Bank
FE
L1-D
L2
L3Bank
Scale Up: Borrow Resources from Neighbor
8.3
1m
m
4.15mmBorrowed Tile:
34.5mm2
16.6
mm
16.6mm
Borrowing CMP:276mm2
32nm
D. Gibson Thesis Defense - 29
Area CostPer-Core
12.3mm2 15.6mm2
+27%
Per-Tile
+8%
37.2mm234.5mm2
Per-CMP
+7%
276mm2 298mm2
32nm
Performance Cost of Borrowing
• Borrowing Slower?– Maybe Not:
Comparable Wire Delay (in this case)
– Maybe: Crosses Physical Core Boundary• Global vs. Local
Wiring?• Cross a clock
domain?
D. Gibson Thesis Defense - 30
• Simple Experiment– 2-cycle lag crossing
core boundary– Slows inter-BG
communication– Slows dispatch
32nm
A Loose Loop
D. Gibson Thesis Defense - 31
0.94
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
1.12
No
rmal
ized
Ru
nti
me
F-1024O F-512
• 2 cycles lag:
F-1024B
– 9% Runtime Reduction from Borrowing w.r.t. Overprovisioning
– Essentially No Performance Improvement From Scaling Up!
Overprovisioning vs. Borrowing
• Overprovisioning CAN be cheap– FF: 7% CMP area– CF: 12.5% area from borrowing [Ipek2007]
• If Borrowing introduces even small delays, it may no longer be worthwhile to scale at all.– This effect is worse if borrowing occurs at
smaller design points.
D. Gibson Thesis Defense - 32
D. Gibson Thesis Defense - 33
Outline
• Introduction: Scalable Cores– Motivation (Why scale in the first place?)– Definition
• Scalable Cores for CMPs– How to scale:
• Dynamically-Scalable Core (Forwardflow)• Overprovision or Borrow Resources?
– When to scale: Hardware Scaling Policies• For single-thread efficiency• For multi-threaded workloads
• Conclusions/Wrap-Up
D. Gibson Thesis Defense - 34
What to do for f=0.00
• What is important?– Performance: Just
Scale Up (done)
– Efficiency: Pick the most efficient configuration? • How to find the right
configuration?• Can we do better?
0
0.2
0.4
0.6
0.8
1
1.2
0.000.10
0.250.50
0.750.85
0.900.95
0.99Parallel Fraction f
No
rma
lize
d R
un
tim
e
What about local efficiency? (i.e., phases)
• Applications may exhibit phases at “micro-scale”– Not all phases are
equal
# Sum an arrayl_array: load [R1+ 0] -> R2 add R2 R3 -> R3 add R1 64 -> R1 brnz l_array...# Sum a listl_list: load [R1+ 8] -> R2 add R2 R3 -> R3 load [R1+ 0] -> R1 brnz l_list
D. Gibson Thesis Defense - 35
Great for Big Windows(Scale Up?)
Big Window Makes No Difference
(Scale Down?)
00.20.40.60.8
11.21.41.61.8
2
E*D
2,
Norm
aliz
ed
to
Best
Sta
tic D
esig
n
Prior Art (some of it)
• POS (Positional Adaptation) [Huang03]:
– Code Configuration– POS: Static Profiling,
Measure Efficiency
• PAMRS (Power-Aware uArch Resource Scaling) [Iyer01]
– Detect “hot spots”– Measure all
configurations’ efficiency, pick best
D. Gibson Thesis Defense - 36
POS
PAMRS
Want: Efficiency of POS, but dynamic response of PAMRS
MLP-based Window Size Estimation
• Play to the cards of the uArch:– FF: Pursue/measure
MLP– Something else:
Something else• Find the smallest
window that will expose as much MLP as the largest window
• Hardware:– Poison bits– Register names– Load miss bit– Counter– LFSR
D. Gibson Thesis Defense - 37
Results
Explain window size estimation in detail with a gory example
FG Scaling Results
• MLP:– No profiling
needed– Safe, only hurts
efficiency >10% for 1 bmk.
– Compare to:• POS, 8 bmks• PAMRS, 20 bmks
D. Gibson Thesis Defense - 38
00.20.40.60.8
11.21.41.61.8
2
Norm
aliz
ed
E*D
2
POS
PAMRS
MLP
Fewer of these compared to PAMRS
Fewer of these compared to POS
D. Gibson Thesis Defense - 39
Recap: What to do for f=0.00
• Profiling (POS)– Can help, might hurt
• Dynamic Response: Seek MLP– Seldom hurts, usually
finds most-optimal configuration
0
0.2
0.4
0.6
0.8
1
1.2
0.000.10
0.250.50
0.750.85
0.900.95
0.99Parallel Fraction f
No
rma
lize
d R
un
tim
e
D. Gibson Thesis Defense - 40
Outline
• Introduction: Scalable Cores– Motivation (Why scale in the first place?)– Definition
• Scalable Cores for CMPs– How to scale:
• Dynamically-Scalable Core (Forwardflow)• Overprovision or Borrow Resources?
– When to scale: Hardware Scaling Policies• For single-thread efficiency• For multi-threaded workloads
• Conclusions/Wrap-Up
D. Gibson Thesis Defense - 41
What to do for f0.25-0.85
• Two Opportunities:1. Sequential
Bottlenecks• Detect, Fix• i.e., Scale Up• Better Performance
2. Useless Executions• Detect, Fix• i.e., Scale Down• Better Efficiency
0
0.2
0.4
0.6
0.8
1
1.2
0.000.10
0.250.50
0.750.85
0.900.95
0.99Parallel Fraction f
No
rma
lize
d R
un
tim
e
What if the OS Knows?
• OS knows about bottlenecks– Can scale up a core
• OS knows about useless work– Can scale down, or,– Can shut off
unneeded cores (e.g., OPMS)
• Result: Amdahl’s Law in the Multicore Era [Hill/Marty08]
D. Gibson Thesis Defense - 42
N
f
k
f
1
0
0.2
0.4
0.6
0.8
1
1.2
0.00 0.10 0.25 0.50 0.75 0.85 0.90 0.95 0.99
Parallel Fraction f
No
rma
lize
d R
un
tim
e
5.1k
If the OS doesn’t know
• Maybe programmer knows? (Prog)
D. Gibson Thesis Defense - 43
Dunce
• SLE-like lock detector to identify critical sections? (Crit)
• Hardware spin detection? [Wells2006] (Spin)
• Holding a lock… except when spinning? (CSpin)
• Every thread spinning except one? (ASpin)(limit study, pretend global communication is OK)
D. Gibson Thesis Defense - 44
Amdahl Microbenchmark
Bmks X Policies X Configs + Opacity
1
Unclear BehaviorSequential Parallel
N
f
k
f
1
f (1-f)
0
0.2
0.4
0.6
0.8
1
1.2
0.000.10
0.250.50
0.750.85
0.900.95
0.99Parallel Fraction f
No
rma
lize
d R
un
tim
e
Real HW
Prog (Programmer-Guided Scaling)
D. Gibson Thesis Defense - 45
0
1
2
3
4
5
P0P1P2P3P4P5P6P7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.00 0.10 0.25 0.50 0.75 0.85 0.90 0.95 0.99
F-128
F-1024
Norm
aliz
ed
Ru
nti
me
Parallel Fraction f
Prog
sc_hint(slow)
sc_hint(fast)
F-1024
F-512
F-256
F-128
F-64
F-32
0
1
2
3
4
5
P0P1P2P3P4P5P6P7
Crit (SLE-Style Lock Detector for Scale-Up)
D. Gibson Thesis Defense - 46
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.00 0.10 0.25 0.50 0.75 0.85 0.90 0.95 0.99
F-128
F-1024
Norm
aliz
ed
Ru
nti
me
Parallel Fraction f
Prog
F-1024
F-512
F-256
F-128
F-64
F-32
Crit
Barrier::Arrive() { l.Lock(); … l.Unlock();}
Lock::Lock() { CAS(myValue); …}
Lock::Unlock() { CAS(myValue); …}
WTH?
Crit: What goes wrong
• Intuition Mismatch:– Lock Detector Implementer's expectations
don’t match pthread library implementer's expectations.1. Critical Section != Sequential Bottleneck2. Lock+Unlock != CAS+Temporal Silent Store
• More general lesson– SW is really flexible. Programmers do
strange things. • HW designer: Be careful, SW may not be doing
what you think
D. Gibson Thesis Defense - 47
0
1
2
3
4
5
P0P1P2P3P4P5P6P7
Spin (Spin Detector for Scale-Down)
D. Gibson Thesis Defense - 48
F-1024
F-512
F-256
F-128
F-64
F-32
Spinning, Scale Down
Seldom/Never Spins
(Performs like CSpin, next)
0
1
2
3
4
5
P0P1P2P3P4P5P6P7
CSpin (Lock Detector for Scale-Up, Spin Detector for Scale-Down)
D. Gibson Thesis Defense - 49
F-1024
F-512
F-256
F-128
F-64
F-32
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.00 0.10 0.25 0.50 0.75 0.85 0.90 0.95 0.99
F-128
F-1024
Norm
aliz
ed
Ru
nti
me
Parallel Fraction f
Prog
Crit
CSpin
LD thinks a lock is held,
but also spinning
0
1
2
3
4
5
P0P1P2P3P4P5P6P7
ASpin (Spin, but Scale Up if all others Scaled Down)
D. Gibson Thesis Defense - 50
F-1024
F-512
F-256
F-128
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.00 0.10 0.25 0.50 0.75 0.85 0.90 0.95 0.99
F-128
F-1024
Norm
aliz
ed
Ru
nti
me
Parallel Fraction f
CSpin
Crit
Prog
F-64
F-32
All Spinning: Scale Up
Better late than never
ASpin
Amdahl Efficiency
D. Gibson Thesis Defense - 51
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.00 0.10 0.25 0.50 0.75 0.85 0.90 0.95 0.99
F-128
Prog
Spin
ASpin
Norm
aliz
ed
E*D
2
Parallel Fraction f
1. Hope of SW Parallelism for Efficiency Seems Sound
2. “Programmer” can help. Psychology?, Difficulty for non-toy programs?
3.a. Spin-detection helps, by scaling down.
3.b. Can scale up when others spin (“others”?)
Real Workloads?
D. Gibson Thesis Defense - 52
Workload Behavior
f0.90+ By design – graduate students
spend a lot of time making this so
No Prog Scaling Policy
Apache: Spin Det. helps. Synchronization Heavy.
JBB: Synchronization Heavy.
OLTP: (Just) Spin hurts a little, ASpin helps. Synchronization Heavy.
Zeus: (Just) Spin hurts a little, ASpin helps. Synchronization Heavy.
0
0.2
0.4
0.6
0.8
1
1.2
F-128
Spin
CSpin
ASpin
Norm
aliz
ed
E*D
2
F-1024
D. Gibson Thesis Defense - 53
Outline
• Introduction: Scalable Cores– Motivation (Why scale in the first place?)– Definition
• Scalable Cores for CMPs– How to scale:
• Dynamically-Scalable Core (Forwardflow)• Overprovision or Borrow Resources?
– When to scale: Hardware Scaling Policies• For single-thread efficiency• For multi-threaded workloads
– How to continue: DVFS/Models for Future Software Evaluations
• Conclusions/Wrap-Up
Conclusions (1/2)
• How to scale cores:– Forwardflow: An Energy-Proportional
Scalable Window Core Architecture• Scale up for performance• Scale down for energy conservation
– Overprovision Resources when cheap• Borrow only when necessary• Avoid loose loops
D. Gibson Thesis Defense - 54
Conclusions (2/2)
• When to scale cores:– For single-thread efficiency:
• Seek efficient operation intrinsically (FF: MLP)
• Profiling can help, if possible.
– For threaded workloads:• Scale up for sequential bottlenecks
– If you can find them• Scale down for useless work
• How to emulate scalable cores– Proxy with DVFS, with caveats
D. Gibson Thesis Defense - 55
Parallel Fraction f
No
rma
lize
d R
un
tim
e
1V
0V
D. Gibson Thesis Defense - 56
Other Contributions
• Side Projects with Collaborators– Deconstructing Scalable Cores, Coming
Soon– “Diamonds are an Architect’s Best Friend”,
ISCA 2009– To CMP or Not to CMP, TR & ANCS Poster
• Parallel Programming at Wisconsin– CS 838, CS 758
• Various Infrastructure Work– Ruby, Tourmaline, Lapis, GEM5
D. Gibson Thesis Defense - 57
Fun Facts About This Thesis
• Simulator:– C++: 135kl (101kl), Python: 16.7kl– 1188 Revs, 17,476 Builds ~15 builds per day since 5 July 2007
• Forwardflow used to be Fiberflow– Watch out, Metamucil
• Est. Simulation Time:– 2.9B CPU*Seconds = 95 Cluster*Days
(just in support of data in this thesis)
D. Gibson Thesis Defense - 58
Questions/Pointers
Overp./Borrowing
FG Uniproc. Scaling
Multiproc. Scaling
DVFS vs. W. Scaling
SSR
All about FF
Estimating Power
LBUS/RBUS
Scalable Scheduling
Seeking MLP
Other Scalable Cores
Related WorkBackward Ptrs.
In the DocumentAlways in motion is the future.
DVFS vs. Scaling
D. Gibson Thesis Defense - 59
DVFS Instead of Simulation
• So far:– “Benchmark” = 1ms – 10ms target time– Scaling “in the micro”
• i.e., Much faster than software
• What about longer runs?– “Benchmark” = minutes+– Scaling “in the macro”
• i.e., At the scale of systems
• No real hardware scalable core– Use DVFS instead, as a proxy.
You must unlearn what you have learned.
1V
0V
DVFS Effects
D. Gibson Thesis Defense - 60
FEL1
-D
L2
L3
DRAM
DVFS Domain
1V
0V
+Freq: Compute operations are faster
+Freq: Memory seems slower
+Freq,+Volt: Dynamic Power Higher (~cubic)
+Pdyn: Higher temperature leads to higher static power
F-128 @ 3Ghz
F-128 @ 3.6Ghz
HW Scaling Effects
D. Gibson Thesis Defense - 61
FEL1
-D
L2
L3
DRAM
+Window: Compute operations are not (much) faster
+Window: Memory seems faster
+Window: Dynamic Power Higher (~log)
Scale Up
F-128 @ 3Ghz
F-256 @ 3Ghz
How do they compare quantitatively?
0
0.2
0.4
0.6
0.8
1
F-256
3.6GHz
DVFS/HW Scaling Performance
D. Gibson Thesis Defense - 62
More CPU-Bound: Prefer DVFS
More Memory-Bound: Prefer Window Scaling
a DVFS/ Scaling
config. pair with
comparable performance
Ru
nti
me N
orm
aliz
ed
to
F-1
28
DVFS/HW Scaling Power
• DVFS: +~38% Chip Power+~70% DVFS Domain
Dynamic Power+~20% Temp-induced
Leakage
• FF Scaling: +~10% Chip Power+~2% Temp-induced
Leakage
D. Gibson Thesis Defense - 63
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
FE
DQ/ALU
MEM
Static
F-128
3.6
GH
z
F-256
Pow
er
Norm
aliz
ed
to
F-1
28
DVFS Proxying Scalable Cores
• Performance: OK With Caveats– CPU-bound workloads: DVFS overestimates
scalable core performance– Memory-bound workloads: DVFS
underestimates scalable core performance
• Power: Not OK.– DVFS follows E*D2 curve– FF/Scalable Core should be better than
E*D2 curve.– Use a model instead.
D. Gibson Thesis Defense - 64
SSR
• Per-Value Distributed Linked List– Starts at producer– Visits each successor– NULL pointer at last
successor
• Amenable to simple hardware– Serializes wakeup
D. Gibson Thesis Defense - 65
ld R4 4 R1add R1 R3 R3sub R4 16 R4st R3 R8breq R4 R3
Effect of Serialized Wakeup
D. Gibson Thesis Defense - 66
astar bzip2 gcc libquantum gmean0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Norm
alize
d R
unti
me
• Compared to idealized window– Low mean
performance loss from serialized wakeup (+2% runtime)
– Occasionally noticeable (i.e., bzip2, 50%+)
SSR Compiler Optimization
D. Gibson Thesis Defense - 67
long split crit0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
RUUOoOSSR
Norm
alize
d R
unti
me
• long– Compiler cannot
identify dynamic repeated regs
• split– Compiler can identify
dynamic repeated regs, but cannot identify critical path
• crit– Compiler knows both
dynamic repeated regs and critical path
D. Gibson Thesis Defense - 68
Power-Awareness
• How much energy is used by a computation?– Measure (e.g., with a multimeter)– Detailed Simulation (e.g., SPICE)– Simple Simulation (e.g., WATTCH)– Simple Model (e.g., 10W/core)
ii ENNumber of activations of element i
Energy per activation of element i
Measuring Energy Online
D. Gibson Thesis Defense - 69
ii EN jestj EC
Event: “Easy” to measure
Activation: “Hard” to measure
Correlated
[Iyer01]: MAC in hardware.[Joseph01]: HW Perf. Ctrs, works for Pentium-eraThis work: Scalable core, use core’s resources to do the computation
D. Gibson Thesis Defense - 70
DVFS Won’t Cut It
• Near saturation in voltage scaling• Subthreshold DVFS never energy-efficient [Zhai04]
• Need microarchitectural alternative
1996 1998 2000 2002 2004 2006 2008 20100
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
IMB PowerPC 405LP
TransMeta Crusoe TM 5800Intel Xscale 80200
Intel Itanium Montecito
Atom Sil-verthorne
Vmin
Vmax
Ope
ratin
g Vo
ltage
(V)
~80% ~33%
Resource borrowed from David’s “Two Cores” Talk
Scalable Interconnect
D. Gibson Thesis Defense - 71
Logically: A Ring.Scale Down: A Ring with Fewer Elements
• Not straightforward• Overprovisioning won’t work well: Wrap-
around link is ugly• Needs to support 1-, 2-, 4-, 8-BG operation
Two Unidirectional Busses (gasp!)
D. Gibson Thesis Defense - 72
F-1024F-512
10 01 10 01 10 01 11
11 10 01 10 01 10 11
10 01 11 00 00 00 00
11 10 01 00 00 00 00
Window Estimation Example
D. Gibson Thesis Defense - 73
# Sum an arrayl_array: load [R1+ 0] -> R2 add R2 R3 -> R3 add R1 64 -> R1 brnz l_array load [R1+ 0] -> R2 add R2 R3 -> R3 add R1 64 -> R1 brnz l_array load [R1+ 0] -> R2 add R2 R3 -> R3 add R1 64 -> R1 brnz l_array
M
M
M
Miss, Start Profiling, Poison R2
W
Poison R3
Antidote R1
Indep. Miss,
Poison R3
Antidote R1
Indep. Miss, Set ELMRi, Poison R2
01234
ELMR4 8 16
1
Set ELMRi, Poison R2
8 1
MSb(ELMR) = 16 → Window size 16 needed.
Adding Hysteresis (1/2)
D. Gibson Thesis Defense - 74
0
1
2
3
4
5
Runtim
eE*D
E*D^2
0
0.5
1
1.5
MLP
F-1024
libquantum
F-1024
F-512
F-256
F-128
F-64
F-32 0
1
2
3
4
5
Runtim
eE*D
E*D^2
00.5
11.5
2
MLP
F-1024
astar
F-1024
F-512
F-256
F-128
F-64
F-32
1. Many reconfigs
2. Too small most of the time. Must anticipate, not react.
Adding Hysteresis (2/2)
• Scale Down only “occasionally”– On full squash
D. Gibson Thesis Defense - 75
Runtim
eE*D
E*D^2
00.5
11.5
2UpDown
UpOnly
F-1024
astar
0
1
2
3
4
5
UpDownUpOnly
F-1024
F-512
F-256
F-128
F-64
F-32• Intuition:
– Assume big window not useful
– Show, occasionally, that a big window IS useful.
D. Gibson Thesis Defense - 76
Leakage Trends
• Leakage Starts to Dominate• SOI & DG Technology Helps (ca 2010/2013)• Tradeoffs Possible:
– Low-Leak Devices (slower access time)
DG Devices
LSP Devices
1MB Cache: Dynamic & Leakage Power [HP2008,ITRS2007]
Leakage Power by Circuit Variant [ITRS2007]
Pow
er (
mW
)
Nor
mal
ized
Pow
er
D. Gibson Thesis Defense - 77
Forwardflow Overview
• Design Philosophy:– Avoid ‘broadcast’ accesses (e.g., no CAMs)
• Avoid ‘search’ operations (via pointers)– Prefer short wires, tolerate long wires– Decouple frontend from backend details
• Abstract backend as a pipeline
D. Gibson Thesis Defense - 78
Forwardflow – Scalable Core Design
• Use Pointers to Explicitly Define Data Movement– Every Operand has a Next Use
Pointer– Pointers specify where data
moves (in log(N) space)– Pointers are agnostic of:
• Implementation• Structure sizes• Distance
– No search operation
ld R4 4 R1add R1 R3 R3sub R4 16 R4st R3 R8breq R4 R3
D. Gibson Thesis Defense - 79
Forwardflow – Dataflow Queue
• Table of in-flight instructions• Combination Scheduler, ROB,
and PRF– Manages OOO Dependencies– Performs Scheduling– Holds Data Values for All
Operands• Each operand maintains a
next use pointer hence the log(N)
• Implemented as Banked RAMs Scalable
1 ld R4 4 R1
2 add R1 R3 R3
3 sub R4 16 R4
4 st R3 R8
5 breq R4 R5
Op1 Op2 DestDataflow Queue
Bird’s Eye View of FF Detailed View of FF
D. Gibson Thesis Defense - 80
Forwardflow – DQ +/-’s
+ Explicit, Persistent Dependencies
+ No searching of any kind
- Multi-cycle Wakeup per value *
1 ld R4 4 R1
2 add R1 R3 R3
3 sub R4 16 R4
4 st R3 R8
5 breq R4 R5
Op1 Op2 Dest
* Average Number of Successors is Small [Ramirez04,Sassone07]
Dataflow Queue
D. Gibson Thesis Defense - 81
DQ: Banks, Groups, and ALUs
Logical Organization Physical OrganizationDQ Bank Group – Fundamental Unit of Scaling
D. Gibson Thesis Defense - 82
Forwardflow: Pipeline Tour
• RCT: Identifies Successors• ARF: Provides Architected Values• DQ: Chases Pointers
I$RCTRCTRCT
DQ
D$
ARF
PRED FETCH DECODE
DISPATCH
COMMIT
EXECUTE
Scalable, Decoupled Backend
D. Gibson Thesis Defense - 83
RCT: Summarizing Pointers
• Want to dispatch:breq R4 R5
• Need to know:– Where to get R4?
• Result of DQ Entry 3
– Where to get R5?• From the ARF
• Register Consumer Table summarizes where most-recent version of registers can be found
1 ld R4 4 R1
2 add R1 R3 R3
3 sub R4 16 R4
4 st R3 R8
5
Op1 Op2 DestDataflow Queue
D. Gibson Thesis Defense - 84
RCT: Summarizing Pointers
1 ld R4 4 R1
2 add R1 R3 R3
3 sub R4 16 R4
4 st R3 R8
5 breq R4 7
Op1 Op2 DestDataflow Queue
REF WRR1 2-S1 1-D
R2
R3 4-S1 2-D
R4 3-D 3-D
R5
Register Consumer Table (RCT)
breq R4 R5
5-S1
R4 Comes From DQ Entry 3-D
R5 Comes From ARF
D. Gibson Thesis Defense - 85
Wakeup/Issue: Walking Pointers
• Follow Dest Ptr When New Result Produced– Continue following
pointers to subsequent successors
– At each successor, read ‘other’ value & try to issue
• NULL Ptr Last Successor
1 ld R4 4 R1
2 add R1 R3 R3
3 sub R4 16 R4
4 st R3 R8
5 breq R4 7
Op1 Op2 DestDataflow Queue
D. Gibson Thesis Defense - 86
DQ: Fields and Banks
• Independent Fields Independent RAMs– I.E. accessed independently, independent ports,
etc.• Multi-Issue ≠ Multi-Port
– Multi-Issue Multi-Bank– Dispatch, Commit access contiguous DQ regions
• Bank on low-order bits for dispatch/commit BW
• Port Contention + Wire Delay = More Banks– Dispatch, Commit Share a Port
• Bank on a high-order bit to reduce contention
D. Gibson Thesis Defense - 87
DQ: Banks, Groups, and ALUs
Logical Organization Physical OrganizationDQ Bank Group – Fundamental Unit of Scaling
D. Gibson Thesis Defense - 88
Related Work
• Scalable Schedulers– Direct Instruction Wakeup [Ramirez04]:
• Scheduler has a pointer to the first successor• Secondary table for matrix of successors
– Hybrid Wakeup [Huang02]:• Scheduler has a pointer to the first successor• Each entry has a broadcast bit for multiple
successors– Half Price [Kim02]:
• Slice the scheduler in half• Second operand often unneeded
D. Gibson Thesis Defense - 89
Related Work
• Dataflow & Distributed Machines– Tagged-Token [Arvind90]
• Values (tokens) flow to successors– TRIPS [Sankaralingam03]:
• Discrete Execution Tiles: X, RF, $, etc.• EDGE ISA
– Clustered Designs [e.g. Palacharla97]• Independent execution queues
D. Gibson Thesis Defense - 90
RW: Scaling, etc.
• CoreFusion [Ipek07]– Fuse individual core structures into bigger
cores• Power aware microarchitecture resource
scaling [Iyer01]– Varies RUU & Width
• Positional Adaptation [Huang03]– Adaptively Applies Low-Power Techniques:
• Instruction Filtering, Sequential Cache, Reduced ALUs
D. Gibson Thesis Defense - 91
RW: Scalable Cores
• CoreFusion [Ipek07]
– Fuse individual core structures into bigger cores
• Composable Lightweight Processors [Kim07]
– Many very small cores operate collectively, ala TRIPS
• WiDGET [Watanabe10]
– Scale window via smart steering
D. Gibson Thesis Defense - 92
RW: Seeking MLP
• Big Windows [Many]
• Runahead Execution [Dundas97][Multu06]
– “Just keep executing”• WIB [Lebeck02]
– Defer, re-schedule later• Continual Flow [Srinivasan04]
– & Friends [Hilton09][Chaudhry09]
– Defer, re-dispatch later
D. Gibson Thesis Defense - 93
Operand NetworksC
DF
Pointer Span
SPAN=5
Observation: ~85% of pointers designate near successors
Intuition: Most of these pointers yield IB traffic, some IBG-N, none IBG-D.
SPAN=16
Observation: Nearly all pointers (>95%) designate successors 16 or fewer entries away
Intuition: There will be very little IBG-D traffic.
astarsjengjbb
D. Gibson Thesis Defense - 94
Is It Correct?
• Impossible to tell– Experiments do not prove, they support or refute
• What support has been observed of the hypothesis “This is correct”?– Reasonable agreement with published
observations (e.g. consumer fanouts)– Few timing-first functional violations– Predictable uBenchmark behavior
• Linked list: No parallelism• Streaming: Much parallelism
D. Gibson Thesis Defense - 95
CoreFusion
• Borrow Everything– Merges multiple
discrete elements in multiple discrete cores into larger components
– Troublesome for N>2
BPRED
Decode
Sched.
PRF
I$
BPRED
Decode
Sched.
PRF
I$
D. Gibson Thesis Defense - 96
“Vanilla” CMOS
P-N
N+ N+
D. Gibson Thesis Defense - 97
Double-Gate, Tri-Gate, Multigate
D. Gibson Thesis Defense - 98
ITRS-HP vs. ITRS-LSP Device
P-
N
N+ N+
LSP: ~2x Thicker Gate Oxides
LSP: ~2x Longer Gates
LSP: ~4x Vth
D. Gibson Thesis Defense - 99
OoO Scaling
Decode Width = 2
op1 dest src1 src2
op2 dest src1 src2
op1 dest src1 src2
op2 dest src1 src2
op1 dest src1 src2
op2 dest src1 src2
Decode Width = 4
Number of Comparators ~ O(N2) Bypassing Complexity ~ O(N2)
Two-way Fully Bypassed
Four-way fully bypassed is beyond my powerpoint skill
D. Gibson Thesis Defense - 100
OoO Scaling
• ROB Complexity: O(N), O(I~3/2)• PRF Complexity: O(ROB), O(I~3/2)• Scheduler Complexity:
– CAM: O(N*log(N)) (size of reg tag increases log(N))
– Matrix: O(N2) (in fairness, the constant in front is small)
D. Gibson Thesis Defense - 101
Flavors of “Off”
Dynamic Power
Static Power
Response Lag Time
Active (Not Off)
U% 100% 0 cycles
Drowsy(Vdd Scaled)
1-5% 40% 1-2 cycles
Clock-Gated
1-5% 100% ~0 cycles
Vdd-Gated <1% <1% 100s cycles
Freq. Scaled
F% 100% ~0 cycles
D. Gibson Thesis Defense - 102
Forwardflow – Resolving Branches
1 ld R4 4 R1
2 add R1 R3 R3
3 sub R4 16 R4
4 st R3 R8
5 breq R4 R5
6 ld R4 4 R1
7 add R1 R3 R3
Op1 Op2 DestDataflow Queue
• On Branch Pred.:– Checkpoint RCT– Checkpoint Pointer
Valid Bits
• Checkpoint Restore– Restores RCT– Invalidates Bad
Pointers
D. Gibson Thesis Defense - 103
add
A Day in the Life of a Forwardflow Instruction: Decode
4-S1R4R3R2
7-DR1Register Consumer History
R1@7D
8-D
8-D
R3=0
8-S1
8-S1
add R1 R3 R3
D. Gibson Thesis Defense - 104
A Day in the Life of a Forwardflow Instruction: Dispatch
9
8
7
R3R1add
R14R4ld
Op1 Op2 DestDataflow Queue
add R1@7D R3=0
Implicit -- Not actually written
0
D. Gibson Thesis Defense - 105
A Day in the Life of a Forwardflow Instruction: Wakeup
7 ld R4 4 R1
8 add R1 0 R3
9 sub R4 16 R4
10
st R3 R8
Op1 Op2 DestDataflow Queue
DQ7 Result is 0!
7-Dnext
Update HW
value 0
DestPtr.Read(7)
DestVal.Write(7,0)
8-S1
0
8-S1
D. Gibson Thesis Defense - 106
A Day in the Life of a Forwardflow Instruction: Issue (…and Execute)
7 ld R4 4 R1
8 add R1 0 R3
9 sub R4 16 R4
10
st R3 R8
Op1 Op2 DestDataflow Queue
S2Val.Read(8)
Meta.Read(8)
8-S1next
Update HW
value 0
S1Ptr.Read(8)
S1Val.Write(8,0)
00
add 0
add 0 + 0 → DQ8
D. Gibson Thesis Defense - 107
A Day in the Life of a Forwardflow Instruction: Writeback
10
9
8
7
R8R3st
R416R4sub
R30R1add
Op1 Op2 DestDataflow Queue
8-Dnext
Update HW
value 0
DestPtr.Read(8)
DestVal.Write(8,0)
0
10-S1 10-S1R3:0
D. Gibson Thesis Defense - 108
A Day in the Life of a Forwardflow Instruction: Commit
10
9
8
7
R8R3st
R416R4sub
Op1 Op2 DestDataflow Queue
0R1add R3:0Commit Logic
Meta.Read(8)
DestVal.Read(8)
add R3:0
ARF.Write(R3,0)
D. Gibson Thesis Defense - 109
5-S1R4
4-S1R3
R2
2-S1R1
DQ Q&A
1 ld R4 4 R1
2 add R1 R3 R3
3 sub R4 16 R4
4 st R3 R8
5 breq R4 R5
6 ld R4 4 R1
7 add R1 R3 R3
8 sub R4 16 R4
9 st R3 R8
Op1 Op2 DestDataflow Queue
Register Consumer History
8-DR4
9-S1R3
R2
7-S1R1
D. Gibson Thesis Defense - 110
Forwardflow – Wakeup
1 ld R4 4 R1
2 add R1 R3 R3
3 sub R4 16 R4
4 st R3 R8
5 breq R4 0
Op1 Op2 DestDataflow Queue
DQ1 Result is 7!
1-Dnext
Update HW
value 7
DestPtr.Read(1)
DestVal.Write(1,7)
2-S1
7
2-S1
D. Gibson Thesis Defense - 111
S2Val.Read(2)
Meta.Read(2)
2-S1
Forwardflow – Selection
1 ld R4 4 R1
2 add R1 R3 R3
3 sub R4 16 R4
4 st R3 R8
5 breq R4 0
Op1 Op2 DestDataflow Queue
next
Update HW
value 7
S1Ptr.Read(2)
S1Val.Write(2,7)
77
add 44
DQ2
Issue
D. Gibson Thesis Defense - 112
Forwardflow – Building Pointer Chains: Decode
• Decode must determine, for each operand, where the operand’s value will originate– Vanilla-OOO: Register Renaming– Forwardflow-OOO: Register Consumer Table
• RCT records last instruction to reference a particular architectural register– RAM-based table, analogous to renamer
D. Gibson Thesis Defense - 113
Decode Example
7-S1R4R3R2
7-DR1Register Consumer History 5 ld R4 4 R4
6 add R4 R1 R4
7 ld R4 16 R1
8
9
Op1 Op2 DestDataflow Queue
D. Gibson Thesis Defense - 114
8: add R1 R3 R3
Decode Example
4-S1R4R3R2
7-DR1Register Consumer History
R1@7D
8-D
8-D
R3=0
8-S1
8-S1
add→R3
D. Gibson Thesis Defense - 115
Forwardflow –Dispatch
• Dispatch into DQ:– Writes metadata and
available operands– Appends instruction
to forward pointer chains
5 ld R4 4 R4
6 add R4 R1 R4
7 ld R4 16 R1
8 add R1 0 R3
9
Op1 Op2 Dest
Dataflow Queue
R3=0add→R3 R1@7D