Dean Tullsen ACACES 2008 Parallelism – Use multiple contexts to achieve better performance than...
-
Upload
melvyn-flowers -
Category
Documents
-
view
219 -
download
0
Transcript of Dean Tullsen ACACES 2008 Parallelism – Use multiple contexts to achieve better performance than...
Dean Tullsen ACACES 2008
Parallelism – Use multiple contexts to achieve better performance than possible on a single context.
Traditional Parallelism – We use extra threads/processors to offload computation. Threads divide up the execution stream.
Non-traditional parallelism – Extra threads are used to speed up computation without necessarily off-loading any of the original computation Primary advantage nearly any code, no matter
how inherently serial, can benefit from parallelization. Another advantage – threads can be added or
subtracted without significant disruption.
Dean Tullsen ACACES 2008
Thread 1 Thread 2 Thread 3 Thread 4
Dean Tullsen ACACES 2008
Thread 1 Thread 2 Thread 3 Thread 4
Speculative precomputation, dynamic speculative precomputation, many others.
Most commonly – prefetching, possibly branch pre-calculation.
Dean Tullsen ACACES 2008
Chappell, Stark, Kim, Reinhardt, Patt, “Simultaneous Subordinate Micro-threading” 1999 Use microcoded threads to manipulate
the microarchitecture to improve the performance of the main thread.
Zilles 2001, Collins 2001, Luk 2001 Use a regular SMT thread, with code
distilled from the main thread, to support the main thread.
Dean Tullsen ACACES 2008
Speculative Precomputation [Collins, et al 2001 – Intel/UCSD]
Dynamic Speculative Precomputation
Event-Driven Simultaneous Optimization Value Specialization Inline Prefetching Thread Prefetching
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
3.30
6.28
1.14
4.79
5.79
1.41
2.76
1.04
2.47
4.46
32.6
4
27.9
0
1
2
3
4
5
6
7
8
art equake gzip mcf health mst
Spe
edup
Perfect Memory
Perfect Delinquent Loads (10)
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
In SP, a p-slice is a thread derived from a trace of execution between a trigger instruction and the delinquent load.
All instructions upon which the load’s address is not dependent are removed (often 90-95%).
Live-in register values (typically 2-6) must be explicitly copied from main thread to helper thread.
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
Delinquent load
Trigger instruction
Prefetch
Spawn thread
Memory latency
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
Because SP uses actual program code, can precompute addresses that fit no predictable pattern.
Because SP runs in a separate thread, it can interfere with the main thread much less than software prefetching. When it isn’t working, it can be killed.
Because it is decoupled from the main thread, the prefetcher is not constrained by the control flow of the main thread.
All the applications in this study already had very aggressive software prefetching applied, when possible.
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
On-chip memory for transfer of live-in values.
Chaining triggers – for delinquent loads in loops, a speculative thread can trigger the next p-slice (think of this as a looping prefetcher which targets a load within a loop) Minimizes live-in copy overhead. Enables SP threads to get arbitrarily far ahead. Necessitates a mechanism to stop the chaining
prefetcher.
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
Chaining triggers executed without impacting main thread
Target delinquent loads arbitrarily far ahead of non-speculative thread Speculative threads make progress
independent of main thread Use basic triggers to initiate
precomputation, but use chaining triggers to sustain it
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
0.81
1.21.41.6
1.82
2.22.42.62.8
art equake gzip mcf health mst Average
Sp
eed
up
ove
r B
asel
ine
2 Thread Contexts 4 Thread Contexts 8 Thread Contexts
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
Speculative precomputation uses otherwise idle hardware thread contexts Pre-computes future memory accesses Targets worst behaving static loads in a
program Chaining Triggers enable speculative
threads to spawn additional speculative threads Results in tremendous performance gains,
even with conservative hardware assumptions
Dean Tullsen ACACES 2008
Speculative PrecomputationDynamic Speculative
Precomputation [Collins, et al – UCSD/Intel]
Event-Driven Simultaneous Optimization Value Specialization Inline Prefetching Thread Prefetching
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
SP, as well as similar techniques proposed about the same time, require Profile support Heavy user or compiler interaction
It is thus susceptible to profile-mismatch, requires recompilation for each machine architecture, and if they require user interaction…
(or, a bit more accurately, we just wanted to see if we could do it all in hardware)
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
relies on the hardware to identify delinquent loads create speculative threads optimize the threads when they aren’t working
quite well enough eliminate the threads when they aren’t working
at all destroy threads when they are no longer
useful…
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
Like hardware prefetching, works without software support or recompilation, regardless of the machine architecture.
Like SP, works with minimal interference on main thread.
Like SP, works on highly irregular memory access patterns.
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
Identify delinquent loads Delinquent Load Identification Table
Construct p-slices and apply optimizations Retired Instruction Buffer
Spawn and manage P-slices Slice Information Table
Implemented as back-end instruction analyzers
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
PCPCPCPC
ICache
RegisterRenaming
CentralizedInstruction
Queue
Re-order BufferRe-order BufferRe-order BufferRe-order Buffer
MonolithicRegister
File
ExecutionUnits
DataCache
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
PCPCPCPC
ICache
RegisterRenaming
CentralizedInstruction
Queue
Re-order BufferRe-order BufferRe-order BufferRe-order Buffer
MonolithicRegister
File
ExecutionUnits
DataCache
Delinquent LoadIdentificationTable (DLIT)
RetiredInstructionBuffer (RIB)
SliceInformationTable (SIT)
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
Once delinquent load identified, RIB buffers instructions until the delinquent load appears as the newest instruction in the buffer.
Dependence analysis easily identifies load’s antecedents, a trigger instruction, and the live-in’s needed by the slice. Similar to register live-range analysis But much easier
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
Construct p-slices to prefetch delinquent loads
Buffers information on an in-order run of committed instructions Comparable to trace cache fill unit
FIFO structureRIB normally idle
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
Analyze instructions between two instances of delinquent load Most recent to oldest
Maintain partial p-slice and register live-in set
Add to p-slice instructions which produce live-in set register Update register live-in set
When analysis terminates, p-slice has been constructed and live-in registers identified
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
struct DATATYPE { int val[10];};
DATATYPE * data [100];
for(j = 0; j < 10; j++) { for(i = 0; i < 100; i++) { data[i]->val[j]++; }}
loop:I1 load r1=[r2]I2 add r3=r3+1I3 add r6=r3-100I4 add r2=r2+8I5 add r1=r4+r1I6 load r5=[r1]I7 add r5=r5+1I8 store [r1]=r5I9 blt r6, loop
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
load r5 = [r1]
Analyze fromrecent
add r1 = r4+r1
add r2 = r2+8
Instruction
add r6 = r3-100
add r3 = r3+1
load r1 = [r2]
blt r6, loop
store [r1] = r5
add r5 = r5+1
load r5 = [r1]
To oldest
Included Live-in Set
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
load r5 = [r1]
add r1 = r4+r1
add r2 = r2+8
Instruction
add r6 = r3-100
add r3 = r3+1
load r1 = [r2]
blt r6, loop
store [r1] = r5
add r5 = r5+1
load r5 = [r1]
Included Live-in Set
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
load r5 = [r1]
add r1 = r4+r1
add r2 = r2+8
Instruction
add r6 = r3-100
add r3 = r3+1
load r1 = [r2]
blt r6, loop
store [r1] = r5
add r5 = r5+1
load r5 = [r1]
Included
r1
Live-in Set
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
load r5 = [r1]
add r1 = r4+r1
add r2 = r2+8
Instruction
add r6 = r3-100
add r3 = r3+1
load r1 = [r2]
blt r6, loop
store [r1] = r5
add r5 = r5+1
load r5 = [r1]
Included
r1
Live-in Set
r1
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
load r5 = [r1]
add r1 = r4+r1
add r2 = r2+8
Instruction
add r6 = r3-100
add r3 = r3+1
load r1 = [r2]
blt r6, loop
store [r1] = r5
add r5 = r5+1
load r5 = [r1]
Included
r1
Live-in Set
r1
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
load r5 = [r1]
add r1 = r4+r1
add r2 = r2+8
Instruction
add r6 = r3-100
add r3 = r3+1
load r1 = [r2]
blt r6, loop
store [r1] = r5
add r5 = r5+1
load r5 = [r1]
Included
r1
Live-in Set
r1,r4
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
load r5 = [r1]
add r1 = r4+r1
add r2 = r2+8
Instruction
add r6 = r3-100
add r3 = r3+1
load r1 = [r2]
blt r6, loop
store [r1] = r5
add r5 = r5+1
load r5 = [r1]
Included
r1
Live-in Set
r1,r4
r1,r4
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
load r5 = [r1]
add r1 = r4+r1
add r2 = r2+8
Instruction
add r6 = r3-100
add r3 = r3+1
load r1 = [r2]
blt r6, loop
store [r1] = r5
add r5 = r5+1
load r5 = [r1]
Included
r1
Live-in Set
r1,r4
r1,r4
r1,r4
r2,r4
r2,r4
r2,r4
r2,r4
r1,r4
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
load r5 = [r1]
add r1 = r4+r1
add r2 = r2+8
Instruction
add r6 = r3-100
add r3 = r3+1
load r1 = [r2]
blt r6, loop
store [r1] = r5
add r5 = r5+1
load r5 = [r1]
P-Slice
load r1 = [r2] add r1 = r4+r1 load r5 = [r1]
Live-in Set
r2,r4
Delinquent Loadis trigger
Dean Tullsen ACACES 2008
If two occurrences of the load are in the buffer (the common case), we’ve identified a loop that can be exploited for better slices.
Can perform additional analysis passes and optimizations Retain live-in set from previous pass Increases construction latency but keeps RIB
simple Optimizations
Advanced trigger placement (if dependences allow, move trigger earlier in loop)
Induction unrolling (prefetch multiple iterations ahead)
Chaining (looping) slices – prefetch many loads with a single thread.
Dean Tullsen Processor Architecture and Compilation Lab
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
mcf vpr
art
equa
ke
mgr
id
swim
em3d
mst
perim
eter
tree
add
aver
age
Sp
ee
du
p o
ve
r n
o D
yn
am
ic S
PBasic Dynamic SP Improved TriggerInduction Unrolling Chaining
1.72 1.97 1.93
Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab
Dynamic Speculative Precomputation aggressively targets delinquent loads Thread based prefetching scheme Uses back-end (off critical path)
instruction analyzers P-slices constructed with no external
software supportMulti-pass RIB analysis enables
aggressive p-slice optimizations
Dean Tullsen ACACES 2008
Speculative PrecomputationDynamic Speculative
PrecomputationEvent-Driven Simultaneous
Optimization Value Specialization Inline Prefetching Thread Prefetching
Dean Tullsen ACACES 2008
With Weifeng Zhang and Brad Calder
Dean Tullsen ACACES 2008
Use “helper threads” to recompile/optimize the main thread.
Optimization is triggered by interesting events that are identified in hardware (event-driven).
Dean Tullsen ACACES 2008
Thread 1 Thread 2 Thread 3 Thread 4
Execution and Compilation take place in parallel!
Dean Tullsen ACACES 2008
A new model of optimization Computation and optimization occur in parallel Optimizations are triggered by the program’s
runtime behavior
Advantages Low overhead profiling of runtime behavior Low overhead optimization by exploiting additional
hardware context Quick response to the program’s changing behavior Aggressive optimizations
Dean Tullsen ACACES 2008
original code
Helper
thread
base optimized code
event
Re-optim
ized code
Maintaining only one copy of the optimized code
Recurrent optimization on already optimized code when the behavior changes
Main
thread
Gradually enabling aggressive optimizations
event
Helper
thread
Hardware event-driven Hardware monitors the program’s behavior with
no software overhead Optimization threads triggered to respond to
particular events. Optimization events handled ASAP to quickly
adapt to the program’s changing behaviors
Hardware Multithreaded Concurrent, low-overhead helper threads Gradual re-optimization upon new events
Main thread
Optimization threads
events
Trident
Dean Tullsen ACACES 2008
Software Components
Hardware Components
Main thread
helper registrationThread trigger
helper priorityHardware event
Runtime Support
Helper thread
Event manager
Helper threadHelper thread
User applicationOS loader
code cache
event queue
Register a given thread to be monitored, and create helper thread contexts
Monitor the main thread to generate events (into the queue) Helper thread is triggered to perform optimization. Update the code cache and patch the main thread
Dean Tullsen ACACES 2008
Events Occurrence of a particular type of runtime behavior
Generic events Hot branch events Trace invalidation
Optimization specific events Hot value events Delinquent Load events
Other events ?
Dean Tullsen ACACES 2008
The Trident Framework is built around a fairly traditional dynamic optimization system => hot trace formation, code cache
Trident captures hot traces in hardware (details omitted)
However, even with its basic optimizations, Trident has key advantages over previous systems Hardware hot branch events identify hot traces Zero-overhead monitoring Low-overhead optimization in another thread No context switches between these functions
Dean Tullsen ACACES 2008
Definitions Hot trace
▪ A number of basic blocks frequently running together
Trace formation▪ Streamlining these blocks for
better execution locality
Code cache▪ Memory buffer to store hot
traces
A
G
E
C
F
K
J
H
D
B
call
return
start
1 0
1 0
I
Dean Tullsen ACACES 2008
Streamlining the instruction sequence Redundant branch/load removal Constant propagation Instruction re-association Code elimination
Architecture-aware optimizations reduction of RAS (return address stack) mis-
predictions (orders of magnitude) I-cache conscious placement of traces within code
cache. Trace Invalidation
Dean Tullsen ACACES 2008
Value specialization Make a special version of the code
corresponding to likely live-in values Advantages over hardware value
prediction Value predictions are made in the
background and less frequently No limits on how many predictions can be made Allow more sophisticated prediction techniques Propagate predicted values along the trace Trigger other optimizations such as strength
reduction
Dean Tullsen ACACES 2008
Value specialization Make a special version of the code
corresponding to likely live-in values Advantages over software value
specialization Can adapt to semi-invariant runtime values
(eg, values that change, but slowly) Adapts to actual dynamic runtime values. Detects optimizations that are no longer
working.
Dean Tullsen ACACES 2008
Value specialization Semi-invariant “constants” Strided values (details
omitted) Dynamic verification
Perform the original load Perform the original load into a scratch registerinto a scratch register
Move predicted value into Move predicted value into the load destinationthe load destination
Check the predicted value, Check the predicted value, branch to recovery if not branch to recovery if not equalequal
Perform constant Perform constant propagation and strength propagation and strength reductionreduction
Copy the scratch into load Copy the scratch into load destinationdestination
Jump to next instruction Jump to next instruction after load in the original after load in the original binarybinary
compensation blockcompensation blockLDQ 0(R2) R1
ADD R6, R4 R3
MUL R1, R3 R2
……
LDQ 0(R2) R3
MOV 0 R1
BNE R1, R3, recovery
ADD R6, R4 R3
MOV 0 R2
..…No dependency!
Dean Tullsen ACACES 2008
Evaluate helper threads’ impact on the main threads
Exercise full optimization flow Do not use the optimized traces ~0.6% of degradation of the main
thread’s IPC
Concurrent execution of the main thread and helpers
≤ 2% of total execution time (running concurrently with the main thread)
Dean Tullsen ACACES 2008
236%
-5%
5%
15%
25%
35%
bzip
cra
fty
eo
n
ga
p
gc
c
gzip
mc
f
pa
rse
r
pe
rl
two
lf
vo
rtex
vp
r
av
gp
erc
en
t s
pe
ed
up
s H/W value prediction
Trace formation
Value specialization
169% 238%
Dean Tullsen ACACES 2008
Speculative PrecomputationDynamic Speculative
PrecomputationEvent-Driven Simultaneous
Optimization Value Specialization Inline Prefetching Thread Prefetching
Dean Tullsen ACACES 2008
Limitations of existing prefetching techniques Compiler-based static prefetching
▪ Address / aliasing resolution▪ Timeliness▪ Hard to identify delinquent loads▪ Variation due to data input or architecture
Hardware prefetching▪ Cannot follow complicated load behaviors
Dean Tullsen ACACES 2008
Goal Provide an efficient way to perform flexible software
prefetching Find prefetching opportunities in legacy code
Effective prefetching Prefetching should be accurate
▪ Target the loads which actually miss in the cache▪ Prefetch far ahead enough to cover miss latency
▪ Must have low overhead to compute prefetching addresses
Dean Tullsen ACACES 2008
Intrinsically difficult to get the prefetch distance right Trident enables adaptive discovery of the optimal prefetch distances Conventional systems often make decisions once because of high
overhead
Loa
d 1
Loa
d 3
Loa
d 2
execution time
original execution trace
Dean Tullsen ACACES 2008
Hot branch event
Delinquent load event
Dean Tullsen ACACES 2008
Determine how far ahead to prefetch a delinquent load
Prefetch Distance =
Prefetch (offset+stride*distance)(base)
Most prior prefetching systems keep the prefetch distance fixed after initial estimate
Trident reuses the first two steps, except that Low overhead of monitoring + optimization allows
us to adapt this distance as well as the stride
average load miss latency
average cycles per iteration
Dean Tullsen ACACES 2008
1. Object prefetching – identifies loads within the same object, and clusters them to minimize prefetch overhead.
2. Adaptive determination of prefetch distance
Dean Tullsen ACACES 2008
Heavy interaction between neighboring loads (especially other loads we are also prefetching) make static or even dynamic determination of the correct prefetch distance difficult.
Because of the low cost of optimization, Trident uses trial-and-error to discover the right distances.
Dean Tullsen ACACES 2008
All stride based prefetch instructions are inserted with initial distance of 1
These loads are continuously monitored in the DLT
The optimizer increases/decreases the distance until The load is no longer delinquent The load is matured
Stabilization is achieved quickly
prefetch not hiding enough latency
load is delinquent
delinquent load event
Dean Tullsen ACACES 2008
In many cases, pointer chasing loads actually have strided patterns.
These patterns can be identified by Trident’s hardware monitors.
This gives Trident 2 advantages over software prefetchers Low-overhead address computation The ability to prefetch multiple iterations
ahead.
Baseline: H/W stride-based prefetching stream buffers Self-repairing based prefetching achieves 23%
speedup 12% better than software prefetching without
repairing
0%
20%
40%
60%
80%
applu
art
dot
equake
facerec
fma3d
galgel
gap
mcf
mgrid
parser
swim
vis
wupw
ise
avgp
erce
nt
spee
du
ps
S/W prefetching - basicS/W prefetching - w hole objectS/W prefetching w ith self-repairing
Dean Tullsen ACACES 2008
Speculative PrecomputationDynamic Speculative
PrecomputationEvent-Driven Simultaneous
Optimization Value Specialization Inline Prefetching Thread Prefetching (speculative
precomputation)
Dean Tullsen ACACES 2008
Can potentially be more effective than inline prefetching.
However, more complex, with more things to get right/wrong Trigger points Termination points Synchronization between helper and main thread
These vary not just with load latencies, but also control flow, etc.
Again, Trident’s ability to continuously adapt is key.
Dean Tullsen ACACES 2008
It is critical in any thread-based prefetching scheme that the prefetch thread stay ahead of the main thread.
Trident optimizations jump-start p-thread multiple iterations
ahead Use dynamically detected strides to replace
complex recurrences Same-object prefetching P-thread placement optimizations for I-
cache performance Low-overhead sw synchronization Quick repair of off-track prefetching
Dean Tullsen ACACES 2008
0%
10%
20%
30%
40%
applu
art
dot
equake
facere
c
fma3d
galg
el
gap
mcf
mgrid
pars
er
sw
im
vis
wupw
ise
avg
perc
en
t sp
eed
up
s
basic p-slice prefetching - runahead=10jump start (J=5) w/ runahead (R=5)specialization with speculative stride <J=5,R=5>109%
129%133%
Trident’s acceleration techniques achieve 7% better performance than existing pre-computation techniques
Dean Tullsen ACACES 2008
0%
10%
20%
30%
40%
50%
60%
perc
en
t sp
eed
up
s
inlined prefetching
precomputation + inlined (non-looped)
142%79%
Adaptive inlined prefetching Pre-computation achieve 10% better
performance than previous aggressive inlined prefetching
Dean Tullsen ACACES 2008
Event-driven multithreaded optimization Hardware event-driven optimization means low
overhead profiling. ▪ monitoring of code need never stop
Allowing compilation to take place in parallel with execution provides low overhead optimization▪ Allows more aggressive optimizations▪ Allows gradual improvement via recurrent optimization▪ Allows self-adaptive (eg, search-based) optimization
What else can we do with this technology?
Dean Tullsen ACACES 2008
Works on serial code (and parallel) Provides parallel speedup by allowing the main
thread to run faster Is not limited by traditional theoretical limits to
parallel speedup Adapts easily to changes in available
parallelism
Other types of non-traditional parallelism??