Assisting Static Compiler Vectorization with a Speculative ...
Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads
description
Transcript of Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads
Carnegie Mellon
Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads
Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan† and Todd C. Mowry
School of Computer ScienceCarnegie Mellon University
†Dept. Elec. & Comp. EngineeringUniversity of Toronto
Compiler Optimization of Memory-Resident Value Communication… - 2 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Motivation
Chip-level multiprocessing is becoming commonplace
We need parallel programs
UntraSPARC IV 2 UltraSparc III cores
IBM Power 4 SUN MAJC Sibyte SB-1250
Can multithreaded processors improve the performance of a single application?
Compiler Optimization of Memory-Resident Value Communication… - 3 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Why Is Automatic Parallelization Difficult?
One solution: Thread-Level Speculation
Automatic parallelization today Must statically prove threads are independent Constructing proofs is difficult due to ambiguous data
dependences Complex control flow Pointers and indirect references Runtime inputs
Optimistic compiler? Limited only by true dependences
Compiler Optimization of Memory-Resident Value Communication… - 4 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Example
while (...){…x=hash[index1];…hash[index2]=y;...
}
Time…= hash[19]…hash[21] =...check_dep()
Thread 2…= hash[33]…hash[30] =...check_dep()
Thread 3…= hash[3]…hash[10] =...check_dep()
Thread 1
…= hash[10]…hash[25] =...check_dep()
Thread 4
…= hash[31]…hash[12] =...check_dep()
Thread 5…= hash[9]…hash[44] =...check_dep()
Thread 6…= hash[27]…hash[32] =...check_dep()
Thread 7
…= hash[10]…hash[25] =...check_dep()
Thread 4 Retry
Processor 1 Processor 2 Processor 3 Processor 4
Compiler Optimization of Memory-Resident Value Communication… - 5 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Frequently Dependent Scalars
…=a
a=……=a
a=…
Can identify scalars that always cause dependences
Time
ProducerConsumer
Compiler Optimization of Memory-Resident Value Communication… - 6 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Frequently Dependent Scalars
…=a
a=…
…=aa=…
Dependent scalars should be synchronized [ASPLOS’02]
Time
Signal(a)
Wait(a)
ProducerConsumer
Compiler Optimization of Memory-Resident Value Communication… - 7 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Frequently Dependent Scalars
…=a
a=…
Dataflow analysis allows us to deal with complex control flow [ASPLOS’02]
…=a
a=…
Time
ProducerConsumer
Compiler Optimization of Memory-Resident Value Communication… - 8 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Communicating Memory-Resident Values
Synchronize?
Speculate?
Will speculation succeed?
Time Load *p
Store *qLoad *p
Store *q
ProducerConsumer
Compiler Optimization of Memory-Resident Value Communication… - 9 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Speculation vs. SynchronizationSequential Execution Speculative Parallel Execution
Load *p
Speculation succeeds: efficient
Time
Load *p
Load *p
Load *p
Store *q
Store *q
Store *q
Store *q
Load *p Load *p Load *p Load *pStore *q Store *q Store *qStore *q
Compiler Optimization of Memory-Resident Value Communication… - 10 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Speculation vs. SynchronizationSequential Execution Speculative Parallel Execution
Speculation fails: inefficient
Load *p
Time
Load *p
Load *p
Load *p
Store *q
Store *q
Store *q
Store *q
Load *pStore *q
Load *pStore *q
Load *pStore *q
Load *pStore *q
Load *pStore *q
Load *pStore *q
Load *pStore *q
Load *pStore *q
Load *pStore *q
Load *pStore *q
violation
Compiler Optimization of Memory-Resident Value Communication… - 11 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Speculation vs. SynchronizationSequential Execution Speculative Parallel Execution
Frequent dependences: Synchronize Infrequent dependences: Speculate
Load *p
Time
Load *p
Load *p
Load *p
Store *q
Store *q
Store *q
Store *q
Load *pStore *q
Load *pStore *q Load *p
Store *q Load *pStore *q
Compiler Optimization of Memory-Resident Value Communication… - 12 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Performance Potential
Reducing failed speculation improves performance
Detailed simulation:• TLS support• 4-processor CMP
• 4-way issue, out-of-order superscalar• 10-cycle communication latency
Original
Perfect memory valuePrediction
Norm
. Reg
iona
l Exe
c. T
ime
0
100
m88ksim ijp
eg
gzip_comp
gzip_decomp
vpr_place
gcc
mcfcrafty
parser
perlbmk ga
p
bzip2_compgo
Compiler Optimization of Memory-Resident Value Communication… - 13 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Hardware vs. Compiler Inserted Synchronization
Store*qLoad *p
Memory
Store*q
Load *p
Memory
Store *q
Load *p
Memory
Speculation Hardware-insertedSynchronization[HPCA’02]
Compiler-insertedSynchronization[CGO’04]
Tim
e Signal()
(stall)
ProducerConsumer
ProducerConsumer
ProducerConsumer
Wait()
Compiler Optimization of Memory-Resident Value Communication… - 14 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Issues in Synchronizing Memory-Resident Values
Static analysis Which instructions to synchronize? Inter-procedural dependences
Runtime Detecting and recovering from improper synchronization
Store *qLoad *p
ProducerConsumer
Time
Compiler Optimization of Memory-Resident Value Communication… - 15 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Outline
Static analysis Runtime checks Results Conclusions
Load *p
ProducerConsumer
Store *q
Time
Compiler Optimization of Memory-Resident Value Communication… - 16 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Compiler Passes
FrontEnd
BackEnd
foo.c
foo.exe
InsertSynchronization
Profile DataDependences
CreateThreads
ScheduleInstructions
Decide what to Synchronize
Compiler Optimization of Memory-Resident Value Communication… - 17 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Example
work()
push (head, entry)
do { push (&set, element); work(); } while (test);
Compiler Optimization of Memory-Resident Value Communication… - 18 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Example
work() { if (condition(&set)) push (&set, element);}
push (head, entry)
do { push (&set, element); work(); } while (test);
Compiler Optimization of Memory-Resident Value Communication… - 19 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Example
work() { if (condition(&set)) push (&set, element);}
push(head,entry) { entry->next = *head; *head = entry; }
push(head,entry) { entry->next = *head; *head = entry; }
Load *head
Store *head
Load *head(work, push)
Load *head(push)
Store *head(work, push)
do { push (&set, element); work(); } while (test);
Store *head(push)
Compiler Optimization of Memory-Resident Value Communication… - 20 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Compiler Passes
FrontEnd
BackEnd
InsertSynchronization
Profile DataDependences
ThreadCreating
InstructionScheduling
Decide what to Synchronize
foo.exe
foo.c
Compiler Optimization of Memory-Resident Value Communication… - 21 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Example
work() { if (condition(&set)) push (&set, element);}
do { push (&set, element); work(); } while (test);
push(head,entry) { entry->next = *head; *head = entry; }
push(head,entry) { entry->next = *head; *head = entry; }
Load *head(push)
Store *head(push)
Load *head(work, push)
Store *head(work, push)
Profile Information=======================================================
=Source Destination FrequencyStore *head(push) Load *head(push) 990Store *head(push) Load *head(work, push) 10Store *head(work, push) Load *head(push) 10
Compiler Optimization of Memory-Resident Value Communication… - 22 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Compiler Passes
FrontEnd
BackEnd
InsertSynchronization
Profile DataDependences
ThreadCreating
InstructionScheduling
Decide what to Synchronize
foo.exe
foo.c
Compiler Optimization of Memory-Resident Value Communication… - 23 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Dependence Graph
Load *head(work, push)
Store *head(work, push)
99010
10
Load *head(push)
Store *head(push)
Pairs that need to be synchronized can be extracted from the dependence graph
Infrequent dependences: occur in less than 5% of iterations
Compiler Optimization of Memory-Resident Value Communication… - 24 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Compiler Passes
FrontEnd
BackEnd
InsertSynchronization
Profile DataDependences
ThreadCreating
InstructionScheduling
Decide what to Synchronize
foo.exe
foo.c
Compiler Optimization of Memory-Resident Value Communication… - 25 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Example
work() { if (condition(&set)) push (&set, element);}
do { push (&set, element); work(); } while (test);
push(head,entry) { entry->next = *head; *head = entry; }
push(head,entry) { entry->next = *head; *head = entry; }
Load *head(push)
Store *head(push)
990
Load *head(push)
Store *head(push)
Synchronize these
push_clone(head,entry) { wait(); entry->next = *head; *head = entry; signal(head, *head);}
push_clone(&set, element);
Compiler Optimization of Memory-Resident Value Communication… - 26 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Outline
• Static analysisRuntime checks Results Conclusions
ProducerConsumer
Store *q Load *pTime
Compiler Optimization of Memory-Resident Value Communication… - 27 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Runtime Checks
Store *q and Load *p access the same memory address No store modifies the forwarded address between Store *q and Load *p
Signal(q, *q);
Producer forwards the address to ensure a match between the load and the store
ProducerConsumer
Load *pStore *q
Time
Compiler Optimization of Memory-Resident Value Communication… - 28 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Ensuring Correctness
Store *x
• Store *q and Load *p access the same memory address No store modifies the forwarded address between Store *q and load *p
ConsumerProducer
Hardware supportSimilar to memory conflict buffer [Gallagher et al, ASPLOS’94]
Load *pStore *q
Time
Compiler Optimization of Memory-Resident Value Communication… - 29 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Ensuring Correctness
Hardware support: TLS hardware already knows which locations are stored to
• Store *q and Load *p access the same memory address No store modifies the forwarded address between Store *q and load *p
ConsumerProducer
Store *yLoad *pStore *q
Time
Compiler Optimization of Memory-Resident Value Communication… - 30 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Outline
• Static analysis
• Runtime checksResults Conclusions
ProducerConsumer
Store *q Load *pTime
Compiler Optimization of Memory-Resident Value Communication… - 31 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Crossbar
Experimental Framework
Underlying architecture 4-processor, single-chip multiprocessor speculation supported through coherence
Simulator superscalar, similar to MIPS R14K 10-cycle communication latency models all bandwidth and contention
Benchmarks SPECint95 and SPECint2000, -O3 optimization
detailed simulationC
C
P
C
P
Compiler Optimization of Memory-Resident Value Communication… - 32 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Parallel Region CoveragePa
ralle
l Reg
ion
Cove
rage
0
100
go
m88ksim ijp
eg
gzip_comp
gzip_decomp
vpr_place
gcc
mcfcrafty
parser
perlbmk ga
p
bzip2_comp
Coverage is significantAverage coverage: 54%
Compiler Optimization of Memory-Resident Value Communication… - 33 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Failed SpeculationSynchronization StallOtherBusy
U=No synchronization insertedC=Compiler-Inserted Synchronization
Seven benchmarks speed up by 5% to 46%
Compiler-Inserted Synchronization
0
100
go
m88ksim ijp
eg
gzip_comp
gzip_decomp
vpr_place
gcc
mcfcrafty
parser
perlbmk ga
p
bzip2_comp
U C U C U C U C U C U C U C U C U C U C U C U C U C
10% 46% 13% 5% 8% 5% 21%
Norm
. Reg
iona
l Exe
c. T
ime
Compiler Optimization of Memory-Resident Value Communication… - 34 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Compiler- vs. Hardware-Inserted Synchronization
0
100
go
m88ksim ijp
eg
gzip_comp
gzip_decomp
vpr_place
gcc
mcf
crafty
parser
perlbmk ga
p
bzip2_comp
C H C H C H C H C H C H C H C H C H C H C H C H C H
C=Compiler-Inserted SynchronizationH=Hardware-Inserted Synchronization
Compiler and hardware [HPCA’02] each benefits different benchmarks
Norm
. Reg
iona
l Exe
c. T
ime
Failed SpeculationSynchronization StallOtherBusy
Hardwaredoes better
Compilerdoes better
Compiler Optimization of Memory-Resident Value Communication… - 35 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Combining Hardware and Compiler Synchronization
C=Compiler-inserted synchronizationH=Hardware-inserted synchronizationB=Combining Both
The combination is more robust than each technique individually
0
100
go
m88ksim
gzip_comp
gzip_decomp
perlbmk ga
pC H B C H B C H B C H B C H B C H B
Norm
. Reg
iona
l Exe
c. T
ime
Failed SpeculationSynchronization StallOtherBusy
Compiler Optimization of Memory-Resident Value Communication… - 36 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Related Work
Zhai et. al.CGO’04Cytron
ICPP’86
Compiler-inserted
Moshovos et. al.ISCA’97
Cintra & TorrellasHPCA’02
Steffan et. al.HPCA’02
Hardware-inserted
Centralized TableDistributed Table
Tsai & YewPACT’96
Compiler Optimization of Memory-Resident Value Communication… - 37 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Conclusions
Compiler-inserted synchronization for memory-resident value communication:
Effective in reducing speculation failure Half of the benchmarks speedup by 5% to 46%
(regional) Combining hardware and compiler techniques is more
robust Neither consistently outperforms the other Can be combined to track the best performer
Memory-resident value communication should be addressed with the combined efforts of the compiler and the hardware
Compiler Optimization of Memory-Resident Value Communication… - 38 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Questions?
Compiler Optimization of Memory-Resident Value Communication… - 39 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
The Potential of Instruction Scheduling
0
100
go
m88ksim ijp
eg
gzip_comp_R
gzip_decomp
vpr_place
mcf
crafty
parser
perlbmk ga
p
gzip_comp gc
c
E=EarlyC=Compiler-Inserted SynchronizationL=Late
Failed SpeculationSynchronization StallOtherBusy
Scheduling instructions has addition benefit for some benchmarks
ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL
Bzip2_comp
Compiler Optimization of Memory-Resident Value Communication… - 40 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Program Performance
0
100
go
m88ksim ijp
eg
gzip_comp_R
gzip_decomp
vpr_place
gcc
mcfcrafty
parser
perlbmk ga
p
bzip2_comp
bzip2_decomp
twolf
gzip_comp
U=Un-optimizedC=Compiler-Inserted SynchronizationH=Hardware-Inserted SynchronizationB=Both compiler and hardware
Failed SpeculationSynchronization StallOtherBusy
UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB
Compiler Optimization of Memory-Resident Value Communication… - 41 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Which Technique Synchronizes This Load?
0
100
go
m88ksim ijp
eg
gzip_comp_R
gzip_decomp
vpr_place
gcc mc
f
crafty
parser
perlbmk ga
p
bzip2_comp
twolf
UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHBUCHBUCHBUCHBUCHB
gzip_comp
U=Un-optimizedC=Compiler-Inserted SynchronizationH=Hardware-Inserted SynchronizationB=Both compiler and hardware
Synchronized by neither techniqueSynchronized by compilerSynchronized by hardwareSynchronized by both
Compiler Optimization of Memory-Resident Value Communication… - 42 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Ensuring Correctness
Hardware supportSimilar to memory conflict buffer [Gallagher et al, ASPLOS’94]
Store *q Load *pStore *x
• Store *q and Load *p access the same memory address No store modifies the forwarded address between Store *q and load *p
ConsumerProducer
Compiler Optimization of Memory-Resident Value Communication… - 43 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Consumer
• Store *q and Load *p access the same memory address No store modifies the forwarded address between Store *q and load *p
Ensuring Correctness
Hardware support Use the forwarded value only if the synchronized pair is dependent
UseForwarded
Value
UseMemoryValue
LocalStore to *p
q == p
NO
YES
YES NO
Store *q Load *pStore *xSignal(q);
Signal(*q)
Producer
Compiler Optimization of Memory-Resident Value Communication… - 44 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Issues in Synchronizing Memory-Resident Values
• Inserting synchronization using compilers
• Ensuring correctnessReducing synchronization cost
Store *q
Load *p
ConsumerProducer
Compiler Optimization of Memory-Resident Value Communication… - 45 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Reducing Cost of Synchronization
Before Instruction Scheduling
Consumer
Producer
Instruction scheduling algorithms are described in [ASPLOS’02]
After Instruction Scheduling
Producer
Consumer
Compiler Optimization of Memory-Resident Value Communication… - 46 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
The Potential of Instruction Scheduling
0
100
m88ksim ijp
eg
gzip_comp
gzip_decomp
vpr_place
gap
E = Perfectly predicting synchronized memory-resident valuesC = Compiler-inserted synchronizationL = Consumer stalls until previous thread commits
Scheduling instructions could offer additional benefit
E C L E C L E C L E C L E C L E C L
Failed SpeculationSynchronization StallOtherBusy
Norm
. Reg
iona
l Exe
c. T
ime
Compiler Optimization of Memory-Resident Value Communication… - 47 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Using More Accuracy of Profiling Information
0
100
C RU
U=No Instruction SchedulingC=Compiler-Inserted SynchronizationR=Compiler-Inserted Synchronization (Profiled with the ref input set)
Gzip_comp is the only benchmark sensitive to profiling input
gzip_comp
Failed SpeculationSynchronization StallOtherBusy
Norm
. Reg
iona
l Exe
c. T
ime