Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads

Carnegie Mellon

Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads

Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan† and Todd C. Mowry

School of Computer ScienceCarnegie Mellon University

†Dept. Elec. & Comp. EngineeringUniversity of Toronto

Compiler Optimization of Memory-Resident Value Communication… - 2 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Motivation

Chip-level multiprocessing is becoming commonplace

We need parallel programs

UntraSPARC IV 2 UltraSparc III cores

IBM Power 4 SUN MAJC Sibyte SB-1250

Can multithreaded processors improve the performance of a single application?


Mowry

Carnegie Mellon

Why Is Automatic Parallelization Difficult?

One solution: Thread-Level Speculation

Automatic parallelization today Must statically prove threads are independent Constructing proofs is difficult due to ambiguous data

dependences Complex control flow Pointers and indirect references Runtime inputs

Optimistic compiler? Limited only by true dependences


Mowry

Carnegie Mellon

Example

while (...){…x=hash[index1];…hash[index2]=y;...

}

Time…= hash[19]…hash[21] =...check_dep()

Thread 2…= hash[33]…hash[30] =...check_dep()


Thread 1

…= hash[10]…hash[25] =...check_dep()

Thread 4




Thread 7


Thread 4 Retry

Processor 1 Processor 2 Processor 3 Processor 4


Mowry

Carnegie Mellon

Frequently Dependent Scalars

…=a

a=……=a

a=…

Can identify scalars that always cause dependences

Time

ProducerConsumer


Mowry

Carnegie Mellon


…=a

a=…

…=aa=…

Dependent scalars should be synchronized [ASPLOS’02]

Time

Signal(a)

Wait(a)

ProducerConsumer


Mowry

Carnegie Mellon


…=a

a=…

Dataflow analysis allows us to deal with complex control flow [ASPLOS’02]

…=a

a=…

Time

ProducerConsumer


Mowry

Carnegie Mellon

Communicating Memory-Resident Values

Synchronize?

Speculate?

Will speculation succeed?

Time Load *p

Store *qLoad *p

Store *q

ProducerConsumer


Mowry

Carnegie Mellon

Speculation vs. SynchronizationSequential Execution Speculative Parallel Execution

Load *p

Speculation succeeds: efficient

Time

Load *p

Load *p

Load *p

Store *q

Store *q

Store *q

Store *q

Load *p Load *p Load *p Load *pStore *q Store *q Store *qStore *q


Mowry

Carnegie Mellon


Speculation fails: inefficient

Load *p

Time

Load *p

Load *p

Load *p

Store *q

Store *q

Store *q

Store *q

Load *pStore *q

Load *pStore *q

Load *pStore *q

Load *pStore *q

Load *pStore *q

Load *pStore *q

Load *pStore *q

Load *pStore *q

Load *pStore *q

Load *pStore *q

violation


Mowry

Carnegie Mellon


Frequent dependences: Synchronize Infrequent dependences: Speculate

Load *p

Time

Load *p

Load *p

Load *p

Store *q

Store *q

Store *q

Store *q

Load *pStore *q

Load *pStore *q Load *p

Store *q Load *pStore *q


Mowry

Carnegie Mellon

Performance Potential

Reducing failed speculation improves performance

Detailed simulation:• TLS support• 4-processor CMP

• 4-way issue, out-of-order superscalar• 10-cycle communication latency

Original

Perfect memory valuePrediction

Norm

. Reg

iona

l Exe

c. T

ime

0

100

m88ksim ijp

eg

gzip_comp

gzip_decomp

vpr_place

gcc

mcfcrafty

parser

perlbmk ga

p

bzip2_compgo


Mowry

Carnegie Mellon

Hardware vs. Compiler Inserted Synchronization

Store*qLoad *p

Memory

Store*q

Load *p

Memory

Store *q

Load *p

Memory

Speculation Hardware-insertedSynchronization[HPCA’02]

Compiler-insertedSynchronization[CGO’04]

Tim

e Signal()

(stall)

ProducerConsumer

ProducerConsumer

ProducerConsumer

Wait()


Mowry

Carnegie Mellon

Issues in Synchronizing Memory-Resident Values

Static analysis Which instructions to synchronize? Inter-procedural dependences

Runtime Detecting and recovering from improper synchronization

Store *qLoad *p

ProducerConsumer

Time


Mowry

Carnegie Mellon

Outline

Static analysis Runtime checks Results Conclusions

Load *p

ProducerConsumer

Store *q

Time


Mowry

Carnegie Mellon

Compiler Passes

FrontEnd

BackEnd

foo.c

foo.exe

InsertSynchronization

Profile DataDependences

CreateThreads

ScheduleInstructions

Decide what to Synchronize


Mowry

Carnegie Mellon

Example

work()

push (head, entry)

do { push (&set, element); work(); } while (test);


Mowry

Carnegie Mellon

Example

work() { if (condition(&set)) push (&set, element);}

push (head, entry)



Mowry

Carnegie Mellon

Example


push(head,entry) { entry->next = *head; *head = entry; }


Load *head

Store *head

Load *head(work, push)

Load *head(push)

Store *head(work, push)


Store *head(push)


Mowry

Carnegie Mellon

Compiler Passes

FrontEnd

BackEnd



ThreadCreating

InstructionScheduling


foo.exe

foo.c


Mowry

Carnegie Mellon

Example





Load *head(push)

Store *head(push)



Profile Information=======================================================

=Source Destination FrequencyStore *head(push) Load *head(push) 990Store *head(push) Load *head(work, push) 10Store *head(work, push) Load *head(push) 10


Mowry

Carnegie Mellon

Compiler Passes

FrontEnd

BackEnd



ThreadCreating



foo.exe

foo.c


Mowry

Carnegie Mellon

Dependence Graph



99010

10

Load *head(push)

Store *head(push)

Pairs that need to be synchronized can be extracted from the dependence graph

Infrequent dependences: occur in less than 5% of iterations


Mowry

Carnegie Mellon

Compiler Passes

FrontEnd

BackEnd



ThreadCreating



foo.exe

foo.c


Mowry

Carnegie Mellon

Example





Load *head(push)

Store *head(push)

990

Load *head(push)

Store *head(push)

Synchronize these

push_clone(head,entry) { wait(); entry->next = *head; *head = entry; signal(head, *head);}

push_clone(&set, element);


Mowry

Carnegie Mellon

Outline

• Static analysisRuntime checks Results Conclusions

ProducerConsumer

Store *q Load *pTime


Mowry

Carnegie Mellon

Runtime Checks

Store *q and Load *p access the same memory address No store modifies the forwarded address between Store *q and Load *p

Signal(q, *q);

Producer forwards the address to ensure a match between the load and the store

ProducerConsumer

Load *pStore *q

Time


Mowry

Carnegie Mellon

Ensuring Correctness

Store *x

• Store *q and Load *p access the same memory address No store modifies the forwarded address between Store *q and load *p

ConsumerProducer

Hardware supportSimilar to memory conflict buffer [Gallagher et al, ASPLOS’94]

Load *pStore *q

Time


Mowry

Carnegie Mellon


Hardware support: TLS hardware already knows which locations are stored to


ConsumerProducer

Store *yLoad *pStore *q

Time


Mowry

Carnegie Mellon

Outline

• Static analysis

• Runtime checksResults Conclusions

ProducerConsumer

Store *q Load *pTime


Mowry

Carnegie Mellon

Crossbar

Experimental Framework

Underlying architecture 4-processor, single-chip multiprocessor speculation supported through coherence

Simulator superscalar, similar to MIPS R14K 10-cycle communication latency models all bandwidth and contention

Benchmarks SPECint95 and SPECint2000, -O3 optimization

detailed simulationC

C

P

C

P


Mowry

Carnegie Mellon

Parallel Region CoveragePa

ralle

l Reg

ion

Cove

rage

0

100

go

m88ksim ijp

eg

gzip_comp

gzip_decomp

vpr_place

gcc

mcfcrafty

parser

perlbmk ga

p

bzip2_comp

Coverage is significantAverage coverage: 54%


Mowry

Carnegie Mellon

Failed SpeculationSynchronization StallOtherBusy

U=No synchronization insertedC=Compiler-Inserted Synchronization

Seven benchmarks speed up by 5% to 46%

Compiler-Inserted Synchronization

0

100

go

m88ksim ijp

eg

gzip_comp

gzip_decomp

vpr_place

gcc

mcfcrafty

parser

perlbmk ga

p

bzip2_comp

U C U C U C U C U C U C U C U C U C U C U C U C U C

10% 46% 13% 5% 8% 5% 21%

Norm

. Reg

iona

l Exe

c. T

ime


Mowry

Carnegie Mellon

Compiler- vs. Hardware-Inserted Synchronization

0

100

go

m88ksim ijp

eg

gzip_comp

gzip_decomp

vpr_place

gcc

mcf

crafty

parser

perlbmk ga

p

bzip2_comp

C H C H C H C H C H C H C H C H C H C H C H C H C H

C=Compiler-Inserted SynchronizationH=Hardware-Inserted Synchronization

Compiler and hardware [HPCA’02] each benefits different benchmarks

Norm

. Reg

iona

l Exe

c. T

ime


Hardwaredoes better

Compilerdoes better


Mowry

Carnegie Mellon

Combining Hardware and Compiler Synchronization

C=Compiler-inserted synchronizationH=Hardware-inserted synchronizationB=Combining Both

The combination is more robust than each technique individually

0

100

go

m88ksim

gzip_comp

gzip_decomp

perlbmk ga

pC H B C H B C H B C H B C H B C H B

Norm

. Reg

iona

l Exe

c. T

ime



Mowry

Carnegie Mellon

Related Work

Zhai et. al.CGO’04Cytron

ICPP’86

Compiler-inserted

Moshovos et. al.ISCA’97

Cintra & TorrellasHPCA’02

Steffan et. al.HPCA’02

Hardware-inserted

Centralized TableDistributed Table

Tsai & YewPACT’96


Mowry

Carnegie Mellon

Conclusions

Compiler-inserted synchronization for memory-resident value communication:

Effective in reducing speculation failure Half of the benchmarks speedup by 5% to 46%

(regional) Combining hardware and compiler techniques is more

robust Neither consistently outperforms the other Can be combined to track the best performer

Memory-resident value communication should be addressed with the combined efforts of the compiler and the hardware


Mowry

Carnegie Mellon

Questions?


Mowry

Carnegie Mellon

The Potential of Instruction Scheduling

0

100

go

m88ksim ijp

eg

gzip_comp_R

gzip_decomp

vpr_place

mcf

crafty

parser

perlbmk ga

p

gzip_comp gc

c

E=EarlyC=Compiler-Inserted SynchronizationL=Late


Scheduling instructions has addition benefit for some benchmarks

ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL

Bzip2_comp


Mowry

Carnegie Mellon

Program Performance

0

100

go

m88ksim ijp

eg

gzip_comp_R

gzip_decomp

vpr_place

gcc

mcfcrafty

parser

perlbmk ga

p

bzip2_comp

bzip2_decomp

twolf

gzip_comp

U=Un-optimizedC=Compiler-Inserted SynchronizationH=Hardware-Inserted SynchronizationB=Both compiler and hardware


UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB


Mowry

Carnegie Mellon

Which Technique Synchronizes This Load?

0

100

go

m88ksim ijp

eg

gzip_comp_R

gzip_decomp

vpr_place

gcc mc

f

crafty

parser

perlbmk ga

p

bzip2_comp

twolf

UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHBUCHBUCHBUCHBUCHB

gzip_comp

U=Un-optimizedC=Compiler-Inserted SynchronizationH=Hardware-Inserted SynchronizationB=Both compiler and hardware

Synchronized by neither techniqueSynchronized by compilerSynchronized by hardwareSynchronized by both


Mowry

Carnegie Mellon


Hardware supportSimilar to memory conflict buffer [Gallagher et al, ASPLOS’94]

Store *q Load *pStore *x


ConsumerProducer


Mowry

Carnegie Mellon

Consumer



Hardware support Use the forwarded value only if the synchronized pair is dependent

UseForwarded

Value

UseMemoryValue

LocalStore to *p

q == p

NO

YES

YES NO

Store *q Load *pStore *xSignal(q);

Signal(*q)

Producer


Mowry

Carnegie Mellon

Issues in Synchronizing Memory-Resident Values

• Inserting synchronization using compilers

• Ensuring correctnessReducing synchronization cost

Store *q

Load *p

ConsumerProducer


Mowry

Carnegie Mellon

Reducing Cost of Synchronization

Before Instruction Scheduling

Consumer

Producer

Instruction scheduling algorithms are described in [ASPLOS’02]

After Instruction Scheduling

Producer

Consumer


Mowry

Carnegie Mellon

The Potential of Instruction Scheduling

0

100

m88ksim ijp

eg

gzip_comp

gzip_decomp

vpr_place

gap

E = Perfectly predicting synchronized memory-resident valuesC = Compiler-inserted synchronizationL = Consumer stalls until previous thread commits

Scheduling instructions could offer additional benefit

E C L E C L E C L E C L E C L E C L


Norm

. Reg

iona

l Exe

c. T

ime


Mowry

Carnegie Mellon

Using More Accuracy of Profiling Information

0

100

C RU

U=No Instruction SchedulingC=Compiler-Inserted SynchronizationR=Compiler-Inserted Synchronization (Profiled with the ref input set)

Gzip_comp is the only benchmark sensitive to profiling input

gzip_comp


Norm

. Reg

iona

l Exe

c. T

ime

Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads

Documents

Transcript of Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads