Master/Slave Speculative Parallelization

32
Master/Slave Speculative Parallelization Craig Zilles (U. Illinois) Guri Sohi (U. Wisconsin)

description

Master/Slave Speculative Parallelization. Craig Zilles (U. Illinois) Guri Sohi (U. Wisconsin). The Basics. #1: a well-known problem: On-chip Communication #2: a well-known opportunity: Program Predictability #3: our novel approach to #1 using #2. Problem: Communication. - PowerPoint PPT Presentation

Transcript of Master/Slave Speculative Parallelization

Page 1: Master/Slave Speculative Parallelization

Master/Slave Speculative Parallelization

Craig Zilles (U. Illinois)Guri Sohi (U. Wisconsin)

Page 2: Master/Slave Speculative Parallelization

The Basics

#1: a well-known problem:On-chip Communication

#2: a well-known opportunity:Program Predictability

#3: our novel approach to #1 using #2

Page 3: Master/Slave Speculative Parallelization

Problem: Communication

Cores becoming “communication limited” Rather than “capacity limited”

Many, many transistors on a chip, but…Can’t bring them all to bear on one

thread Control/data dependences = freq. communication

Page 4: Master/Slave Speculative Parallelization

Best core << chip size

Sweet spot for core size Further size increases either hurts Mhz or IPC

How can we maximize core’s efficiency?

Chip

Core

Page 5: Master/Slave Speculative Parallelization

Opportunity: Predictability

Many programs behaviors are predictable Control flow, dependences, values, stalls, etc.

Widely exploited by processors/compilers But, not to help increase effective core size Core resources used to make, validate pred’s

Example: perfectly-biased branch

bne100% 0%

Page 6: Master/Slave Speculative Parallelization

Speculative Execution

Execute code before/after branch in parallelBranch is fetched, predicted, executed, retired

All of this occurs in the core

Branch predictor Uses space in I-cache

Uses execution resources

Not just the branch, but its backwards slice

Page 7: Master/Slave Speculative Parallelization

Trace/Superblock Formation

Optimize code assuming the predicted path Reduces cost of branch and surrounding code Prediction implicitly encoded in executable

Code still verifies prediction Branch & slice still fetched, executed,

committed, etc.

All of this occurs on the core

bne100% 0%

CC

Page 8: Master/Slave Speculative Parallelization

Why waste core resources?

The branch is perfectly predictable!

The core should only execute instructions that are not

statically predictable!

Page 9: Master/Slave Speculative Parallelization

If not in the core, where?

Anywhere else on chip!Because it is predictable:

Doesn’t prevent forward progress We can tolerate latency to verify prediction

Prediction

Instruction Storage

Verify Prediction

Page 10: Master/Slave Speculative Parallelization

A concrete example:Master/Slave Speculative Parallelization

Execute “distilled program” on one processor A version of program with predictable inst’s removed Faster than original, but not guaranteed to be correct

Verify predictions by executing original program

Parallelize verification by splitting it into “tasks”Master core:Executes distilled program

Slave cores:Parallel execution of original program

Page 11: Master/Slave Speculative Parallelization

Talk Outline

Removing predictability from programs “Approximation”

Externally verifying distilled programs Master/Slave Speculative Parallelization

(MSSP)

Results SummarySummary

Page 12: Master/Slave Speculative Parallelization

Approximation Transformations

Pretend you’ve proven the common case Preserve correctness in the common case Break correctness in uncommon case

Use profile to know the common case

bne99% 1%

A

B C

A

B CAB

C

Page 13: Master/Slave Speculative Parallelization

Not just for branches

Values:ld r13, 0(X)

Load is highly invariant (usually gets 7)

Memory Dependences:st r12, 0(A)ld r11, 0(B)

A and B may aliasnever

If rarely alias in practice? If almost always alias?

always

addi $zero, 7, r13

mv r12, r11

Page 14: Master/Slave Speculative Parallelization

Enables Traditional Optimizations

exit

prin

tffw

rite

csE

OF

printf

fpri

ntf

Two dominant paths

Many static paths

Approximate away unimportant paths

From bzip2

Page 15: Master/Slave Speculative Parallelization

Two dominant paths

Enables Traditional Optimizations

Many static paths

Approximate away unimportant paths

From bzip2

Page 16: Master/Slave Speculative Parallelization

Two dominant paths

Enables Traditional Optimizations

Many static paths

Very straightforward structure

Easy for compiler to optimize

Approximate away unimportant paths

From bzip2

Page 17: Master/Slave Speculative Parallelization

Two dominant paths

Enables Traditional Optimizations

Many static paths

Very straightforward structure

Easy for compiler to optimize

Approximate away unimportant paths

From bzip2

Page 18: Master/Slave Speculative Parallelization

Effect of Approximation

Equivalent 99.999% of the time, better execution characteristics Fewer dynamic instructions: ~1/3 of original code Smaller static size: ~2/5 of original code Fewer taken branches: ~1/4 of original code Smaller fraction of loads/stores

Shorter than best non-speculative code Removing checks: code incorrect .001% of the time

Dis

tille

d C

od

e

Ori

gin

al C

ode

Page 19: Master/Slave Speculative Parallelization

Talk Outline

Removing predictability from programs “Approximation”

Externally verifying distilled programs Master/Slave Speculative Parallelization

(MSSP)

Results SummarySummary

Page 20: Master/Slave Speculative Parallelization

Goal

Achieve performance of distilled program

Retain correctness of original program

Approach: Use distilled code to speed original

program

Page 21: Master/Slave Speculative Parallelization

Checkpoint parallelization

Cut original program into “tasks” Assign tasks to processors

Provide each a checkpoint of registers & memory Completely decouples task execution Tasks retrieve all live-ins from checkpoint

Checkpoints taken from distilled program Captured in hardware Stored as a “diff” from architected state

Page 22: Master/Slave Speculative Parallelization

Master core:Executes distilled program

Slave cores:Parallel execution of original program

Page 23: Master/Slave Speculative Parallelization

Example Execution

Master Slave1 Slave2 Slave3

A’ A Start Master and Slave from architected state

B’ B Take checkpoint, use to start next task

Verify B’s inputs with A’s outputs; commit state

Detected at end of B

CC’

Bad Checkpoint @ CC’

C

Squash, restart from architected state

Page 24: Master/Slave Speculative Parallelization

MSSP Critical Path

Master Slave1 Slave2 Slave3

A’ A

B’ B

CC’

C’C

If checkpoints correct: through distilled program no communication latency verification in background

If bad checkpoints are rare: performance of distilled program tolerant of communication latency

Bad checkpoints: through original program interprocessor comm.

Page 25: Master/Slave Speculative Parallelization

Talk Outline

Removing predictability from programs “Approximation”

Externally verifying distilled programs Master/Slave Speculative Parallelism

(MSSP)Results SummarySummary

Page 26: Master/Slave Speculative Parallelization

Methodology

First-cut distiller Static binary-to-binary translator Simple control flow approximations DCE, inlining, register re-allocation,

save/restore elimination, code layout…HW model: 8-way CMP of 21264’s

10 cycle interconnect latency to shared L2Spec2000 Integer benchmarks on

Alpha

Page 27: Master/Slave Speculative Parallelization

Results Summary

Distilled Programs can be accurate 1 task misspeculation per 10,000 instructions

Speedup depends on distillation 1.25 h-mean: ranges from 1.0 to 1.7 (gcc, vortex) (relative to uniprocessor execution)

Modest storage requirements Tens of kB at L2 for speculation buffering

Decent latency tolerance Latency 5 -> 20 cycles: 10% slowdown

Page 28: Master/Slave Speculative Parallelization

Distilled Program Accuracy

100,000

1,000

10,000

Average distance between task misspeculations:

> 10,000 original program instructions

Page 29: Master/Slave Speculative Parallelization

Distillation Effectiveness

100%

0%

60%

Instructions retired by Master (distilled program)Instructions retired by Slave (original program

(not counting nops)

Up to two-thirds reduction

Page 30: Master/Slave Speculative Parallelization

Performance

Performance scales with distillation effectiveness

100,000

1,000

10,000

100%

0%

60%

1.01.21.41.6

Sp

eed

up

acc

ura

cyd

isti

llati

on

Page 31: Master/Slave Speculative Parallelization

Related Work

SlipstreamSpeculative MultithreadingPre-executionFeedback-directed OptimizationDynamic Optimizers

Page 32: Master/Slave Speculative Parallelization

Summary

Don’t waste core on predictable things “Distill” out predictability from programs

Verify predictions with original program Split into tasks: parallel validation Achieve the throughput to keep up

Has some nice attributes (ask offline) Can support legacy binaries, latency tolerant,

low verification cost, complements explicit parallelism