iCFP: Tolerating All Level Cache Misses in In-Order Processors

67
HPCA-15 :: Feb 18, 2009 iCFP: Tolerating All Level Cache Misses in In-Order Processors Andrew Hilton, Santosh Nagarakatte, Amir Roth University of Pennsylvania {adhilton,santoshn,amir}@cis.upenn.edu

description

iCFP: Tolerating All Level Cache Misses in In-Order Processors. Andrew Hilton, Santosh Nagarakatte, Amir Roth University of Pennsylvania {adhilton,santoshn,amir}@cis.upenn.edu. HPCA-15 :: Feb 18, 2009. A Brief History …. performance. power. Pentium ( in-order ). PentiumII - PowerPoint PPT Presentation

Transcript of iCFP: Tolerating All Level Cache Misses in In-Order Processors

Page 1: iCFP: Tolerating All Level Cache Misses in In-Order Processors

HPCA-15 :: Feb 18, 2009

iCFP: Tolerating All Level Cache Misses in In-Order Processors

Andrew Hilton, Santosh Nagarakatte, Amir RothUniversity of Pennsylvania

{adhilton,santoshn,amir}@cis.upenn.edu

Page 2: iCFP: Tolerating All Level Cache Misses in In-Order Processors

A Brief History …

Pentium(in-order)

PentiumII (out-of-order)

performance

Core2Duo (out-of-order, 2 cores)

power

Nehalem (out-of-order, 4 cores, 8 threads)

Niagara2 (in-order, 16 cores, 64 threads)

POWER!

Page 3: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 3 ][ 3 ]

In-order vs. Out-of-Order

Out-of-order cores• Single thread IPC (+63%)

Key idea• Main benefit of out-of-order: data cache miss tolerance• Can we add to in-order in a simple way?

Is there a compromise?

In-order cores• Power efficiency• More cores

Page 4: iCFP: Tolerating All Level Cache Misses in In-Order Processors

• Regfile checkpoint-restore

Runahead

Runahead execution [Dundas+, ICS’97]

• In-order + miss-level parallelism (MLP)• Checkpoint and “advance” under miss• Restore checkpoint when miss returns RF0

D$I$

Pois

on

• Per register “poison” bits Forwarding$

• Forwarding cache

Can we do better?

Additional hardware?

Page 5: iCFP: Tolerating All Level Cache Misses in In-Order Processors

Yes We Can! (Sorry)

iCFP: in-order Continual Flow Pipeline• Runahead, but … • Save miss-independent work• Re-execute only miss forward slice

Forwarding$

RF0

D$I$

Pois

on

Slice Buffer

• Slice buffer

Additional hardware?

In-order adaptation of CFP [Srinivasan+, ASPLOS’04]

• Unblock pipeline latches, not issue queue and regfile• Apply to misses at all cache levels, not just L2

• Replace forwarding cache with store buffer Store Buffer

RF1

• Hijack additional regfile used for multi-threading

Pois

on

Page 6: iCFP: Tolerating All Level Cache Misses in In-Order Processors

iCFP Roadmap

Motivation and overview

(Not fully) working example

Correctness features• Register communication for miss-dependent instructions• Store-load forwarding• Multiprocessor safety

Performance features

Evaluation

Page 7: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 7 ][ 7 ]

I$

ExampleA1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

A1B1C1

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$

PC/instance

Bold paths are active

Instructions flowing through pipeline

Tail

Pois

on

Pois

on

Tail last completed instruction RF0

Page 8: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 8 ][ 8 ]

• Checkpoint regfile

I$

ExampleA1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

Load A1 misses, transition to “advance” mode

A1B1C1

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$ Miss

Tail

Pois

on

Pois

on

• Poison A1’s output register r2

r2

Page 9: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 9 ][ 9 ]

• Checkpoint regfile

I$

ExampleA1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

Load A1 misses, transition to “advance” mode

C1D1

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$

• Poison A1’s output register r2• Divert A1 to slice buffer

Pending miss (red)

A1

Tail

B1

Pois

on

Pois

on

r2

Page 10: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 10 ][ 10 ]

I$

ExampleA1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

• Propagate poison through data dependences

C1D1

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$

A1

Tail

B1

Pois

on

Pois

on

r2

Page 11: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 11 ][ 11 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

C1D1E1

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$

A1Advance

• Propagate poison through data dependences• Divert miss-dependent instructions to slice buffer

Miss-dependent instruction (this color)

Tail

B1

Pois

on

Pois

on

r2r3

Page 12: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 12 ][ 12 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

E1F1

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$

A1Advance

• Propagate poison through data dependences• Divert miss-dependent instructions to slice buffer• Buffer stores in store buffer

B1

Tail

D1

C1

Pois

on

Pois

on

r2r3r5

Page 13: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 13 ][ 13 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

F1A2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$

A1Advance

• Propagate poison through data dependences• Divert miss-dependent instructions to slice buffer • Buffer stores in store buffer• Miss-independent instructions execute as usual

B1

D1

C1

Tail

E1

Pois

on

Pois

on

r2r3r5

Page 14: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 14 ][ 14 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

A2B2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$

A1Advance

• Propagate poison through data dependences• Divert miss-dependent instructions to slice buffer • Buffer stores in store buffer• Miss-independent instructions execute as usual, update regfile

B1

D1

C1D1

F1

Tail

Miss-independent instruction (green)

E1

Pois

on

Pois

on

r2r3r5

Page 15: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 15 ][ 15 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

B2B2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$

A1Advance

• Propagate poison through data dependences• Divert miss-dependent instructions to slice buffer • Buffer stores in store buffer• Miss-independent instructions execute as usual, update regfile

B1

D1

C1D1

A2

E1

Tail Pois

on

Pois

on

r2r3r5

Page 16: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 16 ][ 16 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

C2D2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$

A1Advance

• Propagate poison through data dependences• Divert miss-dependent instructions to slice buffer • Buffer stores in store buffer• Miss-independent instructions execute as usual, update regfile

• Can “un-poison” tail registers

B1

D1

C1D1

B2

E1

Tail

A2

Pois

on

Pois

on

r3r5

Page 17: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 17 ][ 17 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

D2E2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$

A1Miss Returns

When A1 miss returns, transition to “rally”• Stall fetch• Pipe in contents of slice buffer

B1

D1

C1D1

A2E1

C2

B2

Fill

Tail

Pois

on

Pois

on

r5

Page 18: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 18 ][ 18 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

E2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$A1

Drain

• Drain advance instructions already in pipeline (C2–D2)

B1

D1

C1D1

A2E1 B2C2

D2Tail

Pois

on

Pois

on

Page 19: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 19 ][ 19 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

E2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$B1

Drain

• Drain advance instructions already in pipeline (C2–D2)

D1

C1D1

A2E1 B2C2

D2

A1

Tail

Pois

on

Pois

on

Page 20: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 20 ][ 20 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

E2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$C1

Rally

• Complete deferred instructions from slice buffer

D1

D1

A2E1 B2C2

D2

B1

Tail

Rally

Pois

on

Pois

on

Page 21: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 21 ][ 21 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

E2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$D1

Rally

• Execute deferred instructions from slice buffer• When slice buffer is empty, un-block fetch

D1

A2E1 B2C2

D2

C1

Tail

Rally

Pois

on

Pois

on

Page 22: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 22 ][ 22 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

F2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$E2

Rally

Wait for deferred instructions to complete

D1

A2E1 B2C2

D2Tail

Rally

Pois

on

Pois

on

Page 23: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 23 ][ 23 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

F2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$E2

Back To Normal

When last deferred instruction completes

D1

A2E1 B2C2

D2Tail

Rally

Pois

on

Pois

on

Page 24: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 24 ][ 24 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

F2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$E2

Back To Normal

When last deferred instruction completes• Release register checkpoint

D1D2Tail

Rally

Pois

on

Pois

on

A2E1 B2C2

Page 25: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 25 ][ 25 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

F2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$E2

Back To Normal

When last deferred instruction completes• Release register checkpoint • Resume normal execution at the tail

D1D2Tail

Pois

on

Pois

on

Page 26: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 26 ][ 26 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

F2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$E2

Back To Normal

When last deferred instruction completes• Release register checkpoint • Resume normal execution at the tail• Drain stores from store buffer to D$

D2Tail

Pois

on

Pois

on

D1

Page 27: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 27 ][ 27 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6] Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$

One Way Or The Other

If rally hits mis-predicted branch, exception, etc.• Flush pipeline• Discard store buffer contents• Restore regfile from checkpoint

Tail

Pois

on

Pois

on

A1

Page 28: iCFP: Tolerating All Level Cache Misses in In-Order Processors

iCFP Roadmap

Motivation and overview

(Not fully) working example

Correctness features• Register communication for miss-dependent instructions• Store-load forwarding• Multiprocessor safety

Performance features

Evaluation

Page 29: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 29 ][ 29 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

E2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$B1

Where do A1–C1 write r2, r3, r5 during rally?• Not in Tail RF0• Already written by logically younger A2–C2

D1

C1D1

A2E1 B2C2

D2

A1

Tail

Rally Register CommunicationRally

Pois

on

Pois

on

Page 30: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 30 ][ 30 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

E2

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

D$C1

Use RF1 as rally scratch-pad• Update Tail RF0 if youngest writer (not in this example)

D1

D1

A2E1 B2C2

D2

B1

Rally Register Communication

Tail

Rally

A1

Pois

on

Pois

on

Page 31: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 31 ][ 31 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

E2

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

D$D1

Use RF1 as rally scratch-pad• Update Tail RF0 if youngest writer (not in this example)

D1

A2E1 B2C2

D2

C1

Rally Register Communication

Tail

Rally A1B1

Pois

on

Pois

on

Page 32: iCFP: Tolerating All Level Cache Misses in In-Order Processors

Store-Load Forwarding

iCFP is in-order but …• Rally loads out-of-order wrt advance stores (possible WAR hazards)

Store-load forwarding mechanism should• Avoid WAR hazards• Avoid redoing stores

Forwarding cache? D$ with speculative writes?• Not what we want

What we really want is a large (64-entry+) store queue• Like in an out-of-order processor– Associative search doesn’t scale nicely

Page 33: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 33 ][ 33 ]

7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0

addressvalue

poison

Tail (younger) Head (older)Chained Store Buffer

86 85 84 83 82 81 80(SSN)

Replace associative search with iterative indexed search• Exploit fact that stores enter store buffer in order

• Address must be known: otherwise stall• Overlay store buffer with address-based hash table

44 81 0 15 0 77 0link

85868321

ACB0B4B8

Root……

……

64

-en

trie

s

Page 34: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 34 ][ 34 ]

7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0

addressvalue

poison

Tail (younger) Head (older)Chained Store Buffer

86 85 84 83 82 81 80(SSN)

44 81 0 15 0 77 0link

Loads follow chain starting at appropriate root table entry• For example, load to address 1AC

85868321

ACB0B4B8

Root……

……

64

-en

trie

s

85AC 2AC85

81

1AC81

Match, forward

Page 35: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 35 ][ 35 ]

7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0

addressvalue

poison

Tail (younger) Head (older)Chained Store Buffer

86 85 84 83 82 81 80(SSN)

44 81 0 15 0 77 0link

Loads follow chain starting at appropriate root table entry• For example, load to address 1AC

Rally loads ignore younger stores, avoid WAR hazards• For example, rally load to address 1B4 …• … whose immediately older store 81 (note during advance)

85868321

ACB0B4B8

Root……

……

64

-en

trie

s

83B4

1B483

Younger store, ignore

15

Go to D$

Page 36: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 36 ][ 36 ]

Chained Store Buffer

+ Non-speculative (including no WAR hazards)+ Scalable + Average number of excess hops < 0.05 with 64-entry root table– Must stall on (miss-dependent) stores with unknown addresses• These are rare

7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0

addressvalue

poison

Tail (younger) Head (older)

86 85 84 83 82 81 80(SSN)

44 81 0 15 0 77 0link

85868321

ACB0B4B8

Root……

……

64

-en

trie

s

Page 37: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 37 ][ 37 ]

Multi-Processor Safety

iCFP is in-order but … (yeah again)• Advance loads are vulnerable to stores from other threads• Just like in an out-of-order processor

Must snoop/verify these• Associative load queue too expensive for in-order processor• Paper describes scheme based on local signatures

Page 38: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 38 ][ 38 ]

Methodology

Cycle-level simulation• 2-way issue 9-stage in-order pipeline• 32KByte D$• 20-cycle 1MByte, 8-way L2 (8 8-entry stream buffers)• 400 cycle main memory, 4Bytes/cycle, 32 outstanding misses• 128-entry chained store buffer, 128-entry slice buffer

Spec2000 benchmarks• Alpha AXP ISA• DEC OSF compiler -04 optimization• 2% sampling with warm-up

Page 39: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 39 ][ 39 ]

Initial Evaluation

iCFP vs. Runahead: advance on L2 misses• Roughly same performance: +10%• Dominated by MLP• iCFP’s ability to reuse work rarely significant (vortex)

% Speedup over 2-way in-order

0

10

20

30

40

50

applu mgrid swim bzip2 vortex vpr

Runahead-L2 Runahead-D$ iCFP*-L2 iCFP*-D$

SpecFPSpecFP SpecINTSpecINT

Page 40: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 40 ][ 40 ]

Initial Evaluation

Runahead advance on D$ misses too: performance drops • Chance for MLP is low and can’t reuse work• Overhead of restoring checkpoint is high

• Especially because baseline stalls on use, not miss

% Speedup over 2-way in-order

0

10

20

30

40

50

applu mgrid swim bzip2 vortex vpr

Runahead-L2 Runahead-D$ iCFP*-L2 iCFP*-D$

SpecFPSpecFP SpecINTSpecINT

Page 41: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 41 ][ 41 ]

Initial Evaluation

iCFP advance under D$ misses too• Can reuse work without restoring checkpoint but …

• iCFP* executes rallies until completion in blocking fashion• No efficient way to handle D$ misses under L2 misses

% Speedup over 2-way in-order

0

10

20

30

40

50

applu mgrid swim bzip2 vortex vpr

Runahead-L2 Runahead-D$ iCFP*-L2 iCFP*-D$

SpecFPSpecFP SpecINTSpecINT

Page 42: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 42 ][ 42 ]

iCFP Performance Features

Non-blocking rallies• Miss during rally (dependent or just pending)? Don’t stall, slice it out

Fine-grain multi-threaded rallies• Proceed in parallel with advance execution at the tail• Rallies process dependence chains, can’t exploit superscalar

These need: incremental updates of tail register state• Both values and poison bits• Note: store buffer is not a tail snapshot, so no additional support

Page 43: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 43 ][ 43 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

C2

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

D$B1

Question: should current rally instruction update Tail RF?• A1? B1? C1? • No, no, yes

D1

D1

E1

A1

C1B2A2

Tail

Incremental Tail UpdatesRally

Pois

on

Pois

on

r2r3r5

Page 44: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 44 ][ 44 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

C2

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

D$B1

Advance execution tags registers with sequence numbers• Distance of writing instruction from checkpoint

D1

D1

E1

A1

C1B2A2

Tail

Incremental Tail Updates12345678

Rally

Pois

on

Pois

on

r2r3r5

Seq

Seq7

83

Page 45: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 45 ][ 45 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

C2

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

D$C1

Rally updates Tail RF if seqnum matches

D1

D1

E1

B1

B2A2

Tail

Rally

Incremental Tail Updates12345678

A1

Pois

on

Pois

on

r2r3r5

Seq

Seq7

83

A1’s is 1, so no

Page 46: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 46 ][ 46 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

C2

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

D$D1

Rally updates Tail RF if seqnum matches

D1

E1

C1

B2A2

Tail

Rally

Incremental Tail Updates12345678

A1B1

Pois

on

Pois

on

r2r3r5

Seq

Seq7

83

B1’s is 2, so no

Page 47: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 47 ][ 47 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

D2

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

D$C2

Rally updates Tail RF if seqnum matches

D1

E1

B2A2

Tail

Rally

Incremental Tail Updates12345678

A1B1

C1

C1

Pois

on

Pois

on

r2r3r5

Seq

Seq7

83

C1’s is 3, so yes

Page 48: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 48 ][ 48 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

E2

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

D$D2

Rally updates Tail RF if seqnum matches

D1

E1

B2A2

Tail

Rally

Incremental Tail Updates12345678

A1B1

C1

C1

C2

Pois

on

Pois

on

r2r3

Seq

Seq7

83

Page 49: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 49 ][ 49 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

F2

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

D$E2

Proper slicing can continue at tail

D1

E1

B2A2

Tail

Incremental Tail Updates123456789

A1B1

C1

C1

D2

C2

Pois

on

Pois

on

r2r3r5

Seq

Seq7

89

C2 sliced because r3 poison preserved

Page 50: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 50 ][ 50 ]

Another iCFP Performance Feature

Minimal rallies• Only traverse slice of returned miss, not entire slice buffer

Implementation: borrow trick from TCI [AlZawawi+, ISCA’07]

• Replace poison bits with bitvectors• Re-organize slice buffer to support sparse access• See paper for details

Page 51: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 51 ][ 51 ]

Tolerating All Level Cache Misses

iCFP performance features?

% Speedup over 2-way in-order

0

10

20

30

40

50

applu mgrid swim bzip2 vortex vpr

Runahead-L2 iCFP*-L2 iCFP-L2 iCFP-D$

SpecFPSpecFP SpecINTSpecINT

Page 52: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 52 ][ 52 ]

Tolerating All Level Cache Misses

iCFP performance features?• Help iCFP-L2 (now better than Runahead-L2)

% Speedup over 2-way in-order

0

10

20

30

40

50

applu mgrid swim bzip2 vortex vpr

Runahead-L2 iCFP*-L2 iCFP-L2 iCFP-D$

SpecFPSpecFP SpecINTSpecINT

Page 53: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 53 ][ 53 ]

Tolerating All Level Cache Misses

iCFP performance features?• Help iCFP-L2 (now better than Runahead-L2)• Help iCFP-D$ even more (now better than iCFP-L2)

% Speedup over 2-way in-order

0

10

20

30

40

50

applu mgrid swim bzip2 vortex vpr

Runahead-L2 iCFP*-L2 iCFP-L2 iCFP-D$

SpecFPSpecFP SpecINTSpecINT

Page 54: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 54 ][ 54 ]

Feature Contribution Analysis

iCFP*-D$: no “performance” features

% Speedup over 2-way in-order

0

10

20

30

40

50

applu mgrid swim bzip2 vortex vpr

iCFP* + non-blocking + multi-threading + minimal

SpecFPSpecFP SpecINTSpecINT

Page 55: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 55 ][ 55 ]

Feature Contribution Analysis

Non-blocking rallies• Most significant performance feature• Helps programs with dependent misses (vpr, mcf)• Helps programs with D$ misses under L2 misses (applu)

% Speedup over 2-way in-order

0

10

20

30

40

50

applu mgrid swim bzip2 vortex vpr

iCFP* + non-blocking + multi-threading + minimal

SpecFPSpecFP SpecINTSpecINT

Page 56: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 56 ][ 56 ]

Feature Contribution Analysis

Multi-threaded rallies: one slot of 2-way superscalar• “Free” with support for non-blocking rallies• Helps uniformly

% Speedup over 2-way in-order

0

10

20

30

40

50

applu mgrid swim bzip2 vortex vpr

iCFP* + non-blocking + multi-threading + minimal

SpecFPSpecFP SpecINTSpecINT

Page 57: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 57 ][ 57 ]

Feature Contribution Analysis

Minimal rallies: 8-bit poison vectors• Helps uniformly (most misses are independent)

% Speedup over 2-way in-order

0

10

20

30

40

50

applu mgrid swim bzip2 vortex vpr

iCFP* + non-blocking + multi-threading + minimal

SpecFPSpecFP SpecINTSpecINT

Page 58: iCFP: Tolerating All Level Cache Misses in In-Order Processors

Out of Slice Buffer?

iCFP defaults to runahead when out of slice or store buffer• Not overly sensitive to slice buffer size

% Speedup over 2-way in-order

0

10

20

30

40

50

applu mgrid swim bzip2 vortex vpr

0 (Runahead) 32 64 128

SpecFPSpecFP SpecINTSpecINT

Page 59: iCFP: Tolerating All Level Cache Misses in In-Order Processors

Out of Slice Buffer?

iCFP defaults to runahead when out of slice or store buffer• Not overly sensitive to slice buffer size

% Speedup over 2-way in-order

0

10

20

30

40

50

applu mgrid swim bzip2 vortex vpr

0 (Runahead) 32 64 128

SpecFPSpecFP SpecINTSpecINT

Page 60: iCFP: Tolerating All Level Cache Misses in In-Order Processors

What About Store Buffer?

• A little more sensitive to store buffer size

% Speedup over 2-way in-order

0

10

20

30

40

50

applu mgrid swim bzip2 vortex vpr

32 64 128 128-assoc

SpecFPSpecFP SpecINTSpecINT

Page 61: iCFP: Tolerating All Level Cache Misses in In-Order Processors

What About Store Buffer?

• A little more sensitive to store buffer size• Chaining essentially performance equivalent to associative search

% Speedup over 2-way in-order

0

10

20

30

40

50

applu mgrid swim bzip2 vortex vpr

32 64 128 128-assoc

SpecFPSpecFP SpecINTSpecINT

Page 62: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 62 ][ 62 ]

Performance vs. Hardware Cost

• Runahead: +11% for checkpoints, poison bits, forwarding cache• iCFP: +17%, for checkpoints, poison bits, store buffer, slice buffer

• Basically: Runahead + 6% for a 128-entry slice buffer

% Speedup over 2-way in-order

0

20

40

60

80

100

applu mgrid swim bzip2 vortex vpr

Runahead iCFP OoO CFP

SpecFPSpecFP SpecINTSpecINT

Page 63: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 63 ][ 63 ]

Performance vs. Hardware Cost% Speedup over 2-way in-order

0

20

40

60

80

100

applu mgrid swim bzip2 vortex vpr

Runahead iCFP OoO CFP

• OoO: +63% for 128-entry window, 32-entry issue queue, etc.• CFP: +75% for OoO and 128-entry slice buffer

SpecFPSpecFP SpecINTSpecINT

Page 64: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 64 ][ 64 ]

Related Work

Multipass pipelining [Barnes+, MICRO’05]

• Rallies re-execute everything, but with higher ILP

Simple Latency Tolerant Processor [Nekkalapu+, ICCD’08]

• Similar, but … single, blocking rallies, speculative cache writes

Rock [Tremblay+, ISSCC’08]

• “Upon encountering a long latency instruction, the pipeline takes a checkpoint … creates future state and only reruns dependent instructions accumulated since the original checkpoint …. While one thread is completing the future created by the ahead thread, it continues execution to create the next future version of the architected state … This leapfrogging continues …”

• Sounds similar, what does it really do?

Page 65: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 65 ][ 65 ]

Conclusion

iCFP: in-order Continual Flow Pipeline• In-order + ability to flow around cache misses at all levels• Minimal hardware: runahead + slice buffer

Key features: not present elsewhere (afawk)• Non-blocking, multi-threaded, minimal rallies

Supporting technologies• Chained store buffer• Incremental tail register state updates

Incremental is a good thing!

Page 66: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 66 ][ 66 ]

Page 67: iCFP: Tolerating All Level Cache Misses in In-Order Processors

[ 67 ][ 67 ]

Comparative Performance

0

10

20

30

40

50

applu swim SpecFP bzip2 vpr SpecInt

Runahead Multipass SLTP iCFP