iCFP: Tolerating All Level Cache Misses in In-Order Processors

HPCA-15 :: Feb 18, 2009

Andrew Hilton, Santosh Nagarakatte, Amir RothUniversity of Pennsylvania

{adhilton,santoshn,amir}@cis.upenn.edu

A Brief History …

Pentium(in-order)

PentiumII (out-of-order)

performance

Core2Duo (out-of-order, 2 cores)

Nehalem (out-of-order, 4 cores, 8 threads)

Niagara2 (in-order, 16 cores, 64 threads)

POWER!

[ 3 ][ 3 ]

In-order vs. Out-of-Order

Out-of-order cores• Single thread IPC (+63%)

Key idea• Main benefit of out-of-order: data cache miss tolerance• Can we add to in-order in a simple way?

Is there a compromise?

In-order cores• Power efficiency• More cores

• Regfile checkpoint-restore

Runahead

Runahead execution [Dundas+, ICS’97]

• In-order + miss-level parallelism (MLP)• Checkpoint and “advance” under miss• Restore checkpoint when miss returns RF0

• Per register “poison” bits Forwarding$

• Forwarding cache

Can we do better?

Additional hardware?

Yes We Can! (Sorry)

iCFP: in-order Continual Flow Pipeline• Runahead, but … • Save miss-independent work• Re-execute only miss forward slice

Forwarding$

Slice Buffer

• Slice buffer

Additional hardware?

In-order adaptation of CFP [Srinivasan+, ASPLOS’04]

• Unblock pipeline latches, not issue queue and regfile• Apply to misses at all cache levels, not just L2

• Replace forwarding cache with store buffer Store Buffer

• Hijack additional regfile used for multi-threading

iCFP Roadmap

Motivation and overview

(Not fully) working example

Correctness features• Register communication for miss-dependent instructions• Store-load forwarding• Multiprocessor safety

Performance features

Evaluation

[ 7 ][ 7 ]

ExampleA1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

A1B1C1

Store Buffer

RF0 (Tail)

Slice Buffer

PC/instance

Bold paths are active

Instructions flowing through pipeline

Tail last completed instruction RF0

[ 8 ][ 8 ]

• Checkpoint regfile

Load A1 misses, transition to “advance” mode

A1B1C1

Store Buffer

RF0 (Tail)

Slice Buffer

D$ Miss

• Poison A1’s output register r2

[ 9 ][ 9 ]

• Checkpoint regfile

Load A1 misses, transition to “advance” mode

Store Buffer

RF0 (Tail)

Slice Buffer

• Poison A1’s output register r2• Divert A1 to slice buffer

Pending miss (red)

[ 10 ][ 10 ]

• Propagate poison through data dependences

Store Buffer

RF0 (Tail)

Slice Buffer

[ 11 ][ 11 ]

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

C1D1E1

Store Buffer

RF0 (Tail)

Slice Buffer

A1Advance

• Propagate poison through data dependences• Divert miss-dependent instructions to slice buffer

Miss-dependent instruction (this color)

[ 12 ][ 12 ]

Store Buffer

RF0 (Tail)

Slice Buffer

A1Advance

• Propagate poison through data dependences• Divert miss-dependent instructions to slice buffer• Buffer stores in store buffer

r2r3r5

[ 13 ][ 13 ]

Store Buffer

RF0 (Tail)

Slice Buffer

A1Advance

• Propagate poison through data dependences• Divert miss-dependent instructions to slice buffer • Buffer stores in store buffer• Miss-independent instructions execute as usual

r2r3r5

[ 14 ][ 14 ]

Store Buffer

RF0 (Tail)

Slice Buffer

A1Advance

• Propagate poison through data dependences• Divert miss-dependent instructions to slice buffer • Buffer stores in store buffer• Miss-independent instructions execute as usual, update regfile

Miss-independent instruction (green)

r2r3r5

[ 15 ][ 15 ]

Store Buffer

RF0 (Tail)

Slice Buffer

A1Advance

Tail Pois

r2r3r5

[ 16 ][ 16 ]

Store Buffer

RF0 (Tail)

Slice Buffer

A1Advance

• Can “un-poison” tail registers

[ 17 ][ 17 ]

Store Buffer

RF0 (Tail)

Slice Buffer

A1Miss Returns

When A1 miss returns, transition to “rally”• Stall fetch• Pipe in contents of slice buffer

[ 18 ][ 18 ]

Store Buffer

RF0 (Tail)

Slice Buffer

• Drain advance instructions already in pipeline (C2–D2)

A2E1 B2C2

D2Tail

[ 19 ][ 19 ]

Store Buffer

RF0 (Tail)

Slice Buffer

• Drain advance instructions already in pipeline (C2–D2)

A2E1 B2C2

[ 20 ][ 20 ]

Store Buffer

RF0 (Tail)

Slice Buffer

• Complete deferred instructions from slice buffer

A2E1 B2C2

[ 21 ][ 21 ]

Store Buffer

RF0 (Tail)

Slice Buffer

• Execute deferred instructions from slice buffer• When slice buffer is empty, un-block fetch

A2E1 B2C2

[ 22 ][ 22 ]

Store Buffer

RF0 (Tail)

Slice Buffer

Wait for deferred instructions to complete

A2E1 B2C2

D2Tail

[ 23 ][ 23 ]

Store Buffer

RF0 (Tail)

Slice Buffer

Back To Normal

When last deferred instruction completes

A2E1 B2C2

D2Tail

[ 24 ][ 24 ]

Store Buffer

RF0 (Tail)

Slice Buffer

Back To Normal

When last deferred instruction completes• Release register checkpoint

D1D2Tail

A2E1 B2C2

[ 25 ][ 25 ]

Store Buffer

RF0 (Tail)

Slice Buffer

Back To Normal

When last deferred instruction completes• Release register checkpoint • Resume normal execution at the tail

D1D2Tail

[ 26 ][ 26 ]

Store Buffer

RF0 (Tail)

Slice Buffer

Back To Normal

When last deferred instruction completes• Release register checkpoint • Resume normal execution at the tail• Drain stores from store buffer to D$

D2Tail

[ 27 ][ 27 ]

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6] Store Buffer

RF0 (Tail)

Slice Buffer

One Way Or The Other

If rally hits mis-predicted branch, exception, etc.• Flush pipeline• Discard store buffer contents• Restore regfile from checkpoint

iCFP Roadmap

Motivation and overview

(Not fully) working example

Correctness features• Register communication for miss-dependent instructions• Store-load forwarding• Multiprocessor safety

Performance features

Evaluation

[ 29 ][ 29 ]

Store Buffer

RF0 (Tail)

Slice Buffer

Where do A1–C1 write r2, r3, r5 during rally?• Not in Tail RF0• Already written by logically younger A2–C2

A2E1 B2C2

Rally Register CommunicationRally

[ 30 ][ 30 ]

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

Use RF1 as rally scratch-pad• Update Tail RF0 if youngest writer (not in this example)

A2E1 B2C2

Rally Register Communication

[ 31 ][ 31 ]

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

Use RF1 as rally scratch-pad• Update Tail RF0 if youngest writer (not in this example)

A2E1 B2C2

Rally Register Communication

Rally A1B1

Store-Load Forwarding

iCFP is in-order but …• Rally loads out-of-order wrt advance stores (possible WAR hazards)

Store-load forwarding mechanism should• Avoid WAR hazards• Avoid redoing stores

Forwarding cache? D$ with speculative writes?• Not what we want

What we really want is a large (64-entry+) store queue• Like in an out-of-order processor– Associative search doesn’t scale nicely

[ 33 ][ 33 ]

7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0

addressvalue

poison

Tail (younger) Head (older)Chained Store Buffer

86 85 84 83 82 81 80(SSN)

Replace associative search with iterative indexed search• Exploit fact that stores enter store buffer in order

• Address must be known: otherwise stall• Overlay store buffer with address-based hash table

44 81 0 15 0 77 0link

85868321

ACB0B4B8

Root……

……

[ 34 ][ 34 ]

7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0

addressvalue

poison

86 85 84 83 82 81 80(SSN)

44 81 0 15 0 77 0link

Loads follow chain starting at appropriate root table entry• For example, load to address 1AC

85868321

ACB0B4B8

Root……

……

85AC 2AC85

Match, forward

[ 35 ][ 35 ]

7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0

addressvalue

poison

86 85 84 83 82 81 80(SSN)

44 81 0 15 0 77 0link

Loads follow chain starting at appropriate root table entry• For example, load to address 1AC

Rally loads ignore younger stores, avoid WAR hazards• For example, rally load to address 1B4 …• … whose immediately older store 81 (note during advance)

85868321

ACB0B4B8

Root……

……

Younger store, ignore

Go to D$

[ 36 ][ 36 ]

Chained Store Buffer

+ Non-speculative (including no WAR hazards)+ Scalable + Average number of excess hops < 0.05 with 64-entry root table– Must stall on (miss-dependent) stores with unknown addresses• These are rare

7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0

addressvalue

poison

Tail (younger) Head (older)

86 85 84 83 82 81 80(SSN)

44 81 0 15 0 77 0link

85868321

ACB0B4B8

Root……

……

[ 37 ][ 37 ]

Multi-Processor Safety

iCFP is in-order but … (yeah again)• Advance loads are vulnerable to stores from other threads• Just like in an out-of-order processor

Must snoop/verify these• Associative load queue too expensive for in-order processor• Paper describes scheme based on local signatures

[ 38 ][ 38 ]

Methodology

Cycle-level simulation• 2-way issue 9-stage in-order pipeline• 32KByte D$• 20-cycle 1MByte, 8-way L2 (8 8-entry stream buffers)• 400 cycle main memory, 4Bytes/cycle, 32 outstanding misses• 128-entry chained store buffer, 128-entry slice buffer

Spec2000 benchmarks• Alpha AXP ISA• DEC OSF compiler -04 optimization• 2% sampling with warm-up

[ 39 ][ 39 ]

Initial Evaluation

iCFP vs. Runahead: advance on L2 misses• Roughly same performance: +10%• Dominated by MLP• iCFP’s ability to reuse work rarely significant (vortex)

% Speedup over 2-way in-order

applu mgrid swim bzip2 vortex vpr

Runahead-L2 Runahead-D$ iCFP*-L2 iCFP*-D$

SpecFPSpecFP SpecINTSpecINT

[ 40 ][ 40 ]

Initial Evaluation

Runahead advance on D$ misses too: performance drops • Chance for MLP is low and can’t reuse work• Overhead of restoring checkpoint is high

• Especially because baseline stalls on use, not miss

[ 41 ][ 41 ]

Initial Evaluation

iCFP advance under D$ misses too• Can reuse work without restoring checkpoint but …

• iCFP* executes rallies until completion in blocking fashion• No efficient way to handle D$ misses under L2 misses

[ 42 ][ 42 ]

iCFP Performance Features

Non-blocking rallies• Miss during rally (dependent or just pending)? Don’t stall, slice it out

Fine-grain multi-threaded rallies• Proceed in parallel with advance execution at the tail• Rallies process dependence chains, can’t exploit superscalar

These need: incremental updates of tail register state• Both values and poison bits• Note: store buffer is not a tail snapshot, so no additional support

[ 43 ][ 43 ]

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

Question: should current rally instruction update Tail RF?• A1? B1? C1? • No, no, yes

C1B2A2

Incremental Tail UpdatesRally

r2r3r5

[ 44 ][ 44 ]

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

Advance execution tags registers with sequence numbers• Distance of writing instruction from checkpoint

C1B2A2

Incremental Tail Updates12345678

r2r3r5

[ 45 ][ 45 ]

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

Rally updates Tail RF if seqnum matches

r2r3r5

A1’s is 1, so no

[ 46 ][ 46 ]

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

r2r3r5

B1’s is 2, so no

[ 47 ][ 47 ]

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

r2r3r5

C1’s is 3, so yes

[ 48 ][ 48 ]

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

[ 49 ][ 49 ]

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

Proper slicing can continue at tail

r2r3r5

C2 sliced because r3 poison preserved

[ 50 ][ 50 ]

Another iCFP Performance Feature

Minimal rallies• Only traverse slice of returned miss, not entire slice buffer

Implementation: borrow trick from TCI [AlZawawi+, ISCA’07]

• Replace poison bits with bitvectors• Re-organize slice buffer to support sparse access• See paper for details

[ 51 ][ 51 ]

Tolerating All Level Cache Misses

iCFP performance features?

Runahead-L2 iCFP*-L2 iCFP-L2 iCFP-D$

[ 52 ][ 52 ]

iCFP performance features?• Help iCFP-L2 (now better than Runahead-L2)

[ 53 ][ 53 ]

iCFP performance features?• Help iCFP-L2 (now better than Runahead-L2)• Help iCFP-D$ even more (now better than iCFP-L2)

[ 54 ][ 54 ]

Feature Contribution Analysis

iCFP*-D$: no “performance” features

iCFP* + non-blocking + multi-threading + minimal

[ 55 ][ 55 ]

Non-blocking rallies• Most significant performance feature• Helps programs with dependent misses (vpr, mcf)• Helps programs with D$ misses under L2 misses (applu)

[ 56 ][ 56 ]

Multi-threaded rallies: one slot of 2-way superscalar• “Free” with support for non-blocking rallies• Helps uniformly

[ 57 ][ 57 ]

Minimal rallies: 8-bit poison vectors• Helps uniformly (most misses are independent)

Out of Slice Buffer?

iCFP defaults to runahead when out of slice or store buffer• Not overly sensitive to slice buffer size

0 (Runahead) 32 64 128

Out of Slice Buffer?

iCFP defaults to runahead when out of slice or store buffer• Not overly sensitive to slice buffer size

0 (Runahead) 32 64 128

What About Store Buffer?

• A little more sensitive to store buffer size

32 64 128 128-assoc

What About Store Buffer?

• A little more sensitive to store buffer size• Chaining essentially performance equivalent to associative search

32 64 128 128-assoc

[ 62 ][ 62 ]

Performance vs. Hardware Cost

• Runahead: +11% for checkpoints, poison bits, forwarding cache• iCFP: +17%, for checkpoints, poison bits, store buffer, slice buffer

• Basically: Runahead + 6% for a 128-entry slice buffer

Runahead iCFP OoO CFP

[ 63 ][ 63 ]

Performance vs. Hardware Cost% Speedup over 2-way in-order

Runahead iCFP OoO CFP

• OoO: +63% for 128-entry window, 32-entry issue queue, etc.• CFP: +75% for OoO and 128-entry slice buffer

[ 64 ][ 64 ]

Related Work

Multipass pipelining [Barnes+, MICRO’05]

• Rallies re-execute everything, but with higher ILP

Simple Latency Tolerant Processor [Nekkalapu+, ICCD’08]

• Similar, but … single, blocking rallies, speculative cache writes

Rock [Tremblay+, ISSCC’08]

• “Upon encountering a long latency instruction, the pipeline takes a checkpoint … creates future state and only reruns dependent instructions accumulated since the original checkpoint …. While one thread is completing the future created by the ahead thread, it continues execution to create the next future version of the architected state … This leapfrogging continues …”

• Sounds similar, what does it really do?

[ 65 ][ 65 ]

Conclusion

iCFP: in-order Continual Flow Pipeline• In-order + ability to flow around cache misses at all levels• Minimal hardware: runahead + slice buffer

Key features: not present elsewhere (afawk)• Non-blocking, multi-threaded, minimal rallies

Supporting technologies• Chained store buffer• Incremental tail register state updates

Incremental is a good thing!

[ 66 ][ 66 ]

[ 67 ][ 67 ]

Comparative Performance

applu swim SpecFP bzip2 vpr SpecInt

Runahead Multipass SLTP iCFP

iCFP: Tolerating All Level Cache Misses in In-Order Processors

Documents

Transcript of iCFP: Tolerating All Level Cache Misses in In-Order Processors

Apogee Misses Tom

Tolerating Slowdowns in Replicated State Machines using … · 2020. 11. 4. · Tolerating Slowdowns in Replicated State Machines using Copilots Khiem Ngo?, Siddhartha Sen†, Wyatt

Tolerating Hardware Device Failures in Softwarepages.cs.wisc.edu/~swift/papers/carburizer-sosp09.pdf · Tolerating Hardware Device Failures in Software Asim Kadav, ... assumptions

What you misses!

Exploiting Long-Distance Interactions and Tolerating Atom ...

ICFP 2016 Karen Sichinga ACHAP

Orchestrating Linux Containers while tolerating failures

Tolerating Uncertainty: Experiences of Caregiving and ...eprints.staffs.ac.uk/2054/1/Pryce Laura DClinPsy thesis September... · Tolerating Uncertainty: Experiences of Caregiving

Grammatical Framework Tutorial - Chalmersaarne/talks/gf-icfp-2012-1.pdf · 2012. 9. 18. · Grammatical Framework Tutorial Aarne Ranta, Thomas Hallgren, Krasimir Angelov ICFP 2012,

Tolerating the Intolerable - Josh.org – A Cru Ministry

Procrastination Module 6_Adjusting Rules & Tolerating Discomfort

TD2 : Nematic liquid crystalslptms.u-psud.fr/membres/trizac/Ens/iCFP/TD2.pdf · TD2 : Nematic liquid crystals Statistical Mechanics { iCFP M2 Liquid crystal are small rod-like molecules

ICFP 2016 Tri Hastuti Nur Rochimah

Impressions of the ICFP'08 Programming Contestsalvi.chaosnet.org/texts/icfp2008.pdf · 2016-01-14 · ICFP International Conference on Functional Programming Annual programming contest

The Third Annual ICFP Programming Contest

Rust Demo, ML workshop at ICFP 2014

2008 ICFP DEFUN - Functional in the Field

ICFP 2011 "Subtyping Delimited Continuations" slides

Tolerating/hiding Memory Latency

Automatically Tolerating And Correcting Memory Errors