Capp 09

7/27/2019 Capp 09

1/79

Cache Coherency in

Multiprocessors (MPs)/ Multi-cores

Topic 9

7/27/2019 Capp 09

2/79

Cache Coherency Problem

Multiple Processors/Multi-cores with local

cache and Shared Main Memory

Necessary condition for the problem to arise.

Information in the Cache Memories and MM

are not same

2

7/27/2019 Capp 09

3/79

Cache Coherence

The goal is to make sure that READ(X) returns

the most recent value of the shared variable X,

i.e. all valid copies of a shared variable are

identical.

Origin:

Sharing of Writable Data

Process Migration (Processes move from one

processor to another during their execution)

I/O activity (Reading and writing to MM)

3

7/27/2019 Capp 09

4/79

Cache Coherence

Sharing of Writable Data (2 processors/cores) Before any updation, all 3 copies of X are same

Write Through (at all time cache & MM are same)

: Suppose, P1 writes X to MM, P2 reads X from

Local Cache), 2 copies exist.

Write back (all 3 copies may be different)

Process Migration:

Process updates X, migrates to P2, write back

Write Through, Process running on P1, updates X,

moves to P2, corresponding cache block exists in

P2 4

7/27/2019 Capp 09

5/79

Cache Coherence

I/O activities: Write Through, Read from I/O device, MM copy

changed but Caches copies remain same

Write Back , Write to I/O, Old value from MM will

be written to I/O device

One solution may be to share a local cache

between a Processor and an IOP (I/O Processor).

To simplify the problem, so that problem will existbetween MM and Cache copies.

Performance degradation due to hampering of LOR of

the processors

5

7/27/2019 Capp 09

6/79

Cache Coherence

Solutions: Software based or Hardware based

S/W based:

Compiler tags data as cacheable and non-cacheable.

Only read-only data is considered cachable

and put in private cache. All other data arenon-cachable, and can be put in a global

cache, if available.

6

7/27/2019 Capp 09

7/79

Cache Coherence

H/W based: Snooping/snoopy protocol:

-MM and caches are connected by BUS

(broadcast media)

-Whatever communication is taking place, all

other can listen (by Cache controller

constantly monitoring the bus)

-2 main categories: Cache Block Invalidation

and cache Block Update

7

7/27/2019 Capp 09

8/79

Cache Coherence Cache block Update & Cache block

Invalidation protocols:

-In Update, updates are sent over the Inter-core/Inter-processor BUS (broadcast media) and all caches

whose block is matching will perform the updates.Subsequent updates to the same variable will

generate extra traffic.

-In invalidation, invalidation signal is sent over the busfor the updated cache block and all caches whose

block is matching will invalidate their cache block.

Subsequent updates to the same variable will not

generate traffic. 8

7/27/2019 Capp 09

9/79

Cache Coherence Write Invalidate: (a) Write-Through policy

W(i), W(j)Write to block by Pi & Pj

R(i), R(j) Read block by Pi & Pj

Z(i), Z(j)

Replace/Invalidate block in Pi & Pj

9

Invalid(i)

Valid(i)

R(j), W(j),

Z(i), Z(j)

W(j), Z(i)

R(i), W(i)

R(i), W(i),

R(j), Z(j)

7/27/2019 Capp 09

10/79

Cache Coherence Write Invalidate: (b) Write-Back policy

(also called MSI Protocol)

10

RW RO

R(i), W(i), Z(j)

W(j), Z(i)

R(j)

R(i), R(j),

Z(j)

Invalid

W(i)

R(i)

W(i)

W(j), Z(i)

R(j), Z(j), Z(i), W(j)

7/27/2019 Capp 09

11/79

Cache Coherence MSI Protocol: It is the basic cache coherence

protocol used in MP.

Each cache block can be in one of three

possible states

MODIFIED: The block has been modified, but it is inconsistentwith the M-block copy. So, cache block with M state must be

written back to MM when it is evicted.

INVALID: Not valid, so must be fetched from MM or other

cache SHARED: Multiple caches may hold valid copies (which is

similar to the copy in MM)

These states are maintained through comm between the Caches

and MM. 11

7/27/2019 Capp 09

12/79

Cache Coherence MSI Protocol:

for any given pair of local caches, the permitted

states are

This protocol was mainly used in SGI 4D Variants: Most modern shared memory MP uses

variant to reduce the amount of traffic.

Exclusive state in MESI, Owned state in MOSI12

M S I

M

S

I

7/27/2019 Capp 09

13/79

Cache Coherence

MESI Protocol (Papamarcos & Patel 1984):

It is a version of the snooping cache protocol.

Each cache block can be in one of four states

INVALID: Not valid

SHARED: Multiple caches may hold valid copies.

EXCLUSIVE: No other cache has this block, M-block is valid.

Therefore, writing of a block in E state need not send

Cache Invalidate signal to others (simply change the state to

M).

MODIFIED: Present only in current cache which is a Valid

block, but copy in M-block is not valid. Writing it back to MM

will simply change its state to E.

13

7/27/2019 Capp 09

14/79

Cache Coherence MESI Protocol (Papamarcos & Patel

1984):

Event Local Remote

Read Hit Use local copy No actionRead Miss I to S, or I to E (S,E,M) to S

WH (S,E,M) to M (S,I) to I

WM I to M (S,E,M) to I When a cache block changes its status from M, it first

updates the main memory.

14

7/27/2019 Capp 09

15/79

Cache Coherence MESI Protocol (Papamarcos & Patel

1984):

Following the read miss, the holder of the

modified copy signals the initiator to try again.

Meanwhile, it seizes the bus, and write the

updated copy into the Main memory.

15

7/27/2019 Capp 09

16/79

Cache Coherence MESI Protocol:

for any given pair of local caches, the permittedstates are

This protocol was mainly used in Intels PentiumProcessors

16

M E S I

M

E

S

I

7/27/2019 Capp 09

17/79

Cache Coherence

MOSI Protocol:

It is a version of the snooping cache protocol.

Each cache block can be in one of four states

INVALID: Not valid

SHARED: Multiple caches may hold valid copies.

OWNED: The current cache owns this block. Therefore, it will

serve the requests (by snooping the bus) from other

processors for this block in I mode. Other cache will

receive values from Owning cache much faster than

retrieving values from MM.

MODIFIED: Present only in current cache which is a Valid

block, but copy in M-block is not valid.

17

7/27/2019 Capp 09

18/79

Cache Coherence MOSI Protocol:

for any given pair of local caches, the permittedstates are

18

M O S I

M

O

S

I

7/27/2019 Capp 09

19/79

Cache Coherence

MOESI Protocol:

It is a full cache coherency protocol

encompassing all possible states commonly

used in other protocols.

Each cache block can be in one of five states

INVALID: Not valid (valid copy may be either in MM or

other cache)

SHARED: Multiple caches may hold valid copies. Many bedirty w.r.t. MM (if a cache block exists in Owned State) or

clean.

19

7/27/2019 Capp 09

20/79

Cache Coherence

MOESI Protocol: EXCLUSIVE: Contains most recent correct copy of data, No other cache has

this block and M-block is valid. Therefore, writing of a block in E state

need not send Cache Invalidate signal to others (simply change the state

to M or O). May respond to a snoop request with data.

MODIFIED: Present only in current cache which is a Valid block, but copy

in M-block is not valid. Writing it back to MM will simply change its state

to E. Must respond to a snoop request with data.

OWNED: holds a Valid block (most recent) with other caches like in

Shared state, but the copy in M-block can be stale. Only one cache can be

in O state. Writing it back to MM will simply change its state to S. Can

changed to M state after invalidating all shared copies. Must respond

to a snoop request with data.

20

7/27/2019 Capp 09

21/79

Cache Coherence

MOESI Protocol: for any given pair of local caches, the permitted

states are

21

M O E S I

M O E S I

7/27/2019 Capp 09

22/79

Cache Coherence

MESIF Protocol:

It is a full cache coherency protocol developed

by Intel for ccNUMA (Cache-coherent Non

Uniform Memory Architecture) processors.

Each cache block can be in one of five states

M, E, S, I and F. Where M, E, S and I are same

as used in MESI protocol.

F (Forward): A cache block in F state means the designated

cache is the responder for any cache request for the be

designated cache block. For any cache holds a block in S state,

exactly one cache can hold the block in F state.

22

7/27/2019 Capp 09

23/79

Cache Coherence

MESIF Protocol: for any given pair of local caches, the permitted

states are

23

M E S I F

M E S I F

7/27/2019 Capp 09

24/79

Cache Coherence

H/W based: Directory based protocols

Snooping can not be used in Point-to-point

links (Reliable and efficient)

MM and processors/caches are connected by

Switching network

Which block can presence in which cache

(processor) and one dirty/clean bit

24

7/27/2019 Capp 09

25/79

Cache Coherence

H/W based: Directory based protocols may be

Full-Map directory (maintains ptrs for all)

Limited Directory (only limited ptrs withreplacement). Can statistically computes avg no of

ptrs reqd.

Chained Directory (last entry maintains ChainTerminator, CT), chain starts from Shared memory

25

7/27/2019 Capp 09

26/79

26

Cache Coherence Problem

P0 P1Load A

A 0

Load A

A 0

Store A

7/27/2019 Capp 09

27/79

27

Cache Coherence Problem

P0 P1Load A

A 0

Load A

A 0

Store A

7/27/2019 Capp 09

28/79

02/07 28

Possible Causes of Incoherence

Sharing of writeable data Cause most commonly considered

Process migration

Can occur even if independent jobs are executing I/O

Often fixed via O/S cache flushes

7/27/2019 Capp 09

29/79

02/07 29

Cache Coherence Informally, with coherent caches: accesses to a

memory location appearto occur

simultaneously in all copies of the memory

location

copies caches

Cache coherence suggests an absolute time

scale -- this is not necessary

What is required is the "appearance" ofcoherence... not absolute coherence

E.g. temporary incoherence between memory and

a write-back cache may be OK.

7/27/2019 Capp 09

30/79

30

Update vs.

Invalidation

Protocols

Coherent SharedMemory

All processors seethe effects ofothers writes

How/when writesare propagated

Determine by

coherenceprotocol

7/27/2019 Capp 09

31/79

02/07

Global Coherence States

A memory line can be present (valid) in any of the caches and/or

memory

Represent global state with an N+1 element vector

First N components => cache states (valid/invalid)

N+1st component => memory state (valid/invalid)

Example:Line A:

Line B:

Line C:

Memory

Cache 0 Cache 1

line A: V

line B: I

line C: V

line A

line B

line A

Cache 2

7/27/2019 Capp 09

32/79

Local Coherence States

Individual caches can maintain a summary of the state of memory

lines, from a local perspective

Reduces storage for maintaining state

May have only partial information

Invalid (I): -- local cache does not have a valid copy; (cache

miss)

Dont confuse invalid state with empty frame

Shared (S): -- local cache has a valid copy, main memory has a

valid copy, other caches ??

Modified(M): -- local cache has only valid copy.

Exclusive(E): -- local cache has a valid copy, no other caches

do, main memory has a valid copy.

Owned(O): -- local cache has a valid copy, all other caches and

memory may have a valid copy.

Only one cache can be in O state

is included in O, but not included in any of the others.

7/27/2019 Capp 09

33/79

33

ExampleMemory

Cache 0 Cache 1

line A: Vline B: Iline C: V

line Aline B

line A

Cache 2

Memory

Cache 0 Cache 1

line A: V

line B: I

line C: V

line A: S

line B: M

line C: I

line A: S

line B: I

line C: I

Cache 2

line A: I

line B: I

line C: I

7/27/2019 Capp 09

34/79

34

Snoopy Cache Coherence

All requests broadcast on bus All processors and memory snoop and respond

Cache blocks writeable at one processor or read-only at several

Single-writer protocol

Snoops that hit dirty lines?

Flush modified data out of cache

Either write back to memory, then satisfy remote miss

from memory, or Provide dirty data directly to requestor

Big problem in MP systems

Dirty/coherence/sharing misses

7/27/2019 Capp 09

35/79

02/07

Bus-Based Protocols

Protocol consists ofstates and actions

(state transitions)

Actions can be

invoked fromprocessor or bus

Cache

Controller Cache Data

Processor

Bus

Processor Actions

Bus Actions

State Tags

7/27/2019 Capp 09

36/79

36

Minimal Coherence

Protocol FSM[Source: Patterson/Hennessy, Comp. Org. &

Design]

7/27/2019 Capp 09

37/79

02/07

MSI Protocol

Action and Next State

Current

State

Processor

Read

Processor

Write

Eviction Cache

Read

Cache

Read&M

Cache

Upgrade

I Cache Read

Acquire

Copy

S

Cache Read&M

Acquire Copy

M

No Action

I

No Action

I

No Action

I

S No Action

S

Cache Upgrade

M

No Action

I

No Action

S

Invalidate

Frame

I

Invalidate

Frame

I

M No Action

M

No Action

M

Cache

Write

back

I

Memory

inhibit;

Supply

data;

S

Invalidate

Frame;

Memory

inhibit;

Supply data;

I

l

7/27/2019 Capp 09

38/79

02/07

MSI Example

If line is in no cache Read, modify, Write requires 2 bus transactions

Optimization: add Exclusive state

Thread Event Bus Action Data From Global State Local States:C0 C1 C2

0. Initially: I I I

1. T0 read CR Memory S I I

2. T0 write CU M I I

3. T2 read CR C0 S I S

4. T1 write CRM Memory I M I

7/27/2019 Capp 09

39/79

39

Invalidate Protocol Optimizations

Observation: data can be read shared Add S (shared) state to protocol: MSI

State transitions:

Local read: I->S, fetch shared

Local write: I->M, fetch modified; S->M, invalidate other copies Remote read: M->S, supply data

Remote write: M->I, supply data; S->I, invalidate local copy

Observation: data can be write-private (e.g. stack frame)

Avoid invalidate messages in that case

Add E (exclusive) state to protocol: MESI State transitions:

Local read: I->E if only copy, I->S if other copies exist

Local write: E->M silently, S->M, invalidate other copies

7/27/2019 Capp 09

40/79

02/07 40

MESI Protocol Variation used in Pentium Pro/2/3

(cache to cache transfer)

4-State Protocol

Modified:

Exclusive:

Shared: Invalid:

Bus/Processor Actions

Same as MSI

Adds sharedsignal to indicate if other caches have a copy

7/27/2019 Capp 09

41/79

02/07

MESI ProtocolAction and Next State

Current

State

Processor

Read

Processor

Write

Eviction Cache

Read

Cache

Read&M

Cache

Upgrade

I Cache

Read

If no

sharers:

E

If sharers:

S

Cache Read&M

M

No Action

I

No Action

I

No Action

I

S No Action

S

Cache Upgrade

M

No Action

I

Respond

Shared:

S

No Action

I

No Action

I

E No Action E

No Action M

No Action I

RespondShared;

S

No Action I

M No Action

M

No Action

M

Cache

Write-back

I

Respond

dirty;

Write back

data;

S

Respond

dirty;

Write back

data;

I

MESI E l

7/27/2019 Capp 09

42/79

02/07

MESI Example

Optimization cache-to-cache transfer for shared data (but only one

sharer can respond)

If modified in some cache and another reads then writeback to memory

(w/ snarf)

Optimization: add Owned state (and perform cache-to-cache transfer)

Thread Event BusAction

Data From Global State Local States:C0 C1 C2

0. Initially: I I I

1. T0 read CR Memory E I I

2. T0 write none M I I

7/27/2019 Capp 09

43/79

43

MOESI Optimization

Observation: shared ownership complicates/delays sourcingdata Owner is responsible for sourcing data to any requestor

Add O (owner) state to protocol: MOSI/MOESI

Last requestor becomes the owner

Ownership can be on per-node basis in hierarchically structuredsystem

Avoid writeback (to memory) of dirty data

Also called shared-dirtystate, since memory is stale

7/27/2019 Capp 09

44/79

02/07 44

MOESI Protocol Used in AMD Opteron

5-State Protocol Modified:

Exclusive:

Shared:

Invalid:

Owned: ; only one owner Owner can supply data, so memory does not have to

Avoids lengthy memory access

7/27/2019 Capp 09

45/79

02/07

MOESI ProtocolAction and Next State

Current

State

Processor

Read

Processor

Write

Eviction Cache Read Cache Read&M Cache

Upgrade

I Cache Read

If no sharers:

E

If sharers:

S

Cache Read&M

M

No Action

I

No Action

I

No

Action

I

S No Action

S

Cache Upgrade

M

No Action

I

Respond

shared;

S

No Action

I

No

Action

I

E No Action

E

No Action

M

No Action

I

Respond

shared;

Supply data;

S

Respond

shared;

Supply data;

I

O No Action O

CacheUpgrade

M

CacheWrite-back

I

Respondshared;

Supply data;

O

Respondshared;

Supply data;

I

M No Action

M

No Action

M

Cache

Write-back

I

Respond

shared;

Supply data;

O

Respond

shared;

Supply data;

I

MOESI E l

7/27/2019 Capp 09

46/79

02/07

MOESI Example

Thread Event Bus Action Data From Global State local statesC0 C1 C2

0. Initially: I I I



3. T2 read CR C0 O I S

4. T1 write CRM C0 I M I

7/27/2019 Capp 09

47/79

47

Update Protocols

Basic idea: All writes (updates) are made visible to all caches:

(address,value) tuples sent everywhere

Similar to write-through protocol for uniprocessor caches

Obviously not scalable beyond a few processors

No one actually builds machines this way

Simple optimization Send updates to memory/directory

Directory propagates updates to all known copies: less bandwidth

Further optimizations: combine & delay Write-combining of adjacent updates (if consistency model allows)

Send write-combined data

Delay sending write-combined data until requested Logical end result

Writes are combined into larger units, updates are delayed until needed

Effectively the same as invalidate protocol

7/27/2019 Capp 09

48/79

48

Update Protocol: Dragon Dragon (developed at Xerox PARC)

5-State Protocol

Invalid:

Some say no invalid state due to confusion regarding empty frame

versus invalid line state

Exclusive:

Shared-Clean (Sc): memory may not be up-to-date

Shared-Modified (Sm): memory not up-to-date; only one

copy in Sm

Modified:

Includes Cache Update action Includes Cache Writeback action

Bus includes Shared flag

Appears to also require memory inhibit signal

Distinguish shared case where cache (not memory) supplies data

7/27/2019 Capp 09

49/79

Dragon State DiagramAction and Next State

Current

StateProcessorRead

ProcessorWrite

Eviction Cache Read Cache Update

I Cache Read

If no sharers:

E

If sharers:

Sc

Cache Read

If no sharers:

M

If sharers:

Cache Update

Sm

I I

Sc No Action

Sc

Cache Update

If no sharers:

M

If sharers:

Sm

No Action

I

Respond Shared;

Sc

Respond shared;

Update copy;

Sc

E No Action

E

No Action

M

No Action

I

Respond shared;

Supply data

Sc

Sm No Action

Sm

Cache Update

If no sharers:

M

If sharers:

Sm

Cache Write-back

I

Respond shared;

Supply data;

Sm

Respond shared;

Update copy;

Sc

M No Action

M

No Action

M

Cache Write-back

I

Respond shared;

Supply data;

Sm

E l

7/27/2019 Capp 09

50/79

02/07

ExampleThread Event Bus Action Data From Global State local states

C0 C1 C2

0. Initially: I I I



3. T2 read CR C0 Sm I Sc

4. T1 write CR,CU C0 Sc Sm Sc

5. T0 read none (hit) C0 Sc Sm Sc

Appears to require atomic bus cycles CR,CU on write to invalidline

7/27/2019 Capp 09

51/79

02/07 51

Update Protocol: Firefly Develped at DEC by ex-Xerox people

5-State Protocol

Similar to Dragon different state naming based on shared/exclusive and

clean/dirty

Invalid:

EC:

SC: memory may not be up-to-date

EM:

SM: memory not up-to-date; only one copy in Sm

Performs write-through updates (different from Dragon)

7/27/2019 Capp 09

52/79

02/07

Firefly State DiagramAction and Next State

Current

State

Processor

Read

Processor

Write

Eviction Cache Read Cache Read&M

I Cache Read

If no sharers:

Ec

If sharers:

Sc

Cache Read

If no sharers:

Em

If sharers:

Cache Update

Sm

No Action

I

No Action

I

Sc No Action Sc

Cache Read&MIf no sharers:

Ec

If sharers:

Sc

No Action I

Respond Shared; Sc

Respond Shared Sc

Ec No Action

Ec

No Action

Em

No Action

I

Respond shared;

Sc

Respond Shared

Sc

Sm No Action

Sm

Cache Read&M

If no sharers:

Ec

If sharers:

Sc

Cache Write-back

I

Respond shared;

Supply data;

Sm

Respond shared;

Supply data;

Sc

Em No Action

Em

No Action

Em

Cache Write-back

I

Respond shared;

Supply data;

Sm

Respond shared;

Supply data;

Sc

7/27/2019 Capp 09

53/79

02/07 53

Update vs Invalidate [Weber & Gupta, ASPLOS3]

Consider sharing patterns

No Sharing

Independent threads

Coherence due to thread migration

Update protocol performs many wasteful updates

Read-Only

No significant coherence issues; most protocols work well

Migratory Objects

Manipulated by one processor at a time

Often protected by a lock

Usually a write causes only a single invalidation

E state useful for Read-modify-Write patterns

Update protocol could proliferate copies

7/27/2019 Capp 09

54/79

02/07 54

Update vs Invalidate, contd. Synchronization Objects

Locks Update could reduce spin traffic invalidations

Test&Test&Set w/ invalidate protocol would work well

Many Readers, One Writer Update protocol may work well, but writes are relatively rare

Many Writers/Readers Invalidate probably works better

Update will proliferate copies

What is used today? Invalidate is dominant

CMP maychange this assessment

more on-chip bandwidth

Nasty Realities

7/27/2019 Capp 09

55/79

Nasty Realities

State diagram is for (ideal) protocol assuming

instantaneous and actions

In reality controller implements more complex diagrams

A protocol state transition may be started by controller when

bus activity changes local state

Example: an upgrade pending (for bus) when an invalidate for

same line arrives

Partial Implementation State Table

7/27/2019 Capp 09

56/79

02/07 ECE/CS 757; copyright J. E. Smith, 2007

Partial Implementation State TableAction and Next State

Current

State

Processor

Read

Processor

Write

Bus

Grant

Bus

Response

Cache Read Cache Read&M Cache

Upgrade

I Request Bus

IR

Request

Bus

IW

No Action

I

No Action

I

No Action

I

S No Action

S

Request

Bus

SW

Respond Shared:

S

No Action

I

No Action

I

E No Action

E

No Action

M

Respond Shared;

S

No Action

I

M No Action

M

No Action

M

Respond dirty;

Write back data;

S

Respond dirty;

Write back data;

I

IR Cache Read

IRR

IW Cache Read&M

IWR

IRR If no

sharers:

E

If sharers:

S

Load line

IWR M

Load line

SW Cache Upgrade

M

Respond Shared:

SW

No Action

IW

No Action

IW

7/27/2019 Capp 09

57/79

57

Further Optimizations

Observation: Shared blocks should only be fetched frommemory once If I find a shared block on chip, forward the block

Problem: multiple shared blocks possible, who forwards? Everyone? Power/bandwidth wasted

Single forwarder, but who? Last one to receive block: F state

I->F for requestor, F->S for forwarder

What if F block is evicted? Favor F blocks in replacement?

Dont allow silent eviction (force some other node to be F)

Fall back on memory copy if cant find F copy

Very old idea (IBM machines have done this for a long time),but recent Intel patent issued anyway [Hum/Goodman]

7/27/2019 Capp 09

58/79

58

Further Optimizations

Observation: migratory data often flies by Add T (transition) state to protocol

Tag is still valid, data isnt

Data can be snarfed as it flies by

Only works with certain kinds of interconnect networks

Replacement policy issues

Many other optimizations are possible Literature extends 25 years back

Many unpublished (but implemented) techniques as well

7/27/2019 Capp 09

59/79

59

Implementing Cache Coherence

Snooping implementation

Origins in shared-memory-bus systems

All CPUs could observe all other CPUs requests on the bus;hence snooping

Bus Read, Bus Write, Bus Upgrade

React appropriately to snooped commands

Invalidate shared copies

Provide up-to-date copies of dirty lines

Flush (writeback) to memory, or

Direct intervention (modified intervention or dirty miss)

Snooping suffers from:

Scalability: shared busses not practical Ordering of requests without a shared bus

Lots of recent and on-going work on scaling snoop-basedsystems

7/27/2019 Capp 09

60/79

60

Snooping Cache Coherence

Basic idea: broadcast snoop to all caches to find owner Not scalable?

Address traffic roughly proportional to square of number ofprocessors

Current implementations scale to 64/128-way (Sun/IBM) with

multiple address-interleaved broadcast networks Inbound snoop bandwidth: big problem

O utbo un dS no op Ra te so CacheMissRate BusUpgradeRate += =

InboundSnoo pRate si n so= =

7/27/2019 Capp 09

61/79

61

Snoop Bandwidth

Snoop filtering of various kinds is possible Filter snoops at sink: Jetty filter [Moshovos et al., HPCA 2001]

Check small filter cache that summarizes contents of localcache

Avoid power-hungry lookups in each tag array

Filter snoops at source: Multicast snooping [Bilir et al., ISCA 1999] Predict likely sharing set, snoop only predicted sharers Double-check at directory to make sure

Filter snoops at source: Region coherence Concurrent work: [Cantin/Smith/Lipasti, ISCA 2005; Moshovos, ISCA 2005] Check larger region of memory on every snoop; remember

when no sharers Snoop only on first reference to region, or when region is shared Eliminate 60%+ of all snoops

7/27/2019 Capp 09

62/79

62

Snoop Latency

Snoop latency: Must reach all nodes, return and combine responses Probably the bigger scalability issue in the future Topology matters: ring, mesh, torus, hypercube No obvious solutions

Parallelism: fundamental advantage of snooping Broadcast exposes parallelism, enables speculative latency

reduction

LDir RDirXSnp XRsp CRsp RDatXRd XDat UDat

RDat XDat UDat

RDat XDat UDat

RDat XDat UDat

7/27/2019 Capp 09

63/79

63

Scaleable Cache Coherence Eschew physical bus but still snoop

Point-to-point tree structure (indirect) or ring

Root of tree or ring provide ordering point

Use some scalable network for data (ordering lessimportant)

Or, use level of indirection through directory Directory at memory remembers:

Which processor is single writer

Which processors are shared readers

Level of indirection has a price Dirty misses require 3 hops instead of two

Snoop: Requestor->Owner

Directory: Requestor->Directory->Owner

7/27/2019 Capp 09

64/79

64

Implementing Cache Coherence

Directory implementation

Extra bits stored in memory (directory) record state of line Memory controller maintains coherence based on the current state

Other CPUs commands are not snooped, instead: Directory forwards relevant commands

Powerful filtering effect: only observe commands that you need toobserve

Meanwhile, bandwidth at directory scales by adding memorycontrollers as you increase size of the system Leads to very scalable designs (100s to 1000s of CPUs)

Directory shortcomings Indirection through directory has latency penalty

If shared line is dirty in other CPUs cache, directory must forwardrequest, adding latency

This can severely impact performance of applications with heavysharing (e.g. relational databases)

7/27/2019 Capp 09

65/79

65

Directory Protocol Implementation

Basic idea: Centralized directory keeps track of data

location(s) Scalable

Address traffic roughly proportional to number of processors Directory & traffic can be distributed with memory banks

(interleaved) Directory cost (SRAM) or latency (DRAM) can be prohibitive

Presence bits track sharers Full map (N processors, N bits): cost/scalability Limited map (limits number of sharers) Coarse map (identifies board/node/cluster; must use broadcast)

Vectors track sharers Point to shared copies Fixed number, linked lists (SCI), caches chained together Latency vs. cost vs. scalability

7/27/2019 Capp 09

66/79

66

Directory Protocol Latency

Access to non-shared data Overlap directory read with data read Best possible latency given NUMA arrangement

Access to shared data Dirty miss, modified intervention Shared intervention?

If DRAM directory, no gain If directory cache, possible gain (use F state)

No inherent parallelism Indirection adds latency Minimum 3 hops, often 4 hops

LDir RDirXSnp XRd RDat UDatXDat

Directory-based Cache Coherence

7/27/2019 Capp 09

67/79

Directory-based Cache Coherence

An alternative for large, scalable

MPs Can be based on any of the

protocols discussed thus far

We will use MSI

Memory Controller becomes anactive participant

Sharing info held in memorydirectory

Directory may be distributed

Use point-to-point messages

Network is not totally ordered

Cache

Processor

Interconnection Network

Memory

Module

Directory

Cache

Processor

Memory

Module

Directory

Cache

Processor

Memory

Module

Directory

. . .

. . .

7/27/2019 Capp 09

68/79

68

Example: Simple Directory Protocol Local cache controller states

M, S, I as before

Local directory states

Shared: ; one or more proc. has copy; memoryis up-to-date

Modified: one processor has copy; memorydoes not have a valid copy

Uncached: none of the processors has a validcopy

Directory also keeps track of sharers Can keep global state vector in full

e.g. via a bit vector

Example

7/27/2019 Capp 09

69/79

Example

Local cache suffers load miss

Line in remote cache in M state It is the owner

Four messages send over network

Cache read from local controller tohome memory controller

Memory read to remote cachecontroller

Owner data back to memorycontroller; change state to S

Memory data back to local cache;change state to S

. . .

. . .Cache

Processor

Owner

Contro l ler Cache

Processor

Local

Contro l ler Cache

Processor

Remote

Contro l ler

MemoryBanks

Directory

Memory

Controller

MemoryBanks

Directory

Memory

Controller

MemoryBanks

Directory

Memory

Controller

processor

read

cache

read

memory

read

owner

data

response

memory

data

response In terconnect

7/27/2019 Capp 09

70/79

02/07 ECE/CS 757; copyright J. E. Smith, 2007 70

Cache Controller State Diagram Intermediate States for clarity

Cache Controller

Actions and Next States

from Processor Side from Memory Side

Current

State

Processor

Read

Processor

Write

Eviction Memory

Read

Memory

Read&M

Memory

Invalidate

Memory

Upgrade

Memory Data

I CacheRead

I'

CacheRead&M

I''

No Action I

S No

Action

S

Cache

Upgrade

S'

No

Action*

I

Invalidate

Frame;

Cache ACK;

I

M No

Action

M

No Action

M

Cache

Write-

back

I

Owner

Data;

S

Owner

Data;

I

Invalidate

Frame;

Cache ACK;

I

I' Fill Cache

S

I'' Fill Cache

M

S' No

Action

M

M C t ll St t Di

7/27/2019 Capp 09

71/79

02/07 ECE/CS 757; copyright J. E. Smith, 2007 71

Memory Controller State DiagramMemory Controller

Actions and Next States

command from Local Cache Controller response from Remote Cache Controller

Current

Directory

State

Cache

Read

Cache Read&M Cache

Upgrade

Data

Write-back

Cache ACK Owner

Data

U Memory Data;

Add Requestor to

Sharers;

S

Memory Data;

Add Requestor to

Sharers;

M

S Memory Data;

Add Requestor to

Sharers;

S

Memory

InvalidateAll

Sharers;

M'

Memory

Upgrade

All Sharers;

M''

No Action

I

M Memory Read

from Owner;

S'

Memory Read&M;

to Owner

M'

Make Sharers

Empty;

U

S' Memory Data

to Requestor;

Write memory;

Add Requestor toSharers;

S

M' When all

ACKS

Memory Data;

M

Memory Data

to Requestor;

M

M'' When all

ACKS then

M

Another Example

7/27/2019 Capp 09

72/79

Another Example

Local write (miss) to shared line

Requires invalidations and acks

Memory

Banks

Directory

Home MemoryController

. . .

. . .Cache

Processor

Remote

Control ler Cache

Processor

Local

Contro l ler Cache

Processor

Remote

Control ler

Memory

Banks

Directory

MemoryController

Memory

Banks

Directory

MemoryController

processor

write

cache

Read&M

memory

invalidate

cache

ackmemory

data

response Interconnect

cache

ack

Example Sequence

7/27/2019 Capp 09

73/79

Example Sequence

Similar to earlier sequences

Thread Event ControllerActions

Data From global state local states:C0 C1 C2

0. Initially: I I I

1. T0 read CR,MD Memory S I I

2. T0 write CU, MU*,MD M I I

3. T2 read CR,MR,MD C0 S I S

4. T1 write CRM,MI,CA,MD Memory I M I

Variation: Three Hop Protocol

7/27/2019 Capp 09

74/79

Variation: Three Hop Protocol

Have owner send data directly to local controller

Owner Acks to Memory Controller in parallel

LocalController

OwnerController

MemoryController

cache

read

memory

read

owner

data

memory

data

1 2

3

4

LocalController

OwnerController

MemoryController

cache

read

memory

read

owner

data

owner

ack1 23

3

a) b)

7/27/2019 Capp 09

75/79

75

Directory Protocol Optimizations

Remove dead blocks from cache: Eliminate 3- or 4-hop latency Dynamic Self-Invalidation [Lebeck/Wood, ISCA 1995] Last touch prediction [Lai/Falsafi, ISCA 2000] Dead block prediction [Lai/Fide/Falsafi, ISCA 2001]

Predict sharers Prediction in coherence protocols [Mukherjee/Hill, ISCA 1998] Instruction-based prediction [Kaxiras/Goodman, ISCA 1999] Sharing prediction [Lai/Falsafi, ISCA 1999]

Hybrid snooping/directory protocols Improve latency by snooping, conserve bandwidth with directory Multicast snooping [Bilir et al., ISCA 1999; Martin et al., ISCA 2003] Bandwidth-adaptive hybrid [Martin et al., HPCA 2002] Token Coherence [Martin et al., ISCA 2003] VCT Multicasting [Enright Jerger thesis 2008]

7/27/2019 Capp 09

76/79

Protocol Races Atomic bus

Only stable states in protocol (e.g. M, S, I) All state transitions are atomic (I->M)

No conflicting requests can interfere since bus is held till transaction

completes

Distinguish coherence transaction from data transfer

Data transfer can still occur much later; easier to handle this case

Atomic buses dont scale

At minimum, separate bus request/response

Large systems have broadly variable delays

Req/resp separated by dozens of cycles

Conflicting requests can and do get issued

Messages may get reordered in the interconnect

How do we resolve them?

76

7/27/2019 Capp 09

77/79

Resolving Protocol Races Req/resp decoupling introduces transient

states

E.g. I->S is now I->ItoX->ItoS_nodata->S

Conflicting requests to blocks in transient

states

NAK ugly; livelock, starvation potential

Keep adding more transient states

Directory protocol makes this a bit easier

Can order at directory, which has full state info

Even so, messages may get reordered

77

7/27/2019 Capp 09

78/79

Common Protocol Races Read strings: P0 read, P1 read, P2 read

Easy, since read is nondestructive Can rely on F state to reduce DRAM accesses

Forward reads to previous requestor (F)

Write strings: P0 write, P1 write, P2 write

Forward P1 write req to P0 (M) P0 completes write then forwards M block to P1

Build string of writes (write string forwarding)

Read after write (similar to prev. WAW)

Writeback race: P0 evicts dirty block, P1 reads Dirty block is in the network (no copy at P0 or at dir)

NAK P1, or force P0 to keep copy till dir ACKs WB

Many others crop up, esp. with optimizations

78

7/27/2019 Capp 09

79/79

Summary

Coherence States

Snoopy bus-based Invalidate Protocols

Invalidate protocol optimizations

Update Protocols (Dragon/Firefly)

Directory protocols

Implementation issues

Capp 09

Documents

Transcript of Capp 09