1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture...

77
1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Herbert G. Mayer, PSU Status 5/25/2015 Status 5/25/2015

Transcript of 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture...

Page 1: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

1

CS 201Computer Systems Programming

Chapter 19Cache Coherence in

Multiprocessor Architecture

Herbert G. Mayer, PSUHerbert G. Mayer, PSUStatus 5/25/2015Status 5/25/2015

Page 2: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

2

Syllabus DefinitionsDefinitions Write OnceWrite Once Policy and the Policy and the MESIMESI Protocol Protocol MOESI ExtensionMOESI Extension MESIF ExtensionMESIF Extension Life and Fate of a Cache Line: 7 Life and Fate of a Cache Line: 7 MESIMESI Scenarios Scenarios Scenario 1: A reads line, then B reads same lineScenario 1: A reads line, then B reads same line Scenario 2: A writes line once, then B reads same lineScenario 2: A writes line once, then B reads same line Scenario 3: A writes line multiple times, then B reads same lineScenario 3: A writes line multiple times, then B reads same line Scenario 4: A reads line, then B writes same line w/o readingScenario 4: A reads line, then B writes same line w/o reading Scenario 5: A reads + writes line, then B writes same lineScenario 5: A reads + writes line, then B writes same line Scenario 6: A writes line repeatedly, then B writes w/o readingScenario 6: A writes line repeatedly, then B writes w/o reading Scenario 7: A and B have read the same line, then B writes to itScenario 7: A and B have read the same line, then B writes to it Differences in Pentium Pro processor L2 cache managementDifferences in Pentium Pro processor L2 cache management BibliographyBibliography

Page 3: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

3

Context and Goal

We discuss shared-memory MP systems, in which We discuss shared-memory MP systems, in which each processor has a each processor has a second levelsecond level (L2) cache in (L2) cache in addition to an internaladdition to an internal first level first level (L1) cache (L1) cache

Many discussions are specific to the cache Many discussions are specific to the cache implementation on the Intelimplementation on the Intel®® Pentium Pentium®® processor processor family, with focus on the family, with focus on the MESIMESI protocol protocol

Reference point is a text by MindShare Inc., Reference point is a text by MindShare Inc., literature reference [1]literature reference [1]

The 7 case studies taken from [1]. The 7 case studies taken from [1]. MESIMESI is used on is used on the Pentium Pro processor family in a modified way. the Pentium Pro processor family in a modified way. MESI = MESI = mmodified, odified, eexclusive, xclusive, sshared, hared, iinvalidnvalid

Page 4: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

4

Cache Coherence Problem

The problem addressed and solved with the The problem addressed and solved with the MESIMESI protocol is the protocol is the coherence of memory and cachescoherence of memory and caches on on shared-memory MP systemsshared-memory MP systems

Even in a UP system, in which the processor has a Even in a UP system, in which the processor has a data cache, memory and cache must have coherent data cache, memory and cache must have coherent data; if not, must have a mechanism to ensure that data; if not, must have a mechanism to ensure that at critical moments differing copies will match againat critical moments differing copies will match again

StaleStale memory should be short-term and must be memory should be short-term and must be handled safelyhandled safely

Page 5: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

5

Cache Coherence Problem

On shared-memory MP architectures this problem of On shared-memory MP architectures this problem of data consistency (cache data consistency (cache coherenccoherence) is magnified by e) is magnified by the number of processors sharing memorythe number of processors sharing memory

. . . and by the fact that without an additional L2 . . . and by the fact that without an additional L2 cache performance would drop noticeablycache performance would drop noticeably

In an MP system with N processors and 2 levels of In an MP system with N processors and 2 levels of cache there can be cache there can be N*2+1N*2+1 copies of the same data copies of the same data

The +1 stemming from the original copy of the data The +1 stemming from the original copy of the data in memory. These copies must all be in memory. These copies must all be coherentcoherent

Page 6: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

6

Definitions

Page 7: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

7

DefinitionsDefinitions

Allocate-on-WriteAllocate-on-Write

If a store instruction experiences a cache miss, If a store instruction experiences a cache miss, and as a result a cache line is filled, then the and as a result a cache line is filled, then the allocate-on-writeallocate-on-write cache policy is used cache policy is used

If the write miss causes the paragraph from If the write miss causes the paragraph from memory to be streamed into a data cache line, we memory to be streamed into a data cache line, we say the cache uses say the cache uses allocate-on-writeallocate-on-write

Pentium processors, for example, do not use Pentium processors, for example, do not use allocate-on-writeallocate-on-write

Antonym: Antonym: write bywrite by

Page 8: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

8

DefinitionsDefinitions

Back-OffBack-Off If processor P1 issues a store to a data address shared with If processor P1 issues a store to a data address shared with

another processor P2, and P2 has cached and modified the another processor P2, and P2 has cached and modified the same data, a chance for data inconsistency arisessame data, a chance for data inconsistency arises

To avoid this, the cache with the modified P2 line must To avoid this, the cache with the modified P2 line must snoopsnoop for all accesses, read or write, to guarantee delivery for all accesses, read or write, to guarantee delivery of the newest dataof the newest data

Once the snoop detects the access request from P1, P1 Once the snoop detects the access request from P1, P1 must be prevented from getting ownership of the data; this must be prevented from getting ownership of the data; this is accomplished by temporarily preventing access to the is accomplished by temporarily preventing access to the system bussystem bus

This denial for the purpose of preserving data integrity is This denial for the purpose of preserving data integrity is called called back-offback-off

Page 9: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

9

DefinitionsDefinitions

Blocking CacheBlocking Cache Let a Let a cache misscache miss result in streaming-in of a line result in streaming-in of a line

If during that stream-in no more accesses can be If during that stream-in no more accesses can be made to this cache until the data transfer is made to this cache until the data transfer is complete, then this cache is called complete, then this cache is called blockingblocking

Antonym: Antonym: non-blockingnon-blocking

Generally, a Generally, a blocking cacheblocking cache yields lower yields lower performance than a performance than a non-blockingnon-blocking

Page 10: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

10

DefinitionsDefinitions

Bus MasterBus Master Only one of the devices connected to a system bus has the Only one of the devices connected to a system bus has the

right to send signals across the bus; this ownership is right to send signals across the bus; this ownership is called being the called being the bus masterbus master

Initially the Memory and IO Controller (Initially the Memory and IO Controller (MIOCMIOC) is the bus ) is the bus master; it also is possible a chipset includes a special-master; it also is possible a chipset includes a special-purpose bus arbiterpurpose bus arbiter

Over time, all processors, and for the processors their Over time, all processors, and for the processors their caches, request to become caches, request to become bus masterbus master for some number of for some number of bus cyclesbus cycles

The The MIOCMIOC can grant this right, yet each of the processors can grant this right, yet each of the processors (more specifically: its cache) can request a (more specifically: its cache) can request a back-offback-off, even if , even if otherwise the right to be bus master would be grantedotherwise the right to be bus master would be granted

Page 11: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

11

DefinitionsDefinitions

DirectoryDirectory The collection of all The collection of all tagstags is referred to as the cache is referred to as the cache directorydirectory

In addition to the directory and the actual data there may be In addition to the directory and the actual data there may be further overhead bits in a data cachefurther overhead bits in a data cache

Dirty BitDirty Bit In addition to the directory and the actual data State bit In addition to the directory and the actual data State bit

associated with a cache line. This bit expresses whether a associated with a cache line. This bit expresses whether a write hit has occurred on a system applying write hit has occurred on a system applying write backwrite back. . Synonym: Synonym: Modified bitModified bit there may be further overhead bits in there may be further overhead bits in a data cachea data cache

Page 12: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

12

DefinitionsDefinitions

InvalidInvalid State in the State in the MESIMESI protocol protocol

This This I stateI state (possibly implemented via special (possibly implemented via special purpose bit) indicates that the associated cache purpose bit) indicates that the associated cache line is line is invalidinvalid, and consequently holds no valid , and consequently holds no valid datadata

Invalid (I) stateInvalid (I) state is always set after a system reset is always set after a system reset

Page 13: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

13

DefinitionsDefinitions

ExclusiveExclusive State in State in MESIMESI protocol. The protocol. The EE state state indicates that indicates that

the current cache is not aware of any other cache the current cache is not aware of any other cache sharing the same information, and that the line is sharing the same information, and that the line is unmodifiedunmodified

E allows that in the future another line will contain E allows that in the future another line will contain the same information, in which case the the same information, in which case the E stateE state must be changedmust be changed

Possible that a higher-level cache (L1 for example Possible that a higher-level cache (L1 for example viewed from an L2) may actually have a shared viewed from an L2) may actually have a shared copy of the line in exclusive state; however that copy of the line in exclusive state; however that level of sharing is transparent to other potentially level of sharing is transparent to other potentially sharing agents outside the current processorsharing agents outside the current processor

Page 14: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

14

DefinitionsDefinitions

MESIMESI Acronym for Acronym for Modified, Exclusive, SharedModified, Exclusive, Shared and and

InvalidInvalid This is an ancient protocol to ensure cache This is an ancient protocol to ensure cache

coherence on the Pentium processor. A protocol coherence on the Pentium processor. A protocol is necessary, since more processors than one is necessary, since more processors than one have copy of common data with right to modifyhave copy of common data with right to modify

Through the Through the MESI MESI protocol data protocol data coherence coherence is is ensured no matter which of the processors ensured no matter which of the processors performs writesperforms writes

AKA as AKA as Illinois protocolIllinois protocol due to its origin at the due to its origin at the University of Illinois at Urbana-ChampaignUniversity of Illinois at Urbana-Champaign

Page 15: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

15

DefinitionsDefinitions

ModifiedModified State in State in MESIMESI protocol protocol

MM state state implies that the cache line found by a write implies that the cache line found by a write hit was hit was exclusiveexclusive, and that the current processor , and that the current processor has modified the data.has modified the data.

The The modified modified state expresses: Currently not state expresses: Currently not shared, exclusivelyshared, exclusively owned data have been owned data have been modifiedmodified

In a UP system, this is generally expressed by the In a UP system, this is generally expressed by the dirtydirty bit bit

Page 16: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

16

DefinitionsDefinitions

ParagraphParagraph Conceptual, aligned, fixed-size area of the logical address Conceptual, aligned, fixed-size area of the logical address

space that can be loaded in toto into the cachespace that can be loaded in toto into the cache

The holding area in the cache of paragraph-size is called a The holding area in the cache of paragraph-size is called a lineline

In addition to the actual In addition to the actual datadata, a line in cache has further , a line in cache has further information, including the information, including the dirtydirty and and validvalid bit (in UP systems), bit (in UP systems), the the tagtag, , LRULRU information, and in MP systems the information, and in MP systems the MESIMESI bits bits

The The MESIMESI I state and the I state and the validvalid bit in UP architectures bit in UP architectures perform the same functionperform the same function

Also, the Also, the MESIMESI M state corresponds to the dirty bit in a UP M state corresponds to the dirty bit in a UP systemsystem

Page 17: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

17

DefinitionsDefinitions

SharedShared State in the State in the MESIMESI protocol protocol SS state state expresses that the hit line is present in expresses that the hit line is present in

more than one cache. Moreover, the current cache more than one cache. Moreover, the current cache (with the (with the shared stateshared state) has not modified the line ) has not modified the line after stream-inafter stream-in

Another cache of the same processor may be Another cache of the same processor may be such a sharing agent. For example, in a two level such a sharing agent. For example, in a two level cache, the L2 cache will hold all data present in cache, the L2 cache will hold all data present in the L1 cachethe L1 cache

Similarly, another processor’s L2 cache may share Similarly, another processor’s L2 cache may share data with the current processor’s L2 cachedata with the current processor’s L2 cache

Page 18: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

18

DefinitionsDefinitions

Snarfing Snarfing 1: 1: cc has a modified line and has a modified line and aa wants to read wants to read

A A snoopingsnooping cache cache cc detects that another bus agent detects that another bus agent aa wants wants to read paragraph into to read paragraph into aa’s’s cache line, of which cache line, of which cc has a has a modified M copymodified M copy

M of the MESI protocol implies exclusivity; no other cache M of the MESI protocol implies exclusivity; no other cache will have a copy at this timewill have a copy at this time

Instead of Instead of cc 1.) causing 1.) causing a a to back-off, 2.) thento back-off, 2.) then c c streaming-streaming-out the line, 3.) out the line, 3.) aa streaming-in that written paragraph, and 4.) streaming-in that written paragraph, and 4.) both both aa and and cc ending up in ending up in SS state, state, snarfingsnarfing does the does the following:following:

cc streams-out the line and switches to S, but does not cause streams-out the line and switches to S, but does not cause aa to back-off. Instead, to back-off. Instead, aa reads the line from the bus during reads the line from the bus during the stream-out process. This saves a full memory access the stream-out process. This saves a full memory access and saves the back-off delayand saves the back-off delay

Page 19: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

19

DefinitionsDefinitionsSnarfing Snarfing 2: 2: cc has a modified line and has a modified line and aa wants to writewants to write A A snoopingsnooping cache cache cc detects that another bus agent detects that another bus agent aa wants wants

to stream-in a paragraph due to to stream-in a paragraph due to allocate-on-writeallocate-on-write, of which , of which c c has a modified M copy. Note: has a modified M copy. Note: aa does not have a copy, but does not have a copy, but uses allocate-on-write, hence makes it known that the line uses allocate-on-write, hence makes it known that the line will be streamed-in and then modified once presentwill be streamed-in and then modified once present

Instead of Instead of cc causing 1.) causing 1.) a a to back-off, 2.) then streaming-out to back-off, 2.) then streaming-out the line, 3.) switching to I invalid, and 4.) letting the line, 3.) switching to I invalid, and 4.) letting aa stream-in stream-in the line and modify it, the line and modify it, snarfingsnarfing does the following: does the following:

cc streams-out the line, switches to I, but does not cause streams-out the line, switches to I, but does not cause aa to to back-off. Instead, back-off. Instead, aa reads the line directly from the bus reads the line directly from the bus during the stream-out process. This saves a complete during the stream-out process. This saves a complete memory access, and saves the back-off delay. Now memory access, and saves the back-off delay. Now aa has has the modified copy (modified by the modified copy (modified by cc), as does memory, and E is ), as does memory, and E is the proper state for the proper state for aa. Now . Now aa can further modify the line, can further modify the line, resulting in a state transition from E to M. resulting in a state transition from E to M. cc no longer holds no longer holds the linethe line

Page 20: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

20

DefinitionsDefinitions

SnoopingSnooping After a After a lineline write-hit in a cache using write-hit in a cache using write backwrite back, the data in , the data in

cache and memory are no longer identical. In accordance cache and memory are no longer identical. In accordance with the with the write backwrite back policy, memory will be written eventually, policy, memory will be written eventually, but until then memory is but until then memory is stalestale

The modifier (the cache that wrote) must pay attention to The modifier (the cache that wrote) must pay attention to other bus masters trying to access the same line. If this is other bus masters trying to access the same line. If this is detected, action must be taken to ensure data integritydetected, action must be taken to ensure data integrity

This This paying attentionpaying attention is called is called snoopingsnooping. The right action . The right action may be forcing a back-off, or may be forcing a back-off, or snarfingsnarfing, or yet something else , or yet something else that ensures data coherencethat ensures data coherence

Snooping Snooping starts with the lowest-order cache, here the L2 starts with the lowest-order cache, here the L2 cache. If appropriate, L2 lets L1 cache. If appropriate, L2 lets L1 snoopsnoop for the same address, for the same address, because L1 may have further modified the linebecause L1 may have further modified the line

Page 21: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

21

DefinitionsDefinitions

SquashingSquashing In a In a non-blockingnon-blocking cache, a subsequent memory access may cache, a subsequent memory access may

be issued even if a previous miss resulted in a slow stream-be issued even if a previous miss resulted in a slow stream-in to the addressed cache linein to the addressed cache line

That subsequent memory access will be a miss again, which That subsequent memory access will be a miss again, which is being queued. Whenever an access references an address is being queued. Whenever an access references an address for which a request is already outstanding, the duplicate for which a request is already outstanding, the duplicate request to stream-in can be skippedrequest to stream-in can be skipped

Not entering this in the queue is called Not entering this in the queue is called squashingsquashing

The second and any further outstanding memory access can The second and any further outstanding memory access can be resolved, once the first stream-in results in the line being be resolved, once the first stream-in results in the line being present in the cachepresent in the cache

Page 22: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

22

DefinitionsDefinitions

Strong Write OrderStrong Write Order A policy ensuring that memory writes occur in the A policy ensuring that memory writes occur in the

same order as the store operations in the same order as the store operations in the executing object codeexecuting object code

Antonym: Antonym: Weak orderWeak order

The advantage of The advantage of weak orderingweak ordering can be speed can be speed gain, allowing a compiler or cache policy to gain, allowing a compiler or cache policy to schedule instructions out of order; this requires schedule instructions out of order; this requires some other policy to ensure data integritysome other policy to ensure data integrity

Page 23: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

23

DefinitionsDefinitions

Stream-InStream-In The movement of a paragraph from memory into a The movement of a paragraph from memory into a

cache linecache line Since line length generally exceeds the bus width Since line length generally exceeds the bus width

(i.e. exceeds the number of bytes that can be (i.e. exceeds the number of bytes that can be move in a single bus transaction), a move in a single bus transaction), a stream-instream-in process requires multiple bus transactions in a process requires multiple bus transactions in a rowrow

It is thus possible that the byte actually needed It is thus possible that the byte actually needed arrives last in a cache line during a sequence of arrives last in a cache line during a sequence of bus transactionsbus transactions

Antonym: Stream-outAntonym: Stream-out

Page 24: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

24

DefinitionsDefinitions

Stream-OutStream-Out The movement of one line of modified data from The movement of one line of modified data from

cache into a memory paragraphcache into a memory paragraph

Antonym: Stream-inAntonym: Stream-in

Page 25: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

25

DefinitionsDefinitions

Weak Write OrderWeak Write Order A policy allowing (a compiler or cache the right) A policy allowing (a compiler or cache the right)

that memory writes may occur in a different order that memory writes may occur in a different order than their associated store operationsthan their associated store operations

Antonym: Antonym: Strong Write OrderStrong Write Order

The advantage of weak ordering is potential speed The advantage of weak ordering is potential speed gaingain

Page 26: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

26

DefinitionsDefinitions

Write BackWrite Back Cache write policy that keeps a line of data (a Cache write policy that keeps a line of data (a

paragraph) in the cache even after a writeparagraph) in the cache even after a write

The changed state must be remembered via the The changed state must be remembered via the dirtydirty bit, AKA Modified state bit, AKA Modified state

Upon retirement, any dirty line must be copied Upon retirement, any dirty line must be copied back into memoryback into memory

Advantage: only one stream-out, no matter how Advantage: only one stream-out, no matter how many write hits did occur to that same linemany write hits did occur to that same line

Page 27: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

27

DefinitionsDefinitions

Write ByWrite By Cache write policy, in which the cache is not Cache write policy, in which the cache is not

accessed on a write miss, even if there are cache accessed on a write miss, even if there are cache lines in I statelines in I state

A cache using A cache using write bywrite by “hopes” that soon there “hopes” that soon there may be a load, which will result in a miss and then may be a load, which will result in a miss and then stream-in the appropriate line. And if not, it was stream-in the appropriate line. And if not, it was not necessary to stream-in the line in the first not necessary to stream-in the line in the first placeplace

Antonym: Antonym: allocate-on-writeallocate-on-write

Page 28: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

28

DefinitionsDefinitions

Write OnceWrite Once Cache write policy that starts out as Cache write policy that starts out as write throughwrite through

and changes toand changes to write back write back after the first write hit to after the first write hit to a linea line

Typical policy imposed onto a higher level L1 Typical policy imposed onto a higher level L1 cache by the L2 cachecache by the L2 cache

Advantage: The L1 cache places no unnecessary Advantage: The L1 cache places no unnecessary traffic onto the system bus upon a cache-write hittraffic onto the system bus upon a cache-write hit

Lower level L2 cache can remember that a write Lower level L2 cache can remember that a write has occurred by setting the has occurred by setting the MESIMESI state to state to modifiedmodified

Page 29: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

29

DefinitionsDefinitions

Write ThroughWrite Through Cache write policy that writes data to memory Cache write policy that writes data to memory

upon a write hit. Thus, cache and main memory upon a write hit. Thus, cache and main memory are always in synchare always in synch

Disadvantage: repeated memory access traffic on Disadvantage: repeated memory access traffic on the busthe bus

Page 30: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

30

Write Once PolicyIn MESI Protocol

Page 31: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

31

Introduction to Write Once

The The MESIMESI protocol is one implementation of protocol is one implementation of enforcing data integrity among caches sharing data; enforcing data integrity among caches sharing data; the the write oncewrite once write policy is a method to keep the write policy is a method to keep the protocol performing efficiently by avoiding protocol performing efficiently by avoiding superfluous data traffic on the system bussuperfluous data traffic on the system bus

First we’ll discuss the First we’ll discuss the write oncewrite once, then the MESI , then the MESI protocolprotocol

We’ll also mention MOESI protocol, and MESIF protocol, but focus is MESI

Page 32: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

32

Write Once Policy

Write through has the advantage of keeping cache and memory continually consistent. Draw-back is the added traffic placed on the system bus

Write back has the advantage of postponing unnecessary bus traffic until the last possible moment and to do so just once, even if many writes to a shared cache line occurred. The draw-back is the temporary inconsistency between cache line and memory

To avoid catastrophe, a dirty bit must mark the fact that at least one write happened to an exclusively owned line

Write once combines the advantages of both. For efficiency, multi-level caches generally use write back write for L2. Write once is a refinement used for L1 caches. Both use the MESI protocol to preserve data consistency

Page 33: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

33

Write Once Policy In write once, L1 starts out using write through, and

any line shared with an L2 cache is marked S, for shared. The corresponding copy in L2 is also marked S

If a write hit occurs, the modified data are written through to the L2 cache, which in turn also marks its line as modified, M

This transition is used by L2 to cause L1’s write policy to change from write through to write back. Also, the L1 line is marked E, for exclusive

May look strange, but is safe, since the M information is recorded in the L2 cache, the first to initiate snooping

Page 34: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

34

Write Once Policy

Subsequently, when the same processor modifies the same data further, the L1 cache experiences a write hit

This time, however, the L1 cache is in write back mode, and changes from E to M

L1 does not change the L2 cache again. Further writes keep the state in M

Of the two lines with the same paragraph addresses, the one in the L1 cache is more current than the one in L2

Both record the M state

Page 35: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

35

Write Once Policy

When another processor issues a read from the same paragraph address, the L2 cache with a modified copy of that line snoops and asks the other to back-off. As a result, the line will be written into memory

First, however, L2 must check if L1 has modified the data as well

Detectable by the M state of L1 and the data are first flushed from L1 to L2, and then to memory

Finally, both L1 and L2 change to S. Otherwise, if the L1 cache has not further modified the data, indicated by E, L1 and L2 are already in synch, only the L2 line needs to be written to memory, and both L1 and L2 transition to S

Page 36: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

36

MESI ProtocolDetail

Page 37: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

37

MESI Protocol

On a Pentium family processor, each cache line may On a Pentium family processor, each cache line may have its own write policy, independent of other lines have its own write policy, independent of other lines even in the same seteven in the same set

The complete, total state of a cache line therefore, is The complete, total state of a cache line therefore, is expressed in the write policy used and the MESI expressed in the write policy used and the MESI state bits, associated with each linestate bits, associated with each line

These bits are: M for These bits are: M for modifiedmodified, E for , E for exclusiveexclusive, S for, S for sharedshared, and I for , and I for invalidinvalid. Initially a line holds no . Initially a line holds no information, so its state is Iinformation, so its state is I

Page 38: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

38

MESI Protocol

During system reset, the During system reset, the MESI MESI bits for the Pentium’s L1 and L2 bits for the Pentium’s L1 and L2 caches are set to Icaches are set to I

This will mark all lines as empty and will force any cache read This will mark all lines as empty and will force any cache read or write access to or write access to missmiss. At the first read, lines will be streamed . At the first read, lines will be streamed into L2 and a portion of those to be streamed into L1into L2 and a portion of those to be streamed into L1

Since L1 has a copy of what is in L2, L1 will be set to S and L2 Since L1 has a copy of what is in L2, L1 will be set to S and L2 to Eto E

E holds, as long as no other processor’s cache shares the E holds, as long as no other processor’s cache shares the same data. This transition is shown in the figure belowsame data. This transition is shown in the figure below

Note in coming example: the other processor B has not yet Note in coming example: the other processor B has not yet made any data accesses, hence its cache lines remain Imade any data accesses, hence its cache lines remain I

Page 39: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

39

MESI Protocol Generic Diagram

L1 Cache

ProcessorA

ProcessorA’s L2 Cache

I

S

I

E

L1 Cache

ProcessorB

ProcessorB’s L2 Cache

I

I

System Bus

Shared with L2,prepare for write-once

no other Processor uses this line, E

no activity, set to I,or invalid, after reset

no activity,invalid, or I

Initial Read of cache line into A after reset, B inactive

Page 40: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

40

MESI States The table below explains the 4 The table below explains the 4 MESI MESI states. Note that both caches have states. Note that both caches have

MESIMESI bits, and both use bits, and both use write backwrite back most of the time on the Pentium family most of the time on the Pentium family

MESI State Description

Modified The line has been changed by the cache; a store has taken place. It is implied that the data are exclusive. M alerts the cache to take action on a snoop hit. In that case the line is written back and the state is adjusted, generally to Shared. Alternatively, the cache can snarf. Done on Pentium ® Pro, not on Pentium.

Exclusive The line is owned by just this processor. Except for memory, a shared copy may only exist in a higher order cache of the same processor. For example, L2 may label a line E that exists also in its L1. But no other processor’s cache holds a copy of this same line.

Shared The line is present in at least one other cache, maybe in several. However, all of these are identical copies. No other line with these data has performed a write. Shared implies unmodified.

Invalid The line is not valid in this cache. Typical state after system reset, and thus the line is ready for receiving new data.

Page 41: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

41

MESI States Every cache line will be in one of these 4 statesEvery cache line will be in one of these 4 states

State is influenced by the owning processors’ action State is influenced by the owning processors’ action (load and stores) or by a bus snoop, when another (load and stores) or by a bus snoop, when another processor addresses the same lineprocessor addresses the same line

The latter is supported by special pins and The latter is supported by special pins and connections (lines) between L2 caches on the connections (lines) between L2 caches on the Pentium processorPentium processor

These lines are HIT# (for: cache hit) and HITM# (for: These lines are HIT# (for: cache hit) and HITM# (for: cache hit modified)cache hit modified)

Page 42: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

42

MOESI Extension

MOESI protocol is based on MESI, but offers a 5MOESI protocol is based on MESI, but offers a 5thth state: O for Ownedstate: O for Owned

OO implies modified and shared; goal of the processor implies modified and shared; goal of the processor who ahs modified the line is to defer the eventual, who ahs modified the line is to defer the eventual, mandatory write-backmandatory write-back

To ensure data integrity with the sharing processor, direct cache to cache data transfer is initiated

So done on AMD64; see detail below from AMD

Page 43: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

43

MOESI ExtensionMOESI Extension

OwnedOwned A cache with line in state A cache with line in state OO is a sharing cache with a valid is a sharing cache with a valid

copy, but has the exclusive right to modifycopy, but has the exclusive right to modify

It broadcast those changes to all other sharing cachesIt broadcast those changes to all other sharing caches

Owned state allows dirty sharing of data, i.e., a modified Owned state allows dirty sharing of data, i.e., a modified cache block can be moved around various caches without cache block can be moved around various caches without updating main memoryupdating main memory

Cache line state may be changed to Cache line state may be changed to MM after invalidating all after invalidating all shared copies, implying exclusivenessshared copies, implying exclusiveness

Or it may transition to Or it may transition to SS by writing the modifications back to by writing the modifications back to main memorymain memory

A line in O must respond to snoop requestsA line in O must respond to snoop requests

Page 44: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

44

MOESI Extension, References

1.1. http://en.wikipedia.org/wiki/MOESI_protocolhttp://en.wikipedia.org/wiki/MOESI_protocol

2.2. http://www.revolvy.com/main/index.php?s=MOESIhttp://www.revolvy.com/main/index.php?s=MOESI%20protocol%20protocol

3.3. http://infocenter.arm.com/help/index.jsp?topic=/http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0425/ch03s12s01.htmlcom.arm.doc.dai0425/ch03s12s01.html

Page 45: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

45

MESIF Extension

The MESIF protocol was developed by Intel for NUMA The MESIF protocol was developed by Intel for NUMA architectures, with the usual 4 states of MESI, but one architectures, with the usual 4 states of MESI, but one added state added state F,F, for forward for forward

F is a shared F is a shared SS state variation, stating state variation, stating this cachethis cache must must act as the designated responder for line requestsact as the designated responder for line requests

Protocol ensures that, if any cache holds a line in the Protocol ensures that, if any cache holds a line in the S state, at most one other cache holds it in FS state, at most one other cache holds it in F

But multiple others may hold the line in SBut multiple others may hold the line in S

Page 46: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

46

MESIF Extension Detail With the MESI protocol, a cache line request received by multiple With the MESI protocol, a cache line request received by multiple

caches holding an S line is serviced inefficiently. It may either be caches holding an S line is serviced inefficiently. It may either be satisfied from slow main memory, or all sharing caches could respond, satisfied from slow main memory, or all sharing caches could respond, flooding the requestorflooding the requestor

With MESIF, a cache line request is serviced only by the cache in the With MESIF, a cache line request is serviced only by the cache in the FF state. This allows the requestor to receive a copy at cache-to-cache state. This allows the requestor to receive a copy at cache-to-cache speeds, like in the MOESI protocol, while minimizing multicast packetsspeeds, like in the MOESI protocol, while minimizing multicast packets

Because a cache may unilaterally invalidate a line in Because a cache may unilaterally invalidate a line in SS or or F F, it is possible , it is possible that no cache has a copy in the that no cache has a copy in the FF state, even though copies in the state, even though copies in the SS state existstate exist

In that case, a request for the line is resolved by streaming inIn that case, a request for the line is resolved by streaming in

FF can be viewed as a virtual token to be passed aound: To minimize the can be viewed as a virtual token to be passed aound: To minimize the chance of an chance of an FF line being discarded, the most recent requestor of a line line being discarded, the most recent requestor of a line receives receives FF; when a cache in state ; when a cache in state FF responds, it hands over the responds, it hands over the FF token token to the new cache, saving the stream into the new cache, saving the stream in

Page 47: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

47

MESIF Extension Detail

Key difference from the MESI protocol is that a line Key difference from the MESI protocol is that a line streamed in for reading results in streamed in for reading results in FF. The only way to . The only way to enter the enter the SS state is to satisfy a read request from state is to satisfy a read request from another cacheanother cache

There are other techniques for satisfying read There are other techniques for satisfying read requests from shared caches while suppressing requests from shared caches while suppressing redundant replies, but having only a single designated redundant replies, but having only a single designated cache respond makes it easier to invalidate all copiescache respond makes it easier to invalidate all copies

And if only one cache is left, it transitions to And if only one cache is left, it transitions to EE

Page 48: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

48

MESIF Extension, References

1.1. http://en.wikipedia.org/wiki/MESIF_protocolhttp://en.wikipedia.org/wiki/MESIF_protocol

2.2. http://www.revolvy.com/main/index.php?s=MESIFhttp://www.revolvy.com/main/index.php?s=MESIF%20protocol&item_id=3737582%20protocol&item_id=3737582

Page 49: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

49

Life & Fate of a Cache Line:7 MESI Scenarios

Page 50: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

50

MESI Scenarios

This section shows typical state transitions of the L1 This section shows typical state transitions of the L1 and L2 caches, each transition characterized by the and L2 caches, each transition characterized by the respective titlerespective title

A bulleted list explains the initial state, the figure A bulleted list explains the initial state, the figure shows the transition, and a trailing bulleted list shows the transition, and a trailing bulleted list highlights the key points as a result of the transitionhighlights the key points as a result of the transition

First we show 3 situations, in which First we show 3 situations, in which processor B processor B wisheswishes to read a line of which processor A has a to read a line of which processor A has a copy. Processor A has performed 0, 1, or more copy. Processor A has performed 0, 1, or more writes to that linewrites to that line

Page 51: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

51

MESI Scenarios

Table of 3 Scenarios: B Reads some Line That is Also in A Scenario Processor A Status A L1 State A L2 State

1 A holds line from memory but has not modified it.

S E

2 A has written a line once, using write through. E M

3 A has written the same line more than once, uses write back write policy after first write.

M M

Page 52: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

52

MESI Scenarios

Next follow 4 situations, in which Next follow 4 situations, in which processor B writes processor B writes to a lineto a line of which A has a copy of which A has a copy

Again, A has modified the line various times, and B Again, A has modified the line various times, and B has read the shared data in one case before it has read the shared data in one case before it attempts the writeattempts the write

Assume here the caches use write-by, NOT allocate-Assume here the caches use write-by, NOT allocate-on-writeon-write

Page 53: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

53

MESI Scenarios

Table of 4 more Scenarios: B Writes some Line That is Also in A Scenario Processor A Status A L1 State A L2 State

4 A holds a memory paragraph but has not modified it. Then B writes to that same line.

S E

5 A has read a line then writes the same line once, L1 using write through. Then B writes to that same line.

E M

6 A has written some line more than once, L1 uses write back write policy after first write. Then B writes to same address.

M M

7 A and B have read the same paragraph, then B writes to that same line.

S S

Page 54: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

54

Scenario 1

Scenario 1: A reads line, then B reads same lineScenario 1: A reads line, then B reads same line

Initial state and actions taken:Initial state and actions taken:

1.1. A has read a paragraph, placed it into a line, but not modified itA has read a paragraph, placed it into a line, but not modified it

2.2. A’s L1 is in S stateA’s L1 is in S state

3.3. A’s L2 is in E state, i.e. no other processor has copy, yet its A’s L2 is in E state, i.e. no other processor has copy, yet its own L1 cache does have a copy; but that is transparent to own L1 cache does have a copy; but that is transparent to other processorsother processors

4.4. See figure: See figure: “Initial Read of cache line into ...”“Initial Read of cache line into ...”

5.5. B has not read any data at all, thus B’s L1 and L2 are both in IB has not read any data at all, thus B’s L1 and L2 are both in I

6.6. B next intends to read the same lineB next intends to read the same line

Page 55: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

55

Scenario 1

L1 Cache

ProcessorA

ProcessorA’s L2 Cache

S

S

E

S

L1 Cache

ProcessorB

ProcessorB’s L2 Cache

I

S

I

S

System Bus

remains S: no snoop

snoop detects read byother master

CHIT# asserted

copy in L2

A Reads Line, then B Reads Same Line

A L1 writes through

Page 56: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

56

Scenario 1 A’s L2 snoops and detects a A’s L2 snoops and detects a snoop hit on readsnoop hit on read

A’s L2 does not request back-off, since request by other bus A’s L2 does not request back-off, since request by other bus master is for master is for read read and its own state is E, not Mand its own state is E, not M

A’s L1 is in S state; according to A’s L1 is in S state; according to write oncewrite once policy: it stays S policy: it stays S

A’s L2 transitions from E to S. It is aware another copy exists A’s L2 transitions from E to S. It is aware another copy exists soon, in processor Bsoon, in processor B

The whole line is streamed into B’s L2 and then into L1 cacheThe whole line is streamed into B’s L2 and then into L1 cache

B’s L1 transitions from I to SB’s L1 transitions from I to S

B’s L2 transitions from I to S; since A holds a copy of that the B’s L2 transitions from I to S; since A holds a copy of that the same paragraph, B’s L2 state cannot be Esame paragraph, B’s L2 state cannot be E

4 lines have copies of the memory paragraph, none are 4 lines have copies of the memory paragraph, none are modifiedmodified

Page 57: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

57

Scenario 2

Scenario 2: A writes line once, then B reads same lineScenario 2: A writes line once, then B reads same line

Initial state and actions taken, snarfing is NOT yet used here:Initial state and actions taken, snarfing is NOT yet used here:

1.1. A has read a line, but not modified itA has read a line, but not modified it

2.2. A’s L1 is in S and L2 in E state, since no other processor’s A’s L1 is in S and L2 in E state, since no other processor’s cache has a copycache has a copy

3.3. B has not read any data at all, thus B’s L1 and L2 are in IB has not read any data at all, thus B’s L1 and L2 are in I

4.4. A now writes the line; experiencing a write-hit!A now writes the line; experiencing a write-hit!

5.5. A’s L1 transitions to E, switches to write back due to write A’s L1 transitions to E, switches to write back due to write onceonce

6.6. A’s L2 transitions to from E to M; new data are in L2, not in A’s L2 transitions to from E to M; new data are in L2, not in memorymemory

7.7. B now intends to read the same line, is still in I stateB now intends to read the same line, is still in I state

Page 58: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

58

Scenario 2

L1 Cache

ProcessorA

ProcessorA’s L2 Cache

E

S

M

S

L1 Cache

ProcessorB

ProcessorB’s L2 Cache

I

S

I

S

System Bus

INV sampled low

snoop detects read byother bus master

CHIT# andCHITM# asserted

copy in L2

A Writes Line Once, then B Reads Same Line

L1 writes back

S

E

Page 59: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

59

Scenario 2 B’s L1 and L2 experience read miss, L2 sends read request to B’s L1 and L2 experience read miss, L2 sends read request to

busbus

A’s L2 snoops, sees a read snoop hit, and forces B to back-offA’s L2 snoops, sees a read snoop hit, and forces B to back-off

A’s L1 notices due to the E state that it already has the newest A’s L1 notices due to the E state that it already has the newest data, and not subsequently modified themdata, and not subsequently modified them

A’s L1 transitions from E to S, since the L2 cache already has a A’s L1 transitions from E to S, since the L2 cache already has a copycopy

A’s L2 writes back data to memory, transitions from M to S, A’s L2 writes back data to memory, transitions from M to S, releases releases back-offback-off

B’s L2 streams in the whole line, transitions from I to SB’s L2 streams in the whole line, transitions from I to S

B’s L1 gets copy of line, transitions from I to SB’s L1 gets copy of line, transitions from I to S

Page 60: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

60

Scenario 3

Scenario 3: A reads, writes line multiple times, then B Scenario 3: A reads, writes line multiple times, then B reads linereads line

Initial state and actions taken, snarfing NOT used here, assuming write-by:Initial state and actions taken, snarfing NOT used here, assuming write-by:

1.1. A’s L1 has read, then written line using write through, transitions to EA’s L1 has read, then written line using write through, transitions to E

2.2. A’s L2 transitions to M; note that the new data are in L2, not in A’s L2 transitions to M; note that the new data are in L2, not in memorymemory

3.3. A’s L1 changes to write back due to write once policyA’s L1 changes to write back due to write once policy

4.4. A performs a write again, hits L1 cacheA performs a write again, hits L1 cache

5.5. A’s L1 cache transitions from E to M, A’s L2 cache remains in M modeA’s L1 cache transitions from E to M, A’s L2 cache remains in M mode

6.6. The modified data did not get copied to memory, as L2 uses write The modified data did not get copied to memory, as L2 uses write back; back; now 3 different copies of the same paragraph exist!!!now 3 different copies of the same paragraph exist!!!

7.7. B intends to read the same line, is in I state; what happens with B intends to read the same line, is in I state; what happens with caches?caches?

Page 61: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

61

Scenario 3

L1 Cache

ProcessorA

ProcessorA’s L2 Cache

M

S

M

S

L1 Cache

ProcessorB

ProcessorB’s L2 Cache

I

S

I

S

System Bus

INV sampled low

L2 snoop detects readby other bus master

CHIT# andCHITM# asserted

copy in L2

A Writes Line Multiple Times, then B Reads Same Line

L1 write back

BOFF# BOFF#

Page 62: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

62

Scenario 3 B’s L1 and L2 see read miss, B sends read request to system B’s L1 and L2 see read miss, B sends read request to system

busbus

A’s L2 snoops, experiences a read snoop hit, realizes that B A’s L2 snoops, experiences a read snoop hit, realizes that B would get stale data, and forces B to back-offwould get stale data, and forces B to back-off

A’s L1 has the newest data, visible through M stateA’s L1 has the newest data, visible through M state

A’s L1 writes the modified data back into L2A’s L1 writes the modified data back into L2

A’s L2 writes data back to memoryA’s L2 writes data back to memory

A and memory are in synch; A and memory are in synch; instead of 3 copies, now there instead of 3 copies, now there exists 1!exists 1!

A’s L1 state transitions from M to S, L2 from M to S, releases A’s L1 state transitions from M to S, L2 from M to S, releases back-offback-off

B’s L2 streams in the whole line, second try; transitions I to SB’s L2 streams in the whole line, second try; transitions I to S

B’s L1 gets copy of line, transitions I to S; all 4 are in state SB’s L1 gets copy of line, transitions I to S; all 4 are in state S

Page 63: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

63

Scenario 4

Scenario 4: B writes line also present, unmodified in AScenario 4: B writes line also present, unmodified in A

Initial state and actions taken, snarfing not used, assume write-by:Initial state and actions taken, snarfing not used, assume write-by:

1.1. A has read a line from memory, but not modified itA has read a line from memory, but not modified it

2.2. A’s L1 is in S stateA’s L1 is in S state

3.3. A’s L2 is in E mode, as L2 believes no other processor has A’s L2 is in E mode, as L2 believes no other processor has copycopy

4.4. B has not read any data at all, thus B’s L1 and L2 are both in IB has not read any data at all, thus B’s L1 and L2 are both in I

5.5. B next intends to write to that same line; note use of write-by!B next intends to write to that same line; note use of write-by!

Page 64: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

64

Scenario 4

L1 Cache

ProcessorA

ProcessorA’s L2 Cache

S

I

E

I

L1 Cache

ProcessorB

ProcessorB’s L2 Cache

I

I

I

I

System Bus

INV samples high

L2 snoop detects writeby other bus master

memory write miss

write miss

B Writes Line of Which A Has Copy

BOFF# BOFF#

CHIT#

HIT#

Page 65: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

65

Scenario 4 B’s L1 and L2 experiences data cache write miss, L2 initiates B’s L1 and L2 experiences data cache write miss, L2 initiates

memory write bus cycle on system bus to update memory; we memory write bus cycle on system bus to update memory; we use use write-bywrite-by, not , not allocate-on-writeallocate-on-write

B’s L1 remains in I, L2 similarly stays in IB’s L1 remains in I, L2 similarly stays in I

A’s L2 detects memory write bus cycle, snoops the address, A’s L2 detects memory write bus cycle, snoops the address, which hitswhich hits

A’s L2 is in E state, which says that a write in B is about to A’s L2 is in E state, which says that a write in B is about to update memory of which A has an exclusive copy; but A no update memory of which A has an exclusive copy; but A no longer has a valid copy, thus is transitions to Ilonger has a valid copy, thus is transitions to I

A’s L1 also transitions to IA’s L1 also transitions to I

Note that Pentium does not use Note that Pentium does not use allocate-on-writeallocate-on-write, hence all lines , hence all lines are Iare I

SnarfingSnarfing could be used for a way better policy; see Pentium Pro could be used for a way better policy; see Pentium Pro

Page 66: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

66

Scenario 5

Scenario 5: A reads + writes line, then B writes same Scenario 5: A reads + writes line, then B writes same lineline

Initial state and actions taken, snarfing not used, assume write-by:Initial state and actions taken, snarfing not used, assume write-by:

1.1. A has read a line from memory, L1 is S and L2 in E stateA has read a line from memory, L1 is S and L2 in E state

2.2. A writes line: A’s L1 writes through, updates L2A writes line: A’s L1 writes through, updates L2

3.3. A’s L1 transitions to E, switches policy to write back, L2 A’s L1 transitions to E, switches policy to write back, L2 transitions to Mtransitions to M

4.4. B has not read any data at all, thus B’s L1 and L2 are both in IB has not read any data at all, thus B’s L1 and L2 are both in I

5.5. B next intends to write to that same line, using write-byB next intends to write to that same line, using write-by

Page 67: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

67

Scenario 5

L1 Cache

ProcessorA

ProcessorA’s L2 Cache

E

I

M

I

L1 Cache

ProcessorB

ProcessorB’s L2 Cache

I

I

I

I

System Bus

INV samples high

L2 snoop detects writeby other bus master

memory write miss

memory write miss

A Reads + Writes Line then B Writes Same Line

BOFF# BOFF#

CHIT#

HIT#

Page 68: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

68

Scenario 5 B’s L1 and L2 experience cache write miss, initiate write to memoryB’s L1 and L2 experience cache write miss, initiate write to memory B’s L1 transitions remains in I, L2 similarlyB’s L1 transitions remains in I, L2 similarly A’s L2 detects memory write bus cycle, snoops address, which A’s L2 detects memory write bus cycle, snoops address, which

matchesmatches A’s L2 is in M state, which says that write in B would create stale A’s L2 is in M state, which says that write in B would create stale

memorymemory A’s L2 causes B to back-off from writing, checks if L1 made further A’s L2 causes B to back-off from writing, checks if L1 made further

writeswrites A’s L1 is not in M; L2 writes back data to memoryA’s L1 is not in M; L2 writes back data to memory A’s L2 transitions from M to I, knowing that B will write dataA’s L2 transitions from M to I, knowing that B will write data A’s L2 releases back-off, forces L1 to transition from E to I as wellA’s L2 releases back-off, forces L1 to transition from E to I as well B completes write-by, does not fill cache line, L1 and L2 end up in IB completes write-by, does not fill cache line, L1 and L2 end up in I Note: If B would use Note: If B would use allocate-on-writeallocate-on-write, it could , it could snarf snarf while A writes-while A writes-

back, then modify the back, then modify the snarfedsnarfed line, mark it as M; A would end up in I line, mark it as M; A would end up in I

Page 69: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

69

Scenario 6

Scenario 6: A reads then writes line repeatedly, Scenario 6: A reads then writes line repeatedly, B writes same without prior readB writes same without prior read

A has written a line repeatedly, switched from write A has written a line repeatedly, switched from write through to write backthrough to write back

1.1. A’s L1 is in M state, and L2 is in M mode as wellA’s L1 is in M state, and L2 is in M mode as well

2.2. B next intends to write to that same lineB next intends to write to that same line

Page 70: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

70

Scenario 6

L1 Cache

ProcessorA

ProcessorA’s L2 Cache

M

I

M

I

L1 Cache

ProcessorB

ProcessorB’s L2 Cache

I

I

I

I

System Bus

INV samples high

L2 snoop detects writeby other bus master

memory write miss

memory write miss

A Writes Line Repeatedly then B Writes Same Line

BOFF# BOFF#

CHIT#

HIT#

Page 71: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

71

Scenario 6 B’s L1 experiences write miss, initiates writeB’s L1 experiences write miss, initiates write B’s L2 similarly experiences cache write miss, initiates write on busB’s L2 similarly experiences cache write miss, initiates write on bus B’s L1 and L2 is invalid IB’s L1 and L2 is invalid I A’s L2 detects write bus cycle initiated by B, snooped address matchesA’s L2 detects write bus cycle initiated by B, snooped address matches A’s L2 is in M state, i.e. a write by B would access stale mem.A’s L2 is in M state, i.e. a write by B would access stale mem. A’s L2 causes B to back-off writing, checks if L1 made further writesA’s L2 causes B to back-off writing, checks if L1 made further writes A’s L1 is in M state, thus has newer data than L2A’s L1 is in M state, thus has newer data than L2 A’s L1 writes back data to L2, transitions to I, L2 writes to memoryA’s L1 writes back data to L2, transitions to I, L2 writes to memory A’s L2 transitions to I, knowing that B will write data. Note that memory A’s L2 transitions to I, knowing that B will write data. Note that memory

is now NOT is now NOT stalestale after the current write by A, and B has NOT yet written after the current write by A, and B has NOT yet written A’s L2 releases back-off, so B can resume (retry) the write to memoryA’s L2 releases back-off, so B can resume (retry) the write to memory B completes write according to write-by, still has no line in cache, B’s L1 B completes write according to write-by, still has no line in cache, B’s L1

and L2 end up in Iand L2 end up in I Note in some other protocol B could Note in some other protocol B could snarf snarf the line, use allocate-on-write. the line, use allocate-on-write.

After After snarfingsnarfing, B could modify line, transition to M, leave A in I, and , B could modify line, transition to M, leave A in I, and memory would be safe due to B’s M statememory would be safe due to B’s M state

Page 72: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

72

Scenario 7

Scenario 7: A and B have read the same line, Scenario 7: A and B have read the same line, then B writes to linethen B writes to line

A and B have read the same paragraph:A and B have read the same paragraph:

1.1. A’s L1 is in state S, and L2 is in S mode as well, A’s L1 is in state S, and L2 is in S mode as well, memory is up to datememory is up to date

2.2. B’s L1 and L2 are in S, next B intends to write to that B’s L1 and L2 are in S, next B intends to write to that same linesame line

Page 73: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

73

Scenario 7

L1 Cache

ProcessorA

ProcessorA’s L2 Cache

S

I

S

I

L1 Cache

ProcessorB

ProcessorB’s L2 Cache

S

S

S

E

System Bus

INV samples high

L2 snoop detects writeby other bus master

memory write miss

memory write miss

A and B read same line, then B writes to that line

BOFF# BOFF#

CHIT#

HIT#

Page 74: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

74

Scenario 7 B’s L2 experiences write hit, and initiates memory write, because line B’s L2 experiences write hit, and initiates memory write, because line

is S; note that L1 would not write through, if it were in state Eis S; note that L1 would not write through, if it were in state E B’s L2 transitions to E, knowing other snooping caches transition to IB’s L2 transitions to E, knowing other snooping caches transition to I B’s L2 actually writes, though it generally uses write-back; but it was B’s L2 actually writes, though it generally uses write-back; but it was

in S; write-back is only used in state Ein S; write-back is only used in state E B’s L1 transitions to S, so that L2 would ”know” of subsequent writesB’s L1 transitions to S, so that L2 would ”know” of subsequent writes A’s L2 snoops the address, sees the write by B, transitions to IA’s L2 snoops the address, sees the write by B, transitions to I A’s L2 instructs L1 to snoop, which also hits and causes transition to IA’s L2 instructs L1 to snoop, which also hits and causes transition to I Since neither A’s L1 or L2 have modified the line, the write in B can Since neither A’s L1 or L2 have modified the line, the write in B can

proceedproceed The states are: A’s L1 and L2 are in I, and B’s L2 is in E and L1 in SThe states are: A’s L1 and L2 are in I, and B’s L2 is in E and L1 in S You probably expected B’s write to be held-back and B to end up in You probably expected B’s write to be held-back and B to end up in

state E for L1 and M for L2; but the write does take place in the MESI state E for L1 and M for L2; but the write does take place in the MESI protocolprotocol

Page 75: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

75

New Pentium Pro L2 Cache Pentium ® Xeon ™ is designed for 4-processor MP Pentium ® Xeon ™ is designed for 4-processor MP

configurationconfiguration L2 cache is in separate cavity on Pentium Pro, but on same L2 cache is in separate cavity on Pentium Pro, but on same

chip, wire-bondedchip, wire-bonded L1 and L2 cache snoop simultaneously, hence L1 and L2 can L1 and L2 cache snoop simultaneously, hence L1 and L2 can

be in E state simultaneously!!be in E state simultaneously!! L1 cache on Pentium Pro (before Klamath) is twice the size, 8 L1 cache on Pentium Pro (before Klamath) is twice the size, 8

KB each, code and data cacheKB each, code and data cache L2 in Pentium Pro performs L2 in Pentium Pro performs snarfingsnarfing On Pentium Pro, L2 unified cache is 4-way set-associative, the On Pentium Pro, L2 unified cache is 4-way set-associative, the

L1 data cache is 2-way, and the instruction cache 4-way set-L1 data cache is 2-way, and the instruction cache 4-way set-associativeassociative

Streaming into cache uses Streaming into cache uses toggle-modetoggle-mode, or , or critical-quad-firstcritical-quad-first mode. This resolves the hit, having caused the miss, in the mode. This resolves the hit, having caused the miss, in the shortest possible timeshortest possible time

Pentium Pro has no instruction delimiter bit per instruction Pentium Pro has no instruction delimiter bit per instruction byte in the I-cachebyte in the I-cache

Pentium Pro Pentium Pro squashessquashes

Page 76: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

76

New Pentium Pro L2 Cache

Table Showing Toggle Mode in Pentium Pro, Critical-Quad-First ModeTable Showing Toggle Mode in Pentium Pro, Critical-Quad-First Mode

Address First Quad Second Quad Third Quad Fourth Quad 0x0..0x7 0x0 0x8 0x10 0x18 0x8..0xf 0x8 0x0 0x18 0x10

0x10..0x17 0x10 0x18 0x0 0x8 0x18..0x1f 0x18 0x10 0x8 0x0

Page 77: 1 CS 201 Computer Systems Programming Chapter 19 Cache Coherence in Multiprocessor Architecture Herbert G. Mayer, PSU Status 5/25/2015.

77

Bibliography

1.1. Don Anderson and Shanley, T., MindShare [1995]. Don Anderson and Shanley, T., MindShare [1995]. Pentium TM Pentium TM Processor System ArchitectureProcessor System Architecture, Addison-Wesley Publishing , Addison-Wesley Publishing Company, Reading MA, PC System Architecture Series. ISBN Company, Reading MA, PC System Architecture Series. ISBN 0-201-40992-50-201-40992-5

2.2. Pentium Pro Developer’s Manual, Volume 1: Pentium Pro Developer’s Manual, Volume 1: SpecificationsSpecifications, , 1996, one of a set of 3 volumes1996, one of a set of 3 volumes

3.3. Pentium Pro Developer’s Manual, Volume 2: Pentium Pro Developer’s Manual, Volume 2: Programmer's Programmer's Reference ManualReference Manual, Intel document, 1996, one of a set of 3 , Intel document, 1996, one of a set of 3 volumesvolumes

4.4. Pentium Pro Developer’s Manual, Volume 3: Pentium Pro Developer’s Manual, Volume 3: Operating Operating Systems Writer’s ManualSystems Writer’s Manual, Intel document, 1996, one of a set of , Intel document, 1996, one of a set of 3 volumes3 volumes

5.5. Y. Sheffer: Y. Sheffer: http://webee.technion.ac.il/courses/044800/lectures/MESI.pdfhttp://webee.technion.ac.il/courses/044800/lectures/MESI.pdf

6.6. MOESI protocol: http://en.wikipedia.org/wiki/MOESI_protocolMOESI protocol: http://en.wikipedia.org/wiki/MOESI_protocol

7.7. MESIF protocol: http://en.wikipedia.org/wiki/MESIF_protocolMESIF protocol: http://en.wikipedia.org/wiki/MESIF_protocol