Lecture 9 Outline
description
Transcript of Lecture 9 Outline
![Page 1: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/1.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Lecture 12: Other Bus-Based Coherence Protocols
Spring 2010
E. F. Gehringer, based on slides by Yan Solihin
1
![Page 2: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/2.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Lecture 9 Outline
• MESI protocol• Dragon update-based protocol• Impact of protocol optimizations
2
![Page 3: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/3.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Lower-Level Protocol Choice
• What transition should occur when a BusRd is observed in state M?• Change to S? Assume I’ll reread it soon
• Good for mostly read data• But what about “migratory” data? …
• Change to I? Assume another will write it (Synapse)
• I read and write, then you read and write, then X reads and writes...
• Sequent Symmetry and MIT Alewife use adaptive protocols
3
![Page 4: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/4.jpg)
CSC/ECE 506: Architecture of Parallel Computers
MESI (4-state) Invalidation Protocol• Here’s a problem with the MSI protocol:
• A {Rd, Wr} sequence causes two bus transactions • even when no one is sharing (e.g., serial program!)• BusRd (I S) followed by BusRdX or BusUpgr (S M)• In general, coherence traffic from serial programs is unacceptable
• To avoid this, add a fourth state, Exclusive:• Invalid• Modified (dirty)• Shared (two or more caches may have copies)• Exclusive (only this cache has clean copy, same value as in memory)
• How does the protocol decide whether I E or I S? • Need to check whether someone else has a copy• “Shared” signal on bus: wired-or line asserted in response to BusRd
4
![Page 5: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/5.jpg)
CSC/ECE 506: Architecture of Parallel Computers
MESI: Processor-Initiated Transactions
5
M
S
E
PrRd/–PrWr/–
PrRd/–
PrWr/–
I
PrRd/BusRd(~S)
PrRd/BusRd(S)
PrWr/BusRdX
PrWr/BusRdX
PrRd/–
Fill in the last two transitions here.
![Page 6: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/6.jpg)
CSC/ECE 506: Architecture of Parallel Computers
MESI: Bus-Initiated Transactions
6
M
I
E
BusRd/–BusRdX/–
S
BusRd/Flush BusRd/FlushBusRdX/Flush
BusRdX/Flush
BusRdX/Flush׳BusRd/Flush׳
Flush׳ means flush only if cache-to-cache sharing is used; only the cache responsible for supplying the data will do a flush.
![Page 7: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/7.jpg)
CSC/ECE 506: Architecture of Parallel Computers
MESI State Transition Diagram
7
• BusRd(S) means shared line asserted on BusRd transaction
PrWr/—
BusRd/Flush
PrRd/
BusRdX/Flush
PrWr/BusRdX
PrWr/—
PrRd/—
PrRd/—BusRd/Flush
E
M
I
S
PrRd
BusRd(S)
BusRdX/Flush
BusRdX/Flush
BusRd/Flush
PrWr/BusRdX
PrRd/BusRd (S)
![Page 8: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/8.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Flush vs. Flush'
• Flush: mandatory• Flush' happens only when
• Cache-to-cache sharing is used, and,• Only one cache flushes data
8
![Page 9: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/9.jpg)
CSC/ECE 506: Architecture of Parallel Computers
MESI Visualization
9
Start state. All caches empty and main memory has A = 1.
P1
CacheSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
![Page 10: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/10.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
10
Processor P1 attempts to read A from its cache.
P1
CacheSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd AP1 BusRd AMem returns data
![Page 11: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/11.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
11
Processor P1 issues a BusRd.
P1
CacheSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd AP1 BusRd AMem returns data
![Page 12: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/12.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
12
Main memory returns data to processor P1 which updates its cache.
P1
CacheA = 1 ESnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd AP1 BusRd AMem returns data
![Page 13: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/13.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
13
Read operation completes. P1
CacheA = 1 ESnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
![Page 14: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/14.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Writes A = 2
14
Processor P1 writes to its cache.
P1
CacheA = 2 MSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrWr A
One less bus requestdue to Exclusive state,esp. for serial programs
M
![Page 15: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/15.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Writes A = 2
15
Write operation completes. P1
CacheA = 2 MSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
![Page 16: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/16.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 attempts to read A from its cache.
P1
CacheA = 2 MSnooper
P2
Cache
Snooper
P3
Cache
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush
![Page 17: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/17.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 issues a BusRd.
P1
CacheA = 2 MSnooper
P2
Cache
Snooper
P3
Cache
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush
![Page 18: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/18.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P1 snoops the BusRd from processor P3.
P1
CacheA = 2 SSnooper
P2
Cache
Snooper
P3
Cache
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush
![Page 19: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/19.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P1 flushes, sending updated data to P3
and main memory.
P1
CacheA = 2 SSnooper
P2
Cache
Snooper
P3
CacheA = 2 SSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush
![Page 20: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/20.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Read operation completes. P1
CacheA = 2 SSnooper
P2
Cache
Snooper
P3
CacheA = 2 SSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
![Page 21: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/21.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Processor P3 writes to its cache.
P1
CacheA = 2 SSnooper
P2
Cache
Snooper
P3
CacheA = 2 SSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrWr AP3 BusUpgrP1 snoops BusUpgr
![Page 22: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/22.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Processor P3 issues a BusRd request.
P1
CacheA = 2 SSnooper
P2
Cache
Snooper
P3
CacheA = 2 SSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrWr AP3 BusUpgrP1 snoops BusUpgr
Note: BusUpgr insteadof BusRdX
![Page 23: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/23.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Processor P1 snoops the BusRd and invalidates its cache.
P1
CacheA = 2 ISnooper
P2
Cache
Snooper
P3
CacheA = 3 MSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrWr AP3 BusUpgrP1 snoops BusUpgr
![Page 24: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/24.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Write operation completes. P1
CacheA = 2 ISnooper
P2
Cache
Snooper
P3
CacheA = 3 MSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
![Page 25: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/25.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
Processor P1 reads from its cache.
P1
CacheA = 2 ISnooper
P2
Cache
Snooper
P3
CacheA = 3 MSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd AP1 BusRd AP3 snoops BusRdP3 Flush
![Page 26: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/26.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
Processor P1 issues a BusRd request.
P1
CacheA = 2 ISnooper
P2
Cache
Snooper
P3
CacheA = 3 MSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd AP1 BusRd AP3 snoops BusRdP3 Flush
![Page 27: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/27.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
Processor P3 snoops the BusRd.
P1
CacheA = 2 ISnooper
P2
Cache
Snooper
P3
CacheA = 3 MSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd AP1 BusRd AP3 snoops BusRdP3 Flush
![Page 28: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/28.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
Processor P3 flushes, updating processor P1, main memory and its own cache state.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd AP1 BusRd AP3 snoops BusRdP3 Flush
![Page 29: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/29.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
Read operation completes. P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
![Page 30: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/30.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 reads from its cache.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRd AP3 returns data
![Page 31: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/31.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 returns valid data from its cache.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRd AP3 returns data
![Page 32: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/32.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Read operation completes. P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
![Page 33: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/33.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Processor P2 reads from its cache.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P2 PrRd AP2 BusRd AP1 Flush
![Page 34: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/34.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Processor P2 issues a BusRd request.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P2 PrRd AP2 BusRd AP1 Flush
![Page 35: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/35.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Main memory controller observes the BusRd.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P2 PrRd AP2 BusRd AP1 Flush
A = 3 S
X
Referred to as Cache-to-cache transferin Illinois MESI protocol
![Page 36: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/36.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Operation completes. P1
CacheA = 3 SSnooper
P2
CacheA = 3 SSnooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
![Page 37: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/37.jpg)
CSC/ECE 506: Architecture of Parallel Computers
MESI Example (Cache-to-Cache Transfer)
37* Data from memory if no cache2cache transfer, BusRd/ –
Proc Action State P1 State P2 State P3 Bus Action Data From
R1 E – – BusRd Mem
W1 M – – – Own cache
R3 S – S BusRd/Flush P1 cache
W3 I – M BusRdX Mem
R1 S – S BusRd/Flush P3 cache
R3 S – S – Own cache
R2 S S S BusRd/Flush׳ P1/P3 Cache*
![Page 38: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/38.jpg)
CSC/ECE 506: Architecture of Parallel Computers
MESI Example (Cache-to-Cache Transfer+BusUpgr)
38* Data from memory if no cache2cache transfer, BusRd/ –
Proc Action State P1 State P2 State P3 Bus Action Data From
R1 E - - BusRd Mem
W1 M - - - Own cache
R3 S - S BusRd/Flush P1 cache
W3 I - M BusUpgr Own cache
R1 S - S BusRd/Flush P3 cache
R3 S - S - Own cache
R2 S S S BusRd/Flush׳P1/P3
Cache*
![Page 39: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/39.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Lower-Level Protocol Choices• Who supplies data on miss when not in M state: memory or cache?
• Original, lllinois MESI: cache• assume cache faster than memory (cache-to-cache transfer)• Not necessarily true
• Adds complexity• How does memory know it should supply data? (must wait for
caches)• A selection algorithm is needed if multiple caches have valid data.
• Useful in a distributed-memory system• May be cheaper to obtain from nearby cache than distant memory• Especially when constructed out of SMP nodes (Stanford DASH)
39
![Page 40: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/40.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Lecture 9 Outline
• MESI protocol• Dragon update-based protocol• Impact of protocol optimizations
40
![Page 41: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/41.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Dragon Writeback Update Protocol• Four states
• Exclusive-clean (E): Memory and I have it• Shared clean (Sc): I, others, and maybe memory, but I’m not owner• Shared modified (Sm): I and others but not memory, and I’m the owner
• Sm and Sc can coexist in different caches, with at most one Sm• Modified or dirty (M): I and, no one else• On replacement: Sc can silently drop, Sm has to flush
• No invalid state• If in cache, cannot be invalid• If not present in cache, can view as being in not-present or invalid state
• New processor events: PrRdMiss, PrWrMiss• Introduced to specify actions when block not present in cache
• New bus transaction: BusUpd• Broadcasts single word written on bus; updates other relevant caches
41
![Page 42: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/42.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Dragon: Processor-Initiated Transactions
42
E
M
Sc
Sm
PrRdMiss/BusRd(~S)
PrRd/– PrRd/–
PrWr/BusUpd(S)
PrWr/BusUpd(~S)
PrWrMiss/(BusRd(S);BusUpd)
PrRd/– PrRd/–PrWr/BusUpd(S) PrWr/–
PrWrMiss/BusRd(~S)
PrRdMiss/BusRd(S)
Fill in the last two transitions here.
PrWr/BusUpd(~S)
PrWr/–
![Page 43: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/43.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Dragon: Bus-Initiated Transactions
43
E
M
Sc
Sm
BusRd/–BusUpd/Update
BusRd/–
BusRd/Flush
BusUpd/Update
BusRd/Flush
![Page 44: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/44.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Dragon State Transition Diagram
44
E Sc
Sm M
PrWr/—
PrRd/—
PrRd/—
PrRd/—
PrRdMiss/BusRd(S)
PrRdMiss/BusRd(S) PrWr/—
PrWrMiss/(BusRd(S); BusUpd) PrWrMiss/
BusRd(S)
PrWr/BusUpd(S)
PrWr/BusUpd(S)
BusRd/—
BusRd/Flush
PrRd/— BusUpd/Update
BusUpd/Update
BusRd/Flush
PrWr/BusUpd(S)
PrWr/BusUpd(S)
![Page 45: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/45.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Dragon Visualization
45
Start state. All caches empty and main memory has A = 1.
P1
CacheSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
![Page 46: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/46.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
46
Processor P1 attempts to read A from its cache.
P1
CacheSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd AP1 BusRd AMem returns data
![Page 47: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/47.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
47
Processor P1 issues a BusRd.
P1
CacheSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd AP1 BusRd AMem returns data
![Page 48: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/48.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
48
Main memory returns data to processor P1 which updates its cache.
P1
CacheA = 1 ESnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd AP1 BusRd AMem returns data
![Page 49: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/49.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
49
Read operation completes. P1
CacheA = 1 ESnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
![Page 50: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/50.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Writes A = 2
50
Processor P1 writes to its cache.
P1
CacheA = 2 MSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrWr A
M
![Page 51: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/51.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Writes A = 2
51
Write operation completes. P1
CacheA = 2 MSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
![Page 52: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/52.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 attempts to read A from its cache.
P1
CacheA = 2 MSnooper
P2
Cache
Snooper
P3
Cache
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush
![Page 53: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/53.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 issues a BusRd.
P1
CacheA = 2 MSnooper
P2
Cache
Snooper
P3
Cache
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush
![Page 54: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/54.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P1 snoops the BusRd from processor P3.
P1
CacheA = 2 Sm
Snooper
P2
Cache
Snooper
P3
Cache
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush
![Page 55: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/55.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P1 flushes, sending updated data to P3
and main memory.
P1
CacheA = 2 Sm
Snooper
P2
Cache
Snooper
P3
CacheA = 2 Sc
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush
A = 1
Sm
![Page 56: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/56.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Read operation completes. P1
CacheA = 2 Sm
Snooper
P2
Cache
Snooper
P3
CacheA = 2 Sc
Snooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
A = 1
![Page 57: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/57.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Processor P3 writes to its cache.
P1
CacheA = 2 Sm
Snooper
P2
Cache
Snooper
P3
CacheA = 2 Sc
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrWr AP3 BusUpdP1 snoops BusUpd
![Page 58: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/58.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Processor P3 issues a BusRd request.
P1
CacheA = 2 Sm
Snooper
P2
Cache
Snooper
P3
CacheA = 2 Sc
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrWr AP3 BusUpdP1 snoops BusUpd
Note: BusUpdate insteadof BusUpgr (no inval isperformed)
![Page 59: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/59.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Processor P1 snoops the BusUpd and updates its cache.
P1
CacheA = 3 Sc
Snooper
P2
Cache
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrWr AP3 BusUpgrP1 snoops BusUpd
![Page 60: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/60.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Write operation completes. P1
CacheA = 3 Sc
Snooper
P2
Cache
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
![Page 61: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/61.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
Processor P1 reads from its cache.
P1
CacheA = 3 Sc
Snooper
P2
Cache
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd A
![Page 62: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/62.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
Processor P1 reads from its cache.
P1
CacheA = 3 Sc
Snooper
P2
Cache
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd A
This is a miss in theMESI and MSI protocols
![Page 63: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/63.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
Read operation completes. P1
CacheA = 3 Sc
Snooper
P2
Cache
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
![Page 64: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/64.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 reads from its cache.
P1
CacheA = 3 Sc
Snooper
P2
Cache
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd A
![Page 65: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/65.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 reads from its cache.
P1
CacheA = 3 Sc
Snooper
P2
Cache
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd A
![Page 66: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/66.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Read operation completes. P1
CacheA = 3 Sc
Snooper
P2
Cache
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
![Page 67: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/67.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Processor P2 reads from its cache.
P1
CacheA = 3 Sc
Snooper
P2
Cache
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P2 PrRd AP2 BusRd AP1 Flush
![Page 68: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/68.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Processor P2 issues a BusRd request.
P1
CacheA = 3 Sc
Snooper
P2
Cache
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P2 PrRd AP2 BusRd AP1 Flush
![Page 69: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/69.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Main memory controller observes the BusRd.
P1
CacheA = 3 Sc
Snooper
P2
Cache
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P2 PrRd AP2 BusRd AP1 Flush
A = 3 Sc
Note: Only the cache inState Sm is responsiblefor cache-to-cache transfer
![Page 70: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/70.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Operation completes. P1
CacheA = 3 Sc
Snooper
P2
CacheA = 3 Sc
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
![Page 71: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/71.jpg)
CSC/ECE 506: Architecture of Parallel Computers
P1 Replaces A
A evicted from P1 P1
CacheA = 3 Sc
Snooper
P2
CacheA = 3 Sc
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
A = 3 Sc
![Page 72: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/72.jpg)
CSC/ECE 506: Architecture of Parallel Computers
P1 Replaces A
A evicted from P3. P1
Cache
Snooper
P2
CacheA = 3 Sc
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
A = 3 Sc
P3 replaces XOwner responsiblefor writing back to memory
vs. MSI or MESI wherewrite-back only when the line is in M state
![Page 73: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/73.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Dragon Example
73
Proc Action State P1 State P2 State P3 Bus Action Data from
R1 E – – BusRd Mem
W1 M – – – Own cache
R3 Sm – Sc BusRd/Flush P1 cache
W3 Sc – Sm BusUpd/Upd Own cache
R1 Sc – Sm – Own cacheR3 Sc – Sm – Own cache
R2 Sc Sc Sm BusRd/Flush P3 cache
![Page 74: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/74.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Lower-Level Protocol Choices• Can shared-modified state be eliminated?
• If update memory as well on BusUpd transactions (DEC Firefly)• Dragon protocol doesn’t (assumes DRAM memory slow to update)
• Should replacement of an Sc block be broadcast?• Would allow last copy to go to Exclusive state and not generate updates• Replacement bus transaction is not in critical path, but later update may be
• Shouldn’t update local copy on write hit before controller gets bus• Can mess up serialization
• Coherence, consistency considerations much like write-through case
• In general, many subtle race conditions in protocols• But first, let’s illustrate quantitative assessment at logical level
74
![Page 75: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/75.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Lecture 9 Outline
• MESI protocol• Dragon update-based protocol• Impact of protocol optimizations
75
![Page 76: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/76.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Assessing Protocol Tradeoffs• Methodology:
• Use simulator; default 1MB, 4-way cache, 64-byte block, 16 processors. Some runs use 64K cache.
• Focus on frequencies, not end performance for now• transcends architectural details, but not what we’re really
after• Use idealized memory performance model to avoid changes
of reference interleaving across processors with machine parameters
• Cheap simulation: no need to model contention
76
![Page 77: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/77.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Impact of Protocol Optimizations
77
• MSI = MESI• Upgrades instead of read-exclusive helps• Same story when working sets don’t fit for Ocean, Radix, Raytrace
MESI vs. MSI (w/ BusUpgr) vs. MSI (w/ BusRdX)Traffic (MB/s)
Traffic (MB/s)
x d l t x Ill t Ex0
20
40
60
80
100
120
140
160
180
200
Data busAddress bus
E E0
10
20
30
40
50
60
70
80
Data busAddress bus
Bar
nes/
IIIB
arne
s/3S
tB
arne
s/3S
t-R
dEx LU
/III
Rad
ix/3
St-
RdE
xLU/3
St
LU/3
St-R
dEx
Rad
ix/3
St
Oce
an/II
IO
cean
/3S
Rad
iosi
ty/3
St-R
dEx
Oce
an/3
St-R
dEx
Rad
ix/II
I
Rad
iosi
ty/II
I
Rad
iosi
ty/3
St
Ray
trac
e/III
Ray
trac
e/3S
tR
aytr
ace/
3St-R
dEx
App
l-Cod
e/III
App
l-Cod
e/3S
tA
ppl-C
ode/
3St-R
dEx
App
l-Dat
a/III
App
l-Dat
a/3S
tA
ppl-D
ata/
3St-R
dEx
OS-
Cod
e/III
OS-
Cod
e/3S
t
OS-
Dat
a/3S
tO
S-D
ata/
III
OS-
Cod
e/3S
t-RdE
x
OS-
Dat
a/3S
t-RdE
x
![Page 78: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/78.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Impact of Cache-Line Size• Multiprocessors add new kind of miss to cold, capacity, conflict
• Coherence misses: Due to invalidations• True sharing: Write to same word• False sharing: Write to different words
• How to reduce each kind of miss with an invalidation protocol• Capacity: enlarge cache; increase block size (if spatial locality)• Conflict: increase associativity• Cold and coherence: only block size
• Increasing block size has advantages and disadvantages• Can reduce misses if spatial locality is good• Can hurt too
• increase misses due to false sharing if spatial locality not good• increase misses due to conflicts in fixed-size cache• increase traffic due to fetching unnecessary data and due to false sharing• can increase miss penalty and perhaps hit cost
78
![Page 79: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/79.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Impact of Block Size on Miss Rate
79
• For default problem size: vary block/line size from 8-256 Bytes
• Decreases with larger lines: cold, capacity (due to spatial locality), true sharing (due to spatial locality)• Increases with larger lines: false sharing • Working set doesn’t fit: impact of capacity misses large in Ocean and Radix
Cold
Capacity
True sharing
False sharing
Upgrade
8
0
0.1
0.2
0.3
0.4
0.5
0.6
Cold
Capacity
True sharing
False sharing
Upgrade
8 6 2 4 8 6 80
2
4
6
8
10
12
Mis
s ra
te (%
)
Bar
nes/
8B
arne
s/16
Bar
nes/
32
Bar
nes/
64B
arne
s/12
8
Bar
nes/
256
Lu/8
Lu/1
6Lu
/32
Lu/6
4Lu
/128
Lu/2
56
Rad
iosi
ty/8
Rad
iosi
ty/1
6R
adio
sity
/32
Rad
iosi
ty/6
4
Rad
iosi
ty/1
28R
adio
sity
/256
Mis
s ra
te (%
)
Oce
an/8
Oce
an/1
6
Oce
an/3
2O
cean
/64
Oce
an/1
28
Oce
an/2
56
Rad
ix/8
Rad
ix/1
6
Rad
ix/3
2R
adix
/64
Rad
ix/1
28
Rad
ix/2
56
Ray
trac
e/8
Ray
trac
e/16
Ray
trac
e/32
Ray
trac
e/64
Ray
trac
e/12
8R
aytr
ace/
256
![Page 80: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/80.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Impact of Block Size on Traffic
80
• Results different than for miss rate: traffic almost always increases• When working sets fits, overall traffic still small, except for Radix• Fixed overhead is significant component
• So total traffic often minimized with 16–32 byte block, not smaller• Working set doesn’t fit: even 128-byte good for Ocean due to capacity
• Address bus traffic behaves in opposite way as the data bus traffic
Traffic (bytes/inst) affects performance indirectly through contention
Traf
fic (
byte
s/in
stru
ctio
n)
Traf
fic (
byte
s/FL
OP
)
Data bus
Address busData bus
Address bus
Rad
ix/8
Rad
ix/1
6
Rad
ix/3
2
Rad
ix/6
4
Rad
ix/1
28
Rad
ix/2
56
0
1
2
3
4
5
6
7
8
9
10
LU/8
LU/1
6
LU/3
2
LU/6
4
LU/1
28
LU/2
56
Oce
an/8
Oce
an/1
6
Oce
an/3
2
Oce
an/6
4
Oce
an/1
28
Oce
an/2
56
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2 4 280
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Data busAddress bus
Bar
nes/
16
Traf
fic (b
ytes
/inst
ruct
ions
)
Bar
nes/
8
Bar
nes/
32
Bar
nes/
64B
arne
s/12
8
Bar
nes/
256
Rad
iosi
ty/8
Rad
iosi
ty/1
6R
adio
sity
/32
Rad
iosi
ty/6
4R
adio
sity
/128
Rad
iosi
ty/2
56
Ray
trac
e/8
Ray
trac
e/16
Ray
trac
e/32
Ray
trac
e/64
Ray
trac
e/12
8R
aytr
ace/
256
![Page 81: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/81.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Firefly Visualization
81
Start state. All caches empty and main memory has A = 1.
P1
CacheSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
![Page 82: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/82.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
82
Processor P1 attempts to read A from its cache.
P1
CacheSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRdMiss( ~S)P1 BusRd AMem returns data
![Page 83: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/83.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
83
Processor P1 issues a BusRd.
P1
CacheSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRdMiss( ~S)P1 BusRd AMem returns data
![Page 84: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/84.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
84
Main memory returns data to processor P1 which updates its cache.
P1
CacheA = 1 VSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRdMiss( ~S)P1 BusRd AMem returns data
![Page 85: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/85.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
85
Read operation completes. P1
CacheA = 1 VSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
![Page 86: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/86.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Writes A = 2
86
Processor P1 writes to its cache.
P1
CacheA = 2 MSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrWr A
D
![Page 87: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/87.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Writes A = 2
87
Write operation completes. P1
CacheA = 2 DSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
![Page 88: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/88.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 attempts to read A from its cache.
P1
CacheA = 2 DSnooper
P2
Cache
Snooper
P3
Cache
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRdMiss( S)P3 BusRd AP1 snoops BusRdP1 Flush
![Page 89: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/89.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 issues a BusRd.
P1
CacheA = 2 DSnooper
P2
Cache
Snooper
P3
Cache
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRdMiss( S)P3 BusRd AP1 snoops BusRdP1 Flush
![Page 90: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/90.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P1 snoops the BusRd from processor P3.
P1
CacheA = 2 SSnooper
P2
Cache
Snooper
P3
Cache
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRdMiss( S)P3 BusRd AP1 snoops BusRdP1 Flush
![Page 91: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/91.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P1 flushes, sending updated data to P3
and main memory.
P1
CacheA = 2 SSnooper
P2
Cache
Snooper
P3
CacheA = 2 SSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRdMiss( S)P3 BusRd AP1 snoops BusRdP1 Flush
![Page 92: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/92.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Read operation completes. P1
CacheA = 2 SSnooper
P2
Cache
Snooper
P3
CacheA = 2 SSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
![Page 93: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/93.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Processor P3 writes to its cache.
P1
CacheA = 2 SSnooper
P2
Cache
Snooper
P3
CacheA = 2 SSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrWr AP3 BusUpdP1 snoops BusUpd
![Page 94: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/94.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Processor P3 issues a BusRd request.
P1
CacheA = 2 SSnooper
P2
Cache
Snooper
P3
CacheA = 2 SSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrWr AP3 BusUpdP1 snoops BusUpd
Note: BusUpd
![Page 95: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/95.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Processor P1 snoops the BusRd and invalidates its cache.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrWr AP3 BusUpdP1 snoops BusUpd
CacheA = 3 SSnooper
![Page 96: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/96.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Write operation completes. P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
![Page 97: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/97.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
Processor P1 reads from its cache.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRdHitP3 returns data
![Page 98: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/98.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
Processor P1 reads from its cache.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRdHitP3 returns data
![Page 99: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/99.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 reads from its cache.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRdHitP3 returns data
![Page 100: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/100.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 returns valid data from its cache.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRdHitP3 returns data
![Page 101: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/101.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Processor P2 reads from its cache.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P2 PrRdMiss(S)P2 BusRd AP1 Flush
![Page 102: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/102.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Processor P2 issues a BusRd request.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P2 PrRdMiss(S)P2 BusRd AP1 Flush
![Page 103: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/103.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Main memory controller observes the BusRd.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P2 PrRdMis(S)P2 BusRd AP1 Flush
A = 3 S
![Page 104: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/104.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Operation completes. P1
CacheA = 3 SSnooper
P2
CacheA = 3 SSnooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
![Page 105: Lecture 9 Outline](https://reader031.fdocuments.us/reader031/viewer/2022020309/56813cfe550346895da69efb/html5/thumbnails/105.jpg)
CSC/ECE 506: Architecture of Parallel Computers
Firefly Example
105
Proc Action State P1 State P2 State P3 Bus Action Data From
R1 V – – BusRd Mem
W1 D – – – Own cache
R3 S – S BusRd/Flush P1 cache
W3 S – S BusUpd P3 cache
R1 S – S - Own cache
R3 S – S – Own cache
R2 S S S BusRd/Flush P1 Cache