Cache Coherence “Can we do a better job of supporting cache coherence ?”
Parallel Processing Problems Cache Coherence False Sharing Synchronization.
-
Upload
alvin-fedder -
Category
Documents
-
view
229 -
download
1
Transcript of Parallel Processing Problems Cache Coherence False Sharing Synchronization.
Cache Coherence
$$$ $$$
P1 P2
Current a value in:P1$ P2$ DRAM* * 7
1. P2: Rd a 2. P2: Wr a, 53. P1: Rd a4. P2: Wr a, 35. P1: Rd a
DRAM
P1,P2 are write-back caches
Write Invalidate
$$$ $$$
P1 P2
Current a value in:P1$ P2$ DRAM* * 7
1. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 74. P2: Wr a, 35. P1: Rd a
DRAM
13
2
P1,P2 are write-back caches
4
Write Update
$$$ $$$
P1 P2
Current a value in:P1$ P2$ DRAM* * 7
1. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 74. P2: Wr a, 35. P1: Rd a
DRAM
13,42
P1,P2 are write-back caches
4
Performance Considerations
Invalidate Update
Writing makes data exclusiveReceiving changed data slower
Once shared, always sharedOnce shared, writes always on busGet changed data very quickly
Cache CoherenceFalse Sharing
$$$ $$$
P1 P2
Current contents in:P1$ P2$ * *
1. P2: Rd A[0] 2. P1: Rd A[1]3. P2: Wr A[0], 54. P1: Wr A[1], 3
DRAM
P1,P2 cacheline size: 4 words
Look closely at example
• P1 and P2 do not access the same element
• A[0] and A[1] are in the same cache block, so if they are in one cache, they are in the other cache.
False Sharing
• Different/same processors access different/same items in different/same cache block
• Leads to ___________ misses
Cache Performance
// Pn = my processor number (rank)// NumProcs = total active processors// N = total number of elements// NElem = N / NumProcs
For(i=0;i<N;i++) A[NumProcs*i+Pn] = f(i);
Vs
For(i=(Pn*NElem);i<(Pn+1)*NElem;i++) A[i] = f(i);
Which is worse?
• Both access the same number of elements
• No processors access the same elements as each other
Synchronization
• Sum += A[i];
• Two processors, i = 0, i = 50
• Before the action:– Sum = 5– A[0] = 10– A[50] = 33
• What is the proper result?
Synchronization
• Sum = Sum + A[i];
• Assembly for this equation, assuming – A[i] is already in $t0:– &Sum is already in $s0
SynchronizationOrdering #1
P1 inst Effect P2 inst Effect
Given $t0 = 10 Given $t0 = 33
Lw $t1 =
Lw $t1 =
add $t1 = Add $t1 =
Sw Sum =
Sw Sum =
lw $t1, 0($s0)
add $t1, $t1, $t0
sw $t1, 0($s0)
SynchronizationOrdering #2
P1 inst Effect P2 inst Effect
Given $t0 = 10 Given $t0 = 33
Lw $t1 =
Lw $t1 =
add $t1 = Add $t1 =
Sw Sum =
Sw Sum =
lw $t1, 0($s0)
add $t1, $t1, $t0
sw $t1, 0($s0)
Does Cache Coherence solve it?
• Did load bring in an old value?
• Sum += A[i] is ___________– Atomic – operation occurs in one unit, and
nothing may interrupt it.
Synchronization Problem
• Reading and writing memory is a
non-atomic operation– You can not read and write a memory location
in a single operation
• We need __________________ that allow us to read and write without interruption
Solution
• Software Solution– “lock” –
– “unlock” –
• Hardware– Provide primitives that read & write in order to
implement lock and unlock
Hardware: Implementing lock & unlock with swap
Lock:Li $t0, 1 Loop: swap $t0, 0($a0)
bne $t0, $0, loop
Unlock:sw $0, 0($a0)
• If lock has 0, it is free
• If lock has 1, it is held