Parallel Processing Problems Cache Coherence False Sharing Synchronization.

22
Parallel Processing Problems • Cache Coherence • False Sharing • Synchronization

Transcript of Parallel Processing Problems Cache Coherence False Sharing Synchronization.

Parallel Processing Problems

• Cache Coherence

• False Sharing

• Synchronization

Cache Coherence

$$$ $$$

P1 P2

Current a value in:P1$ P2$ DRAM* * 7

1. P2: Rd a 2. P2: Wr a, 53. P1: Rd a4. P2: Wr a, 35. P1: Rd a

DRAM

P1,P2 are write-back caches

Whatever are we to do?

• Write-Invalidate

• Write-Update

Write Invalidate

$$$ $$$

P1 P2

Current a value in:P1$ P2$ DRAM* * 7

1. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 74. P2: Wr a, 35. P1: Rd a

DRAM

13

2

P1,P2 are write-back caches

4

Write Update

$$$ $$$

P1 P2

Current a value in:P1$ P2$ DRAM* * 7

1. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 74. P2: Wr a, 35. P1: Rd a

DRAM

13,42

P1,P2 are write-back caches

4

Performance Considerations

Invalidate Update

Writing makes data exclusiveReceiving changed data slower

Once shared, always sharedOnce shared, writes always on busGet changed data very quickly

Cache CoherenceFalse Sharing

$$$ $$$

P1 P2

Current contents in:P1$ P2$ * *

1. P2: Rd A[0] 2. P1: Rd A[1]3. P2: Wr A[0], 54. P1: Wr A[1], 3

DRAM

P1,P2 cacheline size: 4 words

Look closely at example

• P1 and P2 do not access the same element

• A[0] and A[1] are in the same cache block, so if they are in one cache, they are in the other cache.

False Sharing

• Different/same processors access different/same items in different/same cache block

• Leads to ___________ misses

Cache Performance

// Pn = my processor number (rank)// NumProcs = total active processors// N = total number of elements// NElem = N / NumProcs

For(i=0;i<N;i++) A[NumProcs*i+Pn] = f(i);

Vs

For(i=(Pn*NElem);i<(Pn+1)*NElem;i++) A[i] = f(i);

Which is worse?

• Both access the same number of elements

• No processors access the same elements as each other

Synchronization

• Sum += A[i];

• Two processors, i = 0, i = 50

• Before the action:– Sum = 5– A[0] = 10– A[50] = 33

• What is the proper result?

Synchronization

• Sum = Sum + A[i];

• Assembly for this equation, assuming – A[i] is already in $t0:– &Sum is already in $s0

SynchronizationOrdering #1

P1 inst Effect P2 inst Effect

Given $t0 = 10 Given $t0 = 33

Lw $t1 =

Lw $t1 =

add $t1 = Add $t1 =

Sw Sum =

Sw Sum =

lw $t1, 0($s0)

add $t1, $t1, $t0

sw $t1, 0($s0)

SynchronizationOrdering #2

P1 inst Effect P2 inst Effect

Given $t0 = 10 Given $t0 = 33

Lw $t1 =

Lw $t1 =

add $t1 = Add $t1 =

Sw Sum =

Sw Sum =

lw $t1, 0($s0)

add $t1, $t1, $t0

sw $t1, 0($s0)

Does Cache Coherence solve it?

• Did load bring in an old value?

• Sum += A[i] is ___________– Atomic – operation occurs in one unit, and

nothing may interrupt it.

Synchronization Problem

• Reading and writing memory is a

non-atomic operation– You can not read and write a memory location

in a single operation

• We need __________________ that allow us to read and write without interruption

Solution

• Software Solution– “lock” –

– “unlock” –

• Hardware– Provide primitives that read & write in order to

implement lock and unlock

SoftwareUsing lock and unlock

Sum += A[i]

HardwareImplementing lock & unlock

• Swap $1, 100($2)– Swap the contents of $1 and M[$2+100]

Hardware: Implementing lock & unlock with swap

Lock:Li $t0, 1 Loop: swap $t0, 0($a0)

bne $t0, $0, loop

Unlock:sw $0, 0($a0)

• If lock has 0, it is free

• If lock has 1, it is held

Summary

• Cache coherence must be implemented for shared memory to work

• False sharing causes bad cache performance

• Hardware primitives necessary for synchronizing shared data