Parallel Processing
description
Transcript of Parallel Processing
![Page 1: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/1.jpg)
Parallel Processing
Chapter 9
![Page 2: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/2.jpg)
• Problem:– Branches, cache misses, dependencies limit
the (Instruction Level Parallelism) ILP available
• Solution:
![Page 3: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/3.jpg)
• Problem:– Branches, cache misses, dependencies limit
the (Instruction Level Parallelism) ILP available
• Solution:– Divide program into parts– Run each part on separate CPUs of larger
machine
![Page 4: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/4.jpg)
Motivations
![Page 5: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/5.jpg)
Motivations
• Desktops are incredibly cheap– Custom high-performance uniprocessor – Hook up 100 desktops
• Squeezing out more ILP is difficult
![Page 6: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/6.jpg)
Motivations
• Desktops are incredibly cheap– Custom high-performance uniprocessor – Hook up 100 desktops
• Squeezing out more ILP is difficult– More complexity/power required each time– Would require change in cooling technology
![Page 7: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/7.jpg)
Challenges
• Parallelizing code is not easy
• Communication can be costly
• Requires HW support
![Page 8: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/8.jpg)
Challenges
• Parallelizing code is not easy– Languages, software engineering, software
verification issue – beyond scope of class
• Communication can be costly
• Requires HW support
![Page 9: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/9.jpg)
Challenges
• Parallelizing code is not easy– Languages, software engineering, software
verification issue – beyond scope of class
• Communication can be costly– Performance analysis ignores caches - these
costs are much higher
• Requires HW support
![Page 10: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/10.jpg)
Challenges
• Parallelizing code is not easy– Languages, software engineering, software
verification issue – beyond scope of class
• Communication can be costly– Performance analysis ignores caches - these
costs are much higher
• Requires HW support– Multiple processes modifying the same data
causes race conditions, and out of order processors arbitrarily reorder things.
![Page 11: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/11.jpg)
Performance - Speedup
• _____________________
• 70% of the program is parallelizable
• What is the highest speedup possible?
• What is the speedup with 100 processors?
![Page 12: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/12.jpg)
Speedup
• Amdahl’s Law!!!!!!
• 70% of the program is parallelizable
• What is the highest speedup possible?
• What is the speedup with 100 processors?
![Page 13: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/13.jpg)
Speedup
• Amdahl’s Law!!!!!!
• 70% of the program is parallelizable
• What is the highest speedup possible?– 1 / (.30 + .70 / ) = 1 / .30 = 3.33
• What is the speedup with 100 processors?
8
![Page 14: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/14.jpg)
Speedup
• Amdahl’s Law!!!!!!
• 70% of the program is parallelizable
• What is the highest speedup possible?– 1 / (.30 + .70 / ) = 1 / .30 = 3.33
• What is the speedup with 100 processors?– 1 / (.30 + .70/100) = 1 / .307 = 3.26
8
![Page 15: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/15.jpg)
Taxonomy
• SISD – single instruction, single data
• SIMD – single instruction, multiple data
• MISD – multiple instruction, single data
• MIMD – multiple instruction, multiple data
![Page 16: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/16.jpg)
Taxonomy
• SISD – single instruction, single data– uniprocessor
• SIMD – single instruction, multiple data
• MISD – multiple instruction, single data
• MIMD – multiple instruction, multiple data
![Page 17: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/17.jpg)
Taxonomy
• SISD – single instruction, single data– uniprocessor
• SIMD – single instruction, multiple data– vector, MMX extensions, graphics cards
• MISD – multiple instruction, single data
• MIMD – multiple instruction, multiple data
![Page 18: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/18.jpg)
P
Controller SIMD
D
P D
P D
P D
PD
PD
PD
PD
Controller fetches instructionsAll processors execute the same instructionConditional instructions only way for variation
![Page 19: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/19.jpg)
Taxonomy
• SISD – single instruction, single data– uniprocessor
• SIMD – single instruction, multiple data– vector, MMX extensions, graphics cards
• MISD – multiple instruction, single data
• MIMD – multiple instruction, multiple data
![Page 20: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/20.jpg)
Taxonomy
• SISD – single instruction, single data– uniprocessor
• SIMD – single instruction, multiple data– vector, MMX extensions, graphics cards
• MISD – multiple instruction, single data– Never built – pipeline architectures?!?
• MIMD – multiple instruction, multiple data
![Page 21: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/21.jpg)
Taxonomy
• SISD – single instruction, single data– uniprocessor
• SIMD – single instruction, multiple data– vector, MMX extensions, graphics cards
• MISD – multiple instruction, single data– Streaming apps?
• MIMD – multiple instruction, multiple data– Most multiprocessors– Cheap, flexible
![Page 22: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/22.jpg)
Example
• Sum the elements in A[] and place result in sum
int sum=0;
int i;
for(i=0;i<n;i++)
sum = sum + A[i];
![Page 23: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/23.jpg)
Parallel versionShared Memory
![Page 24: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/24.jpg)
Parallel versionShared Memory
int A[NUM];int numProcs;int sum;int sumArray[numProcs];myFunction( (input arguments) ){ int myNum - …….; int mySum = 0; for (i = (NUM/numProcs)*myNum; i < (NUM/numProcs)*(myNum+1);i++)
mySum += A[i]; sumArray[myNum] = mySum; barrier(); if (myNum == 0) {
for(i=0;i<numProcs;i++)sum += sumArray[i];
}}
![Page 25: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/25.jpg)
Why Synchronization?
• Why can’t you figure out when proc x will finish work?
![Page 26: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/26.jpg)
Why Synchronization?
• Why can’t you figure out when proc x will finish work?– Cache misses– Different control flow– Context switches
![Page 27: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/27.jpg)
Supporting Parallel Programs
• Synchronization
• Cache Coherence
• False Sharing
![Page 28: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/28.jpg)
Synchronization
• Sum += A[i];
• Two processors, i = 0, i = 50
• Before the action:– Sum = 5– A[0] = 10– A[50] = 33
• What is the proper result?
![Page 29: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/29.jpg)
Synchronization
• Sum = Sum + A[i];
• Assembly for this equation, assuming – A[i] is already in $t0:– &Sum is already in $s0
lw $t1, 0($s0)
add $t1, $t1, $t0
sw $t1, 0($s0)
![Page 30: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/30.jpg)
SynchronizationOrdering #1
P1 inst Effect P2 inst Effect
Given $t0 = 10 Given $t0 = 33
Lw $t1 =
Lw $t1 =
add $t1 = Add $t1 =
Sw Sum =
Sw Sum =
lw $t1, 0($s0)
add $t1, $t1, $t0
sw $t1, 0($s0)
5
3815
5
15
38
![Page 31: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/31.jpg)
SynchronizationOrdering #2
P1 inst Effect P2 inst Effect
Given $t0 = 10 Given $t0 = 33
Lw $t1 =
Lw $t1 =
add $t1 = Add $t1 =
Sw Sum =
Sw Sum =
lw $t1, 0($s0)
add $t1, $t1, $t0
sw $t1, 0($s0)
5
3815
5
15
38
![Page 32: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/32.jpg)
Synchronization Problem
• Reading and writing memory is a
non-atomic operation– You can not read and write a memory location
in a single operation
• We need hardware primitives that allow us to read and write without interruption
![Page 33: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/33.jpg)
Solution
• Software Solution– “lock” – function that allows one processor to
leave, all others to loop– “unlock” – releases the next looping processor
(or resets to allow next arriving proc to leave)
• Hardware– Provide primitives that read & write in order to
implement lock and unlock
![Page 34: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/34.jpg)
SoftwareUsing lock and unlock
lock(&balancelock)Sum += A[i]unlock(&balancelock)
![Page 35: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/35.jpg)
HardwareImplementing lock & unlock
• Swap $1, 100($2)– Swap the contents of $1 and M[$2+100]
![Page 36: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/36.jpg)
Hardware: Implementing lock & unlock with swap
Lock:Li $t0, 1 Loop: swap $t0, 0($a0)
bne $t0, $0, loop
• If lock has 0, it is free
• If lock has 1, it is held
![Page 37: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/37.jpg)
Hardware: Implementing lock & unlock with swap
Lock:Li $t0, 1 Loop: swap $t0, 0($a0)
bne $t0, $0, loop
Unlock:sw $0, 0($a0)
• If lock has 0, it is free
• If lock has 1, it is held
![Page 38: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/38.jpg)
Outline
• Synchronization
• Cache Coherence
• False Sharing
![Page 39: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/39.jpg)
Cache Coherence
$$$ $$$
P1 P2
Current a value in:P1$ P2$ DRAM* * 7
1. P2: Rd a 2. P2: Wr a, 53. P1: Rd a4. P2: Wr a, 35. P1: Rd a
DRAM
P1,P2 are write-back caches
![Page 40: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/40.jpg)
Cache Coherence
$$$ $$$
P1 P2
Current a value in:P1$ P2$ DRAM* * 7
1. P2: Rd a 2. P2: Wr a, 53. P1: Rd a4. P2: Wr a, 35. P1: Rd a
DRAM
1
P1,P2 are write-back caches
![Page 41: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/41.jpg)
Cache Coherence
$$$ $$$
P1 P2
Current a value in:P1$ P2$ DRAM* * 7
1. P2: Rd a * 7 72. P2: Wr a, 53. P1: Rd a4. P2: Wr a, 35. P1: Rd a
DRAM
1
P1,P2 are write-back caches
![Page 42: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/42.jpg)
Cache Coherence
$$$ $$$
P1 P2
Current a value in:P1$ P2$ DRAM* * 7
1. P2: Rd a * 7 72. P2: Wr a, 53. P1: Rd a4. P2: Wr a, 35. P1: Rd a
DRAM
1
2
P1,P2 are write-back caches
![Page 43: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/43.jpg)
Cache Coherence
$$$ $$$
P1 P2
Current a value in:P1$ P2$ DRAM* * 7
1. P2: Rd a * 7 72. P2: Wr a, 5 * 5 3. P1: Rd a4. P2: Wr a, 35. P1: Rd a
DRAM
1
2
P1,P2 are write-back caches
![Page 44: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/44.jpg)
Cache Coherence
$$$ $$$
P1 P2
Current a value in:P1$ P2$ DRAM* * 7
1. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a4. P2: Wr a, 35. P1: Rd a
DRAM
1
2
P1,P2 are write-back caches
![Page 45: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/45.jpg)
Cache Coherence
$$$ $$$
P1 P2
Current a value in:P1$ P2$ DRAM* * 7
1. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35. P1: Rd a
DRAM
13
2
P1,P2 are write-back caches
![Page 46: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/46.jpg)
Cache Coherence
$$$ $$$
P1 P2
Current a value in:P1$ P2$ DRAM* * 7
1. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35 3 55. P1: Rd a
DRAM
13
2
AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!
4
![Page 47: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/47.jpg)
Cache Coherence
$$$ $$$
P1 P2
Current a value in:P1$ P2$ DRAM* * 7
1. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35 3 55. P1: Rd a
DRAM
13
2
AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!What will P1 receive from its load?
4
![Page 48: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/48.jpg)
Cache Coherence
$$$ $$$
P1 P2
Current a value in:P1$ P2$ DRAM* * 7
1. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35 3 55. P1: Rd a
DRAM
13
2
AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!What will P1 receive from its load? 5What should P1 receive from its load?
4
![Page 49: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/49.jpg)
Cache Coherence
$$$ $$$
P1 P2
Current a value in:P1$ P2$ DRAM* * 7
1. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35 3 55. P1: Rd a
DRAM
13
2
AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!What will P1 receive from its load? 5What should P1 receive from its load? 3
4
![Page 50: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/50.jpg)
Whatever are we to do?
• Write-Invalidate– Invalidate that value in all others’ caches– Set the valid bit to 0
• Write-Update– Update the value in all others’ caches
![Page 51: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/51.jpg)
Write Invalidate
$$$ $$$
P1 P2
Current a value in:P1$ P2$ DRAM* * 7
1. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35. P1: Rd a
DRAM
13
2
P1,P2 are write-back caches
4
![Page 52: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/52.jpg)
Write Invalidate
$$$ $$$
P1 P2
Current a value in:P1$ P2$ DRAM* * 7
1. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 3* 3 55. P1: Rd a
DRAM
13
2
P1,P2 are write-back caches
4
![Page 53: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/53.jpg)
Write Invalidate
$$$ $$$
P1 P2
Current a value in:P1$ P2$ DRAM* * 7
1. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 3* 3 55. P1: Rd a 3 3 3
DRAM
13,5
2
P1,P2 are write-back caches
4
![Page 54: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/54.jpg)
Write Update
$$$ $$$
P1 P2
Current a value in:P1$ P2$ DRAM* * 7
1. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35. P1: Rd a
DRAM
13,42
P1,P2 are write-back caches
4
![Page 55: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/55.jpg)
Write Update
$$$ $$$
P1 P2
Current a value in:P1$ P2$ DRAM* * 7
1. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 33 3 35. P1: Rd a
DRAM
13,42
P1,P2 are write-back caches
4
![Page 56: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/56.jpg)
Write Update
$$$ $$$
P1 P2
Current a value in:P1$ P2$ DRAM* * 7
1. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 33 3 35. P1: Rd a 3 3 3
DRAM
13,42
P1,P2 are write-back caches
4
![Page 57: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/57.jpg)
Outline
• Synchronization
• Cache Coherence
• False Sharing
![Page 58: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/58.jpg)
Cache CoherenceFalse Sharing w/ Invalidate
$$$ $$$
P1 P2
Current contents in:P1$ P2$ * *
1. P2: Rd A[0] 2. P1: Rd A[1]3. P2: Wr A[0], 54. P1: Wr A[1], 3
DRAM
P1,P2 cacheline size: 4 words
![Page 59: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/59.jpg)
Look closely at example
• P1 and P2 do not access the same element
• A[0] and A[1] are in the same cache block, so if they are in one cache, they are in the other cache.
![Page 60: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/60.jpg)
Cache Coherence False Sharing w/ Invalidate
$$$ $$$
P1 P2
Current contents in:P1$ P2$ * *
1. P2: Rd A[0] * A[0-3]2. P1: Rd A[1]3. P2: Wr A[0], 54. P1: Wr A[1], 3
DRAM
P1,P2 cacheline size: 4 words
![Page 61: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/61.jpg)
Cache Coherence False Sharing w/ Invalidate
$$$ $$$
P1 P2
Current contents in:P1$ P2$ * *
1. P2: Rd A[0] * A[0-3]2. P1: Rd A[1] A[0-3]A[0-3]3. P2: Wr A[0], 5 4. P1: Wr A[1], 3
DRAM
P1,P2 cacheline size: 4 words
![Page 62: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/62.jpg)
Cache Coherence False Sharing w/ Invalidate
$$$ $$$
P1 P2
Current contents in:P1$ P2$ * *
1. P2: Rd A[0] * A[0-3]2. P1: Rd A[1] A[0-3]A[0-3]3. P2: Wr A[0], 5 * A[0-3]4. P1: Wr A[1], 3
DRAM
P1,P2 cacheline size: 4 words
![Page 63: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/63.jpg)
Cache Coherence False Sharing w/ Invalidate
$$$ $$$
P1 P2
Current contents in:P1$ P2$ * *
1. P2: Rd A[0] * A[0-3]2. P1: Rd A[1] A[0-3]A[0-3]3. P2: Wr A[0], 5 * A[0-3]4. P1: Wr A[1], 3 A[0-3] *
DRAM
P1,P2 cacheline size: 4 words
![Page 64: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/64.jpg)
False Sharing
• Different/same processors access different/same items in different/same cache block
• Leads to ___________ misses
![Page 65: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/65.jpg)
False Sharing
• Different processors access different items in same cache block
• Leads to___________ misses
![Page 66: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/66.jpg)
False Sharing
• Different processors access different items in same cache block
• Leads to coherence cache misses
![Page 67: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/67.jpg)
Cache Performance
// Pn = my processor number (rank)// NumProcs = total active processors// N = total number of elements// NElem = N / NumProcs
For(i=0;i<N;i++) A[NumProcs*i+Pn] = f(i);
Vs
For(i=(Pn*NElem);i<(Pn+1)*NElem;i++) A[i] = f(i);
![Page 68: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/68.jpg)
Which is better?
• Both access the same number of elements
• No processors access the same elements as each other
![Page 69: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/69.jpg)
Why is the second better?
• Both access the same number of elements
• No processors access the same elements as each other
• Better Spatial Locality
![Page 70: Parallel Processing](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814fe1550346895dbda8c9/html5/thumbnails/70.jpg)
Why is the second better?
• Both access the same number of elements
• No processors access the same elements as each other
• Better Spatial Locality
• Less False Sharing