Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the...

70
Parallel Processing Chapter 9

description

Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution: –Divide program into parts –Run each part on separate CPUs of larger machine

Transcript of Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the...

Page 1: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Parallel Processing

Chapter 9

Page 2: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

• Problem:– Branches, cache misses, dependencies limit

the (Instruction Level Parallelism) ILP available

• Solution:

Page 3: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

• Problem:– Branches, cache misses, dependencies limit

the (Instruction Level Parallelism) ILP available

• Solution:– Divide program into parts– Run each part on separate CPUs of larger

machine

Page 4: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Motivations

Page 5: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Motivations

• Desktops are incredibly cheap– Custom high-performance uniprocessor – Hook up 100 desktops

• Squeezing out more ILP is difficult

Page 6: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Motivations

• Desktops are incredibly cheap– Custom high-performance uniprocessor – Hook up 100 desktops

• Squeezing out more ILP is difficult– More complexity/power required each time– Would require change in cooling technology

Page 7: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Challenges

• Parallelizing code is not easy• Communication can be costly• Requires HW support

Page 8: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Challenges

• Parallelizing code is not easy– Languages, software engineering, software

verification issue – beyond scope of class• Communication can be costly• Requires HW support

Page 9: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Challenges

• Parallelizing code is not easy– Languages, software engineering, software

verification issue – beyond scope of class• Communication can be costly

– Performance analysis ignores caches - these costs are much higher

• Requires HW support

Page 10: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Challenges

• Parallelizing code is not easy– Languages, software engineering, software

verification issue – beyond scope of class• Communication can be costly

– Performance analysis ignores caches - these costs are much higher

• Requires HW support– Multiple processes modifying the same data causes

race conditions, and out of order processors arbitrarily reorder things.

Page 11: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Performance - Speedup

• _____________________• 70% of the program is parallelizable• What is the highest speedup possible?

• What is the speedup with 100 processors?

Page 12: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Speedup

• Amdahl’s Law!!!!!!• 70% of the program is parallelizable• What is the highest speedup possible?

• What is the speedup with 100 processors?

Page 13: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Speedup

• Amdahl’s Law!!!!!!• 70% of the program is parallelizable• What is the highest speedup possible?

– 1 / (.30 + .70 / ) = 1 / .30 = 3.33

• What is the speedup with 100 processors?

8

Page 14: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Speedup

• Amdahl’s Law!!!!!!• 70% of the program is parallelizable• What is the highest speedup possible?

– 1 / (.30 + .70 / ) = 1 / .30 = 3.33

• What is the speedup with 100 processors?– 1 / (.30 + .70/100) = 1 / .307 = 3.26

8

Page 15: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Taxonomy

• SISD – single instruction, single data

• SIMD – single instruction, multiple data

• MISD – multiple instruction, single data

• MIMD – multiple instruction, multiple data

Page 16: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Taxonomy

• SISD – single instruction, single data– uniprocessor

• SIMD – single instruction, multiple data

• MISD – multiple instruction, single data

• MIMD – multiple instruction, multiple data

Page 17: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Taxonomy

• SISD – single instruction, single data– uniprocessor

• SIMD – single instruction, multiple data– vector, MMX extensions, graphics cards

• MISD – multiple instruction, single data

• MIMD – multiple instruction, multiple data

Page 18: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

P

Controller SIMD

D

P D

P D

P D

PD

PD

PD

PD

Controller fetches instructionsAll processors execute the same instructionConditional instructions only way for variation

Page 19: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Taxonomy

• SISD – single instruction, single data– uniprocessor

• SIMD – single instruction, multiple data– vector, MMX extensions, graphics cards

• MISD – multiple instruction, single data

• MIMD – multiple instruction, multiple data

Page 20: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Taxonomy

• SISD – single instruction, single data– uniprocessor

• SIMD – single instruction, multiple data– vector, MMX extensions, graphics cards

• MISD – multiple instruction, single data– Never built – pipeline architectures?!?

• MIMD – multiple instruction, multiple data

Page 21: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Taxonomy

• SISD – single instruction, single data– uniprocessor

• SIMD – single instruction, multiple data– vector, MMX extensions, graphics cards

• MISD – multiple instruction, single data– Streaming apps?

• MIMD – multiple instruction, multiple data– Most multiprocessors– Cheap, flexible

Page 22: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Example

• Sum the elements in A[] and place result in sum

int sum=0;int i;for(i=0;i<n;i++)

sum = sum + A[i];

Page 23: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Parallel versionShared Memory

Page 24: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Parallel versionShared Memory

int A[NUM];int numProcs;int sum;int sumArray[numProcs];myFunction( (input arguments) ){ int myNum - …….; int mySum = 0; for (i = (NUM/numProcs)*myNum; i < (NUM/numProcs)*(myNum+1);i++)

mySum += A[i]; sumArray[myNum] = mySum; barrier(); if (myNum == 0) {

for(i=0;i<numProcs;i++)sum += sumArray[i];

}}

Page 25: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Why Synchronization?

• Why can’t you figure out when proc x will finish work?

Page 26: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Why Synchronization?

• Why can’t you figure out when proc x will finish work?– Cache misses– Different control flow– Context switches

Page 27: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Supporting Parallel Programs

• Synchronization• Cache Coherence• False Sharing

Page 28: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Synchronization

• Sum += A[i];• Two processors, i = 0, i = 50• Before the action:

– Sum = 5– A[0] = 10– A[50] = 33

• What is the proper result?

Page 29: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Synchronization

• Sum = Sum + A[i];

• Assembly for this equation, assuming – A[i] is already in $t0:– &Sum is already in $s0

lw $t1, 0($s0)

add $t1, $t1, $t0

sw $t1, 0($s0)

Page 30: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

SynchronizationOrdering #1

P1 inst Effect P2 inst Effect

Given $t0 = 10 Given $t0 = 33

Lw $t1 =

Lw $t1 =

add $t1 = Add $t1 =

Sw Sum =

Sw Sum =

lw $t1, 0($s0)

add $t1, $t1, $t0

sw $t1, 0($s0)

5

38155

15

38

Page 31: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

SynchronizationOrdering #2

P1 inst Effect P2 inst Effect

Given $t0 = 10 Given $t0 = 33

Lw $t1 =

Lw $t1 =

add $t1 = Add $t1 =

Sw Sum =

Sw Sum =

lw $t1, 0($s0)

add $t1, $t1, $t0

sw $t1, 0($s0)

5

38155

15

38

Page 32: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Synchronization Problem

• Reading and writing memory is a non-atomic operation– You can not read and write a memory location

in a single operation • We need hardware primitives that allow us

to read and write without interruption

Page 33: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Solution

• Software Solution– “lock” – function that allows one processor to

leave, all others to loop– “unlock” – releases the next looping processor

(or resets to allow next arriving proc to leave)• Hardware

– Provide primitives that read & write in order to implement lock and unlock

Page 34: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

SoftwareUsing lock and unlock

lock(&balancelock)Sum += A[i]unlock(&balancelock)

Page 35: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

HardwareImplementing lock & unlock

• Swap $1, 100($2)– Swap the contents of $1 and M[$2+100]

Page 36: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Hardware: Implementing lock & unlock with swap

Lock:Li $t0, 1 Loop: swap $t0, 0($a0)

bne $t0, $0, loop

• If lock has 0, it is free• If lock has 1, it is held

Page 37: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Hardware: Implementing lock & unlock with swap

Lock:Li $t0, 1 Loop: swap $t0, 0($a0)

bne $t0, $0, loop

Unlock:sw $0, 0($a0)

• If lock has 0, it is free• If lock has 1, it is held

Page 38: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Outline

• Synchronization• Cache Coherence• False Sharing

Page 39: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Cache Coherence

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a 2. P2: Wr a, 53. P1: Rd a4. P2: Wr a, 35. P1: Rd a

DRAM

P1,P2 are write-back caches

Page 40: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Cache Coherence

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a 2. P2: Wr a, 53. P1: Rd a4. P2: Wr a, 35. P1: Rd a

DRAM

1

P1,P2 are write-back caches

Page 41: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Cache Coherence

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 53. P1: Rd a4. P2: Wr a, 35. P1: Rd a

DRAM

1

P1,P2 are write-back caches

Page 42: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Cache Coherence

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 53. P1: Rd a4. P2: Wr a, 35. P1: Rd a

DRAM

1

2

P1,P2 are write-back caches

Page 43: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Cache Coherence

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 3. P1: Rd a4. P2: Wr a, 35. P1: Rd a

DRAM

1

2

P1,P2 are write-back caches

Page 44: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Cache Coherence

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a4. P2: Wr a, 35. P1: Rd a

DRAM

1

2

P1,P2 are write-back caches

Page 45: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Cache Coherence

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35. P1: Rd a

DRAM

13

2

P1,P2 are write-back caches

Page 46: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Cache Coherence

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35 3 55. P1: Rd a

DRAM

13

2

AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!

4

Page 47: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Cache Coherence

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35 3 55. P1: Rd a

DRAM

13

2

AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!What will P1 receive from its load?

4

Page 48: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Cache Coherence

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35 3 55. P1: Rd a

DRAM

13

2

AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!What will P1 receive from its load? 5What should P1 receive from its load?

4

Page 49: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Cache Coherence

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35 3 55. P1: Rd a

DRAM

13

2

AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!What will P1 receive from its load? 5What should P1 receive from its load? 3

4

Page 50: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Whatever are we to do?

• Write-Invalidate– Invalidate that value in all others’ caches– Set the valid bit to 0

• Write-Update– Update the value in all others’ caches

Page 51: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Write Invalidate

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35. P1: Rd a

DRAM

13

2

P1,P2 are write-back caches

4

Page 52: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Write Invalidate

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 3* 3 55. P1: Rd a

DRAM

13

2

P1,P2 are write-back caches

4

Page 53: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Write Invalidate

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 3* 3 55. P1: Rd a 3 3 3

DRAM

13,5

2

P1,P2 are write-back caches

4

Page 54: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Write Update

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 35. P1: Rd a

DRAM

13,42

P1,P2 are write-back caches

4

Page 55: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Write Update

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 33 3 35. P1: Rd a

DRAM

13,42

P1,P2 are write-back caches

4

Page 56: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Write Update

$$$ $$$

P1 P2Current a value in:P1$ P2$ DRAM

* * 71. P2: Rd a * 7 72. P2: Wr a, 5 * 5 73. P1: Rd a 5 5 54. P2: Wr a, 33 3 35. P1: Rd a 3 3 3

DRAM

13,42

P1,P2 are write-back caches

4

Page 57: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Outline

• Synchronization• Cache Coherence• False Sharing

Page 58: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Cache CoherenceFalse Sharing w/ Invalidate

$$$ $$$

P1 P2Current contents in:P1$ P2$

* *1. P2: Rd A[0] 2. P1: Rd A[1]3. P2: Wr A[0], 54. P1: Wr A[1], 3

DRAM

P1,P2 cacheline size: 4 words

Page 59: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Look closely at example

• P1 and P2 do not access the same element

• A[0] and A[1] are in the same cache block, so if they are in one cache, they are in the other cache.

Page 60: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Cache Coherence False Sharing w/ Invalidate

$$$ $$$

P1 P2Current contents in:P1$ P2$

* *1. P2: Rd A[0] * A[0-3]2. P1: Rd A[1]3. P2: Wr A[0], 54. P1: Wr A[1], 3

DRAM

P1,P2 cacheline size: 4 words

Page 61: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Cache Coherence False Sharing w/ Invalidate

$$$ $$$

P1 P2Current contents in:P1$ P2$

* *1. P2: Rd A[0] * A[0-3]2. P1: Rd A[1] A[0-3]A[0-3]3. P2: Wr A[0], 5 4. P1: Wr A[1], 3

DRAM

P1,P2 cacheline size: 4 words

Page 62: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Cache Coherence False Sharing w/ Invalidate

$$$ $$$

P1 P2Current contents in:P1$ P2$

* *1. P2: Rd A[0] * A[0-3]2. P1: Rd A[1] A[0-3]A[0-3]3. P2: Wr A[0], 5 * A[0-3]4. P1: Wr A[1], 3

DRAM

P1,P2 cacheline size: 4 words

Page 63: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Cache Coherence False Sharing w/ Invalidate

$$$ $$$

P1 P2Current contents in:P1$ P2$

* *1. P2: Rd A[0] * A[0-3]2. P1: Rd A[1] A[0-3]A[0-3]3. P2: Wr A[0], 5 * A[0-3]4. P1: Wr A[1], 3 A[0-3] *

DRAM

P1,P2 cacheline size: 4 words

Page 64: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

False Sharing

• Different/same processors access different/same items in different/same cache block

• Leads to ___________ misses

Page 65: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

False Sharing

• Different processors access different items in same cache block

• Leads to___________ misses

Page 66: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

False Sharing

• Different processors access different items in same cache block

• Leads to coherence cache misses

Page 67: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Cache Performance

// Pn = my processor number (rank)// NumProcs = total active processors// N = total number of elements// NElem = N / NumProcs

For(i=0;i<N;i++) A[NumProcs*i+Pn] = f(i);

Vs

For(i=(Pn*NElem);i<(Pn+1)*NElem;i++) A[i] = f(i);

Page 68: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Which is better?

• Both access the same number of elements• No processors access the same elements

as each other

Page 69: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Why is the second better?

• Both access the same number of elements• No processors access the same elements

as each other• Better Spatial Locality

Page 70: Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Why is the second better?

• Both access the same number of elements• No processors access the same elements

as each other• Better Spatial Locality• Less False Sharing