ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding...
-
Upload
jamir-bottrell -
Category
Documents
-
view
222 -
download
0
Transcript of ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding...
![Page 1: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/1.jpg)
ECE 454 Computer Systems
ProgrammingParallel Architectures and
Performance Implications (II)
Ding YuanECE Dept., University of Toronto
http://www.eecg.toronto.edu/~yuan
![Page 2: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/2.jpg)
Ding Yuan, ECE4542
What we already learnt
• How to benefit from multi-cores by parallelize sequential program into multi-threaded program
• Watch out of locks: atomic regions are serialized• Use fine-grained locks, and avoid locking if
possible
• But are these all?• As long as you do the above, your multi-
threaded program will run Nx faster on an N-core machine?
![Page 3: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/3.jpg)
Ding Yuan, ECE4543
Putting it all together [1]
• Performance implications for parallel architecture
• Background: architecture of the two testing machines
• Cache-coherence performance and implications to parallel software design
[1] Everything you always wanted to know about synchronization but were afraid to ask. David, et. al., SOSP’13
![Page 4: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/4.jpg)
Ding Yuan, ECE4544
Two case studies
• 48-core AMD Opteron
• 80-core Intel Xeon
SocketQuestion to keep in mind: which machine would you use?
![Page 5: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/5.jpg)
48-core AMD Opteron
RAM
• LLC NOT shared
• Directory-based cache coherence
(moth
erb
oard
)
L1
C
L1
C
Last Level Cache
6-cores per die(each socket contains 2 dies)
L1
C
…6x……8x…
L1
C
L1
C
Last Level Cache
L1
C
…6x…
cross-die!6-cores per die
![Page 6: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/6.jpg)
80-core Intel Xeon
RAM
• LLC shared• Snooping-based cache coherence
(moth
erb
oard
)
L1
C
L1
C
Last Level Cache
10-cores per die
L1
C
…10x……8x…
L1
C
L1
C
10-cores per die
L1
C
…10x…
cross-socket
![Page 7: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/7.jpg)
Ding Yuan, ECE4547
Interconnect between sockets
Cross-sockets communication can be 2-hops
![Page 8: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/8.jpg)
Ding Yuan, ECE4548
Performance of memory operations
![Page 9: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/9.jpg)
Ding Yuan, ECE4549
Local caches and memory latencies
• Memory access to a line cached locally (cycles)• Best case: L1 < 10 cycles (remember this)• Worst case: RAM 136 – 355 cycles (remember
this)
![Page 10: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/10.jpg)
Latency of remote access: read (cycles)
Ignore
“State” is the MESI state of a cache line in a remote cache (local state is invalid)
• Cross-socket communication is expensive!• Xeon: loading from Shared state is 7.5 times more expensive over two hops than
within socket• Opteron: cross-socket latency even larger than RAM
• Opteron: uniform latency regardless of the cache state• Directory-based protocol (directory is distributed across all LLC, here we assume
the directory lookup stays in the same die)
• Xeon: load from “Shared” state is much faster than from “M” and “E” states• “Shared” state read is served from LLC instead from remote cache
![Page 11: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/11.jpg)
Latency of remote access: write (cycles)
“State” is the MESI state of a cache line in a remote cache.
• Cross-socket communication is expensive!
• Opteron: store to “Shared” cache line is much more expensive• Directory-based protocol is incomplete• Does not keep track of the sharers, therefore it is• Equivalent to broadcast and have to wait for all invalidations to complete
• Xeon: store latency similar regardless of the previous cache line state• Snooping-based coherence
Ignore
![Page 12: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/12.jpg)
Ding Yuan, ECE45412
How about synchronization?
![Page 13: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/13.jpg)
Ding Yuan, ECE45413
Synchronization implementation
• Hardware support is required to implement sync. primitives• In the form of atomic instructions• Common examples include: test-and-set, compare-
and-swap, etc. • Used to implement high-level synchronization
primitives• e.g., lock/unlock, semaphores, barriers, cond. var., etc.
• We will only discuss test-and-set
![Page 14: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/14.jpg)
Ding Yuan, ECE45414
Test-And-Set
• The semantics of test-and-set are:• Record the old value• Set the value to TRUE• This is a write!
• Return the old value
• Hardware executes it atomically!
• When executing test-and-set on “flag”• What is value of flag afterwards if it was initially False? True?• What is the return result if flag was initially False? True?
bool test_and_set (bool *flag){ bool old = *flag; *flag = True; return old;}
• Read-exclusive (invalidations)• Modify (change state)• Memory barrier
• completes all the mem. op. before this TAS• cancel all the mem. op. after this TAS
Hardware implementation:
atomic!
![Page 15: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/15.jpg)
Ding Yuan, ECE45415
Using Test-And-Set
• Here is our lock implementation with test-and-set:
• When will the while return? What is the value of held?
• Does it work? What about multiprocessors?
struct lock { int held = 0;}void acquire (lock) { while (test-and-set(&lock->held));}void release (lock) { lock->held = 0;}
![Page 16: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/16.jpg)
TAS and cache coherence
Shared Memory (lock->held = 0)
Cache
Processor
State Data
Thread A:
Cache
Processor
State Data
Thread B:acquire(lock)
Read-Exclusive
![Page 17: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/17.jpg)
TAS and cache coherence
Shared Memory (lock->held = 0)
Cache
Processor
Dirty
State
lock->held=1
Data
Thread A:
Cache
Processor
State Data
Thread B:acquire(lock)
Read-ExclusiveFill
![Page 18: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/18.jpg)
TAS and cache coherence
Shared Memory (lock->held = 0)
Cache
Processor
Dirty
State
lock->held=1
Data
Thread A:
Cache
Processor
acquire(lock)
State Data
Thread B:acquire(lock)
Read-Exclusiveinvalidation
![Page 19: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/19.jpg)
TAS and cache coherence
Shared Memory (lock->held = 1)
Cache
Processor
Invalid
State
lock->held=1
Data
Thread A:
Cache
Processor
acquire(lock)
State Data
Thread B:acquire(lock)
Read-Exclusiveinvalidationupdate
![Page 20: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/20.jpg)
TAS and cache coherence
Shared Memory (lock->held = 1)
Cache
Processor
Invalid
State
lock->held=1
Data
Thread A:
Cache
Processor
acquire(lock)
Dirty
State Data
Thread B:
lock->held=1
acquire(lock)
Read-ExclusiveFill
![Page 21: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/21.jpg)
What if there are contentions?
Shared Memory (lock->held = 1)
Cache
Processor
State Data
Thread A:
Cache
Processor
while(TAS(lock)) ;
State Data
Thread B:while(TAS(lock)) ;
![Page 22: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/22.jpg)
Ding Yuan, ECE45422
How bad can it be?
TAS
Recall: TAS essentially is a Store + Memory Barrier
IgnoreStore
Takeaway: heavy lock contentions may lead to worse performancethan serializing the execution!
![Page 23: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/23.jpg)
How to optimize?
• When the lock is being held, a contending “acquire” keeps modifying the lock var. to 1• Not necessary!void test_and_test_and_set (lock) {
do { while (lock->held == 1) ; // spin } } while (test_and_set(lock->held));}void release (lock) { lock->held = 0;}
![Page 24: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/24.jpg)
What if there are contentions?
Shared Memory (lock->held = 0)
Cache
Processor
Dirty
State
lock->held=1
Data
Thread A:
Cache
Processor
while(lock->held==1) ;
State Data
Thread B:
holding lock
Cache
Processor
State Data
Thread B:
ReadRead request
![Page 25: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/25.jpg)
What if there are contentions?
Shared Memory (lock->held = 1)
Cache
Processor
Shared
State
lock->held=1
Data
Thread A:
Cache
Processor
while(lock->held==1) ;
Shared
State Data
Thread B:
lock->held=1
holding lock
Cache
Processor
State Data
Thread B:
ReadRead request update
![Page 26: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/26.jpg)
What if there are contentions?
Shared Memory (lock->held = 1)
Cache
Processor
Shared
State
lock->held=1
Data
Thread A:
Cache
Processor
while(lock->held==1) ;
Shared
State Data
Thread B:
lock->held=1
holding lock
Cache
Processor
Shared
State Data
Thread B:
lock->held=1
while(lock->held==1) ;
Repeated read to “Shared” cache line: no cache coherence traffic!
![Page 27: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/27.jpg)
Let’s put everything together
TAS
LoadIgnor
e
Write
Local access
![Page 28: ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan.](https://reader030.fdocuments.us/reader030/viewer/2022033023/56649c785503460f9492ddfc/html5/thumbnails/28.jpg)
Implications to programmers
• Cache coherence is expensive (more than you thought)• Avoid unnecessary sharing (e.g., false sharing)• Avoid unnecessary coherence (e.g., TAS -> TATAS)
• Clear understanding of the performance
• Crossing sockets is a killer• Can be slower than running the same program on
single core!• pthread provides CPU affinity mask• pin cooperative threads on cores within the same die
• Loads and stores can be as expensive as atomic operations
• Programming gurus understand the hardware• So do you now!• Have fun hacking!
More details in “Everything you always wanted to know about synchronization but were afraid to ask”. David, et. al., SOSP’13