Hardware Multithreading
description
Transcript of Hardware Multithreading
![Page 1: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/1.jpg)
Hardware Multithreading
![Page 2: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/2.jpg)
Increasing CPU Performance
• By increasing clock frequency• By increasing Instructions per Clock• Minimizing memory access impact – data cache• Maximising Inst issue rate – branch prediction• Maximising Inst issue rate – superscalar• Maximising pipeline utilisation – avoid instruction
dependencies – out of order execution
![Page 3: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/3.jpg)
Increasing Parallelism
• Amount of parallelism that we can exploit is limited by the programs– Some areas exhibit great parallelism– Some others are essentially sequential
• In the later case, where can we find additional independent instructions?– In a different program!
![Page 4: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/4.jpg)
Hardware Multithreading
• Allow multiple threads to share a single processor
• Requires replicating the independent state of each thread
• Virtual memory can be used to share memory among threads
![Page 5: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/5.jpg)
CPU Support for Multithreading
Data Cache
Fetch Logic
Fetch Logic
Decode Logic
Fetch Logic
Exec Logic
Fetch Logic
Mem
Logic
Write Logic
Inst Cache
PCA
PCB
VA MappingA
VA MappingB
AddressTranslation
RegA
RegB
![Page 6: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/6.jpg)
Hardware Multithreading
• Different ways to exploit this new source of parallelism– Coarse-grain parallelism– Fine-grain parallelism– Simultaneous Multithreading
![Page 7: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/7.jpg)
Coarse-Grain Multithreading
![Page 8: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/8.jpg)
Coarse-Grain Multithreading
• Issue instructions from a single thread • Operate like a simple pipeline
• Switch Thread on “expensive” operation:– E.g. I-cache miss– E.g. D-cache miss
![Page 9: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/9.jpg)
Switch Threads on Icache miss1 2 3 4 5 6 7
Inst a IF ID EX MEM WB
Inst b IF ID EX MEM WB
Inst c IF MISS ID EX MEM WB
Inst d IF ID EX MEM
Inst e IF ID EX
Inst f IF ID
Inst X
Inst Y
Inst Z
- - - -
• Remove Inst c and switch to other thread• The next thread will continue its execution
until there is another I-cache or D-cache miss
![Page 10: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/10.jpg)
Switch Threads on Dcache miss1 2 3 4 5 6 7
Inst a IF ID EX M-Miss WB
Inst b IF ID EX MEM WB
Inst c IF ID EX MEM WB
Inst d IF ID EX MEM
Inst e IF ID EX
Inst f IF ID
MISS MISS MISS
- - -
- - -
- - -
Inst X
Inst Y
Abort theseAbort these
• Remove Inst a and switch to other thread– Remove the rest of instructions from ‘blue’ thread– Roll back ‘blue’ PC to point to Inst a
![Page 11: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/11.jpg)
Coarse Grain Multithreading
• Good to compensate for infrequent, but expensive pipeline disruption
• Minimal pipeline changes– Need to abort all the instructions in “shadow” of
Dcache miss overhead– Resume instruction stream to recover
• Short stalls (data/control hazards) are not solved
![Page 12: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/12.jpg)
Fine-Grain Multithreading
![Page 13: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/13.jpg)
Fine-Grain Multithreading
• Overlap in time the execution of several threads
• Usually using Round Robin among all the threads in a ‘ready’ state
• Requires instantaneous thread switching
![Page 14: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/14.jpg)
Fine-Grain Multithreading
• Multithreading helps alleviate fine-grain dependencies (e.g. forwarding?)
1 2 3 4 5 6 7
Inst a IF ID EX MEM WB
Inst M IF ID EX MEM WB
Inst b IF ID EX MEM WB
Inst N IF ID EX MEM
Inst c IF ID EX
Inst P IF ID
![Page 15: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/15.jpg)
I-cache misses in Fine Grain Multithreading
• An I-cache miss is overcome transparently
1 2 3 4 5 6 7
Inst a IF ID EX MEM WB
Inst M IF ID EX MEM WB
Inst b IF-MISS - - - -
Inst N IF ID EX MEM
Inst P IF ID EX
Inst Q IF ID
Inst b is removed and the thread is marked as not ‘ready’
‘Blue’ thread is not ready so ‘orange’ is executed
![Page 16: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/16.jpg)
D-cache misses in Fine Grain Multithreading
• Mark the thread as not ‘ready’ and issue only from the other thread
1 2 3 4 5 6 7
Inst a IF ID EX M-MISS Miss Miss WB
Inst M IF ID EX MEM WB
Inst b IF ID - - -
Inst N IF ID EX MEM
Inst P IF ID EX
Inst Q IF ID
Thread marked as not ‘ready’. Remove Inst b. Update PC.
‘Blue’ thread is not ready so ‘orange’ is executed
![Page 17: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/17.jpg)
1 2 3 4 5 6 7
Inst a IF RO EX MEM WB
Inst M IF RO EX MEM WB
Inst b IF ID EX MEM WB
Inst N IF ID EX MEM
Inst c IF ID EX
Inst P IF ID
D-cache misses in Fine Grain Multithreading
• In an out of order processor we may continue issuing instructions from both threads
4 5 6 7
M MISS
EX MEM WB
ID
IF ID EX MEM
IF ID
4 5 6 7
M MISS Miss Miss WB
EX MEM WB
RO (RO) (RO) EX
IF RO EX MEM
IF (RO) (RO)
IF RO
![Page 18: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/18.jpg)
Fine Grain Multithreading
• Improves the utilisation of pipeline resources• Impact of short stalls is alleviated by executing
instructions from other threads
• Single thread execution is slowed• Requires an instantaneous thread switching
mechanism
![Page 19: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/19.jpg)
Simultaneous Multi-Threading
![Page 20: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/20.jpg)
Simultaneous Multi-Threading
• The main idea is to exploit instructions level parallelism and thread level parallelism at the same time
• In a superscalar processor issue instructions from different threads
• Instructions from different threads can be using the same stage of the pipeline
![Page 21: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/21.jpg)
Simultaneous MultiThreading
• Let’s look simply at instruction issue:1 2 3 4 5 6 7 8 9 10
Inst a IF ID EX MEM WB
Inst b IF ID EX MEM WB
Inst M IF ID EX MEM WB
Inst N IF ID EX MEM WB
Inst c IF ID EX MEM WB
Inst P IF ID EX MEM WB
Inst Q IF ID EX MEM WB
Inst d IF ID EX MEM WB
Inst e IF ID EX MEM WB
Inst R IF ID EX MEM WB
![Page 22: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/22.jpg)
SMT issues
• Asymmetric pipeline stall– One part of pipeline stalls – we want other
pipeline to continue• Overtaking – want unstalled thread to make
progress• Existing implementations on O-o-O, register
renamed architectures (similar to tomasulo)
![Page 23: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/23.jpg)
SMT: Glimpse Into The Future?
• Scout threads?– A thread to prefetch memory – reduce cache miss
overhead
• Speculative threads?– Allow a thread to execute speculatively way past
branch/jump/call/miss/etc– Needs revised O-o-O logic– Needs and extra memory support– See Transactional Memory
![Page 24: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/24.jpg)
Simultaneous Multi Threading
• Extracts the most parallelism from instructions and threads
• Implemented only in out-of-order processors because they are the only able to exploit that much parallelism
• Has a significant hardware overhead
![Page 25: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/25.jpg)
Hardware Multithreading
![Page 26: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/26.jpg)
Benefits of Hardware Multithreading
• All multithreading techniques improve the utilisation of processor resources and, hence, the performance
• If the different threads are accessing the same input data they may be using the same regions of memory – Cache efficiency improves in these cases
![Page 27: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/27.jpg)
Disadvantages of Hardware Multithreading
• The perceived performance may be degraded when comparing with a single-thread CPU– Multiple threads interfering with each other
• The cache has to be shared among several threads so effectively they would use a smaller cache
• Thread scheduling at hardware level adds high complexity to processor design– Thread state, managing priorities, OS-level
information, …
![Page 28: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/28.jpg)
Comparison of Multithreading Techniques
1 2 1 2 3 1 2 3 13 4 54 5 6 6 2 3 47 8 7 4 5 5 69 10 11 12 8 6 7 8
7 9 10 11 129 10 11 12 8 9 1013 14
13 15 1614 15 16
1 2 1 2 1 2 1 23 1 2 3 3 1 2 34 5 6 1 2 3 1 3 4 57 8 1 4 5 6 69 10 11 12 3 2 3 4 7
4 5 7 8 3 51 2 3 4 5 5 6 8 64 5 2 3 4 9 10 11 126 4 5 6 7 8 77 6 9 10 11 12
Tim
e ——
——
>
Tim
e ——
——
>
Coarse-grain Fine Grain SMT
Thread A Thread B Thread C Thread D
![Page 29: Hardware Multithreading](https://reader036.fdocuments.us/reader036/viewer/2022062322/56815005550346895dbdd7d2/html5/thumbnails/29.jpg)
Multithreading Summary
• A cost-effective way of finding additional parallelism for the CPU pipeline
• Available in x86, Itanium, Power and SPARC• (Most architectures) Present additional CPU
thread as additional CPU to Operating System• Operating Systems Beware!!! (why?)