IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl...
-
Upload
shanon-wright -
Category
Documents
-
view
217 -
download
0
description
Transcript of IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl...
![Page 1: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/1.jpg)
IMPROVING THE PREFETCHING PERFORMANCE
THROUGH CODE REGION PROFILING
Martí Torrents, Raúl Martínez, and Carlos Molina
Computer Architecture DepartmentUPC – BarcelonaTech
![Page 2: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/2.jpg)
2
Outline
Motivation- Prefetching- Prefetching in CMPs- Prefetch adverse behaviors
Objective- Proposal- Code region granularity- Switch the prefetcher off- Switch the prefetcher on
Experimental frameworkExpected Results
![Page 3: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/3.jpg)
3
Outline
Motivation- Prefetching- Prefetching in CMPs- Prefetch adverse behaviors
Objective- Proposal- Code region granularity- Switch the prefetcher off- Switch the prefetcher on
Experimental frameworkExpected Results
![Page 4: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/4.jpg)
Motivation
• Number of cores in a same chip grows every year
Nehalem4~6 Cores
Tilera64~100 Cores
Intel Polaris80 Cores
Nvidia GeForceUp to 256 Cores
4
![Page 5: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/5.jpg)
5
Prefetching
• Reduce memory latency• Bring to a nearest cache next data required by CPU• Increase the hit ratio• It is implemented in most of the commercial
processors• Erroneous prefetching may produce
– Cache pollution– Resources consumption (queues, bandwidth, etc.)– Power consumption
![Page 6: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/6.jpg)
6
Prefetch in CMPs
• Useful prefetchers implies more performance – Avoid network latency – Reduce memory access latency
• Useless prefetchers implies less performance– More power consumption– More NoC congestion– Interference with other cores requests
![Page 7: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/7.jpg)
7
Prefetch adverse behaviors
M. Torrents, R. Martínez, C. Molina. “Network Aware Performance Evaluation of Prefetching Techniques in CMPs”. Simulation Modeling Practice and Theory (SIMPAT), 2014.
![Page 8: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/8.jpg)
8
Prefetch in shared memories
• Prefetcher distributed
• Entails challenges – Distributed memory streams – Distributed prefetch queue– Statistics generation and recollection point differ
• Difficult the prefetcher task
• Harder to prefetch accuratelyM. Torrents, et al. “Prefetching Challenges in Distributed Memories for CMPs”, In Proceedings of the International Conference on Computational Science (ICCS'15), Reykjavík, (Iceland), June 2015.
![Page 9: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/9.jpg)
9
Outline
Motivation- Prefetching- Prefetching in CMPs- Prefetch adverse behaviors
Objective- Proposal- Code region granularity- Switch the prefetcher off- Switch the prefetcher on
Experimental frameworkExpected Results
![Page 10: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/10.jpg)
10
Objective• Maximize the prefetching effect • By using it only when it is working properly• Minimizing its adverse effects
![Page 11: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/11.jpg)
11
Proposal
• Identify when the prefetcher generates slowdown– Identify code regions with several granularities– Analyze the prefetcher performance in these regions – Tag this code regions with stats
• Switch the prefetcher off– Save power– Avoid network contention– Avoid cache pollution
• Switch it on again– When it generates speedup
![Page 12: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/12.jpg)
12
Code Region Granularity
• Divide the code in code regions– Single instructions, basic blocs, etc. or all the code
mov ebx, 0 mov eax, 0 mov ecx, 0
_Label_1: mov ecx, [esi + ebx * 4] add eax, ecx inc ebx cmp ebx, 100 jne _Label_1
Instruction level
![Page 13: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/13.jpg)
13
Code Region Granularity
• Divide the code in code regions– Single instructions, basic blocs, etc. or all the code
mov ebx, 0 mov eax, 0 mov ecx, 0
_Label_1: mov ecx, [esi + ebx * 4] add eax, ecx inc ebx cmp ebx, 100 jne _Label_1
Basic Bloc level
![Page 14: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/14.jpg)
14
Code Region Granularity
• Divide the code in code regions– Single instructions, basic blocs, etc. or all the code
mov ebx, 0 mov eax, 0 mov ecx, 0
_Label_1: mov ecx, [esi + ebx * 4] add eax, ecx inc ebx cmp ebx, 100 jne _Label_1
All the code
![Page 15: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/15.jpg)
15
Code Region Granularity
• Regions tagged with statistics– Accuracy / Miss Ratio
• Activate or deactivate at every new code region– According to the statistic and the current code region
• Divide the code in code regions– Single instructions, basic blocs, etc. or all the code
• Identify and tag the regions – Statically (Profiling execution)– Dynamically (During the warm up)
![Page 16: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/16.jpg)
16
Switching off the prefetcher
• Detect the uselessness of the prefetcher
• Accuracy– Useful prefetches / Total number of prefetches– Switch off when the accuracy decreases
• Miss Ratio– Based on the number of misses
![Page 17: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/17.jpg)
17
Switching on the prefetcher
• Switched off prefetcher does not generate stats
• Cannot reactivate with accuracy increment
• Reactivate when?– Based on miss ratio– After a certain timeout
![Page 18: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/18.jpg)
18
Outline
Motivation- Prefetching- Prefetching in CMPs- Prefetch adverse behaviors
Objective- Proposal- Code region granularity- Switch the prefetcher off- Switch the prefetcher on
Experimental frameworkExpected Results
![Page 19: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/19.jpg)
19
Experimental framework
• Gem5– 16 x86 CPUs– Ruby memory system– L1 prefetchers– MOESI coherency protocol– Garnet network simulator
• Parsecs 2.1
![Page 20: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/20.jpg)
20
Simulation environment
![Page 21: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/21.jpg)
21
Outline
Motivation- Prefetching- Prefetching in CMPs- Prefetch adverse behaviors
Objective- Proposal- Code region granularity- Switch the prefetcher off- Switch the prefetcher on
Experimental frameworkExpected Results
![Page 22: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/22.jpg)
22
Expected Results
• Power savings without losing performance
• Smaller granularity more accuracy– Blocs or super blocs better than the whole code– Single instructions more accurate than blocs or super blocs
• Smaller granularity: – More resources– More complexity
• Basic bloc granularity should provide good results with a realistic complexity
![Page 23: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/23.jpg)
23
Q & A
![Page 24: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/24.jpg)
24
IMPROVING THE PREFETCHING PERFORMANCE
THROUGH CODE REGION PROFILING
Martí Torrents, Raúl Martínez, and Carlos Molina
Computer Architecture DepartmentUPC – BarcelonaTech
![Page 25: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/25.jpg)
25
Back up slides
![Page 26: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/26.jpg)
26
Prefetch Distributed Memory Systems
• Increases the complexity of prefetching
• Challenges without trivial solutions
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
DISTRIBUTED L2 MEMORY
![Page 27: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/27.jpg)
27
Prefetch Distributed Memory Systems
• Increases the complexity of prefetching
• Challenges without trivial solutions
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
DISTRIBUTED L2 MEMORY
@
L1 MISS for @
![Page 28: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/28.jpg)
28
Prefetch Distributed Memory Systems
• Increases the complexity of prefetching
• Challenges without trivial solutions
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
DISTRIBUTED L2 MEMORY
@
L1 MISS for @
Distributed patterns
![Page 29: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/29.jpg)
29
Prefetch Distributed Memory Systems
• Increases the complexity of prefetching
• Challenges without trivial solutions
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
DISTRIBUTED L2 MEMORY
@@+4
@+2
@ + 2 @ + 4
![Page 30: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/30.jpg)
30
Prefetch Distributed Memory Systems
• Increases the complexity of prefetching
• Challenges without trivial solutions
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
DISTRIBUTED L2 MEMORY
@@+4
@+2
@ + 2 @ + 4
Queue filtering
![Page 31: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/31.jpg)
31
Prefetch Distributed Memory Systems
• Increases the complexity of prefetching
• Challenges without trivial solutions
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
DISTRIBUTED L2 MEMORY
@@+4
@+2
@ + 2 @ + 4
L1 MISS for @ + 2
![Page 32: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.](https://reader036.fdocuments.us/reader036/viewer/2022062401/5a4d1b577f8b9ab0599a9f05/html5/thumbnails/32.jpg)
32
Prefetch Distributed Memory Systems
• Increases the complexity of prefetching
• Challenges without trivial solutions
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
DISTRIBUTED L2 MEMORY
@@+4
@+2
@ + 2 @ + 4
L1 MISS for @ + 2
Dynamic profiling