Pierre Michaud 2nd data prefetching championship, june...
Transcript of Pierre Michaud 2nd data prefetching championship, june...
![Page 1: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/1.jpg)
A best-offset prefetcher
Pierre Michaud
2nd data prefetching championship, june 2015
![Page 2: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/2.jpg)
DPC2 rules
2
core
L1
L2
DRAM
L3
MSHR
prefetcher DPC2
simulator
![Page 3: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/3.jpg)
DPC2 rules
3
core
L1
L2
DRAM
L3
MSHR
prefetcher DPC2
simulator
• physical address • L2 hit/miss • IP • time
![Page 4: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/4.jpg)
DPC2 rules
4
core
L1
L2
DRAM
L3
MSHR
prefetcher DPC2
simulator
• physical address • L2 hit/miss • IP • time
occupancy
![Page 5: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/5.jpg)
DPC2 rules
5
core
L1
L2
DRAM
L3
MSHR
prefetcher DPC2
simulator
• physical address • L2 hit/miss • IP • time
• L2 fill line • L2 victim line • time
occupancy
![Page 6: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/6.jpg)
DPC2 rules
6
core
L1
L2
DRAM
L3
MSHR
prefetcher DPC2
simulator
• physical address • L2 hit/miss • IP • time
• L2 fill line • L2 victim line • time
occupancy
prefetch address must lie in same 4KB page as demand address
![Page 7: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/7.jpg)
Offset prefetching
• Next-line prefetching O=1
• Full-fledged offset prefetcher varying offset
• Sandbox prefetcher (Pugsley et al., HPCA 2014)
7
prefetch demand line X prefetch line X+O
offset O
![Page 8: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/8.jpg)
Proposed Best-Offset (BO) prefetcher
• New method for setting the offset automatically - different from Sandbox - first implementation in an in-house simulator in 2011
• Bandwidth & cache pollution prefetch throttling method - somewhat specific to DPC2 - DPC2 rules limit what can be done
8
![Page 9: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/9.jpg)
Sequential stream
9
0 64 128 192 256 320 384 448
• if the offset is too small, prefetches may not be timely
(neglect page boundary effect)
offset=2
1 2 3 4 5 6 7 8
![Page 10: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/10.jpg)
Strided stream
10
0 64 128 192 256 320 384 448
• constant byte-stride periodic sequence of line-strides (1,2,1,2,...) • offset = sum of line-strides in a period (offset=1+2=3) • ...or multiple of that sum (6,9,...)
offset=3
1 2 3 4 5 6
example: stride=+96 bytes
![Page 11: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/11.jpg)
Interleaved streams
11
• 1st stream alone offset = multiple of 3 • 2nd stream alone offset = multiple of 2 • Both streams offset = multiple of 6
1 2 3 4 5 6
1 2 3 4
offset=6
offset=6
![Page 12: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/12.jpg)
BO prefetcher: main idea
12
best-offset learning +
demand line X (miss / prefetched hit)
prefetch X+O O
![Page 13: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/13.jpg)
BO prefetcher: main idea
13
best-offset learning
recent requests +
demand line X (miss / prefetched hit)
prefetch X+O O
- Y-O
fill line Y (prefetched)
![Page 14: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/14.jpg)
BO prefetcher: main idea
14
best-offset learning
recent requests + -
test O' look up X-O'
demand line X (miss / prefetched hit)
prefetch X+O O
- Y-O
fill line Y (prefetched)
hit/miss ?
![Page 15: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/15.jpg)
Recent Requests (RR) Table
• in 2011: 64-entry fully-associative FIFO
• for DPC2: two direct-mapped banks with different hashing - resembles 2-way skewed-associative - 2 x 64 x 12-bit tags 1536 bits
• Write same tag redundantly in both banks
15
![Page 16: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/16.jpg)
Learning the best offset
• 46 different offsets evaluated - 23 positive + 23 negative - 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,20,24,30,32,36,40
• Each offset has a 5-bit score - 46 x 5 230 bits
• Test the 46 offsets successively (46 L2 accesses) = one round - if hit in RR table for an offset, increment its score
• Learning phase finishes after 100 rounds, or if one of the scores reaches 31 - select the offset with the greatest score this is the new prefetch offset - new learning phase starts reset scores
16
![Page 17: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/17.jpg)
Prefetch timeliness vs. prefetch accuracy
• BO prefetcher tries to do timely prefetches
• However...
• Sometimes, better to choose a smaller offset, even if it generates late prefetches - Example: short sequential streams
• Imperfect solution: delay queue
17
![Page 18: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/18.jpg)
BO prefetcher with a delay queue
18
best-offset learning + -
test O' look up X-O'
demand line X (miss / prefetched hit)
prefetch X+O O
- Y-O
fill line Y (prefetched)
hit/miss ?
RR left
RR right
delay 60 cycles X
![Page 19: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/19.jpg)
Prefetch throttling (DPC2)
• Turn prefetch on only if BO score > BADSCORE - DPC2 BADSCORE=1 (10 for small L3 config) - best-offset learning continues while prefetch is off
• Drop prefetch request if MSHR occupancy is above a threshold - Vary MSHR threshold depending on BO score and L3 access rate
19
L3 access rate
BO score
DRAM BW 50% BW 0
31
20 HIGH
LOW HIGH
![Page 20: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/20.jpg)
State (number of bits)
20
prefetch bits (1 bit per L2 line)
recent requests (2x64x12)
scores (46x5)
delay queue (15 slots)
miscellaneous
TOTAL
2048
1536
230
473
74
4361
bits
![Page 21: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/21.jpg)
fixed vs. adaptive offset (437.leslie3d)
21
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
1.351 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
spee
dup
o�set
BOPBOP w/o DQ
![Page 22: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/22.jpg)
Fixed vs. adaptive offset (433.milc)
22
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
1.351 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
spee
dup
o�set
BOPBOP w/o DQ
![Page 23: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/23.jpg)
Fixed vs. adaptive offset (434.zeusmp)
23
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.41 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
spee
dup
o�set
BOPBOP w/o DQ
![Page 24: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/24.jpg)
BO prefetcher vs. Sandbox prefetcher
• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first published full-fledged offset prefetcher - fake prefetches evaluate an offset by setting bits in a Bloom filter - if demand access hits in Bloom filter fake prefetch successful - prefetch timeliness not considered - Sandbox method is orthogonal to offset prefetching
• BO prefetcher - no fake prefetches - strive for prefetch timeliness
24
![Page 25: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e24a10e177fe64c7724db53/html5/thumbnails/25.jpg)
25
FIN