Meeting Midway: Improving CMP Performance with Memory-Side Prefetching
description
Transcript of Meeting Midway: Improving CMP Performance with Memory-Side Prefetching
![Page 1: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/1.jpg)
Meeting Midway: Improving CMP Performance with Memory-Side Prefetching
Praveen Yedlapalli, Jagadish Kotra, Emre Kultursay, Mahmut Kandemir, Chita R. Das and Anand Sivasubramaniam
The Pennsylvania State University
![Page 2: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/2.jpg)
Summary
• In modern multi-core systems, increasing number of cores share common resources– “Memory Wall”
• Application/Core contention Interference
• Average 10% improvement in application performance
ProposalA novel memory-side prefetching scheme
Mitigates interference while exploiting row buffer locality
![Page 3: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/3.jpg)
Outline
• Background• Motivation• Memory-Side Prefetching• Evaluation• Conclusion
![Page 4: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/4.jpg)
Network On-Chip based CMP
RequestMessage
ResponseMessage
MC0
MC2 MC3
MC1
L2
L1
L1
C
L2
R
![Page 5: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/5.jpg)
Memory Controller
Bank 0F21 G12 C41 B5 H22 B4
Bank 1
A
MC
B
Row Buffer Conflict
Precharge row AActivate row B
Row Buffer Hit
B
DRAMCPU
B4
![Page 6: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/6.jpg)
Outline
• Background• Motivation• Memory-Side Prefetching• Evaluation• Conclusion
![Page 7: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/7.jpg)
Impact of Interference
bzip2 GemsFDTD lbm libquantum mcf milc sphinx3 zeusmp0
10
20
30
40
50
60
70
80
90
100
Individual Mix-8
Row
Buff
er H
it Ra
te
How to handle this negative interference?
![Page 8: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/8.jpg)
Latency Breakdown of L2 Miss
18%
60%
22%
High MPKI
35%
19%
46%
Moderate MPKI
43%
8%
49%
Low MPKI
Off-chip latency is the majority part
On-chip Off- chip Queueing Off- chip Access
![Page 9: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/9.jpg)
Observations
• Memory requests from multiple cores interleave at the memory controllers– Row buffer locality of individual apps is lost
• Off-chip latency is the majority part in a memory access
• On-chip network and caches are critical– Cannot afford to pollute them
![Page 10: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/10.jpg)
What about Cache Prefetching?• Not effective for large CMPs
• Agnostic to memory state– Gap between caches and memory (62% latency increase)
• On-chip resource pollution– Both caches and network (22% network latency increase)
• Difficulty of stream detection in S-NUCA– Each L2 bank caters to only a portion of the address space– Each L2 bank gets requests from multiple L1s
• Our memory-side prefetching scheme can work along with core-side prefetching
![Page 11: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/11.jpg)
Outline
• Background• Motivation• Memory-Side Prefetching• Evaluation• Conclusion
![Page 12: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/12.jpg)
Memory-Side Prefetching
• Objective 1– Reduce off-chip access latency
• Objective 2– With out increasing on-chip resource contention
![Page 13: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/13.jpg)
Memory-Side Prefetching
What to Prefetch? When to Prefetch?
Where to Prefetch?
![Page 14: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/14.jpg)
What to Prefetch?
• Prefetch from an open row – Minimizes overhead
• Looked at the line access patterns within a row
![Page 15: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/15.jpg)
What to Prefetch?
Line 0
Line 4Lin
e 8
Line 12
Line 16
Line 20
Line 24
Line 28
Line 32
Line 36
Line 40
Line 44
Line 48
Line 52
Line 56
Line 60
05
1015202530354045
Line 0Line 11
Line 22Line 33
Line 44Line 55
milc
% o
f Acc
esse
s
![Page 16: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/16.jpg)
What to Prefetch?
Line 0
Line 8
Line 16
Line 24
Line 32
Line 40
Line 48
Line 56
0
20
40
60
80
100
Line 0Line 16
Line 32Line 48
libquantum
% o
f Acc
esse
s
Line 0
Line 8
Line 16
Line 24
Line 32
Line 40
Line 48
Line 56
0
4
8
12
16
20
Line 0Line 15
Line 30Line 45
Line 60
omnetpp
% o
f Acc
esse
s
In general, next-line locality is good
![Page 17: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/17.jpg)
When to Prefetch?Critical Path Locality # of Prefetches
Prefetch at RBC Yes No High
Prefetch at RBH No Yes Low
Prefetch at Row ACT
No No High
Prefetch at Idle No Yes High
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 3133+
0
500000
1000000
Cycles
Idle
Per
iods
5618579
![Page 18: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/18.jpg)
Where to Prefetch?
• Should be stored on-chip
• Prefetch buffers in the memory controllers– To avoid on-chip resource pollution
• Organization– Per-core– Shared
![Page 19: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/19.jpg)
Memory-Side Prefetching Optimizations
• Applications vary in memory behavior
• Prefetch Throttling– Feedback
• Precharge on Prefetch– Less likely to get a request
• Avert Costly Prefetchets– Waiting demand requests
![Page 20: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/20.jpg)
Memory-Side Prefetching: Example
Bank 0F21 G12 C41 C26 H22 A10
Bank 1
A
MC
B
Core 0
Core 1 C32, C33, C34, C36
Core 2 R12, R13, R14, R15
Core 3 F20, F21, F22, F23
A11, A12, A13, A14A10
Row Buffer Hit
Prefetch from A
A11
A11
DRAMCPU
![Page 21: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/21.jpg)
Memory-Side Prefetching: ComparisonCache Prefetcher[Lui et al. ILP ‘11]
Existing Memory Prefetchers [Lin HPCA ‘01]
Our Memory-side Prefetcher
Memory State Aware
No Yes Yes
On-chip resource pollution
Yes Yes No
Accuracy Yes No Yes
![Page 22: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/22.jpg)
Implementation
• Prefetch Buffer Implementation– Organized as n per-core prefetch buffers– 256 KB per Memory Controller (<3% compared to
LLC)– < 1% Area and Power overhead
• Prefetch Request Timing– Requests are generated internally by the memory
controller along with a read row buffer hit request
![Page 23: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/23.jpg)
Outline
• Background• Motivation• Memory-Side Prefetching• Evaluation• Conclusion
![Page 24: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/24.jpg)
Evaluation Platform
• Cores: 32 at 2.4 GHz• Network: 8x4 2D mesh• Caches: 32KB L1I; 32KB L1D; 1MB L2 per core • Memory: 16GB DDR3-1600 with 4 Memory
Channels
• GEMS simulator with GARNET
![Page 25: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/25.jpg)
Evaluation Methodology
• Benchmarks:– Multi-programmed: SPEC 2006 (WL1 to WL5)– Multi-threaded: SPECOMP 2001 (WL6 & WL7)
• Metrics:– Harmonic IPC– Off-chip and On-chip Latencies
![Page 26: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/26.jpg)
IPC
WL1 WL2 WL3 WL4 WL5 WL6 WL7 AVG
-10
-5
0
5
10
15
20
CSP MSP MSP-PUSH IDLE-PUSH CSP+MSP
IPC
Impr
ovem
ent
33.2
10%
![Page 27: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/27.jpg)
Latency
WL1 WL2 WL3 WL4 WL5 WL6 WL7 AVG0
100
200
300
400
500
600
No Pref CSP MSP IDLE-PUSH CSP+MSP
Cycl
es
-48.5%
![Page 28: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/28.jpg)
Latency
WL1 WL2 WL3 WL4 WL5 WL6 WL7 AVG0
100
200
300
400
500
600
No Pref CSP MSP MSP-PUSH IDLE-PUSH CSP+MSP
Cycl
es
-48.5%
![Page 29: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/29.jpg)
L2 Hitrate
WL1 WL2 WL3 WL4 WL5 WL6 WL7 AVG0
20
40
60
80
100
No Pref CSP MSP CSP+MSP
L2 H
it Ra
te
![Page 30: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/30.jpg)
Row Buffer Hitrate
WL1 WL2 WL3 WL4 WL5 WL6 WL7 AVG0
1020304050607080
No Pref CSP MSP CSP+MSP
Row
Buff
er H
itrat
e
![Page 31: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/31.jpg)
Outline
• Background• Motivation• Memory-Side Prefetching• Evaluation• Conclusion
![Page 32: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/32.jpg)
Conclusion
• Proposed a new memory-side prefetcher– Opportunistic– Instantaneous knowledge of memory state
• Prefetching Midway– Doesn’t pollute on-chip resources
• Reduces the off-chip latency by 48.5% and improves performance by 6.2% on average
• Our technique can be combined with core-side prefetching to amplify the benefits
![Page 33: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching](https://reader035.fdocuments.us/reader035/viewer/2022062520/56816108550346895dd0517e/html5/thumbnails/33.jpg)
Thank You
• Questions?