PageForge: A Near-Memory Content- Aware Page …iacoma.cs.uiuc.edu › iacoma-papers › PRES ›...
Transcript of PageForge: A Near-Memory Content- Aware Page …iacoma.cs.uiuc.edu › iacoma-papers › PRES ›...
PageForge: A Near-Memory Content-Aware Page-Merging Architecture
Dimitrios Skarlatos, Nam Sung Kim, and Josep Torrellas
University of Illinois at Urbana-Champaign
MICRO-50 @ Boston
Motivation: Server Consolidation in the Cloud
2
Motivation: Server Consolidation in the Cloud
3
How to consolidate Main Memory?
Content-Aware Page Merging or Page Deduplication
4
Content-Aware Page Merging or Page Deduplication
• Hypervisors search the address space and merge identical pages
5
Content-Aware Page Merging or Page Deduplication
• Hypervisors search the address space and merge identical pages
6
Host Physical
Guest Virtual
Guest Physical
Page 0 Page 1
Page 0 Page 1
Page 0 Page 1
VM-0 VM-1
Content-Aware Page Merging or Page Deduplication
• Hypervisors search the address space and merge identical pages
7
Host Physical
Guest Virtual
Guest Physical
Page 0 Page 1
Page 0 Page 1
Page 0 Page 1
VM-0 VM-1
Page 0 Page 1
Page 0
Page 0 Page 1
VM-0 VM-1
Problem: Overhead of Software Page Merging
• Search hundreds of millions of pages
• Latency sensitive applications get disrupted
• RedHat’s SW page merging (KSM) has execution overhead:
• Average mean latency overhead: 68%
• Average tail latency overhead: 136%
8
Contribution: PageForge• First solution for hardware-assisted content-aware page merging
• General, effective, minimal hypervisor involvement & hardware mods
• Reduced overhead vs state-of-the-art software:
• Mean latency 68% à 10%
• Tail latency 136% à 11%
• Same memory savings as software: 48% à Twice #VMs
• Novel use of ECC for page content characterization
MICRO-50 @ Boston
Content Duplication in the Cloud
Background: Cloud
11
Infrastructure
Background: Cloud
12
Infrastructure
Host Operating System
Background: Cloud
13
Infrastructure
Host Operating System
Hypervisor
Background: Cloud
14
Infrastructure
Host Operating System
Hypervisor
Guest OS
Background: Cloud
15
Infrastructure
Host Operating System
Hypervisor
Guest OS
Libs
Background: Cloud
16
Infrastructure
Host Operating System
Hypervisor
Guest OS
Libs
Data
Background: Cloud
17
Infrastructure
Host Operating System
Hypervisor
Guest OS Guest OS Guest OS
Libs
Data
Libs
Data
Libs
Data
Background: Content Duplication
18
Infrastructure
Host Operating System
Hypervisor
Guest OS Guest OS Guest OS
Libs
Data
Libs
Data
Libs
Data
Background: Content Duplication
19
Infrastructure
Host Operating System
Hypervisor
Guest OS Guest OS Guest OS
Libs
Data
Libs
Data
Libs
Data
Background: Content Duplication
20
Infrastructure
Host Operating System
Hypervisor
Guest OS Guest OS Guest OS
Libs
Data
Libs
Data
Libs
Data
Background: Content Duplication
21
Infrastructure
Host Operating System
Hypervisor
Guest OS Guest OS Guest OS
Libs
Data
Libs
Data
Libs
Data
Background: Content Duplication
22
Infrastructure
Host Operating System
Hypervisor
Guest OS Guest OS Guest OS
Libs
Data
Libs
Data
Libs
Data
Background: Content Duplication
23
Infrastructure
Host Operating SystemPage Merging
ContentDuplication
Guest OS Guest OS Guest OS
Libs
Data
Libs
Data
Libs
Data
Hypervisor
Our Proposal: Content Deduplication in HW
24
Infrastructure
Host Operating SystemPage Merging
ContentDuplication
Guest OS Guest OS Guest OS
Libs
Data
Libs
Data
Libs
Data
Hypervisor
Software-based Content-Aware Page Merging
Software-Based Content-Aware Page Merging
Pool of Pages to Scan
4KB4KB
Pool of Stable Pages
4KB4KB
4KB
4KB
Software-Based Content-Aware Page Merging
Pool of Pages to Scan
4KB4KB
Pool of Stable Pages
4KB4KB
4KB
4KB
Candidate Page
ComparePage
Software-Based Content-Aware Page Merging
4KB
4KB
MemoryCtrl L3 $ L1 $ Core
Main Memory
L2 $
Candidate Page
ComparePage
Core1B
1BL1 $
Software-Based Content-Aware Page Merging
4KB
4KB
MemoryCtrl L3 $ L2 $
Main Memory
==
64B
64B
Candidate Page
ComparePage
Software-Based Content-Aware Page Merging
Pool of Pages to Scan
4KB4KB
Pool of Stable Pages
4KB4KB
4KB
4KB
Candidate Page
ComparePage
Software-Based Content-Aware Page Merging
Pool of Pages to Scan
4KB4KB
Pool of Stable Pages
4KB4KB
4KB
4KB
Candidate Page
ComparePage
Why Page-Merging is expensive?
• Search hundreds of millions of pages
• Many core cycles consumed
• Caches get polluted
• Latency sensitive applications get disrupted
32
Software-Based Content-Aware Page Merging
• Optimization: Identify if the candidate page has recently changed
• If a page changes too often à Not a good candidate
Candidate Page
4KB
1KB #Hash Key
JHash
PageForge: Hardware Assisted Page-Merging
PageForge Main Idea
• Hardware engine in the memory controller
• Compares pages
• Generates hashes
• Advantages:
ü No core utilization
ü No cache pollution
35
L2 L2 L2
CPU CPU CPU
Interconnect
L3 L3 L3
Mem
Con
trolle
r
L2
CPU
L3
Page
Forg
e
Memory
Memory
L3 L3 L3 L3
L2 L2 L2 L2
CPU CPU CPU CPU
L2
CPU
L3
L3
L2
CPU
Mem
Con
trolle
r
Memory
Memory
PageForge Main Idea
• Hardware engine in the memory controller
• Compares pages
• Generates hashes
• Advantages:
ü No core utilization
ü No cache pollution
36
L2 L2 L2
CPU CPU CPU
Interconnect
L3 L3 L3
Mem
Con
trolle
r
L2
CPU
L3
Page
Forg
e
Memory
Memory
L3 L3 L3 L3
L2 L2 L2 L2
CPU CPU CPU CPU
L2
CPU
L3
L3
L2
CPU
Mem
Con
trolle
r
Memory
Memory
PageForge @ The Memory Controller
Read Req Buffer
Write Req Buffer
Scheduler(FR-FCFS)
CommandGeneration
Memory Interface
On-C
hip Netw
ork
Control
Comparator
Write Data BufferMemory Controller
Off-C
hip Mem
ory Interface
PageForge
Scan Table
ECC Enc
ECC DecRead Data Buffer
37
PageForge Operation
38
PageForge Operation
39
Operating System
• Loads in Scan Table:
• Candidate page PPN
• PPNs of pages to compare to candidate
PageForge Operation
40
Operating System
• Loads in Scan Table:
• Candidate page PPN
• PPNs of pages to compare to candidate
PageForgeHardware
• Autonomously performs :
• Sequence of comparisons
• Generation of the hash key for candidate page
PageForge Operation
41
In Hardware
PageForge Operation
42
SearchPool of
Stable Pages
PickCandidate
Page
In Hardware
PageForge Operation
43
Match Found?
OS MergesPages
SearchPool of
Stable Pages
In Hardware
Yes
PageForge Operation
44
Match Found?
Old Key =New Key?
No
In Hardware
SearchPool of
Stable Pages
Old Key =New Key?
No
PickCandidate
Page
PageForge Operation
45
SearchPool of
Unprotected Pages
Match Found?
Yes
Yes
In Hardware
Match Found?
Old Key =New Key?
SearchPool of
Stable Pages
OS MergesPages
PageForge Operation
46
SearchPool of
Unprotected Pages
Match Found?
Insert Candidate in Unprotected Pool
No
Yes
In Hardware
Match Found?
Old Key =New Key?
SearchPool of
Stable Pages
PageForge Operation
47
SearchPool of
Unprotected Pages
Match Found?
Insert Candidate in Unprotected Pool
No
Yes
Yes
In Hardware
Match Found?
OS MergesPages
Old Key =New Key?
No
No
Yes
SearchPool of
Stable Pages
PickCandidate
Page
Eliminating The Cost of Hash Keys
Hash Key for Page Content Characterization
49
Hash Key for Page Content Characterization
• If page changed recently, key may be different
50
Hash Key for Page Content Characterization
• If page changed recently, key may be different
• KSM: JHash à Serial computation + 1KB of data
51
Hash Key for Page Content Characterization
• If page changed recently, key may be different
• KSM: JHash à Serial computation + 1KB of data
• PageForge: ECC à Parallel computation + 256B of data
52
PageForge @ The Memory Controller
Read Req Buffer
Write Req Buffer
Scheduler(FR-FCFS)
CommandGeneration
Memory Interface
On-C
hip Netw
ork
Control
Comparator
Write Data BufferMemory Controller
Off-C
hip Mem
ory Interface
PageForge
Scan Table
ECC Enc
ECC DecRead Data Buffer
53
Proposal: ECC for Page Content Characterization
54
Proposal: ECC for Page Content Characterization
• Break a 4KB page into 4 segments
55
16
Data 64B
16 16 16
4K Page
Proposal: ECC for Page Content Characterization
• Break a 4KB page into 4 segments
• Pick a random cache line within each segment
56
16
Data 64B
16 16 16
4K Page
Proposal: ECC for Page Content Characterization
• Break a 4KB page into 4 segments
• Pick a random cache line within each segment
• Select the 8-bit ECC of the least significant QWORD
57
16
Data 64B
16 16 16
ECC 8B
4K Page
Proposal: ECC for Page Content Characterization
• Break a 4KB page into 4 segments
• Pick a random cache line within each segment
• Select the 8-bit ECC of the least significant QWORD
58
16
Data 64B
16 16 16
ECC 8B
Hash Key 4B
4K Page
Proposal: ECC for Page Content Characterization
• Break a 4KB page into 4 segments
• Pick a random cache line within each segment
• Select the 8-bit ECC of the least significant QWORD
59
16
Data 64B
16 16 16
ECC 8B
Hash Key 4B
Software JHash = 1KB
Proposal: ECC for Page Content Characterization
• Break a 4KB page into 4 segments
• Pick a random cache line within each segment
• Select the 8-bit ECC of the least significant QWORD
60
16
Data 64B
16 16 16
ECC 8B
Hash Key 4B
Software Jhash = 1KB
Proposal: ECC for Page Content Characterization
• Break a 4KB page into 4 segments
• Pick a random cache line within each segment
• Select the 8-bit ECC of the least significant QWORD
61
16
Data 64B
16 16 16
ECC 8B
Hash Key 4B
ECC Hash = 256 Bytes
Proposal: ECC for Page Content Characterization
• Break a 4KB page into 4 segments
• Pick a random cache line within each segment
• Select the 8-bit ECC of the least significant QWORD
62
16
Data 64B
16 16 16
ECC 8B
Hash Key 4B75% Memory Footprint Reduction for Hash Key
Generation!
Evaluation
Simulation Setup
• 2GHz, 10-core, 16GB DRAM
• One VM per core
• Ubuntu 16.04 Host, Ubuntu Cloud Guest
• InHouse simulator: Simics + SST + DRAMSim2
• TailBench Suite benchmarks
64
Configurations
• Baseline: Page-merging is disabled
• KSM: RedHat’s implementations shipped with the current Linux
• PageForge: Hardware-assisted page-merging
65
Memory Savings – w/o vs w/ Page Merging
66
Img-Dnn Masstree Moses Silo Sphinx Average
48%
W/O
Pa
ge-M
ergi
ng
0
20
40
60
80
100
Pa
ges
%
w/o Page Merging w/ Page Merging
Memory Savings – w/o vs w/ Page Merging
67
Img-Dnn Masstree Moses Silo Sphinx Average
48%
W/O
Pa
ge-M
ergi
ng
0
20
40
60
80
100
Pa
ges
%
w/o Page Merging w/ Page Merging
PageForge AchievesMemory Savings of 48%
Memory Savings – w/o vs w/ Page Merging
68
Img-Dnn Masstree Moses Silo Sphinx Average
48%
W/O
Pa
ge-M
ergi
ng
0
20
40
60
80
100
Pa
ges
%
w/o Page Merging w/ Page Merging
PageForge AchievesMemory Savings of 48%
Twice #VMs!!
Memory Savings – w/o vs w/ Page Merging
0
20
40
60
80
100
Pa
ges
%
Unmergeable Mergeable Zero Mergeable Non-Zero
69
Img-Dnn Masstree Moses Silo Sphinx Average
W/O
Pa
ge-M
ergi
ng
W/ P
age
-Mer
gin
g
Memory Savings – w/o vs w/ Page Merging
0
20
40
60
80
100
Pa
ges
%
Unmergeable Mergeable Zero Mergeable Non-Zero
70
Img-Dnn Masstree Moses Silo Sphinx Average
W/O
Pa
ge-M
ergi
ng
W/ P
age
-Mer
gin
g
Memory Savings – w/o vs w/ Page Merging
0
20
40
60
80
100
Pa
ges
%
Unmergeable Mergeable Zero Mergeable Non-Zero
71
Img-Dnn Masstree Moses Silo Sphinx Average
W/O
Pa
ge-M
ergi
ng
W/ P
age
-Mer
gin
g48%
Mean Latency of Requests
0
0.5
1
1.5
2
2.5
3
Img-Dnn Masstree Moses Silo Sphinx Average
No
rma
lize
d M
ean
La
ten
cy
KSM PageForge
72
Mean Latency of Requests
0
0.5
1
1.5
2
2.5
3
Img-Dnn Masstree Moses Silo Sphinx Average
No
rma
lize
d M
ean
La
ten
cy
Baseline KSM PageForge
73
Mean Latency of Requests
0
0.5
1
1.5
2
2.5
3
Img-Dnn Masstree Moses Silo Sphinx Average
No
rma
lize
d M
ean
La
ten
cy
Baseline KSM PageForge
74
Mean Latency of Requests
0
0.5
1
1.5
2
2.5
3
Img-Dnn Masstree Moses Silo Sphinx Average
No
rma
lize
d M
ean
La
ten
cy
Baseline KSM PageForge
75
68%10%
Mean Latency of Requests
0
0.5
1
1.5
2
2.5
3
Img-Dnn Masstree Moses Silo Sphinx Average
No
rma
lize
d M
ean
La
ten
cy
Baseline KSM PageForge
76
PageForge Reduces Mean Latency Overhead
From 68% Down to 10%!
68%10%
Tail Latency of Requests
0
0.5
1
1.5
2
2.5
3
Img-Dnn Masstree Moses Silo Sphinx Average
95th
Per
cen
tile
La
ten
cy
Baseline KSM PageForge5.18
77
5.18
136%
11%
Tail Latency of Requests
0
0.5
1
1.5
2
2.5
3
Img-Dnn Masstree Moses Silo Sphinx Average
95th
Per
cen
tile
La
ten
cy
Baseline KSM PageForge
78
PageForge Reduces Tail Latency Overhead
From 136% Down to 11%!
5.18
136%
11%
Also in the Paper
• Software interface to PageForge
• Interaction with the cache coherence protocol
• Alternative designs
• Bandwidth analysis
• ECC vs Jhash à ECC keys add negligible collisions
• Power and area at 22nm is negligible
79
Takeaway: PageForge• First solution for hardware-assisted content-aware page merging
• General, effective, minimal hypervisor involvement & hardware mods
• Reduced overhead vs state-of-the-art software:
• Mean latency 68% à 10%
• Tail latency 136% à 11%
• Same memory savings as software: 48% à Twice #VMs
• Novel use of ECC for content characterization à 75% less memory footprint
MICRO-50 @ Boston