Open Problems in Multi-Modal Scheduling theory for thermal-Resilient multicore Systems
Improving Virtual Machine Scheduling in NUMA Multicore Systems
Transcript of Improving Virtual Machine Scheduling in NUMA Multicore Systems
![Page 1: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/1.jpg)
Improving Virtual Machine Scheduling in NUMA Multicore Systems
Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs
Kun Wang, Cheng-Zhong XuWayne State University
http://cs.uccs.edu/~jrao/
Monday, February 25, 13
![Page 2: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/2.jpg)
Multicore SystemsFundamental platform for datacenters and HPC- power-efficiency & parallelism
Performance degradation and unpredictability- contention on shared resources: last-level cache, memory controller ...
NUMA architecture further complicates scheduling- low NUMA factor, remote access is not the only/main concern
Virtualization- app or OS-level optimizations ineffective due to inaccurate virtual topology
Monday, February 25, 13
![Page 3: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/3.jpg)
Multicore SystemsFundamental platform for datacenters and HPC- power-efficiency & parallelism
Performance degradation and unpredictability- contention on shared resources: last-level cache, memory controller ...
NUMA architecture further complicates scheduling- low NUMA factor, remote access is not the only/main concern
Virtualization- app or OS-level optimizations ineffective due to inaccurate virtual topology
Objective:Improve performance and reduce variability
Monday, February 25, 13
![Page 4: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/4.jpg)
Multicore SystemsFundamental platform for datacenters and HPC- power-efficiency & parallelism
Performance degradation and unpredictability- contention on shared resources: last-level cache, memory controller ...
NUMA architecture further complicates scheduling- low NUMA factor, remote access is not the only/main concern
Virtualization- app or OS-level optimizations ineffective due to inaccurate virtual topology
Objective:Improve performance and reduce variability
Approach:Add NUMA and contention awareness
to virtual machine scheduling
Monday, February 25, 13
![Page 5: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/5.jpg)
Motivation
Enumerate different thread-to-core assignments- 4 threads on two-socket Intel Westmere NUMA machine
Calculate the worst to best degradation
020406080
100120
lu sp ft ua bt cg mg ep sphinx3 mcf soplex milc% D
egra
datio
n re
lativ
e to
opt
imal
Parallel workloads Multiprogrammed workloads
Monday, February 25, 13
![Page 6: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/6.jpg)
Motivation
Enumerate different thread-to-core assignments- 4 threads on two-socket Intel Westmere NUMA machine
Calculate the worst to best degradation
020406080
100120
lu sp ft ua bt cg mg ep sphinx3 mcf soplex milc% D
egra
datio
n re
lativ
e to
opt
imal
Parallel workloads Multiprogrammed workloads
scheduling plays an important rolefor NUMA-sensitive workloads
Monday, February 25, 13
![Page 7: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/7.jpg)
Motivation
Enumerate different thread-to-core assignments- 4 threads on two-socket Intel Westmere NUMA machine
Calculate the worst to best degradation
020406080
100120
lu sp ft ua bt cg mg ep sphinx3 mcf soplex milc% D
egra
datio
n re
lativ
e to
opt
imal
Parallel workloads Multiprogrammed workloads
scheduling plays an important rolefor NUMA-sensitive workloads
some workloads are NUMAinsensitive
Monday, February 25, 13
![Page 8: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/8.jpg)
The NUMA Architecture
Two-socket Intel Nehalem NUMA machine
Cross-socket interconnect
Memory node-0 Memory node-1
Processor-0 Processor-1
Monday, February 25, 13
![Page 9: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/9.jpg)
The NUMA Architecture
Two-socket Intel Nehalem NUMA machine
Cross-socket interconnect
Memory node-0 Memory node-1
Processor-0 Processor-1
Shared resource contention
Inter-skt communication overhead
Distributing threads
Remote memory access
Clustering threads
Co-locate data and threads
Monday, February 25, 13
![Page 10: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/10.jpg)
The NUMA Architecture
Two-socket Intel Nehalem NUMA machine
Cross-socket interconnect
Memory node-0 Memory node-1
Processor-0 Processor-1
Shared resource contention
Inter-skt communication overhead
Distributing threads
Remote memory access
Clustering threads
Co-locate data and threads
Performance depends on the complex interplay
Monday, February 25, 13
![Page 11: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/11.jpg)
The NUMA Architecture
Two-socket Intel Nehalem NUMA machine
Cross-socket interconnect
Memory node-0 Memory node-1
Processor-0 Processor-1
Shared resource contention
Inter-skt communication overhead
Distributing threads
Remote memory access
Clustering threads
Co-locate data and threads
Performance depends on the complex interplay
Scheduling is hard due toconflicting policies
Monday, February 25, 13
![Page 12: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/12.jpg)
Micro-benchmark
64B ... ...Private data Private data
Random-linked list
64B
Shared data
Thread-1 Thread-N
...Working set size
Sharing size__sync_add_and_fetch
Monday, February 25, 13
![Page 13: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/13.jpg)
Micro-benchmark
64B ... ...Private data Private data
Random-linked list
64B
Shared data
Thread-1 Thread-N
...Working set size
Sharing size__sync_add_and_fetch
Configurable parameters: working set size
number of threadssharing size
Monday, February 25, 13
![Page 14: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/14.jpg)
Experiment TestbedIntel Xeon E5620
Number of cores 4 cores (2 sockets)
Clock frequency 2.40 GHz
L1 Cache 32KB ICache, 32KB DCache
L2 Cache 256KB unified
L3 Cache 12MB unified, inclusive, shared by 4 cores
IMC 32GB/s bandwidth, 2 memory nodes, each with 8GB
QPI 5.86GT/s, 2 links
Monday, February 25, 13
![Page 15: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/15.jpg)
4 thread, sharing disabled, co-located datawith thread
Single Factor on Performance
00.20.40.60.8
11.21.41.6
4M 8M 16M 32M 64M 128M
Local Remote
Nor
mal
ized
runt
ime
00.20.40.60.8
11.21.41.6
4M 8M 16M 32M 64M 128M
Sharing LLC Separate LLC
00.20.40.60.8
11.21.41.6
1 2 4 6 8 10 12
Within socket Across socket
Working set size (byte) Working set size (byte) Number of threads
(a) Data locality (b) LLC contention (a) Sharing overhead
1 thread, sharing disabled private data disabled, sharing 1 cacheline
Monday, February 25, 13
![Page 16: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/16.jpg)
4 thread, sharing disabled, co-located datawith thread
Single Factor on Performance
00.20.40.60.8
11.21.41.6
4M 8M 16M 32M 64M 128M
Local Remote
Nor
mal
ized
runt
ime
00.20.40.60.8
11.21.41.6
4M 8M 16M 32M 64M 128M
Sharing LLC Separate LLC
00.20.40.60.8
11.21.41.6
1 2 4 6 8 10 12
Within socket Across socket
Working set size (byte) Working set size (byte) Number of threads
(a) Data locality (b) LLC contention (a) Sharing overhead
1. When WSS is smaller than LLC,remote access does not hurt performance
2. When WSS is beyond LCC capacity,the larger the WSS, the larger impact
remote penalty hits performance
1 thread, sharing disabled private data disabled, sharing 1 cacheline
Monday, February 25, 13
![Page 17: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/17.jpg)
4 thread, sharing disabled, co-located datawith thread
Single Factor on Performance
00.20.40.60.8
11.21.41.6
4M 8M 16M 32M 64M 128M
Local Remote
Nor
mal
ized
runt
ime
00.20.40.60.8
11.21.41.6
4M 8M 16M 32M 64M 128M
Sharing LLC Separate LLC
00.20.40.60.8
11.21.41.6
1 2 4 6 8 10 12
Within socket Across socket
Working set size (byte) Working set size (byte) Number of threads
(a) Data locality (b) LLC contention (a) Sharing overhead
1. When WSS is smaller than LLC,remote access does not hurt performance
2. When WSS is beyond LCC capacity,the larger the WSS, the larger impact
remote penalty hits performance
1. Scheduling does not affect performancewhen WSS fits in LLC
2. As WSS increases, LLC contention has diminishing effect on performance
1 thread, sharing disabled private data disabled, sharing 1 cacheline
Monday, February 25, 13
![Page 18: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/18.jpg)
4 thread, sharing disabled, co-located datawith thread
Single Factor on Performance
00.20.40.60.8
11.21.41.6
4M 8M 16M 32M 64M 128M
Local Remote
Nor
mal
ized
runt
ime
00.20.40.60.8
11.21.41.6
4M 8M 16M 32M 64M 128M
Sharing LLC Separate LLC
00.20.40.60.8
11.21.41.6
1 2 4 6 8 10 12
Within socket Across socket
Working set size (byte) Working set size (byte) Number of threads
(a) Data locality (b) LLC contention (a) Sharing overhead
1. When WSS is smaller than LLC,remote access does not hurt performance
2. When WSS is beyond LCC capacity,the larger the WSS, the larger impact
remote penalty hits performance
1. Scheduling does not affect performancewhen WSS fits in LLC
2. As WSS increases, LLC contention has diminishing effect on performance
1. Inter-socket sharing overheadincreases initially but decreases
as more threads are used
1 thread, sharing disabled private data disabled, sharing 1 cacheline
Monday, February 25, 13
![Page 19: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/19.jpg)
4 thread, sharing disabled, co-located datawith thread
Single Factor on Performance
00.20.40.60.8
11.21.41.6
4M 8M 16M 32M 64M 128M
Local Remote
Nor
mal
ized
runt
ime
00.20.40.60.8
11.21.41.6
4M 8M 16M 32M 64M 128M
Sharing LLC Separate LLC
00.20.40.60.8
11.21.41.6
1 2 4 6 8 10 12
Within socket Across socket
Working set size (byte) Working set size (byte) Number of threads
(a) Data locality (b) LLC contention (a) Sharing overhead
1. When WSS is smaller than LLC,remote access does not hurt performance
2. When WSS is beyond LCC capacity,the larger the WSS, the larger impact
remote penalty hits performance
1. Scheduling does not affect performancewhen WSS fits in LLC
2. As WSS increases, LLC contention has diminishing effect on performance
1. Inter-socket sharing overheadincreases initially but decreases
as more threads are used
Memory footprint, thread-level parallelism, and inter-thread
sharing pattern determine how much each factor affects
performance
1 thread, sharing disabled private data disabled, sharing 1 cacheline
Monday, February 25, 13
![Page 20: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/20.jpg)
Multiple Factors on Performance
00.20.40.60.8
11.21.41.6
4M 8M 16M 32M 64M 128M
Intra-S Inter-S
Nor
mal
ized
runt
ime
00.20.40.60.8
11.21.41.6
4K 8K 16K 32K 64K 128K
Intra-S Inter-S
00.20.40.60.8
11.21.41.6
4M 8M 16M 32M 64M 128M
Intra-S Inter-S
Working set size (byte)
(a) Data locality & LLC contention (b) LLC contention & sharing overhead(c) Locality & LLC contention
& sharing overhead
Working set size (byte)Sharing size (byte)
Intra-S: clustering threadson node-0
Inter-S: distributing threads
Monday, February 25, 13
![Page 21: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/21.jpg)
Multiple Factors on Performance
00.20.40.60.8
11.21.41.6
4M 8M 16M 32M 64M 128M
Intra-S Inter-S
Nor
mal
ized
runt
ime
00.20.40.60.8
11.21.41.6
4K 8K 16K 32K 64K 128K
Intra-S Inter-S
00.20.40.60.8
11.21.41.6
4M 8M 16M 32M 64M 128M
Intra-S Inter-S
Working set size (byte)
(a) Data locality & LLC contention (b) LLC contention & sharing overhead(c) Locality & LLC contention
& sharing overhead
Working set size (byte)Sharing size (byte)
1. Dominant factor determines performance
2. Dominant factor switches asprogram characteristic changes
Intra-S: clustering threadson node-0
Inter-S: distributing threads
Monday, February 25, 13
![Page 22: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/22.jpg)
VirtualizationApplication and guest OS see virtual topology
Virtual topology physical topology- flat topology
- inaccurate topology
• Static Resource Affinity Table (SRAT)
• inaccurate due to load balancing 00.20.40.60.8
11.21.41.6
Accurate topology Inaccurate topologyN
orm
aliz
ed ru
ntim
e
set_mempolicy in libnuma
4 threads, 32MB WSS, 128KB sharing size
Monday, February 25, 13
![Page 23: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/23.jpg)
Related WorkOptimization via scheduling- Contention management: [TCS’10], [SIGMETRICS’11], [ISCA’11],
[ASPLOS’10]
- Thread clustering: [EuroSys’07]
- NUMA management: [ATC’11], [ASPLOS’13]
Program and system-level optimizations- Program transformation: [PPoPP’10], [CGO’12]
- System support: libnuma, page replication and migration
Monday, February 25, 13
![Page 24: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/24.jpg)
Related WorkOptimization via scheduling- Contention management: [TCS’10], [SIGMETRICS’11], [ISCA’11],
[ASPLOS’10]
- Thread clustering: [EuroSys’07]
- NUMA management: [ATC’11], [ASPLOS’13]
Program and system-level optimizations- Program transformation: [PPoPP’10], [CGO’12]
- System support: libnuma, page replication and migration
Our work:1. requiring no offline profiling, online
2. addressing complex interplays
3. assuming no knowledge on virtual topology
Monday, February 25, 13
![Page 25: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/25.jpg)
Uncore Penalty as a Performance Index
Shared resource contention
Inter-skt communication overheadRemote memory access
Monday, February 25, 13
![Page 26: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/26.jpg)
Uncore Penalty as a Performance Index
Core memory subsystem
Shared resource contention
Inter-skt communication overheadRemote memory access
Monday, February 25, 13
![Page 27: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/27.jpg)
Uncore Penalty as a Performance Index
Core memory subsystem
Uncore memory subsystem
Shared resource contention
Inter-skt communication overheadRemote memory access
Monday, February 25, 13
![Page 28: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/28.jpg)
Uncore Penalty as a Performance Index
Core memory subsystem
Uncore memory subsystem
Shared resource contention
Inter-skt communication overheadRemote memory access Uncore stall cycles
Monday, February 25, 13
![Page 29: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/29.jpg)
Uncore Penalty as a Performance Index
Core memory subsystem
Uncore memory subsystem
Shared resource contention
Inter-skt communication overheadRemote memory access
uncore penalty
Uncore stall cycles
constant
Monday, February 25, 13
![Page 30: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/30.jpg)
Uncore Penalty as a Performance Index
Core memory subsystem
Uncore memory subsystem
Shared resource contention
Inter-skt communication overheadRemote memory access
uncore penalty
Little’s law
uncore latency
outstanding parallel uncore rqstuncore penalty
cycles at least one “critical”uncore rqst
Uncore stall cycles
“critical” uncore rqst = demand data/instruction load + RFO rqst
constant
Monday, February 25, 13
![Page 31: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/31.jpg)
Effectiveness of Uncore Penalty Metric
-1
-0.5
0
0.5
1
4M 8M 16M 32M 64M 128M
Runtime LLC miss rate Uncore penalty
Rel
ativ
e to
Intra
-S
Working set size (byte)
(a) Data locality & LLC contention (b) LLC contention & sharing overhead (c) Locality & LLC contention & sharing overhead
Working set size (byte)Sharing size (byte)
-1
-0.5
0
0.5
1
4M 8M 16M 32M 64M 128M-1
-0.5
0
0.5
1
4K 8K 16K 32K 64K 128K
Relative diff
Monday, February 25, 13
![Page 32: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/32.jpg)
Effectiveness of Uncore Penalty Metric
-1
-0.5
0
0.5
1
4M 8M 16M 32M 64M 128M
Runtime LLC miss rate Uncore penalty
Rel
ativ
e to
Intra
-S
Working set size (byte)
(a) Data locality & LLC contention (b) LLC contention & sharing overhead (c) Locality & LLC contention & sharing overhead
Working set size (byte)Sharing size (byte)
-1
-0.5
0
0.5
1
4M 8M 16M 32M 64M 128M-1
-0.5
0
0.5
1
4K 8K 16K 32K 64K 128K
• LLC miss rate only agrees with runtime in a subset of runs
Relative diff
Monday, February 25, 13
![Page 33: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/33.jpg)
Effectiveness of Uncore Penalty Metric
-1
-0.5
0
0.5
1
4M 8M 16M 32M 64M 128M
Runtime LLC miss rate Uncore penalty
Rel
ativ
e to
Intra
-S
Working set size (byte)
(a) Data locality & LLC contention (b) LLC contention & sharing overhead (c) Locality & LLC contention & sharing overhead
Working set size (byte)Sharing size (byte)
-1
-0.5
0
0.5
1
4M 8M 16M 32M 64M 128M-1
-0.5
0
0.5
1
4K 8K 16K 32K 64K 128K
• LLC miss rate only agrees with runtime in a subset of runs
• Strong linear relationship between uncore penalty and runtime
Uncore penalty LLC miss rate
r = 0.91 r = 0.61
Linear correlation coefficient r (against runtime)Relative diff
Monday, February 25, 13
![Page 34: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/34.jpg)
NUMA-aware Virtual Machine Scheduling
Monitoring uncore penalty- calculate each vCPU’s penalty based on PMU readings
- update penalty when performing periodic scheduler bookkeeping
Identifying NUMA scheduling candidate- rely on application or guest OS to identify NUMA-sensitive vCPUs
vCPU migration- Bias Random Migration (BRM)
Monday, February 25, 13
![Page 35: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/35.jpg)
Bias Random Migration
Memory node-0
Uncore
Core Core
PMUper vCPU uncore penalty
vCPU
Memory node-1
Uncore
Core Core
vCPU
Global penalty
Update(v)
Migrate(v)Procedure BiasRandomPick(v)
rand = random() mod 100if rand < v.prob then
select a core in node v.biasreset v.prob
elseSelect current core
end ifend procedure
Procedure Update(v) n = NUMA_CPU_TO_NODE(v.c)
update global penalty and save it in v.unc[n]min = argmin v.unc[x], x= 0,..., NODE-1
if v.unc[n] > v.unc[min] then v.prob ++else v.prob -- v.bias = nend if
end procedure
x
Procedure Migrate(v)Update(v)if v is a migration candidate then new_core = BiasRandomPick(v)else new_core = DefaultPick(v)end if
end procedure
struct vcpu {...int c; //current coreint bias; // migration biasint prob; //migration probabilityint unc[NODE]; //array of global penalties...}
Monday, February 25, 13
![Page 36: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/36.jpg)
Bias Random Migration
Memory node-0
Uncore
Core Core
PMUper vCPU uncore penalty
vCPU
Memory node-1
Uncore
Core Core
vCPU
Global penalty
Update(v)
Migrate(v)Procedure BiasRandomPick(v)
rand = random() mod 100if rand < v.prob then
select a core in node v.biasreset v.prob
elseSelect current core
end ifend procedure
Procedure Update(v) n = NUMA_CPU_TO_NODE(v.c)
update global penalty and save it in v.unc[n]min = argmin v.unc[x], x= 0,..., NODE-1
if v.unc[n] > v.unc[min] then v.prob ++else v.prob -- v.bias = nend if
end procedure
x
Procedure Migrate(v)Update(v)if v is a migration candidate then new_core = BiasRandomPick(v)else new_core = DefaultPick(v)end if
end procedure
struct vcpu {...int c; //current coreint bias; // migration biasint prob; //migration probabilityint unc[NODE]; //array of global penalties...}
Monday, February 25, 13
![Page 37: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/37.jpg)
Bias Random Migration
Memory node-0
Uncore
Core Core
PMUper vCPU uncore penalty
vCPU
Memory node-1
Uncore
Core Core
vCPU
Global penalty
Update(v)
Migrate(v)Procedure BiasRandomPick(v)
rand = random() mod 100if rand < v.prob then
select a core in node v.biasreset v.prob
elseSelect current core
end ifend procedure
Procedure Update(v) n = NUMA_CPU_TO_NODE(v.c)
update global penalty and save it in v.unc[n]min = argmin v.unc[x], x= 0,..., NODE-1
if v.unc[n] > v.unc[min] then v.prob ++else v.prob -- v.bias = nend if
end procedure
x
Procedure Migrate(v)Update(v)if v is a migration candidate then new_core = BiasRandomPick(v)else new_core = DefaultPick(v)end if
end procedure
struct vcpu {...int c; //current coreint bias; // migration biasint prob; //migration probabilityint unc[NODE]; //array of global penalties...}
Monday, February 25, 13
![Page 38: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/38.jpg)
Bias Random Migration
Memory node-0
Uncore
Core Core
PMUper vCPU uncore penalty
vCPU
Memory node-1
Uncore
Core Core
vCPU
Global penalty
Update(v)
Migrate(v)Procedure BiasRandomPick(v)
rand = random() mod 100if rand < v.prob then
select a core in node v.biasreset v.prob
elseSelect current core
end ifend procedure
Procedure Update(v) n = NUMA_CPU_TO_NODE(v.c)
update global penalty and save it in v.unc[n]min = argmin v.unc[x], x= 0,..., NODE-1
if v.unc[n] > v.unc[min] then v.prob ++else v.prob -- v.bias = nend if
end procedure
x
Procedure Migrate(v)Update(v)if v is a migration candidate then new_core = BiasRandomPick(v)else new_core = DefaultPick(v)end if
end procedure
struct vcpu {...int c; //current coreint bias; // migration biasint prob; //migration probabilityint unc[NODE]; //array of global penalties...}
Monday, February 25, 13
![Page 39: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/39.jpg)
Bias Random Migration
Memory node-0
Uncore
Core Core
PMUper vCPU uncore penalty
vCPU
Memory node-1
Uncore
Core Core
vCPU
Global penalty
Update(v)
Migrate(v)Procedure BiasRandomPick(v)
rand = random() mod 100if rand < v.prob then
select a core in node v.biasreset v.prob
elseSelect current core
end ifend procedure
Procedure Update(v) n = NUMA_CPU_TO_NODE(v.c)
update global penalty and save it in v.unc[n]min = argmin v.unc[x], x= 0,..., NODE-1
if v.unc[n] > v.unc[min] then v.prob ++else v.prob -- v.bias = nend if
end procedure
x
Procedure Migrate(v)Update(v)if v is a migration candidate then new_core = BiasRandomPick(v)else new_core = DefaultPick(v)end if
end procedure
struct vcpu {...int c; //current coreint bias; // migration biasint prob; //migration probabilityint unc[NODE]; //array of global penalties...}
Push migrate to a node with the minimum global penaltyPull migrate (stealing) not touched for load balancing
Monday, February 25, 13
![Page 40: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/40.jpg)
ImplementationXen 4.0.2- patched with Perfctr-Xen to read PMU counters
- update uncore penalty every 10ms when Xen burns credits
- lightweight random number generator using last 2 digits in TSC, in range [0,99]
Guest OS, Linux 2.6.32
- two new hypercalls: tag and clear
- tag a vCPU as candidate if cpus_allowed is a subset of online CPUs
Monday, February 25, 13
![Page 41: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/41.jpg)
WorkloadMicro-benchmark- 4 threads, 128KB sharing size, WSS changes from 4MB to 8MB
Parallel workload- NAS parallel benchmarks except is
- Compiled with OpenMP, busy waiting synchronization
Multiprogrammed workload
- SPEC CPU2006, 4 identical copies of mcf, milc, soplex, sphinx3
- Mixed workload = mcf + milc + soplex + sphinx3
Monday, February 25, 13
![Page 42: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/42.jpg)
Improving Performance
0.4
0.6
0.8
1
1.2
4M 8M 16M 32M 64M 128M
Xen BRM Hand-optimized
Nor
mal
ized
runt
ime
0.4
0.6
0.8
1
1.2
lu sp ft ua bt cg ep mg0.4
0.6
0.8
1
1.2
soplex mcf milc sphinx3 mixed
Micro-benchmark (WSS) Parallel workload Multiprogrammed workload
Hand-optimized: offlinedetermined best policy
Monday, February 25, 13
![Page 43: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/43.jpg)
Improving Performance
0.4
0.6
0.8
1
1.2
4M 8M 16M 32M 64M 128M
Xen BRM Hand-optimized
Nor
mal
ized
runt
ime
0.4
0.6
0.8
1
1.2
lu sp ft ua bt cg ep mg0.4
0.6
0.8
1
1.2
soplex mcf milc sphinx3 mixed
Micro-benchmark (WSS) Parallel workload Multiprogrammed workload
1. BRM outperforms Xen in most experiments and improves performance by up to 31.7%
Hand-optimized: offlinedetermined best policy
Monday, February 25, 13
![Page 44: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/44.jpg)
Improving Performance
0.4
0.6
0.8
1
1.2
4M 8M 16M 32M 64M 128M
Xen BRM Hand-optimized
Nor
mal
ized
runt
ime
0.4
0.6
0.8
1
1.2
lu sp ft ua bt cg ep mg0.4
0.6
0.8
1
1.2
soplex mcf milc sphinx3 mixed
Micro-benchmark (WSS) Parallel workload Multiprogrammed workload
1. BRM outperforms Xen in most experiments and improves performance by up to 31.7%
2. BRM performs closely to Hand-optimized and even outperforms it in some cases
Hand-optimized: offlinedetermined best policy
Monday, February 25, 13
![Page 45: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/45.jpg)
Improving Performance
0.4
0.6
0.8
1
1.2
4M 8M 16M 32M 64M 128M
Xen BRM Hand-optimized
Nor
mal
ized
runt
ime
0.4
0.6
0.8
1
1.2
lu sp ft ua bt cg ep mg0.4
0.6
0.8
1
1.2
soplex mcf milc sphinx3 mixed
Micro-benchmark (WSS) Parallel workload Multiprogrammed workload
1. BRM outperforms Xen in most experiments and improves performance by up to 31.7%
2. BRM performs closely to Hand-optimized and even outperforms it in some cases
3. BRM adds overhead to NUMA insensitive workloads
Hand-optimized: offlinedetermined best policy
Monday, February 25, 13
![Page 46: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/46.jpg)
Reducing Variation
02468
101214
4M 8M 16M 32M 64M 128M
Xen BRM Hand-optimized
Rel
ativ
e st
anda
rd d
evia
tion
(%)
0
4
8
12
16
20
lu sp ft ua bt cg ep mg0
2
4
6
8
10
soplex mcf milc sphinx3 mixed
Micro-benchmark (WSS) Parallel workload Multiprogrammed workload
Monday, February 25, 13
![Page 47: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/47.jpg)
Reducing Variation
02468
101214
4M 8M 16M 32M 64M 128M
Xen BRM Hand-optimized
Rel
ativ
e st
anda
rd d
evia
tion
(%)
0
4
8
12
16
20
lu sp ft ua bt cg ep mg0
2
4
6
8
10
soplex mcf milc sphinx3 mixed
Micro-benchmark (WSS) Parallel workload Multiprogrammed workload
1. BRM reduces runtime variation significantly,with on average no more than 2% variations
Monday, February 25, 13
![Page 48: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/48.jpg)
Overhead
0
5
10
15
20
ep ft mg bt cg ua lu sp
2-thread 4-thread8-thread 16-thread
Run
time
over
head
(%)
migration disabled
Monday, February 25, 13
![Page 49: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/49.jpg)
Overhead
0
5
10
15
20
ep ft mg bt cg ua lu sp
2-thread 4-thread8-thread 16-thread
Run
time
over
head
(%)
1. Less than 2% overhead for 2- and 4-thread workloads2. BRM incurs up to 6.4% overhead for 8 threads, but still useful
3. Running 16-thread workloads is problematic
migration disabled
Monday, February 25, 13
![Page 50: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/50.jpg)
Overhead
0
5
10
15
20
ep ft mg bt cg ua lu sp
2-thread 4-thread8-thread 16-thread
Run
time
over
head
(%)
1. Less than 2% overhead for 2- and 4-thread workloads2. BRM incurs up to 6.4% overhead for 8 threads, but still useful
3. Running 16-thread workloads is problematic
Amazon EC2’s High-CPU Extra Large Instance has 8 vCPUs
migration disabled
Monday, February 25, 13
![Page 51: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/51.jpg)
Conclusion and Future WorkProblem
- sub-optimal scheduling and unpredictable performance
Our approach
- uncore penalty as a performance index
- Bias random migration for online performance optimization
- improves performance and reduces variability
Future work
- inferring NUMA-sensitive vCPU
- improving scalability and considering simultaneous multithreading
Monday, February 25, 13
![Page 53: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/53.jpg)
Backup Slides begin here...
Monday, February 25, 13
![Page 54: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/54.jpg)
Why IPC or CPI Harmful?*
IPC or CPI does not reflect performance,even not for straggler thread
* Wood, MICRO’06
0 0.2 0.4 0.6 0.8
1 1.2
0 10 20 30 40 50 60 70 80 90 100
Cyc
les
per i
nstru
ctio
n
(a) Bt
Running aloneBusy sibling hyperthread
Busy core
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70 80 90 100
Cyc
les
per i
nstru
ctio
n
Time (sec)
(b) Lu
Running aloneBusy sibling hyperthread
Busy core
Monday, February 25, 13
![Page 55: Improving Virtual Machine Scheduling in NUMA Multicore Systems](https://reader034.fdocuments.us/reader034/viewer/2022051517/627f209ed9dcfc66f14e5e36/html5/thumbnails/55.jpg)
QuestionsWhy not Cycles per Instruction (CPI)?- CPI is not useful for multiprocessor workloads
How to improve scalability?- Li et al., PPoPP’09, relaxing consistency requirement on global update
What workload BRM is not useful for?- short-lived jobs
Does BRM affect fairness and priorities? - NO
Monday, February 25, 13