Evaluation of CPU cgroup Copy 1024 bufsize 2000 maxblocks File Copy 256 bufsize 500 maxblocks File...
Transcript of Evaluation of CPU cgroup Copy 1024 bufsize 2000 maxblocks File Copy 256 bufsize 500 maxblocks File...
Copyright 2012 FUJITSU LIMITED
Evaluation of CPU cgroup
Jun 7th, 2012
Taku Izumi
Fujitsu Limited
LinuxCon Japan 2012
Table of Contents
Outline of CPU cgroup
Evaluation of CFS bandwidth control
CPU bandwidth control of KVM guests
Copyright 2012 FUJITSU LIMITED 1
Motivation
Fujitsu sells cloud computing service for mission critical customers. In that service QoS/accounting is very important
CPU bandwidth control feature is for cpu QoS
CPU resource provisioning according to the service level
CPU cgroup is now enhanced to support CFS bandwidth control feature
was merged with linux-3.2 kernel
RHEL6.2 bundles kernel with CFS bandwidth control
Copyright 2012 FUJITSU LIMITED 2
Copyright 2012 FUJITSU LIMITED
CPU cgroup is one of subsystems of cgroups and provides a control interface for scheduler
Facilities
“share”
• provisioning of proportional CPU resource through weight
• lower-bound provisioning of CPU resource
bandwidth control for completely fair scheduler
• aka CFS bandwidth control
• provide an upper limit of CPU resource
• very new feature
bandwidth control for real-time scheduler
CPU cgroup subsystem
main topic
4
Copyright 2012 FUJITSU LIMITED
“share” facility
cpu.shares to specify the weight to provide CPU time
e.g. Configure “Gold” group to receive 1.5x the CPU bandwidth that of “Silver” group
• # echo 3072 > Gold/cpu.shares
• # echo 2048 > Silver/cpu.shares
ROOT
Gold Silver
3072 2048
A B C D
1024 1024 1024 1024
30% 30% 20% 20%
60% 40%
100%
5
Copyright 2012 FUJITSU LIMITED
Known issue with share facility
CPU time is shared among groups with runnable tasks
The amount of CPU time provided to the group depends on the state of neibouring groups
It is difficult to estimate performance and not good for selling cpu time in enterprise system
ROOT
Gold Silver
3072 2048
A B C D
1024 1024 1024 1024
30%→60% 30%→0% 30%→0% 20%→40%
60% 40%
100%
B C
e.g. if B,C go idle…
6
What we ask for is
CFS bandwidth control !
“share” facility is not suitable for our service
Our requirement is the feature to limit CPU usage according to service class
The service level of low class users must not exceed that of high class users no matter what happens.
Copyright 2012 FUJITSU LIMITED 7
Copyright 2012 FUJITSU LIMITED
CFS Bandwidth control
The bandwidth allowed for a group is specified by quota and period.
Within each given "period" (microseconds), a group is allowed to consume only up to "quota" microseconds of CPU time.
“quota” means maximum run-time in a specified “period”
10L/week
FUEL CPU
500ms/s
“quota/period”
8
Copyright 2012 FUJITSU LIMITED
CFS Bandwidth control (cont.)
When a group exhausts its own quota of CPU time per a period, tasks under the cgroup cpu will never scheduled until the next period.
Empty
~ ~ ~
~
Empty
9
Copyright 2012 FUJITSU LIMITED
How CFS bandwidth control works
Example
If period is 250ms and quota is also 250ms, the group will get 1 CPU worth of runtime every 250ms.
“Throttled”
time
CPU#1
CPU#0
CPU#2
CPU#3
period=250ms
! ! !
10
Copyright 2012 FUJITSU LIMITED
cgroup interface
Quota and period are managed within the cpu subsystem of cgroupfs.
cpu.cfs_quota_us:
• The total available run-time within a period (in microseconds, ~1ms)
• “-1” (means no restriction) is default
cpu.cfs_period_us:
• The length of a period (in microseconds, 1s~1ms)
• “100000” (100msec) is default
50% bandwidth
“quota/period” # echo 125000 > cpu.cfs_quota_us /* quota = 125ms */
# echo 250000 > cpu.cfs_period_us /* period = 250ms */
11
Copyright 2012 FUJITSU LIMITED
Hierarchical considerations
The interface enforces that a child cgroup’s quota/period ratio never be over parent’s one
However, it’s allowed that aggregate quota of children is over parent’s one for supporting works-conserving semantics
1000
1200 600
A child’s bandwidth cannot exceed the parent’s bandwidth
1000
600 300
A total of children’s bandwidths can exceed the parent’s bandwidth
400
OK NG
12
Copyright 2012 FUJITSU LIMITED
Hierarchical considerations (cont.)
There are two ways in which a group is throttled: a. it fully consumes its own quota within a period
b. a parent's quota is fully consumed within its period
In case b) above, even though the child may have runtime remaining, it will not be allowed to run until the parent goes to the next period.
1000
600 300 400
600 100 150
! 1000
600 300 400
450 350 200
!
450+350+200 = 1000
a) b)
13
Experimental setup
Hardware
Fujitsu PRIMEQUEST 1800 E2
OS
Copyright 2012 FUJITSU LIMITED
CPU Intel Xeon E7-8870 (10core / 2.4GHz) * 2
Memory 128GB
NIC Intel 82576NS
kernel 3.1.0-rc7-tip
CONFIG_CFS_BANDWIDTH=y
qemu-kvm 0.14.0-7
libvirt 0.9.4-1
15
About benchmarks
Himeno Benchmark M (256x128x128)
Benchmark for measuring FLOPS popular in Japan
Unixbench version 5.1.2
Measure performance of UNIX based system
Hackbench
Estimate the time of chat-like operation
• # hackbench 150 process 500
SysBench (0.4.12-5 ) oltp test
Benchmark a real database performance
• # sysbench --test=oltp --num-threads=10 --db-driver=mysql --mysql-table-engine=innodb --oltp-table-size=1000000
• using mysqld on localhost
Copyright 2012 FUJITSU LIMITED 16
About benchmarks (cont.)
Super Pi benchmark ver. 2.0
Estimate the time of π calculation up to 1 million (220) digits
• # time superpi 20
Original sleep-work-wakeup benchmark
measuring process wake-up latency on not-busy host
lots of pairs of threads wake up each other with random sleep and a static job. Because of sleep, CPUs are not fully used.
used for emulate some customer’s job queuing pipeline latency, waiting event and queuing jobs.
Copyright 2012 FUJITSU LIMITED
WORK SLEEP
Thread 1 Thread 2
measure this latency
wake up
17
“quota” test
When changing “quota”, how does the score of benchmark change?
Procedure
Create the CPU cgroup named “benchmark”
Specify cpu.cfs_quota_us (cpu.cfs_period_us is left the default)
Run the benchmark under “benchmark” cgroup
Benchmarks
Himeno Benchmark
Unixbench
Hackbench
SysBench oltp test
Sleep-work-wakeup benchmark
Copyright 2012 FUJITSU LIMITED
/
benchmark
𝑐𝑝𝑢.𝑐𝑓𝑠_𝑞𝑢𝑜𝑡𝑎_𝑢𝑠
𝑐𝑝𝑢.𝑐𝑓𝑠_𝑝𝑒𝑟𝑖𝑜𝑑_𝑢𝑠=
𝑥
100000
Process of benchmark
CPU cgroup hierarchy
Process of benchmark
18
Himeno Benchmark result
Copyright 2012 FUJITSU LIMITED
2151
0
500
1000
1500
2000
2500
0 20 40 60 80 100 120
Sco
re (
MFL
OP
S)
quota (msec)
period=100msec
Score when unlimited
2148.3 MFLOPS
19
Unixbench result
Copyright 2012 FUJITSU LIMITED
0
500
1000
1500
2000
2500
3000
3500
4000
0 20 40 60 80 100 120
IND
EX
quota (msec)
Dhrystone 2 using register variable
Double-Precision Whetstone
Execl Throughput
File Copy 1024 bufsize 2000 maxblocks
File Copy 256 bufsize 500 maxblocks
File Copy 4096 bufsize 8000 maxblock
Pipe Throughput
Pipe-based Context Switching
Process Creation
Shell Scripts (1 concurrent)
Shell Scripts (8 concurrent)
System Call Overhead
System Benchmarks Index Score
period=100msec
20
Hackbench result
Copyright 2012 FUJITSU LIMITED
3377.899
82.612
y = 71495x-1.29
0
10
20
30
40
50
60
70
80
90
0
500
1000
1500
2000
2500
3000
3500
4000
0 50 100 150 200 250
Task
s p
er
sec
(/se
c)
Elap
sed
tim
e (
sec)
quota (msec)
Elapsed time (sec)
tasks per sec
period=100msec
Score when unlimited
3.578 sec
21
SysBench oltp test result
Copyright 2012 FUJITSU LIMITED
29.71
174.41 335.5776
57.3293
0
50
100
150
200
250
300
350
400
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120A
vera
ge o
f ex
ecu
tio
n t
ime
(m
s)
Tran
sact
ion
s (T
PS)
quota (msec)
transaction
execution time
period=100msec Score when unlimited
54.61 sec
183.03 TPS
22
Sleep-work-wakeup benchmark result
Copyright 2012 FUJITSU LIMITED
519.547914
8.078458 0.078542 0
100
200
300
400
500
600
0 50 100 150 200 250 300 350 400 450
late
ncy
(m
sec)
quota (msec)
period=100msec Score when unlimited
0.0697 msec
23
CPU bandwidth control of KVM guests
Virtual CPU bandwidth control of KVM guests can be achieved by using CFS bandwidth control
libvirt has already supported this feature
Specify “vcpu_period” and “vcpu_quota” parameters against each guests
libvirt converts them to cpu.cfs_period_us and cpu.cfs_quota_us
Copyright 2012 FUJITSU LIMITED
Virtual Guest
period (vcpu_period)
quota (vcpu_quota)
25
VM1
CPU Bandwidth control of KVM guests (cont.)
vcpu_period, vcpu_quota
Copyright 2012 FUJITSU LIMITED
/
libvirt
qemu
VM1
VM2
VM3
vcpu0
vcpu1
unlimited
unlimited
unlimited
unlimited
unlimited
25000
100000
25000
100000
50000
100000
CPU cgroup hierarchy
VM1’s quota value is
sum of each vcpu’s quota
period (vcpu_period)=100000
quota (vcpu_quota) = 25000
𝑐𝑝𝑢. 𝑐𝑓𝑠_𝑞𝑢𝑜𝑡𝑎_𝑢𝑠
𝑐𝑝𝑢. 𝑐𝑓𝑠_𝑝𝑒𝑟𝑖𝑜𝑑_𝑢𝑠
vCPU#0
vCPU#1
26
Comparison with “share” facility
Preconditions
2 virtual guests (1 vcpu each)
Each vcpu is pinned to the same physical CPU
Run the super-pi benchmark at VM1 in the following case:
A) Without CPU bandwidth control
i. VM2 is shut-off
ii. VM2 is running but idle
iii. VM2 is running and busy
B) With CPU bandwidth control
i. VM2 is shut-off
ii. VM2 is running but idle
iii. VM2 is running and busy
Copyright 2012 FUJITSU LIMITED
Host
VM1 VM2
pCPU#0
vCPU#0 vCPU#0
pinning
27
Comparison with “share” facility (cont.)
Copyright 2012 FUJITSU LIMITED
A) Without CPU bandwidth control on guests <“share”>
CPU time is allocated proportionally according to the “share” value
B) With CPU bandwidth control
CPU time is limited according to each quota value
37.778
37.4
37.648
31.457
15.721
15.656
0 10 20 30 40
BUSY
IDELING
SHUTOFF
Super Pi calculation time (sec)
The
sta
te o
f V
M2
Super pi benchmark result @ VM1
w/o bandwidth control
w/bandwidth control(50% quota)
(B-iii)
(This shows elapsed time of calculation, so shorter is better) (A-i)
(A-ii)
(A-iii)
(B-i)
(B-ii)
28
“quota” test on KVM guest
Preconditions
KVM guest
Procedure
Specify vcpu_quota ( vcpu_quota is left the default) of guests
Run the benchmark on the guest
Benchmark
Super Pi benchmark (1 million digits)
SysBench oltp test
Sleep-work-wakeup benchmark
Copyright 2012 FUJITSU LIMITED
CPU 1 vcpu
Memory 2GB
OS RHEL6.1 x86_64
29
Super Pi benchmark result
Copyright 2012 FUJITSU LIMITED
341.301
17.453
y = 5632.6x-1.271
0
10
20
30
40
50
60
70
0
50
100
150
200
250
300
350
400
0 20 40 60 80 100 120
Cal
cula
tio
n s
pe
ed
(d
igit
/ m
sec)
sup
er-
pi c
alcu
lati
on
tim
e (
sec)
vcpu_quota (msec)
calculation time
calculation speed
period=100msec Score when unlimited
15.722 sec
30
SysBench oltp test result
Copyright 2012 FUJITSU LIMITED
2348.2446
117.12
4.26
85.36
-10
0
10
20
30
40
50
60
70
80
90
100
0
500
1000
1500
2000
2500
0 20 40 60 80 100 120
Ave
rage
of
exe
cuti
on
tim
e (
mse
c)
Tran
sact
ion
(TP
S)
vcpu_quota (msec)
execution time
transaction
period=100msec Score when unlimited
94.57 msec
105.71 TPS
31
Sleep-work-wakeup benchmark result
Copyright 2012 FUJITSU LIMITED
130.825721
3.892596 0
20
40
60
80
100
120
140
0 20 40 60 80 100 120
late
ncy
(m
sec)
vcpu_quota (msec)
period=100msec
Score when unlimited
2.611 sec
32
Consideration of vcpu_quota
When vcpu_quota is specified, not only vcpu but virtual machine is throttled
Even if the CPU time of vcpu is less than quota value, it may be throttled
In case of heavy I/O
Copyright 2012 FUJITSU LIMITED
1vcpu VM
vCPU#0
qemu
vcpu0 thread
main thread
I/O threads
VM
vcpu0
main thread
vcpu0 thread
CPU cgroup hierarchy
I/O threads
𝑣𝑐𝑝𝑢_𝑞𝑢𝑜𝑡𝑎
𝑣𝑐𝑝𝑢_𝑝𝑒𝑟𝑖𝑜𝑑 =
𝑞
𝑝
𝑐𝑝𝑢.𝑐𝑓𝑠_𝑞𝑢𝑜𝑡𝑎_𝑢𝑠
𝑐𝑝𝑢.𝑐𝑓𝑠_𝑝𝑒𝑟𝑖𝑜𝑑_𝑢𝑠 =
𝑞
𝑝
𝑐𝑝𝑢.𝑐𝑓𝑠_𝑞𝑢𝑜𝑡𝑎_𝑢𝑠
𝑐𝑝𝑢.𝑐𝑓𝑠_𝑝𝑒𝑟𝑖𝑜𝑑_𝑢𝑠 =
𝑞 ∗1
𝑝
qemu
33
Consideration of vcpu_quota (cont.)
Even if vcpu_quota is set as 50% of vcpu_period and vcpu0 is placed at full load on guest, CPU usage rate of vcpu0 doesn’t reach 50%
Copyright 2012 FUJITSU LIMITED
pCPU#0
Guest
Host
vCPU#0
pinning
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
CP
U u
sage
(%
)
time (sec)
mpstat of pCPU#0
%idle
%steal
%soft
%irq
iowait
%sys %
%nice
%usr
%guest
34
Consideration of vcpu_quota (cont.)
CPU usage rate of qemu and qemu I/O threads is combined, the sum total will be 50%
Copyright 2012 FUJITSU LIMITED
pCPU#0
Guest
Host
vCPU#0
pinning
main
I/O threads
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
CP
U u
sage
(%
)
time (sec)
mpstat of pCPU#0
%idle
%steal
%soft
%irq
%iowait
%sys
%nice
%usr
%guest
35
Impact investigation on network I/O
Preconditions
KVM guest
Place a network load on KVM guest by using netperf between guest and host
# netperf –t TCP_STREAM -- -m 32768 -s 57344 -S 57344
Copyright 2012 FUJITSU LIMITED
Host
Guest
CPU 1vcpu (pinned to pCPU#0)
Memory 2GB
OS RHEL 6.1 x86_64
NIC vhost_net, using macvtap
36
Impact investigation on network I/O (cont.)
CPU usage of pCPU#0 during netperf (w/o CPU bandwidth control of the guest)
Copyright 2012 FUJITSU LIMITED
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9 10 11 12 13
CP
U u
sage
(%
)
time (sec)
mpstat of pCPU#0
%idle
%steal
%soft
%irq
%iowait
%sys
%nice
%usr
%guest
VM using 80%+ CPU
vcpu using 25%+ CPU
37
Impact investigation on network I/O (cont.)
Copyright 2012 FUJITSU LIMITED
0
100
200
300
400
500
600
700
0 10 20 30 40 50 60 70 80 90 100 110
net
wo
rk t
hro
ugh
pu
t (M
bp
s)
vcpu_quota (msec)
Network throughput on the guest
period=100msec
38
Impact investigation on disk I/O
Preconditions
KVM guest
Place a disk load on KVM guest by using dd command
# dd if=/dev/zero of=testfile bs=1024M count=2
Copyright 2012 FUJITSU LIMITED
Host
Guest
CPU 1vcpu (pinned to pCPU#0)
Memory 4GB
OS RHEL 6.1 x86_64
HDD File based virtual HDD, 16GB
39
Impact investigation on disk I/O (cont.)
CPU usage of pCPU#0 during dd command (w/o CPU bandwidth control of the guest)
Copyright 2012 FUJITSU LIMITED
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
CP
U u
sage
(%
)
time (sec)
mpstat of pCPU#0
%idl
%steal
%soft
%irq
%iowait
%sys
%nice
%usr
%guest
VM using 30%+ CPU
vcpu using 20% CPU
40
Impact investigation on disk I/O (cont.)
Copyright 2012 FUJITSU LIMITED
0
20
40
60
80
100
120
0 10 20 30 40 50 60 70 80 90 100 110
I?O
th
rou
ghp
ut
(MB
/se
c)
vcpu_quota (msec)
Disk I/O throughput on the guest
period=100msec
41
Comparison with other hypervisors
vCPU bandwidth control of other hypervisors (Xen, VMWare)
[virtual CPU usage] <= [limit value]
• Limit virtual CPU usage
• Do NOT limit hypervisor CPU usage
vCPU bandwidth control of KVM
[virtual CPU usage] + [hypervisor CPU usage] <= [limit value]
• Limit total of virtual CPU usage and hypervisor CPU usage
Both methods have Pros. and Cons.
We Fujitsu are now enhancing libvirt to support “virtual CPU usage only” limiting on KVM
Copyright 2010 FUJITSU LIMITED 42
Summary
Outline of CPU cgroup
CFS bandwidth control provides the upper limit of CPU resource
Evaluation of CFS bandwidth control
The effect of bandwidth control depends on workload:
• tasks which use CPU for full time are affected directly, while I/O bounded tasks are a little insensitive
CPU bandwidth control of KVM guests
useful when building KVM based cloud system
necessary to take impact on guests’ I/Os into account
Copyright 2012 FUJITSU LIMITED 43