© 2003 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change withoutnotice
Xen Network I/OPerformance Analysis and
Opportunities for Improvement
J. Renato SantosG. (John) Janakiraman
Yoshio Turner
HP Labs
Xen Summit April 17-18, 2007
page 2April 20, 2007
Motivation
• Network I/O has high CPU cost– TX: 350% cost of linux– RX: 310% cost of linux
CPU cost for TCP connection at 1 Gbps(xen-unstable (03/16/2007) ; PV Linux guest; X86 - 32bit)
linux
linux
xen
xen
0
20
40
60
80
100
120
140
RX TX
CP
U u
tili
za
tio
n
page 3April 20, 2007
• Performance Analysis for networkI/O RX path (netfront/netback)
• Network I/O RX optimizations
• Network I/O TX optimization
Outline
page 4April 20, 2007
Performance Analysis
For Network I/O
RX Path
page 5April 20, 2007
Experimental Setup
• Machines: HP Proliant DL580 (client and server)– P4 Xeon 2.8 Ghz, 4 CPU (MT disabled),
64 GB, 512 KB L2, 2MB L3– NIC: Intel E1000 (1 Gbps)
• Network configuration– Single switch connecting client and server
• Server configuration– Xen unstable (c.s.14415 – March 16, 2007)
(default xen0/xenU configs)– Single guest (512MB), dom0 also with 512MB
• Benchmark– Simple UDP micro benchmark (1gbps, 1500 bytes
packets)
page 6April 20, 2007
CPU Profile for network RXpath
• Cost of data copy is significant both in Linux and Xen– Xen has the cost of an additional data copy
• Xen guest kernel alone uses more CPU than linux• Most cost for Xen code is in dom0
0
20
40
60
80
100
120
140
linux xen
CP
U u
til (%
)app
usercopy
kernel
xenU
grantcopy
kernel0
xen0
page 7April 20, 2007
Xen Code Cost
• Major Xen overhead is: grant table & dom page map/unmap• Copy grant: several expensive atomic operations (lock
instr.prefix):– Atomic cmpxchg operation for updating status field (grant in
use)– Increment/decrement grant usage counters (multiple spinlock
op)– Increment/decrement src and dst page ref counts (get/put_page())
0
2
4
6
8
10
12
14
16
18
20
xen0 xenU
CP
U u
til (%
)
other
mm
interrupt
event
time
schedule
hypercall
domain_page
grant_table
page 8April 20, 2007
Dom0 Kernel Cost
• Bridge/network is large component in dom0 cost (Xen summit 2006)– Can be reduced if netfilter bridge config option is disabled
• Xen new code: Netback, hypercall, swiotlb• Higher interrupt overhead in Xen: extra code in evtchn.c• Additional high cost functions in Xen (accounted in “other”)
– spin_unlock_irqrestore(), spin_trylock()
0
5
10
15
20
25
30
35
linux dom0
CP
U u
til. (
%)
other
dma+swiotlb
hypercall
interrupt
syscall
schedule
mm
tcp
bridge
network
netback
driver
page 9April 20, 2007
Bridge netfilter cost
• What do we need to do to disable bridge netfilter by default?– Should we add a netfilter hook in netback?
0
5
10
15
20
25
30
35
linux nf_br no nf_br
CP
U u
til. (
%)
other
dma+swiotlb
hypercall
interrupt
syscall
schedule
mm
tcp
bridge
network
netback
driver
page 10April 20, 2007
Overhead in guest
• Netfront: 5 times more expensive than e1000 driver in Linux• Memory op (mm): 2 times more expensive in Xen (?)• grant table:
– high cost of atomic cmpxchg operation to revoke grant access• “Other”: spin_unlock_irqrestore(), spin_trylock() (same as dom0)
0
5
10
15
20
25
30
linux domU
CP
U u
til. (
%)
other
grant_table
hypercall
interrupt
syscall
schedule
mm
tcp
network
driver
page 11April 20, 2007
Source of Netfront Cost
• Netback copy packet data into netfront page fragments• Netfront copies first 200 bytes of packet from fragment into main
socket buffer data area• Large netfront cost is due to this extra data copy
0
5
10
15
20
25
30
linux domU
CP
U u
til. (
%)
other
grant_table
hypercall
interrupt
syscall
schedule
mm
tcp
network
headcopy
driver
page 12April 20, 2007
Opportunities for
Improvement
on RX path
page 13April 20, 2007
Reduce RX head copy size
• No need to have all headers in main SKB data area• Copy only Ethernet header (14 bytes)• Network stack copies more data as needed
0
5
10
15
20
25
30
linux headcopy
200 bytes
headcopy
14 bytes
CP
U u
til. (
%)
other
grant_table
hypercall
interrupt
syscall
schedule
mm
tcp
network
headcopy
driver
page 14April 20, 2007
Move grant data copy into guest CPU
• Dom0 grant access to data and guest copy it using copygrant
• Cost of 2nd copy to user buffer is reduced as data is alreadyin the guest CPU cache(assumes cache is not evicted due to user process delaying read)
• Additional benefit: Improves dom0 (driver domain) scalabilityas more work is done at the guest side
• Data copy is more expensive in guest (alignment problem)
0
20
40
60
80
100
120
140
copy in
dom0
copy in
guest
CP
U u
til. (
%)
app
usercopy
kernel
xenU
grantcopy
kernel0
xen0
page 15April 20, 2007
Grant copy align problemPacketstartsat half word
…
…
packet 2 16 bytes
…
…
grant copy
grant copy
netback SKB
nettfront SKB
netback SKB
netfrontfragment page
copy inguest
copy indom0
• copy is expensive when destination start is not at word boundary• Fix: Copy also 2 prefix bytes source and destination now aligned
page 16April 20, 2007
Fixing grant copy alignment
• Grant copy in guest becomes more efficient than in current Xen• Grant copy in dom0: destination is word aligned but
– Source is not word aligned– Can also be improved by copying additional prefix data
• Either 2 or (2+16) bytes
0
20
40
60
80
100
120
140
copy in
dom0
copy in
guest
align
cp
u U
TIL
. (%
)
app
usercopy
kernel
xenU
grantcopy
kernel0
xen0
page 17April 20, 2007
Fixing alignment in current Xen
• Source alignment reduces copy cost• Source and dest. at same buffer offset has better
performance– Reason (?): maybe because same offset in cache?
• Copy cost in dom0 is still more expensive than copy in guest.– Different cache behavior
0
20
40
60
80
100
120
140
original word
align
buffer
align
copy in
guest
CP
U u
til. (
%)
app
usercopy
kernelU
xenU
grantcopy
kernel0
xen0
page 18April 20, 2007
Copy in guest has better cache locality
• Dom0 copy has more L2 cache misses than guest copy– Dom0 copy has lower cache locality– Guest post multiple pages on IO ring.
• All pages in ring must be used before the same page can be reused
• For guest copy, pages are allocated on demand and reused more oftenimproving cache locality
0
20
40
60
80
100
120
cycles L3 misses L2 misses
app
usercopy
kernelU
xenU
grantcopy
kernel0
xen0
0
20
40
60
80
100
120
cycles L3 misses L2 misses
CP
U u
til (%
)
app
usercopy
kernelU
xenU
grantcopy
kernel0
xen0
grant copy in guest grant copy in dom0
page 19April 20, 2007
Possible grant optimizations
• Define new simple copy grant:– Allow only one copy operation at a time– No need to keep grant usage counters (remove
lock)• avoid cost of atomic cmpxchg operations
– Separate fields used for enabling grant and usagestatus
• Avoid incrementing/decrementing page ref counters– Use an RCU scheme for page deallocation (lazy
deallocation)
page 20April 20, 2007
Potential savings in grant modifications
• Results are optimistic– Still need to implement grant modifications– Results are based on eliminating current operations
0
2
4
6
8
10
12
14
16
original no status
update
no src
page pin
no dst
page pin
other
mm
time
schedule
hypercall
domain_page
grant_table
page 21April 20, 2007
Coalescing netfront RXinterrupts
• NIC (e1000) already coalescing HW interrupts (~10 packets/int)• Batching packets can provide additional benefit
– 10% for 32 packets– But adds extra latency– Dynamic coalescing scheme should be beneficial
0
10
20
30
40
50
60
70
80
no
batch
8 pkts 16 pkts 32 pkts 64 pkts
CP
U u
til. (
%)
app
usercopy
kernelU
xenU
grantcopy
kernel0
xen0
page 22April 20, 2007
Coalescing effect on Xen cost
• Except grant and domain page map/unmap, all other Xen costsare amortized by larger batches– An additional reason for optimizing grant
0
2
4
6
8
10
12
no
batch
8 pkts 16 pkts 32 pkts 64 pkts
CP
U u
til.
(%
)other
mm
time
schedule
hypercall
domain_page
grant_table
Xen in guest context
page 23April 20, 2007
Combining all RX optimizations
• Cost of network I/O for RX can be significantly reduced– From ~250% to ~70% ovehead (compared to linux)
• Largest improvement comes for moving grant copy toguest CPU
0
20
40
60
80
100
120
140
linux current optimized
CP
U u
til
(%)
app
usercopy
kernelU
xenU
grantcopy
kernel0
xen0
page 24April 20, 2007
Optimization
for TX path
page 25April 20, 2007
Lazy page mapping on TX
• Dom0 only needs to access packet headers• No need to map guest pages with packet payload
– NIC device access memory directly through DMA• Avoid mapping guest page on packet TX
– Copy packet headers using I/O ring– Modified grant operation returns machine address
(for DMA) but does not map page in dom0.• Provide page fault handler to deal with cases in
which dom0 needs to access payload– Packet to dom0/domU; netfilter rules
page 26April 20, 2007
Benefit of lazy TX page mapping
• Performance improvement for TX optimization– ~10% for large TX– ~8% for TCP RX due to ACKs
• Some additional improvement may be possiblewith grant optimizations
0
10
20
30
40
50
60
70
80
90
original TX optimization
CP
U u
til. (
%)
app
usercopy
kernelU
xenU
grantcopy
kernel0
xen0
1 Gbps TCP TX (64KB msg)
0
20
40
60
80
100
120
140
160
180
original tx optimiation
CP
U u
til (%
)
app
usercopy
kernelU
xenU
grantcopy
kernel0
xen0
1 Gbps TCP RX
page 27April 20, 2007
Questions ?
Top Related