Download - Xen Network I/O · J. Renato Santos G. (John) Janakiraman Yoshio Turner HP Labs ... –Xen has the cost of an additional data copy •Xen guest kernel alone uses more CPU than linux

© 2003 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change withoutnotice

Xen Network I/OPerformance Analysis and

Opportunities for Improvement

J. Renato SantosG. (John) Janakiraman

Yoshio Turner

HP Labs

Xen Summit April 17-18, 2007

April 20, 2007

Motivation

• Network I/O has high CPU cost– TX: 350% cost of linux– RX: 310% cost of linux

CPU cost for TCP connection at 1 Gbps(xen-unstable (03/16/2007) ; PV Linux guest; X86 - 32bit)

linux

linux

xen

xen

0

20

40

60

80

100

120

140

RX TX

CP

U u

tili

za

tio

n

April 20, 2007

• Performance Analysis for networkI/O RX path (netfront/netback)

• Network I/O RX optimizations

• Network I/O TX optimization

Outline

April 20, 2007

Performance Analysis

For Network I/O

RX Path

April 20, 2007

Experimental Setup

• Machines: HP Proliant DL580 (client and server)– P4 Xeon 2.8 Ghz, 4 CPU (MT disabled),

64 GB, 512 KB L2, 2MB L3– NIC: Intel E1000 (1 Gbps)

• Network configuration– Single switch connecting client and server

• Server configuration– Xen unstable (c.s.14415 – March 16, 2007)

(default xen0/xenU configs)– Single guest (512MB), dom0 also with 512MB

• Benchmark– Simple UDP micro benchmark (1gbps, 1500 bytes

packets)

April 20, 2007

CPU Profile for network RXpath

• Cost of data copy is significant both in Linux and Xen– Xen has the cost of an additional data copy

• Xen guest kernel alone uses more CPU than linux• Most cost for Xen code is in dom0

0

20

40

60

80

100

120

140

linux xen

CP

U u

til (%

)app

usercopy

kernel

xenU

grantcopy

kernel0

xen0

April 20, 2007

Xen Code Cost

• Major Xen overhead is: grant table & dom page map/unmap• Copy grant: several expensive atomic operations (lock

instr.prefix):– Atomic cmpxchg operation for updating status field (grant in

use)– Increment/decrement grant usage counters (multiple spinlock

op)– Increment/decrement src and dst page ref counts (get/put_page())

0

2

4

6

8

10

12

14

16

18

20

xen0 xenU

CP

U u

til (%

)

other

mm

interrupt

event

time

schedule

hypercall

domain_page

grant_table

April 20, 2007

Dom0 Kernel Cost

• Bridge/network is large component in dom0 cost (Xen summit 2006)– Can be reduced if netfilter bridge config option is disabled

• Xen new code: Netback, hypercall, swiotlb• Higher interrupt overhead in Xen: extra code in evtchn.c• Additional high cost functions in Xen (accounted in “other”)

– spin_unlock_irqrestore(), spin_trylock()

0

5

10

15

20

25

30

35

linux dom0

CP

U u

til. (

%)

other

dma+swiotlb

hypercall

interrupt

syscall

schedule

mm

tcp

bridge

network

netback

driver

April 20, 2007

Bridge netfilter cost

• What do we need to do to disable bridge netfilter by default?– Should we add a netfilter hook in netback?

0

5

10

15

20

25

30

35

linux nf_br no nf_br

CP

U u

til. (

%)

other

dma+swiotlb

hypercall

interrupt

syscall

schedule

mm

tcp

bridge

network

netback

driver

April 20, 2007

Overhead in guest

• Netfront: 5 times more expensive than e1000 driver in Linux• Memory op (mm): 2 times more expensive in Xen (?)• grant table:

– high cost of atomic cmpxchg operation to revoke grant access• “Other”: spin_unlock_irqrestore(), spin_trylock() (same as dom0)

0

5

10

15

20

25

30

linux domU

CP

U u

til. (

%)

other

grant_table

hypercall

interrupt

syscall

schedule

mm

tcp

network

driver

April 20, 2007

Source of Netfront Cost

• Netback copy packet data into netfront page fragments• Netfront copies first 200 bytes of packet from fragment into main

socket buffer data area• Large netfront cost is due to this extra data copy

0

5

10

15

20

25

30

linux domU

CP

U u

til. (

%)

other

grant_table

hypercall

interrupt

syscall

schedule

mm

tcp

network

headcopy

driver

April 20, 2007

Opportunities for

Improvement

on RX path

April 20, 2007

Reduce RX head copy size

• No need to have all headers in main SKB data area• Copy only Ethernet header (14 bytes)• Network stack copies more data as needed

0

5

10

15

20

25

30

linux headcopy

200 bytes

headcopy

14 bytes

CP

U u

til. (

%)

other

grant_table

hypercall

interrupt

syscall

schedule

mm

tcp

network

headcopy

driver

April 20, 2007

Move grant data copy into guest CPU

• Dom0 grant access to data and guest copy it using copygrant

• Cost of 2nd copy to user buffer is reduced as data is alreadyin the guest CPU cache(assumes cache is not evicted due to user process delaying read)

• Additional benefit: Improves dom0 (driver domain) scalabilityas more work is done at the guest side

• Data copy is more expensive in guest (alignment problem)

0

20

40

60

80

100

120

140

copy in

dom0

copy in

guest

CP

U u

til. (

%)

app

usercopy

kernel

xenU

grantcopy

kernel0

xen0

April 20, 2007

Grant copy align problemPacketstartsat half word

…

…

packet 2 16 bytes

…

…

grant copy

grant copy

netback SKB

nettfront SKB

netback SKB

netfrontfragment page

copy inguest

copy indom0

• copy is expensive when destination start is not at word boundary• Fix: Copy also 2 prefix bytes source and destination now aligned

April 20, 2007

Fixing grant copy alignment

• Grant copy in guest becomes more efficient than in current Xen• Grant copy in dom0: destination is word aligned but

– Source is not word aligned– Can also be improved by copying additional prefix data

• Either 2 or (2+16) bytes

0

20

40

60

80

100

120

140

copy in

dom0

copy in

guest

align

cp

u U

TIL

. (%

)

app

usercopy

kernel

xenU

grantcopy

kernel0

xen0

April 20, 2007

Fixing alignment in current Xen

• Source alignment reduces copy cost• Source and dest. at same buffer offset has better

performance– Reason (?): maybe because same offset in cache?

• Copy cost in dom0 is still more expensive than copy in guest.– Different cache behavior

0

20

40

60

80

100

120

140

original word

align

buffer

align

copy in

guest

CP

U u

til. (

%)

app

usercopy

kernelU

xenU

grantcopy

kernel0

xen0

April 20, 2007

Copy in guest has better cache locality

• Dom0 copy has more L2 cache misses than guest copy– Dom0 copy has lower cache locality– Guest post multiple pages on IO ring.

• All pages in ring must be used before the same page can be reused

• For guest copy, pages are allocated on demand and reused more oftenimproving cache locality

0

20

40

60

80

100

120

cycles L3 misses L2 misses

app

usercopy

kernelU

xenU

grantcopy

kernel0

xen0

0

20

40

60

80

100

120

cycles L3 misses L2 misses

CP

U u

til (%

)

app

usercopy

kernelU

xenU

grantcopy

kernel0

xen0

grant copy in guest grant copy in dom0

April 20, 2007

Possible grant optimizations

• Define new simple copy grant:– Allow only one copy operation at a time– No need to keep grant usage counters (remove

lock)• avoid cost of atomic cmpxchg operations

– Separate fields used for enabling grant and usagestatus

• Avoid incrementing/decrementing page ref counters– Use an RCU scheme for page deallocation (lazy

deallocation)

April 20, 2007

Potential savings in grant modifications

• Results are optimistic– Still need to implement grant modifications– Results are based on eliminating current operations

0

2

4

6

8

10

12

14

16

original no status

update

no src

page pin

no dst

page pin

other

mm

time

schedule

hypercall

domain_page

grant_table

April 20, 2007

Coalescing netfront RXinterrupts

• NIC (e1000) already coalescing HW interrupts (~10 packets/int)• Batching packets can provide additional benefit

– 10% for 32 packets– But adds extra latency– Dynamic coalescing scheme should be beneficial

0

10

20

30

40

50

60

70

80

no

batch

8 pkts 16 pkts 32 pkts 64 pkts

CP

U u

til. (

%)

app

usercopy

kernelU

xenU

grantcopy

kernel0

xen0

April 20, 2007

Coalescing effect on Xen cost

• Except grant and domain page map/unmap, all other Xen costsare amortized by larger batches– An additional reason for optimizing grant

0

2

4

6

8

10

12

no

batch

8 pkts 16 pkts 32 pkts 64 pkts

CP

U u

til.

(%

)other

mm

time

schedule

hypercall

domain_page

grant_table

Xen in guest context

April 20, 2007

Combining all RX optimizations

• Cost of network I/O for RX can be significantly reduced– From ~250% to ~70% ovehead (compared to linux)

• Largest improvement comes for moving grant copy toguest CPU

0

20

40

60

80

100

120

140

linux current optimized

CP

U u

til

(%)

app

usercopy

kernelU

xenU

grantcopy

kernel0

xen0

April 20, 2007

Optimization

for TX path

April 20, 2007

Lazy page mapping on TX

• Dom0 only needs to access packet headers• No need to map guest pages with packet payload

– NIC device access memory directly through DMA• Avoid mapping guest page on packet TX

– Copy packet headers using I/O ring– Modified grant operation returns machine address

(for DMA) but does not map page in dom0.• Provide page fault handler to deal with cases in

which dom0 needs to access payload– Packet to dom0/domU; netfilter rules

April 20, 2007

Benefit of lazy TX page mapping

• Performance improvement for TX optimization– ~10% for large TX– ~8% for TCP RX due to ACKs

• Some additional improvement may be possiblewith grant optimizations

0

10

20

30

40

50

60

70

80

90

original TX optimization

CP

U u

til. (

%)

app

usercopy

kernelU

xenU

grantcopy

kernel0

xen0

1 Gbps TCP TX (64KB msg)

0

20

40

60

80

100

120

140

160

180

original tx optimiation

CP

U u

til (%

)

app

usercopy

kernelU

xenU

grantcopy

kernel0

xen0

1 Gbps TCP RX

April 20, 2007

Questions ?