On Paravirualizing TCP: Congestion Control in Xen VMs · On ParaVirualizing TCP: Congestion Control...

On ParaVirualizing TCP: Congestion Control in Xen Virtual Machines

Luwei Cheng, Cho-Li Wang, Francis C.M. Lau Department of Computer Science

The University of Hong Kong

Xen Project Developer Summit 2013 Edinburgh, UK, October 24-25, 2013

Outline

Motivation – Physical datacenter vs. Virtualized datacenter – Incast congestion

Understand the Problem – Pseudo-congestion – Sender-side vs. Receiver-side

PVTCP – A ParaVirtualized TCP – Design, Implementation, Evaluation

Questions & Comments

Physical datacenter

A set of physical machines Network delays:

propagation delays of the physical network/switch

…

ToR switches

Core switch

. . .

Servers in a rack

… … …

ToR switches

Core switch

. . .

Servers in a rack

… …

VM VM VM

VM VM VM

Virtualized datacenter

A set of virtual machines Network delays:

additional delays due to virtualization overhead

Virtualization brings “delays”

1. I/O virtualization overhead (PV or HVM) – Guest VMs are unable to directly access the hardware. – Additional data movement between dom0 and domUs. – HVM: Passthrough I/O can avoid it

2. VM scheduling delays – Multiple VMs share one physical core

delay

VM

VM

VM

pCPU

VM

VM

VM

pCPU

Hypervisor

Virtualization brings “delays”

[1VM 2VMs] [1VM 3VMs]

Peak: 30ms Peak: 60ms

Avg: 0.147ms Avg: 0.374ms

[PM PM] [1VM 1VM]

Delays of I/O virtualization (PV guests): < 1ms

VM scheduling delays: 10× ms – Queuing delays VM scheduling delays

The dominant factor to network RTT

Network delays in public clouds

[HPDC’10]

[INFOCOM’10]

Incast network congestion • A special form of network congestion, typically seen in

distributed processing applications (scatter-gather). – Barrier-synchronized request workloads – The limited buffer space of the switch output port can be easily

overfilled by simultaneous transmissions. • Application-level throughput (goodput) can be orders of

magnitude lower than the link capacity.

[SIGCOMM’09]

Solutions for physical clusters

The dominate factor: once the packet loss happens, whether the sender can know it as soon as possible. – In case of “tail loss”, the sender can only count on the

retransmit timer’s firing.

Two representative papers: Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems [FAST’08]. Understanding TCP Incast Throughput Collapse in Datacenter Networks [WREN’09].

Prior works: none of them can fully eliminate the throughput collapse. – Increase switch buffer size – Limited transmit – Reduce duplicate ACK threshold – Disable slow-start – Randomize timeout value – Reno, NewReno, SACK

Solutions for physical clusters (cont’d)

RTOmin in a virtual cluster? Not well studied.

[SIGCOMM’09] [DCTCP, SIGCOMM’10]

Significantly reducing RTOmin has been shown to be a safe and effective approach. [SIGCOMM’09]

Even with ECN support in hardware switch, a small RTOmin still shows apparent advantages. [DCTCP, SIGCOMM’10]

Outline





Pseudo-congestion

A small RTOmin frequent spurious RTOs

RTOmin=200ms RTOmin=100ms

RTOmin=10ms RTOmin=1ms

NO network congestion, still RTT spikes.

VM

pCPU

VM

VM

30ms

30ms

30ms

3VMs per core

Red points: measured RTTs Blue points: calculated RTO values

RTO = SRTT + 4* RTTVAR Lower-bound: RTOmin

TCP’s low-pass filter

Retransmit TimeOut

Pseudo-congestion (cont’d)

A small RTOmin: serious spurious RTOs with largely

varied RTTs.

A big RTOmin: throughput collapse with heavy network

congestion.

“adjusting RTOmin: a tradeoff between timely response with premature timeouts, and there is NO optimal balance between the two.” -- Allman and Paxson [SIGCOMM’99]

Virtualized datacenters A new instantiation

Sender-side vs. Receiver-side

The scheduling delays to the sender VM The scheduling delays to the receiver VM

3VMs1VM Freq. 1VM3VMs 1086 1× RTOs 677

0 2× RTOs 673 0 3× RTOs 196 0 4× RTOs 30

RTO only happens once a time Successive RTOs are normal

To transmit 4000 1MB data blocks

A micro-view with tcpdump

8.4

8.5

8.6

8.7

8.8

8.9

9

9.1

0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80

x106 snd.una: the first sent but unacknowledged byte. snd.nxt: the next byte that will be sent.

snd.nxt

snd.una

snd.nxt

snd.una

RTO happens twice, before the receiver

VM wakes up.The receiver VM has been

stopped.

The generation and the return of the ACKs will be delayed.

RTOs must happen on the sender’s side.

When the receiver VM is preempted

Time (ms) vs. sequence number (from the sender VM) Time (ms) vs. ACK number (from the receiver VM)

The ACK’s arrival time is not delayed, but the receiving time is too late.

From TCP’s perspective, RTO should not be triggered.

When the sender VM is preempted

RTO happens just afterthe sender VM wakes up.

The sender VM has been

stopped.An ACK arrives

before the sender VM wakes up.

Timer

VM1 is running

Buffer

Timer TCP sender

Driver domain

TCP receiver ACK ACK ACK

data data data

Physical network

Within hypervisor

VM2 is running VM3 is running VM1 is running

clear timer

clear timer

VM scheduling latency

VM2 is waiting

VM3 is waiting VM1 is waiting

VM3 is waiting VM1 is waiting

VM2 is waiting

VM2 is waiting

VM3 is waiting

wait ..

Timer IRQ: RTO happens!

Network IRQ: receive ACK;

Spurious RTO!

deliver ACK

Scheduling queue

Expire time

Timer

The reasons due to common OS design – Timer interrupt is executed before other interrupts – Network processing is a little late (bottom half)

The sender-side problem: OS reasons

1

2

After the VM wakes up, both TIMER and NET are pending. RTO happens just before the ACK enters the VM

To detect spurious RTOs

Disabling delayed ACK seems to be helpful

c

Two well-known detection algorithms: F-RTO and Eifel – Eifel performs much worse than F-RTO in some situations, e.g.

with bursty packet loss [CCR’03] – F-RTO is implemented in Linux

Low detection

rate

[3VMs1VM] [1VM3VMs]

Low detection

rate

F-RTO interacts badly with delayed ACK (ACK coalescing) – Reducing delayed ACK timeout value does NOT help.

Delayed ACK vs. CPU overhead

Disabling delayed ACK Significant CPU overhead

Sender VM Receiver VM Sender VM Receiver VM

Delayed ACK vs. CPU overhead

Disabling delayed ACK Significant CPU overhead

delack-200ms delack-1ms w/o delack

Total ACKs 229,650 244,757 2,832,260

delack-200ms delack-1ms w/o delack

Total ACKs 252,278 262,274 2,832,179

Sender VM Receiver VM Sender VM Receiver VM Disabling delayed ACK: 11~13× more ACKs are sent

Outline





PVTCP – A ParaVirtualized TCP

Main Idea – If we can detect such moment, and let the guest OS be

aware of this, there is a chance to handle the problem.

Observation – Spurious RTOs only happen when the sender/receiver

VM just experienced a scheduling delays.

“the more information about current network conditions available to a transport protocol, the more efficiently it can use the network to transfer its data.” -- Allman and Paxson [SIGCOMM’99]

Detect the VM’s wakeup moment

VM

pCPU

VM

VM

30ms

30ms

30ms

Virtual timer IRQs (every 1ms)

Time

Guest OS

Hypervisor

VM is NOT running

. . .

VM is running

Virtual timer IRQs (every 1ms)

jiffie

s++

jiffie

s++

jiffie

s++

jiffie

s++

jiffie

s++

jiffie

s++

VM is running

jiffies += 60 (HZ=1000)

jiffie

s++

jiffie

s++

jiffie

s++

jiffie

s++

3VMs per core

Acute increase of the system clock (jiffies) The VM just wakes up

one-shot timer

PVTCP – the sender VM is preempted Spurious RTOs can be avoided.

No need to detect them at all!

Timer TCP

Timer

VM1 is running

Buffer

Timer TCP sender

Driver domain


data data data

Physical network

Within hypervisor


clear timer

clear timer

wait .. deliver

ACK

Expire time

Timer

Start time

Expiry time

Timer IRQ: RTO happens!

Network IRQ: receive ACK;

Spurious RTO! 2

1


PVTCP – the sender VM is preempted Spurious RTOs can be avoided.

No need to detect them at all!

Timer

Timer

TCP PVTCP

Timer

VM1 is running

Buffer

Timer TCP sender

Driver domain


data data data

Physical network

Within hypervisor


clear timer

clear timer

wait .. deliver

ACK

Expire time

Timer

Start time

Expiry time

1ms

Solution: after the VM wakes up, extend the TCP retransmit timer’s expiry time by 1ms.

Net IRQ first: ACK enters.

Reset the timer.


PVTCP – the sender VM is preempted

Timer PVTCP

Timer

VM1 is running

Buffer

Timer TCP sender

Driver domain


data data data

Physical network

Within hypervisor


clear timer

clear timer

wait .. deliver

ACK

Expire time

Timer

1ms

Net IRQ first: ACK enters.

Reset the timer.


StartTime ExpiryTime

Solution: MRTTi SRTTi-1

TCP’s low-pass filter to estimate RTT/RTO Smoothed RTT (SRTTi) 7/8 * SRTTi-1 +1/8 * MRTTi RTT variance (RTTVARi) 3/4 * RTTVARi -1+ 1/4 * |SRTTi - MRTTi| Expected RTO value (RTOi+1) SRTTi + 4 * RTTVARi

Measured RTT (MRTT) = TrueRTT + VMSchedDelay

PVTCP – the receiver VM is preempted

Spurious RTOs cannot be avoided, so we have to let the sender detect them.

Solution: temporarily disable delayed ACK when the receiver VM just wakes up. – Eifel: check the timestamp of the first one ACK – F-RTO: check the ACK number of the first two ACKs – Just-in-time: do not delay the ACKs for the first three segments

Detection algorithms requires deterministic return of future ACKs from the receiver – Enable delayed ACK retransmission ambiguity – Disable delayed ACK significant CPU overhead

PVTCP evaluation: throughput

PVTCP avoids throughput collapse in the whole range

TCP’s dilemma: pseudo-congestion & real congestion

RTOmin

Experimental setup: 20 sender VMs 1 receiver VM

PVTCP-1ms

TCP-1ms TCP-200ms

PVTCP evaluation: CPU overhead


Enable delayed ACK: PVTCP (RTOmin=1ms) ≈ TCP (RTOmin=200ms)

PVTCP evaluation: CPU overhead

RTOmin TCP-200ms TCP-1ms PVTCP-1ms

Total ACKs 192,587 244,757 192,863

RTOmin TCP-200ms TCP-1ms PVTCP-1ms

Total ACKs 194,384 262,274 208,688


+0% +7.4%

Spurious RTOs are avoided

Temporarily disable delayed ACK to help the sender detect spurious RTOs

One concern

The buffer of the netback

The vif’s buffer: temporarily store incoming packets when the VM has been preempted. – ifconfig vifX.Y txqueuelen [value]

The default value is too small intensive packet loss – #define XENVIF_QUEUE_LENGTH 32

This parameter should be set bigger (> 10,000 perhaps..)

The scheduling delays to the receiver VM

Receiver VM

Driver domain

RTO happens!

Hypervisor scheduling delay

VM 2RUN

VM 3 WAIT

VM 1WAIT

VM 3WAIT

buffer

Data packets, waiting for ACK

ACKing

Within hypervisor

Physical network VM scheduling queue

VM 3WAIT

VM 1 RUN

VM 2WAIT

VM 1RUN

wait

deliver

ACKing

VM 1 WAIT

VM2WAIT

VM 3RUN

VM 2WAIT

Sender VM

Sender VM

ReceiverVM

Driver domain

RTO happens!

Hypervisor scheduling delay

ACKingbuffer

Within hypervisor

Physical network

Data packets, waiting for ACK

VM scheduling queue

VM 1 RUN

VM 3RUN

VM 2WAIT

VM 1WAIT

VM 2RUN

VM 3WAIT

VM 3WAIT

VM 2WAIT

VM 1WAIT

VM 1 RUN

VM 2WAIT

VM 3WAIT

wait

deliver

The buffer size matters!

The scheduling delays to the sender VM

Summary Problem: VM scheduling delays cause spurious RTOs.

Proposed Solution: a ParaVirtualized TCP (PVTCP) – Provide a method to detect a VM’s wakeup moment

Sender-side problem – There are OS reasons

Receiver-side problem – Networking problem

Sender-side problem – Spurious RTOs can be

avoided. – Slightly extends the

retransmit timer’s expiry time after the sender VM wakes up.

Receiver-side problem – Spurious RTOs can be

detected. – Temporarily disable

delayed ACK after the receiver VM wakes up.

– Just-in-time

Future Work: your inputs ..

Thanks for your listening

Comments & Questions

Email: [email protected] URL: http://www.cs.hku.hk/~lwcheng

On Paravirualizing TCP: Congestion Control in Xen VMs · On ParaVirualizing TCP: Congestion Control...

Documents

Transcript of On Paravirualizing TCP: Congestion Control in Xen VMs · On ParaVirualizing TCP: Congestion Control...