On Paravirualizing TCP: Congestion Control in Xen VMs · On ParaVirualizing TCP: Congestion Control...
Transcript of On Paravirualizing TCP: Congestion Control in Xen VMs · On ParaVirualizing TCP: Congestion Control...
On ParaVirualizing TCP: Congestion Control in Xen Virtual Machines
Luwei Cheng, Cho-Li Wang, Francis C.M. Lau Department of Computer Science
The University of Hong Kong
Xen Project Developer Summit 2013 Edinburgh, UK, October 24-25, 2013
Outline
Motivation – Physical datacenter vs. Virtualized datacenter – Incast congestion
Understand the Problem – Pseudo-congestion – Sender-side vs. Receiver-side
PVTCP – A ParaVirtualized TCP – Design, Implementation, Evaluation
Questions & Comments
Outline
Motivation – Physical datacenter vs. Virtualized datacenter – Incast congestion
Understand the Problem – Pseudo-congestion – Sender-side vs. Receiver-side
PVTCP – A ParaVirtualized TCP – Design, Implementation, Evaluation
Questions & Comments
Physical datacenter
A set of physical machines Network delays:
propagation delays of the physical network/switch
…
ToR switches
Core switch
. . .
Servers in a rack
… … …
ToR switches
Core switch
. . .
Servers in a rack
… …
VM VM VM
VM VM VM
Virtualized datacenter
A set of virtual machines Network delays:
additional delays due to virtualization overhead
Virtualization brings “delays”
1. I/O virtualization overhead (PV or HVM) – Guest VMs are unable to directly access the hardware. – Additional data movement between dom0 and domUs. – HVM: Passthrough I/O can avoid it
2. VM scheduling delays – Multiple VMs share one physical core
delay
VM
VM
VM
pCPU
VM
VM
VM
pCPU
Hypervisor
Virtualization brings “delays”
[1VM 2VMs] [1VM 3VMs]
Peak: 30ms Peak: 60ms
Avg: 0.147ms Avg: 0.374ms
[PM PM] [1VM 1VM]
Delays of I/O virtualization (PV guests): < 1ms
VM scheduling delays: 10× ms – Queuing delays VM scheduling delays
The dominant factor to network RTT
Network delays in public clouds
[HPDC’10]
[INFOCOM’10]
Incast network congestion • A special form of network congestion, typically seen in
distributed processing applications (scatter-gather). – Barrier-synchronized request workloads – The limited buffer space of the switch output port can be easily
overfilled by simultaneous transmissions. • Application-level throughput (goodput) can be orders of
magnitude lower than the link capacity.
[SIGCOMM’09]
Solutions for physical clusters
The dominate factor: once the packet loss happens, whether the sender can know it as soon as possible. – In case of “tail loss”, the sender can only count on the
retransmit timer’s firing.
Two representative papers: Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems [FAST’08]. Understanding TCP Incast Throughput Collapse in Datacenter Networks [WREN’09].
Prior works: none of them can fully eliminate the throughput collapse. – Increase switch buffer size – Limited transmit – Reduce duplicate ACK threshold – Disable slow-start – Randomize timeout value – Reno, NewReno, SACK
Solutions for physical clusters (cont’d)
RTOmin in a virtual cluster? Not well studied.
[SIGCOMM’09] [DCTCP, SIGCOMM’10]
Significantly reducing RTOmin has been shown to be a safe and effective approach. [SIGCOMM’09]
Even with ECN support in hardware switch, a small RTOmin still shows apparent advantages. [DCTCP, SIGCOMM’10]
Outline
Motivation – Physical datacenter vs. Virtualized datacenter – Incast congestion
Understand the Problem – Pseudo-congestion – Sender-side vs. Receiver-side
PVTCP – A ParaVirtualized TCP – Design, Implementation, Evaluation
Questions & Comments
Pseudo-congestion
A small RTOmin frequent spurious RTOs
RTOmin=200ms RTOmin=100ms
RTOmin=10ms RTOmin=1ms
NO network congestion, still RTT spikes.
VM
pCPU
VM
VM
30ms
30ms
30ms
3VMs per core
Red points: measured RTTs Blue points: calculated RTO values
RTO = SRTT + 4* RTTVAR Lower-bound: RTOmin
TCP’s low-pass filter
Retransmit TimeOut
Pseudo-congestion (cont’d)
A small RTOmin: serious spurious RTOs with largely
varied RTTs.
A big RTOmin: throughput collapse with heavy network
congestion.
“adjusting RTOmin: a tradeoff between timely response with premature timeouts, and there is NO optimal balance between the two.” -- Allman and Paxson [SIGCOMM’99]
Virtualized datacenters A new instantiation
Sender-side vs. Receiver-side
The scheduling delays to the sender VM The scheduling delays to the receiver VM
3VMs1VM Freq. 1VM3VMs 1086 1× RTOs 677
0 2× RTOs 673 0 3× RTOs 196 0 4× RTOs 30
RTO only happens once a time Successive RTOs are normal
To transmit 4000 1MB data blocks
A micro-view with tcpdump
8.4
8.5
8.6
8.7
8.8
8.9
9
9.1
0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80
x106 snd.una: the first sent but unacknowledged byte. snd.nxt: the next byte that will be sent.
snd.nxt
snd.una
snd.nxt
snd.una
RTO happens twice, before the receiver
VM wakes up.The receiver VM has been
stopped.
The generation and the return of the ACKs will be delayed.
RTOs must happen on the sender’s side.
When the receiver VM is preempted
Time (ms) vs. sequence number (from the sender VM) Time (ms) vs. ACK number (from the receiver VM)
The ACK’s arrival time is not delayed, but the receiving time is too late.
From TCP’s perspective, RTO should not be triggered.
When the sender VM is preempted
RTO happens just afterthe sender VM wakes up.
The sender VM has been
stopped.An ACK arrives
before the sender VM wakes up.
Timer
VM1 is running
Buffer
Timer TCP sender
Driver domain
TCP receiver ACK ACK ACK
data data data
Physical network
Within hypervisor
VM2 is running VM3 is running VM1 is running
clear timer
clear timer
VM scheduling latency
VM2 is waiting
VM3 is waiting VM1 is waiting
VM3 is waiting VM1 is waiting
VM2 is waiting
VM2 is waiting
VM3 is waiting
wait ..
Timer IRQ: RTO happens!
Network IRQ: receive ACK;
Spurious RTO!
deliver ACK
Scheduling queue
Expire time
Timer
The reasons due to common OS design – Timer interrupt is executed before other interrupts – Network processing is a little late (bottom half)
The sender-side problem: OS reasons
1
2
After the VM wakes up, both TIMER and NET are pending. RTO happens just before the ACK enters the VM
To detect spurious RTOs
Disabling delayed ACK seems to be helpful
c
Two well-known detection algorithms: F-RTO and Eifel – Eifel performs much worse than F-RTO in some situations, e.g.
with bursty packet loss [CCR’03] – F-RTO is implemented in Linux
Low detection
rate
[3VMs1VM] [1VM3VMs]
Low detection
rate
F-RTO interacts badly with delayed ACK (ACK coalescing) – Reducing delayed ACK timeout value does NOT help.
Delayed ACK vs. CPU overhead
Disabling delayed ACK Significant CPU overhead
Sender VM Receiver VM Sender VM Receiver VM
Delayed ACK vs. CPU overhead
Disabling delayed ACK Significant CPU overhead
delack-200ms delack-1ms w/o delack
Total ACKs 229,650 244,757 2,832,260
delack-200ms delack-1ms w/o delack
Total ACKs 252,278 262,274 2,832,179
Sender VM Receiver VM Sender VM Receiver VM Disabling delayed ACK: 11~13× more ACKs are sent
Outline
Motivation – Physical datacenter vs. Virtualized datacenter – Incast congestion
Understand the Problem – Pseudo-congestion – Sender-side vs. Receiver-side
PVTCP – A ParaVirtualized TCP – Design, Implementation, Evaluation
Questions & Comments
PVTCP – A ParaVirtualized TCP
Main Idea – If we can detect such moment, and let the guest OS be
aware of this, there is a chance to handle the problem.
Observation – Spurious RTOs only happen when the sender/receiver
VM just experienced a scheduling delays.
“the more information about current network conditions available to a transport protocol, the more efficiently it can use the network to transfer its data.” -- Allman and Paxson [SIGCOMM’99]
Detect the VM’s wakeup moment
VM
pCPU
VM
VM
30ms
30ms
30ms
Virtual timer IRQs (every 1ms)
Time
Guest OS
Hypervisor
VM is NOT running
. . .
VM is running
Virtual timer IRQs (every 1ms)
jiffie
s++
jiffie
s++
jiffie
s++
jiffie
s++
jiffie
s++
jiffie
s++
VM is running
jiffies += 60 (HZ=1000)
jiffie
s++
jiffie
s++
jiffie
s++
jiffie
s++
3VMs per core
Acute increase of the system clock (jiffies) The VM just wakes up
one-shot timer
PVTCP – the sender VM is preempted Spurious RTOs can be avoided.
No need to detect them at all!
Timer TCP
Timer
VM1 is running
Buffer
Timer TCP sender
Driver domain
TCP receiver ACK ACK ACK
data data data
Physical network
Within hypervisor
VM2 is running VM3 is running VM1 is running
clear timer
clear timer
wait .. deliver
ACK
Expire time
Timer
Start time
Expiry time
Timer IRQ: RTO happens!
Network IRQ: receive ACK;
Spurious RTO! 2
1
VM scheduling latency
PVTCP – the sender VM is preempted Spurious RTOs can be avoided.
No need to detect them at all!
Timer
Timer
TCP PVTCP
Timer
VM1 is running
Buffer
Timer TCP sender
Driver domain
TCP receiver ACK ACK ACK
data data data
Physical network
Within hypervisor
VM2 is running VM3 is running VM1 is running
clear timer
clear timer
wait .. deliver
ACK
Expire time
Timer
Start time
Expiry time
1ms
Solution: after the VM wakes up, extend the TCP retransmit timer’s expiry time by 1ms.
Net IRQ first: ACK enters.
Reset the timer.
VM scheduling latency
PVTCP – the sender VM is preempted
Timer PVTCP
Timer
VM1 is running
Buffer
Timer TCP sender
Driver domain
TCP receiver ACK ACK ACK
data data data
Physical network
Within hypervisor
VM2 is running VM3 is running VM1 is running
clear timer
clear timer
wait .. deliver
ACK
Expire time
Timer
1ms
Net IRQ first: ACK enters.
Reset the timer.
VM scheduling latency
StartTime ExpiryTime
Solution: MRTTi SRTTi-1
TCP’s low-pass filter to estimate RTT/RTO Smoothed RTT (SRTTi) 7/8 * SRTTi-1 +1/8 * MRTTi RTT variance (RTTVARi) 3/4 * RTTVARi -1+ 1/4 * |SRTTi - MRTTi| Expected RTO value (RTOi+1) SRTTi + 4 * RTTVARi
Measured RTT (MRTT) = TrueRTT + VMSchedDelay
PVTCP – the receiver VM is preempted
Spurious RTOs cannot be avoided, so we have to let the sender detect them.
Solution: temporarily disable delayed ACK when the receiver VM just wakes up. – Eifel: check the timestamp of the first one ACK – F-RTO: check the ACK number of the first two ACKs – Just-in-time: do not delay the ACKs for the first three segments
Detection algorithms requires deterministic return of future ACKs from the receiver – Enable delayed ACK retransmission ambiguity – Disable delayed ACK significant CPU overhead
PVTCP evaluation: throughput
PVTCP avoids throughput collapse in the whole range
TCP’s dilemma: pseudo-congestion & real congestion
RTOmin
Experimental setup: 20 sender VMs 1 receiver VM
PVTCP-1ms
TCP-1ms TCP-200ms
PVTCP evaluation: CPU overhead
Sender VM Receiver VM Sender VM Receiver VM
Enable delayed ACK: PVTCP (RTOmin=1ms) ≈ TCP (RTOmin=200ms)
PVTCP evaluation: CPU overhead
RTOmin TCP-200ms TCP-1ms PVTCP-1ms
Total ACKs 192,587 244,757 192,863
RTOmin TCP-200ms TCP-1ms PVTCP-1ms
Total ACKs 194,384 262,274 208,688
Sender VM Receiver VM Sender VM Receiver VM
+0% +7.4%
Spurious RTOs are avoided
Temporarily disable delayed ACK to help the sender detect spurious RTOs
One concern
The buffer of the netback
The vif’s buffer: temporarily store incoming packets when the VM has been preempted. – ifconfig vifX.Y txqueuelen [value]
The default value is too small intensive packet loss – #define XENVIF_QUEUE_LENGTH 32
This parameter should be set bigger (> 10,000 perhaps..)
The scheduling delays to the receiver VM
Receiver VM
Driver domain
RTO happens!
Hypervisor scheduling delay
VM 2RUN
VM 3 WAIT
VM 1WAIT
VM 3WAIT
buffer
Data packets, waiting for ACK
ACKing
Within hypervisor
Physical network VM scheduling queue
VM 3WAIT
VM 1 RUN
VM 2WAIT
VM 1RUN
wait
deliver
ACKing
VM 1 WAIT
VM2WAIT
VM 3RUN
VM 2WAIT
Sender VM
Sender VM
ReceiverVM
Driver domain
RTO happens!
Hypervisor scheduling delay
ACKingbuffer
Within hypervisor
Physical network
Data packets, waiting for ACK
VM scheduling queue
VM 1 RUN
VM 3RUN
VM 2WAIT
VM 1WAIT
VM 2RUN
VM 3WAIT
VM 3WAIT
VM 2WAIT
VM 1WAIT
VM 1 RUN
VM 2WAIT
VM 3WAIT
wait
deliver
The buffer size matters!
The scheduling delays to the sender VM
Summary Problem: VM scheduling delays cause spurious RTOs.
Proposed Solution: a ParaVirtualized TCP (PVTCP) – Provide a method to detect a VM’s wakeup moment
Sender-side problem – There are OS reasons
Receiver-side problem – Networking problem
Sender-side problem – Spurious RTOs can be
avoided. – Slightly extends the
retransmit timer’s expiry time after the sender VM wakes up.
Receiver-side problem – Spurious RTOs can be
detected. – Temporarily disable
delayed ACK after the receiver VM wakes up.
– Just-in-time
Future Work: your inputs ..
Thanks for your listening
Comments & Questions
Email: [email protected] URL: http://www.cs.hku.hk/~lwcheng