Scheduling Mix-flows in Commodity Datacenters with Karuna · Karuna completes deadline flow just...
Transcript of Scheduling Mix-flows in Commodity Datacenters with Karuna · Karuna completes deadline flow just...
Scheduling Mix-flows in Commodity Datacenterswith Karuna
Li Chen, Kai Chen, Wei Bai, Mohammad Alizadeh (MIT)SING Group, CSE Department
Hong Kong University of Science and Technology
Datacenter Transport
• Deadline flows• Meeting deadlines• D3, D2TCP, …
• General (non-deadline) flows• Reduce flow completion time (FCT).• pFabric, PDQ, PASE, PIAS, …
We investigate a practical, yet neglected, problem: Coexistence of deadline and non-deadline flowsMix-flow Scheduling
2
Prior solutions do not work for mix-flowsShortest Job First (SJF) Scheduling – pFabric, PASE, PIAS, PDQ
Deadline flows
Non-deadline flows
End-host End-host
t0
0.1
0.2
0.3
0.4
0.5
0 5 10 15 20
Frac
tion
Percentage of non-deadline flows smaller than deadline flows.
Deadline Miss Rate
Scheduling only with sizes hurts deadline flowsProblem: unawareness of deadlines.
3
Flow Priority = Remaining size
Prior solutions do not work for mix-flowsEarliest Deadline First Scheduling – pFabric, PASE, PIAS, PDQ
Deadline flows
Non-deadline flows
End-host End-host
t
Prioritizing deadline flows hurts non-deadline flows, especially short ones.Problem: Existing transports for deadline flows unnecessarily takes all bandwidth.
0
5
10
15
20
0 1 2 3 4 5 6 7 8
ms
Percentage of deadline flows in overall traffic
99 Percentile FCT
Non-deadline: Overall Non-deadline: Size<10KB
4
Flow Priority = Time till Deadline
DeadlineDeadline
How to schedule mix-flows?
5
Deadline Flows•Meet deadlines•Flow deadline à Priority
Non-deadline Flows•Reduce FCT•Flow Size à Priority
Karuna
• Deadline flows • High priority with minimal bandwidth to complete just before
deadlines.• Non-deadline flows
• Low priority but take all available bandwidth to reduce FCT.
Deadline Flows
Non-deadline Flows
MCP
Work Conserv.Transport
Highest Priority
Priority 2
Priority 3
Priority K
End-host Network Fabric 6
SJF
Key Insight: Deadline flows should minimally impact non-deadline flows.
MCP for deadline flows: Completing deadlines with minimal bandwidthMinimal-impact Congestion control Protocol
7
Deadline flows Non-deadline flows Implementation Evaluation
MCP: Formulation and solution
• Objective à Minimal impact• Per-packet latency
• Constraints:• Meet deadlines• Network capacity
Stochastic Optimization
LyapunovOptimization
Framework [1]Convex
Optimization Primal SolutionPer-flow
congestion window update
function
[1] M. J. Neely. Stochastic Network Optimization with Application to Communication and Queueing Systems, Morgan & Claypool, 2010.8
MCP: Formulation and solution
• Objective à Minimal impact• Per-packet latency
• Constraints • Meet deadlines• Network capacity
• Solutionà Near-deadline completion
9
Rate
t
Link CapRate
tDeadline
Stochastic Optimization
LyapunovOptimization
Framework [1]Convex
Optimization Primal SolutionPer-flow
congestion window update
function
Reducing FCT for non-deadline flowsMimicking SJFNon-deadline flows with/out known sizes
10
Deadline flows Non-deadline flows Implementation Evaluation
Non-deadline flows with unknown size
• PIAS [2] is best known scheme.
[2] Wei Bai, et. al., Information-Agnostic Flow Scheduling for Commodity Data Centers, USENIX NSDI 2015
Highest Priority
2nd
Highest Priority
…
Lowest Priority
Send packets tagged with the highest priority untilα#bytes sent.
Send packets tagged with 2nd highest priority untilα$bytes sent.
Send packets tagged with the lowest priority.
Flows
11
Karuna for non-deadline flows
• Non-deadline flows with unknown size ç PIAS• Non-deadline flows with known size
• Karuna extends PIAS to schedule flows with/out known sizes.
Sum of Linear Ratios Problem
(PIAS)
Reformulation to include flows with known
sizes
Quadratic Sum of Ratios Problem(Karuna)
Demotion Thresholds: {𝛼'} Demotion Thresholds: {𝛼'}Splitting Thresholds: {𝛽'}
12
Karuna for non-deadline flows: mimicking SJF
PIASPriority 2
Priority 3
Priority KSize ≤ 𝛽#
𝛽# < Size ≤ 𝛽$
𝛽,-# < Size
Flow with known sizes
Flow without known sizes
13
…End-host Network Fabric
Implementation
Flow size
Socket
Deadline
SO_MARK
setsockopt()
Information passing
Pass flow information (deadline, size) to the kernel using SO_MARK
End-host Network Fabric15
Implementation
Flow size
Socket
Deadline
SO_MARK
setsockopt()
Tc module
pkt
Tagged pkt
Information PassingPacket tagging
TC module at the sender-side.Tag DSCP fields in packet headers based on thresholds.
End-host Network Fabric16
Implementation
Flow size
Socket
Deadline
SO_MARK
setsockopt()
Tc module
pkt
Tagged pkt
Modulate Congestion
Window with MCP
Information PassingPacket taggingRate control
TC module.Non deadline flows use DCTCPModifies window size using MCP.
End-host Network Fabric17
DCTCP
Implementation
Flow size
Socket
Deadline
SO_MARK
setsockopt()
Tc module
pkt
Tagged pkt
Strict Priority QueueingModulate Congestion
Window with MCP
Information PassingPacket taggingRate controlSwitch configuration
ECN marking.Priority Queueing(priorities mapped to DSCP fields).
End-host Network Fabric18
Strict Priority Queueing
Strict Priority Queueing
DCTCP
EvaluationTestbed ExperimentsSimulations
19
Deadline flows Non-deadline flows Implementation Evaluation
Evaluation: Testbed Experiments
• Setup• 16 servers• A Gigabit Pronto-3295 switch• 8 Priority queues mapped to DSCP• RTT ~100us• Karuna kernel module
• Traffic trace• Web search (DCTCP [3])• Data mining (VL2 [4])
20
[3] Alizadeh, Mohammad, et al. "Data center tcp (dctcp)." ACM SIGCOMM computer communication review. Vol. 40. No. 4. ACM, 2010.[4] Greenberg, Albert, et al. "VL2: a scalable and flexible data center network."ACM SIGCOMM computer communication review. Vol. 39. No. 4. ACM, 2009.
Testbed Experiments: Deadline FlowsFlow Size Deadline Start Time1 14.4MB 20ms 0ms2 48MB 120ms 0ms3 3MB 5ms 50ms4 0.5MB 10ms 80ms
0
200
400
600
800
1000
1200
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99 101
103
105
107
109
111
113
115
117
119
121
Mbp
s
Time (ms)
DCTCP
Flow 1 Flow 2 Flow 3 Flow 4Deadline Missed
for Flow 1
21
Testbed Experiments: Deadline FlowsFlow Size Deadline Start Time1 14.4MB 20ms 0ms2 48MB 120ms 0ms3 3MB 5ms 50ms4 0.5MB 10ms 80ms
0
200
400
600
800
1000
1200
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99 101
103
105
107
109
111
113
115
117
119
121
Mbp
s
Time (ms)
pFabric – Earliest Deadline First
Flow 1 Flow 2 Flow 3 Flow 4
22
Flow 1 deadline Flow 2 deadlineFlow 4 deadlineFlow 3 deadline
Testbed Experiments: Deadline FlowsFlow Size Deadline Start Time1 14.4MB 20ms 0ms2 48MB 120ms 0ms3 3MB 5ms 50ms4 0.5MB 10ms 80ms
0
200
400
600
800
1000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99 101
103
105
107
109
111
113
115
117
119
121
Mbp
s
Time (ms)
Karuna
Flow 1 Flow 2 Flow 3 Flow 4
23
Testbed Experiments: Deadline FlowsFlow Size Deadline Start Time1 14.4MB 20ms 0ms2 48MB 120ms 0ms3 3MB 5ms 50ms4 0.5MB 10ms 80ms
Karuna completes deadline flow just before deadline, leaving bandwidth for non-deadline flows.
0100200300400500600700800900
1000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
Karuna
0
200
400
600
800
1000
1200
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
pFabric – Earliest Deadline First
0
200
400
600
800
1000
1200
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
DCTCP
24
Testbed Experiments: Non-deadline Flows
Mimics shortest job first scheduling for non-deadline flows.
0.9162.7211.716
5.5548.04
28.725
0
5
10
15
20
25
30
35
0-100KB (Avg) 0-100KB (99th)
ms
Flow Size
FCT
Karuna DCTCP TCP
81.729
104.629
120.286
0
20
40
60
80
100
120
140
100KB-10MB (Avg)
ms
Flow Size
FCT
Karuna DCTCP TCP
851.72 718.3
69 608.04
0100200300400500600700800900
1000
>10MB (Avg)
ms
Flow Size
FCT
Karuna DCTCP TCP
61.133
67.019
73.634
0
10
20
30
40
50
60
70
80
Overall
ms
FCT
Karuna DCTCP TCP
25
Evaluation: Simulations
• Simulation Setup• Spine-leaf with 144 servers• 10G Server-ToR links• 40G ToR-Spine links
• Compare with:• D3• D2TCP• pFabric - EDF
26
Large-scale Simulations: Key Benefit of Karuna
0
2
4
6
8
10
0.8 0.85 0.9 0.95
%
Deadline traffic load
Deadline Miss Rate
D3 D2TCPpFabric (EDF) Karuna
0
5
10
15
20
0.8 0.85 0.9 0.95m
s
Deadline traffic load
95th Percentile FCT
D3 D2TCPpFabric Karuna
0100002000030000400005000060000
0.8 0.85 0.9 0.95
#
Deadline traffic load
# Completed non-deadline flows
D3 D2TCP pFabric Karuna
Reducing completion times of non-deadline flows while completing deadline flows.27
>100x