Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet...
Transcript of Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet...
![Page 1: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/1.jpg)
Capacity planning for the Google backbone networkChristoph Albrecht, Ajay Bangla, Emilie Danna, Alireza Ghaffarkhah, Joe Jiang, Bikash Koley, Ben Preskill, Xiaoxue Zhao
July 13, 2015ISMP
![Page 2: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/2.jpg)
Multiple large backbone networks
B2: Internet facing backbone70+ locations in 33 countries
B4: Global software-defined inter-datacenter backbone
20,000+ circuits in operation40,000+ submarine fiber-pair miles
![Page 3: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/3.jpg)
Multiple layers
A
C
B
A
C
B
Logical Topology (L3):router adjacencies
Physical Topology (L1):fiber paths
A
C
B
Flows routed onlogical links
![Page 4: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/4.jpg)
Multiple layers
A
C
B
A
C
B
Logical Topology (L3):router adjacencies
Physical Topology (L1):fiber paths
A
C
B
Flows routed onlogical links
Failures propagate from layer to layer
![Page 5: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/5.jpg)
Multiple time horizons
O(months)
• Demand variation
• Capacity changes
• Risk assessment
O(seconds)
• Failure events
• Fast protection/restoration
• Routing changes
• Definitive failure repair: O(hours or days)
O(years)
• Long-term demand forecast
• Topology optimization and simulation
• What-if business case analysis
![Page 6: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/6.jpg)
Multiple objectivesStrategic objectives: minimize cost, ensure scalability
Service level objectives (SLO): latency, availability➔ Failures are modeled probabilistically➔ The objective is defined as points on the probability distribution➔ Latency example: the 95th percentile of latency from A to B is at most 17 ms➔ Availability example: 10 Gbps of bandwidth is available from A to B at least
99.9% of the time
![Page 7: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/7.jpg)
Multiple practical constraintsExample: Routing of flows on the logical graph➔ A flow can take a limited number of paths➔ Routing is sometimes not deterministic➔ There is a time delay to modify routing after a failure happens
![Page 8: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/8.jpg)
Deterministic optimization
![Page 9: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/9.jpg)
ProblemWhat is the cheapest network that can route flows during a given set of failure scenarios?
➔ L3-only version: physical topology and logical/physical mapping are fixed, decide logical capacity
➔ Cross-layer version: decide physical and logical topology, mapping between the two, and logical capacity
![Page 10: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/10.jpg)
Mixed integer program: building blockEach flow is satisfied during each failure scenario in the given set.
For each failure scenario:Variables: how the flow is routedConstraints: for each link, utilization is less than capacity under failure
Matrix represented as:capacity variables
routing variables
![Page 11: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/11.jpg)
Mixed integer program: L3-only
. . .
L1 variables
L3 link capacity variables
Equipment placementLine system capacity...
L3 routing for each L1 failure scenario
![Page 12: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/12.jpg)
Mixed integer program: cross-layer
. . .
Extended L1 variables
L3 link capacity variables
L3 link capacity under failure variables
L3/L1 mapping variables
How the L3/L1 mapping affects the L3 capacity available under L1 failure
L3 routing for each L1 failure scenario
L3/L1 constraints
![Page 13: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/13.jpg)
Routing for each failure scenario
For each src-dst flow: . One variable for the amount of traffic for each link and each direction. At each node: flow conservation constraint
. All flows of the same source are combined into one flow with multiple destinations.. For each src-multiple dst flow: (link, direction) variables and flow conservation constraints
Path formulation Edge formulation Edge formulationfor single source to multiple destinations flow
For each src-dst flow:. Multiple paths are generated from src to dst. One variable per path for the amount of traffic along the path.
![Page 14: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/14.jpg)
Latency constraints
Strict version: Edge formulation for single source to multiple destinations on the shortest path tree
Challenges:
. How to make the constraint less strict?. How to make it probabilistic?
![Page 15: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/15.jpg)
ResultsPotential cost reduction:
Cross-layer optimization can reduce cost 2x more than L3-only optimization
![Page 16: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/16.jpg)
Stochastic simulation
![Page 17: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/17.jpg)
ProblemDoes a given network meet availability and latency SLOs?➔ Current network: risk assessment➔ Hypothetical future networks: what-if analysis
![Page 18: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/18.jpg)
Monte Carlo simulation
TE simulator
Simulation
Input:. Demand. Cross-layer topology. Failure data
Output:Whether each flow meets SLO
Draw many samples from the failure distribution
For each L3 topology:Evaluate the satisfied demand and latency with the traffic engineering (TE) simulator
TE results
Combine results into a satisfied demand and latency distribution
Unique L3 topology
Derive corresponding L3 topologies with link capacity and deduplicate
Random combination of failures
![Page 19: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/19.jpg)
Parallel implementation
Logical topologies
Random seeds
Deduplicated logical topologies with number of samples
Flows with satisfied demand and number of samples
Flow Availability
1 billion samples
74 million unique topologies
10,000 threads2 hours> 1000x speedup
memory bottleneck
simulation: speed bottleneck
memory bottleneck
![Page 20: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/20.jpg)
Results
Data and automation are transforming ➔ our decision making➔ the definition of our business: measurable service quality and guarantees
Time
Availability
![Page 21: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/21.jpg)
Stochastic optimization
![Page 22: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/22.jpg)
ProblemWhat is the cheapest network that can meet SLO?➔ Probabilistic modeling of failures➔ SLO = chance constraints
◆ Probability (latency from A to B <= 17 ms) >= 0.95◆ Probability (satisfied demand from A to B >= 10 Gbps) >= 0.999
![Page 23: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/23.jpg)
Simulation / Optimization loop with scenario-based approach
Deterministic optimization
Stochastic simulation
optimized topology for the current set of failure scenarios
new set of failure scenarios
Greedy approach to meet SLO by optimizing with the smallest number of failure scenarios
Add failure scenarios with. highest probability. highest number or volume of flows that miss SLO and are not satisfied during that failure scenario
![Page 24: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/24.jpg)
ChallengesTradeoff between accuracy, optimality, scalability, complexity and speed
Examples➔ Accuracy: routing convergence➔ Optimality: better stochastic optimization➔ Scalability: more failure scenarios➔ Complexity: explanation of solutions to our users➔ Speed: repair the topology on the fly (transport SDN)
![Page 25: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/25.jpg)
Thank you
![Page 26: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/26.jpg)
Scenario-based approach
Probability
Failure scenario
0.001
Demand must be satisfied during all these failures
Probability (satisfied demand from A to B >= 10 Gbps) >= 0.999
Choose subset of failure scenarios where demand is satisfied such that Sum Probability( ) >= 0.999
...
Demand must be satisfied duringsome of these failures
![Page 27: Google backbone network ISMP July 13, 2015 Jiang, Bikash ... · Does a given network meet availability and latency SLOs? ... (TE) simulator TE results Combine results into a satisfied](https://reader034.fdocuments.us/reader034/viewer/2022051601/5ae335577f8b9a5b348d2662/html5/thumbnails/27.jpg)
Accuracy
Conclusive Flow Inconclusive Flow
The availability calculation is a statistical estimationFlow is conclusive if confidence interval lies entirely on one side of its target availabilityThis is used to determine the necessary number of samples
Target availability
Estimated availabilityConfidence interval for availability estimator
Target availabilityEstimated availability
Confidence interval for availability estimator