Modeling Quality-Quantity based Communication Orr Srour under the supervision of Ishai Menache.
Surviving Failures in Bandwidth Constrained Datacenters Authors: Peter Bodik Ishai Menache Mosharaf...
-
Upload
dayna-freeman -
Category
Documents
-
view
214 -
download
0
Transcript of Surviving Failures in Bandwidth Constrained Datacenters Authors: Peter Bodik Ishai Menache Mosharaf...
Surviving Failures in Bandwidth Constrained Datacenters
Authors:
Peter Bodik
Ishai Menache
Mosharaf Chowdhury
Pradeepkumar Mani
David A.Maltz
Ion Stoica
Presented By,
Sneha Arvind Mani
OUTLINEIntroductionMotivation and BackgroundProblem StatementAlgorithmic SolutionsEvaluation of the AlgorithmsRelated WorkConclusion
IntroductionThe main goal of this paper:
◦To improve the fault tolerance of the deployed applications
◦Reduce bandwidth usage in the core.-How? - By optimizing allocation of
applications to physical machines.• Both of the above problems are NP-hard• So they formulated a related convex optimization
problem that • Incentivizes spreading machines of individual services
across fault domains.• Adds a penalty term for machine reallocations that
increase bandwidth usage.
Introduction (2)Their algorithm achieved 20%-50% reduction in
bandwidth usage and improving worst-case survival by 40%-120%
Improvement in Fault Tolerance – reduced the fraction of services affected by potential hardware failures by up to a factor of 14.
The contribution of this paper is three-fold:◦Measurement Study◦Algorithms◦Methodology
Motivation and BackgroundBing.com – a large scale Web application
running in multiple datacenters around the world.Some definitions used in this paper:
◦ Logical Machine: Smallest logical component of a web application.
◦ Service: Service consists of many logical machines executing the same code.
◦ Environment: Consists of many services◦ Physical Machine: Physical server that can run a single
logical machine.◦ Fault Domain: Set of physical machines that share a
single point of failure.
Communication PatternsOn tracing communication between all
pairs of servers and for each pairs of services i and j, it was observed that datacenter network core is highly utilized.
Traffic matrix is very
sparse. Only 2% service
pairs communicate at all.
link utilization >50%
>60% >70% >80%
aggregate months above utilization
115.7 47.5 18.3 6.2
Communication Patterns(2)Communication pattern is very skewed. 0.1% of
the services that communicate generate 60% of all traffic & 4.8% of service pairs generate 99% of traffic.
Services that do not require lot of bandwidth can be spread out across the datacenter, improving their fault tolerance.
Communication Patterns(3)The majority of the traffic, 45% stays within the same
service, 23% leaves the service but stays within the same environment & 23% crosses environments.
Median services talk to nine other services.Communicating services form small and large
components.
Failure CharacteristicsNetworking hardware failures causes significant
outages.Redundancy reduces impact of failures on lost
bytes by only 40%Power fault domains create non-trivial patterns.
Implications for Optimization Framework:
It has to consider the complex patterns of the power and networking fault domains, instead of simply spreading the services across several racks to achieve good fault tolerance.
Problem StatementMetrics:Bandwidth (BW): The sum of the rates on the core links
is the overall measure of the bandwidth usage at the core of network.
Fault Tolerance(FT): It is the average of Worst-Case-Survival(WCS) across all the services.
No. of Moves(NM): The number of servers that have to be re-imaged to get from initial datacenter allocation to the proposed allocation.
Optimization:
Maximize FT – α BW
Subject to NM ≤ N0
α – tunable positive parameter
N0 – Upper limit on number of moves.
Algorithmic SolutionsThe solution roadmap is as follows:
◦ Cells – a subset of physical machines that belong to exactly the same fault domains. This allows reduction in the size of optimization problem.
◦ Fault Tolerance Cost (FTC) is a convex structure, hence the minimization of FTC improves FT.
◦ Their method to optimize BW is to perform a minimum k-way cut on the communication graph.
◦ CUT + FT + BW consists of two- phases Minimum k-way cut to compute initial assignment that
minimizes bandwidth at the network core. Iteratively move machines to improve FT.
FT + BW does not perform graph-cut but starts with current allocation & improves performance by greedy moves that reduce weighted sum of BW and FTC.
Formal Definitions I – the indicator function I(n1,n2) = 1 if traffic from n1 to n2 traverses through
core link & I(n1,n2) = 0 otherwise.Bandwidth is given by:
Where is required BW between a pair of machines from services k1 and k2.To define FT let be the total
number of machines allocated to service k affected by fault j. FT is given by:
K – total no. of services.
Formal Definitions(2)Fault Tolerance Cost(FTC) is given by:
bk and wj are positive weights assigned to services and faults.
A decrease in FTC should increase FT, as squaring the zk,j variables incentivizes keeping their values small, obtained by spreading the machine assignment across multiple fault domain.
Minimization of BW is based on minimum k-way cut, which partitions the logical machines into a given number of clusters.
Algorithms to improve both BW & FTCUT+FT : Apply CUT in the first phase
then minimize FTC in the second phase using machine swap
CUT + FT +BW: As above but in second phase a penalty term for bandwidth is added. (i.e )ΔFTC + αΔ BW, α is the weighing factor.
NM-aware algorithm:FT + BW: Start with initial allocation, do
only second phase of CUT + FT + BW.
Scaling to large DatacentersAn algorithm that directly exploits skewness of the communication matrix.CUT+RandLow: Apply cut in the first phase. Determine
the subset of services whose aggregate BW are lower than others then randomly permute the machine allocation of all services belonging to the subset.
Scaling to large datacenters:To scale to large datacenters, we sample a large number
of candidate swaps and choose the one that most improves FTC.
Also during graph cut, logical machines of same service are grouped into smaller number of representative nodes.
Evaluation of AlgorithmsCUT + FT+ BW: When ignoring the server
moves, it achieves 30%-60% reduction in BW usage at the same time improving FT by 40-120%
FT + BW is close to CUT + FT+BW : FT+BW performs only steepest-descent moves.It could be used in scenarios where concurrent server moves is limited.
Random allocation in CUT + RandLow works well as many services transfer relatively little data and they can be spread randomly across DCs.
Methodology to EvaluateThe following information is needed to perform evaluation:Network Topology of a clusterServices running in the cluster and list of
machines required for each services.List of fault domains and machines in each
fault domainsTraffic matrix for services in the cluster.
The algorithms are compared on the entire achievable tradeoff boundaries instead of their performance.
Comparing Different Algorithms
The solid circles represents the FT and BW at starting
allocation(at origin), after BW-only optimization(bottom-
left-corner) & after FT-only optimization (top-right-
corner).
Optimizing for both BW and FTArtificially partitioning each service to several
subgroups – did not lead to satisfactory results.Augmenting the cut procedure with “spreading”
requirements for services – did not scale to large applications.
Cut + FT: Graph is plotted by increasing number of server swaps.
By changing the number of swaps, tradeoff between FT & BW can be controlled.
The formulation is convex, so performing steepest descent until convergence leads to global minimum w.r.t. fault tolerance.
Optimizing for both BW and FT(2)
Cut + FT+BW: Depends on α . Higher the value of α, more weight on improving BW at the cost of not improving FT.
Not optimizing over a convex function, not guaranteed to reach global optimum.
Cut + RandLow : Performs close to Cut+FT+BW but does not optimize the BW of low-talking service nor the FT of high-talking ones.
These graphs show the trade-off boundary between FT and BW for different algorithms across 3 more DCs.
Optimizing for BW,FT and NMWe notice significant
improvements by moving just 5% of the cluster. Moving 29% of the cluster achieves results similar to moving most of machines using Cut + FT + BW
When running FT + BW until convergence, it achieves results close to Cut+FT+BW even without the graph cut.
This is significant because it means we can use FT + BW incrementally and still reach similar performance as Cut+FT+BW reshuffles the whole datacenter.
Improvements in FT & BW
For α = 0.1, FT+BW achieved reduction in BW usage by 26% but improved FT by 140% and FT was reduced only for 2.7% of services and it is much lesser than for α = 1.0
For α = 1.0, FT+BW reduced core BW usage by 47% and improved average FT by 121%
Additional ScenariosOptimization of bandwidth across
multiple layers.Preparing for maintenance and online
recovery.Adapting to changes in traffic patterns.Hard constraints on fault tolerance and
placement.Multiple logical machines on a server.
Related WorkDatacenter traffic analysisDatacenter resource allocationVirtual network embeddingHigh availability in distributed systemsVPN and network testbed allocation
ConclusionAnalysis shows that the communication
volume between pairs of services has long tail, with majority of traffic being generated by small fraction of service pairs.
This allowed the optimization algorithm to spread most of the services across fault domains without significantly increasing BW usage in the core.
Thank You!