Overload - Brown University
Transcript of Overload - Brown University
Overload
Qingyi Lu & Ke Ding
Outline
● Overview of overload● Motivation (mono → micro)● Common Problem● Difference ()● Wechat● Azure● ATOM
Overview of Overload
● Overload: System’s workload exceeds the maximum processing capacity of the system.
● Reasons: - Excessive visits
- Bottlenecks and failures within the system
- Backend failure and delay
Overview of Overload
● Problems: - CPU and memory could reach the bottleneck
- System’s ability to response could slow down
- System processing capacity could sharply fall down
● Common solutions: load balancing, flow control, monitoring, etc.
Overload Control of Monolith and Microservices
● Monolith:
A small number of service components with trivial dependencies.
● Microservices:
Increasingly complex in the architecture and dependency.
- All microservices must be monitored
- Hard to handle overload independently
- Need to adapt the service changes, workload dynamics and external environment
Overload Control in Practice
● Overload Control for Scaling WeChat Microservices
Complex core Service architecture
● Azure reponses for Covid19
Emergency solution
● ATOM: Model-Driven Autoscaling for Microservices
CPU & Replica
Overload Control for Scaling WeChat Microservices
Observation
1. WeChat’s microservice architecture
Complex dependency of services
2. Deployment of WeChat Services
Centralized or SLA-based overload control mechanism could not support highly rapid service changes at large scale
3. Dynamic Workload
Overload control mechanism should adaptively tolerate the workload fluctuation
Overload Scenarios
Subsequent Overload
Challenge & Insight
● No single entry point for services request and with complex call path
● Excessive request aborts waste the computational resources
● Excessive request affects user experience (due to the high latency of service response)
● Service Agnostic: Decoupling and Dynamic Development
● Independent but Collaborative: Granule and Subsequent Overload
● Efficient and Fair: Partial Failures
Design
1. How to detect overload: Overload Detection2. How to control the overload: Service Admission Control
Overload Detection
By average waiting time of requests in the pending queue (queuing time).
Why do not use response time?
Service Admission Control
1. Business-oriented Admission Control2. User-oriented Admission Control3. Session-oriented Admission Control4. Adaptive Admission Control5. Collaborative Admission Control
Business-oriented Admission Control
● Prioritized based on their business significance● Subsequent requests inherit the same business priority● Advantages:
- Service agnostic: business priority is independent to the business logic of any service
- Easy to maintain: business priority is assigned in the entry services & reflect the changes of basic and leap services
User-oriented Admission Control
Example:
Current business is T
Overload detected
Level change to T-1 partially failure
System underloaded
Level set back to T
Other Admission Control
● Session-oriented Admission Control:
Based on Session ID
● Adaptive Admission Control:
Adapt to the load status changes to minimize impact on the quality of the overall service
● Collaborative Admission Control:
Learn the latest admission level of the downstream server
Service Admission Control Workflow
Evaluation - Queuing time vs. Response time.
Evaluation - Difference Types
Lesson learned
● Overload control in the large-scale microservice architecture must be decentralized and autonomous in each service
● The algorithmic design of overload control should take into account a variety of feedback mechanisms
● An effective design of overload control is always derived from the comprehensive profiling of the processing behavior in the actual workload
Azure reponses for Covid19
Observations & Insight
Observation:
- Increasing large amount numbers of work from home, remote learning, stay connected with friends online
- Impact on healthcare: using huge amount of data to analyze virus
Insight:
- Help people adapt to this new world- Prioritize critical customers: doctor and nurse in hospital, emergency
management service, critical government infrastructure
For Goods Program
● Guiding principles: do no harm, outcome driven, unique value to affect outcomes, opening collaborative
● Example:
Response Framework
● Meet Demand
Address capacity in the hardest regions & scale up
● Forecast
Well prepared for the potential case
● Optimize
Optimizer services
Network
Incredible growth in VPN and WAN usage
- Wan scaling:
12 new edge sites
25% increased peering capacity
100+ terabits
Network
- Wan traffic optimization: load balance the traffics
Services on Azure - Teams
Services on Azure - Windows Virtual Desktop
Service scale out:
- More gateways & front ends per cluster- Additional clusters per region- Deployed to more regions for best performance- More regions coming for data residency
Optimization:
- Fine-tuned database indexes- Created client-side cache + read-only replicas- Rebalanced traffic routing for nearby regions
Azure Security
● Azure Active Directory● Application Proxy- Adjusted capacity: monitoring and alerting
- Increased scale unit: availability across regions
- Provided higher throttling: limits to customers
Confidential computing on AzureTrusted Execution Environment
Example:
ATOM: Model-Driven Autoscaling for Microservices
Observation and Challenge
1. Rule based auto scaling provides difference performance gains based on current workload. Vertical stands for CPU and horizontal stands for replica.
2. Previous method focus more on either vertical or horizontal rather than both
Insight
1. Estimate performance2. Auto-scale by changing CPU and replicas, which combines horizontal and
vertical scaling
Layered Queueing Network
Layered Queueing Network
1. In previous, they apply Utilization Technique as the feature for Least Square to estimate service demand. U = XD
2. Now we take the Queue Length Technique of response time and queue length. R = LD
Layered Queueing Network
Layered Queueing Network
ATOM: Autoscaling Microservice
1. Maximize the revenue of transactions2. MAPE-K (monitor, analyse, plan and execute with a shared knowledge base)
ATOM Algo Detail
ATOM: Autoscaling Microservice
1. UH: Horizontal scaling2. UV: Vertical scaling
Lesson learned & future direction
Lesson: Machine Learning helps us extract useful information to get a better strategies with high likelihood. Finding relevant features to decide auto-scaling is helpful.
Future direction: Can we move the offline training to online sequential training?
Comparisons - Similarities
● Same overview of Designs
- Detect overhead or performance
- Derive the solution
Comparisons - Difference
● DAGOR (WeChat)
Based more on ruled base method
● ATOM
Based more on machine learning method and feature method
● Azure
Based more on meeting demand first, forecast and then optimize
Questions from students
● Some requests also don't need everything to succeed (e.g. can send an OK, then resolve later). Is that something that can be handled?
● Is it possible to obtain the collection of services required to complete certain requests? In case a critical service is overloaded and thus has an admission level higher than the priority of the request, can we reject it at the entry service?
● Why are the default timeout and queueing time not even changed or optimized?
Q&A