Overload - Brown University

45
Overload Qingyi Lu & Ke Ding

Transcript of Overload - Brown University

Page 1: Overload - Brown University

Overload

Qingyi Lu & Ke Ding

Page 2: Overload - Brown University

Outline

● Overview of overload● Motivation (mono → micro)● Common Problem● Difference ()● Wechat● Azure● ATOM

Page 3: Overload - Brown University

Overview of Overload

● Overload: System’s workload exceeds the maximum processing capacity of the system.

● Reasons: - Excessive visits

- Bottlenecks and failures within the system

- Backend failure and delay

Page 4: Overload - Brown University

Overview of Overload

● Problems: - CPU and memory could reach the bottleneck

- System’s ability to response could slow down

- System processing capacity could sharply fall down

● Common solutions: load balancing, flow control, monitoring, etc.

Page 5: Overload - Brown University

Overload Control of Monolith and Microservices

● Monolith:

A small number of service components with trivial dependencies.

● Microservices:

Increasingly complex in the architecture and dependency.

- All microservices must be monitored

- Hard to handle overload independently

- Need to adapt the service changes, workload dynamics and external environment

Page 6: Overload - Brown University

Overload Control in Practice

● Overload Control for Scaling WeChat Microservices

Complex core Service architecture

● Azure reponses for Covid19

Emergency solution

● ATOM: Model-Driven Autoscaling for Microservices

CPU & Replica

Page 7: Overload - Brown University

Overload Control for Scaling WeChat Microservices

Page 8: Overload - Brown University

Observation

1. WeChat’s microservice architecture

Complex dependency of services

2. Deployment of WeChat Services

Centralized or SLA-based overload control mechanism could not support highly rapid service changes at large scale

3. Dynamic Workload

Overload control mechanism should adaptively tolerate the workload fluctuation

Page 9: Overload - Brown University

Overload Scenarios

Subsequent Overload

Page 10: Overload - Brown University

Challenge & Insight

● No single entry point for services request and with complex call path

● Excessive request aborts waste the computational resources

● Excessive request affects user experience (due to the high latency of service response)

● Service Agnostic: Decoupling and Dynamic Development

● Independent but Collaborative: Granule and Subsequent Overload

● Efficient and Fair: Partial Failures

Page 11: Overload - Brown University

Design

1. How to detect overload: Overload Detection2. How to control the overload: Service Admission Control

Page 12: Overload - Brown University

Overload Detection

By average waiting time of requests in the pending queue (queuing time).

Why do not use response time?

Page 13: Overload - Brown University

Service Admission Control

1. Business-oriented Admission Control2. User-oriented Admission Control3. Session-oriented Admission Control4. Adaptive Admission Control5. Collaborative Admission Control

Page 14: Overload - Brown University

Business-oriented Admission Control

● Prioritized based on their business significance● Subsequent requests inherit the same business priority● Advantages:

- Service agnostic: business priority is independent to the business logic of any service

- Easy to maintain: business priority is assigned in the entry services & reflect the changes of basic and leap services

Page 15: Overload - Brown University

User-oriented Admission Control

Example:

Current business is T

Overload detected

Level change to T-1 partially failure

System underloaded

Level set back to T

Page 16: Overload - Brown University

Other Admission Control

● Session-oriented Admission Control:

Based on Session ID

● Adaptive Admission Control:

Adapt to the load status changes to minimize impact on the quality of the overall service

● Collaborative Admission Control:

Learn the latest admission level of the downstream server

Page 17: Overload - Brown University

Service Admission Control Workflow

Page 18: Overload - Brown University

Evaluation - Queuing time vs. Response time.

Page 19: Overload - Brown University

Evaluation - Difference Types

Page 20: Overload - Brown University

Lesson learned

● Overload control in the large-scale microservice architecture must be decentralized and autonomous in each service

● The algorithmic design of overload control should take into account a variety of feedback mechanisms

● An effective design of overload control is always derived from the comprehensive profiling of the processing behavior in the actual workload

Page 21: Overload - Brown University

Azure reponses for Covid19

Page 22: Overload - Brown University

Observations & Insight

Observation:

- Increasing large amount numbers of work from home, remote learning, stay connected with friends online

- Impact on healthcare: using huge amount of data to analyze virus

Insight:

- Help people adapt to this new world- Prioritize critical customers: doctor and nurse in hospital, emergency

management service, critical government infrastructure

Page 23: Overload - Brown University

For Goods Program

● Guiding principles: do no harm, outcome driven, unique value to affect outcomes, opening collaborative

● Example:

Page 24: Overload - Brown University

Response Framework

● Meet Demand

Address capacity in the hardest regions & scale up

● Forecast

Well prepared for the potential case

● Optimize

Optimizer services

Page 25: Overload - Brown University

Network

Incredible growth in VPN and WAN usage

- Wan scaling:

12 new edge sites

25% increased peering capacity

100+ terabits

Page 26: Overload - Brown University

Network

- Wan traffic optimization: load balance the traffics

Page 27: Overload - Brown University

Services on Azure - Teams

Page 28: Overload - Brown University

Services on Azure - Windows Virtual Desktop

Service scale out:

- More gateways & front ends per cluster- Additional clusters per region- Deployed to more regions for best performance- More regions coming for data residency

Optimization:

- Fine-tuned database indexes- Created client-side cache + read-only replicas- Rebalanced traffic routing for nearby regions

Page 29: Overload - Brown University

Azure Security

● Azure Active Directory● Application Proxy- Adjusted capacity: monitoring and alerting

- Increased scale unit: availability across regions

- Provided higher throttling: limits to customers

Page 30: Overload - Brown University

Confidential computing on AzureTrusted Execution Environment

Example:

Page 31: Overload - Brown University

ATOM: Model-Driven Autoscaling for Microservices

Page 32: Overload - Brown University

Observation and Challenge

1. Rule based auto scaling provides difference performance gains based on current workload. Vertical stands for CPU and horizontal stands for replica.

2. Previous method focus more on either vertical or horizontal rather than both

Page 33: Overload - Brown University

Insight

1. Estimate performance2. Auto-scale by changing CPU and replicas, which combines horizontal and

vertical scaling

Page 34: Overload - Brown University

Layered Queueing Network

Page 35: Overload - Brown University

Layered Queueing Network

1. In previous, they apply Utilization Technique as the feature for Least Square to estimate service demand. U = XD

2. Now we take the Queue Length Technique of response time and queue length. R = LD

Page 36: Overload - Brown University

Layered Queueing Network

Page 37: Overload - Brown University

Layered Queueing Network

Page 38: Overload - Brown University

ATOM: Autoscaling Microservice

1. Maximize the revenue of transactions2. MAPE-K (monitor, analyse, plan and execute with a shared knowledge base)

Page 39: Overload - Brown University

ATOM Algo Detail

Page 40: Overload - Brown University

ATOM: Autoscaling Microservice

1. UH: Horizontal scaling2. UV: Vertical scaling

Page 41: Overload - Brown University

Lesson learned & future direction

Lesson: Machine Learning helps us extract useful information to get a better strategies with high likelihood. Finding relevant features to decide auto-scaling is helpful.

Future direction: Can we move the offline training to online sequential training?

Page 42: Overload - Brown University

Comparisons - Similarities

● Same overview of Designs

- Detect overhead or performance

- Derive the solution

Page 43: Overload - Brown University

Comparisons - Difference

● DAGOR (WeChat)

Based more on ruled base method

● ATOM

Based more on machine learning method and feature method

● Azure

Based more on meeting demand first, forecast and then optimize

Page 44: Overload - Brown University

Questions from students

● Some requests also don't need everything to succeed (e.g. can send an OK, then resolve later). Is that something that can be handled?

● Is it possible to obtain the collection of services required to complete certain requests? In case a critical service is overloaded and thus has an admission level higher than the priority of the request, can we reject it at the entry service?

● Why are the default timeout and queueing time not even changed or optimized?

Page 45: Overload - Brown University

Q&A