Applied Math at Microsoft Azure - Rohit Pandey

16
Applied Math at Microsoft Azure

Transcript of Applied Math at Microsoft Azure - Rohit Pandey

Page 1: Applied Math at Microsoft Azure - Rohit Pandey

Applied Math at Microsoft Azure

Page 2: Applied Math at Microsoft Azure - Rohit Pandey

What to expect• I will talk about two interesting use cases of Applied Math in Azure.• Unfortunately, I can’t go into details of Azure or the numbers but I’m

hoping the gist will be clear.

Page 3: Applied Math at Microsoft Azure - Rohit Pandey

What is Azure?• Azure is a cloud service.

• Competitor to AWS

• Basic Architecture

Page 4: Applied Math at Microsoft Azure - Rohit Pandey

Topic 1: Dirichlet Entropy for anomaly detectionContributors:• Rohit Pandey• Gil Lapid Shafriri

Page 5: Applied Math at Microsoft Azure - Rohit Pandey

Background• At Azure, we keep track of various causes and components associated

with downtimes of customer VMs (categorical histograms).• We use this data to prioritize fixes for top downtime reasons and

components.• But what about patterns that manage to stay out of sight?

Page 6: Applied Math at Microsoft Azure - Rohit Pandey

• There is a tendency to confuse “small” with “ambient”. And over a large timeframe, “small” becomes “large”.

• Ambient noise should be like a fair dice.• Truly ambient noise won’t unduly favor any component (Ex:

Rack).• We need one measure for how “skewed” our histogram is and

trend that over time.

Background (continued)

Page 7: Applied Math at Microsoft Azure - Rohit Pandey

Approach• Categorical histograms are like rolls of a dice and the canonical

distribution for the parameters of a dice is the Dirichlet.• A great metric for determining skewness is Entropy (for a random

variable ).

1 2 3 4 5 60

5

10

15

20

25

30

35

1 2 3 4 5 60

1

2

3

4

5

6

7

8Low Entropy High Entropy

Page 8: Applied Math at Microsoft Azure - Rohit Pandey

Implementation and Results• Set up a portal that shows

list of categorical histograms descending by Entropy.

• Caught multiple instances of rack failures.

• Nodes stuck in reboot loop due to incorrect configuration.

• And more..

Page 9: Applied Math at Microsoft Azure - Rohit Pandey

Topic 2: To reboot or not to rebootContributors:• Rohit Pandey• Durmus Karatay• Gil Lapid Shafriri• Randolph Yao

Page 10: Applied Math at Microsoft Azure - Rohit Pandey

The Problem• Machines in Azure can be in various “states”. For example, “Healthy”

and “Unwell”. • When a machine becomes unwell, we wait a certain amount of time

(0) to give it a chance to organically recover.• How do we optimize this 0 so as to minimize the downtime.

Page 11: Applied Math at Microsoft Azure - Rohit Pandey

Toy Transition Diagram

Unwell

HealthyRebooting

State 1

State 2

Page 12: Applied Math at Microsoft Azure - Rohit Pandey

Transition Matrices• Transition probabilities matrix ()

• Transition times matrix ()

Page 13: Applied Math at Microsoft Azure - Rohit Pandey

Formulation

• In our estimate of Y, we consider both the happy and the sad paths.• We can find the threshold ()that minimizes the expected downtime by

setting .

Unwell

HealthyRebooting 𝒀

𝑿 : 𝒇 𝑿 (𝒙)𝝉

Page 14: Applied Math at Microsoft Azure - Rohit Pandey

Choice of X• Considered 7-8 distributions and settled on Lomax because it can

model extreme values the best.

• To estimate the parameters – • All samples that we saw from Unwell to Ready• The instances of Unwell to Rebooting which were all cases where it took

more than for sure.

Page 15: Applied Math at Microsoft Azure - Rohit Pandey

Choice of Y

• We think of “Healthy” as the absorbing state, others as transient.• We denote by the time taken to get to the absorbing state from transient state .

Page 16: Applied Math at Microsoft Azure - Rohit Pandey

Result

𝒀

𝑬 [𝑿 ]

𝑬 [𝑻 ]

𝝉�̂�𝝉𝟎

Savings