Applied Math at Microsoft Azure - Rohit Pandey
-
Upload
withthebest -
Category
Technology
-
view
54 -
download
0
Transcript of Applied Math at Microsoft Azure - Rohit Pandey
Applied Math at Microsoft Azure
What to expect• I will talk about two interesting use cases of Applied Math in Azure.• Unfortunately, I can’t go into details of Azure or the numbers but I’m
hoping the gist will be clear.
What is Azure?• Azure is a cloud service.
• Competitor to AWS
• Basic Architecture
Topic 1: Dirichlet Entropy for anomaly detectionContributors:• Rohit Pandey• Gil Lapid Shafriri
Background• At Azure, we keep track of various causes and components associated
with downtimes of customer VMs (categorical histograms).• We use this data to prioritize fixes for top downtime reasons and
components.• But what about patterns that manage to stay out of sight?
• There is a tendency to confuse “small” with “ambient”. And over a large timeframe, “small” becomes “large”.
• Ambient noise should be like a fair dice.• Truly ambient noise won’t unduly favor any component (Ex:
Rack).• We need one measure for how “skewed” our histogram is and
trend that over time.
Background (continued)
Approach• Categorical histograms are like rolls of a dice and the canonical
distribution for the parameters of a dice is the Dirichlet.• A great metric for determining skewness is Entropy (for a random
variable ).
1 2 3 4 5 60
5
10
15
20
25
30
35
1 2 3 4 5 60
1
2
3
4
5
6
7
8Low Entropy High Entropy
Implementation and Results• Set up a portal that shows
list of categorical histograms descending by Entropy.
• Caught multiple instances of rack failures.
• Nodes stuck in reboot loop due to incorrect configuration.
• And more..
Topic 2: To reboot or not to rebootContributors:• Rohit Pandey• Durmus Karatay• Gil Lapid Shafriri• Randolph Yao
The Problem• Machines in Azure can be in various “states”. For example, “Healthy”
and “Unwell”. • When a machine becomes unwell, we wait a certain amount of time
(0) to give it a chance to organically recover.• How do we optimize this 0 so as to minimize the downtime.
Toy Transition Diagram
Unwell
HealthyRebooting
State 1
State 2
Transition Matrices• Transition probabilities matrix ()
• Transition times matrix ()
Formulation
• In our estimate of Y, we consider both the happy and the sad paths.• We can find the threshold ()that minimizes the expected downtime by
setting .
Unwell
HealthyRebooting 𝒀
𝑿 : 𝒇 𝑿 (𝒙)𝝉
Choice of X• Considered 7-8 distributions and settled on Lomax because it can
model extreme values the best.
• To estimate the parameters – • All samples that we saw from Unwell to Ready• The instances of Unwell to Rebooting which were all cases where it took
more than for sure.
Choice of Y
• We think of “Healthy” as the absorbing state, others as transient.• We denote by the time taken to get to the absorbing state from transient state .
Result
𝒀
𝑬 [𝑿 ]
𝑬 [𝑻 ]
𝝉�̂�𝝉𝟎
Savings