Machine learning in production

50
Machine Learning in Production Krishna Sridhar (@krishna_srd) Data Scientist, Dato Inc. 1

Transcript of Machine learning in production

Machine Learning in Production

Krishna Sridhar (@krishna_srd)Data Scientist, Dato Inc.

1

About Me• Background

- Machine Learning (ML) Research. - Ph.D Numerical Optimization @Wisconsin

• Now - Build ML tools for data-scientists & developers @Dato. - Help deploy ML algorithms.

@krishna_srd, @DatoInc

2

Overview

• Lots of fundamental problems to tackle. • Blend of statistics, applied-ML, and software engineering. • The space is new, so lots of room for innovation! • Understanding production helps make better modeling

decisions.

3

ML

+

What is an ML app?

4

Why production?

5

Why production?

6

Make your predictions available to everyone.

Share

Measure quality of the predictions over time.

Review

Improve prediction quality with feedback.

React

ML in Production - 101

Creation Production

7

Historical Data

Trained Model

Deployed Model

Live Data

Predictions

What is Production?

Evaluation

Management Monitoring

Deployment

Making model predictions easily available.

Measuring quality of deployed models.

Tracking model quality over time.

Improving deployed models with feedback.

8

What is Production?

Evaluation

Monitoring

Deployment

Management

9

Deployment

10

What is Deployment?

Evaluation

Monitoring

Deployment

Management

11

ML in Production - 101

12

Trained Model

Deployed Model

Production Creation

Historical Data

Live Data

Predictions

What are we deploying?

13

def predict(data): data[‘is_good’] = data[‘rating’] > 3 return model.predict(data)

Advantages

• Flexibility: No need for complicated abstractions. • Software deployment is a very mature field. • Rapid model updating with continuous deployments.

Treat model deployment the same was as code deployment!

What are we deploying?

def predict(data): data[‘is_good’] = data[‘rating’] > 3 return model.predict(data)

def predict(data) : double = { data[‘is_good’] = data[‘rating’] > 3 return model.predict(data)}

predict <- function(data): data$is_good = data$rating > 3 return predict(model, data)

14

What’s the challenge?

Wal

l of c

onfu

sion

Beat baseline by 15%. Time to deploy!

What the **** is alpha, and beta.

Data Scientists Deployment Engineers

15

What’s the solution?

Beat baseline by 15%. Time to deploy!

Beat baseline by 15%!

16

Data Scientists Deployment Engineers

Deployment - Demo

17

Deploying ML: Requirements

1. Ease of integration. - Any code, any language.

2. Low latency predictions. - Cache frequent predictions.

3. Fault Tolerant. - Replicate models, run on many machines.

4. Scalable. - Elastically scale nodes up or down.

5. Maintainable. - Easily update with newer models.

18

Deploying ML

Model Prediction

Cache

Web Service

Node 1

Model Prediction

Cache

Web Service

Node 3

Load Balancer

Model Prediction

Cache

Web Service

Node 2

Client19

Evaluation

20

What is Evaluation?

Evaluation

Monitoring

Deployment

Management

21

What is Evaluation?

22

Predictions Metric

+ Evaluation

Which metric?Model evaluation metric != business metric

Precision-Recall, DCG, NDCG

User engagement, click through rate

Track both ML and business metrics to see if they correlate!

23

Evaluating Models

24

Historical Data

Live Data

PredictionsTrained Model

Deployed Model

Offline Evaluation

Online Evaluation

Monitoring & Management

25

Monitoring & Management?

Evaluation

Monitoring

Deployment

Management

26

Monitoring & Management?

Tracking metrics over time and reacting to feedback from deployed models.

Monitoring Management

Monitoring & Management

28

Historical Data

Live Data

PredictionsTrained Model

Deployed Model

Feedback

Monitoring & Management

Important for software engineering

- Versioning. - Logging. - Provenance. - Dashboards. - Reports.

Interesting for applied-ML researchers

- Updating models.

29

Updating modelsWhen to update?

• Trends and user taste changes over time. - I liked R in the past, but now I like Python! - Tip: Track statistics about the data over time

• Model performance drops. - CTR was down 20% last month. - Tip: Monitor both offline and online metric, track correlation!

How to update?

• A/B Testing • Multi-armed bandits

30

A/B testingIs model V2 significantly better than model V1?

2000 visits 10% CTR

2000 visits 30% CTR

Model V2

Model V1

31

Be really careful with A/B testing.

B

A

World gets V2

Multi-armed Bandits

32

2000 visits 10% CTR

2000 visits 30% CTR

Model V2

Model V1B

A

World gets V2

10% of the time

Exploration

90% of the time

Exploitation

36k visits 30% CTR

Multi-armed Bandits

MAB vs A/B TestingWhy MAB?

• “Set and forget approach” for continuous optimization. • Minimize your losses. • Good MAB algorithms converge very quickly!

Why A/B testing?

• Easy and quick to set up! • Answer relevant business questions. • Sometimes, it could take a while before you observe results.

Conclusion

@krishna_srd, @DatoInc

• ML in production can be fun! Lots of new challenges in deployment, evaluation, monitoring, and management.

• Summary of tips:- Try to run the same code in modeling & deployment mode.- Business metric != Model metric-Monitor offline and online behavior, track their correlation.- Be really careful with A/B testing. -Minimize your losses with multi-armed bandits!

35

Thanks!Download

pip install graphlab-create

Docs

https://dato.com/learn/

Source

https://github.com/dato-code/tutorials

Thank you!

37

Backup

38

When/how to evaluate ML• Offline evaluation - Evaluate on historical labeled data. - Make sure you evaluate on a test set!

• Online evaluation - A/B testing – split off a portion of incoming requests (B)

to evaluate new deployment, use the rest as control group (A).

39

ML Deployment - 2

Prototype model

Historical data

Deployed model

Predictions

New request

Online adaptive model

40

Online Learning

• Benefits - Computationally faster and more efficient. - Deployment and training are the same!

• Key Challenges - How do we maintain distributed state? - Do standard algorithms need to change in order to be more deployment

friendly? - How much should the model “forget”. - Tricky to evaluate.

• Simple Ideas that work. - Splitting the model space so the state of each model can lie in a single

machine.

41

A/B testing

I’m a happy Gaussian

I’m another happy Gaussian

Click-through rate

Variance A

Variance B

42

Running an A/B testAs easy as alpha, beta, gamma, delta.

• Procedure - Pick significance level α. - Compute the test statistic. - Compute p-value (probability of test statistic under the null

hypothesis). - Reject the null hypothesis if p-value is less than α.

43

How long to run the test?• Run the test until you see a significant difference? - Wrong! Don’t do this.

• Statistical tests directly control for false positive rate (significance) - With probability 1-α, Population 1 is different from Population 0

• The statistical power of a test controls for the false negative rate - How many observations do I need to discern a difference of δ between

the means with power 0.8 and significance 0.05?

• Determine how many observations you need before you start the test - Pick the power β, significance α, and magnitude of difference δ - Calculate n, the number of observations needed - Don’t stop the test until you’ve made this many observations.

44

Separation of experiencesHow well did you split off group B?

Homepage New homepage

Second page Second page

BA

Button Button Button Button

45

Separation of experiencesHow well did you split off group B?

Homepage New homepage

Second page Second page

BA

Button Button Button Button

Unclean separation of experiences!

46

Shock of newness• People hate change

• Why is my button now blue??

• Wait until the “shock of newness” wears off, then measure

• Some population of users are forever wedded to old ways • Consider obtaining a fresh population

Click-through rate

The shock of newness

t0

47

Deploying ML: Requirements

1. Ease of integration. - Any code, any language.

2. Low latency predictions. - Cache frequent predictions.

3. Fault Tolerant. - Replicate models, run on many machines.

4. Scalable. - Elastically scale nodes up or down.

5. Maintainable. - Easily update with newer models.

48

What are we deploying?

def predict(data): data[‘is_good’] = data[‘rating’] > 3 return model.predict(data)

def predict(data) : double = { data[‘is_good’] = data[‘rating’] > 3 return model.predict(data)}

predict <- function(data): data$is_good = data$rating > 3 return predict(model, data)

49

Deploying ML

Model Prediction

Cache

Web Service

Node 1

Model Prediction

Cache

Web Service

Node 3

Load Balancer

Model Prediction

Cache

Web Service

Node 2

Client50