Machine Learning in Production

Post on 16-Apr-2017

159 views 1 download

Transcript of Machine Learning in Production

Machine LearningReal-Life Data & ML in Production

@benfreu Ben Freundorfer

Costs

What’s a model

Many algorithms are a bunch of matrix calculations.

• Costly to train models

• Cheap to apply models (predict)

Human work

Real-Life Data

TransformationTransform relational data into vectors

All algos need: matrices of numbers

Some need0.0 ≤ x ≤ 1.0mean=0σ=1

Look out for algos requiring „normalized“ or „standardized“ values → feature scaling

Categories

• Features with no numerical relation

• Category 5 doesn’t have 5x the y of category 1

• Fix: Dummy variables

• cat_1, cat_2, … cat_5 with values 0 or 1

Missing Values• days_since_last_purchase = null

How to deal with this? 0 or 999?

• Often intuitively clear from the data domain One solution: max(days_since_last_purchase of other users)

• HAS to be addressed

Outliers

• days_since_last_purchase = 2837 for a legacy customer

• If it’s irrelevant, get rid of the whole example (legacy customer)

• Or cap at a max/min value

Reduce Features

• check for correlation between features. get rid of correlated ones

• get rid of intuitively useless features

A Better Model

• Less features - i.e. is simpler

• Trained on more training examples

Moving to Production

Online vs Offline

OFFLINE From time to time retrain whole model and upload model

ONLINE Algorithm runs each time a new example is added and adapts the model a bit

examples should be randomized

ExamplePredict which category user will buy from after

newsletter-signup

Build Model• Collect data

Traffic source, categories looked at prior to signup, etc. and y = category of purchase after signup

• Analyze Try to make predictions using e.g. logistic regression

• Train final model

• Save weights to DB or JSON or file

Predict• User signs up

• Load weights and predict probabilities of categories.

• If P(category X) > thresholdclassify user as „interested in category X“

• Send out newsletters

Tips• Use R or Python/Jupyter/Pandas to analyze data

• Test if you need a separate system for predictions or just for training

• Try not to implement algos yourself If you do, use numerical computation libraries (probably wrappers for C or Fortran code)

• Be sure the past predicts the future

Ethics

• Your model might turn into a racially profiling sexist.

• Be aware of what your input features mean & what you actually base your predictions on

• Relatively harmless when predicting product categories - questionable for credit ratings

Thank youBen Freundorfer

@benfreu