Post on 16-Apr-2017
Machine LearningReal-Life Data & ML in Production
@benfreu Ben Freundorfer
Costs
What’s a model
Many algorithms are a bunch of matrix calculations.
• Costly to train models
• Cheap to apply models (predict)
Human work
Real-Life Data
TransformationTransform relational data into vectors
All algos need: matrices of numbers
Some need0.0 ≤ x ≤ 1.0mean=0σ=1
Look out for algos requiring „normalized“ or „standardized“ values → feature scaling
Categories
• Features with no numerical relation
• Category 5 doesn’t have 5x the y of category 1
• Fix: Dummy variables
• cat_1, cat_2, … cat_5 with values 0 or 1
Missing Values• days_since_last_purchase = null
How to deal with this? 0 or 999?
• Often intuitively clear from the data domain One solution: max(days_since_last_purchase of other users)
• HAS to be addressed
Outliers
• days_since_last_purchase = 2837 for a legacy customer
• If it’s irrelevant, get rid of the whole example (legacy customer)
• Or cap at a max/min value
Reduce Features
• check for correlation between features. get rid of correlated ones
• get rid of intuitively useless features
A Better Model
• Less features - i.e. is simpler
• Trained on more training examples
Moving to Production
Online vs Offline
OFFLINE From time to time retrain whole model and upload model
ONLINE Algorithm runs each time a new example is added and adapts the model a bit
examples should be randomized
ExamplePredict which category user will buy from after
newsletter-signup
Build Model• Collect data
Traffic source, categories looked at prior to signup, etc. and y = category of purchase after signup
• Analyze Try to make predictions using e.g. logistic regression
• Train final model
• Save weights to DB or JSON or file
Predict• User signs up
• Load weights and predict probabilities of categories.
• If P(category X) > thresholdclassify user as „interested in category X“
• Send out newsletters
Tips• Use R or Python/Jupyter/Pandas to analyze data
• Test if you need a separate system for predictions or just for training
• Try not to implement algos yourself If you do, use numerical computation libraries (probably wrappers for C or Fortran code)
• Be sure the past predicts the future
Ethics
• Your model might turn into a racially profiling sexist.
• Be aware of what your input features mean & what you actually base your predictions on
• Relatively harmless when predicting product categories - questionable for credit ratings
Thank youBen Freundorfer
@benfreu