Mastering Predictive Modeling - Meetupfiles.meetup.com/1802328/Matt Berseth - Mastering... · with...
Transcript of Mastering Predictive Modeling - Meetupfiles.meetup.com/1802328/Matt Berseth - Mastering... · with...
Mastering Predictive ModelingMatt Berseth
About Me● Matt Berseth - [email protected]
● Co-founder of NLP Logix - Jacksonville
based data science startup.
● Not a Statistician. 10 years as an Application
and Web Developer.
● Fargo ND, North Dakota State University
● NLP Logix Team
Outline● Introduction to Predictive Modeling
○ Simple Example
○ Terms (can be confusing)
○ Gradient Descent
○ Examples of Success and Failures
● Tips / Common Pitfalls
● Kaggle Competitions and other Projects
● State of the Art
● Resources
What is Predictive Modeling?● There are two goals in analyzing the data[1]:
○ Prediction: Given a fresh set of unseen input variables, predict the response variable
○ Information: Given the historical data, gain a better understanding of how the model is associating the response variables to the input variables
[1]: Leo Breiman - Statistical Modeling: The Two Cultures
Algorithmic Modeling Culture (Machine Learning)● There are two goals in analyzing the data[1]:
○ Prediction: Given a fresh set of unseen input variables, predict the response variable
○ Information: Given the historical data, gain a better understanding of how the model is associating the response variables to the input variables
[1]: Leo Breiman - Statistical Modeling: The Two Cultures
What is Predictive Modeling?● Everything starts with data! (i.e. learning from data)
● Imagine the data is generated by a black box in which a
set of input variables, X, go in one side and out comes
the response variable, y, on the other side. i.e. y <-- X
● Predictive Modeling is the process of estimating the
function that lives inside of the box [1].[1]: Leo Breiman - Statistical Modeling: The Two Cultures
What is Predictive Modeling?● Credit Approval Model
● Approve Credit?
age 23 years
gender male
annual salary $30,000
years in residence 1 year
years in job 1 year
current debt $15,000
... ...
What is Predictive Modeling● Formalization
○ Input: X (customer loan application)
○ Output: y (approved / not approved)
○ Target function: y = f(X) (i.e. Natures Box)
○ Data: [(x1, y1), (x2, y2), (x3, y3) ... (xn, yn)]
○ Hypothesis: y = g(X) (i.e. Learned Models)
Yaser S. Abu-Mostafa: https://work.caltech.edu/lectures.html
hat is Predictive Modeling
Yaser S. Abu-Mostafa: https://work.caltech.edu/lectures.html
Housing Data Demo● Fictional data set that includes 4 different measurements for
1000 different houses:
○ Square Feet
○ Number of Bedrooms
○ Number of Bathrooms
○ Number Garage Stalls
● Define a formula that assigns weights to each of these values to
make up the housing price - 'Natures Box' - unknown by
modeler or anyone else
● See how we can recover the function
Not that simple● Higher dimensions
○ Think about the number of tables / columns in your data
warehouse.
○ Type of data: Web, Graph, Text, Time Series, Image and Audio
● Noisy
○ Errors in data collection. Could be introduced by software or the
end users. Also business processes change.
○ Missing Values, Outliers
● Not labeled - Makes supervised learning difficult or impossible
● Data is spread across multiple systems
● You don't have access to all or even know all of the variables
Not that simple● What problem are you going to solve?
○ You have order / customer / part data,
now what?
● How do you evaluate the results?○ Example: Fraud detection - classifier that always
predicts 'False' can achieve 95% accuracy
○ What do you want to optimize?
Some Terms● labels, ground truth, dependent variable
● weights, parameters, theta, coefficients
● variables, attributes, independent variables,
features, columns
● regression versus classification○ Logistic Regression - special case
● supervised versus unsupervised
Some More Terms● fitting, training
● train / validation / test data sets
● hyper-parameters, model parameters
● batch, mini-batch, epoch, iteration
● ensemble, blending, model averaging
● machine learning, statistical modeling,
artificial intelligence
Gradient Descent● Very important concept in Machine Learning
● Each model has an associated learning
algorithm - many are iterative○ Many learning procedures rely on gradient descent
○ Also, some tricks
■ Mini-batch
■ Momentum
■ Decay Learning Rates
Gradient Descent● Basic Algorithm
○ Step 1: Compute the derivative of the optimization
objective w.r.t. the model parameters
○ Step 2: Take a step in the direction that points
downhill
● Close your eyes and imagine you are on a
hill ...
● Drop a ball on the error surface ...
Gradient Descent
Andrew Ng: Machine Learning Coursera Course
Predictive Modeling is Big Business● Target: Increased revenue 15 to 30 percent
with predictive models
● Cox Communications: Tripled direct mail
response rates by predicting propensity to buy
● Amazon.com: 35 percent of sales come from
product recommendationsEric Siegel: Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die
Predictive Modeling is Big Business● Netflix: Used predictive model to increase recommendations by
10%
● Google: Everything! What search results are most relevant. What
ads should be served. What email is spam. Self driving cars. 96%
of 2011 revenue came from advertising. Where would google be
without PM?
● Allstate: Tripled accuracy of predicting bodily injury based solely
on the vehicle. Worth an estimated $40 million annually.
● Siri: 2 Steps: understand what was said, then retrieve the relevant
information
Eric Siegel: Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die
Predictive Modeling is Big BusinessNokia / SprintMicrosoft / GoogleFacebook / LinkedinMatch.com / OkCupidTarget / WalmartFingerhutFedExChase / PREMIER...
IBM / Hewlett-PackardSun MicrosystemsU.S. BankMTVDTE EnergyReed ElsevierPfizer / FICOMastercard...
Predictive Modeling is Big Business● Recent Headlines
○ NSA - PRISM
■ Microsoft, Youtube, Google, Yahoo, Facebook,
Skype, Apple and many others allegedly
contributing
○ IBM
■ Watson - Begin deployed into Healthcare
Lots of success stories - must be easy ...● Not really. Lots of pitfalls.
● What about some notable failures?
Failures - #1 FICO Score● The score supposedly predicts the default probability of a borrower
based on vast historical credit, default and personal data
● During the 2008 mortgage default wave, the FICO score
dramatically underestimated the default rate of almost every credit score category, from the prime group (the highest credit
score group) to the subprime group (the lowest credit score group).
http://en.wikipedia.org/wiki/Predictive_modelling#Notable_failures_of_predictive_modeling
Failures - #2 LTCM● Long Term Capital Management. Ran a fund based on the Black-
Scholes Option Pricing Model (Nobel Prize Winner in Economics)
● The models produced impressive profits until a spectacular debacle
that forced the then Fed Reserve chairman Alan Greenspan to step
in to broker a rescue plan by the wall street broker dealers in order
to prevent a meltdown of the bond market.
● Lost 2 Billion Dollars in 3 weeks
When Genius Failed: The Rise and Fall of Long-Term Capital Managementhttp://en.wikipedia.org/wiki/Predictive_modelling#Notable_failures_of_predictive_modeling
Fundamental Limitations of Predictive Modeling● History can not always predict the future. Systems
are complex and change over time. Predictive
Models have difficulty adapting to the change.
Regime Change.
● Unknown unknowns. Variables being collected are
not the ones that are most critical to predicting an
outcome.http://en.wikipedia.org/wiki/Predictive_modelling#Notable_failures_of_predictive_modeling
Predictive Modeling Projects - Tips and Pitfalls
Generalization vs Memorization● AKA - Overfitting
● Very expressive and powerful models will
start to think the noise in the training data is
actually signal and account for it in its
estimations
● Extreme case - memorize each example
Generalization vs Memorization● Extreme example: 5 data points, fit to a 4th
order polynomial
● No errors on training
● But no chance on xval
Generalization vs Memorization● Techniques for combating overfitting
○ Regularization (i.e. penalize the model for begin too
complex)
■ L1 and L2 norm penalties
■ Weight Sharing
○ Early stopping
○ N-Fold Cross Validation
Leverage External Data● US Census and ACS Data has a Wealth of
Information○ Show the Spreadsheet
● Social Data: Linkedin, Facebook, Twitter
● Data Augmentation Services
● Google Public Data Directory
Project Expectations● Results are not guaranteed
● Just because you want to predict something
doesn't mean you can!
● Not just about having enough data, but also
requires having the right data
Information Leakage● Refers to a piece of information your model is
using to make its predictions, but would not
actually be available at prediction time
● Can be very subtle to detect!
● Need to be very careful to make sure only the
information that is known at the time the model
is executed is made available during training
Leakage - Example #1● IBM modeling project: Put together a dataset that
consisted of IBM customers. Dataset included text/html
data from each candidate customers web site
● The goal was to predict which ones had purchased the
'websphere' product
● Best model determined the probability of purchase was
much higher if the site included the text 'websphere'
Leak - they pulled the most recent version of the website!
Leakage - Example #2● US Census Kaggle Competition.
● Task - predict the response rate of each 2010
census tract given the tracts demographic
information
● Leak. The demographic information that was
provided was not known until the census was
already completed (and the rates would have
already been known)
Leakage - Example #3● Build a predictive model that estimates the
total points scored in a sporting event
● Inputs: Starting lineups, weather conditions
at game time, teams historical performance
● Where is the leakage? Be sure to use the
forecasted weather and forecasted lineup
Leakage● Powerful modelling tools will quickly detect a
leak and exploit it - often times it carries the
most signal
● Need to be very cautious and make sure you
only provide the model with the information
that would have been known at that moment
in time
Technology, Tools and Considerations● What is the right technology to use for a PM
project?○ SPSS, SAS, SSAS, Matlab, Python, R, Weka, Julia
○ Each has pros and cons
● How do you integrate a modeling into a
production system?○ Java, .Net, Ruby etc ...
○ SQL / Oracle?
NLP Logix Technology & Tools● NLP Logix Modeling Technology Stack
○ Python for predictive modeling
■ PyCharm, Sublime Text, scikits-learn, pandas
■ matplotlib, theano
○ R for statistical analysis, visualization and modeling
■ RStudio, ggplot2
○ Linux OS for modeling, Windows OS for data processing
○ git / github for version control (atypical for data analysis?)
○ ipython notebook for sharing / reproducing research
○ SqlServer for data shaping and preprocessing
Quick aside on Python● Powerful tool set for scientific computing
○ Links to BLAS / ATLAS (optimized c code for executing
linear algebra operations)
○ Links to CUDA for incredible parallelism via GPU
○ Active ecosystem of data science packages
■ Pandas (like R data frame)
■ numpy / scipy (like Matlab - adds vectorized
operations to python)
■ Matplotlib (like Matlab)
■ scikits-learn, PyBrain, PyLearn2, etc ...
Quick aside on NumPy● Powerful library for scientific computing
○ Extension to python
○ Adds N-dimensional arrays
○ Supports matrix algebra operations
■ np.dot(X1, X2)
○ Open source and has many contributors
○ Adds MATLAB type functionality to python
Quick aside on GPUs● Graphical Processing Units
○ Created for gaming, but also for scientific computing
○ Latest cards have nearly 1,000 cores and 2 Gigs of RAM
○ Cores are not as fast as CPU cores, but parallelism
makes up for it
○ Handles linear algebra operations very efficiently (i.e.
matrix mult and matrix transpose)
● Cheaper than CPUs and simplifies marshalling the data out
to a cluster
Quick aside on GPUs● Andrew Ng: 'Deep learning with COTS HPC systems'
○ HPC == High Performance Computing
○ 3 Computers, each with 2 GPU cards
○ Fit neural network with 1 billion parameters in 3 days
● NLP Logix RBM Implementation
○ Ported from NumPy to theano
○ Improved epoch performance by 50% on benchmark test
○ 8 core i7 versus 120 core GPU (4 years old)
Steps for a Successful Project● Clearly defined problem
● Agreed upon way to measure and evaluate a
successful model○ Need the ability to compare candidate models
● Deployment strategy○ Might dictate modeling technologies and tools
Use 3 Data Sets!● Train
○ The data the model is fit on
● Xval○ The data the model performance is evaluated on.
○ Hyperparameter selection
● Test○ Only evaluated once (ideally)
Predictive Modeling Projects
Kaggle● Kaggle.com
● Hosts data science competitions: 'data
science as a sport'○ Netflix Challenge - 1 Million Dollars
○ Heritage Health Prize - 3 Million Dollars
○ GE FlightQuest - $250,000
○ KDD Cup
Kaggle● Also a number of other competitions hosted
by commercial partners○ Facebook, Microsoft, Pfizer, Allstate, etc ...
● A few I have competed in ...○ Predict Probability of Patient having Type 2 Diabetes
○ Predict US Census Return Rates
○ Predict Runway and Gate Arrival Times
Flight Quest● Two different prediction problems
○ Given the following information: flight plan, last
estimated arrival time (onboard computer system),
the other flights in the air, weather, waypoints,
aircraft type, etc ... -- predict the number of minutes
from a cutoff time that it will take the craft to land on
the runway and arrive at the gate
○ Very challenging - over 1 TB of data
Flight Quest● Data
○ Trained on all data between 11/2012 and 2/2013
○ Final evaluation data was from 2nd half of Feb 2013
○ Final evaluation data was collected after model had
been frozen
● Great way to combat over-fitting. Final
evaluation data was not even generated
when models were finalized
Flight Quest● Cutoff Time:
○ For each day in the 2 week test period, a random time
between 9AM EST and 9PM EST was picked
○ Every flight that was in the air at the cutoff time was
included in the evaluation set
○ Each of these flights required 2 predictions: time of
runway and gate arrival
● Allowed us to generate almost an infinite amount of training
data!
Flight Quest● Key Points from my Model (4th best scoring model)
○ Primarily used 2 different types of models - Random
Forest and Gradient Boosted Trees
○ Removed outlier days - holidays, removed outlier flights -
redirects
○ Over 500 derived features
○ ~200 individual models were combined (ensembled) to
create the final predictions
■ Ensembling was done using another layer of boosted
trees
Flight Quest● Launching FlightQuest2 at the end of June - 6/30/2013
● http://www.gequest.com/flight2
MNIST - Handwritten Digits● A competition of sorts - between academics
● Database of 70,000 handwritten digits
● Classification task - assign a class label to
each digit. See who can get the most right
● Permutation Invariant constraint usually
applied
MNIST - Handwritten Digits● Hinton's Page
○ Generative and Discriminative
● http://www.cs.toronto.edu/~hinton/adi/index.htm
MNIST - Handwritten Digits● Some Scores
○ SVM: 140 errors
○ Neural Network: 140 errors
○ Boosted Trees: 153 errors
○ Neural Network with Dropout: 110 errors
○ Neural Network with Dropout and Maxout: 94 errors
○ Deep Boltzmann Machine plus pretraining: 79 errors
State of the Art
State of the Art● Lot of excitement around 'deep learning'
○ Generative pre-training
○ GPU's over clusters
■ Deep Learning with COTS HPC systems
● Train large networks (1 Billion parameters) on small
cluster of GPU's instead of large cluster of CPU's
○ No more feature generation - features themself are learned
● Success in areas of extremely high dimensions like image, text and
audio
State of the Art● Lot of success with stochastic, tree based models
○ Random Forest
■ Bagging: Induce a new decision tree using a
sampled (with replacement) subset of training
data
○ Boosted Trees
■ Boosting: Induce new decision trees, focus on
the cases the previous trees did poorly on
State of the Art● Random Forest: Visualization
State of the Art● Ensembling
○ Combine - either linearly, simple average or fed into
another, more complex model type as features
○ Simple example, ensemble of Neural Networks
■ Using the exact same set of hyper-parameters, fit
5 different neural networks, each using different
random initializations. Take product of 5
predictions - highest score gets the class label
State of the Art● Machine Learning Moves Fast
● I am personally excited most about Neural Networks,
Deep Belief Networks and Boltzmann Machines
○ Uses unlabelled data and can learn/detect
features
● GPU's
○ My Boltzmann implementation improved by 50%
using a 4 year old GPU versus my latest i7
Resources● Coursera
○ Probabilistic Graphical Models
○ Machine Learning (Andrew Ng)
○ Neural Networks (Hinton)
● Books
○ Learning from Data: Yaser S. Abu-Mostafa
● Academics
○ Stanford, University of Toronto and University of
Montreal
Questions?● Contact me: [email protected]