Mastering Predictive Modeling - Meetupfiles.meetup.com/1802328/Matt Berseth - Mastering... · with...

Mastering Predictive ModelingMatt Berseth

About Me● Matt Berseth - [email protected]

● Co-founder of NLP Logix - Jacksonville

based data science startup.

● Not a Statistician. 10 years as an Application

and Web Developer.

● Fargo ND, North Dakota State University

● NLP Logix Team

Outline● Introduction to Predictive Modeling

○ Simple Example

○ Terms (can be confusing)

○ Gradient Descent

○ Examples of Success and Failures

● Tips / Common Pitfalls

● Kaggle Competitions and other Projects

● State of the Art

● Resources

What is Predictive Modeling?● There are two goals in analyzing the data[1]:

○ Prediction: Given a fresh set of unseen input variables, predict the response variable

○ Information: Given the historical data, gain a better understanding of how the model is associating the response variables to the input variables

[1]: Leo Breiman - Statistical Modeling: The Two Cultures

Algorithmic Modeling Culture (Machine Learning)● There are two goals in analyzing the data[1]:

○ Prediction: Given a fresh set of unseen input variables, predict the response variable

○ Information: Given the historical data, gain a better understanding of how the model is associating the response variables to the input variables

[1]: Leo Breiman - Statistical Modeling: The Two Cultures

What is Predictive Modeling?● Everything starts with data! (i.e. learning from data)

● Imagine the data is generated by a black box in which a

set of input variables, X, go in one side and out comes

the response variable, y, on the other side. i.e. y <-- X

● Predictive Modeling is the process of estimating the

function that lives inside of the box [1].[1]: Leo Breiman - Statistical Modeling: The Two Cultures

What is Predictive Modeling?● Credit Approval Model

● Approve Credit?

age 23 years

gender male

annual salary $30,000

years in residence 1 year

years in job 1 year

current debt $15,000

... ...

What is Predictive Modeling● Formalization

○ Input: X (customer loan application)

○ Output: y (approved / not approved)

○ Target function: y = f(X) (i.e. Natures Box)

○ Data: [(x1, y1), (x2, y2), (x3, y3) ... (xn, yn)]

○ Hypothesis: y = g(X) (i.e. Learned Models)

Yaser S. Abu-Mostafa: https://work.caltech.edu/lectures.html

https://work.caltech.edu/lectures.html

hat is Predictive Modeling

Yaser S. Abu-Mostafa: https://work.caltech.edu/lectures.html

https://work.caltech.edu/lectures.html

Housing Data Demo● Fictional data set that includes 4 different measurements for

1000 different houses:

○ Square Feet

○ Number of Bedrooms

○ Number of Bathrooms

○ Number Garage Stalls

● Define a formula that assigns weights to each of these values to

make up the housing price - 'Natures Box' - unknown by

modeler or anyone else

● See how we can recover the function

Not that simple● Higher dimensions

○ Think about the number of tables / columns in your data

warehouse.

○ Type of data: Web, Graph, Text, Time Series, Image and Audio

● Noisy

○ Errors in data collection. Could be introduced by software or the

end users. Also business processes change.

○ Missing Values, Outliers

● Not labeled - Makes supervised learning difficult or impossible

● Data is spread across multiple systems

● You don't have access to all or even know all of the variables

Not that simple● What problem are you going to solve?

○ You have order / customer / part data,

now what?

● How do you evaluate the results?○ Example: Fraud detection - classifier that always

predicts 'False' can achieve 95% accuracy

○ What do you want to optimize?

Some Terms● labels, ground truth, dependent variable

● weights, parameters, theta, coefficients

● variables, attributes, independent variables,

features, columns

● regression versus classification○ Logistic Regression - special case

● supervised versus unsupervised

Some More Terms● fitting, training

● train / validation / test data sets

● hyper-parameters, model parameters

● batch, mini-batch, epoch, iteration

● ensemble, blending, model averaging

● machine learning, statistical modeling,

artificial intelligence

Gradient Descent● Very important concept in Machine Learning

● Each model has an associated learning

algorithm - many are iterative○ Many learning procedures rely on gradient descent

○ Also, some tricks

■ Mini-batch

■ Momentum

■ Decay Learning Rates

Gradient Descent● Basic Algorithm

○ Step 1: Compute the derivative of the optimization

objective w.r.t. the model parameters

○ Step 2: Take a step in the direction that points

downhill

● Close your eyes and imagine you are on a

hill ...

● Drop a ball on the error surface ...

Gradient Descent

Andrew Ng: Machine Learning Coursera Course

Predictive Modeling is Big Business● Target: Increased revenue 15 to 30 percent

with predictive models

● Cox Communications: Tripled direct mail

response rates by predicting propensity to buy

● Amazon.com: 35 percent of sales come from

product recommendationsEric Siegel: Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die

Predictive Modeling is Big Business● Netflix: Used predictive model to increase recommendations by

10%

● Google: Everything! What search results are most relevant. What

ads should be served. What email is spam. Self driving cars. 96%

of 2011 revenue came from advertising. Where would google be

without PM?

● Allstate: Tripled accuracy of predicting bodily injury based solely

on the vehicle. Worth an estimated $40 million annually.

● Siri: 2 Steps: understand what was said, then retrieve the relevant

information

Eric Siegel: Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die

Predictive Modeling is Big BusinessNokia / SprintMicrosoft / GoogleFacebook / LinkedinMatch.com / OkCupidTarget / WalmartFingerhutFedExChase / PREMIER...

IBM / Hewlett-PackardSun MicrosystemsU.S. BankMTVDTE EnergyReed ElsevierPfizer / FICOMastercard...

Predictive Modeling is Big Business● Recent Headlines

○ NSA - PRISM

■ Microsoft, Youtube, Google, Yahoo, Facebook,

Skype, Apple and many others allegedly

contributing

○ IBM

■ Watson - Begin deployed into Healthcare

Lots of success stories - must be easy ...● Not really. Lots of pitfalls.

● What about some notable failures?

Failures - #1 FICO Score● The score supposedly predicts the default probability of a borrower

based on vast historical credit, default and personal data

● During the 2008 mortgage default wave, the FICO score

dramatically underestimated the default rate of almost every credit score category, from the prime group (the highest credit

score group) to the subprime group (the lowest credit score group).

http://en.wikipedia.org/wiki/Predictive_modelling#Notable_failures_of_predictive_modeling



Failures - #2 LTCM● Long Term Capital Management. Ran a fund based on the Black-

Scholes Option Pricing Model (Nobel Prize Winner in Economics)

● The models produced impressive profits until a spectacular debacle

that forced the then Fed Reserve chairman Alan Greenspan to step

in to broker a rescue plan by the wall street broker dealers in order

to prevent a meltdown of the bond market.

● Lost 2 Billion Dollars in 3 weeks

When Genius Failed: The Rise and Fall of Long-Term Capital Managementhttp://en.wikipedia.org/wiki/Predictive_modelling#Notable_failures_of_predictive_modeling



Fundamental Limitations of Predictive Modeling● History can not always predict the future. Systems

are complex and change over time. Predictive

Models have difficulty adapting to the change.

Regime Change.

● Unknown unknowns. Variables being collected are

not the ones that are most critical to predicting an

outcome.http://en.wikipedia.org/wiki/Predictive_modelling#Notable_failures_of_predictive_modeling



Predictive Modeling Projects - Tips and Pitfalls

Generalization vs Memorization● AKA - Overfitting

● Very expressive and powerful models will

start to think the noise in the training data is

actually signal and account for it in its

estimations

● Extreme case - memorize each example

Generalization vs Memorization● Extreme example: 5 data points, fit to a 4th

order polynomial

● No errors on training

● But no chance on xval

Generalization vs Memorization● Techniques for combating overfitting

○ Regularization (i.e. penalize the model for begin too

complex)

■ L1 and L2 norm penalties

■ Weight Sharing

○ Early stopping

○ N-Fold Cross Validation

Leverage External Data● US Census and ACS Data has a Wealth of

Information○ Show the Spreadsheet

● Social Data: Linkedin, Facebook, Twitter

● Data Augmentation Services

● Google Public Data Directory

Project Expectations● Results are not guaranteed

● Just because you want to predict something

doesn't mean you can!

● Not just about having enough data, but also

requires having the right data

Information Leakage● Refers to a piece of information your model is

using to make its predictions, but would not

actually be available at prediction time

● Can be very subtle to detect!

● Need to be very careful to make sure only the

information that is known at the time the model

is executed is made available during training

Leakage - Example #1● IBM modeling project: Put together a dataset that

consisted of IBM customers. Dataset included text/html

data from each candidate customers web site

● The goal was to predict which ones had purchased the

'websphere' product

● Best model determined the probability of purchase was

much higher if the site included the text 'websphere'

Leak - they pulled the most recent version of the website!

Leakage - Example #2● US Census Kaggle Competition.

● Task - predict the response rate of each 2010

census tract given the tracts demographic

information

● Leak. The demographic information that was

provided was not known until the census was

already completed (and the rates would have

already been known)

Leakage - Example #3● Build a predictive model that estimates the

total points scored in a sporting event

● Inputs: Starting lineups, weather conditions

at game time, teams historical performance

● Where is the leakage? Be sure to use the

forecasted weather and forecasted lineup

Leakage● Powerful modelling tools will quickly detect a

leak and exploit it - often times it carries the

most signal

● Need to be very cautious and make sure you

only provide the model with the information

that would have been known at that moment

in time

Technology, Tools and Considerations● What is the right technology to use for a PM

project?○ SPSS, SAS, SSAS, Matlab, Python, R, Weka, Julia

○ Each has pros and cons

● How do you integrate a modeling into a

production system?○ Java, .Net, Ruby etc ...

○ SQL / Oracle?

NLP Logix Technology & Tools● NLP Logix Modeling Technology Stack

○ Python for predictive modeling

■ PyCharm, Sublime Text, scikits-learn, pandas

■ matplotlib, theano

○ R for statistical analysis, visualization and modeling

■ RStudio, ggplot2

○ Linux OS for modeling, Windows OS for data processing

○ git / github for version control (atypical for data analysis?)

○ ipython notebook for sharing / reproducing research

○ SqlServer for data shaping and preprocessing

Quick aside on Python● Powerful tool set for scientific computing

○ Links to BLAS / ATLAS (optimized c code for executing

linear algebra operations)

○ Links to CUDA for incredible parallelism via GPU

○ Active ecosystem of data science packages

■ Pandas (like R data frame)

■ numpy / scipy (like Matlab - adds vectorized

operations to python)

■ Matplotlib (like Matlab)

■ scikits-learn, PyBrain, PyLearn2, etc ...

Quick aside on NumPy● Powerful library for scientific computing

○ Extension to python

○ Adds N-dimensional arrays

○ Supports matrix algebra operations

■ np.dot(X1, X2)

○ Open source and has many contributors

○ Adds MATLAB type functionality to python

Quick aside on GPUs● Graphical Processing Units

○ Created for gaming, but also for scientific computing

○ Latest cards have nearly 1,000 cores and 2 Gigs of RAM

○ Cores are not as fast as CPU cores, but parallelism

makes up for it

○ Handles linear algebra operations very efficiently (i.e.

matrix mult and matrix transpose)

● Cheaper than CPUs and simplifies marshalling the data out

to a cluster

Quick aside on GPUs● Andrew Ng: 'Deep learning with COTS HPC systems'

○ HPC == High Performance Computing

○ 3 Computers, each with 2 GPU cards

○ Fit neural network with 1 billion parameters in 3 days

● NLP Logix RBM Implementation

○ Ported from NumPy to theano

○ Improved epoch performance by 50% on benchmark test

○ 8 core i7 versus 120 core GPU (4 years old)

Steps for a Successful Project● Clearly defined problem

● Agreed upon way to measure and evaluate a

successful model○ Need the ability to compare candidate models

● Deployment strategy○ Might dictate modeling technologies and tools

Use 3 Data Sets!● Train

○ The data the model is fit on

● Xval○ The data the model performance is evaluated on.

○ Hyperparameter selection

● Test○ Only evaluated once (ideally)

Predictive Modeling Projects

Kaggle● Kaggle.com

● Hosts data science competitions: 'data

science as a sport'○ Netflix Challenge - 1 Million Dollars

○ Heritage Health Prize - 3 Million Dollars

○ GE FlightQuest - $250,000

○ KDD Cup

Kaggle● Also a number of other competitions hosted

by commercial partners○ Facebook, Microsoft, Pfizer, Allstate, etc ...

● A few I have competed in ...○ Predict Probability of Patient having Type 2 Diabetes

○ Predict US Census Return Rates

○ Predict Runway and Gate Arrival Times

Flight Quest● Two different prediction problems

○ Given the following information: flight plan, last

estimated arrival time (onboard computer system),

the other flights in the air, weather, waypoints,

aircraft type, etc ... -- predict the number of minutes

from a cutoff time that it will take the craft to land on

the runway and arrive at the gate

○ Very challenging - over 1 TB of data

Flight Quest● Data

○ Trained on all data between 11/2012 and 2/2013

○ Final evaluation data was from 2nd half of Feb 2013

○ Final evaluation data was collected after model had

been frozen

● Great way to combat over-fitting. Final

evaluation data was not even generated

when models were finalized

Flight Quest● Cutoff Time:

○ For each day in the 2 week test period, a random time

between 9AM EST and 9PM EST was picked

○ Every flight that was in the air at the cutoff time was

included in the evaluation set

○ Each of these flights required 2 predictions: time of

runway and gate arrival

● Allowed us to generate almost an infinite amount of training

data!

Flight Quest● Key Points from my Model (4th best scoring model)

○ Primarily used 2 different types of models - Random

Forest and Gradient Boosted Trees

○ Removed outlier days - holidays, removed outlier flights -

redirects

○ Over 500 derived features

○ ~200 individual models were combined (ensembled) to

create the final predictions

■ Ensembling was done using another layer of boosted

trees

Flight Quest● Launching FlightQuest2 at the end of June - 6/30/2013

● http://www.gequest.com/flight2

http://www.gequest.com/flight2

http://www.gequest.com/flight2

MNIST - Handwritten Digits● A competition of sorts - between academics

● Database of 70,000 handwritten digits

● Classification task - assign a class label to

each digit. See who can get the most right

● Permutation Invariant constraint usually

applied

MNIST - Handwritten Digits● Hinton's Page

○ Generative and Discriminative

● http://www.cs.toronto.edu/~hinton/adi/index.htm

http://www.cs.toronto.edu/~hinton/adi/index.htm

http://www.cs.toronto.edu/~hinton/adi/index.htm

MNIST - Handwritten Digits● Some Scores

○ SVM: 140 errors

○ Neural Network: 140 errors

○ Boosted Trees: 153 errors

○ Neural Network with Dropout: 110 errors

○ Neural Network with Dropout and Maxout: 94 errors

○ Deep Boltzmann Machine plus pretraining: 79 errors

State of the Art

State of the Art● Lot of excitement around 'deep learning'

○ Generative pre-training

○ GPU's over clusters

■ Deep Learning with COTS HPC systems

● Train large networks (1 Billion parameters) on small

cluster of GPU's instead of large cluster of CPU's

○ No more feature generation - features themself are learned

● Success in areas of extremely high dimensions like image, text and

audio

State of the Art● Lot of success with stochastic, tree based models

○ Random Forest

■ Bagging: Induce a new decision tree using a

sampled (with replacement) subset of training

data

○ Boosted Trees

■ Boosting: Induce new decision trees, focus on

the cases the previous trees did poorly on

State of the Art● Random Forest: Visualization

State of the Art● Ensembling

○ Combine - either linearly, simple average or fed into

another, more complex model type as features

○ Simple example, ensemble of Neural Networks

■ Using the exact same set of hyper-parameters, fit

5 different neural networks, each using different

random initializations. Take product of 5

predictions - highest score gets the class label

State of the Art● Machine Learning Moves Fast

● I am personally excited most about Neural Networks,

Deep Belief Networks and Boltzmann Machines

○ Uses unlabelled data and can learn/detect

features

● GPU's

○ My Boltzmann implementation improved by 50%

using a 4 year old GPU versus my latest i7

Resources● Coursera

○ Probabilistic Graphical Models

○ Machine Learning (Andrew Ng)

○ Neural Networks (Hinton)

● Books

○ Learning from Data: Yaser S. Abu-Mostafa

● Academics

○ Stanford, University of Toronto and University of

Montreal

Questions?● Contact me: [email protected]

Mastering Predictive Modeling - Meetupfiles.meetup.com/1802328/Matt Berseth - Mastering... · with...

Documents

Transcript of Mastering Predictive Modeling - Meetupfiles.meetup.com/1802328/Matt Berseth - Mastering... · with...