Predictive Analytics - What It Really Is and What It Really Does

Post on 17-Aug-2015

71 views 1 download

Transcript of Predictive Analytics - What It Really Is and What It Really Does

PREDICTIVE ANALYTICS

WHAT IT REALLY IS AND WHAT IT REALLY DOES

Kevin GrayCannon Gray LLC

http://www.cannongray.com/kevin@cannongray.com

Cannon Gray LLC

"There falls the words of fools about my ears...

Answers everywhere, promising solutions to my fears,

leading through halls with no doors in the walls,

and leave me in the darkness."

The Silence of a Candle (Ralph Towner)https://www.youtube.com/watch?v=6Do_P7R9tQE

Cannon Gray LLC

This Presentation…

…will not be a sales pitch promising that Predictive Analytics is The Answer to everything

Instead, it will be a snapshot of a complex topic

But first…what is “Predictive Analytics?”

Cannon Gray LLC

Two Examples

A Data Scientist at a university develops a model to identify students at high risk of defaulting on their student loans

A Data Scientist at a financial services company builds a model to identify customers most likely to invest in certain new retirement funds

Cannon Gray LLC

Cannon Gray LLC

Confession…

In each case the Data Scientist was me…

…and this was the 1980s!

Cannon Gray LLC

Cannon Gray LLC

“New” Can Be Old

“Data Scientist,” “Big Data,” and “CRM” were not used at the time

“Predictive Analytics” was called Predictive Modeling

Cannon Gray LLC

“New” Can Be Old

The origins of Predictive Modeling can be traced back centuries to epidemiology, actuarial science, astronomy and other fields

Banks had developed credit scoring systems based on statistical models at least as far back as the 1950s

Cannon Gray LLC

Plus ça change, plus c'est la même chose?

"Sometimes things don’t change as much as all the

terminology changes!"

- veteran US analytics recruiter (personal communication)

Cannon Gray LLC

Data, Data, Everywhere…

Companies store data for many reasons Legal (e.g., Sarbanes-Oxley) HR Operations Supply chain management Customer service Sales ... Marketing

Cannon Gray LLC

Not Just For Marketing

Most Predictive Analytics has no connection with marketing Medical and pharmaceutical research Fraud detection Finance Human Resource Management Oil and gas exploration Network security Military and National security Seismology

Cannon Gray LLC

Marketing Applications

Marketing applications are most common in industries with detailed consumer data Retailing Banking Insurance Travel and hospitality Medical/Pharmaceutical Telecommunications

Cannon Gray LLC

Marketing Applications

A few examples are Customer Relationship Management Customer retention Retail recommender systems Direct marketing Cross selling Targeted ads Analysis of website traffic

Cannon Gray LLC

“Big” Comes In Many Sizes

There is absolutely no requirement that Predictive Analytics data must be “big”

Sometimes just a few hundred observations My student loan model Segmentation typing tools

Cannon Gray LLC

“Big” Comes In Many Sizes

Massive, high-velocity, streaming data also used but not required

Data can be structured or unstructured Real-time analytics the exception, not

the rule

Cannon Gray LLC

No Escaping The Basics

Understanding the fundamentals of research and statistics is now more

important than ever!!

Cannon Gray LLC

Sampling

Sampling is an essential part of most research, not just market surveys

Predictive Analytics is typically based on a sample then deployed on new data

Cannon Gray LLC

Sampling

Seldom need zillions of records to develop and evaluate the model

Complex sampling and weighting sometimes used e.g., when predicting rare events such as

fraud

Cannon Gray LLC

Design And Inference

Knowledge of experimental and quasi-experimental designs essential e.g., when running different campaigns

among different customer groups Sound grasp of causal inference also

needed e.g., “Why isn’t our campaign working?”

Cannon Gray LLC

“Trad” Stats Are Not Dead

Descriptive Statistics Principal Components Analysis Multiple Regression Ridge Regression, LASSO Partial Least Squares Regression Logistic Regression Discriminant Analysis CHAID, CART Survival Analysis Time-series Analysis Mixture Modeling/Latent Class

Cannon Gray LLC

“Janitorial Work”

Big Data is often small data repeated many times and can be substantially reduced at the pre-processing stage e.g., may only need monthly spend on one

food category in P1Y not every transaction in P5Y

Cannon Gray LLC

“Janitorial Work”

Many data fields are “exhaust” By-products of transactional or operational

processes and not useful in Predictive Analytics

Most data have little or no marketing value

Cannon Gray LLC

Predictive Analytics Is A ProcessCRISP-DM: Cross Industry Standard Process for Data Mining

Cannon Gray LLC

Cannon Gray LLC

Core Concepts Of Predictive Analytics

Existing data are used to develop a model that scores new data

By score I mean any of the following: Classifying into most probable group (will

purchase/ will not purchase) Assigning a probability score (probability of

purchase) Predicting a quantity (how much will spend)

Cannon Gray LLC

Core Concepts Of Predictive Analytics

Existing data are used to develop a model that scores new data

By new I mean data not used to build the model, for example: Data that do not yet exist (e.g., future

customers) Data deliberately set aside (held out) when

the model was being built

Cannon Gray LLC

Core Concepts Of Predictive Analytics

Existing data are used to develop a model that scores new data

By model I mean either of the following: An equation or system of equations used to

represent the process that generated the data - a statistical model

A computer algorithm designed for pattern recognition - a machine learner

These are not official definitions!

Cannon Gray LLC

Overfitting

Any sample has its idiosyncrasies We want to develop a Predictive Model

that generalizes well to new data Must try to avoid modelling noise in our

data Known as overfitting Models will always be less accurate on new

data (“shrinkage’)

Cannon Gray LLC

Overfitting

Overfitting is related to bias-variance trade-off

Overfitting is nearly always a concern, but especially when few cases (e.g., B2B customers) and many independent variables (predictors) “small n large p”

Cannon Gray LLC

Overfitting and Bias-Variance Trade-off

Systematically wrong (biased)but by about the same amount from sample to sample

Less biased but accuracy ofpredictions will vary a lot from sample to sample

Cannon Gray LLC

Model Validation

We use a Training Sample to develop our model

We use a Validation Sample to estimate the accuracy of our candidate models Which method and parameter settings will

work best? How well will it predict new data?

Cannon Gray LLC

Cross Validation

One of the simplest ways is Cross Validation

We randomly split data into two parts - a training sample and a validation sample

70/30 splits are common Build the model on the training sample

and observe how accurately it predicts on the validation (hold out) sample

Cannon Gray LLC

Cross Validation

70%

30%

Training Validation

Cannon Gray LLC

K-Fold Validation

K-Fold Validation is generally preferred Many variations but popular way is to

randomly divide data into 5 subsamples Build model on 4 subsamples combined

and validate on 5th

Repeat this 4 times so that model performance is assessed in each subsample

Run this whole process 5-10 times on different random subsamples use average/modal result

Cannon Gray LLC

5-Fold Validation

Run 5

Run 4

Run 3

Run 2

Run 1

0% 20% 40% 60% 80% 100%

Subsample 1Subsample 2Subsample 3Subsample 4Subsample 5

Validate

Validate

Validate

Validate

Validate

Repeat process 5-10 times on different random subsamples

Cannon Gray LLC

Model Selection

Most Machine Learners have several tuning parameters and Statistical modelling usually requires many decisions

Often many methods are tried, tuned and tested since even small differences in accuracy can translate into Big $

Jackknife and Bootstrap are two other procedures sometimes used

Cannon Gray LLC

Final Model

When the type of predictive model and its parameters have been decided the model is re-run on the entire sample

This will be the model that is actually deployed

Sometimes many models/versions are “stacked” and the results averaged Usually a Plan B option because of cost and

complexity

Cannon Gray LLC

Updating Model

The predictive accuracy of any model will decline over time

Should be periodically updated on new data

Cannon Gray LLC

Cannon Gray LLC

Some Machine Learners Used In Predictive Analytics

Naive Bayes KNN Apriori Artificial Neural Networks Support Vector Machines Random Forests Stochastic Gradient Boosting MARS Cubist, C5.0 (J. Ross Quinlan) Latent Dirichlet Allocation …many more…

Cannon Gray LLC

K-Nearest Neighbors (KNN)

Nearest 3 Neighbors

Nearest 5 Neighbors

New (Green) point predicted to be Red

New (Green) point predicted to be Blue

Cannon Gray LLC

Artificial Neural Networks (ANN)

Input Layer Output Layer

Hidden LayerNodes

Nodes

Cannon Gray LLC

Support Vector Machines (SVM)

Margin

Support Vectors

Cannon Gray LLC

Random Forests (RF)

Hundreds or Thousands of Trees

Variables and Cases Randomly Selected

Cannon Gray LLC

Prediction And Explanation

An ideal model will predict new cases and be informative

Compared to Statistical models, Machine Learners often slightly better at prediction but are usually hard to interpret

Sometimes can use a Machine Learner and Statistical model in tandem - if each adequate their predictions will be highly correlated

Cannon Gray LLC

Cannon Gray LLC

Humans Are Not Yet Obsolete

Applied Logistic Regression (Hosmer and Lemeshow):

"[statistical] methods...are not to be used as a substitute, but rather as an addition to clear and careful thought. Successful modeling of a complex data set is part science, part statistical methods, and part experience and common sense.“

Cannon Gray LLC

Humans Are Not Yet Obsolete

Leo Breiman and Adele Cutler (developers of RF algorithm):

“[random forests] is an example of a tool that is useful in doing analyses of scientific data. But the cleverest algorithms are no substitute for human intelligence and knowledge of the data in the problem. Take the output of random forests not as absolute truth, but as smart computer generated guesses that may be helpful in leading to a deeper understanding of the problem.”

Cannon Gray LLC

Humans Are Not Yet Obsolete

The resurgence of Bayesian statistics is further evidence that human judgment cannot be purged from analytics

We must also avoid finding a great answer…to the wrong question!

Cannon Gray LLC

Humans Are Not Yet Obsolete

More analytic options also mean higher risk and greater need for well-trained and experienced researchers

Paradoxically, technology has made it easier to be an incompetent data scientist and harder to be a good one! e.g., “abuser-friendly” stats software Beautiful data visualizations…that tell us

nothing

Cannon Gray LLC

Humans Are Not Yet Obsolete

With bigger and messier data, understanding people will become more critical, not less

Demand will rise for data scientists able to see beyond math and programming who truly understand marketing and consumers

Cannon Gray LLC

Cannon Gray LLC

The Why

Understanding The Why is critical Marketing isn’t only about predicting

behavior it’s also about changing behavior Understanding The Why can help us do this

Cannon Gray LLC

The Why

Reverse-engineering The Why from past behavior isn’t always easy Two people can do the same things for the

same reasons Two people can do the same things for

different reasons Two people can do different things for the

same reasons Two people can do different things for

different reasons

Cannon Gray LLC

The Why

There is also the “multiple me”…on different occasions I can do the same things for the same reasons I can do the same things for different reasons I can do different things for the same reasons I can do different things for different reasons

Also…some of our behavior is random…

Cannon Gray LLC

“How do I know it will pay off?”

Cannon Gray LLC

Reasons To Be Skeptical

Data infrastructure may involve considerable direct costs

The necessary skills may not be available in-house

To work well, Predictive Analytics is a team effort and time costs can be substantial

Cannon Gray LLC

Reasons To Be Skeptical

Managers and staff often already overstretched “This is the last thing we need…”

Can be seen as threat, not opportunity

Cannon Gray LLC

Reasons To Be Skeptical

Anonymity, confidentiality and privacy are common (and legitimate) concerns

Potential legal issues Risk of data breaches Customers can become worried and

irritated!

Cannon Gray LLC

Reasons To Be Skeptical

Still only a handful of universities offer degree programs in Data Science - not as established as Finance or Accounting

Predictive Analytics is not guaranteed to pay back

C-Level buy-in critical!

Cannon Gray LLC

Big Expectations

On the other hand, some clients think Predictive Analytics is easy “I have all this data, can you find me

something?” Managing client expectations critical -

may only be a small, rusty needle in that Big Expensive Haystack!

Cannon Gray LLC

Cannon Gray LLC

Some Books The Data Warehouse Toolkit (Kimball and Ross) Data Architecture: A Primer for the Data Scientist (Inmon and

Linstedt) Hadoop: The Definitive Guide (White) Sampling: Design and Analysis (Lohr) Experimental and Quasi-Experimental Designs (Shadish et al.) Categorical Data Analysis (Agresti) Propensity Score Analysis (Guo and Fraser) Time Series Analysis (Wei) Data Mining Techniques (Linoff and Berry) Regression Modeling Strategies (Harrell) Data Mining (Whitten et al.) Applied Predictive Modeling (Kuhn and Johnson) An Introduction to Statistical Learning (James et al.) Elements of Statistical Learning (Hastie et al.) Data Mining: The Textbook (Aggarwal)

Cannon Gray LLC

Some Online Resources

KDNuggets - http://www.kdnuggets.com/ Data Science Central -

http://www.datasciencecentral.com/ About Data Analysis -

https://www.linkedin.com/grp/home?gid=8156839

CRISP-DM - https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining

Cannon Gray company library - http://cannongray.com/methods

Cannon Gray LLC

Cannon Gray LLC

Key Points To Remember

Predictive Analytics uses existing data to build a model that will accurately predict new data (e.g., customer behavior)

It does not require Big Data, Machine Learning or Real-Time Analytics

Cannon Gray LLC

Key Points To Remember

It is not a recent development but has become much more sophisticated in past decade

It is now much more widely accepted - not so exotic (or wacko) anymore Very quizzical looks used to be routine!

Cannon Gray LLC

Key Points To Remember

Big Data has not miraculously made data clean and easy to analyze - just the opposite!

Only 10% - 20% of total analyst time spent on modeling

Cannon Gray LLC

Key Points To Remember

Overfitting frequently a problem - use K-fold validation when possible

Aim for parsimony - as simple as possible but not too simple

Cannon Gray LLC

Key Points To Remember

Implementation is the graveyard of many good ideas!

After deployment, sales and customer behavior should be tracked The actual effects must be assessed - e.g., are

we just making customers more price sensitive and eroding our brand equity?

Even successful implementations must be refined, modified or discarded with passage of time

Cannon Gray LLC

Key Points To Remember

Predictive Analytics is a process that requires a team - technical skill is only one aspect

Marketing researchers can’t do it all but there is often an important role for us

Vital that end users be active team participants

Cannon Gray LLC

Key Points To Remember

Marketing researchers, marketers, statisticians and computer scientists often use different jargon or same jargon differently!

Don’t assume - communicate!

Cannon Gray LLC

Key Points To Remember

Predictive Analytics is not a substitute for marketing research

Synergizes well with “trad” marketing research to provide richer insights into what consumers do and why

Cannon Gray LLC

Key Points To Remember

Marketing researchers have a natural advantage over others working in this space - we are more than technicians

Some criticisms of MR by “Data Scientists” betray lack of understanding of marketing and research

Science Fiction Is Not Science

THANK YOU!!

Kevin GrayCannon Gray LLC

http://www.cannongray.com/kevin@cannongray.com