Predictive Analytics - What It Really Is and What It Really Does
-
Upload
kevin-gray -
Category
Documents
-
view
71 -
download
1
Transcript of Predictive Analytics - What It Really Is and What It Really Does
PREDICTIVE ANALYTICS
WHAT IT REALLY IS AND WHAT IT REALLY DOES
Kevin GrayCannon Gray LLC
http://www.cannongray.com/[email protected]
Cannon Gray LLC
"There falls the words of fools about my ears...
Answers everywhere, promising solutions to my fears,
leading through halls with no doors in the walls,
and leave me in the darkness."
The Silence of a Candle (Ralph Towner)https://www.youtube.com/watch?v=6Do_P7R9tQE
Cannon Gray LLC
This Presentation…
…will not be a sales pitch promising that Predictive Analytics is The Answer to everything
Instead, it will be a snapshot of a complex topic
But first…what is “Predictive Analytics?”
Cannon Gray LLC
Two Examples
A Data Scientist at a university develops a model to identify students at high risk of defaulting on their student loans
A Data Scientist at a financial services company builds a model to identify customers most likely to invest in certain new retirement funds
Cannon Gray LLC
Cannon Gray LLC
Confession…
In each case the Data Scientist was me…
…and this was the 1980s!
Cannon Gray LLC
Cannon Gray LLC
“New” Can Be Old
“Data Scientist,” “Big Data,” and “CRM” were not used at the time
“Predictive Analytics” was called Predictive Modeling
Cannon Gray LLC
“New” Can Be Old
The origins of Predictive Modeling can be traced back centuries to epidemiology, actuarial science, astronomy and other fields
Banks had developed credit scoring systems based on statistical models at least as far back as the 1950s
Cannon Gray LLC
Plus ça change, plus c'est la même chose?
"Sometimes things don’t change as much as all the
terminology changes!"
- veteran US analytics recruiter (personal communication)
Cannon Gray LLC
Data, Data, Everywhere…
Companies store data for many reasons Legal (e.g., Sarbanes-Oxley) HR Operations Supply chain management Customer service Sales ... Marketing
Cannon Gray LLC
Not Just For Marketing
Most Predictive Analytics has no connection with marketing Medical and pharmaceutical research Fraud detection Finance Human Resource Management Oil and gas exploration Network security Military and National security Seismology
Cannon Gray LLC
Marketing Applications
Marketing applications are most common in industries with detailed consumer data Retailing Banking Insurance Travel and hospitality Medical/Pharmaceutical Telecommunications
Cannon Gray LLC
Marketing Applications
A few examples are Customer Relationship Management Customer retention Retail recommender systems Direct marketing Cross selling Targeted ads Analysis of website traffic
Cannon Gray LLC
“Big” Comes In Many Sizes
There is absolutely no requirement that Predictive Analytics data must be “big”
Sometimes just a few hundred observations My student loan model Segmentation typing tools
Cannon Gray LLC
“Big” Comes In Many Sizes
Massive, high-velocity, streaming data also used but not required
Data can be structured or unstructured Real-time analytics the exception, not
the rule
Cannon Gray LLC
No Escaping The Basics
Understanding the fundamentals of research and statistics is now more
important than ever!!
Cannon Gray LLC
Sampling
Sampling is an essential part of most research, not just market surveys
Predictive Analytics is typically based on a sample then deployed on new data
Cannon Gray LLC
Sampling
Seldom need zillions of records to develop and evaluate the model
Complex sampling and weighting sometimes used e.g., when predicting rare events such as
fraud
Cannon Gray LLC
Design And Inference
Knowledge of experimental and quasi-experimental designs essential e.g., when running different campaigns
among different customer groups Sound grasp of causal inference also
needed e.g., “Why isn’t our campaign working?”
Cannon Gray LLC
“Trad” Stats Are Not Dead
Descriptive Statistics Principal Components Analysis Multiple Regression Ridge Regression, LASSO Partial Least Squares Regression Logistic Regression Discriminant Analysis CHAID, CART Survival Analysis Time-series Analysis Mixture Modeling/Latent Class
Cannon Gray LLC
“Janitorial Work”
Big Data is often small data repeated many times and can be substantially reduced at the pre-processing stage e.g., may only need monthly spend on one
food category in P1Y not every transaction in P5Y
Cannon Gray LLC
“Janitorial Work”
Many data fields are “exhaust” By-products of transactional or operational
processes and not useful in Predictive Analytics
Most data have little or no marketing value
Cannon Gray LLC
Predictive Analytics Is A ProcessCRISP-DM: Cross Industry Standard Process for Data Mining
Cannon Gray LLC
Cannon Gray LLC
Core Concepts Of Predictive Analytics
Existing data are used to develop a model that scores new data
By score I mean any of the following: Classifying into most probable group (will
purchase/ will not purchase) Assigning a probability score (probability of
purchase) Predicting a quantity (how much will spend)
Cannon Gray LLC
Core Concepts Of Predictive Analytics
Existing data are used to develop a model that scores new data
By new I mean data not used to build the model, for example: Data that do not yet exist (e.g., future
customers) Data deliberately set aside (held out) when
the model was being built
Cannon Gray LLC
Core Concepts Of Predictive Analytics
Existing data are used to develop a model that scores new data
By model I mean either of the following: An equation or system of equations used to
represent the process that generated the data - a statistical model
A computer algorithm designed for pattern recognition - a machine learner
These are not official definitions!
Cannon Gray LLC
Overfitting
Any sample has its idiosyncrasies We want to develop a Predictive Model
that generalizes well to new data Must try to avoid modelling noise in our
data Known as overfitting Models will always be less accurate on new
data (“shrinkage’)
Cannon Gray LLC
Overfitting
Overfitting is related to bias-variance trade-off
Overfitting is nearly always a concern, but especially when few cases (e.g., B2B customers) and many independent variables (predictors) “small n large p”
Cannon Gray LLC
Overfitting and Bias-Variance Trade-off
Systematically wrong (biased)but by about the same amount from sample to sample
Less biased but accuracy ofpredictions will vary a lot from sample to sample
Cannon Gray LLC
Model Validation
We use a Training Sample to develop our model
We use a Validation Sample to estimate the accuracy of our candidate models Which method and parameter settings will
work best? How well will it predict new data?
Cannon Gray LLC
Cross Validation
One of the simplest ways is Cross Validation
We randomly split data into two parts - a training sample and a validation sample
70/30 splits are common Build the model on the training sample
and observe how accurately it predicts on the validation (hold out) sample
Cannon Gray LLC
Cross Validation
70%
30%
Training Validation
Cannon Gray LLC
K-Fold Validation
K-Fold Validation is generally preferred Many variations but popular way is to
randomly divide data into 5 subsamples Build model on 4 subsamples combined
and validate on 5th
Repeat this 4 times so that model performance is assessed in each subsample
Run this whole process 5-10 times on different random subsamples use average/modal result
Cannon Gray LLC
5-Fold Validation
Run 5
Run 4
Run 3
Run 2
Run 1
0% 20% 40% 60% 80% 100%
Subsample 1Subsample 2Subsample 3Subsample 4Subsample 5
Validate
Validate
Validate
Validate
Validate
Repeat process 5-10 times on different random subsamples
Cannon Gray LLC
Model Selection
Most Machine Learners have several tuning parameters and Statistical modelling usually requires many decisions
Often many methods are tried, tuned and tested since even small differences in accuracy can translate into Big $
Jackknife and Bootstrap are two other procedures sometimes used
Cannon Gray LLC
Final Model
When the type of predictive model and its parameters have been decided the model is re-run on the entire sample
This will be the model that is actually deployed
Sometimes many models/versions are “stacked” and the results averaged Usually a Plan B option because of cost and
complexity
Cannon Gray LLC
Updating Model
The predictive accuracy of any model will decline over time
Should be periodically updated on new data
Cannon Gray LLC
Cannon Gray LLC
Some Machine Learners Used In Predictive Analytics
Naive Bayes KNN Apriori Artificial Neural Networks Support Vector Machines Random Forests Stochastic Gradient Boosting MARS Cubist, C5.0 (J. Ross Quinlan) Latent Dirichlet Allocation …many more…
Cannon Gray LLC
K-Nearest Neighbors (KNN)
Nearest 3 Neighbors
Nearest 5 Neighbors
New (Green) point predicted to be Red
New (Green) point predicted to be Blue
Cannon Gray LLC
Artificial Neural Networks (ANN)
Input Layer Output Layer
Hidden LayerNodes
Nodes
Cannon Gray LLC
Support Vector Machines (SVM)
Margin
Support Vectors
Cannon Gray LLC
Random Forests (RF)
Hundreds or Thousands of Trees
Variables and Cases Randomly Selected
Cannon Gray LLC
Prediction And Explanation
An ideal model will predict new cases and be informative
Compared to Statistical models, Machine Learners often slightly better at prediction but are usually hard to interpret
Sometimes can use a Machine Learner and Statistical model in tandem - if each adequate their predictions will be highly correlated
Cannon Gray LLC
Cannon Gray LLC
Humans Are Not Yet Obsolete
Applied Logistic Regression (Hosmer and Lemeshow):
"[statistical] methods...are not to be used as a substitute, but rather as an addition to clear and careful thought. Successful modeling of a complex data set is part science, part statistical methods, and part experience and common sense.“
Cannon Gray LLC
Humans Are Not Yet Obsolete
Leo Breiman and Adele Cutler (developers of RF algorithm):
“[random forests] is an example of a tool that is useful in doing analyses of scientific data. But the cleverest algorithms are no substitute for human intelligence and knowledge of the data in the problem. Take the output of random forests not as absolute truth, but as smart computer generated guesses that may be helpful in leading to a deeper understanding of the problem.”
Cannon Gray LLC
Humans Are Not Yet Obsolete
The resurgence of Bayesian statistics is further evidence that human judgment cannot be purged from analytics
We must also avoid finding a great answer…to the wrong question!
Cannon Gray LLC
Humans Are Not Yet Obsolete
More analytic options also mean higher risk and greater need for well-trained and experienced researchers
Paradoxically, technology has made it easier to be an incompetent data scientist and harder to be a good one! e.g., “abuser-friendly” stats software Beautiful data visualizations…that tell us
nothing
Cannon Gray LLC
Humans Are Not Yet Obsolete
With bigger and messier data, understanding people will become more critical, not less
Demand will rise for data scientists able to see beyond math and programming who truly understand marketing and consumers
Cannon Gray LLC
Cannon Gray LLC
The Why
Understanding The Why is critical Marketing isn’t only about predicting
behavior it’s also about changing behavior Understanding The Why can help us do this
Cannon Gray LLC
The Why
Reverse-engineering The Why from past behavior isn’t always easy Two people can do the same things for the
same reasons Two people can do the same things for
different reasons Two people can do different things for the
same reasons Two people can do different things for
different reasons
Cannon Gray LLC
The Why
There is also the “multiple me”…on different occasions I can do the same things for the same reasons I can do the same things for different reasons I can do different things for the same reasons I can do different things for different reasons
Also…some of our behavior is random…
Cannon Gray LLC
“How do I know it will pay off?”
Cannon Gray LLC
Reasons To Be Skeptical
Data infrastructure may involve considerable direct costs
The necessary skills may not be available in-house
To work well, Predictive Analytics is a team effort and time costs can be substantial
Cannon Gray LLC
Reasons To Be Skeptical
Managers and staff often already overstretched “This is the last thing we need…”
Can be seen as threat, not opportunity
Cannon Gray LLC
Reasons To Be Skeptical
Anonymity, confidentiality and privacy are common (and legitimate) concerns
Potential legal issues Risk of data breaches Customers can become worried and
irritated!
Cannon Gray LLC
Reasons To Be Skeptical
Still only a handful of universities offer degree programs in Data Science - not as established as Finance or Accounting
Predictive Analytics is not guaranteed to pay back
C-Level buy-in critical!
Cannon Gray LLC
Big Expectations
On the other hand, some clients think Predictive Analytics is easy “I have all this data, can you find me
something?” Managing client expectations critical -
may only be a small, rusty needle in that Big Expensive Haystack!
Cannon Gray LLC
Cannon Gray LLC
Some Books The Data Warehouse Toolkit (Kimball and Ross) Data Architecture: A Primer for the Data Scientist (Inmon and
Linstedt) Hadoop: The Definitive Guide (White) Sampling: Design and Analysis (Lohr) Experimental and Quasi-Experimental Designs (Shadish et al.) Categorical Data Analysis (Agresti) Propensity Score Analysis (Guo and Fraser) Time Series Analysis (Wei) Data Mining Techniques (Linoff and Berry) Regression Modeling Strategies (Harrell) Data Mining (Whitten et al.) Applied Predictive Modeling (Kuhn and Johnson) An Introduction to Statistical Learning (James et al.) Elements of Statistical Learning (Hastie et al.) Data Mining: The Textbook (Aggarwal)
Cannon Gray LLC
Some Online Resources
KDNuggets - http://www.kdnuggets.com/ Data Science Central -
http://www.datasciencecentral.com/ About Data Analysis -
https://www.linkedin.com/grp/home?gid=8156839
CRISP-DM - https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining
Cannon Gray company library - http://cannongray.com/methods
Cannon Gray LLC
Cannon Gray LLC
Key Points To Remember
Predictive Analytics uses existing data to build a model that will accurately predict new data (e.g., customer behavior)
It does not require Big Data, Machine Learning or Real-Time Analytics
Cannon Gray LLC
Key Points To Remember
It is not a recent development but has become much more sophisticated in past decade
It is now much more widely accepted - not so exotic (or wacko) anymore Very quizzical looks used to be routine!
Cannon Gray LLC
Key Points To Remember
Big Data has not miraculously made data clean and easy to analyze - just the opposite!
Only 10% - 20% of total analyst time spent on modeling
Cannon Gray LLC
Key Points To Remember
Overfitting frequently a problem - use K-fold validation when possible
Aim for parsimony - as simple as possible but not too simple
Cannon Gray LLC
Key Points To Remember
Implementation is the graveyard of many good ideas!
After deployment, sales and customer behavior should be tracked The actual effects must be assessed - e.g., are
we just making customers more price sensitive and eroding our brand equity?
Even successful implementations must be refined, modified or discarded with passage of time
Cannon Gray LLC
Key Points To Remember
Predictive Analytics is a process that requires a team - technical skill is only one aspect
Marketing researchers can’t do it all but there is often an important role for us
Vital that end users be active team participants
Cannon Gray LLC
Key Points To Remember
Marketing researchers, marketers, statisticians and computer scientists often use different jargon or same jargon differently!
Don’t assume - communicate!
Cannon Gray LLC
Key Points To Remember
Predictive Analytics is not a substitute for marketing research
Synergizes well with “trad” marketing research to provide richer insights into what consumers do and why
Cannon Gray LLC
Key Points To Remember
Marketing researchers have a natural advantage over others working in this space - we are more than technicians
Some criticisms of MR by “Data Scientists” betray lack of understanding of marketing and research
Science Fiction Is Not Science