Introduction to Population Health Analytics, Predictive Analytics, Big Data and Texas Analytics
Predictive Analytics with Microsoft Big Data
-
Upload
saptak-sen -
Category
Technology
-
view
270 -
download
4
Transcript of Predictive Analytics with Microsoft Big Data
Predictive Analytics with Microsoft Big DataVal Fontama, PhDSaptak Sen
DBI-B339
Agenda
Introducing predictive analytics
Microsoft data mining tools
DemosData mining on Hadoop.Data mining in Excel.
Real business problem
Pitfalls
Competing on analyticsWhat percent of analytic applications will use predictive capabilities in 2014?a. b. d. c.
—Gartner Business Intelligence Summit 2012
10% 30% 67.8%
50%
Competing on analyticsWhat percent of analytic applications will use predictive capabilities in 2014?a. b. d. c.
—Gartner Business Intelligence Summit 2012
10% 30% 67.8%
50%
Why the resurgence in predictive analytics? 1. 2. 4. 3. More data, more accurate models.
More and cheaper Compute power.
New technologies.Increased awareness and customer demand.
What is predictive analytics?Data analysis with mathematical techniques from statistics, data mining, and machine learning. Used to uncover hidden patterns that yields competitive advantage.Diagnostic analysis
Predictive analysis
Prescriptive analysisWhat happened and why?
Used for customer segmentation.Diagnose with clustering or classification techniques.
What will happen in the future?Forecasting and propensity to buy.Predict with time series, neural networks, regression, etc.
What is the next best action?Channel or portfolio optimization.Linear programming, Monte Carlo simulation, or game theory.
Common customer scenarios for predictive analytics
Weather forecastingCredit Scoring
Targeted AdvertisingLife sciences research
Fraud detection
Predicting disease outbreaksSocial network analysis
Churn analysis
Predictive analytics workflow: generic process
Define business problem.
1
Collect and prepare data.
2
Train and test model.
3Deploy model.
4
Monitor model’s performance.
5
Predictive analytics workflow example: credit scorecardsA retail bank uses a credit scorecard every day to issue new loans or monitor performance of existing loans.To be competitive the bank needs to aggressively acquire new customers, but limit the risk of default.A credit scorecard is used to maximize profit by accepting the largest pool of customers who will pay their debt.A credit scorecard is a predictive model used to predict likelihood of default.
1. Business problem2a) Identify and acquire data.Bank’s credit data.Payment history data from credit bureau.Demographic data from third party.
2b) Clean, shape, pre-process data.Deal with outliers and missing values.Prepare continuous and categorical variables.Identify and resolve highly
correlated variables.
2. Data collection and preparation
3a) Author or select model.Use Microsoft or third party solution
(e.g., SAS). Or…Program your own algorithm.
3b) Train and test model.Model trained on large subset of data
and tested on smaller subset.Logistic regression is de facto standard.
3. Model development
5) Scorecard monitoring.Continuously monitor scorecard’s performance on
new customers.Retrain the model once it starts underperforming.
5. Monitoring4) Scorecard deployment.Represent the model as a formula.Implement the model in bank’s scoring tool.Integrate model outputs into apps for use by bank’s staff or partners.
4. Model deployment
Predictive analytics workflow example: credit scorecardsA retail bank uses a credit scorecard every day to issue new loans or monitor performance of existing loans.To be competitive the bank needs to aggressively acquire new customers, but limit the risk of default.A credit scorecard is used to maximize profit by accepting the largest pool of customers who will pay their debt.A credit scorecard is a predictive model used to predict likelihood of default.
1. Business problem2a) Identify and acquire data.Bank’s credit data.Payment history data from credit bureau.Demographic data from third party.
2b) Clean, shape, pre-process data.Deal with outliers and missing values.Prepare continuous and categorical variables.Identify and resolve highly
correlated variables.
2. Data collection and preparation
3a) Author or select model.Use Microsoft or third party solution
(e.g., SAS). Or…Program your own algorithm.
3b) Train and test model.Model trained on large subset of data
and tested on smaller subset.Logistic regression is de facto standard.
3. Model development
5) Scorecard monitoring.Continuously monitor scorecard’s performance on
new customers.Retrain the model once it starts underperforming.
5. Monitoring4) Scorecard deployment.Represent the model as a formula.Implement the model in bank’s scoring tool.Integrate model outputs into apps for use by bank’s staff or partners.
4. Model deployment
Predictive analytics workflow example: credit scorecardsA retail bank uses a credit scorecard every day to issue new loans or monitor performance of existing loans.To be competitive the bank needs to aggressively acquire new customers, but limit the risk of default.A credit scorecard is used to maximize profit by accepting the largest pool of customers who will pay their debt.A credit scorecard is a predictive model used to predict likelihood of default.
1. Business problem2a) Identify and acquire data.Bank’s credit data.Payment history data from credit bureau.Demographic data from third party.
2b) Clean, shape, pre-process data.Deal with outliers and missing values.Prepare continuous and categorical variables.Identify and resolve highly
correlated variables.
2. Data collection and preparation
3a) Author or select model.Use Microsoft or third party solution
(e.g., SAS). Or…Program your own algorithm.
3b) Train and test model.Model trained on large subset of data
and tested on smaller subset.Logistic regression is de facto standard.
3. Model development
5) Scorecard monitoring.Continuously monitor scorecard’s performance on
new customers.Retrain the model once it starts underperforming.
5. Monitoring4) Scorecard deployment.Represent the model as a formula.Implement the model in bank’s scoring tool.Integrate model outputs into apps for use by bank’s staff or partners.
4. Model deployment
Predictive analytics workflow example: credit scorecardsA retail bank uses a credit scorecard every day to issue new loans or monitor performance of existing loans.To be competitive the bank needs to aggressively acquire new customers, but limit the risk of default.A credit scorecard is used to maximize profit by accepting the largest pool of customers who will pay their debt.A credit scorecard is a predictive model used to predict likelihood of default.
1. Business problem2a) Identify and acquire data.Bank’s credit data.Payment history data from credit bureau.Demographic data from third party.
2b) Clean, shape, pre-process data.Deal with outliers and missing values.Prepare continuous and categorical variables.Identify and resolve highly
correlated variables.
2. Data collection and preparation
3a) Author or select model.Use Microsoft or third party solution
(e.g., SAS). Or…Program your own algorithm.
3b) Train and test model.Model trained on large subset of data
and tested on smaller subset.Logistic regression is de facto standard.
3. Model development
5) Scorecard monitoring.Continuously monitor scorecard’s performance on
new customers.Retrain the model once it starts underperforming.
5. Monitoring4) Scorecard deployment.Represent the model as a formula.Implement the model in bank’s scoring tool.Integrate model outputs into apps for use by bank’s staff or partners.
4. Model deployment
Predictive analytics workflow example: credit scorecardsA retail bank uses a credit scorecard every day to issue new loans or monitor performance of existing loans.To be competitive the bank needs to aggressively acquire new customers, but limit the risk of default.A credit scorecard is used to maximize profit by accepting the largest pool of customers who will pay their debt.A credit scorecard is a predictive model used to predict likelihood of default.
1. Business problem2a) Identify and acquire data.Bank’s credit data.Payment history data from credit bureau.Demographic data from third party.
2b) Clean, shape, pre-process data.Deal with outliers and missing values.Prepare continuous and categorical variables.Identify and resolve highly
correlated variables.
2. Data collection and preparation
3a) Author or select model.Use Microsoft or third party solution
(e.g., SAS). Or…Program your own algorithm.
3b) Train and test model.Model trained on large subset of data
and tested on smaller subset.Logistic regression is de facto standard.
3. Model development
5) Scorecard monitoring.Continuously monitor scorecard’s performance on
new customers.Retrain the model once it starts underperforming.
5. Monitoring4) Scorecard deployment.Represent the model as a formula.Implement the model in bank’s scoring tool.Integrate model outputs into apps for use by bank’s staff or partners.
4. Model deployment
Predictive analytics workflow example: credit scorecardsA retail bank uses a credit scorecard every day to issue new loans or monitor performance of existing loans.To be competitive the bank needs to aggressively acquire new customers, but limit the risk of default.A credit scorecard is used to maximize profit by accepting the largest pool of customers who will pay their debt.A credit scorecard is a predictive model used to predict likelihood of default.
1. Business problem2a) Identify and acquire data.Bank’s credit data.Payment history data from credit bureau.Demographic data from third party.
2b) Clean, shape, pre-process data.Deal with outliers and missing values.Prepare continuous and categorical variables.Identify and resolve highly
correlated variables.
2. Data collection and preparation
3a) Author or select model.Use Microsoft or third party solution
(e.g., SAS). Or…Program your own algorithm.
3b) Train and test model.Model trained on large subset of data
and tested on smaller subset.Logistic regression is de facto standard.
3. Model development
5) Scorecard monitoring.Continuously monitor scorecard’s performance on
new customers.Retrain the model once it starts underperforming.
5. Monitoring4) Scorecard deployment.Represent the model as a formula.Implement the model in bank’s scoring tool.Integrate model outputs into apps for use by bank’s staff or partners.
4. Model deployment
Predictive analysis tools from Microsoft
Rich library of data mining algorithms for diagnostic, predictive analytics—clustering, time series, neural nets, etc.Can be integrated into full data lifecycle from ETL, OLAP cubes, or KPIs on dashboards.Programmable and extensible through DMX.
Data mining with familiar tool—Microsoft Excel.Simplicity and ease of use—can build powerful predictive models with no deep data mining skills.Includes rich toolkit for data pre-processing, clustering, forecasting, basket analysis, etc.
Data mining tool in SQL Server Analysis Services.
Data Mining add-in for Excel.
Solving real business problems with Microsoft’s predictive analysis tools: the problem1
2
Identify the most likely customers for Product X.Which customers should we target to sell Product X and why?Win/loss analysis: which customers did we lose that we should have won?Did we lose any customers who are very similar to those we won?
Solving real business problems with Microsoft’s predictive analysis tools: the solution
Built customer targeting models with decision trees and Naïve Bayes algorithms.Input data collected from several sources:
Sales data warehouse. Dun & Bradstreet.Country statistics data.
Used data mining tools in SSAS, the Data Mining add-in for Excel, and Data Explorer.Decision trees and Naïve Bayes offered similar levels of accuracy.Both models identified most influential variables for targeting customers.
1 Identifying the most likely customers for Product X.
Customer targeting models: decision trees
Highest influencers of sales:Partner involvement, competitor, license type, country GDP, education spending, etc.
All
Partner subsegment not = “‘-St%Ufwysjw”
Partner subsegment = “‘-St%Ufwysjw”
Competitor 4= 0
License program category = “license program category”
Competitor 4 not = 0
License program category not = “license program category”
License program category not = “license program category
Subsidiary name not = “Australia”
Vertical name not
= “IT services”
Competitor 7 not = 1
License program category = “license program category
Subsidiary name= “Australia”
Vertical name = “IT services”
Competitor 7= 1
Subsidiary name not = “Australia”
Partner group name = “group 1”
GNP per cap bucket not = “medium”
Education public spending >= 7.483
Subsidiary name= “Australia”
Partner group name not = “group 1”
GNP per cap bucket = “medium”
Education public spending < 7.483
Primary workload type not = “core infrastructure…”
Partner engagement type
not = “(no partner)”
Partner engagement type
= “(no partner)”
Primary workload type = “core infrastructure…”
Competitor 1= 0
Competitor 1 not = 0
Education public spending < 6.225
Education public spending >= 6.225
Population 0 14< 20.662
Population 0 14>= 20.662
Partner group name = “group 2”
Partner group name not = “group 2”
Customer targeting models: resultsBoth decision trees and Naïve Bayes outperform random guess model.
Overall population %
Targ
et p
opul
atio
n (y
es) %
Data mining lift chart for mining structure: V Opportunity V2 Conf.
0%
10%
20% 30% 40% 50% 60% 70% 80% 90% 100%
20%30%40%50%60%70%80%90%
100%
Solution #2: win/loss analysis
Built customer segmentation model with clustering algorithm.Input data collected from several sources:
Sales data warehouse. Dun & Bradstreet.Country statistics data.
Used data mining tools in SSAS, the Data Mining add-in for Excel, and Data Explorer.Clustering model identified nine customer segments from the data.
2 Win/loss analysis: which customers did we lose that we should have won?
Customer segmentation model with clustering
Cluster 8
Cluster 3
Cluster 9
Cluster 6
Cluster 2
Cluster 1
Cluster 7
Cluster 5
Cluster 4
Six of the clusters had high propensity to buy (Clusters 1, 2, 3, 6, 8, and 9).Model showed segments with similar profile but different purchase outcomes (e.g., Cluster 3 and Cluster 4).Cluster 3 has a 51% likelihood to buy vs. only 16% for Cluster 4!
Customer segmentation modelUnderstanding why some customers did not buy Product X It is clear that the presence of
partners or competitors influence the customer’s propensity to buy!
Demo #1:Data Mining on HDInsight
DEMO
Demo #2:Data Mining add-in for Excel
DEMO
Common pitfalls in predictive analyticsSample bias Over-fitting
the modelPoor interpretation
Insufficient sample size.Unrepresentative data sample.
Your model may perform very well on training data, but poorly on new datasets!
Confusing correlation with causality.Confusing precision and accuracy.Statistical significance.
Summary
Introduced predictive analytics
Data mining tools from Microsoft
Demos and pitfalls
Solving real business problemData mining tools
in SQL Server Analysis Services.Data mining add-ins for Excel.
Customer targeting with segmentation and prediction.
What is predictive analytics?
Analytical approach
Description Uses Commonly used techniques
Diagnostic Analysis.
Helps you understand what happened and why.
Customer segmentation,Market basket analysis.
Clustering, classification, neural networks, decision trees or content analysis from statistics, data mining, and machine learning.
Predictive Analysis.
Helps you predict what will happen in the future.
Forecasting, predicting propensity to buy, or risk of default.
Time series analysis, neural networks, decision trees, Monte Carlo simulation, and regression from statistics, data mining, and machine learning.
Prescriptive Analysis.
Identifies the best course of action.
Channel optimization, portfolio optimization, or traffic optimization.
Linear and non-linear programming, Monte Carlo simulation, or game theory from statistics and data mining.
“Software and/or hardware solutions that allow firms to discover, evaluate, optimize, and deploy predictive models by analyzing Big Data sources to improve business performance or mitigate risk.”—Mike Gualtieri, Forrester, 2013
Track Resources
@sqlservermvaMicrosoft Virtual Academy
SQL Server Website
Get Certified!
Hands-On LabsDownload Data
Explorer
Download GeoflowWindows
Azure
msdnResources for Developers
http://microsoft.com/msdn
LearningMicrosoft Certification & Training Resources
www.microsoft.com/learning
TechNet
Resources
Sessions on Demandhttp://channel9.msdn.com/Events/TechEd
Resources for IT Professionalshttp://microsoft.com/technet
Complete an evaluation on CommNet and enter to win!
Evaluate this session
Scan this QR code to evaluate this session and be automatically entered in a drawing to win a prize
© 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.