Predictive Analytics with Microsoft Big Data

33

Transcript of Predictive Analytics with Microsoft Big Data

Page 1: Predictive Analytics with Microsoft Big Data
Page 2: Predictive Analytics with Microsoft Big Data

Predictive Analytics with Microsoft Big DataVal Fontama, PhDSaptak Sen

DBI-B339

Page 3: Predictive Analytics with Microsoft Big Data

Agenda

Introducing predictive analytics

Microsoft data mining tools

DemosData mining on Hadoop.Data mining in Excel.

Real business problem

Pitfalls

Page 4: Predictive Analytics with Microsoft Big Data

Competing on analyticsWhat percent of analytic applications will use predictive capabilities in 2014?a. b. d. c.

—Gartner Business Intelligence Summit 2012

10% 30% 67.8%

50%

Page 5: Predictive Analytics with Microsoft Big Data

Competing on analyticsWhat percent of analytic applications will use predictive capabilities in 2014?a. b. d. c.

—Gartner Business Intelligence Summit 2012

10% 30% 67.8%

50%

Page 6: Predictive Analytics with Microsoft Big Data

Why the resurgence in predictive analytics? 1. 2. 4. 3. More data, more accurate models.

More and cheaper Compute power.

New technologies.Increased awareness and customer demand.

Page 7: Predictive Analytics with Microsoft Big Data

What is predictive analytics?Data analysis with mathematical techniques from statistics, data mining, and machine learning. Used to uncover hidden patterns that yields competitive advantage.Diagnostic analysis

Predictive analysis

Prescriptive analysisWhat happened and why?

Used for customer segmentation.Diagnose with clustering or classification techniques.

What will happen in the future?Forecasting and propensity to buy.Predict with time series, neural networks, regression, etc.

What is the next best action?Channel or portfolio optimization.Linear programming, Monte Carlo simulation, or game theory.

Page 8: Predictive Analytics with Microsoft Big Data

Common customer scenarios for predictive analytics

Weather forecastingCredit Scoring

Targeted AdvertisingLife sciences research

Fraud detection

Predicting disease outbreaksSocial network analysis

Churn analysis

Page 9: Predictive Analytics with Microsoft Big Data

Predictive analytics workflow: generic process

Define business problem.

1

Collect and prepare data.

2

Train and test model.

3Deploy model.

4

Monitor model’s performance.

5

Page 10: Predictive Analytics with Microsoft Big Data

Predictive analytics workflow example: credit scorecardsA retail bank uses a credit scorecard every day to issue new loans or monitor performance of existing loans.To be competitive the bank needs to aggressively acquire new customers, but limit the risk of default.A credit scorecard is used to maximize profit by accepting the largest pool of customers who will pay their debt.A credit scorecard is a predictive model used to predict likelihood of default.

1. Business problem2a) Identify and acquire data.Bank’s credit data.Payment history data from credit bureau.Demographic data from third party.

2b) Clean, shape, pre-process data.Deal with outliers and missing values.Prepare continuous and categorical variables.Identify and resolve highly

correlated variables.

2. Data collection and preparation

3a) Author or select model.Use Microsoft or third party solution

(e.g., SAS). Or…Program your own algorithm.

3b) Train and test model.Model trained on large subset of data

and tested on smaller subset.Logistic regression is de facto standard.

3. Model development

5) Scorecard monitoring.Continuously monitor scorecard’s performance on

new customers.Retrain the model once it starts underperforming.

5. Monitoring4) Scorecard deployment.Represent the model as a formula.Implement the model in bank’s scoring tool.Integrate model outputs into apps for use by bank’s staff or partners.

4. Model deployment

Page 11: Predictive Analytics with Microsoft Big Data

Predictive analytics workflow example: credit scorecardsA retail bank uses a credit scorecard every day to issue new loans or monitor performance of existing loans.To be competitive the bank needs to aggressively acquire new customers, but limit the risk of default.A credit scorecard is used to maximize profit by accepting the largest pool of customers who will pay their debt.A credit scorecard is a predictive model used to predict likelihood of default.

1. Business problem2a) Identify and acquire data.Bank’s credit data.Payment history data from credit bureau.Demographic data from third party.

2b) Clean, shape, pre-process data.Deal with outliers and missing values.Prepare continuous and categorical variables.Identify and resolve highly

correlated variables.

2. Data collection and preparation

3a) Author or select model.Use Microsoft or third party solution

(e.g., SAS). Or…Program your own algorithm.

3b) Train and test model.Model trained on large subset of data

and tested on smaller subset.Logistic regression is de facto standard.

3. Model development

5) Scorecard monitoring.Continuously monitor scorecard’s performance on

new customers.Retrain the model once it starts underperforming.

5. Monitoring4) Scorecard deployment.Represent the model as a formula.Implement the model in bank’s scoring tool.Integrate model outputs into apps for use by bank’s staff or partners.

4. Model deployment

Page 12: Predictive Analytics with Microsoft Big Data

Predictive analytics workflow example: credit scorecardsA retail bank uses a credit scorecard every day to issue new loans or monitor performance of existing loans.To be competitive the bank needs to aggressively acquire new customers, but limit the risk of default.A credit scorecard is used to maximize profit by accepting the largest pool of customers who will pay their debt.A credit scorecard is a predictive model used to predict likelihood of default.

1. Business problem2a) Identify and acquire data.Bank’s credit data.Payment history data from credit bureau.Demographic data from third party.

2b) Clean, shape, pre-process data.Deal with outliers and missing values.Prepare continuous and categorical variables.Identify and resolve highly

correlated variables.

2. Data collection and preparation

3a) Author or select model.Use Microsoft or third party solution

(e.g., SAS). Or…Program your own algorithm.

3b) Train and test model.Model trained on large subset of data

and tested on smaller subset.Logistic regression is de facto standard.

3. Model development

5) Scorecard monitoring.Continuously monitor scorecard’s performance on

new customers.Retrain the model once it starts underperforming.

5. Monitoring4) Scorecard deployment.Represent the model as a formula.Implement the model in bank’s scoring tool.Integrate model outputs into apps for use by bank’s staff or partners.

4. Model deployment

Page 13: Predictive Analytics with Microsoft Big Data

Predictive analytics workflow example: credit scorecardsA retail bank uses a credit scorecard every day to issue new loans or monitor performance of existing loans.To be competitive the bank needs to aggressively acquire new customers, but limit the risk of default.A credit scorecard is used to maximize profit by accepting the largest pool of customers who will pay their debt.A credit scorecard is a predictive model used to predict likelihood of default.

1. Business problem2a) Identify and acquire data.Bank’s credit data.Payment history data from credit bureau.Demographic data from third party.

2b) Clean, shape, pre-process data.Deal with outliers and missing values.Prepare continuous and categorical variables.Identify and resolve highly

correlated variables.

2. Data collection and preparation

3a) Author or select model.Use Microsoft or third party solution

(e.g., SAS). Or…Program your own algorithm.

3b) Train and test model.Model trained on large subset of data

and tested on smaller subset.Logistic regression is de facto standard.

3. Model development

5) Scorecard monitoring.Continuously monitor scorecard’s performance on

new customers.Retrain the model once it starts underperforming.

5. Monitoring4) Scorecard deployment.Represent the model as a formula.Implement the model in bank’s scoring tool.Integrate model outputs into apps for use by bank’s staff or partners.

4. Model deployment

Page 14: Predictive Analytics with Microsoft Big Data

Predictive analytics workflow example: credit scorecardsA retail bank uses a credit scorecard every day to issue new loans or monitor performance of existing loans.To be competitive the bank needs to aggressively acquire new customers, but limit the risk of default.A credit scorecard is used to maximize profit by accepting the largest pool of customers who will pay their debt.A credit scorecard is a predictive model used to predict likelihood of default.

1. Business problem2a) Identify and acquire data.Bank’s credit data.Payment history data from credit bureau.Demographic data from third party.

2b) Clean, shape, pre-process data.Deal with outliers and missing values.Prepare continuous and categorical variables.Identify and resolve highly

correlated variables.

2. Data collection and preparation

3a) Author or select model.Use Microsoft or third party solution

(e.g., SAS). Or…Program your own algorithm.

3b) Train and test model.Model trained on large subset of data

and tested on smaller subset.Logistic regression is de facto standard.

3. Model development

5) Scorecard monitoring.Continuously monitor scorecard’s performance on

new customers.Retrain the model once it starts underperforming.

5. Monitoring4) Scorecard deployment.Represent the model as a formula.Implement the model in bank’s scoring tool.Integrate model outputs into apps for use by bank’s staff or partners.

4. Model deployment

Page 15: Predictive Analytics with Microsoft Big Data

Predictive analytics workflow example: credit scorecardsA retail bank uses a credit scorecard every day to issue new loans or monitor performance of existing loans.To be competitive the bank needs to aggressively acquire new customers, but limit the risk of default.A credit scorecard is used to maximize profit by accepting the largest pool of customers who will pay their debt.A credit scorecard is a predictive model used to predict likelihood of default.

1. Business problem2a) Identify and acquire data.Bank’s credit data.Payment history data from credit bureau.Demographic data from third party.

2b) Clean, shape, pre-process data.Deal with outliers and missing values.Prepare continuous and categorical variables.Identify and resolve highly

correlated variables.

2. Data collection and preparation

3a) Author or select model.Use Microsoft or third party solution

(e.g., SAS). Or…Program your own algorithm.

3b) Train and test model.Model trained on large subset of data

and tested on smaller subset.Logistic regression is de facto standard.

3. Model development

5) Scorecard monitoring.Continuously monitor scorecard’s performance on

new customers.Retrain the model once it starts underperforming.

5. Monitoring4) Scorecard deployment.Represent the model as a formula.Implement the model in bank’s scoring tool.Integrate model outputs into apps for use by bank’s staff or partners.

4. Model deployment

Page 16: Predictive Analytics with Microsoft Big Data

Predictive analysis tools from Microsoft

Rich library of data mining algorithms for diagnostic, predictive analytics—clustering, time series, neural nets, etc.Can be integrated into full data lifecycle from ETL, OLAP cubes, or KPIs on dashboards.Programmable and extensible through DMX.

Data mining with familiar tool—Microsoft Excel.Simplicity and ease of use—can build powerful predictive models with no deep data mining skills.Includes rich toolkit for data pre-processing, clustering, forecasting, basket analysis, etc.

Data mining tool in SQL Server Analysis Services.

Data Mining add-in for Excel.

Page 17: Predictive Analytics with Microsoft Big Data

Solving real business problems with Microsoft’s predictive analysis tools: the problem1

2

Identify the most likely customers for Product X.Which customers should we target to sell Product X and why?Win/loss analysis: which customers did we lose that we should have won?Did we lose any customers who are very similar to those we won?

Page 18: Predictive Analytics with Microsoft Big Data

Solving real business problems with Microsoft’s predictive analysis tools: the solution

Built customer targeting models with decision trees and Naïve Bayes algorithms.Input data collected from several sources:

Sales data warehouse. Dun & Bradstreet.Country statistics data.

Used data mining tools in SSAS, the Data Mining add-in for Excel, and Data Explorer.Decision trees and Naïve Bayes offered similar levels of accuracy.Both models identified most influential variables for targeting customers.

1 Identifying the most likely customers for Product X.

Page 19: Predictive Analytics with Microsoft Big Data

Customer targeting models: decision trees

Highest influencers of sales:Partner involvement, competitor, license type, country GDP, education spending, etc.

All

Partner subsegment not = “‘-St%Ufwysjw”

Partner subsegment = “‘-St%Ufwysjw”

Competitor 4= 0

License program category = “license program category”

Competitor 4 not = 0

License program category not = “license program category”

License program category not = “license program category

Subsidiary name not = “Australia”

Vertical name not

= “IT services”

Competitor 7 not = 1

License program category = “license program category

Subsidiary name= “Australia”

Vertical name = “IT services”

Competitor 7= 1

Subsidiary name not = “Australia”

Partner group name = “group 1”

GNP per cap bucket not = “medium”

Education public spending >= 7.483

Subsidiary name= “Australia”

Partner group name not = “group 1”

GNP per cap bucket = “medium”

Education public spending < 7.483

Primary workload type not = “core infrastructure…”

Partner engagement type

not = “(no partner)”

Partner engagement type

= “(no partner)”

Primary workload type = “core infrastructure…”

Competitor 1= 0

Competitor 1 not = 0

Education public spending < 6.225

Education public spending >= 6.225

Population 0 14< 20.662

Population 0 14>= 20.662

Partner group name = “group 2”

Partner group name not = “group 2”

Page 20: Predictive Analytics with Microsoft Big Data

Customer targeting models: resultsBoth decision trees and Naïve Bayes outperform random guess model.

Overall population %

Targ

et p

opul

atio

n (y

es) %

Data mining lift chart for mining structure: V Opportunity V2 Conf.

0%

10%

20% 30% 40% 50% 60% 70% 80% 90% 100%

20%30%40%50%60%70%80%90%

100%

Page 21: Predictive Analytics with Microsoft Big Data

Solution #2: win/loss analysis

Built customer segmentation model with clustering algorithm.Input data collected from several sources:

Sales data warehouse. Dun & Bradstreet.Country statistics data.

Used data mining tools in SSAS, the Data Mining add-in for Excel, and Data Explorer.Clustering model identified nine customer segments from the data.

2 Win/loss analysis: which customers did we lose that we should have won?

Page 22: Predictive Analytics with Microsoft Big Data

Customer segmentation model with clustering

Cluster 8

Cluster 3

Cluster 9

Cluster 6

Cluster 2

Cluster 1

Cluster 7

Cluster 5

Cluster 4

Six of the clusters had high propensity to buy (Clusters 1, 2, 3, 6, 8, and 9).Model showed segments with similar profile but different purchase outcomes (e.g., Cluster 3 and Cluster 4).Cluster 3 has a 51% likelihood to buy vs. only 16% for Cluster 4!

Page 23: Predictive Analytics with Microsoft Big Data

Customer segmentation modelUnderstanding why some customers did not buy Product X It is clear that the presence of

partners or competitors influence the customer’s propensity to buy!

Page 24: Predictive Analytics with Microsoft Big Data

Demo #1:Data Mining on HDInsight

DEMO

Page 25: Predictive Analytics with Microsoft Big Data

Demo #2:Data Mining add-in for Excel

DEMO

Page 26: Predictive Analytics with Microsoft Big Data

Common pitfalls in predictive analyticsSample bias Over-fitting

the modelPoor interpretation

Insufficient sample size.Unrepresentative data sample.

Your model may perform very well on training data, but poorly on new datasets!

Confusing correlation with causality.Confusing precision and accuracy.Statistical significance.

Page 27: Predictive Analytics with Microsoft Big Data

Summary

Introduced predictive analytics

Data mining tools from Microsoft

Demos and pitfalls

Solving real business problemData mining tools

in SQL Server Analysis Services.Data mining add-ins for Excel.

Customer targeting with segmentation and prediction.

Page 28: Predictive Analytics with Microsoft Big Data

What is predictive analytics?

Analytical approach

Description Uses Commonly used techniques

Diagnostic Analysis.

Helps you understand what happened and why.

Customer segmentation,Market basket analysis.

Clustering, classification, neural networks, decision trees or content analysis from statistics, data mining, and machine learning.

Predictive Analysis.

Helps you predict what will happen in the future.

Forecasting, predicting propensity to buy, or risk of default.

Time series analysis, neural networks, decision trees, Monte Carlo simulation, and regression from statistics, data mining, and machine learning.

Prescriptive Analysis.

Identifies the best course of action.

Channel optimization, portfolio optimization, or traffic optimization.

Linear and non-linear programming, Monte Carlo simulation, or game theory from statistics and data mining.

“Software and/or hardware solutions that allow firms to discover, evaluate, optimize, and deploy predictive models by analyzing Big Data sources to improve business performance or mitigate risk.”—Mike Gualtieri, Forrester, 2013

Page 30: Predictive Analytics with Microsoft Big Data

msdnResources for Developers

http://microsoft.com/msdn

LearningMicrosoft Certification & Training Resources

www.microsoft.com/learning

TechNet

Resources

Sessions on Demandhttp://channel9.msdn.com/Events/TechEd

Resources for IT Professionalshttp://microsoft.com/technet

Page 31: Predictive Analytics with Microsoft Big Data

Complete an evaluation on CommNet and enter to win!

Page 32: Predictive Analytics with Microsoft Big Data

Evaluate this session

Scan this QR code to evaluate this session and be automatically entered in a drawing to win a prize

Page 33: Predictive Analytics with Microsoft Big Data

© 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.