Egypt hackathon 2014 analytics & spss session

35
© 2014 IBM Corporation Predictive Analytics and Modeling Using IBM SPSS Modeler 26 Nov 2014 Mohamed Baddar Software Engineering Researcher IBM Center of Advanced Studies in Cairo Nesreen Sharaby Information Management & Analytics Architect IBM Egypt Global Delivery Center

Transcript of Egypt hackathon 2014 analytics & spss session

Page 1: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation

Predictive Analytics and Modeling Using IBM SPSS Modeler

26 Nov 2014

Mohamed BaddarSoftware Engineering Researcher IBM Center of Advanced Studies in Cairo

Nesreen SharabyInformation Management & Analytics Architect IBM Egypt Global Delivery Center

Page 2: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation2

Agenda

Time Contents

3:00 – 3:10 Introduction to Predictive Analytics

3:10 – 3:20 Predictive Analytics Processes

3:20 – 3:30 Introduction to IBM SPSS

3:30 – 3:45 Data Analysis and Preprocessing

3:45 – 4:10 Modeling and Evaluation: Classification

4:10 – 4:20 Modeling and Evaluation: Clustering

4:20 – 4:30 Questions

Page 3: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation3

Introduction to Predictive Analytics

Page 4: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation

Introduction to Predictive AnalyticsWhat is Analytics?

Performing statistical and mathematical analysis over large set of data to discover patterns, relationships, trends hidden in tons of available data.

Analytics can be descriptive or predictive. Descriptive Analytics is used to gain multidimensional insights from historical data

with reporting, scorecards, segmentation, etc. Predictive Analytics is about using statistical techniques from modeling, machine learning, and

data mining that analyze current and historical facts to make predictions about unknown events.

4

Page 5: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation

In predictive analytics, we use predictive models to explore patterns found in historical and transactional data to identify risks and opportunities.

Most common applications are: Customer retention, Direct marketing, Credit scoring, Fraud detection, Risk management, etc.

Many techniques are used; i.e. Decision Trees, Regression, Clustering, Time series, Neural networks, etc.

Remember: After building a model, we should answer two main questions: 1) “Does it work?” 2)”How accurate is it ?”

5

(1) http://www.ibm.com/developerworks/library/ba-predictive-analytics1/

Introduction to Predictive AnalyticsWhat is Predictive Analytics?

Page 6: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation6

Remember: The ultimate goal for any predictive analytics project is to automate the decision making process based on data-driven approach

Ex. Churn rate analysis : Give the Demographic and Behavioral data for a customer , will he still be a customer for our company for the next year (1)?

Some of common predictive analytics

(1) http://www.predictiveanalyticsworld.com/businessapplications.php

# Business Application

What is Predicted? Decision to Take

1 Customer Retention

Predict the rate of defection/churn/attrition

Actions to decrease churn rate for specific customer segments

2 Direct Marketing Customer Response To which customers should we target the campaign

3 Credit Scoring Credit score – will the customer pay his debts ?

To which customer we can give credit , how much ?

4 Product recommendation

Customer response towards recommended product

What product to recommend for each customer

Introduction to Predictive AnalyticsReal World Applications for Predictive Analytics

Page 7: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation7

Predictive AnalyticsProcesses

Page 8: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation8

To apply predictive analytics in industry problems , we need a formal process that defines the needed actions, starting from obtaining raw data, till we reach the decision recommendation.

The standard processes are: KDD , SEMMA , and CRISP-DM.

Predictive Analytics ProcessesWhat Are the Common Steps?

Data Preprocessing

• Data cleaning, missing data handling , outlier detection and handling.

Data Analysis

• Plotting, analyzing trends, patterns in data which helps identify the best model to use.

Modeling

• Creating statistical and machine learning models to predict some factor given other factors.

Page 9: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation9

Data preprocessing and model evaluation are very important omitted by most practitioners. Data preprocessing aims to clean the data, fill N/A fields, and remove outliers Without data preprocessing, modeling can yield to poor/in-accurate models with low accuracy. Evaluation is a crucial step after building the model, for two reasons:

– Measure AccuracyIt’s not enough to predict something , we need also to say how accurate is the prediction .

• For ex. , we predict customer X will remain using services of company Y , with confidence 75% . • This is important for decision makers to know to what extend they should relay on the model

– Measure GeneralizationWe need not only the model to be applied to the data in-hand , but also to have almost the same accuracy for new data

• We use several data partitioning schemas to avoid over-fitting (more on that later)

Predictive Analytics ProcessesWhy Are These Steps Important?

Page 10: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation10

The standard three processes are:– KDD Knowledge Discovery in Database – SEMMA Sample, Explore, Modify, Model, Assess– CRISP-DM CRoss-Industry Standard Process for Data Mining.

There is a strong correspondence between the steps for each process.

The steps are performed in iterative manner KDD and SEMMA are usually used in academia , CRISP is usually used in industry

Predictive Analytics ProcessesKDD, SEMMA, CRISP-DM

CRISP-DM Process

Page 11: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation11

Introduction toIBM SPSS

Page 12: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation12

IBM SPSS Modeler is an advanced analytics and data minig.

It is used to build predictive models and conduct other analytic tasks.

It has a visual interface which allows users to leverage statistical and data mining algorithms without programming.

Introduction to IBM SPSSIBM SPSS Modeler Basics

Page 13: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation13

The main element in SPSS modeler is node.

Nodes are categorized based on the activities of CRISP Process.

Nodes are combined in a “stream”. It is advisory to put streams in corresponding CRISP action folder.

Introduction to IBM SPSSIBM SPSS Modeler Basics

Page 14: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation14

Data Analysis and Preprocessing

Page 15: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation15

It is crucial to understand data , perform any processing for it to be adequate for modeling. Techniques we will use are:

– Data balancing – Data partitioning– Anomaly detection

Usually the data model, for each record in the dataset used in predictive modeling is as follows.

Source: If applicable, describe source origin

Predictor #1

Predicor#2 … Target #1 Target # 2 …

Data Analysis and PreprocessongOverview

Page 16: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation16

One problem can arise when the distribution of records over target fields (labels) is highly skewed (unbalanced) , this can causes biased model.

Balancing can be done either ways: boosting number of records with less frequent targets , or reducing the number of records with frequent target count

Balancing is done for training data only , to preserve the distribution of the test data.

Distribution for churn field before data-balancing Distribution for churn field after data-balancing

Data Analysis and PreprocessongData Balancing

Page 17: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation17

Modeling/evaluation activities are performed iteratively Evaluation is performed by applying the model to a test data set,

and compare results to reference data Usually the data is partitioned into 3 partitions:

– Training: Used to build the model . This partition affects the modeling process directly.

– Test: Used to fine-tune the modeling parameters by iteratively building / evaluating the model. It affects the modeling process indirectly.

– Validation: Used to perform generalization test of the model. This partition doesn’t affect the modeling process

Partitioning is done based on random number generator to avoid any bias towards the order in which data is fed into the system.

Data Analysis and PreprocessongData Partitioning

Page 18: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation18

In typical cases, we apply predictive modeling for large datasets.

Some data records have values of fields that make them away from the “common” distribution of the most of the records in the dataset.

This kind of data are called Anomalies , working with them require two main activities : Detection & Handling.

For example : in churn data, the long distance calls field for each record should have common distribution , if we have some users with VERY HIGH or VERY LOW long distance calls field value , how should we handle them?

Data Analysis and PreprocessongAnomaly Detection and Handling

Page 19: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation19

Technically , including these anomalies with normal data in the same model , will distort the model. Detection of Anomalies is based on clustering (more on that later), we cluster data based on specific fields,

and mark the data far away from cluster centers as anomalies> From business point of view , we can handle anomalies in two ways :

– Ignoring them: consider these records as irrelevant.– Separate modeling : consider these records as special business case , for example , consider customers

with VERY HIGH long distance calls as golden customer.

Anomaly Detection Settings Anomaly Detection Results

Data Analysis and PreprocessongAnomaly Detection and Handling (2)

Page 20: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation20

Modeling and Evaluation

Page 21: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation21

After Preprocessing the data , the next step is to build a predictive model. Usually , we divide the fields of input records into predictors (X) and Targets(Y) Predictive Analytics problems usually have three categories :

– Classification: Given set of predictors , we want to estimate the target values , i.e. estimate , where F is the classification function.

– Clustering: Given set of predictors , we need to find the most dense grouping of records based on predictors’ values .

– Association: Given the set of predictors and targets, we want to find association (two way function) between predictors and targets , i.e. and

Modeling and EvaluationPredictive Model Techniques

Page 22: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation22

Choosing the model technique is controlled by many factors:

– Business Problem Type• We want to predict churn rate from customer behavioral data -> Classification/Prediction• We want to group similar customers in to groups to analyze / act on each group -> Clustering for

customer segmentation• Predict future sales give the historical sales data -> Time Series Forecasting

– Data Type• Continuous predictors and continuous targets : Regression / Neural Networks• Continuous Predictors and Nominal Targets : Logistic Regression , Decision Trees , SVMs

Modeling and EvaluationPredictive Model Techniques (2)

Page 23: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation23

In this session, we will focus on Classification and Clustering. For Classification , we will focus on:

– Regression / Logistic Regression– Decision Trees

For Clustering, we will focus on:– K-means

During the illustration of each modeling technique, we will embed how we assess the accuracy and generalization of that model.

Evaluation techniques we will use are:– Data Partitioning– Cross-Validation

Modeling and EvaluationPredictive Model Techniques (3)

Page 24: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation24

The goal of regression is to fit a (linear/non linear) function between predictors and targets. There are many variants in regression , based on estimated function and targets

– If we have single target, we call it univariate regression , if we have multiple targets , we call it multivariate regression

– The estimated function can be linear or non-linear– If targets are continuous , we call it “regression” , if we have nominal or ordinal targets we call it “logistic

regression” We will focus on classical univariate linear regression. The linear regression equation is :

Where:Y is the target X is predictor k number of predictors A, B are estimated model parameterse is error/residual term following normal distribution

Modeling and EvaluationRegression

Page 25: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation25

Given a dataset of patients with healthcare claims with the following schema .

– ASG : severity level– AGE : patient age– LOS : length of staying period at hospital– CLAIM : amount of claim in dollars

What we need is to detect patients who over-claims the hospital cost.

Modeling and EvaluationRegression – Example for Fraud Detection

Page 26: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation26

Assuming linear relationship , we will model claim as linear function of AGE, LOS and ASG Next step is to compare estimated claim value against real one , if estimated is far less then real value, report

this record as “FRAUD”.

SPSS Stream for linear regression

Modeling and EvaluationRegression – Example for Fraud Detection (2)

Page 27: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation27

Model Analysis:

- Estimated model coefficients - - Model evaluation –

Fraud Detection:

- Output of Regression - - Visual detection of frauds –

Modeling and EvaluationRegression – Example for Fraud Detection (3)

Page 28: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation28

One of the most important techniques for classification. In general, all decision tree methods share common concepts.

– It scans all predictors

– Determine the most important predictor that splits records based on target value

– After splitting , remove this predictor from list and repeat for subsequent predictor

Example for decision tree for weekend activity decision (1)

(1) http://www.doc.ic.ac.uk/~sgc/teaching/pre2012/v231/lecture11.html

Modeling and EvaluationDecision Trees

Page 29: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation29

Advantages of Decision Trees :– One if its output is the importance of each predictor,

which has a great business value– Self explanatory and can be easily understood

by business stake-holders.

SPSS contains four methods for Decision Trees, as in the table on the left. (1)

(1) Predictive Modeling with IBM SPSS Modeler Student Guide Course Code: 0A032

Modeling and EvaluationDecision Trees (2)

Page 30: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation30

Given Churn dataset with the following schema

The problem is to predict churn value from customer data. The data contains :

– Behavioral Data : LONGDIST , International , LOCAL– Demographic Data : SEX , AGE , EST_Income ,

Car_Owner

As the predictors is a mix between nominal and continuous, and the target is ordinal , we use C5.0 Trees

Stream for building C5.0 Model with analysis node

Modeling and EvaluationDecision Trees – Churn Analysis with C5.0 Trees

Page 31: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation31

- Model Accuracy Analysis with partitioning -

- Generated Decision Tree - - Model Accuracy Analysis without partitioning -

Modeling and EvaluationDecision Trees – Browsing and Analyzing Accuracy

Page 32: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation32

Clustering (Cluster Analysis) used to group records with similar features values into similar features groups. K-Means is one of many clustering techniques Other clustering techniques are:

– Kohonen Networks– Two-Step approach

We select the number of clusters (k) to be created, and the model selects k well-spaced data records as starting clusters. Each data record is then assigned to the nearest of the k clusters. The cluster centers (means on the fields used in the clustering) are updated to accommodate the new members.

Two main issues should be considered when using clustering:– Distance measure– Number of clusters

For non-numeric data, we should convert them to numerical using some mapping schema.

Modeling and EvaluationK-Means

Page 33: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation33

Problem: We need to segment telecom customers based on usage data, as follows:

– Local calls– Long distance calls– International calls

If we have a large data set, we will focus on reducing the spaces into smaller sub-spaces and we will use K-Means. K-means is sensitive for outliers and noise, and thus not recommended for small data sets.

SPSS telecom data clustering stream

Modeling and EvaluationK-Means – Example for Customer Segmentation

Page 34: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation34

- K-means model summary - - K-means clustering results -

- K-means clustering results -

Modeling and EvaluationK-Means – Example for Customer Segmentation (2)

Page 35: Egypt hackathon 2014   analytics & spss session

© 2014 IBM Corporation35

Thank You!

Questions?