Egypt hackathon 2014 analytics & spss session

© 2014 IBM Corporation

Predictive Analytics and Modeling Using IBM SPSS Modeler

26 Nov 2014

Mohamed BaddarSoftware Engineering Researcher IBM Center of Advanced Studies in Cairo

Nesreen SharabyInformation Management & Analytics Architect IBM Egypt Global Delivery Center

© 2014 IBM Corporation2

Agenda

Time Contents

3:00 – 3:10 Introduction to Predictive Analytics

3:10 – 3:20 Predictive Analytics Processes

3:20 – 3:30 Introduction to IBM SPSS

3:30 – 3:45 Data Analysis and Preprocessing

3:45 – 4:10 Modeling and Evaluation: Classification

4:10 – 4:20 Modeling and Evaluation: Clustering

4:20 – 4:30 Questions


Introduction to Predictive Analytics


Introduction to Predictive AnalyticsWhat is Analytics?

Performing statistical and mathematical analysis over large set of data to discover patterns, relationships, trends hidden in tons of available data.

Analytics can be descriptive or predictive. Descriptive Analytics is used to gain multidimensional insights from historical data

with reporting, scorecards, segmentation, etc. Predictive Analytics is about using statistical techniques from modeling, machine learning, and

data mining that analyze current and historical facts to make predictions about unknown events.

4


In predictive analytics, we use predictive models to explore patterns found in historical and transactional data to identify risks and opportunities.

Most common applications are: Customer retention, Direct marketing, Credit scoring, Fraud detection, Risk management, etc.

Many techniques are used; i.e. Decision Trees, Regression, Clustering, Time series, Neural networks, etc.

Remember: After building a model, we should answer two main questions: 1) “Does it work?” 2)”How accurate is it ?”

5

(1) http://www.ibm.com/developerworks/library/ba-predictive-analytics1/

Introduction to Predictive AnalyticsWhat is Predictive Analytics?


Remember: The ultimate goal for any predictive analytics project is to automate the decision making process based on data-driven approach

Ex. Churn rate analysis : Give the Demographic and Behavioral data for a customer , will he still be a customer for our company for the next year (1)?

Some of common predictive analytics

(1) http://www.predictiveanalyticsworld.com/businessapplications.php

# Business Application

What is Predicted? Decision to Take

1 Customer Retention

Predict the rate of defection/churn/attrition

Actions to decrease churn rate for specific customer segments

2 Direct Marketing Customer Response To which customers should we target the campaign

3 Credit Scoring Credit score – will the customer pay his debts ?

To which customer we can give credit , how much ?

4 Product recommendation

Customer response towards recommended product

What product to recommend for each customer

Introduction to Predictive AnalyticsReal World Applications for Predictive Analytics


Predictive AnalyticsProcesses


To apply predictive analytics in industry problems , we need a formal process that defines the needed actions, starting from obtaining raw data, till we reach the decision recommendation.

The standard processes are: KDD , SEMMA , and CRISP-DM.

Predictive Analytics ProcessesWhat Are the Common Steps?

Data Preprocessing

• Data cleaning, missing data handling , outlier detection and handling.

Data Analysis

• Plotting, analyzing trends, patterns in data which helps identify the best model to use.

Modeling

• Creating statistical and machine learning models to predict some factor given other factors.


Data preprocessing and model evaluation are very important omitted by most practitioners. Data preprocessing aims to clean the data, fill N/A fields, and remove outliers Without data preprocessing, modeling can yield to poor/in-accurate models with low accuracy. Evaluation is a crucial step after building the model, for two reasons:

– Measure AccuracyIt’s not enough to predict something , we need also to say how accurate is the prediction .

• For ex. , we predict customer X will remain using services of company Y , with confidence 75% . • This is important for decision makers to know to what extend they should relay on the model

– Measure GeneralizationWe need not only the model to be applied to the data in-hand , but also to have almost the same accuracy for new data

• We use several data partitioning schemas to avoid over-fitting (more on that later)

Predictive Analytics ProcessesWhy Are These Steps Important?


The standard three processes are:– KDD Knowledge Discovery in Database – SEMMA Sample, Explore, Modify, Model, Assess– CRISP-DM CRoss-Industry Standard Process for Data Mining.

There is a strong correspondence between the steps for each process.

The steps are performed in iterative manner KDD and SEMMA are usually used in academia , CRISP is usually used in industry

Predictive Analytics ProcessesKDD, SEMMA, CRISP-DM

CRISP-DM Process


Introduction toIBM SPSS


IBM SPSS Modeler is an advanced analytics and data minig.

It is used to build predictive models and conduct other analytic tasks.

It has a visual interface which allows users to leverage statistical and data mining algorithms without programming.

Introduction to IBM SPSSIBM SPSS Modeler Basics


The main element in SPSS modeler is node.

Nodes are categorized based on the activities of CRISP Process.

Nodes are combined in a “stream”. It is advisory to put streams in corresponding CRISP action folder.

Introduction to IBM SPSSIBM SPSS Modeler Basics


Data Analysis and Preprocessing


It is crucial to understand data , perform any processing for it to be adequate for modeling. Techniques we will use are:

– Data balancing – Data partitioning– Anomaly detection

Usually the data model, for each record in the dataset used in predictive modeling is as follows.

Source: If applicable, describe source origin

Predictor #1

Predicor#2 … Target #1 Target # 2 …

Data Analysis and PreprocessongOverview


One problem can arise when the distribution of records over target fields (labels) is highly skewed (unbalanced) , this can causes biased model.

Balancing can be done either ways: boosting number of records with less frequent targets , or reducing the number of records with frequent target count

Balancing is done for training data only , to preserve the distribution of the test data.

Distribution for churn field before data-balancing Distribution for churn field after data-balancing

Data Analysis and PreprocessongData Balancing


Modeling/evaluation activities are performed iteratively Evaluation is performed by applying the model to a test data set,

and compare results to reference data Usually the data is partitioned into 3 partitions:

– Training: Used to build the model . This partition affects the modeling process directly.

– Test: Used to fine-tune the modeling parameters by iteratively building / evaluating the model. It affects the modeling process indirectly.

– Validation: Used to perform generalization test of the model. This partition doesn’t affect the modeling process

Partitioning is done based on random number generator to avoid any bias towards the order in which data is fed into the system.

Data Analysis and PreprocessongData Partitioning


In typical cases, we apply predictive modeling for large datasets.

Some data records have values of fields that make them away from the “common” distribution of the most of the records in the dataset.

This kind of data are called Anomalies , working with them require two main activities : Detection & Handling.

For example : in churn data, the long distance calls field for each record should have common distribution , if we have some users with VERY HIGH or VERY LOW long distance calls field value , how should we handle them?

Data Analysis and PreprocessongAnomaly Detection and Handling


Technically , including these anomalies with normal data in the same model , will distort the model. Detection of Anomalies is based on clustering (more on that later), we cluster data based on specific fields,

and mark the data far away from cluster centers as anomalies> From business point of view , we can handle anomalies in two ways :

– Ignoring them: consider these records as irrelevant.– Separate modeling : consider these records as special business case , for example , consider customers

with VERY HIGH long distance calls as golden customer.

Anomaly Detection Settings Anomaly Detection Results

Data Analysis and PreprocessongAnomaly Detection and Handling (2)


Modeling and Evaluation


After Preprocessing the data , the next step is to build a predictive model. Usually , we divide the fields of input records into predictors (X) and Targets(Y) Predictive Analytics problems usually have three categories :

– Classification: Given set of predictors , we want to estimate the target values , i.e. estimate , where F is the classification function.

– Clustering: Given set of predictors , we need to find the most dense grouping of records based on predictors’ values .

– Association: Given the set of predictors and targets, we want to find association (two way function) between predictors and targets , i.e. and

Modeling and EvaluationPredictive Model Techniques


Choosing the model technique is controlled by many factors:

– Business Problem Type• We want to predict churn rate from customer behavioral data -> Classification/Prediction• We want to group similar customers in to groups to analyze / act on each group -> Clustering for

customer segmentation• Predict future sales give the historical sales data -> Time Series Forecasting

– Data Type• Continuous predictors and continuous targets : Regression / Neural Networks• Continuous Predictors and Nominal Targets : Logistic Regression , Decision Trees , SVMs

Modeling and EvaluationPredictive Model Techniques (2)


In this session, we will focus on Classification and Clustering. For Classification , we will focus on:

– Regression / Logistic Regression– Decision Trees

For Clustering, we will focus on:– K-means

During the illustration of each modeling technique, we will embed how we assess the accuracy and generalization of that model.

Evaluation techniques we will use are:– Data Partitioning– Cross-Validation

Modeling and EvaluationPredictive Model Techniques (3)


The goal of regression is to fit a (linear/non linear) function between predictors and targets. There are many variants in regression , based on estimated function and targets

– If we have single target, we call it univariate regression , if we have multiple targets , we call it multivariate regression

– The estimated function can be linear or non-linear– If targets are continuous , we call it “regression” , if we have nominal or ordinal targets we call it “logistic

regression” We will focus on classical univariate linear regression. The linear regression equation is :

Where:Y is the target X is predictor k number of predictors A, B are estimated model parameterse is error/residual term following normal distribution

Modeling and EvaluationRegression


Given a dataset of patients with healthcare claims with the following schema .

– ASG : severity level– AGE : patient age– LOS : length of staying period at hospital– CLAIM : amount of claim in dollars

What we need is to detect patients who over-claims the hospital cost.

Modeling and EvaluationRegression – Example for Fraud Detection


Assuming linear relationship , we will model claim as linear function of AGE, LOS and ASG Next step is to compare estimated claim value against real one , if estimated is far less then real value, report

this record as “FRAUD”.

SPSS Stream for linear regression

Modeling and EvaluationRegression – Example for Fraud Detection (2)


Model Analysis:

- Estimated model coefficients - - Model evaluation –

Fraud Detection:

- Output of Regression - - Visual detection of frauds –

Modeling and EvaluationRegression – Example for Fraud Detection (3)


One of the most important techniques for classification. In general, all decision tree methods share common concepts.

– It scans all predictors

– Determine the most important predictor that splits records based on target value

– After splitting , remove this predictor from list and repeat for subsequent predictor

Example for decision tree for weekend activity decision (1)

(1) http://www.doc.ic.ac.uk/~sgc/teaching/pre2012/v231/lecture11.html

Modeling and EvaluationDecision Trees


Advantages of Decision Trees :– One if its output is the importance of each predictor,

which has a great business value– Self explanatory and can be easily understood

by business stake-holders.

SPSS contains four methods for Decision Trees, as in the table on the left. (1)

(1) Predictive Modeling with IBM SPSS Modeler Student Guide Course Code: 0A032

Modeling and EvaluationDecision Trees (2)


Given Churn dataset with the following schema

The problem is to predict churn value from customer data. The data contains :

– Behavioral Data : LONGDIST , International , LOCAL– Demographic Data : SEX , AGE , EST_Income ,

Car_Owner

As the predictors is a mix between nominal and continuous, and the target is ordinal , we use C5.0 Trees

Stream for building C5.0 Model with analysis node

Modeling and EvaluationDecision Trees – Churn Analysis with C5.0 Trees


- Model Accuracy Analysis with partitioning -

- Generated Decision Tree - - Model Accuracy Analysis without partitioning -

Modeling and EvaluationDecision Trees – Browsing and Analyzing Accuracy


Clustering (Cluster Analysis) used to group records with similar features values into similar features groups. K-Means is one of many clustering techniques Other clustering techniques are:

– Kohonen Networks– Two-Step approach

We select the number of clusters (k) to be created, and the model selects k well-spaced data records as starting clusters. Each data record is then assigned to the nearest of the k clusters. The cluster centers (means on the fields used in the clustering) are updated to accommodate the new members.

Two main issues should be considered when using clustering:– Distance measure– Number of clusters

For non-numeric data, we should convert them to numerical using some mapping schema.

Modeling and EvaluationK-Means


Problem: We need to segment telecom customers based on usage data, as follows:

– Local calls– Long distance calls– International calls

If we have a large data set, we will focus on reducing the spaces into smaller sub-spaces and we will use K-Means. K-means is sensitive for outliers and noise, and thus not recommended for small data sets.

SPSS telecom data clustering stream

Modeling and EvaluationK-Means – Example for Customer Segmentation


- K-means model summary - - K-means clustering results -

- K-means clustering results -

Modeling and EvaluationK-Means – Example for Customer Segmentation (2)


Thank You!

Questions?

Egypt hackathon 2014 analytics & spss session

Technology

Transcript of Egypt hackathon 2014 analytics & spss session