7093727 Spss Trainingboek Advanced Statistics and Data Mining

Data Mining: Modeling

18749-001

SPSS v11.5; Clementine v7.0; AnswerTree 3.1; DecisionTime 1.1 Revised 9/26/2002 ss/mr

For more information about SPSS software products, please visit our Web site at http://www.spss.com or contact

SPSS Inc. 233 South Wacker Drive, 11th Floor Chicago, IL 60606-6412 Tel: (312) 651-3000 Fax: (312) 651-3668 SPSS is a registered trademark and its other product names are the trademarks of SPSS Inc. for its proprietary computer software. No material describing such software may be produced or distributed without the written permission of the owners of the trademark and license rights in the software and the copyrights in the published materials.

The SOFTWARE and documentation are provided with RESTRICTED RIGHTS. Use, duplication, or disclosure by the Government is subject to restrictions as set forth in subdivision (c)(1)(ii) of The Rights in Technical Data and Computer Software clause at 52.227-7013. Contractor/manufacturer is SPSS Inc., 233 South Wacker Drive, 11th Floor, Chicago, IL 60606-6412.

TableLook is a trademark of SPSS Inc. Windows is a registered trademark of Microsoft Corporation. DataDirect, DataDirect Connect, INTERSOLV, and SequeLink are registered trademarks of MERANT Solutions Inc. Portions of this product were created using LEADTOOLS 1991-2000, LEAD Technologies, Inc. ALL RIGHTS RESERVED. LEAD, LEADTOOLS, and LEADVIEW are registered trademarks of LEAD Technologies, Inc. Portions of this product were based on the work of the FreeType Team (http:\\www.freetype.org). General notice: Other product names mentioned herein are used for identification purposes only and may be trademarks or registered trademarks of their respective companies in the United States and other countries.

Data Mining: Modeling Copyright 2002 by SPSS Inc. All rights reserved. Printed in the United States of America.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher.

Data Mining: Data Modeling

Table of Contents 1

Data Mining: Modeling Table of Contents

CHAPTER 1

INTRODUCTION INTRODUCTION........................................................................................................1-2 MODEL OVERVIEW .................................................................................................1-3 VALIDATION.............................................................................................................1-6

CHAPTER 2

STATISTICAL DATA MINING TECHNIQUES INTRODUCTION........................................................................................................2-2 STATISTICAL TECHNIQUES ..................................................................................2-3 LINEAR REGRESSION..............................................................................................2-4 DISCRIMINANT ANALYSIS..................................................................................2-21 LOGISTIC AND MULTINOMIAL REGRESSION.................................................2-31 APPENDIX: GAINS TABLES..................................................................................2-38

CHAPTER 3

MARKET BASKET OR ASSOCIATION ANALYSIS INTRODUCTION........................................................................................................3-2 TECHNICAL CONSIDERATIONS............................................................................3-3 RULE GENERATION.................................................................................................3-4 APRIORI EXAMPLE: GROCERY PURCHASES.....................................................3-5 USING THE ASSOCIATIONS.................................................................................3-12 APRIORI EXAMPLE: TRAINING COURSE PURCHASES..................................3-15

CHAPTER 4

NEURAL NETWORKS INTRODUCTION........................................................................................................4-2 BASIC PRINCIPLES OF SUPERVISED NEURAL NETWORKS...........................4-3 A NEURAL NETWORK EXAMPLE: PREDICTING CREDIT RISK ...................4-11

Data Mining: Data Modeling

Table of Contents 2

CHAPTER 5

RULE INDUCTION AND DECISION TREE METHODS INTRODUCTION........................................................................................................5-2 WHY SO MANY METHODS?...................................................................................5-4 CHAID ANALYSIS ....................................................................................................5-6 A CHAID EXAMPLE: CREDIT RISK.......................................................................5-7 RULE INDUCTION (C5.0).......................................................................................5-22 A C5.0 EXAMPLE: CREDIT RISK..........................................................................5-22

CHAPTER 6

CLUSTER ANALYSIS INTRODUCTION........................................................................................................6-2 WHAT TO LOOK AT WHEN CLUSTERING ..........................................................6-3 A K-MEANS EXAMPLE: CLUSTERING SOFTWARE USAGE DATA................6-5 CLUSTERING WITH KOHONEN NETWORKS....................................................6-14 A KOHONEN EXAMPLE: CLUSTERING PURCHASE DATA ...........................6-15

CHAPTER 7

TIME SERIES ANALYSIS INTRODUCTION..............................................................................................................7-2 DATA ORGANIZATION FOR TIME SERIES ANALYSIS ......................................................7-3 INTRODUCTION TO EXPONENTIAL SMOOTHING .............................................................7-4 A DECISIONTIME FORECASTING EXAMPLE: DAILY PARCEL DELIVERIES .....................7-5

CHAPTER 8

SEQUENCE DETECTION INTRODUCTION TO SEQUENCE DETECTION....................................................8-2 TECHNICAL CONSIDERATIONS............................................................................8-3 DATA ORGANIZATION FOR SEQUENCE DETECTION .....................................8-4 SEQUENCE DETECTION ALGORITHMS IN CLEMENTINE ..............................8-5 A SEQUENCE DETECTION EXAMPLE: REPAIR DIAGNOSTICS......................8-6

REFERENCES..........................................................................................R-1


Introduction 1 - 1

Chapter 1 Introduction

Topics: INTRODUCTION

MODEL OVERVIEW

VALIDATION


Introduction 1 - 2

INTRODUCTION This course focuses on the modeling stage of the data mining process. It will compare and review the analytic methods commonly used for data mining. In addition, it will illustrate these methods using SPSS software (SPSS, AnswerTree, DecisionTime, and Clementine). The course assumes that a business question has been formulated and that relevant data have been collected, organized, and checked and prepared. In short, that all the time-consuming, preparatory work has been completed and you are at the modeling stage of your project. For more details concerning what should be done during the earlier stages in a data mining project, see the SPSS Data Mining: Overview and Data Mining: Data Understanding and Data Preparation courses. This chapter serves as a road map for the rest of the course. We try to place the various methods discussed within a framework and give you a sense of when to use which methods. The unifying theme is data mining and we discuss in detail the analytic techniques most often used to support these efforts. The course emphasizes the practical issues of setting up, running, and interpreting the results of statistical and machine learning analyses. It assumes you have, or will have, some business questions that require analysis, and that you know what to do with the results once you have them. There are choices regarding specific methods with several of these techniques, and the recommendations we make are based on what is known from properties of the methods, Monte Carlo simulations, or empirical work. You should be aware from the start that in most cases there is not a single method that will definitely yield the best results. However, in the chapters that follow detailing the specific methods, we have sections that list research projects for which the method is appropriate, features and limitations of the method, and comments concerning model deployment. These should prove of some use when you must decide on the method to apply to your problem. Finally, the approach is practical, not mathematical. Relatively few equations are presented and references are given for those who would like a more rigorous review of the techniques. Also, our goal is to provide you with a good sense of the properties of each method and how it is used and interpreted. The course does not strive for exhaustive detail. Entire books have been written on a topic we cover in a single chapter, and we are trying to present the main issues a practitioner will face. Analyses are run using different SPSS products. However, the emphasis in this course is on understanding the characteristics of the methods and being able to interpret the results. Thus we will not discuss data definition and general program operation issues. We do present instructions to perform the analyses, but more information is needed than is presented here to master the software programs used. To provide this depth, SPSS offers operational courses for the products used in this course.


Introduction 1 - 3

MODEL OVERVIEW In this section we provide brief descriptions and comparisons of the data mining analysis and modeling methods that will be discussed in this course. Recall from your statistics courses that inferential statistics have two key features. They require that you specify an hypothesis to test such as that more satisfied customers will be more likely to make additional purchasesand they allow you to make inferences back to the population from the particular sample data you are studying. Because of these features, it isnt formally necessary to create training and validation (test) data sets when using inferential statistics. The validation portion of the analysis is done with standard test statistics, such as F, t, or chi-square, providing a probability of the hypothesis under test being correct. However, given the accepted data-mining methodology, you may decide to create a validation data set even when using inferential techniques. There is generally no harm in doing so, especially with a sufficient amount of data where the training and validation sets can both be reasonably large. Here is a listing of some inferential statistical methods commonly used in data mining projects. We will not define them here but leave that for a later section. The type of variables each requires is also listed.

GENERAL TECHNIQUE (Inferential Statistics)

PREDICTOR VARIABLES

OUTCOME VARIABLE

Discriminant Analysis Continuous or dummies* Categorical Linear Regression (and ANOVA) Continuous or dummies Continuous Logistic and Multinomial Regression Continuous or dummies Categorical Time Series Analysis Continuous or dummies Continuous (*Dummies refers to transformed variables coded 1 or 0, representing the presence or absence of a characteristic. Thus a field such as region [north, south, east and west], when used as a predictor variable in several inferential methods, would be represented by dummy variables. For example, one dummy field might be named North and coded 1 if the records region code were north and 0 otherwise.) As is common for inferential statistics, all of these techniques are used to make predictions of a dependent variable. Some have been used for many years, such as linear regression or discriminant analysis. Inferential statistical techniques often make stringent assumptions about the data, such as normality, uncorrelated errors, or homogeneity of variance. They are more restrictive


Introduction 1 - 4

than non-inferential techniques, which can be a disadvantage. However, they provide rigorous tests of hypotheses unavailable with more automated methods of analysis. Although these methods are not always mentioned in many data mining books and articles, you need to be aware of them because they are often exactly what are necessary to answer a particular question. For instance, to predict the amount of revenue, in dollars, that a new customer is likely to provide in the next two years, linear regression could be a natural choice, depending on the available predictor variables and the nature of the relationships.

GENERAL TECHNIQUE (Data Mining)

PREDICTOR VARIABLES

OUTCOME VARIABLE

Decision Trees (Rule Induction) Continuous or dummies* Categorical (some allow Continuous)

Neural Networks Continuous or dummies Categorical or Continuous

The key difference for most users between inferential and non-inferential techniques is in whether hypotheses need to be specified beforehand. In the latter methods, this is not normally required, as each is semi- or completely automated as it searches for a model. Nonetheless, in all non-inferential techniques, you clearly need to specify a list of variables as inputs to the procedure, and you may have to specify other details, depending on the exact method. As we discussed in the previous courses in the SPSS Data Mining sequence, data mining is not mindless activity; even here, you need a plan of approacha research designto use these techniques wisely. Notice that the inferential statistical methods are not distinguished from the data mining methods in terms of the types of variables they allow. Instead data mining methods, such as decision trees and neural networks, are distinguished by making fewer assumptions about the data (for example, normality of errors). In many instances both classes of methods can be applied to a given prediction problem. Some data mining methods do not involve prediction, but instead search for groupings or associations in the data. Several of these methods are listed below along with the types of analysis you can do with them.


Introduction 1 - 5

GENERAL TECHNIQUE Analysis Cluster Analysis Uses continuous or categorical variables to create

cluster memberships; no predefined outcome variable.

Market Basket/Association Analysis

Uses categorical variables to create associations between categories; no outcome variable required.

Sequence Detection Uses categorical variables in data sorted in time order to discover sequences in data; no outcome variable required, but there may be interest in specific outcomes.

Finally, discussions of data mining mention the tasks of classification, affinity analysis, prediction or segmentation. Below we group the data mining techniques within these categories. Affinity/Association: These methods attempt to find items that are closely associated in a data file, with the archetypal case being shopping patterns of consumers. Market basket analysis and sequence detection fall into this category. Classification/Segmentation: These methods attempt to classify customers into discrete categories that have already been defined, i.e., customers who stay and those who leave, based on a set of predictors. Several methods are available, including decision trees, neural networks, and sequence detection (when data are time structured). Note that logistic regression and discriminant analysis are inferential techniques that accomplish this same task. Clustering/Segmentation: Notice that we have repeated the word segmentation. This is because segmentation is used in two senses in data mining. Its second meaning is to create natural clusters of objectswithout using an outcome variablethat are similar on various characteristics. Cluster analysis and Kohonen networks accomplish this task. Prediction/Estimation: These methods predict a continuous outcome variable, as opposed to classification methods, which work with discrete outcomes. Neural networks fall into this group. Decision tree methods can work with continuous predictors, but they split them into discrete ranges as the tree is built. Memory-based reasoning techniques (not covered in this course) can also predict continuous outcomes. Regression is the inferential method likely to be used for this purpose. The descriptions above are quite simple and hide a wealth of detail that we will consider as we review the techniques. More than one specific method is usually available for a general technique. So, to cluster data, K-means clustering, Two-step clustering, and Kohonen networks (a form of neural network) could be used, with the choice of which to use depending on the type of data, the availability of software, the ease of understanding desired, the speed of processing, and so forth.


Introduction 1 - 6

VALIDATION Since most data-mining methods do not depend on specific data distribution assumptions (for example, normality of errors) to draw inferences from the sample to the population, validation is strongly recommended. It is usually done by fitting the model to a portion of the data (called the Training data) and then applying the predictions to, and evaluating the results with, the other portion of the data (called the Validation data note some authors refer to this as Test data, but as we will see, Test data has a specific meaning in neural network estimation). In this way, the validity of the model is established by demonstrating that it applies to (fits) data independent of that used to derive the model. Statisticians often recommend such validation for statistical models, but it is crucial for more general (less distribution bound) data mining techniques. There are several methods of performing validation.

Holdout Sample This method was described above. The data set is split into two parts: training and validation files. For large files it might be a 50/50 split, while for smaller files more records are typically placed in the training set. Modeling is performed on the training data, but fit evaluation is done on the separate validation data.

N-Fold Validation If the data file is small, reserving a holdout sample may not be feasible (the training sample may be too small to obtain stable results). In this case n-fold validation may be done. Here the data set is divided into a number of groups of equal sample size. Lets use 10 groups for the example. The first group is held out from the analysis, which is based on the other 9 groups (or 9/10ths of the data), and is used as the validation sample. Next the second group is held out from the analysis, again based on the other 9 groups, and is used as the validation sample. This continues until each of the 10 groups has served as a validation sample. The validation results from each of these samples are then pooled. This has the advantage of providing a form of validation in the presence of small samples, but since any given data record is used in 9 of the 10 models, there is less than complete independence. A second problem is that since 10 models are run there is no single model result (there are 10). For this reason, n-fold validation is generally used to estimate the fit or accuracy of a model with small data files and not to produce the model coefficients or rules. Some procedures extend this principle to base the model on all but one observation (using fast algorithms), keeping a single record as the hold-out. Generally speaking, resource-wise, only closed-form models that involve no iteration (like regression or discriminant) can afford this.


Introduction 1 - 7

Validate with Other Models Since different data mining models often can be applied to the same data, you would have greater confidence in your results if different methods led to the same conclusions. This is not to say that the results should be identical, since the models do differ in their assumptions and approach. But you would expect that important predictors repeat across methods and have the same general relationship to the outcome.

Validate with Different Starting Values Neural networks usually begin with randomly assigned weights and then, hopefully, converge to the optimum solution. If analyses run with different starting values for the weights produce the same solution, then you would have greater confidence in it.

Domain Validation Do the model results make sense within the business area being studied? Here a domain expertsomeone who understands the business and dataexamines the model results to determine if they make sense, and to decide if they are interesting and useful, as opposed to obvious and trivial.


Introduction 1 - 8


Statistical Data Mining Techniques 2 - 1

Chapter 2 Statistical Data Mining Techniques

Topics:

STATISTICAL TECHNIQUES

LINEAR REGRESSION

DISCRIMINANT ANALYSIS

LOGISTIC AND MULTINOMIAL REGRESSION

APPENDIX: GAINS TABLES



INTRODUCTION In this chapter we consider the various inferential statistical techniques that are commonly used in data mining. We include a detailed example of each, as well as discussions about typical sample sizes, whether the method can be automated, and how easily the model can be understood and deployed. As you work on the tasks weve cited in the last chapter, you should also be thinking about what data mining techniques to use to answer those questions. Research isnt done step-by-step, in some predefined order, as we are taught in textbooks. Instead, all phases of a data mining project should be under review early in the process. This is especially critical for the data mining techniques you plan to employ, for at least three reasons. First, each data mining technique is suitable for only some types of analysis, but not all. Thus the research question you have defined cant necessarily be answered by just any technique. So if you want to answer a question that requires, say, market basket analysis (discussed in Chapter 3), and you have little expertise in this procedure, youll need to prepare ahead of time, conceivably even acquire additional software, so you are ready to begin analysis when the data are ready. Second, some techniques require more data than others do, or data of a particular kind, so you will need to have these conditions in mind when you collect the data. And third, some techniques are more easily understandable than others and the models more readily retrained if the environment changes rapidly; both of which might affect your choice of which technique to use. In this chapter we provide several different frameworks or classification schemes by which to understand and conceptualize the various inferential data mining techniques available in SPSS and other software. Examples of each technique will be given, including research questions or projects suitable for that type of analysis. Although details for running various analyses are given in the chapter, the emphasis is on setting up the basic analysis and interpreting the results. For this reason, all available options and variations will not be covered in this class. Also, such steps as data definition and data exploration are assumed to be completed prior to the modeling stage. In short, the goal of the chapter is not to exhaustively cover each data mining procedure in SPSS, but to present and discuss the core features needed for most analyses. (For more details on specific procedures, you may attend separate SPSS, AnswerTree, DecisionTime, and Clementine application courses.) Instead, we provide an overview of these methods with enough detail for you to begin to make an informed choice about which method will be appropriate for your own data mining projects, to set up a typical analysis, and interpret the results.



STATISTICAL TECHNIQUES Recall that inferential statistics have two key features. They require that you specify an hypothesis to testsuch as that more satisfied customers will be more likely to make additional purchasesand they allow you to make inferences back to the population from the particular sample data you are studying. Below is the listing, from Chapter 1, of the many inferential methods commonly used in data mining projects. We will define them in later sections of this chapter. The type of variables each requires is also listed. GENERAL TECHNIQUE

PREDICTOR VARIABLES

OUTCOME VARIABLE

Discriminant Analysis Continuous or dummies* Categorical Linear Regression (and ANOVA) Continuous or dummies Continuous Logistic and Multinomial Regression Continuous or dummies Categorical Time Series Analysis Continuous or dummies Continuous (*Dummies refers to transformed variables coded 1 or 0, representing the presence or absence of a characteristic. Thus a field such as region (north, south, east and west), when used as a predictor variable in several inferential methods, would be represented by dummy variables. For example, one dummy field might be named North and coded 1 if the records region code was north and 0 otherwise.) As we discuss the techniques, we also provide information on whether they can be automated or not, their ease of understanding and typical size of data files, plus other important traits. After this brief glimpse at the various techniques, we turn next to a short discussion of each, including examples of research questions it can answer and where each can be found, if available, in SPSS software. You are probably already familiar with several of the inferential statistics methods we consider here. Our emphasis is on practical use of the techniques, not on the theory underlying each one.



LINEAR REGRESSION Linear regression is a method familiar to just about everyone these days. It is the classic linear model technique, and is used to predict an outcome variable that is interval or ratio with a set of predictors that are also interval or ratio. In addition, categorical predictor variables can be included by creating dummy variables. Linear regression is available in SPSS under the AnalyzeRegression menu and is available in SPSS Clementine. Linear regression, of course, assumes that the data can be modeled with a linear relationship. As illustration, Figure 2.1 exhibits a scatterplot depicting the relationship between the number of previous late payments for bills and the credit risk of defaulting on a new loan. Superimposed on the plot is the best-fit regression line. The plot may look a bit unusual because of the use of sunflowers, which are used to represent the number of cases at a point. Since credit risk and late payments are measured as whole integers, the number of discrete points here is relatively limited given the large file size (over 2,000 cases). Figure 2.1 Scatterplot of Late Payments and Credit Risk

Although there is a lot of spread around the regression line, it is clear that there is a trend in the data such that more late payments are associated with a greater credit risk. Of course, linear regression is normally used with several predictors; this makes it impossible to display the complete solution with all predictors in convenient graphical form. Thus most users of linear regression use the numeric output.



Basic Concepts of Regression Earlier we pointed out that to the eye there seems to be a positive relation between credit risk and the number of late payments. However, it would be more useful in practice to have some form of prediction equation. Specifically, if some simple function can approximate the pattern shown in the plot, then the equation for the function would concisely describe the relation, and could be used to predict values of one variable given knowledge of the other. A straight line is a very simple function, and is usually what researchers start with, unless there are reasons (theory, previous findings, or a poor linear fit) to suggest another. Also, since the point of much research involves prediction, a prediction equation is valuable. However, the value of the equation would be linked to how well it actually describes or fits the data, and so part of the regression output includes fit measures.

The Regression Equation and Fit Measure In the plot above, credit risk is placed on the Y (or vertical axis) and the number of late payments appears along the X (horizontal) axis. If we are interested in credit risk as a function of the number of late payments, we consider credit risk to be the dependent variable and number of late payments the independent or predictor variable. A straight line is superimposed on the scatterplot along with the general form of the equation: Y = B*X + A Here, B is the slope (the change in Y per one unit change in X) and A is the intercept (the value of Y when X is zero). Given this, how would one go about finding a best-fitting straight line? In principle, there are various criteria that might be used: minimizing the mean deviation, mean absolute deviation, or median deviation. Due to technical considerations, and with a dose of tradition, the best-fitting straight line is the one that minimizes the sum of the squared deviation of each point about the line Returning to the plot of credit risk and number of late payments, we might wish to quantify the extent to which the straight line fits the data. The fit measure most often used, the r-square measure, has the dual advantages of falling on a standardized scale and having a practical interpretation. The r-square measure (which is the correlation squared, or r2, when there is a single predicator variable, and thus its name) is on a scale from 0 (no linear association) to 1 (perfect prediction). Also, the r-square value can be interpreted as the proportion of variation in one variable that can be predicted from the other. Thus an r-square of .50 indicates that we can account for 50% of the variation in one variable if we know values of the other. You can think of this value as a measure of the improvement in your ability to predict one variable from the other (or others if there is more than one independent variable).



Multiple regression represents a direct extension of simple regression. Instead of a single predictor variable (Y = B*X + A), multiple regression allows for more than one independent variable in the prediction equation: Y = B1*X1 + B2*X2 + B3*X3 + . . . + A While we are limited to the number of dimensions we can view in a single plot (SPSS can build a 3-dimensional scatterplot), the regression equation allows for many independent variables. When we run multiple regression we will again be concerned with how well the equation fits the data, whether there are any significant linear relations, and estimating the coefficients for the best-fitting prediction equation. In addition, we are interested in the relative importance of the independent variables in predicting the dependent measure.

Residuals and Outliers Viewing the plot, we see that many points fall near the line, but some are more distant from it. For each point, the difference between the value of the dependent variable and the value predicted by the equation (value on the line) is called the residual. Points above the line have positive residuals (they were under predicted), those below the line have negative residuals (they were over predicted), and a point falling on the line has a residual of zero (perfect prediction). Points having relatively large residuals are of interest because they represent instances where the prediction line did poorly. As we will see shortly in our detailed example, large residuals (gross deviations from the model) have been used to identify data errors or possible instances of fraud (in application areas such as insurance claims, invoice submission, telephone and credit card usage). In SPSS, the Regression procedure can provide information about large residuals, and also present them in standardized form. Outliers, or points far from the mass of the others, are of interest in regression because they can exert considerable influence on the equation (especially if the sample size is small, which is rarely the case in data mining). Also, outliers can have large residuals and would be of interest for this reason as well. While not covered in this class, SPSS can provide influence statistics to aid in judging whether the equation was strongly affected by an observation and, if so, to identify the observation.

Assumptions Regression is usually performed on data for which the dependent and independent variables are interval scale. In addition, when statistical significance tests are performed, it is assumed that the deviations of points around the line (residuals) follow the normal bell-shaped curve. Also, the residuals are assumed to be independent of the predicted (values on the line) values, which implies that the variation of the residuals around the line is homogeneous (homogeneity of variance). SPSS can provide summaries and plots useful in evaluating these latter issues. One special case of the assumptions involves the interval scale nature of the independent variable(s). A variable coded as a dichotomy (say



0 and 1) can technically be considered as an interval scale. An interval scale assumes that a one-unit change has the same meaning throughout the range of the scale. If a variables only possible codes are 0 and 1 (or 1 and 2, etc.), then a one-unit change does mean the same change throughout the scale. Thus dichotomous variables (e.g., gender) can be used as predictor variables in regression. It also permits the use of categorical predictor variables if they are converted into a series of dichotomous variables; this technique is called dummy coding and is considered in most regression texts (Draper and Smith (1998), Cohen, Cohen, West and Aiken (2002)).

An Example: Error or Fraud Detection in Claims To illustrate linear regression we turn to a data set containing insurance claims for a single medical treatment performed in a hospital (in the US for a single DRG or diagnostic related group). In addition to the claim amount, the data file also contains patient age (Age), length of hospital stay (Los) and a severity of illness category (Asg). This last field is based on several heath measures and higher category scores indicate greater severity of the illness. The plan is to build a regression model that predicts the total claims amount for a patient on the basis of length of stay, severity of illness and patient age. Assuming the model fits, we are then interested in those patients that the model predicts poorly. Such cases can simply be instances of poor model fit, or the result of predictors not included in the model, but they also might be due to errors on the claims form or fraudulent entries. Thus we are approaching the problem of error or fraud detection by identifying exceptions to the prediction model. Such exceptions are not necessarily instances of fraud, but since they are inconsistent with the model, they may be more likely to be fraudulent or contain errors. Some organizations perform random audits on claims applications and then classify them as fraudulent or not. Under these circumstances, predictive models can be constructed that attempt to correctly classify new claims applications (logistic, discriminant, rule induction and neural networks have been used for this purpose). However, when such an outcome field is not available, fraud detection then involves searching for and identifying exceptional instances. Here, an exceptional instance is one that the model predicts poorly. We use regression to build the model; if there were reason to believe the model were more complex (for example, contained nonlinear relations) then neural networks could be applied.

A Note Concerning Data Files, Variable Names and Labels in SPSS In this course guide, variable names (not labels) in alphabetic order are displayed in SPSS dialog boxes. To set your machine to match this display, click as follows within SPSS.



Click EditOptions Click the General tab Click the Display names option button in the Variable Lists section Click the Alphabetical option button in the Variable Lists section Click OK

Also, files are assumed to be located in the c:\Train\DM_Model directory. They can be copied from the floppy accompanying this guide (or from the CD-ROM containing this guide). If you are running SPSS Server (you can check by clicking File..Switch Server from within SPSS), then files used with SPSS should be copied to a directory that can be accessed from (mapped) the server. To develop a regression equation predicting claims amount based on hospital length of stay, severity of illness group and age using SPSS:

Click FileOpenData (switch to the c:\Train\DM_Model directory if necessary)

Double click on InsClaims Click AnalyzeRegression

This chapter will discuss two choices: linear regression, which performs simple and multiple linear regression and logistic regression (Binary). Curve Estimation will invoke the Curvefit procedure, which can apply up to 16 different functions relating two variables. Binary logistic regression is used when the dependent variable is a dichotomy (for example, when predicting whether a prospective customer makes a purchase or not). Multinomial logistic regression is appropriate when you have a categorical dependent variable with more than two possible values. Ordinal regression is appropriate if the outcome variable is ordinal (rank ordered). Probit analysis, nonlinear regression, weight estimation (used for weighted least squares analysis), 2-Stage least squares, and optimal scaling are not generally used for data mining and so will not be discussed further here.



Figure 2.2 Regression Menu

We will select Linear to perform multiple linear regression, then specify claim as the dependent variable and age, asg (severity level) and length of stay (los) as the independent variables.

Click Linear from the Regression menu Move claim to the Dependent: list box Move age, asg and los to the Independent(s): list box



Figure 2.3 Linear Regression Dialog Box

Since our goal is to identify exceptions to the regression model, we will ask for residual plots and information about cases with large residuals. Also, the Regression dialog box allows many specifications; here we will discuss the most important features.

Note on Stepwise Regression With such a small number of predictor variables, we will simply add them all into the model. However, in the more common situation of many predictor variables (most insurance claims forms would contain far more information) a mechanism to select the most promising predictors is desirable. This could be based on the domain knowledge of the business expert (here perhaps a medical expert). In addition, an option may be chosen to select, from a larger set of independent variables, those that in some statistical sense are the best predictors (Stepwise method). The Selection Variable option permits cross-validation of regression results. Only cases whose values meet the rule specified for a selection variable will be used in the regression analysis, yet the resulting prediction equation will be applied to the other cases. Thus you can evaluate the regression on cases not used in the analysis, or apply the equation derived from one subgroup of your data to other groups. The importance of such validation in data mining is a repeated theme in this course. While SPSS will present standard regression output by default, many additional (and some of them quite technical) statistics can be requested via the Statistics dialog box. The



Plots dialog box is used to generate various diagnostic plots used in regression, including a residual plot in which we have interest. The Save dialog box permits you to add new variables to the data file containing such statistics as the predicted values from the regression equation, various residuals and influence measures. We will create these in order to calculate our own percentage deviation field. Finally, the Options dialog box controls the criteria when running stepwise regression and choices in handling missing data (the SPSS Missing Values option provides more sophisticated methods of handling missing values). Note that by default, SPSS excludes a case from regression if it has one or more values missing for the variables used in the analysis.

Residual Plots While we can run the multiple regression at this point, we will request some diagnostic plots involving residuals and information about outliers. A residual is the difference (signed) between the actual value of the dependent variable and the value predicted by the model. Residuals can be used to identify large errors in prediction or cases poorly fit by the model. By default no residual plots will appear. These options are explained below.

Click the Plots pushbutton Within the Plots dialog box:

Check Histogram in the Standardized Residual Plots area Figure 2.4 Regression Plots Dialog Box



The options in the Standardized Residual Plots area of the dialog box all involve plots of standardized residuals. Ordinary residuals are useful if the scale of the dependent variable is meaningful, as it is here (claim amount in dollars). Standardized residuals are helpful if the scale of the dependent is not familiar (say a 1 to 10 customer satisfaction scale). By this I mean that it may not be clear to the analyst just what constitutes a large residual: is an over-prediction of 1.5 units a large miss on a 1 to 10 scale? In such situations, standardized residuals (residuals expressed in standard deviation units) are very useful because large prediction errors can be easily identified. If the errors follow a normal distribution, then standardized residuals greater than 2 (in absolute value) should occur in about 5% of the cases, and those greater than 3 (in absolute value) should happen in less than 1% of the cases. Thus standardized residuals provide a norm against which one can judge what constitutes a large residual. Recall that the F and t tests in regression assume that the residuals follow a normal distribution.

Click Continue Next we will look at the Statistics dialog box, which contains options concerning Casewise Diagnostics. When this option is checked, Regression will list information about all cases whose standardized residuals are more than 3 standard deviations from the line. This outlier criterion is under your control.

Click the Statistics pushbutton Click the Casewise diagnostics check box in the Residuals area

Figure 2.5 Regression Statistics Dialog Box

By requesting this option we will obtain a listing of those records that the model predicts poorly. When dealing with a very large data file, which may have many outliers, such a



list is cumbersome. It would be more efficient to save the residual value (standardized or not) as a new field, then select the large residuals and write these cases to a new file or add a flag field to the main database. We create these new fields below.

Click Continue Click Save pushbutton Click the check boxes for Unstandardized Predicted Values,

Unstandardized and Standardized Residuals Figure 2.6 Saving Predicted Values and Errors

Click Continue, then click OK Now we examine the results.



Figure 2.7 Model Summary and Overall Significance Tests

After listing the dependent and independent variables (not shown), Regression provides several measures of how well the model fits the data. First is the multiple R, which is a generalization of the correlation coefficient. If there are several independent variables (our situation) then the multiple R represents the unsigned (positive) correlation between the dependent measure and the optimal linear combination of the independent variables. Thus the closer the multiple R is to 1, the better the fit. As mentioned earlier, the r-square measure can be interpreted as the proportion of variance of the dependent measure that can be predicted from the independent variable(s). Here it is about 32%, which is far from perfect prediction, but still substantial. The adjusted r-square represents a technical improvement over the r-square in that it explicitly adjusts for the number of predictor variables, and as such is preferred by many analysts. However, it is a more recently developed statistic and so is not as well known as the r-square. Generally, they are very close in value; in fact, if they differ dramatically in multiple regression, it is a sign that you have used too many predictor variables relative to your sample size, and the adjusted r-square value should be more trusted. In our results, they are very close. While the fit measures indicate how well we can expect to predict the dependent variable or how well the line fits the data, they do not tell whether there is a statistically significant relationship between the dependent and independent variables. The analysis of variance table presents technical summaries (sums of squares and mean square statistics), but here we refer to variation accounted for by the prediction equation. We are interested in determining whether there is a statistically significant (non-zero) linear relation between the dependent variable and the independent variable(s) in the population. Since our analysis contains three predictor variables, we test whether any linear relation differs



from zero. The significance value accompanying the F test gives us the probability that we could obtain one or more sample slope coefficients (which measure the straight-line relationships) as far from zero as what we obtained if there were no linear relations in the population. The result is highly significant (significance probability less than .0005 (the table value is rounded to .000) or 5 chances in 10,000). Now that we have established there is a significant relationship between the claims amount and one or more predictor variables, and obtained fit measures, we turn to interpret the regression coefficients. Here we are interested in verifying that several expected relationships hold: (1) claims will increase with length of stay, (2) claims will increase with increasing severity of illness, and (3) claims will increase with age. Strictly speaking, this step is not necessary in order to identify cases that are exceptional. However, in order to be confident in the model, it should make sense to a domain expert. Since interpretation of regression models can be made directly from the estimated regression coefficients, we turn to those next. Figure 2.8 Estimated Regression Coefficients

The first column contains a list of the independent variables plus the intercept (constant). Although the estimated B coefficients are important for prediction and interpretive purposes, analysts usually look first to the t test at the end of each line to determine which independent variables are significantly related to the outcome measure. Since three variables are in the equation, we are testing if there is a linear relationship between each independent variable and the dependent measure after adjusting for the effects of the two other independent variables. Looking at the significance values we see that all three predictors are highly significant (significance values are .004 or less). If any of the variables were not found to be significant, you would typically rerun the regression after removing variables not found to be significant. The column labeled B contains the estimated regression coefficients we would use to deploy the model via a prediction equation. The coefficient for length of stay indicates that on average, each additional day spent in the hospital was associated with a claims increase of about $1,106. The coefficient for admission severity group tells us that each one-unit increase in the severity code is associated with a claims increase of $417. Finally, the age coefficient of 33 suggests that claims decrease, on average, by $33 as age increases one year. This is counterintuitive and should be examined by a domain



expert (here a physician). Perhaps the youngest patients are at greater risk. If there isnt a convincing reason for this negative association, the data values for age and claims should be examined more carefully (perhaps data errors or outliers are influencing the results). Such oddities may have shown up in the original data exploration. We will not pursue this issue here, but it certainly would be done in practice. The constant or intercept of $3,027 indicates that the claim of someone with 0 days in the hospital, in the least severe illness category (0) and at age 0 would be expected to file a claim of $3,027. This is clearly impossible. This odd result stems in part from the fact that no one in the sample had less than 1 day in the hospital (it was an inpatient procedure) and the patients were adults (no ages of 0), so the intercept projects well beyond where there are any data. Thus the intercept cannot represent an actual patient, but still may be needed to fit the data. Also, note that when using regression it can be risky to extrapolate beyond where the data are observed; the assumption is that the same pattern continues. Here it clearly cannot! The Standard Error (of B) column contains standard errors of the estimated regression coefficients. These provide a measure of the precision with which we estimate the B coefficients. The standard errors can be used to create a 95% confidence band around the B coefficients (available as a Statistics option). In our example, the regression coefficient for length of stay is $1,106 and the standard error is about $104. Thus we would not be surprised if in the population the true regression coefficient were $1,000 or 1,200 (within two standard errors of our sample estimate), but it is very unlikely that the true population coefficient would be $300 or $2,000. Betas are standardized regression coefficients and are used to judge the relative importance of each of several independent variables. They are important because the values of the regression coefficients (Bs) are influenced by the standard deviations of the independent variables and the beta coefficients adjust for this. Here, not surprisingly, length of stay is the most important predictor of claims amount, followed by severity group and age. Betas typically range from 1 to 1 and the further from 0, the more influential the predictor variable. Thus if we wish to predict claims based on length of stay, severity code and age, the formula would use the B coefficients: Predicted Claims = $1,106*length of stay + 417*severity code 33*age + $3,027.

Points Poorly Fit by Model The motivation for this analysis is to detect errors or possible fraud by identifying cases that deviate substantially from the model. As mentioned earlier, these need not be the result of errors or fraud, but they are inconsistent with the majority of cases and thus merit scrutiny. We first turn to a list of cases whose residuals are more than three standard deviations from 0 (a residual of 0 indicates the model perfectly predicts the outcome).



Figure 2.9 Outliers

There are two cases for which the claims value is more than three standard deviations from the regression prediction. Both are about $6,000 more than expected from the model. Note that they are 5.5 and 6.1 standard deviations away from the model predictions. These would be the claims to examine more carefully. The case sequence number for these records appears or an identification field could be substituted (through the Case Labels box within the Linear Regression dialog). Figure 2.10 Histogram of Residuals

This histogram of the standardized residuals presents the overall distribution of the errors. It is clear that all large residuals are positive (meaning the model under-predicted the claims value). Case (record) identification is not available in the histogram, but since the standardized residuals were added to the data file, they can be easily selected and examined.



Calculating a Percent Error Instead of standardizing the residuals, analysts may prefer to express the residual as a percent deviation from the prediction. Such a measure may be easier to communicate to a wider audience.

Move to the Data Editor window Click TransformCompute Type presid in the Target text box Enter the following into the Expression text box 100* (claim pre_1)/pre_1

Figure 2.11 Compute Dialog Box with Percentage Deviation from Model

Each cases deviation from the model (claim pre_1) is divided by the model prediction (/pre_1) and converted to a percent (100*).

Click OK Scroll to the right in the Data Editor window



Figure 2.12 Percent Deviation Field

Extreme values on this percent deviation field can also be used to identify exceptional claims. While we wont pursue it here, a histogram would display the distribution of the deviations and cases with extreme values could be selected for closer examination. Unusual values could appear at both the high and low ends, with low values indicating the claim was much less than predicted by the model. These might be examined as well, since they might reflect errors or suggest less expensive variations on the treatment. In this section, we offered the search for deviations from a model as a method to identify data errors or possible fraud. It would not detect, of course, fraudulent claims consistent with the model prediction. In actual practice, such models are usually based on a much greater number of predictor variables, but the principles, whether using regression or more complex models such as neural networks, are largely the same.

Appropriate Research Projects Other examples of questions for which linear regression is appropriate are: Predict expected revenue in dollars from a new customer based on customer

characteristics. Predict sales revenue for a store (with a sufficiently large number of stores in the

database). Predict waiting time on hold for callers to an 800 number.



Other Features There is no limit to the size of data files used with linear regression, but just as with discriminant, most uses of regression limit the number of predictors to a manageable number, say under 50 or so. As before, there is then no reason for extremely large file sizes. The use of stepwise regression is quite common. Since this involves selection of a few predictors from a larger set, it is recommended that you validate the results with a validation data set when you use a stepwise method. Although this technique is called linear regression, with the use of suitable transformations of the predictors, it is possible to model non-linear relationships. However, more in-depth knowledge is needed to do this correctly, so if you expect non-linear relationships to occur in your data, you might consider using neural networks or classification and regression trees, which handle these more readily, if differently.

Model Understanding Linear regression produces very easily understood models, as we can see from the table in Figure 2.8. As noted, graphical results are less helpful with more than a few predictors, although graphing the error in prediction with other variables can lead to insights about where the model fails.

Model Deployment Predictions for new cases are made from one equation using the unstandardized regression coefficient estimates. Any convenient software for doing this calculation can be employed, and regression equations can therefore be applied directly to data warehouses, not only to extracted datasets. This makes the model easily deployable.



DISCRIMINANT ANALYSIS Discriminant analysis, a technique used in market research and credit analysis for many years, is a general linear model method, like linear regression. It is used in situations where you want to build a predictive model of group or category membership, based on linear combinations of predictor variables that are either continuous (age) or categorical variables represented by dummy variables (type of customer). Most of the predictors should be truly interval scale, else the multivariate normality assumption will be violated. Discriminant is available in SPSS under the AnalyzeClassify menu. Discriminant follows from a view that the domain of interest is composed of separate populations, each of which is measured on variables that follow a multivariate normal distribution. Discriminant attempts to find the linear combinations of these measures that best separate the populations. This is represented in Figure 2.13, which shows one discriminant function derived from two input variables, X and Y, that can be used to predict membership in a dependent variable: Group. The score on the discriminant function separates cases in group 1 from group 2, using the midpoint of the discriminant function (the short line segment). Figure 2.13 Discriminant Function Derived From Two Predictors



A Discriminant Example: Predicting Purchases To demonstrate discriminant analysis we take data from a study in which respondents answered, hypothetically, whether they would accept an interactive news subscription service (via cable). There was interest in identifying those segments most likely to adopt the service. Several demographic variables were available: education, gender, age, income category, number of children, number of organizations the respondent belonged to, and the number of hours of TV watched per day. The outcome measure was whether they would accept the offering. Most of the predictor variables are interval scale, the exceptions being gender (a dichotomy) and income (an ordered categorical variable). We would expect few if any of these variables to follow a normal distribution, but will proceed with discriminant. As in our other examples, we will move directly to the analysis although ordinarily you would run data checks and exploratory data analysis first.

Click File..Open..Data Move to the c:\Train\DM_Model directory (if necessary) Double click on Newschan (respond No if asked to save Data Editor

contents) Click AnalyzeClassifyDiscriminant Click newschan, then click the upper arrow to move it into Grouping

Variable list box Notice that two question marks appear beside newschan in the Grouping Variable list box. This is because Discriminant can be applied to more than two outcome groups and expects a minimum and maximum group code. The news channel acceptance variable is coded 0 (no) and 1 (yes) and we use the Define Range pushbutton to supply this information.

Click Define Range pushbutton (not shown) Type 0 in the Minimum text box Click in the Maximum text box, and type 1 Click Continue to process the range

The default Method within Discriminant is to run the analysis using all the predictor variables. For the typical data mining application, you would probably invoke a stepwise option that will enter predictor variables into the equation based on statistical criteria instead of forcing all predictors into the model.

Click and drag from age to tvday to select them Click the lower arrow to place the selected variables in the Independents:

list box. Click Use stepwise method option button



Figure 2.14 Discriminant Analysis Dialog Box

Click the Classify pushbutton Click the Summary table checkbox Click the Leave-one-out classification checkbox

Figure 2.15 Classification Dialog Box

The Classification dialog box controls the results displayed when the discriminant model is applied to the data. The most useful table does not print out by default (because misclassification summaries require a second data pass), but you can easily request a summary classification table, which reports how well the model predicts the outcome



measure. Without this table you cannot effectively evaluate the discriminant analysis, so you should make a point of asking for it. The "leave-one-out" variation classifies each case based on discriminant coefficients calculated while the case is excluded from the analysis. This method is a form of n-fold validation and provides a classification table that should at least slightly better generalize to other samples. Since we have a relatively small data file, rather than splitting it into training and validation samples, we will use the leave-one-out classification for validation purposes.

You can use the Prior Probabilities area to provide Discriminant with information about the distribution of the outcome in the population. By default, before examining the data, Discriminant assumes an observation is equally likely to belong to each outcome group. If you know that the sample proportions reflect the distribution of the outcome in the population then you can instruct Discriminant to make use of this information. For example, if an outcome category is very rare, Discriminant can make use of this fact in its prediction equation. Using the dialog box, priors can be set to the sample sizes, and with syntax you can directly specify the population proportions. In our instance, we dont know what the proportions would be, so we retain the default.

Click Continue to process Classification choices Click Statistics pushbutton Click Fishers checkbox in the Function Coefficients area Click Unstandardized checkbox in the Function Coefficients area

Figure 2.16 Statistics Dialog Box

Either Fisher's coefficients or the unstandardized discriminant coefficients can be used to deploy the model for future observations (customers). Both sets of coefficients produce the same predictions. If there are only two outcome categories (as is our situation), either is easy to use. If you want to try what if scenarios using a spreadsheet, the unstandardized coefficients (since they involve a single equation in the two-outcome



case) would be more convenient to work with. If you run discriminant with more than two outcome categories, then Fisher's coefficients are easier to apply as prediction rules. If you suspect some of the predictors are highly related, you might view the within-groups correlations among the predictor variables to identify highly correlated predictors.

Click Continue to process Statistics requests Now we are ready to run the stepwise discriminant analysis. The Select pushbutton can be used to have SPSS select part of the data to estimate the discriminant function, and then apply the predictions to the other part (cross-validation). We would use this method of validation in place of the leave-one-out method if our data set were larger. The Save pushbutton will create new variables that contain the group membership predicted from the discriminant function and the associated probabilities. To retain predictions for the training data set, you would use the Save dialog to create these variables.

Click OK to run the analysis Scroll to the Classification Results table at the bottom of the Viewer

window Figure 2.17 Classification Results Table

Although this table appears at the end of the discriminant output, we turn to it first. It is an important summary since it tells us how well we can expect to predict the outcome. There are two subtables with Original referring to the training data and Cross-Validated supplying the leave-one-out results. The actual (known) groups constitute the rows and the predicted groups make up the columns of the table. Looking at the Original section, of the 227 people surveyed who said they would not accept the



offering, the discriminant model correctly predicted 157 of them, and so its accuracy is 69.2%. For the 214 respondents who said they would accept the offering, 66.4% were correctly predicted. Thus overall, the discriminant model was accurate in 67.80% of the cases. The Cross-Validated summary is very close (67.3% accurate overall). Is this performance good? If we simply guess the larger group 100% of the time, we would be correct 227 times of 441 (227 + 214), or about 51.5% of the time. The 67.8% or 67.3% correct figures, while certainly far from perfect accuracy, do far better than guessing. Whether you would accept this figure and review the remaining output, or go back to the drawing board, is largely a function of the level of predictive accuracy required. Since we are interested in discovering which characteristics are associated with someone who accepts the news channel offer, we proceed.

Stepwise Results Age is entered first, followed by gender and education. A significance test (Wilks lambda) of between-group differences is performed for the variables at each step. None of the other variables made a significant difference after adjusting for the first three. As an exercise you might rerun the analysis with the additional variables entered and compare the classification results. Figure 2.18 Stepwise Results

This summary is followed by one entitled "Variables in the Analysis" (not shown), which lists the variables included in the discriminant analysis at each step. For the variables selected, tolerance is shown. It measures the proportion of variance in each predictor variable that is independent of the other predictors in the equation at this step. As



tolerance values approach 0 (say below .1 or so) the data approach multicollinearity, meaning the predictor variables are highly interrelated, and interpretation of individual coefficients can be compromised. Note that discriminant coefficients are only calculated after the stepwise phase is complete.

Figure 2.19 Standardized Coefficients and Structure Matrix

The standardized discriminant coefficients can be used as you would regression Beta coefficients in that they attempt to quantify the relative importance of each predictor in the discriminant function. Not surprisingly, age is the dominant factor. The signs of the coefficients can be interpreted with respect to the group means on the discriminant function (see Figure 2.20). An older individual will have a higher discriminant score, since the age coefficient is positive. The outcome group accepting the offering has a positive mean (see Figure 2.20) and so older people are more likely to accept the offering. Notice the coefficient for gender is negative. Other things being equal, as you shift from a man (code 0) to a woman (code 1), this results in a one unit change, which when multiplied by the negative coefficient will lower the discriminant score, and move the individual toward the group with a negative mean (those that dont accept the offering). Thus women are less likely to accept the offering, adjusting for the other predictors.



Figure 2.20 Unstandardized Coefficients and Group Means (Centroids)

Back in Figure 2.13 we saw a scatterplot of two separate groups and the axis along which they could be best separated. Unstandardized discriminant coefficients, when multiplied by the values of an observation, project an individual on this discriminant axis (or function) that separates the groups. If you wish to use the unstandardized coefficient estimates for prediction purposes, you simply multiply a prospective customers education, gender and age values by the corresponding unstandardized coefficients and add the constant. Then you compare this value to the cut point (by default the midpoint) between the two group means (centroids) along the discriminant function (the means appear in Figure 2.20). If the prospective customers value is greater than the cut point you predict the customer will accept, if the score is below the cut point, then you predict the customer will not accept. This prediction rule is also easy to implement with two groups, but involves much more complex calculations when more than two groups are involved. It is in a convenient form to do what if scenarios, for example, it we have a male with 16 years of education at what age would such an individual be a good prospect? To answer this we determine the age value that moves the discriminant score above the cut point.



Figure 2.21 Fisher Classification Coefficients

The Fisher function coefficients can be used to classify new observations (customers). If we know a prospective customers education (say 16 years), gender (Female=1) and age (30), we multiply these values by the set of Fisher coefficients for the No (no acceptance) group (2.07*16 + 1.98*1 + .32*30 -20.85), which yields a numeric score. We repeat the process using the coefficients for the Yes group and obtain another score. The customer is then placed in the outcome group for which she has the higher score. Thus the Fisher coefficients are easy to incorporate later into other software (spreadsheets, databases) for predictive purposes. We did not test for the assumptions of discriminant analysis (normality, equality of within group covariance matrices) in this example. In general, normality does not make a great deal of difference, but heterogeneity of the covariance matrices can, especially if the sample group sizes are very different. Here the samples sizes were about the same. For a more detailed discussion of problems with assumption violation in discriminant analysis see Lachenbruch (1975) or Huberty (1994). As mentioned earlier, whether you consider the hit rate here to be adequate really depends on the costs of errors, the benefits of a correct prediction and what your alternatives are. Here, although the prediction was far from perfect we were able to identify the relations between the demographic variables and the choice outcome.

Appropriate Research Projects Examples of questions for which discriminant analysis is appropriate are: Predict instances of fraud in all types of situations, including credit card, insurance,

and telephone usage. Predict whether customers will remain or leave (churn or not).



Predict which customers will respond to a new product or offer. Predict outcomes of various medical procedures.

Other Features In theory, there is no limit to the size of data files for discriminant analysis, either in terms of records or variables. However, practically speaking, most applications of discriminant limit the number of predictors to a few dozen at most. With that number of predictors, there is usually no reason to use more than a few thousand records. It is possible to use stepwise methods with discriminant, so that the software can select the best set of predictors from a larger potential group. In this sense, stepwise discriminant can be considered an automated procedure like decision trees. As a result, if you use a stepwise method, you should use a validation dataset on which to check the model derived by discriminant.

Model Understanding Discriminant analysis produces easily understood results. We have already seen the classification table in Figure 2.17. In addition, the procedure calculates the relative importance of each variable as a predictor (standardized coefficientssee Figure 2.19). Graphical output is produced by discriminant, but with more than a few predictors it becomes less useful.

Model Deployment Predictions for new cases are made from simple equations using the classification function coefficients (especially the Fisher coefficients). This means that any statistical program, or even a spreadsheet program, could be used to generate new predictions, and that the model can be applied directly to data warehouses, not only extracted data sets. This makes the model easily deployable.



LOGISTIC AND MULTINOMIAL REGRESSION Logistic regression is similar to discriminant analysis in that it attempts to predict a categorical dependent variable. And it is similar to linear regression in that it uses the general linear model as its theoretical underpinning, and so calculates regression coefficients and tries to fit the data to a line, although not a straight one. A common application would be predicting whether someone renews an insurance policy. The outcome variable should be categorical. SPSS has three procedures that can be used to build logistic regression models. The Binary Logistic procedure and the Multinomial Logistic procedure are both found in the AnalyzeRegression menu. The former is used only for dichotomous dependent variables; the latter can handle dependent variables with two or more categories. See the manual SPSS Regression Models, Chapter 1, for a discussion of when to use each for a dichotomous outcome. In addition, the Ordinal Regression procedure models an ordinal outcome variable. Logistic regression follows from a view that the world is truly continuous, and so the procedure actually predicts a continuous function that represents the probability associated with being in a particular category of the dependent variable. This is represented in Figure 2.22, which displays the predicted relationship between household income and the probability of purchase of a home. The S-shaped curve is the logistic curve, hence the name for this technique. The idea is that at low income, the probability of purchasing a home is small and rises only slightly with increasing income. But at a certain point, the chance of buying a home begins to increase in almost a linear fashion, until eventually most people with substantial incomes have bought homes, at which point the function levels off again. Thus the outcome variable varies from 0 to 1 because it is measured in probability.



Figure 2.22 The Logistic Function

After the procedure calculates the outcome probability, it simply assigns a case to a predicted category based on whether its probability is above .50 or not. The same basic approach is used when the dependent variable has three or more categories.

In Figure 2.22, we see that the logistic model is a nonlinear model relating predictor variables to the probability of a choice or event (for example, a purchase). If there are two predictor variables (X1, X2), then the logistic prediction equation can be expressed as:

prob(event) = ) exp (1 expA) X*B X*(B

A) X*B X*(B

2211

2211

++

++

+

where exp() represents the exponential function. The conceptual problem is that the probability of the event is not linearly related to the predictors. However, if a little math is done you can establish that the odds of the event occurring are equal to:

A) X*B X*(B 2211 exp ++ , which equals A)X*(B)X*(B exp*exp * exp 2211 Although not obviously simpler to the eye, the second formulation (and SPSS displays the logistic coefficients in both the original form and raised to the exponential power) allows you to state how much the odds of the event change with a one unit change in the predictor. For example, if I stated that the odds of making a sale double if a resource is



given to me, everyone would know what I meant. With this in mind we will look at the coefficients in the logistic regression equation and try to interpret them. Recall that logistic regression assumes that the predictor variables are interval scale, and like regression, dummy coding of predictors can be performed. As such, its assumptions are less restrictive than discriminant.

A Logistic Regression Example: Predicting Purchases We will apply logistic regression to the same problem of discovering which demographics are related to acceptance of an interactive news service. However, instead of running a stepwise method we will apply the variables selected by our discriminant analysis and compare the results.

Click AnalyzeRegressionBinary Logistic Move newschan into the Dependent list box Move age, educate and gender into Covariates: list box

Figure 2.23 Logistic Regression Dialog Box

This is all we need in order to run a standard logistic regression analysis. Notice the Interaction button . You can create interaction terms by clicking on two or more predictor variables in original list, then clicking on the Interaction button. Also, you can use the Categorical pushbutton to have Logistic Regression create dummy coded (or contrast) variables to substitute for your categorical predictor variables (note that Clementine performs such operations automatically in its modeling nodes). The Save pushbutton allows you to create new variables containing the predicted probability of the event, and various residual and influence measures. As in Discriminant, the Select



pushbutton will estimate the model from part of your sample (you provide a selection rule) and apply the prediction equation to the other part of the data (cross-validation). The Options pushbutton provides control over the criteria used in stepwise analysis.

Click Save pushbutton Click Probabilities check box Click Group membership check box

Figure 2.24 Logistic Regression: Save Dialog Box

The Logistic Regression procedure can create new variables to store various types of information. Influence statistics, which measure the influence of each point on the logistic analysis, can be saved. A variety of residual measures, which identify poorly fit data, can also be retained. When scoring, the predicted probability provides a score for each observation that is used to classify it into an outcome category. We save them here in order to demonstrate how they can be used in a gains tables to evaluate the effectiveness of the model.

Click Continue Click OK to run the analysis Scroll down to the Classification Table in the Block 1 section of the

Viewer window



Figure 2.25 Classification Table

The classification results table indicates that those refusing the offer were predicted with 70.5% accuracy and those accepting with 61.7% accuracy, for an overall correct classification of 66.2%. The logistic model predicted a slight bit better for the refusals, and about 4 percentage points worse for the acceptances, so overall it does slightly worse (about 2 percentage points) than discriminant on the training sample. The default classification rule for a case is that if the predicted probability of belonging in the outcome group with the higher value (here 1) is greater than equal to .5, then predict membership in that group. Otherwise, predict membership in the group with the lower outcome value (here 0). We will examine these predicted probabilities in more detail later. 2.26 Significance Tests and Model Summary

The Model Chi-square test provides a significance test for the entire model (three variables) similar to the overall F test in regression. We would say there is a significant relation between the three predictors and the outcome. The Step Chi-square records the change in chi-square from one step to the next and is useful when running stepwise methods.



Figure 2.27 Model Summary

The pseudo r-square is a statistic modeled after the r-square in regression (discussed earlier in this chapter). It measures how much of the initial lack of fit chi-square is accounted for by the variables in the model. Both variants indicate the model only accounts for a modest amount of the initial unexplained chi-square. Now lets move to the variables in the equation. Figure 2.28 Variables in the Equation

The B coefficients are the actual logistic regression coefficients, but recall they bear a nonlinear relationship to the probability of accepting the offer. Although they do linearly relate to the log odds of accepting, most people do not find this metric helpful for interpretation. The second column (S.E.) contains the standard errors for the B coefficients. The Wald statistic is used to test whether the predictor is significantly related to the outcome measure adjusting for the other variables in the equation (all three are highly significant). The last column presents the B coefficient exponentiated using the e (exponential) function, and we can interpret these coefficients in terms of an odds shift in the outcome. For example, the coefficients of age and education are above 1 meaning that the odds of accepting the offer increase with increasing age. The coefficient for age indicates that the odds increase by a factor of 1.06 per year, which seems rather small. However, recall that age can range from 18 to almost 90 years old and a 20-year age difference would have a substantial impact on the odds of accepting the offering (the odds more than triple). The coefficient for gender is about .5 indicating that if the other factors are held constant, moving from a male to female reduces the odds of accepting the offering by about 1/2. In



this way you can express the effect of a predictor variable in terms of an odds shift. You can use the B coefficients for your prediction equation:

prob of accepting = ) exp (1 exp

3.5)- age*.060gender *.738- educate*(.107

3.5)- age*.060gender *.738- educate*(.107

+

+

+ The results of the logistic regression confirm that age, gender and education are related to the probability of a potential customer accepting the offering. Although not shown, we ran logistic using a stepwise method and obtained the same model.

Appropriate Research Projects Examples of questions for which logistic regression is appropriate are: Predicting whether a person will renew an insurance policy. Predicting instances of fraud. Predicting which product someone will buy or a response to a direct mail offer. Predicting that a product is likely to fail.

Other Features As with the other techniques we have discussed, there is no limit to the size of data files. As previously, usually a limited number of predictors are used for any problem, so file sizes can be reasonably small. Stepwise logistic regression is available in the Binary Logistic procedure, but not in the Multinomial Logistic procedure, and the standard caveats apply about using a validation data set when stepwise variable selection is done.

Model Understanding Although the logistic model is inherently more complex, it is not unduly so compared to linear regression. When the results are translated into odds on the dependent variable, they become quite helpful to decision makers. As before, graphical representations of the solution are less helpful with more than a few predictors.

Model Deployment For predictions with binary logistic regression, only one equation is involved, so the model is easily deployable. For multinomial regression, more than one equation is



involved, which requires more calculations, but this doesnt make prediction for new cases that much more difficult.

7093727 Spss Trainingboek Advanced Statistics and Data Mining

Documents

Transcript of 7093727 Spss Trainingboek Advanced Statistics and Data Mining