Mlr Article Digest

download Mlr Article Digest

of 13

Transcript of Mlr Article Digest

  • 8/2/2019 Mlr Article Digest

    1/13

    Multiple Linear Regression

    Modeling An eBookUnderstand, build and use MLR models using

    RapidMiner for predicting sales

    Bala Deshpande, Ph.D., MBA

  • 8/2/2019 Mlr Article Digest

    2/13

    MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore

    SimaFore LLC Page 1

    Table of Contents

    Chapter 1: Multiple Linear Regression Business Problem and Data .......................................................... 2

    Chapter 2: Setting up MLR using RapidMiner ............................................................................................... 4

    Chapter 3: Identifying most important variables: Feature Selection ........................................................... 7

    Chapter 4: Checkpoints to ensure regression model validity ..................................................................... 11

  • 8/2/2019 Mlr Article Digest

    3/13

    MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore

    SimaFore LLC Page 2

    Chapter 1: Multiple Linear Regression Business Problem and Data

    We will describe one of the most commonly used data mining techniques - multiple linear

    regression in this eBook. According to theRexer Analytics Survey, regression models are one of

    the three most common analytics tools used today by practitioners. In the first chapter we will

    discuss the problem we are trying to address the data and give a quick introduction to using

    regression models. In the chapters 2 and 3 we will dig into the mechanics of using RapidMiner

    to do the data preparation, model building, and validation. Finally in chapter 4 we will describe

    some check points to ensure that MLR is used correctly.

    The business problem

    The fundamental issue for all businesses and specifically small and medium enterprises (SME) is

    the need to grow revenues. Understanding and increasing the likelihood that someone will buy

    again from the company is critical. Another important question that would help strategically is

    predicting how much money a customer is likely to spend given data about their previouspurchase habits. The business problem we are looking at here is the second issue.

    About predictive vs. explanatory models

    Two very important distinctions need to be made here: understanding why someone purchased

    from the company will fall into the realm of "explanatory modeling" whereas predicting how

    much someone is likely to spend will fall into the realm of "predictive analytics". Addressing the

    second problem is predicated by the availability of large data volumes.

    http://www.rexeranalytics.com/Data-Miner-Survey-Results-2010.htmlhttp://www.rexeranalytics.com/Data-Miner-Survey-Results-2010.htmlhttp://www.rexeranalytics.com/Data-Miner-Survey-Results-2010.htmlhttp://www.rexeranalytics.com/Data-Miner-Survey-Results-2010.html
  • 8/2/2019 Mlr Article Digest

    4/13

    MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore

    SimaFore LLC Page 3

    Multiple linear regression can be applied in a variety of situations ranging from predicting

    customer activity based on demographics and historical patterns to predicting time to failure of

    machinery based on usage and operating conditions. Any situation where a numerical

    prediction, such as "how much will someone spend", is required warrants the use of regression

    models. This is in contrast to making categorical prediction such as "will buy/will not buy", "will

    fail/will not fail", where we can use eitherdecision treesorlogistic regression models.

    The main task is to find a linear equation that relates the predictors (independent variables or

    factors) to the response (dependent variable or target). If there are two or more predictors, we

    are effectively doing "multiple" regression. A note of caution before using the equation for

    prediction: we have to ensureregression models are not arbitrarily deployedand mustperform

    checks to ensure regression models are valid. This is discussed in more detail in Chapter 4.

    Data

    The data consists of six predictors and one response variable. The predictors are as follows:historical transactions, days since last transaction, online order (y/n), gender (m/f), customer

    type (b2b/b2c), and region (domestic/international). The response variable is of course the

    amount of spend.

    Due to the small number of factors, we may not need to employ any data reduction schemes

    and in chapter 2 we will be using RapidMiner to directly build the model and explore the

    weakest/strongest predictors, most likely customer profile, and predictive accuracy.

    http://www.simafore.com/blog/?Tag=decision+treeshttp://www.simafore.com/blog/?Tag=decision+treeshttp://www.simafore.com/blog/?Tag=decision+treeshttp://www.simafore.com/blog/?Tag=logistic+regression+modelshttp://www.simafore.com/blog/?Tag=logistic+regression+modelshttp://www.simafore.com/blog/?Tag=logistic+regression+modelshttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.simafore.com/blog/bid/54963/6-checkpoints-to-ensure-regression-model-validity-for-analyticshttp://www.simafore.com/blog/bid/54963/6-checkpoints-to-ensure-regression-model-validity-for-analyticshttp://www.simafore.com/blog/bid/54963/6-checkpoints-to-ensure-regression-model-validity-for-analyticshttp://www.simafore.com/blog/bid/54963/6-checkpoints-to-ensure-regression-model-validity-for-analyticshttp://www.simafore.com/blog/bid/54963/6-checkpoints-to-ensure-regression-model-validity-for-analyticshttp://www.simafore.com/blog/bid/54963/6-checkpoints-to-ensure-regression-model-validity-for-analyticshttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.simafore.com/blog/?Tag=logistic+regression+modelshttp://www.simafore.com/blog/?Tag=decision+trees
  • 8/2/2019 Mlr Article Digest

    5/13

    MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore

    SimaFore LLC Page 4

    Chapter 2: Setting up MLR using RapidMiner

    In this chapter, we will show how to set up a RapidMiner process to build a multiple linear

    regression model for the sales prediction business analytics problem described in Chapter 1.

    Before we do the actual modeling, let us do some initial analysis. This includes summarizing the

    data by using excel pivot tables. Anexcellent introduction to building pivot tables in excelcan

    be found here. Additionally we will check for correlations between the predictors and response

    variable to avoidmulticollinearity issues later on.

    Here is the data that was introduced in chapter 1, shown in a table.

    We have 2000 records (rows) and 7 predictors. The response or label (in RapidMiner

    terminology) is Column F - Purchase Amount which is what needs to be predicted. Note that the

    data is "coded" which means that instead of a column for "Online order" with rows reading

    either "Yes" or "no", we have a variable called "online order (Yes=0)" and the rows are either0's (implying online order) or 1. The same coding has been applied to the other three non-

    numeric variables: gender, geographic region, type of customer. This coding will help in

    interpreting the regression coefficients later on.

    Summarizing data using pivot tables in XL allows us to gain some early insight into the data and

    will help us understand the final model better. As seen in the summary table below, Online

    orders are a bit more than 50% whereas there is not much difference between Male and

    Female customers in terms of number of orders. However, B2B customers make up 82% of the

    orders and Domestic customers are nearly 78% of all orders.

    RapidMiner model setup requires 3 steps: click and drag Read Excel into the main window in

    Step 1, connect it to the Split Validation operator in Step 2, and use the "Linear Regression"

    http://www.timeatlas.com/5_minute_tips/chunkers/learn_to_use_pivot_tables_in_excel_2007_to_organize_datahttp://www.timeatlas.com/5_minute_tips/chunkers/learn_to_use_pivot_tables_in_excel_2007_to_organize_datahttp://www.timeatlas.com/5_minute_tips/chunkers/learn_to_use_pivot_tables_in_excel_2007_to_organize_datahttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.timeatlas.com/5_minute_tips/chunkers/learn_to_use_pivot_tables_in_excel_2007_to_organize_data
  • 8/2/2019 Mlr Article Digest

    6/13

    MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore

    SimaFore LLC Page 5

    operator in the Training Window, which opens when the nested window icon within Split

    Validation is double clicked. Full details of these steps are describedhere(and in our other free

    ebook on Decision Treeshere).

    When the above model is run, RapidMiner will provide two main outputs: the actual model in

    the form of a text (linear equation) and a table. The text output is useful to interpret the model

    while the table form helps to explain the confidence level in each of the regression coefficients.

    The graphics below illustrate this.

    http://www.simafore.com/blog/bid/56588/how-to-use-decision-trees-for-credit-scoring-using-rapidminer-part-2http://www.simafore.com/blog/bid/56588/how-to-use-decision-trees-for-credit-scoring-using-rapidminer-part-2http://www.simafore.com/blog/bid/56588/how-to-use-decision-trees-for-credit-scoring-using-rapidminer-part-2http://www.simafore.com/Download-ebook-Decision-Tree-Articles-Digest/http://www.simafore.com/Download-ebook-Decision-Tree-Articles-Digest/http://www.simafore.com/Download-ebook-Decision-Tree-Articles-Digest/http://www.simafore.com/Download-ebook-Decision-Tree-Articles-Digest/http://www.simafore.com/blog/bid/56588/how-to-use-decision-trees-for-credit-scoring-using-rapidminer-part-2
  • 8/2/2019 Mlr Article Digest

    7/13

    MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore

    SimaFore LLC Page 6

    So who is most likely to become a customer and how much are they likely to spend? Finally,

    which are the most important and least important variables? (Why we cannot use model

    coefficients to do this?) We will explore this and the model accuracy questions in the next

    chapter.

  • 8/2/2019 Mlr Article Digest

    8/13

    MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore

    SimaFore LLC Page 7

    Chapter 3: Identifying most important variables: Feature Selection

    In this chapter, we will explore two additional questions which were raised at the end of

    Chapter 2:

    1. Eliminating the least important variables from the model2. Identifying the characteristics of the most valuable prospect based on historical data

    We will show how to use RapidMiner to run a feature selection operator which will answer 1

    and interpret the model to answer 2.

    Feature selection or data dimension reduction or variable screening in predictive analyticsrefers to the process of identifying the few most important variables or parameters which help

    in predicting the outcome. In today's charged up world of high speed computing, one might be

    forgiven for asking, why bother? The most important reasons all come from practicality.

    Reason 1: If two or more of the independent variables (or predictors) are correlated to the

    dependent (or predicted) variable, then the estimates of coefficients in a regression model tend

    to be unstable or counter intuitive.

    Example: y = 45 + 0.8x1 and y = 45 + 0.1x2 are two linear regression models which predict y.

    Both clearly indicate that if x's increase, y also increases. If x1 and x2 show a strong correlation

    to y, then a multiple regression model might look like y = 45 + 0.02 x1 - 0.4 x2. In this case,because the three (x1, x2 and y) are strongly correlated, interaction effects between x1 and x2

    lead to a situation where x2 is in a negative relationship with y, meaning y will decrease with

    increase in x2. This is not only the reverse of what was seen in the simple model, but is also

    counter-intuitive.

  • 8/2/2019 Mlr Article Digest

    9/13

    MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore

    SimaFore LLC Page 8

    Reason 2: The law of averages states that the larger the set of predictors, the higher the

    probability of having missing values in the data. If we chose to delete cases which have missing

    values for some predictors, we may end up with a shortage of samples.

    Example: A practical rule of thumb used by data miners is to have at least 5(p+2) samples

    where p is the number of predictors. If your data set is sufficiently large and this rule is easilysatisfied, then you may not be risking much by deleting cases. But if your data is from an

    expensive market survey for example, a systematic procedure to actually reduce the data set,

    may result in a situation where you dont have to address this problem of losing samples. It is

    better to lose variables which dont impact your prediction than to lose somewhat more

    expensive samples.

    There are several other more technical reasons for reducing data dimensionality which will be

    explored in subsequent articles. In a next article, we will discuss some common techniques for

    actually implementing this process.

    Backward Elimination to reduce dataset

    The process logic which RapidMiner uses is not "linear", but recursive. We dont apply

    operators linearly, one after another. The graphic below explains how this nesting was used in

    setting up the training and testing of Linear Regression operator for the analysis we did in

    chapter 2. The red arrow indicates that the training and testing process was nested within the

    "Split Validation" operator.

    In order to introduce the feature selection method, we need to tuck the training and testing

    process inside another sub-process called the Learning Process. The learning process is nested

    inside the "Backward Elimination" operator. We now have two nestings as schematically shown

    below.

  • 8/2/2019 Mlr Article Digest

    10/13

    MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore

    SimaFore LLC Page 9

    Finally the image below shows how to access the Backward Elimination operator. Doubleclicking on the Backward Elim operator opens up the Learning Process which will now contain

    the Split Validation operator used earlier.

    There is one more step to complete before running this model. Simply connecting the Backward

    Elim operators ports to the output will not show us the final regression model equation. To be

    able to see that, we need to connect the "exa" port of Backward Elim operator to another

    "Linear Regression" operator in the main process! The output of this operator will contain the

    model which can be examined in the Results perspective. The graphic below shows this.

  • 8/2/2019 Mlr Article Digest

    11/13

    MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore

    SimaFore LLC Page 10

    What variables have been eliminated by Backward Elimination?

    Comparing the two regression equations (above and in chapter 2) we can see that the variables

    B2C=1 and Domestic=1 have been removed. What are the advantages of this, if any? It implies

    thath in the future, it may not be necessary to collect these two pieces of data to predict

    spending amount.

    What characteristics does a high spending customer have?

    Referring back to the regression model shown in the graphic above, we see that amount of

    spend increases with Purchase frequency. Also online orders tend to spend less. More recent

    purchasers tend to spend more and finally if a contact was made with a prospect recently (Last

    Update) they tend to spend more.

    In the final chapter of this ebook, we will discuss some tips to make sure that regression

    modeling is applied correctly.

  • 8/2/2019 Mlr Article Digest

    12/13

    MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore

    SimaFore LLC Page 11

    Chapter 4: Checkpoints to ensure regression model validity

    While acknowledging the general overallrisk in using models, it is important to know how to

    mitigate some of these risks. In this article, we will specifically focus on 6 checkpoints to ensure

    that bivariate analyses used to develop models (such as simple regression models), or to verify

    if two parameters are related, are valid. Finally, we will briefly mention some advantages of

    using mutual information over simple regression models for bivariate analysis.

    Checkpoint 1: The first check point to consider before accepting any simple regression model is

    of course to quantify the r-squared, which is also known as the "coefficient of determination".

    R-squared effectively explains how much of variability in the dependent parameter is explained

    by the independent parameter.

    Addendum to 1: In most cases of Linear Regression the r-squared value lies between 0 and 1.

    The ideal range for r-squared varies across applications , for example, in social and behavioral

    science models typically low values are acceptable. Generally, very low values( ~ < 0.2) indicate

    that the variables in your do not explain the outcome satisfactorily. Similarly very high values (>

    0 .8) values indicate too high a dependency making the predictive ability of the model low.

    Checkpoint 2: Once a regression model is fit through the sample data points, the t-statistic

    must be used to check if the slope of the model is different from zero. But why not simply check

    the slope (even visually) of the model? The t-statistic check ensures that the population slope

    (not just the sample slope) is different from zero. This of course requires the assumption of

    normal distribution of all sample slopes that make up the population.

    Checkpoint 3: This brings us to the next check - which is to ensure that all error terms in the

    model are normally distributed. Fortunately most standard statistical packages do this

    automatically, but it is good to know that this check has been performed.

    http://www.simafore.com/blog/?Tag=risk+managementhttp://www.simafore.com/blog/?Tag=risk+managementhttp://www.simafore.com/blog/?Tag=risk+managementhttp://www.simafore.com/blog/?Tag=risk+managementhttp://www.simafore.com/blog/?Tag=risk+management
  • 8/2/2019 Mlr Article Digest

    13/13

    MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore

    SimaFore LLC Page 12

    Checkpoint 4: Make sure that if you are using the model to predict, the domain of the predictor

    is within the range of the sample data used to build the model.

    Checkpoint 5: Passing checks 1 and 2 will ensure that the independent and dependent variable

    are related. However this does not imply that the independent variable is the cause and the

    dependent is the effect. Remember that correlation is not causation!

    Checkpoint 6: Highly non-linear relationships will result in simple regression models failing

    checks 1 through 3. However this does not mean that the two variables are not related. In such

    cases it may become necessary to resort to somewhat more advanced bivariate analysis

    methods. The use ofmutual informationfor testing if two variables are related is highly

    effective in such cases.

    Mutual information will very simply tell you if variable X is related to variable Y, and how much

    uncertainty is reduced in predicting Y if the uncertainty in knowing X is quantified. Furthermore,

    mutual information can handle jumps or discontinuities within the sample data - for example

    the X data may not be uniformly spaced. Such jumps in data are well captured by mutual

    information, as are non-linearities.

    If you liked this ebook tutorial on analytics, sign up for visTASC, "a visual thesaurus of analytics,

    statistics and complex systems for more like these. Sign up is FREE and allows you to search for

    techniques for other common business problems.

    http://www.simafore.com/blog/?Tag=mutual+informationhttp://www.simafore.com/blog/?Tag=mutual+informationhttp://www.simafore.com/blog/?Tag=mutual+informationhttp://vistasc.simafore.com/create-accounthttp://www.simafore.com/blog/?Tag=mutual+information