Corrleation and regression

Click here to load reader

download Corrleation and regression

of 33

Transcript of Corrleation and regression

  • 1. Aaker, Kumar, Day (5th Edition) Correlation & Regression

2. Regression What is Regression 3. Regression What is Regression A Statistical Technique that is used to relate two or more variables. Use the independent variable(s) to predict the value of dependent variable. Objective Example For a given value of advertisement expenditure, how much sales will be generated. With a given diet plan, how much weight an individual will be able to reduce. With a given diet plan, how much weight an individual will be able to reduce. 4. Regression Understanding A layman Question Suppose we want to find out how much the age of the car helps you to determine the price of the car The older the car ______ will be the priceA layman Answer Regression in Simple Words As the age of the car increases by one year the price of the car is estimated to decrease by a certain amount. Y(Estimated) = b0 + b1 X Regression in Statistical Terms 5. Regression Understanding Data Set: Age & Price of the Cars A Negative Relationship What Relation Do you see? Age 1 2 1 2 3 4 3 4 3 Price 90 85 93 84 80 74 81 76 79 A Convenient Way to Look (What is this tool Called?) Price Age 70 80 90 1 2 3 4 6. Price Age 70 80 90 1 2 3 4 HowtoShowit Statistically Y (E) = b0 + b1 X Y (E) = 97 5 X Y = 97 5 X +E Term Y (E) X b0 b1 What it is! Dependent Variable whose behavior is to be determined Independent Variable whose effect to be determined Intercept: Value of Y(E) when X = 0 Estimated Change in Y in response to unit Change in X E Difference between the actual and estimated 7. Assessing the Goodness of Fit: Graphical Way Goodness of Fit Means How well the model fits the actual data. Less residual means a good bit, more residual means bad Fit Bad Fit Good Fit Perfect Fit 8. Assessing the Goodness of Fit: Statistical Way Expected Y Estimated YActual Y 9. SSR SSR = (Estimated Expected)2 10. SST SST = (Real Expected)2 11. SSE SSE = (Actual Expected)2 12. Assessing the Goodness of Fit: Statistical Way R2 SST = (Real Expected)2 SSR = (Estimated Expected)2 SSE = (Actual Expected)2 A good Model is the one in which SSE is the lowest SSE = 0 SST = SSR + SSE R2 = SSR/SST R2 = 1 - SSE/SST 13. Inferring About the Population Assumptions Expected Value of Residual Variance of Residual Distribution of Residual Dependency of Residuals E(ei ) = 0 e1= e2= . = ei Normal Independent What it means No apparent pattern in residual plot Residual Plot has consistent Spread Histogram is symmetric or normal (Histogram & Probability Plot of Residual) Relationship b/w IndV & DV Linear Linear Scatter Plot How to Check it 14. The Three Conditions Shown Together As the distribution is symmetric, the mean distribution of error term will be zero The distribution of error term is shown to be normally distributed Variance of error term for different values of x appear to be same 15. Analysis of Residuals If the assumptions of regression are met, the following two conditions are met Cond1: Plot of residuals (e) against predictor (x) should fall roughly in a horizontal band & symmetric about x-axis Cond2: A normal probability plot of the residuals should be roughly linear 16. 16 Residual Analysis Examining the residuals (or standardized residuals), help detect violations of the required conditions. Example continued: Nonnormality. Use Excel to obtain the standardized residual histogram. Examine the histogram and look for a bell shaped. diagram with a mean close to zero. 17. 17 For each residual we calculate the standard deviation as follows: 2 x 2 i i ir s)1n( )xx( n 1 h whereh1ss i A Partial list of Standard residuals ObservationPredicted Price Residuals Standard Residuals 1 14736.91 -100.91 -0.33 2 14277.65 -155.65 -0.52 3 14210.66 -194.66 -0.65 4 15143.59 446.41 1.48 5 15091.05 476.95 1.58 Standardized residual i = Residual i Standard deviation Residual Analysis 18. 18 Standardized residuals 0 10 20 30 40 -2 -1 0 1 2 More It seems the residual are normally distributed with mean zero Residual Analysis 19. 19 Heteroscedasticity When the requirement of a constant variance is violated we have a condition of heteroscedasticity. Diagnose heteroscedasticity by plotting the residual against the predicted y. + + + + + + + + + + + + + + + + + + + + + + + + The spread increases with y^ y^ Residual ^y + + + + + + + + + + + ++ + + + + + + + + + + 20. 20 Homoscedasticity When the requirement of a constant variance is not violated we have a condition of homoscedasticity. Example - continued -1000 -500 0 500 1000 13500 14000 14500 15000 15500 16000 Predicted Price Residuals 21. 21 Non Independence of Error Variables A time series is constituted if data were collected over time. Examining the residuals over time, no pattern should be observed if the errors are independent. When a pattern is detected, the errors are said to be autocorrelated. Autocorrelation can be detected by graphing the residuals against time. 22. 22 Patterns in the appearance of the residuals over time indicates that autocorrelation exists. + + + + + + + + + + + + + + + + + + + + + + + + + Time Residual Residual Time + + + Note the runs of positive residuals, replaced by runs of negative residuals Note the oscillating behavior of the residuals around zero. 0 0 Non Independence of Error Variables 23. 23 Outliers An outlier is an observation that is unusually small or large. Several possibilities need to be investigated when an outlier is observed: There was an error in recording the value. The point does not belong in the sample. The observation is valid. Identify outliers from the scatter diagram. It is customary to suspect an observation is an outlier if its |standard residual| > 2 24. Regression Using SPSS 25. Sequence of Entering Variables Which Variables to Enter First The one which is theoretically more important If Variables are Uncorrelated The sequence of entering variable does not have any effect But Real life has more of the correlated than the uncorrelated Some Methods Hierarchical Forced/Enter Stepwise First Known then unknown All together, the only method for testing theory The order is selected mathematically by software 26. Stepwise Methods Forward Backward Process Start with the constant and then add the one with the highest variation explained Start with the all and then remove the one with the least significance Suppression Effect It suppresses No suppression Suppression effect means that a variable has significant effect only when other variables are held constant. Forward is more prone to exclude the variable because of suppression effect. Cross Validation When stepwise methods are used, the sample is advised to be divided into two groups; one is used to develop the model and the other is used to test it. 27. AccuracyofRegressionModel DiagnosticsAssumptions Outliers & Residuals Influential Cases Variable Type Variance Positive No Perfect Multicolinearity Homoscedasticity Independent Errors Predictors are uncorrelated with external variables 28. Diagnostics Outliers Outlier Outlier Effect How to Identify Residuals Diagnostics Outliers Unstandardized Residuals Standardized Residuals (SR) There is outlier if SR > 3.29 More than 1% Sample cases have SR > 2.58 More than 5% Sample cases have SR > 1.96 Student zed Residuals Unstandardized Residual divided by Changing Standard deviation 29. Diagnostics Influential Cases Influential Case Measuring the Effect on Case Undue influence on coefficient Adjusted Predicted Value ( APV) DFFIT Deleted Residuals Studentized Deleted Residuals Predicted value when that particular case is excluded while developing the model APV Original PV APV Original OV Deleted Residuals / Std Dev 30. Cooks Distance Leverage (K+1)/n K =Predictors n = sample size Mahalanobis Distance CD >1 influence of the observed value of the outcome variable over the predicted values. (0 to 1) Effect On Model Effect on Model Values Cause for Concern Distance of cases from the mean(s) of the predictor variable(s). L >2(K+1)/n L >3(K+1)/n N = 500, 5 above 25 N = 100, 3 above 15 Use Barnett & Lewis Table 31. Assumptions Variable Type Variance Positive No Perfect Multicolinearity Homoscedasticity Independent Errors Predictors are uncorrelated with external variables Quantitative or Categorical Variance > 0 Predictor Variables should not correlate highly Variance of the residual terms should be constant 32. Multicolinearity Perfect Colinearity Perfect collinearity exists when at least one predictor is a perfect linear combination of the others (the simplest example being two predictors that are perfectly correlated they have a correlation coefficient of 1).