Post on 15-Jul-2015
1
Regression Analysis:
NBA Points
By:
Matthew Adkins
John Michael Croft
Chima Iheme
Anthony Podolak
2
Table of Contents
Regression Analysis of NBA ppg
I.) Table of Contents …………………………………………………………………. 2
II.) Abstract …………………………………………………………………………. 3
III.) Introduction …………………………………………………………………. 4
IV.) Model Specifications ...……………………………………………………….. 4
A.) Explanation of Coefficients ...……………………………………….. 5
B.) Test for Non-Linearity ….……………………………………………… 6
C.) Test for Heteroskedasticity …………………...…………………….. 6
D.) Normality Test …………………………………………………………. 6
i.) Histogram, PP Plot, other graphical methods
ii.) Quantitative Analysis of Normality
E.) Outlier Control …………………………………………………………. 7
i.) Decision to keep/disregard outliers
ii.) Explanation for decision
V.) Conclusion …………………………………………………………………. 7
A.) Final Model
B.) Limitations of Model
C.) Possible Improvements
Appendix: Includes relevant graphs, plots, and non-essential output reports. 9
3
Abstract
In this project, analyzed a set of data and determine its relationship to NBA ppg. The first thing
that we did was create a baseline model from which we were able to compare all of our later
results against. We then tested for non-linearity in our independent variables to ascertain the
forms of the variables that explained the most of the variation in our model. After we built a
preliminary model, we tested it for multi-collinearity, heteroskedasticity, and normality. Once
we adjusted our model to deal with these issues, we made the decision to retain the outliers as
they did not seem to significantly affect our model. By doing these things, we feel comfortable
that we have created a model that accurately assesses the information that we were given.
However, we are subject to the limited data and in order to build a truly explanatory model, we
would need more data.
4
Model Introduction
This is a project that is a required assignment by our professor of Econometrics. It
incorporates the building of a model and testing for problems in said model. In addition, it
provides an avenue for us to practice the concepts that we learned this semester in school. As we
proceed, we will lead you in a step-by-step process that will allow you to intuitively understand
how we came to our specific model.
Initially, we were given a data set that included our dependent value, PointsPerGame, and
a number of possible factors and statistics for NBA players in one completed season. Based on
this, we created a baseline model that resulted in the following Adjusted R².
Model Summary
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate
1 .928a .861 .858 2.215
a. Predictors: (Constant), coll, assists, rebounds, allstar, wage, avgmin
This model is sub-optimal and is useful only as a starting point for us to compare with our later
models. However, it does allow us to have an idea of how our results compare to this and
whether we are progressing or regressing in our analysis of this data.
Model Specifications
After many regressions, we were able to create what we believe to be the optimal model,
given the data we are working with and the limitations placed upon us. In order to achieve this,
we regressed pointspergame against every possible (In this case, the possibilities are the inverse,
5
square root, natural log, quadratic, and cubic) variation of our available independent variables.
We worked to eliminate variables by first using the stepwise function. After we completed that
initial round of variable elimination, we focused on eliminating variables with a VIF of over ten.
Lastly, once we were able to reduce the VIF’s of all of our remaining variables to numbers under
ten, we then eliminated variables that we could not prove were statistically significant. By using
this process, we were able to slightly increase our adjusted R², which increases the explanatory
power of our model. The summary of the final model is as follows:
Explanation of the Coefficients
Now that we have our model we can move on to testing it for reliability, but before we do
that we will explain the meaning behind the independent variables we incorporated.
• Constant : 1.259
• Avgmin: As the average minutes increase by 1, LN(PPG) increases by .065.
• Wage: As the average annual salary increases by $1000 , LN(PPG) increases by 4.286E-
5.
• Coll: As the years of college played increases by 1, LN(PPG) decreases by .068.
• Assists: As average assist per game increase by 1, LN(PPG) decreases by .063
• Allstar: If ever an all-star, LN(PPG) is increases by .220
6
• Invminutes: As the inverse of minutes per year increase by 1, LN(PPG) decreases by .001
• Invassists: As the inverse of assist per game increase by 1, LN(PPG) decreases by .058
• Cubeminutes: As minutes-cubed increases by 1, LN(PPG) decreases by .008
• Cuberebounds: As rebounds-cubed increases by 1, LN(PPG) decreases by .000
Testing for Non-Linearity
One of the main problems that can be found in regression analysis comes from the non-
linearity of variables. It is not always easy to tell whether a variable is linear or not, but there are
a couple of methods we can use to account for possible non-linearity in our variables. The first
thing to do is visually inspect the data by plotting each independent variable out against
pointspergame. We did this and found a few variables that were questionable. We then
incorporated all of the possible variations that we could think of into our regression and
eliminated the ones that did not increase the explanatory power of our model. These two
methods allowed us to account for the non-linearity in our variables, which in turn allowed us to
further optimize our model.
7
Testing for Heteroskedasticity
Our analysis shows no heteroskadticity while observing different residual output charts
and graphs. The residual histogram shows the residuals are normally distributed, and the residual
scatter plot shows no obvious trend. The number of observations coincides with the Empirical
Rule, with 99.7% of data within three standard deviations of the expected value. We decided no
correction was necessary by way of GLS in regards to the Breusch Pagan test.
Testing for Normality
There are two methods for observing whether a model is normally distributed. The first
is graphically and the second is quantitatively. We use two different types of graphs to assess the
normality of our regression. When we graph our standardized residual by a PP plot and a
Histogram, we view a distribution that looks very normal. These visual aids are helpful to us,
but we cannot accurately determine whether our regression is normally distributed solely through
this method. We also need to test it quantitatively. Thus, we use the Kolmogorov-Smirnov and
Shapiro-Wilks tests to determine the normality of our regression. Through these tests we are
able to determine that our regression is in fact normal.
Outlier Control
In order to obtain a better perspective on our outliers, we created a box plot to graph our
outliers and see where they fit in relationship to our data as a whole. We also ran some
descriptive statistics on our standardized residual, which allowed us to observe our outliers
statistically. We found that most of our outliers lay right at or within three standard deviations of
the mean and even the few that were outside of three standard deviations, were within four.
8
Because of this statistical data and the size of our sample, we decided that it would be
best to leave our outliers in the model. We do not believe that they will significantly skew our
data and keeping them may help us in explaining the variation of our model.
Conclusion
After analyzing the data, we constructed a model that as far as we can tell explains the
most variation of any possible model. As we tested the model, we were able to further refine it
and eliminate a great deal of the multi-collinearity and insignificant factors found in our model.
This was primarily achieved by applying various natural log, inverse, square, cube, and square
root functions against our independent and dependent variables. Our final model is:
LN(PPG) = 1.295 + .065(AvgMin) + 4.286E-5(Wage) - .068(Coll) - .063(Assists) + .
220(Allstar) - .001(Minutes)⁻¹ - .058(Assists)⁻¹ - .008(Minutes)³ + .000(Rebounds)³
Model Limitations
Due to the nature of this model, there are certain inherent limitations. Foremost among
these is its limited ability to predict future values. A model like this one is more of an
explanatory model. It lacks multiple years of historical data, which would allow it to then
9
predict future values based on the analyzed trends of older data. In addition, at least one of the
variables (draft) was entered in such a way that there were blanks if the player was undrafted.
There is probably a better way to enter this information perhaps using dummy variables which
might add to the models’ effectiveness. The model is also limited by the exposure of the various
tools and techniques presented to us in an undergrad class. In other words, we’re certain that
there are countless other tests and methods that can be applied to improve the model further.
Possible Improvements
The major improvement that could be made to this model would be the importation of
more variables. If we were able to more variables, like # of players of same position per team,
points per game from prior season or college, average team points scored per game, average
opponent points allowed per game, injuries; motivation level, win/loss record, contract year, then
we would definitely be able to create a more accurate model. This improvement would allow us
to account for a great deal more variation than we are currently able to under the scope of this
data.
10
Appendix
III. Final Model
11
12
III.D.
13
Tests of Normality
Kolmogorov-Smirnova Shapiro-Wilk
Statistic df Sig. Statistic df Sig.
Standardized Residual .056 269 .040 .981 269 .001
a. Lilliefors Significance Correction
III.E.
14