FP FINAL ECON 407
-
Upload
william-smith -
Category
Documents
-
view
16 -
download
0
Transcript of FP FINAL ECON 407
Smith1
THEEVOLUTIONOFMODERNBASEBALL:
HOMERUNSHITANDITSIMPACTONSALARYTHROUGHOUTBASEBALL’SSTEROIDERA
WILLIAMSMITH
ECONOMETRICSII:APPLIEDECONOMETRICS
DR.LAURABOYD
DECEMBER18,2015
Smith2
I. OVERVIEW
The main objective of this project is to understand the changing factors in Major
League Baseball during what is known as the “steroid era” of baseball. This era defines a
period where professional baseball players were taking advantage of MLB’s lack of league
wide performance enhancing drug, or PED, testing. While it is hard to pinpoint the exact
dates, or even years, in which the steroid era occurred, it is generally accepted to have lasted
from the late 80’s through the late 2000’s. MLB prohibited the usage of steroids in 1991, but
no formal drug testing was established until 2003.1
Because of baseball’s absence of a drug testing policy, the late 90’s and early 2000’s
had some of the most impressive offensive seasons ever seen in baseball. The 6 individual
seasons with the most home runs by a player in MLB history all occurred between 1998 and
2001 (Sosa 3, McGwire 2, Bonds 1). This time period produced some of the most memorable
players in baseball’s history due to all of the long balls that were being hit. Players like Barry
Bonds, Mark McGwire, Sami Sosa, and Alex Rodriguez were the faces of MLB and had
immense popularity in the peak of the steroid era, therefore it is not surprising that every one
of these players has been linked to using PED’s since this time. The steroid era defined a
period where spectators paid to see their favorite players hit home runs; therefore I argue that
in the steroid era, teams were more willing to invest in players who hit the most home runs
more so than any other era in baseball.
For the sake of my research, I would like to find the correlation, economic
significance and statistical significance of home runs hit by a player on their salary for the
following season before, during, and after the steroid era. Hence, my dependent variable is
salary and my key independent variable is home runs. Using individual seasons to represent
each era, I will define 2000 as the peak year of the steroid era as it was also the season with
1Opinionoftimelineofthesteroidera,yearofsteroidbanandmandatorydrugtestingfromhttp://espn.go.com/mlb/topics/_/page/the-steroids-era
Smith3
Graph 1
Data Obtained from Baseball-Almanac.com n1994 had shortened season due to players strike
the highest home run total for the league in baseball’s history, as shown in Graph 1. To
analyze the time before and after the steroid era, I will use data from the 1990 and 2010
baseball seasons, respectively. I chose 1990 to represent the pre-steroid era because this was
before baseball experienced the abnormally high home run totals that it did in the later 90’s
and early 2000’s. The 2010 season represents the post-steroid era as MLB had established
PED testing and teams began to look for defense, speed, and pitching more and home runs
less in their players. I hypothesize home runs will have a positive impact on player salary for
every era, but that its impact will be the greater and more significant in 2000 (steroid era) than
in 2010 (post-steroid era), which will have a greater and more significant impact than in 1990
(pre-steroid era).
II. ESTIMATION
All data obtained is cross-sectional data from Sean Lahman’s Baseball Database. In
particular, the data from his Batting and Salary files. I took individual offensive player
2500
3000
3500
4000
4500
5000
5500
6000
1980 1985 1990 1995 2000 2005 2010 2015
HomeRuns
Year
TotalHomeRunsinMLBfrom1982-2013
Smith4
statistics from the three years identified above, and associated those individual seasons with
the salary the player made the following season, 1991, 2001, and 2011 respectively. I did this
as a way to demonstrate how a player’s performance in a season impacts their salary for the
following season. I also analyzed various team statistics from 1990, 2000, and 2010, and
associated those stats with a player based on their current team, the team that was paying him
salary in 1991, 2001, and 2011. Including team statistics in my model is a way to correct for
team success as a function of the salary teams can pay their players. Essentially, every statistic
I used was lagged a year from the independent variable, salary. I adjusted salary in 1991 and
2001 for inflation to 2011 levels in order to correct for inflation. The regression I will be
examining combines all of the observations from these 3 different years. I will be using the
following variables to help describe salary in my research:
Table 2 Variable Predicted
Sign Description
AdjSal1000 (dependent)
Annual Salary adjusted for inflation to 2011 levels, in thousands.
HR + The total Home runs an individual hit in a season HRpre - An interactive variable for HR hit in the 1990 season (pre-
steroid era) HRpost - An interactive variable for HR hit in the 2010 season (post-
steroid era) Team Stats
DivWin5y
+ Total Division Championships over previous 5 seasons LgWin5y + Total League Championships over previous 5 seasons WSWin5y + Total World Series Championships over previous 5 seasons AL + A dummy variable equal to 1 if team is in American League TeamAttby1000 + Total annual home stadium attendance in thousands from
previous season NewTeam - A dummy variable equal to 1 if a team was an expansion
franchise in the last 10 years Player Stats
BA*100 + A players batting average (hits/at-bats) * 100 SLG*100 + A players slugging percentage (total bases/at-bats) * 100 R + Runs scored (total times a player touched home plate) RBI + Runs batted in (runs scored a batter is directly responsible for) SB + Stolen Bases PreStEra - A dummy variable equal to 1 if pre steroid era (1990) PostStEra + A dummy variable equal to 1 if post steroid era (2010)
Smith5
The key independent variable in this model is home runs. Since observations from
three different years are combined in the model, two interactive terms that solely looked at
home runs before the steroid era, HRpre, (1990) and after the steroid era, HRpost, (2010)
were used in order to differentiate the impact of home runs on salary for each era. As
predicted in the hypothesis, home runs will have a positive impact on player salary for the
following season because a home run can greatly influence the outcome of a game, but the
two interactive home run terms will both be negative to demonstrate that home runs had a
greater impact on salary during the steroid era.
As for team statistics, I predict that the amount of Division Championships
(DivWin5y), League Championships (LgWin5y), and World Series Championships
(WSWin5y) a team has won over the previous five seasons will all have a positive impact on
the salary they pay their players for the following season because teams that do well generate
greater revenue and can afford to pay the most sought after players the most money. A
problem with multicollinearity is expected with these three variables, and the appropriate
corrections will be made based on the regression results. The American League dummy
variable should have a positive impact on salary because in the American League (unlike in
the National League), the pitcher does not have to bat and is replaced by a designated hitter, a
player who is generally a successful power hitter. It is expected team attendance measured in
thousands of spectators, TeamAttby1000, will have a positive impact on player salary for the
following season as ticket sales are one of the main sources of revenue for MLB teams, giving
teams the ability to spend more on their players. Since the only recent MLB expansion
franchises were the Florida Marlins and the Colorado Rockies in 1993 and the Arizona
Diamondbacks and the Tampa Bay Devil Rays in 1998,2 the new team variable only applies
to players who were on these 4 teams in the 2001 season. Theoretically, if a player was on a
2Franchisedataobtainedfromhttp://www.baseball-reference.com/
Smith6
new team, it would negatively influence salary because expansion franchises generally have
to start from scratch and do not have the funds that more established franchises might have.
The player stats included in my model reflect some of the most well-known and used
offensive statistics to define a player’s success. Since these statistics reflect either yearly
percentages or cumulative statistics that only increase with a player’s success, a positive
impact on salary is predicted with these variables. Batting average, represented as BA*100,
should have a positive impact because it is essentially the proportion of at-bats that a player
gets a hit, and the more hits a team gets in game, the more runs a team is likely to score, and
the more likely a team is to win. Slugging percentage, represented as SLG*100, should also
have a positive impact for similar reasons batting average hypothetically should, because it
effectively combines batting average with power statistics (i.e. doubles, triples, and home runs
hit). Some issues with multicollinearity between batting average and slugging percentage are
expected, as these statistics are closely related to each other, and the appropriate corrections
will be made if necessary. Runs scored, or R, is predicted to have a positive impact on salary
because the more runs a team can score, the more likely that team will win. Similarly, runs
batted in, or RBI, is predicted to be positively related with salary because this statistic reflects
the amount of runs a batter is directly responsible for creating when in the batters box. Stolen
bases, or SB, is a different kind of measurement than the other offensive statistics listed
above; it represents speed and not necessarily performance as a batter. Regardless, stolen
bases will also have a positive impact because the ability to advance to the next base, and thus
closer to home plate, when on base increases the likelihood that runner will score.
The effects and conditions that lead to salary differences over time in the MLB are
also controlled for with two dummy variables that define era. The last 25 years in baseball
have seen major increases in salaries. Controlling for each of the different seasons should
reflect this change. The pre-steroid era dummy variable, PreStEra (equal to 1 if observation
Smith7
from before steroid era), should be negative and the post-steroid era dummy variable,
PostStEra (equal to 1 if observation from after the steroid era), should be positive to reflect
this increase in salaries over time, even when adjusted for inflation.
The base model in this research represents a linear relationship of the combined
observations from the three MLB seasons of salary adjusted for inflation and the independent
variables listed above:
AdjSal1000 = β0(Constant) + β1HR + β2HRpre + β3HRpost + β4DivWin5y + β5LgWin5y + β6WSWin5y + β7AL + β8TeamAttby1000 + β9NewTeam + β10BA*100 + β11SLG*100 + β12R + β13RBI + β14SB + β15PreStEra + β16PostStEra
The effectiveness of a logarithmic relationship will also be explored, either by using logged
salary, logged home runs, or by logging both.
III. DATA
Sean Lahman’s baseball database was used for all data represented in the statistics and
regressions seen below. Generally, a player who played in all 162 games in a season will have
over 600 at-bats; therefore, observations where a player did not have over 200 at-bats in the
given year were eliminated. This prevents low total numbers and possibly erratic averages due
to limited at-bats for a player from being taken into account when considering salary for the
following season. In the 1990 season, there were only 26 teams, where as in the 2000 and
2010 seasons there were 30 teams. This is reflected in the lower number of observations from
the 1990 season seen in the Descriptive Statistics table below.
Even though baseball had less teams in 1990 than in 2000 and 2010, a team was more
likely to win their division, league, and the World Series in 2000 and 2010 than in 1990
because of MLB’s changing playoff structure. Winning the division guarantees a team a spot
in the playoffs, the place where the team can then compete for the pennant (League
Championship) and the World Series. There were only 2 divisions per league in 1990 (4 in
Smith8
total), which expanded to 3 divisions per league in 1994 (6 in total). Therefore, there is only a
total of 20 (4 divisions *5 years) division champions in MLB from 1986-1990, but a total of
30 (6 divisions * 5 years) division champions from 1996-2000 and from 2006-2010. Thus, a
team was more likely to win a division during the steroid and post-steroid eras than in the pre-
steroid era. The expansion of the number of divisions in baseball from 4 to 6 coupled with the
introduction of a guaranteed playoff spot for a Wild-Card team, the team with the most wins
in the league who did not win their division, for each league in 1994, doubled the amount of
teams that made it to the post season in a given year, 8 vs. 4. After 1994, each league had 4,
compared to 2; teams that were able to compete for the pennant and 8, compared to 4, teams
in total were guaranteed an opportunity to win the World Series. These changes to MLB’s
divisional and playoff structure are important in considering the likelihood of a team winning
their division, league, and the World Series before and after 1994.
The other major changes that occurred in baseball during this time were the shifting
sizes of both the American and National Leagues. They are as follows:
• 1977-1992-14teamsinAmericanLeagueand12inNationalLeague,26teamsintheleagueintotal
• 1993-NationalLeagueexpandsbyaddingtheFloridaMarlinsandColoradoRockies.Bothleaguesat14teams(1993-1998)
• 1998–MLBexpandsto30teamsbyaddingTampaBayDevilRays(A.L.)andArizonaDiamondbacks(N.L.).MilwaukeeBrewersmovetoNationalLeague.AmericanLeagueat14teams,NationalLeagueat16(1998-2013)3
TheNewTeamdummyvariablebeingobservedisequalto1ifateamwasnewto
baseballinthepast10years.SincetheonlyexpansionstheMLBmadeduringtheperiod
observedwasin1993withtheFloridaMarlinsandColoradoRockiesandin1998with
theTampaBayDevilRaysandtheArizonaDiamondbacks,theNewTeamdummy
variablewillonlybeequalto1forplayerswhowereononeofthese4teamsin2001.
Leaguesizeorthedifferencebetweenleaguesinsizeisnotconsideredforthisresearch,3Informationonchangingdivisional,playoff,andleaguestructuresandexpansionfranchisesofMLBfoundonhttp://www.baseball-reference.com/
Smith9
1990 2000 2010 Totalmean 1732 3925 4355 3245
maximum 6277 27962 32000 32000minimum 165 254 414 165range 6112 27707 31586 31834
95thpercentile 5231 12604 15000 12167Std.Deviation 1635 4271 5263 4093
mean 10.8 16.0 12.9 13.4maximum 51 50 54 54minimum 0 0 0 0range 51 50 54 54
95thpercentile 28 39 31 33Std.Deviation 8.9 11.2 9.5 10.2
mean 0.266 0.279 0.265 0.270maximum 0.340 0.372 0.359 0.372minimum 0.192 0.200 0.146 0.146range 0.147 0.172 0.213 0.226
95thpercentile 0.309 0.333 0.312 0.321Std.Deviation 0.028 0.031 0.029 0.030
mean 0.401 0.457 0.419 0.427maximum 0.592 0.746 0.633 0.746minimum 0.254 0.253 0.208 0.208range 0.338 0.493 0.425 0.538
95thpercentile 0.523 0.606 0.534 0.570Std.Deviation 0.065 0.085 0.071 0.078
mean 56.3 66.3 56.7 60.0maximum 119 152 115 152minimum 11 17 12 11range 108 135 103 141
95thpercentile 96 114 100 106Std.Deviation 22.4 28.8 24.6 26.0
mean 53.2 64.1 54.2 57.5maximum 132 147 126 147minimum 12 11 13 11range 120 136 113 136
95thpercentile 95 121 103 111Std.Deviation 23.4 31.3 26.0 27.8
mean 11.0 8.0 8.2 8.9maximum 77 62 68 77minimum 0 0 0 0range 77 62 68 77
95thpercentile 36 28 32 32Std.Deviation 12.7 9.9 10.5 11.0
245 304 305 854
BattingAverage(BA)
SluggingPercentage
(SLG)
RunsScored(R)
RunsBattedIn(RBI)
StolenBases(SB)
RealSalaryadjustedto2011in
thousands(AdjSal1000)
DescriptiveStatistics
Observations(N)
HomeRuns(HR)
Smith10
thoughitshouldbenotedthattheamountofteamsintheAmericanLeaguecouldimpact
theAmericanLeaguedummyvariablemeanttocontrolforthedesignatedhitterinthe
AmericanLeague.
The descriptive statistics table from the observed offensive player statistics is listed on
the previous page. It includes the mean, maximum, minimum, range, value of 95th percentile,
and the standard deviation for salary, home runs, batting average, slugging percentage, runs
scored, runs batted in, and stolen bases for each era and the eras combined. The 95th percentile
statistic is included to demonstrate what the top performers of a single statistic were
accomplishing, rather than just a maximum that represents a single player who could have
been an outlier from the rest of the data. Descriptive statistics of the other variables in my
model would prove to be irrelevant because these statistics are not reflective of an individual’s
performance, and rather reflects the era or team the player happened to be on the year after
these statistics were recorded. The data is broken up into three columns based on the year of
the observation, which represents the three different eras being compared and a fourth column
with all observations combined. The data for salary is from a year after the year listed at the
top of each column, so 1991, 2001, and 2011 respectively. From the first three columns, the
changing trends in offensive statistics can be observed throughout the beginning and end of
the steroid era.
Expanding upon the hypothesis of this project, the peak of the steroid era should see
increased offensive performance, not only reflected in increased home runs, but also in the
other offensive statistics compared to the other two eras. The one variable that is not expected
to increase with steroid usage is stolen bases. The ability to steal a base is not necessarily
reflected in muscle growth, the result of steroid usage. A theoretical prediction can be made
that stolen bases were actually less abundant during the steroid era than before or after the
steroid era because the perceived importance of being a power hitter was much greater than
Smith11
being a quick base runner during the steroid era, and players might choose to take steroids and
gain muscle at the cost of speed on the base paths. There is a significant difference in number
of observations in 1990 (245) than in 2000 (304) and 2010 (305), but this can be explained by
the fact there were less players in 1990 than in 2000 or 2010 because there were four less
teams in MLB.
Even when adjusted for inflation, salary appears to be constantly growing as time
moves forward. While there is not a substantial difference in the mean, maximum, and 95th
percentile of salary 2000 and 2010, salary in 1990 was significantly lower. The higher salary
numbers in 2000 and 2010 could be a result of the 1994 player’s strike, as a player’s salary
demands might have been taken more seriously by the league and teams for fears of another
strike.
For home runs, there is a clear difference in the three seasons observed. While the
highest totals from each year are similar, the mean home run total in 2000 (16.0) is over 3
home runs greater than the mean total in 2010 (12.9) and over 5 home runs greater than the
mean total in 1990 (10.8). 2000 also was the only year where the mean value was greater than
the mean value of all observations (13.4). The 95th percentile for home runs further proves
home runs were much more prevalent in 2000 than in the other years. A player in the top 5%
of home runs hit in 1990 and 2010 would need to hit 11 and 8 more home runs, respectively,
in order to be in the top 5% of home run hitters in 2000. From the descriptive statistics, it is
clear that home runs were much more abundant in 2000 than in the other years.
Batting average, slugging percentage, runs scored and runs batted in have the highest
mean, maximum, and 95th percentile in 2000 by a significant margin. It is especially
staggering to observe how much greater the 95th percentile is in 2000 than in the other years.
The 95th percentile for slugging percentage is over .07 higher in 2000 than in 2010 and over
.08 higher than in 1990. These numbers may seem small, but in actuality, .07 has a large
Smith12
economic significance to slugging percentage. It is also noteworthy to analyze the higher
totals of runs scored in 2000. The maximum value in 2000 of 152 is much higher than the
maximum totals of 119 and 115 in 1990 and 2010. Based on the means obtained, it can also
be inferred that the average player was scoring about 10 more runs in 2000 than in 1990 and
2010. The abnormally large number of runs scored in 2000 represents the surge of offense in
the steroid era, especially in the late 90’s and early 2000’s, as batters were creating more
opportunities for themselves and their teammates to score.
Unlike the other batting statistics that were highest in 2000, stolen bases had the
lowest mean value, maximum value, and 95th percentile in 2000 than in the other two years.
Interestingly, stolen bases appeared to be the most predominant in 1990, as 1990 had the
highest mean, maximum, and 95th percentile stolen bases. This can be explained in the same
way batting descriptive statistics were the highest in 2000. Steroids were used without
regulation during the steroid era, and players tended to sacrifice speed for power in years such
as 2000. In 1990, steroids were arguably not nearly as prevalent as they were in 2000. Players
were not as focused on gaining muscle and hitting for power in 1990, and rather directed more
attention to other aspects of the game.
To define the steroid era in baseball as a period when hitting and hitting for power was
at its peak would require a demonstration of decreased hitting statistics for the periods
surrounding this time. Descriptive statistics obtained from the years 1990, 2000, and 2010
show that hitting and hitting for power was indeed at its peak in 2000, the steroid era. It would
be prudent to only rely on one aspect of the game to show changing eras, which is why stolen
bases, a measure of speed, is included in the data. The plunge in stolen bases from 1990 to
2000 and subsequent increase ten years later in 2010 further prove the steroid era was a very
unique time in baseball’s history when the dominant aspect of the game teams, players, and
fans alike cared for was the home run, while other parts of the game were forgotten.
Smith13
IV. RESULTS
The statistical program STATA was used for all results and manipulation of data in this research.
After observing the effects of including the key independent variables (HR, HRpre,
and HRpost) squared, it was clear that none of these squared terms significantly influenced
salary, so they were not used or considered in any of the following regressions.
The original linear regression from the base model above is run. The adjusted R2 for
this model is .403, meaning that 40.3% of the variation of the dependent variable is explained
by this model. As expected, there are 854 observations between the three years being used. To
test for Heteroskedasticity, a Breusch-Pagan test was implemented. After regressing the
model, I found the predicted estimates and thus was able to find the residual errors. I used the
square value of the residual errors as the dependent variable against the independent variables
from the base model for the Breusch-Pagan test. The test results are as follows:
H0: Homoskedastic Errors HA: Heteroskedastic Errors F (16, 837) = 6.53 Prob > F = 0.0000
Therefore the null hypothesis is rejected, as there are severe signs of Heteroskedasticity.
Therefore, the robust standard errors are reported in all subsequent models. Variables
that immediately proved to be very insignificant were DivWin5y, LgWin5y, NewTeam,
and SB, all of which were significant over the 50% level. WSWin5y is still fairly
significant, at about 20%, even though DivWin5y and LgWin5y weren’t. For the next
model, LgWin5y will be left in to observe if its significance will increase with the
removal of DivWin5y. Hence, a team’s recent success is controlled with LgWin5y and
WSWin5y. The variables omitted in the next model were DivWin5y, NewTeam, and
SB, as they do not appear to have a significant impact on salary.
Smith14
There were multiple variables that experienced multicollinearity, as observed in
the test for multicollinearity section below. The key dependent variable, home runs,
experienced multicollinearity, but this is expected with the interactive home run terms.
Slugging percentage and runs scored both experienced multicollinearity. Since slugging
percentage is a term that both reflects power numbers and batting average, it is taken out
of the model because power numbers, HR, and batting average are controlled for in the
equation. Runs scored and runs batted in also experienced multicollinearity, and this is
to be expected, as they are somewhat similar statistics. Since runs batted in was more
significant, and has a more practical effect on power numbers than runs scored, it was
left in while runs scored was omitted. Therefore the next model observed is as follows:
AdjSal1000 = β0(Constant) + β1HR + β2HRpre + β3HRpost + β4LgWin5y + β5WSWin5y + β6AL + β7TeamAttby1000 + β8BA*100 + β9RBI + β10PreStEra + β11PostStEra
The adjusted R2 for this second model, .397, is lower than for the base model. Runs batted in
still shows signs of multicollinearity, as its VIF = 5.01. In the subsequent model, RBI will be
replaced with R to see if runs scored fits better. LgWin5y appears to have become far less
significant in this model, even after DivWin5y is removed, therefore LgWin5y is to be
removed and WSWin5y is to become the only control for a team’s success. The third model is
as follows:
AdjSal1000 = β0(Constant) + β1HR + β2HRpre + β3HRpost + β4WSWin5y + β5AL + β6TeamAttby1000 + β7BA*100 + β8R + β9PreStEra + β10PostStEra
The adjusted R2 for this third model, .402, has increased for this model from the last model.
There is not an issue of multicollinearity in this model. All variables are significant at the 10%
level except for the interactive variable for home runs after the steroid era, HRpost which is
significant at the 25% level, and my dummy variables for each era, PreStEra and PostStEra,
which were both significant at the 20% level.
Smith15
The next attempt is to find the best lin-log model with the key independent variables
logged to see if it is a better fit for the data with independent variables. The forth model:
AdjSal1000 = β0(Constant) + β1ln_HR + β2ln_HRpre + β3ln_HRpost + β4DivWin5y + β5LgWin5y + β6WSWin5y + β7AL + β8TeamAttby1000 + β9NewTeam + β10BA*100 + β11SLG*100 + β12R + β13RBI + β14SB + β15PreStEra + β16PostStEra
The adjusted R2 for this model is .388, much lower than any of the models without using the
natural log of home runs. The key independent variable ln_HR is not significant in this model.
Other highly insignificant variables were DivWin5y, LgWin5y, NewTeam, BA*100, SB, and
PostStEra, therefore they were all removed. RBI experienced multicollinearity, so it was also
removed. The following was the fifth model estimated:
AdjSal1000 = β0(Constant) + β1ln_HR + β2ln_HRpre + β3ln_HRpost + β4WSWin5y + β5AL + β6TeamAttby1000 + β7SLG*100 + β8R + β9PreStEra
All variables became much more significant in this model, including ln_HR. But the adjusted
R2 in this model, .378, was less than the best model without taking the natural log of home
runs, therefore the third model is the best fit linear model, one that used AdjSal1000 as the
dependent variable.
The next step is to find the best model for when the natural log of adjusted salary in
thousands of dollars is the dependent variable. The sixth model used is as follows:
ln_AdjSal1000 = β0(Constant) + β1HR + β2HRpre + β3HRpost + β4DivWin5y + β5LgWin5y + β6WSWin5y + β7AL + β8TeamAttby1000 + β9NewTeam + β10BA*100 + β11SLG*100 + β12R + β13RBI + β14SB + β15PreStEra + β16PostStEra
This model explains 38.8% of the variation of the independent variable. Upon observing the
results of this output, there were a few very insignificant variables. Once again, DivWin5y
and LgWin5y were both insignificant and are removed. The NewTeam variable is also
removed because it is insignificant.
Smith16
SLG, RBI, and R all have signs of multicollinearity, but for the same reasons as in an
earlier model, R will be left in as removing RBI might fix this problem. SLG is removed.
Therefore the seventh model observed is as follows:
ln_AdjSal1000 = β0(Constant) + β1HR + β2HRpre + β3HRpost + β4WSWin5y + β5AL + β6TeamAttby1000 + β7BA*100 + β8R + β9SB + β10PreStEra + β11PostStEra
The adjusted R2 of this model is .379. There are no signs of multicollinearity, but SB and the
dummy variable PreSt_Era are both highly insignificant, so they are removed. The interactive
term for home runs after the steroid era, HRpost, is insignificant in this model, but it will be
used in the next model to see if it will have a greater significance with these two variables
eliminated. The eighth model is as follows:
ln_AdjSal1000 = β0(Constant) + β1HR + β2HRpre + β3HRpost + β4WSWin5y + β5AL + β6TeamAttby1000 + β7BA*100 + β8R + β9PostStEra
The adjusted R2 for this model is .381. Every variable is significant in this model besides
HRpost, therefore it is removed. Since the dummy variable for post steroid era will still be in
the equation, there is a control for the impact on salary after the steroid era, but there will to
be no difference of the impact of home runs hit for the post steroid era on salary. The ninth
model is as follows:
ln_AdjSal1000 = β0(Constant) + β1HR + β2HRpre + β3WSWin5y + β4AL + β5TeamAttby1000 + β6BA*100 + β7R + β8PostStEra
The adjusted R2 on this model is .381. All of the variables are significant at the 10% level,
except for WSWin5y which is significant at the 11.5%. This is the best fit log-lin model.
The final step is analyzing what the best fit of the data is when the natural log of salary
and home runs is taken. The tenth model analyzed is as follows:
ln_AdjSal1000 = β0(Constant) + β1ln_HR + β2ln_HRpre + β3ln_HRpost + β4DivWin5y + β5LgWin5y + β6WSWin5y + β7AL + β8TeamAttby1000 + β9NewTeam + β10BA*100 + β11SLG*100 + β12R + β13RBI + β14SB + β15PreStEra + β16PostStEra
Smith17
The adjusted R2 for this model is .383. The variables that weren’t significant were ln_HRpost,
DivWin5y, LgWin5y, and PreStEra and thus will be removed. The variables that experienced
multicollinearity were the three HR terms and the two dummy variables that describe era, but
only the dummy variable for PreStEra will be removed as leaving PostStEra in the model will
be the only control for post steroid era because ln_HRpost is removed. Also SLG*100 and
RBI experienced signs of multicollinearity, so they are both removed. The eleventh equation
is as follows:
ln_AdjSal1000 = β0(Constant) + β1ln_HR + β2ln_HRpre + β3WSWin5y + β4AL + β5TeamAttby1000 + β6NewTeam + β7BA*100 + β8R + β9SB + β10PostStEra
In this model, NewTeam and SB are not significant, and therefore are removed. Therefore the
following model represents the best fit log-log model:
ln_AdjSal1000 = β0(Constant) + β1ln_HR + β2ln_HRpre + β3WSWin5y + β4AL + β5TeamAttby1000 + β6BA*100 + β7R + β8PostStEra
The adjusted R2 is .370, which is lower than the adjusted R2 for the best-fit log-lin model, the
ninth model, therefore the best log-lin model describes the data better than the best log-log
model.
A MWD test is required to see whether it is best to take the linear regression of salary
or by taking the natural log of salary. The two equations that will be used for this model is the
best fit linear model, model 3, and the best fit log-lin model, model 9. From obtaining the
predicted values for AdjSal1000 in the best fit linear model (Yhat) and the predicted values of
ln_AdjSal1000 in the best fit logistic model (lnYhat), the values Z1 (=ln(Y) – lnYhat) and Z2
(=e^(lnYhat - Yhat)) were created. The hypothesis for which functional form fits the data best
is as follows:
H0: linear model is best HA: log-log model is best When regressing the best-fit linear model with Z1, Z1 appears to be statistically significant at
the .1% level, therefore we reject the null hypothesis that the linear model is the best fit for
Smith18
the data. When the best-fit log-lin model is regressed with Z2, Z2 shows no signs of statistical
significance, therefore we fall to reject the alternative hypothesis that the log-lin model is the
best fit model. Therefore it can be concluded the log-lin regression is the best-fit model for the
data:
ln_AdjSal1000 = β0(Constant) + β1HR + β2HRpre + β3WSWin5y + β4AL + β5TeamAttby1000 + β6BA*100 + β7R + β8PostStEra
This model describing player salary controls for the impact from the previous season of a
player hitting a home run, hitting a home run before the steroid era, each World Series a
player’s current team had won in the past 5 years, the league the player is in, a team’s home
attendance, a player’s batting average, run’s scored, and if the season was after the steroid era.
These variables all had the same sign that was predicted. Home runs, home runs before the
steroid era, team attendance, runs scored, and the post steroid era dummy variable were all
significant at the 1% level, the American League dummy variable and batting average were
significant at the 5% level, and World Series Championships from the previous season was
significant at the 20% level.
With all of the other variables listed above controlled for, home runs, home runs
before the steroid era, World Series Championships a team has won in the previous 5 seasons,
the league a player is in, team attendance, batting average, runs scored, and if the player is
playing in the post steroid era, 2010, an interpretation of the impact a change of each variable
in this model can be made. Each home run a player hits during the steroid era increases his
salary by 3.53% for the next season, with all other variables remaining constant.
Theoretically, this means the player who hit the maximum number of home runs in 2000 (50)
had a salary that was 176.5% higher than a layer who hit the minimum number of home runs
(0) for that year. For every home run a player hit in 1990, salary increases by 2.27% for the
following season, with all other variables held constant. Even though the coefficient on the
interactive term for home runs before the steroid era is negative, the impact on salary of a
Smith19
MLBSalaryFunctionMeasuredinThousandsofDollarsAdjustedforInflation
(1) (3) (6) (9) Predicted
Sign BaseLinear BestfitLinear
BaseLog-Lin
BestfitModel
(Log-Lin)Constant -5501.9***
(1335.106)-5909.9***(1253.243)
4.950***(0.355)
4.828***(0.326)
HR + 148.4***(47.62)
155.0***(24.648)
0.0289**(0.012)
0.0353***(0.004)
HRpre - -122.4***(25.193)
-125.1***(24.960)
-0.0107*(0.006)
-0.0126***(0.004)
HRpost - 47.52(38.008)
45.45(38.047)
0.00553(0.007)
DivWin5y + 99.11(180.067)
0.0000510(0.042)
LgWin5y + -113.0(363.232)
-0.0252(0.088)
WSWin5y + 476.5(374.796)
411.3*(226.270)
0.109(0.094)
0.0819(0.052)
AL + 624.5***(237.211)
647.1***(236.579)
0.134**(0.066)
0.137**(0.064)
TeamAttby1000 + 1.193***(0.201)
1.253***(0.172)
0.000330***(0.000)
0.000324***(0.000)
NewTeam - 286.3(592.384)
0.114(0.163)
BAper + 125.8*(70.506)
83.97*(44.413)
0.0420*(0.022)
0.0252**(0.012)
SLGper + -43.41(46.879)
-0.0172(0.013)
R + 17.87*(10.762)
32.32***(6.468)
0.00659**(0.003)
0.0128***(0.002)
RBI + 22.39*(12.207)
0.0101***(0.003)
SB + 9.459(14.449)
0.00397(0.004)
PreS_Era - 347.6(361.608)
475.2(342.135)
-0.0603(0.126)
PostS_Era + 666.9(512.409)
693.5(502.386)
0.191(0.141)
0.263***(0.076)
N.ofCases 854 854 854 854AdjR-Squared 0.403 0.402 0.388 0.381SEE 3391.7 3393.4 0.933 0.938F-ratio 26.61 41.60 49.61 88.71SSR 9.62864e+09 9.70719e+09 728.2 743.1
Robuststandarderrorsinparentheses *sig.at10%,**sig.at5%,***sig.at1%-Table includes base lin-lin model, best lin-lin model, base log-lin model, and best log-lin model
Smith20
home run hit before the steroid era is still positive, as this negative coefficient only reflects the
different impact on salary from a home run hit before the steroid era and a home run hit
during the steroid era (.0353 - .0126). Another way to phrase this statement is the impact on
salary of hitting a home run before the steroid era is 1.26% less than the impact on salary of
hitting a home run in the steroid era, with all other variables held constant There is no
significant difference of hitting a home run during the steroid era than after the steroid era, as
it is seen that the home run post steroid era interactive variable was not statistically significant
when observing a log-lin relationship.
The impact of the following statistics of the team a player is on when salary is taken
(1991, 2001, and 2011) from the year before are as follows: Each World Series a team has
won in the previous 5 seasons increases the salary of every hitter on the team by 8.19%, with
all other variables held constant. Baseball players who play in the American League have a
salary 13.7% higher than the salary of players who play in the National League, with all other
variables held constant. For every thousand-person increase in a team’s home attendance,
salary of the hitters on their team increase by .0324% for the following season, with all other
variables held constant. This number may seem small, but if thought about practically, it still
has economic significance. Baseball stadiums can fit tens of thousands of spectators for each
game, and a team plays 81 home games each season. If a team is averaging a thousand more
spectators a game than another team, the team with more spectators per game pays its hitters
2.62% (81*.0324%) more than the team with less spectators pays its hitters, with all other
variables held constant. Since teams like the New York Yankees can sometimes average ten
thousand if not more average spectators per game than a smaller market team, it is clear that
team attendance has a big economic significance on player salary.
The only other player statistics besides home runs that appears to influence salary is
batting average and runs scored. Since batting average, in this case, is measured as a
Smith21
percentage (BA*100) of times a player gets a hit in an at-bat, it can be inferred that for every
one percentage point increase of the percentage of at-bats where a player gets a hit, their
salary increases by 2.52% for the following season, with all other variables held constant.
Every run a player scores increases their salary for the following season by 1.28%, with all
other variables held constant. The last variable of the best model is the dummy variable for
post steroid era. The interpretation of this variable is that, on average, hitters from the post-
steroid era were getting paid 26.3% more than hitters in the steroid era, with all other
variables held constant.
To see the practical application of this model, the predicted salary in 1991, 2001, and
2011 of the MVP’s from both the American and National Leagues in 1990, 2000, and 2010,
respectively, will be observed. The 1990 American League MVP was Ricky Henderson, who
happens to be baseball’s all-time stolen base leader. In the 1991 season, Henderson was on the
Oakland Athletics, had 28 home runs with a BA*100 of 32.52 and led both leagues with 119
runs scored in 1990, as seen in the Descriptive Statistics table. Henderson’s salary in 1991
adjusted for inflation to 2011 is predicted to be $7,829,877 when it actuality it was
$5,369,000. This is not a large difference, and could be explained by the fact there is no
control for speed in this model, and Henderson was one of the fastest players in baseball at
this time. The 1990 National League MVP was Barry Bonds, who happens to be the all time
home run leader in baseball’s history, played for the Pittsburg Pirates in 1991, and in 1990 hit
33 home runs with a BA*100 of 30.06 and scored 104 runs. This model predicts Bond’s
adjusted 1991 salary to be $4,148,335 when in actuality $3,799,600, a close estimation.
In 2000, the American League MVP was Jason Giambi and Jeff Kent in the National
League. Giambi played for Oakland in 2001, and in 2000 hit 43 home runs with a BA*100 of
33.33 and 108 runs scored. The model predicted Giambi to have an adjusted salary of
$10,147,180 in 2001 when it was actually $5,215,336. Giambi was yet to be an established
Smith22
veteran hitter in the MLB at this time, and could explain why he was not being paid as much
as the model predicts. In 2001, Jeff Kent played for the San Francisco Giants. In 2000, he hit
33 home runs with a BA*100 of 33.39 and 114 runs scored. His salary adjusted for inflation
was predicted to be 11,722,990 in 2001 when it was actually $7,626,000. This can be
explained by the fact Kent was bound to a contract he signed with the Giants, and was paid
this fixed amount of $7.6 million from 2000-2002, therefore his salary in 2001 did not have a
chance to increase because of his MVP season in 2000.
In 2010, the MVP from the American League was Josh Hamilton and Joey Votto in
the National League. Hamilton played for the Texas Rangers in 2011, and in 2010 hit 32
home runs, led both leagues with a BA*100 of 35.9 and scored 95 runs. His salary is
predicted to be $10,831,960 in 2011 based on his performance in 2010, but his salary was
$8,750,000. Hamilton signed a new 2-year, $24 million contract with the Rangers after his
2010 MVP season, the bulk of which he was paid in the 2012 season. Since his contract
averaged $12 million a year, his salary is somewhat reflected in this model. Votto was signed
with the Cincinnati Reds for the 2011 season, and compiled 37 home runs and 106 runs
scored with a BA*100 of 32.36 in his 2010 MVP campaign. The model predicts Joey’s salary
to be $10,268,240 in 2011 when it was actually $7,410,655. Considering the fact Votto was
paid less than a million dollars for the 2010 season and signed a 3-year $38 million dollar
contract before the 2011 season, the model does a fairly good job predicting his salary. A
pitcher sometimes will be awarded the MVP, but it is a rare occurrence, and did not happen
one of these three years. It is interesting to point out that none of the MVP’s in these three
years was the top home run hitter for that year.
The 3 players who led the league in home runs in 1990, 2000, and 2010 were Cecil
Fielder, Sami Sosa, and Jose Bautista with 51, 50, and 54 home runs respectively. Fielders
predicted salary adjusted for inflation was $5,642,801when it was actually $2,891,000 in
Smith23
1991. This difference is explained by the fact Cecil had little success in the MLB up to 1988,
and decided to play baseball in Japan in 1989. Therefore 1990 was his first season back in the
MLB when he was not yet an established player. It would be hard for a team to commit to
much money to a player who just had his first good season in the MLB at the age of 26.
Sosa’s predicted 2001 salary adjusted is $15,659,430 from this model, and he was actually
paid $15,887,500. The model was extremely effective in predicting Sosa’s salary in 2001. In
2011, Bautista was paid $8 million dollars, yet the model predicted his salary to be
$16,498,840, more than double of what the actual value was. Bautista was a veteran baseball
player in 2010, as he had already completed 6 MLB seasons with little success. 2010 was his
first productive MLB season, and his salary jumped from $2 million in 2010 to $8 million in
2011 to $14 million in 2012.4 While it is clear that there are some challenges in predicting
MLB salary for a given year based solely on the statistics from the previous season, the best-
fit model from the data demonstrates some merit when predicting MLB salary.
V. Conclusion
The hypothesis of this research is proven to have some credibility based on the results
from the data. As expected, home runs hit had a positive and strong statistical significance
with salary in every era. With 99% certainty, it can be concluded that the impact on salary of
hitting a home run before the steroid era was less than the impact of hitting a home run during
the steroid era. The part of the hypothesis that is not proven to be correct is that there is a
significant difference on salary of home runs hit during and after the steroid era, as the
interactive term home runs after the steroid era showed no statistical significance in the
model.
4Statistics,contracts,andMVP’sfrompredictionsaboveisfromhttp://www.baseball-reference.com/
Smith24
The variables used in the model all controlled for important aspects of MLB salaries.
The values for a team’s success over the past 5 seasons were relevant because the teams that
win more have the ability and are more willing to pay for better players to continue the
success of the franchise. Even though World Series Championships over the past 5 seasons
was the only significant variable for team success in the final model, it still served an
important purpose to control for a team’s success. The American League variable proved to
be significant in describing the data, and it is surprising to see that when all other variables are
controlled for, players in the American League earn about 13.7% more than players in the
National League. Different from a team’s success, team attendance controls for the revenue of
a team, as ticket sales are an important source of income for every baseball team. It is not
surprising to see a statistically significant impact of team revenue, as represented by
attendance, on the salary teams give their players for the next season. The only offensive
player variables that proved to be statistically significant with salary besides home runs hit
were batting average and runs scored. While multicollinearity was expected with slugging
percentage for reasons mentioned earlier in this paper, it was surprising to see that RBI did
not have a significant impact on salary, and that runs scored proved to be more significant.
RBI is constantly used in baseball to measure success. It is one of the three statistics, along
with batting average and home runs, that are always seen next to a player’s name when a
player is up to bat, therefore one would imagine it to have a significant impact on salary.
Unfortunately, the control for speed, stolen bases, did not have a significant impact on salary.
Salary for before the steroid era was controlled for with the home run interactive variable,
even though the dummy variable for before the steroid era did not have an impact on salary.
While there was no difference in home runs hit during the steroid era and after the steroid era
on salary, players after the steroid era were shown to have higher salaries as reflected in the
post steroid era dummy variable, a result that was expected.
Smith25
There were a few variables that could have been considered as a control for different
aspects of salary, but were not available to include in this model. Such variables include a
player’s age or time of service in the MLB, a control for a player’s recognition, defensive
ability, position, and country of birth. Variables that could have been used to control for a
player’s recognition could include the amount of MVPs, Silver Sluggers and Gold Gloves a
player has won along with the amount of All-Star games a player has been elected to. A
control for defense could be fielding percentage ((defensive attempts-errors)/defensive
attempts), but this is tricky, as it does not necessarily reflect the range a player can cover
when playing defense or a players ability to make outstanding defensive plays that an average
player would not be able to make. The Gold Glove award often reflects this ability to make
spectacular defensive plays, and a variable for Gold Glove Awards won could have a
significant impact on salary. The position a player generally plays could also serve as an
important control for salary as some positions on the field might have more value than others
when determining salary. A simplified way to look at this could be to have a dummy variable
if a player is an outfielder or a dummy variable if a player is an infielder. This idea could also
have some problems, as there are a number of baseball players who play multiple positions in
a given season. Country of birth could also be shown to have a significant impact on salary, as
often times, especially recently, players from other countries such as The Dominican
Republic, Venezuela, and Cuba have become some of the most highly desired players in
baseball. An improvement of this research could include some of these variables when
analyzing player salary. Regardless, even without all of these controls, the best model still is
successful at roughly predicting player salary.
Because there are many different explanations that a player’s success in a season is not
reflected in their salary the next season, it can be difficult to have one model that does a good
job at predicting salary. It would be ideal if a player signed a new contract or experienced
Smith26
free-agency every year, but this often times does not happen, even though there are many
free-agents after each season who are free to sign the best contract they are offered. Many
players are bound by long-term contracts, and even though there are bonuses a player can
obtain based on performing well, this fact can hamper the idea that salary is a fluid value that
changes every year. A way baseball has tried to fix this problem is through salary arbitration.
This is a process where a player can argue for a higher salary from his team based on his
performance from the previous season, but this is only something a player is eligible for from
their 3rd season (sometimes 2nd season) until their 6th season.5 Players who do not have the
opportunity for arbitration, therefore, are bound to their contracts, for better or worse. Even
with all of the limitations that inhibit changing salaries in baseball on a yearly bases, this
model does a fairly good job of explaining salary, as it explains 38.1% of the variation in
player salary.
The steroid era changed the structure of baseball in ways that are still seen today. In
the late 90’s, hitting home runs became a glorified statistic that became the new measure of a
player’s success. The home run chase between Mark McGwire and Sami Sosa in 1998 and
Barry Bonds’ run at Hank Aaron’s all time home record will go down as two of the most
memorable moments in baseball’s history. The excitement of the home run during the peak of
the steroid era of baseball my never be matched again. While baseball players still get caught
using PED’s today, it is generally accepted that the amount of players using steroids in
baseball now is far less than 15 years ago, yet the scares of the steroid era are still felt by
MLB as it becomes known that the heroes of a generation had an advantage over the other all-
time greats of the game.
5Arbitrationrulesobtainedfromhttp://www.fangraphs.com/library/business/mlb-salary-arbitration-rules/
Smith27
Sources
Data of player statistics and salary information: http://www.seanlahman.com/baseball-
archive/statistics/
Supplemental statistics: http://www.baseball-reference.com/
Arbitration rules: http://www.fangraphs.com/library/business/mlb-salary-arbitration-
rules/
Opinion of start and end of Steroid Era: http://espn.go.com/mlb/topics/_/page/the-
steroids-era