FP FINAL ECON 407

27
Smith 1 THE EVOLUTION OF MODERN BASEBALL: HOME RUNS HIT AND ITS IMPACT ON SALARY THROUGHOUT BASEBALL’S STEROID ERA WILLIAM SMITH ECONOMETRICS II: APPLIED ECONOMETRICS DR. LAURA BOYD DECEMBER 18, 2015

Transcript of FP FINAL ECON 407

Page 1: FP FINAL ECON 407

Smith1

THEEVOLUTIONOFMODERNBASEBALL:

HOMERUNSHITANDITSIMPACTONSALARYTHROUGHOUTBASEBALL’SSTEROIDERA

WILLIAMSMITH

ECONOMETRICSII:APPLIEDECONOMETRICS

DR.LAURABOYD

DECEMBER18,2015

Page 2: FP FINAL ECON 407

Smith2

I. OVERVIEW

The main objective of this project is to understand the changing factors in Major

League Baseball during what is known as the “steroid era” of baseball. This era defines a

period where professional baseball players were taking advantage of MLB’s lack of league

wide performance enhancing drug, or PED, testing. While it is hard to pinpoint the exact

dates, or even years, in which the steroid era occurred, it is generally accepted to have lasted

from the late 80’s through the late 2000’s. MLB prohibited the usage of steroids in 1991, but

no formal drug testing was established until 2003.1

Because of baseball’s absence of a drug testing policy, the late 90’s and early 2000’s

had some of the most impressive offensive seasons ever seen in baseball. The 6 individual

seasons with the most home runs by a player in MLB history all occurred between 1998 and

2001 (Sosa 3, McGwire 2, Bonds 1). This time period produced some of the most memorable

players in baseball’s history due to all of the long balls that were being hit. Players like Barry

Bonds, Mark McGwire, Sami Sosa, and Alex Rodriguez were the faces of MLB and had

immense popularity in the peak of the steroid era, therefore it is not surprising that every one

of these players has been linked to using PED’s since this time. The steroid era defined a

period where spectators paid to see their favorite players hit home runs; therefore I argue that

in the steroid era, teams were more willing to invest in players who hit the most home runs

more so than any other era in baseball.

For the sake of my research, I would like to find the correlation, economic

significance and statistical significance of home runs hit by a player on their salary for the

following season before, during, and after the steroid era. Hence, my dependent variable is

salary and my key independent variable is home runs. Using individual seasons to represent

each era, I will define 2000 as the peak year of the steroid era as it was also the season with

1Opinionoftimelineofthesteroidera,yearofsteroidbanandmandatorydrugtestingfromhttp://espn.go.com/mlb/topics/_/page/the-steroids-era

Page 3: FP FINAL ECON 407

Smith3

Graph 1

Data Obtained from Baseball-Almanac.com n1994 had shortened season due to players strike

the highest home run total for the league in baseball’s history, as shown in Graph 1. To

analyze the time before and after the steroid era, I will use data from the 1990 and 2010

baseball seasons, respectively. I chose 1990 to represent the pre-steroid era because this was

before baseball experienced the abnormally high home run totals that it did in the later 90’s

and early 2000’s. The 2010 season represents the post-steroid era as MLB had established

PED testing and teams began to look for defense, speed, and pitching more and home runs

less in their players. I hypothesize home runs will have a positive impact on player salary for

every era, but that its impact will be the greater and more significant in 2000 (steroid era) than

in 2010 (post-steroid era), which will have a greater and more significant impact than in 1990

(pre-steroid era).

II. ESTIMATION

All data obtained is cross-sectional data from Sean Lahman’s Baseball Database. In

particular, the data from his Batting and Salary files. I took individual offensive player

2500

3000

3500

4000

4500

5000

5500

6000

1980 1985 1990 1995 2000 2005 2010 2015

HomeRuns

Year

TotalHomeRunsinMLBfrom1982-2013

Page 4: FP FINAL ECON 407

Smith4

statistics from the three years identified above, and associated those individual seasons with

the salary the player made the following season, 1991, 2001, and 2011 respectively. I did this

as a way to demonstrate how a player’s performance in a season impacts their salary for the

following season. I also analyzed various team statistics from 1990, 2000, and 2010, and

associated those stats with a player based on their current team, the team that was paying him

salary in 1991, 2001, and 2011. Including team statistics in my model is a way to correct for

team success as a function of the salary teams can pay their players. Essentially, every statistic

I used was lagged a year from the independent variable, salary. I adjusted salary in 1991 and

2001 for inflation to 2011 levels in order to correct for inflation. The regression I will be

examining combines all of the observations from these 3 different years. I will be using the

following variables to help describe salary in my research:

Table 2 Variable Predicted

Sign Description

AdjSal1000 (dependent)

Annual Salary adjusted for inflation to 2011 levels, in thousands.

HR + The total Home runs an individual hit in a season HRpre - An interactive variable for HR hit in the 1990 season (pre-

steroid era) HRpost - An interactive variable for HR hit in the 2010 season (post-

steroid era) Team Stats

DivWin5y

+ Total Division Championships over previous 5 seasons LgWin5y + Total League Championships over previous 5 seasons WSWin5y + Total World Series Championships over previous 5 seasons AL + A dummy variable equal to 1 if team is in American League TeamAttby1000 + Total annual home stadium attendance in thousands from

previous season NewTeam - A dummy variable equal to 1 if a team was an expansion

franchise in the last 10 years Player Stats

BA*100 + A players batting average (hits/at-bats) * 100 SLG*100 + A players slugging percentage (total bases/at-bats) * 100 R + Runs scored (total times a player touched home plate) RBI + Runs batted in (runs scored a batter is directly responsible for) SB + Stolen Bases PreStEra - A dummy variable equal to 1 if pre steroid era (1990) PostStEra + A dummy variable equal to 1 if post steroid era (2010)

Page 5: FP FINAL ECON 407

Smith5

The key independent variable in this model is home runs. Since observations from

three different years are combined in the model, two interactive terms that solely looked at

home runs before the steroid era, HRpre, (1990) and after the steroid era, HRpost, (2010)

were used in order to differentiate the impact of home runs on salary for each era. As

predicted in the hypothesis, home runs will have a positive impact on player salary for the

following season because a home run can greatly influence the outcome of a game, but the

two interactive home run terms will both be negative to demonstrate that home runs had a

greater impact on salary during the steroid era.

As for team statistics, I predict that the amount of Division Championships

(DivWin5y), League Championships (LgWin5y), and World Series Championships

(WSWin5y) a team has won over the previous five seasons will all have a positive impact on

the salary they pay their players for the following season because teams that do well generate

greater revenue and can afford to pay the most sought after players the most money. A

problem with multicollinearity is expected with these three variables, and the appropriate

corrections will be made based on the regression results. The American League dummy

variable should have a positive impact on salary because in the American League (unlike in

the National League), the pitcher does not have to bat and is replaced by a designated hitter, a

player who is generally a successful power hitter. It is expected team attendance measured in

thousands of spectators, TeamAttby1000, will have a positive impact on player salary for the

following season as ticket sales are one of the main sources of revenue for MLB teams, giving

teams the ability to spend more on their players. Since the only recent MLB expansion

franchises were the Florida Marlins and the Colorado Rockies in 1993 and the Arizona

Diamondbacks and the Tampa Bay Devil Rays in 1998,2 the new team variable only applies

to players who were on these 4 teams in the 2001 season. Theoretically, if a player was on a

2Franchisedataobtainedfromhttp://www.baseball-reference.com/

Page 6: FP FINAL ECON 407

Smith6

new team, it would negatively influence salary because expansion franchises generally have

to start from scratch and do not have the funds that more established franchises might have.

The player stats included in my model reflect some of the most well-known and used

offensive statistics to define a player’s success. Since these statistics reflect either yearly

percentages or cumulative statistics that only increase with a player’s success, a positive

impact on salary is predicted with these variables. Batting average, represented as BA*100,

should have a positive impact because it is essentially the proportion of at-bats that a player

gets a hit, and the more hits a team gets in game, the more runs a team is likely to score, and

the more likely a team is to win. Slugging percentage, represented as SLG*100, should also

have a positive impact for similar reasons batting average hypothetically should, because it

effectively combines batting average with power statistics (i.e. doubles, triples, and home runs

hit). Some issues with multicollinearity between batting average and slugging percentage are

expected, as these statistics are closely related to each other, and the appropriate corrections

will be made if necessary. Runs scored, or R, is predicted to have a positive impact on salary

because the more runs a team can score, the more likely that team will win. Similarly, runs

batted in, or RBI, is predicted to be positively related with salary because this statistic reflects

the amount of runs a batter is directly responsible for creating when in the batters box. Stolen

bases, or SB, is a different kind of measurement than the other offensive statistics listed

above; it represents speed and not necessarily performance as a batter. Regardless, stolen

bases will also have a positive impact because the ability to advance to the next base, and thus

closer to home plate, when on base increases the likelihood that runner will score.

The effects and conditions that lead to salary differences over time in the MLB are

also controlled for with two dummy variables that define era. The last 25 years in baseball

have seen major increases in salaries. Controlling for each of the different seasons should

reflect this change. The pre-steroid era dummy variable, PreStEra (equal to 1 if observation

Page 7: FP FINAL ECON 407

Smith7

from before steroid era), should be negative and the post-steroid era dummy variable,

PostStEra (equal to 1 if observation from after the steroid era), should be positive to reflect

this increase in salaries over time, even when adjusted for inflation.

The base model in this research represents a linear relationship of the combined

observations from the three MLB seasons of salary adjusted for inflation and the independent

variables listed above:

AdjSal1000 = β0(Constant) + β1HR + β2HRpre + β3HRpost + β4DivWin5y + β5LgWin5y + β6WSWin5y + β7AL + β8TeamAttby1000 + β9NewTeam + β10BA*100 + β11SLG*100 + β12R + β13RBI + β14SB + β15PreStEra + β16PostStEra

The effectiveness of a logarithmic relationship will also be explored, either by using logged

salary, logged home runs, or by logging both.

III. DATA

Sean Lahman’s baseball database was used for all data represented in the statistics and

regressions seen below. Generally, a player who played in all 162 games in a season will have

over 600 at-bats; therefore, observations where a player did not have over 200 at-bats in the

given year were eliminated. This prevents low total numbers and possibly erratic averages due

to limited at-bats for a player from being taken into account when considering salary for the

following season. In the 1990 season, there were only 26 teams, where as in the 2000 and

2010 seasons there were 30 teams. This is reflected in the lower number of observations from

the 1990 season seen in the Descriptive Statistics table below.

Even though baseball had less teams in 1990 than in 2000 and 2010, a team was more

likely to win their division, league, and the World Series in 2000 and 2010 than in 1990

because of MLB’s changing playoff structure. Winning the division guarantees a team a spot

in the playoffs, the place where the team can then compete for the pennant (League

Championship) and the World Series. There were only 2 divisions per league in 1990 (4 in

Page 8: FP FINAL ECON 407

Smith8

total), which expanded to 3 divisions per league in 1994 (6 in total). Therefore, there is only a

total of 20 (4 divisions *5 years) division champions in MLB from 1986-1990, but a total of

30 (6 divisions * 5 years) division champions from 1996-2000 and from 2006-2010. Thus, a

team was more likely to win a division during the steroid and post-steroid eras than in the pre-

steroid era. The expansion of the number of divisions in baseball from 4 to 6 coupled with the

introduction of a guaranteed playoff spot for a Wild-Card team, the team with the most wins

in the league who did not win their division, for each league in 1994, doubled the amount of

teams that made it to the post season in a given year, 8 vs. 4. After 1994, each league had 4,

compared to 2; teams that were able to compete for the pennant and 8, compared to 4, teams

in total were guaranteed an opportunity to win the World Series. These changes to MLB’s

divisional and playoff structure are important in considering the likelihood of a team winning

their division, league, and the World Series before and after 1994.

The other major changes that occurred in baseball during this time were the shifting

sizes of both the American and National Leagues. They are as follows:

• 1977-1992-14teamsinAmericanLeagueand12inNationalLeague,26teamsintheleagueintotal

• 1993-NationalLeagueexpandsbyaddingtheFloridaMarlinsandColoradoRockies.Bothleaguesat14teams(1993-1998)

• 1998–MLBexpandsto30teamsbyaddingTampaBayDevilRays(A.L.)andArizonaDiamondbacks(N.L.).MilwaukeeBrewersmovetoNationalLeague.AmericanLeagueat14teams,NationalLeagueat16(1998-2013)3

TheNewTeamdummyvariablebeingobservedisequalto1ifateamwasnewto

baseballinthepast10years.SincetheonlyexpansionstheMLBmadeduringtheperiod

observedwasin1993withtheFloridaMarlinsandColoradoRockiesandin1998with

theTampaBayDevilRaysandtheArizonaDiamondbacks,theNewTeamdummy

variablewillonlybeequalto1forplayerswhowereononeofthese4teamsin2001.

Leaguesizeorthedifferencebetweenleaguesinsizeisnotconsideredforthisresearch,3Informationonchangingdivisional,playoff,andleaguestructuresandexpansionfranchisesofMLBfoundonhttp://www.baseball-reference.com/

Page 9: FP FINAL ECON 407

Smith9

1990 2000 2010 Totalmean 1732 3925 4355 3245

maximum 6277 27962 32000 32000minimum 165 254 414 165range 6112 27707 31586 31834

95thpercentile 5231 12604 15000 12167Std.Deviation 1635 4271 5263 4093

mean 10.8 16.0 12.9 13.4maximum 51 50 54 54minimum 0 0 0 0range 51 50 54 54

95thpercentile 28 39 31 33Std.Deviation 8.9 11.2 9.5 10.2

mean 0.266 0.279 0.265 0.270maximum 0.340 0.372 0.359 0.372minimum 0.192 0.200 0.146 0.146range 0.147 0.172 0.213 0.226

95thpercentile 0.309 0.333 0.312 0.321Std.Deviation 0.028 0.031 0.029 0.030

mean 0.401 0.457 0.419 0.427maximum 0.592 0.746 0.633 0.746minimum 0.254 0.253 0.208 0.208range 0.338 0.493 0.425 0.538

95thpercentile 0.523 0.606 0.534 0.570Std.Deviation 0.065 0.085 0.071 0.078

mean 56.3 66.3 56.7 60.0maximum 119 152 115 152minimum 11 17 12 11range 108 135 103 141

95thpercentile 96 114 100 106Std.Deviation 22.4 28.8 24.6 26.0

mean 53.2 64.1 54.2 57.5maximum 132 147 126 147minimum 12 11 13 11range 120 136 113 136

95thpercentile 95 121 103 111Std.Deviation 23.4 31.3 26.0 27.8

mean 11.0 8.0 8.2 8.9maximum 77 62 68 77minimum 0 0 0 0range 77 62 68 77

95thpercentile 36 28 32 32Std.Deviation 12.7 9.9 10.5 11.0

245 304 305 854

BattingAverage(BA)

SluggingPercentage

(SLG)

RunsScored(R)

RunsBattedIn(RBI)

StolenBases(SB)

RealSalaryadjustedto2011in

thousands(AdjSal1000)

DescriptiveStatistics

Observations(N)

HomeRuns(HR)

Page 10: FP FINAL ECON 407

Smith10

thoughitshouldbenotedthattheamountofteamsintheAmericanLeaguecouldimpact

theAmericanLeaguedummyvariablemeanttocontrolforthedesignatedhitterinthe

AmericanLeague.

The descriptive statistics table from the observed offensive player statistics is listed on

the previous page. It includes the mean, maximum, minimum, range, value of 95th percentile,

and the standard deviation for salary, home runs, batting average, slugging percentage, runs

scored, runs batted in, and stolen bases for each era and the eras combined. The 95th percentile

statistic is included to demonstrate what the top performers of a single statistic were

accomplishing, rather than just a maximum that represents a single player who could have

been an outlier from the rest of the data. Descriptive statistics of the other variables in my

model would prove to be irrelevant because these statistics are not reflective of an individual’s

performance, and rather reflects the era or team the player happened to be on the year after

these statistics were recorded. The data is broken up into three columns based on the year of

the observation, which represents the three different eras being compared and a fourth column

with all observations combined. The data for salary is from a year after the year listed at the

top of each column, so 1991, 2001, and 2011 respectively. From the first three columns, the

changing trends in offensive statistics can be observed throughout the beginning and end of

the steroid era.

Expanding upon the hypothesis of this project, the peak of the steroid era should see

increased offensive performance, not only reflected in increased home runs, but also in the

other offensive statistics compared to the other two eras. The one variable that is not expected

to increase with steroid usage is stolen bases. The ability to steal a base is not necessarily

reflected in muscle growth, the result of steroid usage. A theoretical prediction can be made

that stolen bases were actually less abundant during the steroid era than before or after the

steroid era because the perceived importance of being a power hitter was much greater than

Page 11: FP FINAL ECON 407

Smith11

being a quick base runner during the steroid era, and players might choose to take steroids and

gain muscle at the cost of speed on the base paths. There is a significant difference in number

of observations in 1990 (245) than in 2000 (304) and 2010 (305), but this can be explained by

the fact there were less players in 1990 than in 2000 or 2010 because there were four less

teams in MLB.

Even when adjusted for inflation, salary appears to be constantly growing as time

moves forward. While there is not a substantial difference in the mean, maximum, and 95th

percentile of salary 2000 and 2010, salary in 1990 was significantly lower. The higher salary

numbers in 2000 and 2010 could be a result of the 1994 player’s strike, as a player’s salary

demands might have been taken more seriously by the league and teams for fears of another

strike.

For home runs, there is a clear difference in the three seasons observed. While the

highest totals from each year are similar, the mean home run total in 2000 (16.0) is over 3

home runs greater than the mean total in 2010 (12.9) and over 5 home runs greater than the

mean total in 1990 (10.8). 2000 also was the only year where the mean value was greater than

the mean value of all observations (13.4). The 95th percentile for home runs further proves

home runs were much more prevalent in 2000 than in the other years. A player in the top 5%

of home runs hit in 1990 and 2010 would need to hit 11 and 8 more home runs, respectively,

in order to be in the top 5% of home run hitters in 2000. From the descriptive statistics, it is

clear that home runs were much more abundant in 2000 than in the other years.

Batting average, slugging percentage, runs scored and runs batted in have the highest

mean, maximum, and 95th percentile in 2000 by a significant margin. It is especially

staggering to observe how much greater the 95th percentile is in 2000 than in the other years.

The 95th percentile for slugging percentage is over .07 higher in 2000 than in 2010 and over

.08 higher than in 1990. These numbers may seem small, but in actuality, .07 has a large

Page 12: FP FINAL ECON 407

Smith12

economic significance to slugging percentage. It is also noteworthy to analyze the higher

totals of runs scored in 2000. The maximum value in 2000 of 152 is much higher than the

maximum totals of 119 and 115 in 1990 and 2010. Based on the means obtained, it can also

be inferred that the average player was scoring about 10 more runs in 2000 than in 1990 and

2010. The abnormally large number of runs scored in 2000 represents the surge of offense in

the steroid era, especially in the late 90’s and early 2000’s, as batters were creating more

opportunities for themselves and their teammates to score.

Unlike the other batting statistics that were highest in 2000, stolen bases had the

lowest mean value, maximum value, and 95th percentile in 2000 than in the other two years.

Interestingly, stolen bases appeared to be the most predominant in 1990, as 1990 had the

highest mean, maximum, and 95th percentile stolen bases. This can be explained in the same

way batting descriptive statistics were the highest in 2000. Steroids were used without

regulation during the steroid era, and players tended to sacrifice speed for power in years such

as 2000. In 1990, steroids were arguably not nearly as prevalent as they were in 2000. Players

were not as focused on gaining muscle and hitting for power in 1990, and rather directed more

attention to other aspects of the game.

To define the steroid era in baseball as a period when hitting and hitting for power was

at its peak would require a demonstration of decreased hitting statistics for the periods

surrounding this time. Descriptive statistics obtained from the years 1990, 2000, and 2010

show that hitting and hitting for power was indeed at its peak in 2000, the steroid era. It would

be prudent to only rely on one aspect of the game to show changing eras, which is why stolen

bases, a measure of speed, is included in the data. The plunge in stolen bases from 1990 to

2000 and subsequent increase ten years later in 2010 further prove the steroid era was a very

unique time in baseball’s history when the dominant aspect of the game teams, players, and

fans alike cared for was the home run, while other parts of the game were forgotten.

Page 13: FP FINAL ECON 407

Smith13

IV. RESULTS

The statistical program STATA was used for all results and manipulation of data in this research.

After observing the effects of including the key independent variables (HR, HRpre,

and HRpost) squared, it was clear that none of these squared terms significantly influenced

salary, so they were not used or considered in any of the following regressions.

The original linear regression from the base model above is run. The adjusted R2 for

this model is .403, meaning that 40.3% of the variation of the dependent variable is explained

by this model. As expected, there are 854 observations between the three years being used. To

test for Heteroskedasticity, a Breusch-Pagan test was implemented. After regressing the

model, I found the predicted estimates and thus was able to find the residual errors. I used the

square value of the residual errors as the dependent variable against the independent variables

from the base model for the Breusch-Pagan test. The test results are as follows:

H0: Homoskedastic Errors HA: Heteroskedastic Errors F (16, 837) = 6.53 Prob > F = 0.0000

Therefore the null hypothesis is rejected, as there are severe signs of Heteroskedasticity.

Therefore, the robust standard errors are reported in all subsequent models. Variables

that immediately proved to be very insignificant were DivWin5y, LgWin5y, NewTeam,

and SB, all of which were significant over the 50% level. WSWin5y is still fairly

significant, at about 20%, even though DivWin5y and LgWin5y weren’t. For the next

model, LgWin5y will be left in to observe if its significance will increase with the

removal of DivWin5y. Hence, a team’s recent success is controlled with LgWin5y and

WSWin5y. The variables omitted in the next model were DivWin5y, NewTeam, and

SB, as they do not appear to have a significant impact on salary.

Page 14: FP FINAL ECON 407

Smith14

There were multiple variables that experienced multicollinearity, as observed in

the test for multicollinearity section below. The key dependent variable, home runs,

experienced multicollinearity, but this is expected with the interactive home run terms.

Slugging percentage and runs scored both experienced multicollinearity. Since slugging

percentage is a term that both reflects power numbers and batting average, it is taken out

of the model because power numbers, HR, and batting average are controlled for in the

equation. Runs scored and runs batted in also experienced multicollinearity, and this is

to be expected, as they are somewhat similar statistics. Since runs batted in was more

significant, and has a more practical effect on power numbers than runs scored, it was

left in while runs scored was omitted. Therefore the next model observed is as follows:

AdjSal1000 = β0(Constant) + β1HR + β2HRpre + β3HRpost + β4LgWin5y + β5WSWin5y + β6AL + β7TeamAttby1000 + β8BA*100 + β9RBI + β10PreStEra + β11PostStEra

The adjusted R2 for this second model, .397, is lower than for the base model. Runs batted in

still shows signs of multicollinearity, as its VIF = 5.01. In the subsequent model, RBI will be

replaced with R to see if runs scored fits better. LgWin5y appears to have become far less

significant in this model, even after DivWin5y is removed, therefore LgWin5y is to be

removed and WSWin5y is to become the only control for a team’s success. The third model is

as follows:

AdjSal1000 = β0(Constant) + β1HR + β2HRpre + β3HRpost + β4WSWin5y + β5AL + β6TeamAttby1000 + β7BA*100 + β8R + β9PreStEra + β10PostStEra

The adjusted R2 for this third model, .402, has increased for this model from the last model.

There is not an issue of multicollinearity in this model. All variables are significant at the 10%

level except for the interactive variable for home runs after the steroid era, HRpost which is

significant at the 25% level, and my dummy variables for each era, PreStEra and PostStEra,

which were both significant at the 20% level.

Page 15: FP FINAL ECON 407

Smith15

The next attempt is to find the best lin-log model with the key independent variables

logged to see if it is a better fit for the data with independent variables. The forth model:

AdjSal1000 = β0(Constant) + β1ln_HR + β2ln_HRpre + β3ln_HRpost + β4DivWin5y + β5LgWin5y + β6WSWin5y + β7AL + β8TeamAttby1000 + β9NewTeam + β10BA*100 + β11SLG*100 + β12R + β13RBI + β14SB + β15PreStEra + β16PostStEra

The adjusted R2 for this model is .388, much lower than any of the models without using the

natural log of home runs. The key independent variable ln_HR is not significant in this model.

Other highly insignificant variables were DivWin5y, LgWin5y, NewTeam, BA*100, SB, and

PostStEra, therefore they were all removed. RBI experienced multicollinearity, so it was also

removed. The following was the fifth model estimated:

AdjSal1000 = β0(Constant) + β1ln_HR + β2ln_HRpre + β3ln_HRpost + β4WSWin5y + β5AL + β6TeamAttby1000 + β7SLG*100 + β8R + β9PreStEra

All variables became much more significant in this model, including ln_HR. But the adjusted

R2 in this model, .378, was less than the best model without taking the natural log of home

runs, therefore the third model is the best fit linear model, one that used AdjSal1000 as the

dependent variable.

The next step is to find the best model for when the natural log of adjusted salary in

thousands of dollars is the dependent variable. The sixth model used is as follows:

ln_AdjSal1000 = β0(Constant) + β1HR + β2HRpre + β3HRpost + β4DivWin5y + β5LgWin5y + β6WSWin5y + β7AL + β8TeamAttby1000 + β9NewTeam + β10BA*100 + β11SLG*100 + β12R + β13RBI + β14SB + β15PreStEra + β16PostStEra

This model explains 38.8% of the variation of the independent variable. Upon observing the

results of this output, there were a few very insignificant variables. Once again, DivWin5y

and LgWin5y were both insignificant and are removed. The NewTeam variable is also

removed because it is insignificant.

Page 16: FP FINAL ECON 407

Smith16

SLG, RBI, and R all have signs of multicollinearity, but for the same reasons as in an

earlier model, R will be left in as removing RBI might fix this problem. SLG is removed.

Therefore the seventh model observed is as follows:

ln_AdjSal1000 = β0(Constant) + β1HR + β2HRpre + β3HRpost + β4WSWin5y + β5AL + β6TeamAttby1000 + β7BA*100 + β8R + β9SB + β10PreStEra + β11PostStEra

The adjusted R2 of this model is .379. There are no signs of multicollinearity, but SB and the

dummy variable PreSt_Era are both highly insignificant, so they are removed. The interactive

term for home runs after the steroid era, HRpost, is insignificant in this model, but it will be

used in the next model to see if it will have a greater significance with these two variables

eliminated. The eighth model is as follows:

ln_AdjSal1000 = β0(Constant) + β1HR + β2HRpre + β3HRpost + β4WSWin5y + β5AL + β6TeamAttby1000 + β7BA*100 + β8R + β9PostStEra

The adjusted R2 for this model is .381. Every variable is significant in this model besides

HRpost, therefore it is removed. Since the dummy variable for post steroid era will still be in

the equation, there is a control for the impact on salary after the steroid era, but there will to

be no difference of the impact of home runs hit for the post steroid era on salary. The ninth

model is as follows:

ln_AdjSal1000 = β0(Constant) + β1HR + β2HRpre + β3WSWin5y + β4AL + β5TeamAttby1000 + β6BA*100 + β7R + β8PostStEra

The adjusted R2 on this model is .381. All of the variables are significant at the 10% level,

except for WSWin5y which is significant at the 11.5%. This is the best fit log-lin model.

The final step is analyzing what the best fit of the data is when the natural log of salary

and home runs is taken. The tenth model analyzed is as follows:

ln_AdjSal1000 = β0(Constant) + β1ln_HR + β2ln_HRpre + β3ln_HRpost + β4DivWin5y + β5LgWin5y + β6WSWin5y + β7AL + β8TeamAttby1000 + β9NewTeam + β10BA*100 + β11SLG*100 + β12R + β13RBI + β14SB + β15PreStEra + β16PostStEra

Page 17: FP FINAL ECON 407

Smith17

The adjusted R2 for this model is .383. The variables that weren’t significant were ln_HRpost,

DivWin5y, LgWin5y, and PreStEra and thus will be removed. The variables that experienced

multicollinearity were the three HR terms and the two dummy variables that describe era, but

only the dummy variable for PreStEra will be removed as leaving PostStEra in the model will

be the only control for post steroid era because ln_HRpost is removed. Also SLG*100 and

RBI experienced signs of multicollinearity, so they are both removed. The eleventh equation

is as follows:

ln_AdjSal1000 = β0(Constant) + β1ln_HR + β2ln_HRpre + β3WSWin5y + β4AL + β5TeamAttby1000 + β6NewTeam + β7BA*100 + β8R + β9SB + β10PostStEra

In this model, NewTeam and SB are not significant, and therefore are removed. Therefore the

following model represents the best fit log-log model:

ln_AdjSal1000 = β0(Constant) + β1ln_HR + β2ln_HRpre + β3WSWin5y + β4AL + β5TeamAttby1000 + β6BA*100 + β7R + β8PostStEra

The adjusted R2 is .370, which is lower than the adjusted R2 for the best-fit log-lin model, the

ninth model, therefore the best log-lin model describes the data better than the best log-log

model.

A MWD test is required to see whether it is best to take the linear regression of salary

or by taking the natural log of salary. The two equations that will be used for this model is the

best fit linear model, model 3, and the best fit log-lin model, model 9. From obtaining the

predicted values for AdjSal1000 in the best fit linear model (Yhat) and the predicted values of

ln_AdjSal1000 in the best fit logistic model (lnYhat), the values Z1 (=ln(Y) – lnYhat) and Z2

(=e^(lnYhat - Yhat)) were created. The hypothesis for which functional form fits the data best

is as follows:

H0: linear model is best HA: log-log model is best When regressing the best-fit linear model with Z1, Z1 appears to be statistically significant at

the .1% level, therefore we reject the null hypothesis that the linear model is the best fit for

Page 18: FP FINAL ECON 407

Smith18

the data. When the best-fit log-lin model is regressed with Z2, Z2 shows no signs of statistical

significance, therefore we fall to reject the alternative hypothesis that the log-lin model is the

best fit model. Therefore it can be concluded the log-lin regression is the best-fit model for the

data:

ln_AdjSal1000 = β0(Constant) + β1HR + β2HRpre + β3WSWin5y + β4AL + β5TeamAttby1000 + β6BA*100 + β7R + β8PostStEra

This model describing player salary controls for the impact from the previous season of a

player hitting a home run, hitting a home run before the steroid era, each World Series a

player’s current team had won in the past 5 years, the league the player is in, a team’s home

attendance, a player’s batting average, run’s scored, and if the season was after the steroid era.

These variables all had the same sign that was predicted. Home runs, home runs before the

steroid era, team attendance, runs scored, and the post steroid era dummy variable were all

significant at the 1% level, the American League dummy variable and batting average were

significant at the 5% level, and World Series Championships from the previous season was

significant at the 20% level.

With all of the other variables listed above controlled for, home runs, home runs

before the steroid era, World Series Championships a team has won in the previous 5 seasons,

the league a player is in, team attendance, batting average, runs scored, and if the player is

playing in the post steroid era, 2010, an interpretation of the impact a change of each variable

in this model can be made. Each home run a player hits during the steroid era increases his

salary by 3.53% for the next season, with all other variables remaining constant.

Theoretically, this means the player who hit the maximum number of home runs in 2000 (50)

had a salary that was 176.5% higher than a layer who hit the minimum number of home runs

(0) for that year. For every home run a player hit in 1990, salary increases by 2.27% for the

following season, with all other variables held constant. Even though the coefficient on the

interactive term for home runs before the steroid era is negative, the impact on salary of a

Page 19: FP FINAL ECON 407

Smith19

MLBSalaryFunctionMeasuredinThousandsofDollarsAdjustedforInflation

(1) (3) (6) (9) Predicted

Sign BaseLinear BestfitLinear

BaseLog-Lin

BestfitModel

(Log-Lin)Constant -5501.9***

(1335.106)-5909.9***(1253.243)

4.950***(0.355)

4.828***(0.326)

HR + 148.4***(47.62)

155.0***(24.648)

0.0289**(0.012)

0.0353***(0.004)

HRpre - -122.4***(25.193)

-125.1***(24.960)

-0.0107*(0.006)

-0.0126***(0.004)

HRpost - 47.52(38.008)

45.45(38.047)

0.00553(0.007)

DivWin5y + 99.11(180.067)

0.0000510(0.042)

LgWin5y + -113.0(363.232)

-0.0252(0.088)

WSWin5y + 476.5(374.796)

411.3*(226.270)

0.109(0.094)

0.0819(0.052)

AL + 624.5***(237.211)

647.1***(236.579)

0.134**(0.066)

0.137**(0.064)

TeamAttby1000 + 1.193***(0.201)

1.253***(0.172)

0.000330***(0.000)

0.000324***(0.000)

NewTeam - 286.3(592.384)

0.114(0.163)

BAper + 125.8*(70.506)

83.97*(44.413)

0.0420*(0.022)

0.0252**(0.012)

SLGper + -43.41(46.879)

-0.0172(0.013)

R + 17.87*(10.762)

32.32***(6.468)

0.00659**(0.003)

0.0128***(0.002)

RBI + 22.39*(12.207)

0.0101***(0.003)

SB + 9.459(14.449)

0.00397(0.004)

PreS_Era - 347.6(361.608)

475.2(342.135)

-0.0603(0.126)

PostS_Era + 666.9(512.409)

693.5(502.386)

0.191(0.141)

0.263***(0.076)

N.ofCases 854 854 854 854AdjR-Squared 0.403 0.402 0.388 0.381SEE 3391.7 3393.4 0.933 0.938F-ratio 26.61 41.60 49.61 88.71SSR 9.62864e+09 9.70719e+09 728.2 743.1

Robuststandarderrorsinparentheses *sig.at10%,**sig.at5%,***sig.at1%-Table includes base lin-lin model, best lin-lin model, base log-lin model, and best log-lin model

Page 20: FP FINAL ECON 407

Smith20

home run hit before the steroid era is still positive, as this negative coefficient only reflects the

different impact on salary from a home run hit before the steroid era and a home run hit

during the steroid era (.0353 - .0126). Another way to phrase this statement is the impact on

salary of hitting a home run before the steroid era is 1.26% less than the impact on salary of

hitting a home run in the steroid era, with all other variables held constant There is no

significant difference of hitting a home run during the steroid era than after the steroid era, as

it is seen that the home run post steroid era interactive variable was not statistically significant

when observing a log-lin relationship.

The impact of the following statistics of the team a player is on when salary is taken

(1991, 2001, and 2011) from the year before are as follows: Each World Series a team has

won in the previous 5 seasons increases the salary of every hitter on the team by 8.19%, with

all other variables held constant. Baseball players who play in the American League have a

salary 13.7% higher than the salary of players who play in the National League, with all other

variables held constant. For every thousand-person increase in a team’s home attendance,

salary of the hitters on their team increase by .0324% for the following season, with all other

variables held constant. This number may seem small, but if thought about practically, it still

has economic significance. Baseball stadiums can fit tens of thousands of spectators for each

game, and a team plays 81 home games each season. If a team is averaging a thousand more

spectators a game than another team, the team with more spectators per game pays its hitters

2.62% (81*.0324%) more than the team with less spectators pays its hitters, with all other

variables held constant. Since teams like the New York Yankees can sometimes average ten

thousand if not more average spectators per game than a smaller market team, it is clear that

team attendance has a big economic significance on player salary.

The only other player statistics besides home runs that appears to influence salary is

batting average and runs scored. Since batting average, in this case, is measured as a

Page 21: FP FINAL ECON 407

Smith21

percentage (BA*100) of times a player gets a hit in an at-bat, it can be inferred that for every

one percentage point increase of the percentage of at-bats where a player gets a hit, their

salary increases by 2.52% for the following season, with all other variables held constant.

Every run a player scores increases their salary for the following season by 1.28%, with all

other variables held constant. The last variable of the best model is the dummy variable for

post steroid era. The interpretation of this variable is that, on average, hitters from the post-

steroid era were getting paid 26.3% more than hitters in the steroid era, with all other

variables held constant.

To see the practical application of this model, the predicted salary in 1991, 2001, and

2011 of the MVP’s from both the American and National Leagues in 1990, 2000, and 2010,

respectively, will be observed. The 1990 American League MVP was Ricky Henderson, who

happens to be baseball’s all-time stolen base leader. In the 1991 season, Henderson was on the

Oakland Athletics, had 28 home runs with a BA*100 of 32.52 and led both leagues with 119

runs scored in 1990, as seen in the Descriptive Statistics table. Henderson’s salary in 1991

adjusted for inflation to 2011 is predicted to be $7,829,877 when it actuality it was

$5,369,000. This is not a large difference, and could be explained by the fact there is no

control for speed in this model, and Henderson was one of the fastest players in baseball at

this time. The 1990 National League MVP was Barry Bonds, who happens to be the all time

home run leader in baseball’s history, played for the Pittsburg Pirates in 1991, and in 1990 hit

33 home runs with a BA*100 of 30.06 and scored 104 runs. This model predicts Bond’s

adjusted 1991 salary to be $4,148,335 when in actuality $3,799,600, a close estimation.

In 2000, the American League MVP was Jason Giambi and Jeff Kent in the National

League. Giambi played for Oakland in 2001, and in 2000 hit 43 home runs with a BA*100 of

33.33 and 108 runs scored. The model predicted Giambi to have an adjusted salary of

$10,147,180 in 2001 when it was actually $5,215,336. Giambi was yet to be an established

Page 22: FP FINAL ECON 407

Smith22

veteran hitter in the MLB at this time, and could explain why he was not being paid as much

as the model predicts. In 2001, Jeff Kent played for the San Francisco Giants. In 2000, he hit

33 home runs with a BA*100 of 33.39 and 114 runs scored. His salary adjusted for inflation

was predicted to be 11,722,990 in 2001 when it was actually $7,626,000. This can be

explained by the fact Kent was bound to a contract he signed with the Giants, and was paid

this fixed amount of $7.6 million from 2000-2002, therefore his salary in 2001 did not have a

chance to increase because of his MVP season in 2000.

In 2010, the MVP from the American League was Josh Hamilton and Joey Votto in

the National League. Hamilton played for the Texas Rangers in 2011, and in 2010 hit 32

home runs, led both leagues with a BA*100 of 35.9 and scored 95 runs. His salary is

predicted to be $10,831,960 in 2011 based on his performance in 2010, but his salary was

$8,750,000. Hamilton signed a new 2-year, $24 million contract with the Rangers after his

2010 MVP season, the bulk of which he was paid in the 2012 season. Since his contract

averaged $12 million a year, his salary is somewhat reflected in this model. Votto was signed

with the Cincinnati Reds for the 2011 season, and compiled 37 home runs and 106 runs

scored with a BA*100 of 32.36 in his 2010 MVP campaign. The model predicts Joey’s salary

to be $10,268,240 in 2011 when it was actually $7,410,655. Considering the fact Votto was

paid less than a million dollars for the 2010 season and signed a 3-year $38 million dollar

contract before the 2011 season, the model does a fairly good job predicting his salary. A

pitcher sometimes will be awarded the MVP, but it is a rare occurrence, and did not happen

one of these three years. It is interesting to point out that none of the MVP’s in these three

years was the top home run hitter for that year.

The 3 players who led the league in home runs in 1990, 2000, and 2010 were Cecil

Fielder, Sami Sosa, and Jose Bautista with 51, 50, and 54 home runs respectively. Fielders

predicted salary adjusted for inflation was $5,642,801when it was actually $2,891,000 in

Page 23: FP FINAL ECON 407

Smith23

1991. This difference is explained by the fact Cecil had little success in the MLB up to 1988,

and decided to play baseball in Japan in 1989. Therefore 1990 was his first season back in the

MLB when he was not yet an established player. It would be hard for a team to commit to

much money to a player who just had his first good season in the MLB at the age of 26.

Sosa’s predicted 2001 salary adjusted is $15,659,430 from this model, and he was actually

paid $15,887,500. The model was extremely effective in predicting Sosa’s salary in 2001. In

2011, Bautista was paid $8 million dollars, yet the model predicted his salary to be

$16,498,840, more than double of what the actual value was. Bautista was a veteran baseball

player in 2010, as he had already completed 6 MLB seasons with little success. 2010 was his

first productive MLB season, and his salary jumped from $2 million in 2010 to $8 million in

2011 to $14 million in 2012.4 While it is clear that there are some challenges in predicting

MLB salary for a given year based solely on the statistics from the previous season, the best-

fit model from the data demonstrates some merit when predicting MLB salary.

V. Conclusion

The hypothesis of this research is proven to have some credibility based on the results

from the data. As expected, home runs hit had a positive and strong statistical significance

with salary in every era. With 99% certainty, it can be concluded that the impact on salary of

hitting a home run before the steroid era was less than the impact of hitting a home run during

the steroid era. The part of the hypothesis that is not proven to be correct is that there is a

significant difference on salary of home runs hit during and after the steroid era, as the

interactive term home runs after the steroid era showed no statistical significance in the

model.

4Statistics,contracts,andMVP’sfrompredictionsaboveisfromhttp://www.baseball-reference.com/

Page 24: FP FINAL ECON 407

Smith24

The variables used in the model all controlled for important aspects of MLB salaries.

The values for a team’s success over the past 5 seasons were relevant because the teams that

win more have the ability and are more willing to pay for better players to continue the

success of the franchise. Even though World Series Championships over the past 5 seasons

was the only significant variable for team success in the final model, it still served an

important purpose to control for a team’s success. The American League variable proved to

be significant in describing the data, and it is surprising to see that when all other variables are

controlled for, players in the American League earn about 13.7% more than players in the

National League. Different from a team’s success, team attendance controls for the revenue of

a team, as ticket sales are an important source of income for every baseball team. It is not

surprising to see a statistically significant impact of team revenue, as represented by

attendance, on the salary teams give their players for the next season. The only offensive

player variables that proved to be statistically significant with salary besides home runs hit

were batting average and runs scored. While multicollinearity was expected with slugging

percentage for reasons mentioned earlier in this paper, it was surprising to see that RBI did

not have a significant impact on salary, and that runs scored proved to be more significant.

RBI is constantly used in baseball to measure success. It is one of the three statistics, along

with batting average and home runs, that are always seen next to a player’s name when a

player is up to bat, therefore one would imagine it to have a significant impact on salary.

Unfortunately, the control for speed, stolen bases, did not have a significant impact on salary.

Salary for before the steroid era was controlled for with the home run interactive variable,

even though the dummy variable for before the steroid era did not have an impact on salary.

While there was no difference in home runs hit during the steroid era and after the steroid era

on salary, players after the steroid era were shown to have higher salaries as reflected in the

post steroid era dummy variable, a result that was expected.

Page 25: FP FINAL ECON 407

Smith25

There were a few variables that could have been considered as a control for different

aspects of salary, but were not available to include in this model. Such variables include a

player’s age or time of service in the MLB, a control for a player’s recognition, defensive

ability, position, and country of birth. Variables that could have been used to control for a

player’s recognition could include the amount of MVPs, Silver Sluggers and Gold Gloves a

player has won along with the amount of All-Star games a player has been elected to. A

control for defense could be fielding percentage ((defensive attempts-errors)/defensive

attempts), but this is tricky, as it does not necessarily reflect the range a player can cover

when playing defense or a players ability to make outstanding defensive plays that an average

player would not be able to make. The Gold Glove award often reflects this ability to make

spectacular defensive plays, and a variable for Gold Glove Awards won could have a

significant impact on salary. The position a player generally plays could also serve as an

important control for salary as some positions on the field might have more value than others

when determining salary. A simplified way to look at this could be to have a dummy variable

if a player is an outfielder or a dummy variable if a player is an infielder. This idea could also

have some problems, as there are a number of baseball players who play multiple positions in

a given season. Country of birth could also be shown to have a significant impact on salary, as

often times, especially recently, players from other countries such as The Dominican

Republic, Venezuela, and Cuba have become some of the most highly desired players in

baseball. An improvement of this research could include some of these variables when

analyzing player salary. Regardless, even without all of these controls, the best model still is

successful at roughly predicting player salary.

Because there are many different explanations that a player’s success in a season is not

reflected in their salary the next season, it can be difficult to have one model that does a good

job at predicting salary. It would be ideal if a player signed a new contract or experienced

Page 26: FP FINAL ECON 407

Smith26

free-agency every year, but this often times does not happen, even though there are many

free-agents after each season who are free to sign the best contract they are offered. Many

players are bound by long-term contracts, and even though there are bonuses a player can

obtain based on performing well, this fact can hamper the idea that salary is a fluid value that

changes every year. A way baseball has tried to fix this problem is through salary arbitration.

This is a process where a player can argue for a higher salary from his team based on his

performance from the previous season, but this is only something a player is eligible for from

their 3rd season (sometimes 2nd season) until their 6th season.5 Players who do not have the

opportunity for arbitration, therefore, are bound to their contracts, for better or worse. Even

with all of the limitations that inhibit changing salaries in baseball on a yearly bases, this

model does a fairly good job of explaining salary, as it explains 38.1% of the variation in

player salary.

The steroid era changed the structure of baseball in ways that are still seen today. In

the late 90’s, hitting home runs became a glorified statistic that became the new measure of a

player’s success. The home run chase between Mark McGwire and Sami Sosa in 1998 and

Barry Bonds’ run at Hank Aaron’s all time home record will go down as two of the most

memorable moments in baseball’s history. The excitement of the home run during the peak of

the steroid era of baseball my never be matched again. While baseball players still get caught

using PED’s today, it is generally accepted that the amount of players using steroids in

baseball now is far less than 15 years ago, yet the scares of the steroid era are still felt by

MLB as it becomes known that the heroes of a generation had an advantage over the other all-

time greats of the game.

5Arbitrationrulesobtainedfromhttp://www.fangraphs.com/library/business/mlb-salary-arbitration-rules/

Page 27: FP FINAL ECON 407

Smith27

Sources

Data of player statistics and salary information: http://www.seanlahman.com/baseball-

archive/statistics/

Supplemental statistics: http://www.baseball-reference.com/

Arbitration rules: http://www.fangraphs.com/library/business/mlb-salary-arbitration-

rules/

Opinion of start and end of Steroid Era: http://espn.go.com/mlb/topics/_/page/the-

steroids-era