What Innings Determine Total Wins

What Innings Determine Total Wins?

By: Payton Soicher and Kevin Mosier

“Winning the inning” is one of the most popular phrases thrown around by baseball players and coaches. Baseball teams try to “win” every inning in the game, because chances are that the more innings you win in a game, the better the chance you have to be victorious. However, in a 162 game Major League Baseball season, it is impossible to win every inning. With all MLB teams trying to figure out which players will give their team the best chance to win by studying batting averages, slugging and on-base percentages, runs batted in, and other advanced analytics, our study was to see if managers and organizations were overlooking an important predictor of a team’s record at the end of the season: Which specific innings are the best predictors of your record?

The first thing that we did was take the 2015 MLB won-loss records of every team. After that, we found the numbers of runs that they scored in every single inning, and made them their own variables (Inning1, Inning2, Inning3, etc.). This would show how many runs each team’s offense put up in each inning during the season. We did not do any extra innings because not every team plays the same number of extra innings as other teams, but every baseball game has to go a minimum of 9 innings. After collecting all the runs for each frame, we found the number of runs each team gave up in innings 1-9 to show how many runs the defense gave up each inning during the year. We then took each inning of offense and subtracted the number of runs their defense gave up in that same inning, giving them a “net run” for each inning. If the result of that inning was positive, they won that inning for the season, and a negative run total would show that they lost that inning on the year.

The data that we used did not need any kind of transformation. We also did not want to eliminate any outlier data points, if there were any, that would be in the model. For example, the Blue Jays scored a total of 116 runs in the second inning of the season, when the MLB average of the second inning was 70.63. Although that might look like an extremely high value, the Blue Jays were known to have one of the best offensive lineups in the MLB with their third, fourth and fifth batters all hitting over 40 home runs on the season. We also looked at each residual plot for each inning, and they all looked random and scattered, which is what you would like to see for a regression analysis. Also, each inning variable will be independent of each other since runs, hits, outs, or any other determining factor in baseball don’t carry over from one inning to the next.

The first test that we decided to run before we ran our model with each independent inning was to first take the sum of every inning and make one “net” variable. In order to accomplish this, we took the sum of innings 1 through inning 9 to see if they had a positive or negative run total on the year, and then ran a model against the dependent variable of wins.

From looking at the graph, it definitely looks like there is a linearity with the X points (net runs) with the dependent variable (wins), there is constant variance, the points are independent, and the scatter plot also shows mean zero with no trend.

Now the null and alternative hypothesis for this regression test would be:Ho : β1=0.Ha : β1≠0withα=.05

Net ANOVA Table

Looking at the ANOVA (Analysis of Variance) table, with an extremely high F value of 102.36 leading to an extremely low p value of less than .0001, we can reject the null hypothesis and conclude that β1≠0. Also, looking at the parameter estimates, with the only variable in the model being Net Runs, the F p value and the Net p value will be identical, which they are. From what we observe in this simple one variable linear regression analysis, Net Runs for an entire season is a good predictor of wins at the end of the regular season.

Although that is an interesting regression test, it is a very vague examination compared to the one that we are most interested in, but it sets a base that this would be a good test to analyze even further. After we had run the first model, we decided to run the more specific model with all innings as variables. When running the ANOVA statistics, the F p-value had the null and alternative hypothesis:Ho : β1=β2=β3=β4=β5=β6=β7=β8=β9=0.Ha : At least one β ≠0withα=.05In more practical terms, we wanted to see if there was one variable that had a slope that didn’t have 0 as a slope value in a 95% confidence interval.

Residual Plot of All 9 Innings

Looking at the residual plots for each of the innings, they are randomly scattered with mean 0 and no increase in variance, all great signs for a linear regression.

Model With All 9 Innings

With all 9 variables in the model, we get a F p-value of less than .0001, therefore, we reject the null hypothesis and conclude that at least one of the β ' s≠0. The R-Square and Adjusted R-Square are both above .75, which says that the independent variables can explain about 75% of the dependent variable, and historically a .75 Adjusted R-Square is pretty good. We can also see from the full model above that at the α=.05 level, the intercept, Inning1 and Inning5 are the only parameter estimates that are significant in the model. According to this multivariable regression, the final model would be:

PredictedWins=80.833+.11651 (Inning 1 )+.19052( Inning 5)You might also notice that every single parameter estimate is positive, which might give people the wrong impression thinking that you can only get a win total higher than 80.83, but that is

not true. Since you can have positive or negative inning values, if you lose both Inning1 and Inning5 this season, you would have a win total less than 80.83 (which is the mean of all the wins by all the teams in the MLB).

It is unnecessary, and not a good model, to have all the other variables in the model that are not significant, so we decided to run a few tests to see which model we could get with all of the variables in it that would still be significant predictors of wins. The first thing we looked at was Mallow’s C(p) criteria, which is a good indicator of what model is the best predictor of wins. We took the best result for every model with 1 variable, up till the last model with all nine variables in it, and this was what the output looked like:

When looking at the C(p) criteria, the smallest value is the model that you would pick, and the smallest value is 4.4689 (in model #4) with the variables Inning1, Inning2, Inning5, and Inning7. That same model has a R-Square of .7584, which is higher than the model with all nine variables in it, which is great since there are five less variables in the model and it can explain the dependent variable wins just as well.

To help support our initial theory that the model with Inning1, Inning2, Inning5, and inning 7 in the model, our next step was to do a forward selection process to see if that method would give us the same model. The forward selection process takes in each variable one at a time, takes the highest |t| value that is within the significance level (α=.05), and then keeps that variable in the model and does the entire process again with the other variables until there are no variables in the model that meet the significance level. When we ran the stepwise model, this is what we got:

As you can see, the variables in the forward selection process match the same model that was concluded with Mallow’s C(p) model. We then concluded that the model with Inning1, Inning2, Inning5, and Inning7 would be the best model to predict wins in a regular season. This was the ANOVA and parameter estimates from that model:

Model With Inning1 Inning2 Inning5 Inning7

The final model to predict the number of wins at the end of the regular season is PredictedWins=80.833+.12322 ( Inning 1 )+ .15119 ( Inning2 )

+.22596 ( Inning5 )+ .18543 ( Inning 7 )For example, if you you had +10 for the first inning, -3 for the second inning, +8 for the fifth inning, and +10 for the seventh inning, your predicted win total would be

PredictedWins=80.833+.12322 (10 )+.15119 (−3 )+.22596 (8 )+.18543 (10 )= 85.27361 wins

Looking at the first 20 (out of 30) residuals, they are random and have no real trend to it. About half of the residuals are above the predicted value and half are below. Just like normal

regression plots do, they flare out towards the ends and are closer to the predicted values near the centroid.

Determining that this model will be the best choice to try and predict wins at the end of the regular season, a common question arises: why those innings? As any baseball player will tell you, the first two innings are the hardest two innings for any pitcher because you are guaranteed at a minimum to face the top six hitters in the opposing team’s lineup. As most people would correctly assume, managers place their best hitters at the top of the lineup because they want those players to have the greatest chance of batting more times than the rest of the players in the batting order. Looking at the dataset, the vast majority of teams both scored and allowed most of their runs in the in the first two innings of their seasons. Teams also get influenced or discouraged easily in those first two innings. If you jump out to an early lead in a game, it is difficult for the opposing team to stay positive and locked in playing from behind, while the leading team gains more confidence knowing that they are playing with a lead and don’t have as much pressure placed upon them.

As for the fifth and seventh innings, offense is not so much of an indicator at this point, but relief pitching now comes into play. In the MLB, a “Quality Start” is defined as a starting pitcher lasting six or more innings and giving up three or fewer runs. Most MLB starting pitchers only toss six innings, and then the team’s bullpen of relief pitchers take over. Considering recent research regarding the dangers of over-using a pitcher’s throwing arm, most teams like to pull the plug on a starting pitcher after the sixth inning, as his pitch-count nears a hundred. In most cases, the seventh inning would be the first frame that would involve a reliever. That is often when hitters get to look at a new pitcher after possibly struggling with the starter, giving them an opportunity to score more runs and regain momentum for their team. On the opposite side of that spectrum, a team with a great bullpen can definitely win more games if they can hold on to a lead beginning in the seventh inning. As for the fifth inning, teams that don’t get the quality start that they were looking for out of their starting pitcher for their game will have to go to the bullpen earlier than expected. The fifth inning is another common inning for a new pitcher to come into a game if the starting pitcher isn’t performing well.

This dataset shows in great clarity that both offense and pitching have big influences on the outcome of wins in a season, but there are certain innings that are more important than others. Teams might be focusing too heavily on individual statistics to try to determine what is the key to success. Organizations can look at this study to help explain what needs more work, offense at the beginning of games or relief pitching towards the middle or later part of games. Our study has shown, one of the biggest predictors to determine a ball club’s record at the end of the regular season is to “win” the first, second, fifth, and seventh innings of a baseball game.

What Innings Determine Total Wins

Documents

Transcript of What Innings Determine Total Wins