Stats Project

Brendon GoodMath 2414/17/15Baseball Data ProjectThis statistical project is based around baseball and specific stats and variables .The purpose behind the baseball statistical project is to determine if there is any correlation between the salary a professional baseball player earns and several different variables. RBIs, walks, hits, runs, and at bats were all variables used in the determination if there was any correlation. The population of the data is the entirety of the MLB, and the sample set used for this data is from 244 MLB players on 11 different teams. The first area of my project that was determined was the descriptive statistics of the baseball data set. These descriptive statistics are the mean, median, mode, range, standard deviation, and variance for the salary earned by the sample population. The following chart shows these statistics.Variable Mean StDev Variance Median Range Mode ModeSalary 3577408 4305248 1.85352E+13 1500000 24500000 500000 21The mean salary of this data set is $3,577,408. The mean of this data is the most useful measure of central tendency. While the other information is interesting, showing such information as where the middle range of the salary is located (1,500,000) and the most common salary (500,000 earned by 21 players), the mean salary is the most useful to us. It shows the average salary earned by the sample set and allows us to make correlations from there. The most useful measure of variability in this data set is standard deviation. The standard deviation allows us to see what salary may be normal, small or large compared to the mean salary depending on how many deviations from the mean it is. The histogram that follows gives us a visual representation of how the salary is distributed over the sample population. It shows us how many players earn different ranges of salaries, and also gives us a rough visual of the distribution of salaries for the population sample.

This histogram shows the salary earned by most players is within the 0 $4,000,000 range. The second most earned salary that the histogram shows is in the range of 4,000,000, to 8,000,000. The salaries drop off dramatically after these first two ranges, down to only player making between 24,000,000 and 28,000,000. The data in this histogram is substantially skewed to the left, showing that the information is not symmetric. The following boxplot shows much of the same information as the above histogram. It allows us to other sets of information easier though, such as outliers.

The data in this box plot shows we have many outliers above the third quartile. The outliers start approximately 11,000,000 and cluster between there and roughly 16,000,000. This cluster is followed by two more outliers even further above original outliers. The data is also clustered around the first and second quartiles, showing us that much of the salary ranges down below 5 million dollar mark at the third quartile. Following this cluster data is spread out between the second quartile and the third quartile. A confidence interval allows us to determine with a certain percentage that the calculated true mean is contained within a certain range of data. The following interval does this with a 95% level of confidence. N Mean SE Mean 95% CI244 3577408 275615 (3037212, 4117604)

This information shows that there is a 95% chance that the confidence interval we calculated contains the true mean of $3,577,408. Following the confidence interval I conducted a hypothesis test attempting to determine whether the sample data provides sufficient evidence to conclude that the sample population mean salary is less than 3 million. During the hypothesis I determined a p-value of .981 and comparing that with a significance level of .05 determined that there is not sufficient evidence to reject the null hypothesis of the true mean salary equaling 3 million. By not rejecting the null hypothesis I can conclude that we do not have sufficient evidence to prove that the true mean salary of the sample proportion of players is less than 3 million. After the hypothesis test I conducted a regression analysis of the variables Salary versus Runs, and determined the regression equation comparing the amount of salary to the amount of runs to be (Salary = 2158285 + 60844 (Runs)). Using this regression equation I conducted two calculations determining a players salary if they hit 18 runs, and then 33 runs. The calculations follow.Salary = 2158258 + 60844(18) = 3253450Salary = 2158258 + 60844(33) = 4166110Using this regression equation it does appear that by achieving more runs, a player would earn a higher salary. Although following my calculations with the regression equation, I constructed a scatterplot (with a coefficient of determination = 14.6%) of earned salary versus the number of runs.

Even though the regression equation does appear to show a higher salary per runs scored, by observing the scatterplot it appears that a players earned salary compared to the number of runs has no direct correlation. The scatterplot shows not distinct line correlating earned salary to number of runs from 0 runs to 120 runs. This shows that regression is not an appropriate way predicting in this specific scenario. I continued my analysis of the data set by conducting multiple hypothesis test to determine if there was a correlation between earned salary and the different variables. The first hypothesis test I conducted was a continuation of the above calculations, correlating salary versus runs. Preforming this hypothesis test I calculated a p-value of 0.0 with a correlation coefficient of .383. With this p-value and a significance level of .05 I determined that I have sufficient evidence to reject the null hypothesis. This means that there is sufficient evidence to determine that a correlation between the number of runs and earned salary is present. A caveat to that though is that with the correlation coefficient being so low (.383) the linear relation is slight but still positive. I concluded my analysis by preforming hypothesis tests on the remaining variables, trying to determine if there was a correlation between them, and earned salary. The following table shows p-values for each hypothesis test as well as the correlation coefficients. Hypothesis TestsSalary vs RBIs Correlation Coefficient = .406 Pvalue = 0.0Salary vs HITS Correlation Coefficient = .356 Pvalue = 0.0Salary vs RUNS Correlation Coefficient = .383 Pvalue = 0.0Salary vs WALKS Correlation Coefficient = .416 Pvalue = 0.0Salary vs AT BATS Correlation Coefficient = .359 Pvalue = 0.0The chart above shows the correlation coefficients for each of the independent variables when compared to the salary variable. Each in hypothesis test was determined to have a p-value of 0, and when paired with a significance level of .05 the null hypothesis for each test can be rejected. These rejections determine that there is a correlation between each independent variable and the earned salary of baseball players. But as with runs there is still a caveat. The correlation coefficients for each test while still positive was fairly low, meaning that the linear relation is still only slight.After preforming multiple types of test on this data set, I can conclude that while there is a correlation between the salary earned by players and each independent variable, the correlation is only slight. Conducting a calculation on a regression equation and then reviewing a scatterplot gave mixed results, but by preforming hypothesis tests on each independent variable versus salary I was able to determine a level of correlation. This implies that players who are able to achieve high levels in each of the independent variables should have a higher salary then others. To test this correlation further, a new data set could be established from players from both ends of the spectrum. Players with low stats versus players with high stats could both be taken, and then have their salaries compared versus the stats of their different independent variables.

Stats Project

Documents

Transcript of Stats Project