Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season...

+

Kernel Methods in Data Analytics: Predicting College

Football Outcomes by Logistic Regression

Theodore Trafalis, School of Industrial and Systems Engineering , University of Oklahoma

Panel in Sports Analytics, International Conference in Sports Management

Athens, Greece May 9, 2016

+ Agenda

Background

Existing Approaches

Model Development

Parameters Used

Offensive

Defensive

Interactions

Other

Results and Interpretation

Validation

Conclusions and Recommendations

Works Cited

+ Background

College Football Post Season

30-40 Bowls

College Football Playoffs

Huge Viewership

6 Million per game in 2014

33 Million watched Championship Game

Millions bet on outcomes of Bowl Games

Uncommon matchups

Lots of Uncertainty

+ Existing Approaches

Las Vegas Oddsmakers

Massey Ratings

Conglomerate of Computer Polls

History of Computer Polls in BCS era

ELO Ratings

Developed by Arpad Elo to rank Chess players

Adapted to sports, videogames, programming, etc.

S&P+ Ratings

Derived from Play by Play data

Efficiency

Explosiveness

Field Position

Finishing Drives

Turnovers

Football Power Index

Developed by ESPN

Predictive and Simulation based

+ Model Development

Binomial Logistic Regression

Fits dependent variable into one of two non-overlapping sets

Can utilize many different types of input variable

Nominal

Ordinal

Interval

Without Interactions

With Interactions

+ Model Development

Parameter Fitting

36 Games from 2014 used as training data

Method of Least Squares

Program Used

Excel’s Solver GRG package

+ Parameters Used

Offensive Metrics

8 Team Statistics

Consider both teams

Defensive Metrics

8 Team Statistics

Consider both teams

Quarterback Rating

Conversion Metrics

Disruptive Metrics

Penalty Metrics

Outside Rankings

Conference

Interaction Effects

+ Offensive Metrics

Yardage Metrics

Total Yards

Total Yards per Game

Passing Yards

Passing Yards per Game

Rushing Yards

Rushing Yards per Game

Scoring Metrics

Total Points Scored

Points Scored per Game

+ Defensive Metrics

Yardage Metrics

Total Yards Allowed

Total Yards Allowed per Game

Passing Yards Allowed

Passing Yards Allowed per Game

Rushing Yards Allowed

Rushing Yards Allowed per Game

Scoring Metrics

Total Points Allowed

Points Allowed per Game

+ Quarterback Rating

Abbreviated QBR

Measure of Quarterback Quality

Completion %

Yards per Attempt

Touchdown %

Interception %

+ Conversion and Conference Metrics

Conversion Metrics

Measure of a team’s ability to maintain possession of the ball.

Number of First Downs

3rd Down Conversion Rate

4th Down Conversion Rate

Conference Metric

Power 5 vs. Great 5

+ Disruptive Metrics

Defensive Disruption

Measure of Defense’s ability to disrupt Offense

Sacks

Interceptions

Fumbles

Offensive Disruption

Measure of team discipline and ability to keep offensive drives on

track

Total Penalty Yards Assessed

+ Outside Rankings

Massey Ranking

Conglomeration of Computer Polls

Las Vegas Spread

Made by Las Vegas Book Keepers

Goal is to insure equal betting on both teams

Negative spread means team is favored

+ Interaction Effects

Offense vs. Defense

Total Yards vs. Total Yards Allowed

Points Scored per Game vs. Points Allowed per Game

Rushing Yards per Game vs. Rushing Yards Allowed per Game

Passing Yards per Game vs. Passing Yards Allowed per Game

QBR vs. Defense

Disruptive

Sack Ratio

Interception Ratio

Fumble Ratio

Conversion

1st Down Ratio

3rd Down Conversion Ratio

4th Down Conversion Ratio

Penalties

Penalty Yard Ratio

Ranking

Massey Ranking Ratio

Vegas Spread

+ Model Results

+ Model Results—Unexpected

+ Model Fit

Able to correctly categorize 73/79 (92.4%) CFB bowl outcomes from 2014.

Vegas

58.3%

Massey

61.1%

ELO

66.7%

S&P+

63.9%

FPI

58.3%

+ Cross Validation—2013

Ran model for 2013 bowl games

Model Accuracy dropped to 57.1%

Existing methods also dropped in accuracy

Vegas

62.9%

Massey

60%

ELO

60%

S&P+

54.3%

FPI

51.4%

+ Cross Validation—2012

Ran model for 2013 bowl games

Model Accuracy dropped to 55.7%

Existing methods also dropped in accuracy

Vegas

60%

Massey

60%

ELO

51%

S&P+

48.2%

FPI

45.2%

+ Conclusions and Recommendations

Conclusions

Model Works for Categorizing Games a posteriori

Not great for a priori predictions

Model is competitive with other predictive measures

Recommendations

Use ANOVA to filter out unwanted parameters

Incorporate other predictive measures

Path Forward

Compare Binomial Multiple Logistic Regression to SVM

+ Works Cited

[1] Rishe, P. (2015, January 15). Reviewing The 2014-15 Bowl Season: Highest Bowl Game Prices, Attendances, And TV Ratings. Retrieved November 30, 2015, from http://www.forbes.com/sites/prishe/2015/01/15/reviewing-the-2014-15-bowl-season-highest-bowl-game-prices-attendances-and-tv-ratings/

[2] Purdum, D. (2015, January 30). Wagers, bettor losses set record. Retrieved November 30, 2015, from http://espn.go.com/chalk/story/_/id/12253876/nevada-sports-bettors-wagered-lost-more-ever-2014

[3] World Football Elo Ratings: Rating System. (n.d.). Retrieved November 30, 2015, from http://www.eloratings.net/system.html

[4] Football Outsiders. (n.d.). Retrieved November 30, 2015, from http://www.footballoutsiders.com/stats/ncaa

[5] Hosmer, D., & Lemeshow, S. (1989). Introduction to the Logistic Regression Model. In Applied logistic regression (pp. 1-30). New York: Wiley.

[6] Diaz, A., Tomba, E., Lennarson, R., Richard, R., Bagajewicz, M., & Harrison, R. (2010). Prediction of protein solubility in Escherichia coli using logistic regression. Biotechnol. Bioeng. Biotechnology and Bioengineering, 374-383.

http://www.forbes.com/sites/prishe/2015/01/15/reviewing-the-2014-15-bowl-season-highest-bowl-game-prices-attendances-and-tv-ratings/



























http://espn.go.com/chalk/story/_/id/12253876/nevada-sports-bettors-wagered-lost-more-ever-2014















http://www.eloratings.net/system.html

http://www.footballoutsiders.com/stats/ncaa

+ Predicting Major League

Baseball Championship

Winners through SVMs

Attribute Category

Total Runs Scored (R) Offensive

Stolen Bases (SB) Offensive

Batting Average (AVG) Offensive

On Base Percentage (OBP) Offensive

Slugging Percentage (SLG) Offensive

Team Wins Record

Team Losses Record

Earned Run Average (ERA) Pitching

Save Percentage Pitching

Strikeouts per nine innings (K/9) Pitching

Opponent Batting Average (AVG) Pitching

Walks plus hits per inning pitched (WHIP) Pitching

Fielding Independent Pitching (FIP) Pitching

Double Plays turned (DP) Defensive

Fielding Percentage (FP) Defensive

Wins Above Replacement (WAR) Baserunning, Hitting, and Fielding

Table 1 Summary of attributes evaluated in this study

+ SVM

The SVM algorithm, developed by Vapnik (1998), is frequently applied in machine learning.

The SVM algorithm for binary classification problems constructs a hyperplane that separates a set of training vectors into two classes (e.g., tornadoes vs. non-tornadoes).

The objectives of SVMs for the primal problem are to maximize the margin of separation and to minimize the misclassification error.

We utilize the probabilistic outputs for SVMs proposed by Platt (1999).

+ Illustration of SVM

+ American League Pennant Model

Selection Standard SVM

Imbalanced data , bias towards L (majority class)

63 100%

20 100%

0 0%

0 0%

+ Different RBF Classifiers

Gaussian, γ=50 Gaussian, γ=300

Gaussian, γ=30,000 Gaussian, γ=300,000

+ Comparison of the accuracy of

different Gaussian RBF classifiers

Model Cost for Majority

(L) Cost for Minority

(W) gamma (γ)

Accuracy

(%)

Case 1 1 3 50 80.7

Case 2 1 3 300 71.8

Case 3 1 3 30,000 86.7

Case 4 1 3 300,000 98.8

+

Sweet Spot

+

47 74.6%

8 40.0%

12 60.0%

16 25.4%

Classifier selected to perform prediction on the American League pennant race

+ National League Pennant Model

Selection

Sweet Spot

Sweet Spot

Sweet Spot Sweet Spot

+

SVM Model Cost for Majority

(L) Cost for Minority

(W) gamma (γ)

Accuracy

(%)

Gaussian RBF 1 3 30,000 77.1

49 77.8%

5 25.0%

15 75.0%

14 22.2%

The classifier selected to perform prediction on the National League pennant race

+

27 64.3%

21 50.0%

21 50.0%

15 35.7%

Figure 7 shows the best classifier acquired from the Linear Kernel SVM algorithm

33 78.6%

24 57.1%

18 42.9%

9 21.4%

Figure 8 shows the best classifier acquired from the Quadratic Kernel SVM algorithm

36 85.7%

21 50.0%

21 50.0%

6 14.3%

Figure 9 shows the best classifier acquired from the Cubic Kernel SVM algorithm

28 66.7%

12 28.6%

30 71.4%

14 33.3%

Figure 10 shows the best classifier acquired from the Gaussian Kernel RBFSVM algorithm

Machine Learning

Algorithm Model Accuracy Attribute (1) Attribute (2)

Linear Kernel SVM 57.1% Fielding Percentage Batting Average

Quadratic Kernel SVM 60.7% WHIP ERA

Cubic Kernel SVM 67.9% Double Plays turned Wins

Gaussian Kernel RBF 69% SLG Double plays turned

Table 4 compares the accuracy values of the best classifier for each SVM algorithm.

Figure 11 shows the playoff results of the 2015 Major League Baseball season

American League National League

World Series

http://www.google.com/url?sa=i&source=imgres&cd=&cad=rja&uact=8&ved=0ahUKEwi8uILJ16nJAhUJSSYKHd2iBJkQjRwICA&url=https://plus.google.com/u/0/110029762749434990238&psig=AFQjCNHuCswod90-RFHjvSF2x4xxZxUtJg&ust=1448475958511697

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwjW1MCF2KnJAhWMPiYKHfUrBgAQjRwIBw&url=http://www.fanpop.com/clubs/new-york-yankees/images/223768/title/yankees-logo-photo&bvm=bv.108194040,d.eWE&psig=AFQjCNGAwyT1s8eoPPJ-oDMGeeOpGE1nlA&ust=1448476077969404

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwjM4uWo2KnJAhVHRSYKHbNBCq8QjRwIBw&url=http://logos.wikia.com/wiki/Houston_Astros&bvm=bv.108194040,d.eWE&psig=AFQjCNEZfgNInF8UH7bcf5ugZ-nO5g4-bw&ust=1448476151199824


https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwixvpfM2KnJAhVDQSYKHY1fCXMQjRwIBw&url=https://en.wikipedia.org/wiki/File:Toronto_Blue_Jays_logo.svg&bvm=bv.108194040,d.eWE&psig=AFQjCNFBUOt1UULp9AqX0B5ezR2rtpsJLA&ust=1448476214913294


https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwja7cbV2anJAhWGQSYKHcBDCxAQjRwIBw&url=https://en.wikipedia.org/wiki/File:Pittsburgh_Pirates_MLB_Logo.svg&psig=AFQjCNHtJSnQfMQJGkt_roVnpo3KTwEViA&ust=1448476511218907

http://sportsteamhistory.com/new-york-mets



https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwjCi5_b2qnJAhWJFT4KHaX5BkIQjRwIBw&url=https://en.wikipedia.org/wiki/File:St._Louis_Cardinals_Logo.svg&psig=AFQjCNFO9v8lg1E5dBw8zY9zjZS7trjJ2Q&ust=1448476798392123

Figure 12 plots the teams competing in the 2015 American League pennant race on the best classifier developed previously

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwjW1MCF2KnJAhWMPiYKHfUrBgAQjRwIBw&url=http://www.fanpop.com/clubs/new-york-yankees/images/223768/title/yankees-logo-photo&bvm=bv.108194040,d.eWE&psig=AFQjCNGAwyT1s8eoPPJ-oDMGeeOpGE1nlA&ust=1448476077969404

http://www.google.com/url?sa=i&source=imgres&cd=&cad=rja&uact=8&ved=0ahUKEwi8uILJ16nJAhUJSSYKHd2iBJkQjRwICA&url=https://plus.google.com/u/0/110029762749434990238&psig=AFQjCNHuCswod90-RFHjvSF2x4xxZxUtJg&ust=1448475958511697



https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwjCi5_b2qnJAhWJFT4KHaX5BkIQjRwIBw&url=https://en.wikipedia.org/wiki/File:St._Louis_Cardinals_Logo.svg&psig=AFQjCNFO9v8lg1E5dBw8zY9zjZS7trjJ2Q&ust=1448476798392123

https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwja7cbV2anJAhWGQSYKHcBDCxAQjRwIBw&url=https://en.wikipedia.org/wiki/File:Pittsburgh_Pirates_MLB_Logo.svg&psig=AFQjCNHtJSnQfMQJGkt_roVnpo3KTwEViA&ust=1448476511218907

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwjX8fju2KnJAhWF6yYKHVL3ClEQjRwIBw&url=http://www.maguzz.com/2/dodgers-logo.html&bvm=bv.108194040,d.eWE&psig=AFQjCNFEUkTIXGtIKcOhCvbDnkWQI_r3SA&ust=1448476296445299


Machine Learning Algorithm

Model Accuracy World Series Winner (W) World Series Loser (L)

Actual Result -

Linear Kernel SVM 57.1%

Quadratic Kernel SVM 60.7%

Cubic Kernel SVM 67.9%

Gaussian RBF Kernel SVM 69%

Table 5 Prediction results of the 2015 World Series for several classifiers



+

End of Presentation

Contact: [email protected]

Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season...

Documents

Transcript of Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season...