MODELLING FOOTBALL DATA

24
MODELLING FOOTBALL DATA Shavajai Quentin Franz

Transcript of MODELLING FOOTBALL DATA

Page 1: MODELLING FOOTBALL DATA

MODELLING FOOTBALL DATAShavajai Quentin Franz

Page 2: MODELLING FOOTBALL DATA

BACKGROUND AND MOTIVATION

• The motivation for this project was to apply mathematical, statistical and actuarial modelling techniques to professional sports.

• Our interest is to provide information on how such modelling techniques can be used to gain insights into sports dynamics.

Page 3: MODELLING FOOTBALL DATA

OBJECTIVES

The main aim of this work is to build models for football data, in particular the scoring rate. The specific objectives that we set out to achieve include: Examining the suitability of the simple Poisson

model to our data. Incorporating the time heterogeneity of the

simple Poisson parameter. Exploring the effect of time on the scoring rate.

Page 4: MODELLING FOOTBALL DATA

DATA

• Data was retrieved from rsssf.com. • The data used in this project spans 7 football

seasons from the 1996-1997 to the 2002-2003 English premier league seasons.

• There were (380*7)=2660 matches over the 7 season with 7001 goals in total, of which 4055 were home goals and 2946 were away goals.

• Of the 7001 goals, 3119 were scored in the first half with the rest (3882) being scored in the second half.

Page 5: MODELLING FOOTBALL DATA

Goals frequency

Page 6: MODELLING FOOTBALL DATA

METHODOLOGY• Consider the Poisson model as it is conventionally used for count data and

its parameter is defined for positive values. The same model has been used in previous studies extensively (eg Karlis and Ntzoufras, 2003).

• Test for correlation between home and away scores.• Fit a Poisson Model for scoring rates for home case (λh) away case (λa) and

total case (λt) .• Conduct goodness of fit test on the simple Poisson model.• Introduce the Poisson Gamma Mixture as an improvement on the simple

Poisson.• Conduct goodness of fit test on the mixed Poisson model.• Introduce Poisson regression to model the relationship between scoring

rates and time in a football match.

Page 7: MODELLING FOOTBALL DATA

FINDINGSTest for Correlation (R Test)

• Ho: ρ = 0 Versus H1: ρ 0• r = = -0.02067 • Test stat = r= -1.06593 ~ t2658

Degrees of freedom = n - 2 =2658• P value = 0.028646• The coefficient of correlation was found to be -0.02067. The test stat is z =

-1.0653 and the corresponding p value is 0.028646. So here we accept Ho. This implies that the correlation coefficient is not significantly different from zero.

• Because the correlation coefficient was found not to be significantly different from zero, we can proceed treating the home goals and away goals as independent Poisson events and test whether a Poisson model is reasonable.

Page 8: MODELLING FOOTBALL DATA

Table 4

Using the formulas below we obtained the following values

Page 9: MODELLING FOOTBALL DATA

The Poisson Model• For a Poisson distribution Y…• Y=0,1,2,3..• Y ~ Po (λ)• E (Y| λ) = λ• Var (Y| λ) = λ• However the Expectation of Y can be derived from the Tower Law• E(Y) = E (E (Y| λ)) = E (λ) (1)• The variance can be calculated in the following manner. This is known as the law of total

variance or variance decomposition formula• Var(Y) = E (Var (Y|λ)) + Var (E (Y|λ)) (2)• Thus the excess variance is synonymous with the variance of the parameter λ, since this

can be simplified further using E (Y| λ) = λ and Var (Y| λ) = λ, for a Poisson model• Var(Y) = E (λ) + Var (λ)• Therefore, we observe heterogeneity in the scoring rate λ

Page 10: MODELLING FOOTBALL DATA

What next?• We observed heterogeneity in the scoring rate λ. This suggests that λ is a

random variable. We need a model that incorporates λ as a random variable.

• We now consider a Poisson mixture. • Since λ >0 , it would be appropriate to take a Gamma distribution as the

mixing distributionas it is also defined for positive values and is very flexible to fit.

• It also has the advantage of havingmoment generating functions that are easy to compute, which will be used later in the study.

• Some other distributions which cover the same state space (λ>0), do not have easily computable moment generating functions.

• Gamma distributions are a two parameter distribution, usually termed α – the shape parameter and β – the scale parameter. The pdf of a gamma distribution is given by

Page 11: MODELLING FOOTBALL DATA

Poisson Gamma Model• This may at first seem difficult as the model assumes that the scoring rates, home, away and total are

drawn randomly from a gamma distribution so λ is a random variable and not fixed. However we can integrate out λ by integrating over the whole range available to λ in the gamma distribution and hence derive a pdf in terms of α and β. This is illustrated below:

Page 12: MODELLING FOOTBALL DATA

Poisson gamma probabilities• Hence we have a smart way of determining probabilities for number of

home goals, away goals and total goals. The minor problem here would be evaluating non-integral gamma functions. This can easily be overcome by using the recurrence formula

• P(X =0) = • And P(X = x) = P(X=x-1)• Using the excess variance for all three goal distributions it is possible to

fit appropriate gamma distributions by the method of moments, since E (λ) = α/β and Var (λ) = α/β2 and using E(X) = E(λ) and Excess variance = Var(X) – E(X) = Var(λ)

• Henceβ = (α/β) / (α/β2) = E(X) / (Var(X)-E(X))α = E(X) β

Page 13: MODELLING FOOTBALL DATA

Hypothesis testing

• Ho Goals (i) follow Poisson distribution i = Home, away, total• H1 Goals (i) don’t follow Poisson distribution i = Home,

away, total• And• H2 Goals (i) follow Poisson distribution with rates drawn

from a Gamma distribution i = Home, away, total• H3 Goals (i) don’t follow Poisson distribution with rates

drawn from a Gamma distribution i = Home, away, total• To test the hypotheses above, a Chi-square test needs to be

carried out. This test results are given below:

Page 14: MODELLING FOOTBALL DATA

Simple Poisson goodness of fit test

Clearly the simple Poisson models with assumed constant rate for all games are not good models with p values all substantially below 1%. Hence Ho can be rejected in each case. However this is hardly surprising as the rates are extremely unlikely to be the same for each home and away side for each match.

Table 4: Goodness of fit test for simple poisson model

Σ(O-E)2/E Classes Df lost for total Parameters estimated Total df p value

Home 21.9963 7 1 1 5 0.00052

Away 16.7955 6 1 1 4 0.00212

Total 21.5212 9 1 1 7 0.00589

Page 15: MODELLING FOOTBALL DATA

Poisson gamma goodness of fit test

In the case of the home and away goals scored the p values are very large indicating we can accept Ho i.e. the Poisson Gamma model is a good fit. Here the Poisson gamma model is little better than the simple Poisson model. The same conclusion cannot be drawn for the total goals as the p value is so small.

Table 5: Goodness of fit test for the Poisson Gamma Model

Σ(O-E)2/E Classes Df lost for total Parameters estimated Total df p value

Home 4.02597 7 1 2 4 0.4025

Away 4.10313 6 1 2 3 0.25054

Total 18.9681 9 1 2 6 0.00829

Page 16: MODELLING FOOTBALL DATA

Poisson Regression

0 10 20 30 40 50 60 70 80 90 1000

50

100

150

200

250

300

350 Chart 1: Total Goals vs. Time

Series1

Time in Minutes

goals

Page 17: MODELLING FOOTBALL DATA

0 10 20 30 40 50 60 70 80 90 1000

20

40

60

80

100

120

140

160

180

Chart 2: Number of Home Goals vs. Time

Series1

Time(minutes)

Goals

Page 18: MODELLING FOOTBALL DATA

0 10 20 30 40 50 60 70 80 90 1000

20

40

60

80

100

120

140

160

180

Chart 3: Number of Away Goals vs. Time

Series1

time(minutes)

goals

Page 19: MODELLING FOOTBALL DATA

• As is clear from the scatter plots above, there is a significantly smaller number of goals scored in the 1st and 46th minute, and a spike in the scoring rates of the 45th and 90th minute.

• The latter could be due to the extra time that is associated with these two minutes. We define the total scoring rate as the total goals scored in a given minute. In this case the scoring rates between minute 2 and 44 and between 46th and 89th minutes steadily increase with time.

• To construct a model it is reasonable to divide the 90 minute interval to reflect this systematic pattern in the total scoring rates with respect to time.

• We need to obtain the 1st,45th,46th,and 90th minute rates explicitly; and then derive an overall rate for the interval[2,44] and [47,89] inclusive.

• An argument can be made that a linear model can be constructed to capture the trend in the data. However, a linear regression would imply equal weights assigned to each minute (the weights are the minutes of exposure in each minute which =2660), but since the rate increases with time in two largely significant intervals in the data, this is not a reasonable assumption.

• The rate in these intervals would reasonably be approximated by a model with an associated Poisson distribution. Since they span only a minute each, the other explicit scoring rates can be derived by the trivial formula of the total goals scored per minute divided by the minutes of exposure, which is 2660 in each case. In addition, to account for the linearity of the rates in the ‘Poisson intervals’, the Poisson regression seems ideal.

Page 20: MODELLING FOOTBALL DATA

The Poisson regression is a general linear model under the Poisson family of distributions and is expressed as follows:lnI=+t, where t denotes the scoring rate for a given minute. For the total goals case;

The explicitly derived rates denoted as μ0(t)s are obtained as follows: total goals in the (i)th minute 2660 For instance the rates for the total case are as follows;μ0(t) = 0.0147 0 < t < 1μ0(t) =0.0650 44 < t < 45μ0(t) = 0.018797 45 < t< 46μ0(t) = 0.1140 89 < t < 90The Poisson regression model is t= exp(+t)The following parameters and were estimated using r and the following results were obtained:

Page 21: MODELLING FOOTBALL DATA

Table 6:Estimated Parameters α's and β's

Home case Away case Total case

First half α 3.653472 3.069511 4.09877

First half β 0.00188 0.009956 0.005133

First half α 3.312263 3.451026 4.291641

Second half β 0.007898 0.001473 0.005163

To establish whether the Poisson regression model adheres to the data for the first and second half scores, a Chi-square goodness of fit test was carried out as shown below:Ho (i) The data of goals(i) adheres to the Poisson regression modelH1 (i)The data does of goals (i) not adhere to the Poisson modelWhere i= home, away , total

HOME AWAY TOTAL

First half second half first half second half first half second half

p value 0.012314 1.13E-07 0.00035 5.90E-06 0.137095 0.384263

Page 22: MODELLING FOOTBALL DATA

Discussion(Poisson Regression)• According to the results of the test above, the p-values for both the first and

second half for the total number of goals are significantly above zero. However, the p-values for the home and away goals are essentially equal to zero.

• We hence accept the hypothesis that the in the total case, the data adheres to the poisson regression model. Nonetheless we reject the hypothesis Ho (i) where (i)=home goals or away goals.

• Given this fact, we may conclude that the Poisson regression is a good fit for the total number of goals. However, the Poisson regression is not a good fit for the home and away scores given the very low p values.

• Despite the fact the Poisson regression has low p-values for the home and away scores, it addresses the flaws in the Poisson Gamma model, and the simple Poisson model by extension.

• The Poisson regression model does not share the flaws that we observed in the Poisson model. For this reason, we believe that it is a better fit for our data.

• Therefore, given that it avoids the limitations of both the Poisson and the Poisson Gamma models, we find that the Poisson regression model is the most ideal for capturing the scoring rate in football matches.

Page 23: MODELLING FOOTBALL DATA

Limitations• A major limitation of our project is that the Poisson regression

model has been staggered to cover the different time intervals where we observed patterns.

• A single model that captures the entire 90 minutes of a football match would be more ideal.

• Similarly other factors that affect goal scoring were not incorporated in the study due time and other constraints . These included team strengths, injured players, home advantage, refereeing decisions, coaching techniques, retaliatory factor amongst others. Further research is needed here.

Page 24: MODELLING FOOTBALL DATA

REFERENCES• Baker, R. D., and McHale, I. G. (2015). Time varying ratings in association Football: the all-• time greatest team is… Journal of the Royal Statistical Society. Series A (Statistics in Society).Vol. 178 (2), 481-492. Retrieved 1st • March 2016 from http://onlinelibrary.wiley.com/doi/10.1111/rssa.12060/full• Karlis, D., and Ntzoufras, J. (2003). Bayesian and Non-Bayesian Analysis of Soccer Data • Using Bivariate Poisson Regression Models. Retrieved 16th May 2016 from • http://www.stat-athens.aueb.gr/~karlis/Bivariate%20Poisson%20Regression.pdf• Maher, M. J. (1982). Modeling Association Football Scores. Statistica Neerlandica. Vol. 36 • (3), 109-118. Retrieved 18th May 2016 fromhttp://www.90minut.pl/misc/maher.pdf • Pena, L. J. (2014). A Markovian model for association football possessions and • Its Outcomes. Retrieved 1st March 2016 from http://arxiv.org/pdf/1403.7993v1.pdf• Percy, D. F. (2015).Strategy selection and outcome prediction in sport using Dynamic • learning for stochastic processes. Journal of the Operational Research Society. Vol. 66, 1840-1849. Retrieved 1st March 2016

from• http://www.palgrave-journals.com/jors/journal/v66/n11/pdf/jors2014137a.pdf• Pollard, R. (2008). Home Advantage in Football: A Current Review of the Unsolved Puzzle. • The Open Sports Sciences Journal. Vol. 1, 12-24. Retrieved 18th May 2016 from• http://benthamopen.com/contents/pdf/TOSSJ/TOSSJ-1-12.pdf • The Economist. (2014). The World’s Game, Not England’s. Retrieved 16th May 2016 • From

http://www.economist.com/news/britain/21601540-premier-league-football-clubs-are-destroying-their-roots-they-grow-worlds-game-not