Modelling the distribution of first innings runs in T20 ...

37
Modelling the distribution of first innings runs in T20 Cricket James Kirkby The joy of smoothing James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 1 / 22

Transcript of Modelling the distribution of first innings runs in T20 ...

Page 1: Modelling the distribution of first innings runs in T20 ...

Modelling the distribution of first innings runs in T20 Cricket

James Kirkby

The joy of smoothing

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 1 / 22

Page 2: Modelling the distribution of first innings runs in T20 ...

Introduction

Cricket for the uninitiated

Figure : Muralitharan to Gilchrist

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 2 / 22

Page 3: Modelling the distribution of first innings runs in T20 ...

Introduction

Motivation

Why we might we interested in cricket data?

Because we love cricket?Well some of us do.

Because it’s not the Iris or the Old Faithful dataThere is lots of cricket data. Discrete nature of the game, means that largequantities of data are available. Statistics are already an important aspect of thegame.

GamblingStanding on the shoulders of giants. Working out the odds of dice and card games iswhat inspired the first interest in statistics and probability.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 3 / 22

Page 4: Modelling the distribution of first innings runs in T20 ...

Introduction

Motivation

Why we might we interested in cricket data?

Because we love cricket?

Well some of us do.

Because it’s not the Iris or the Old Faithful dataThere is lots of cricket data. Discrete nature of the game, means that largequantities of data are available. Statistics are already an important aspect of thegame.

GamblingStanding on the shoulders of giants. Working out the odds of dice and card games iswhat inspired the first interest in statistics and probability.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 3 / 22

Page 5: Modelling the distribution of first innings runs in T20 ...

Introduction

Motivation

Why we might we interested in cricket data?

Because we love cricket?Well some of us do.

Because it’s not the Iris or the Old Faithful dataThere is lots of cricket data. Discrete nature of the game, means that largequantities of data are available. Statistics are already an important aspect of thegame.

GamblingStanding on the shoulders of giants. Working out the odds of dice and card games iswhat inspired the first interest in statistics and probability.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 3 / 22

Page 6: Modelling the distribution of first innings runs in T20 ...

Introduction

Motivation

Why we might we interested in cricket data?

Because we love cricket?Well some of us do.

Because it’s not the Iris or the Old Faithful data

There is lots of cricket data. Discrete nature of the game, means that largequantities of data are available. Statistics are already an important aspect of thegame.

GamblingStanding on the shoulders of giants. Working out the odds of dice and card games iswhat inspired the first interest in statistics and probability.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 3 / 22

Page 7: Modelling the distribution of first innings runs in T20 ...

Introduction

Motivation

Why we might we interested in cricket data?

Because we love cricket?Well some of us do.

Because it’s not the Iris or the Old Faithful dataThere is lots of cricket data. Discrete nature of the game, means that largequantities of data are available. Statistics are already an important aspect of thegame.

GamblingStanding on the shoulders of giants. Working out the odds of dice and card games iswhat inspired the first interest in statistics and probability.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 3 / 22

Page 8: Modelling the distribution of first innings runs in T20 ...

Introduction

Motivation

Why we might we interested in cricket data?

Because we love cricket?Well some of us do.

Because it’s not the Iris or the Old Faithful dataThere is lots of cricket data. Discrete nature of the game, means that largequantities of data are available. Statistics are already an important aspect of thegame.

Gambling

Standing on the shoulders of giants. Working out the odds of dice and card games iswhat inspired the first interest in statistics and probability.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 3 / 22

Page 9: Modelling the distribution of first innings runs in T20 ...

Introduction

Motivation

Why we might we interested in cricket data?

Because we love cricket?Well some of us do.

Because it’s not the Iris or the Old Faithful dataThere is lots of cricket data. Discrete nature of the game, means that largequantities of data are available. Statistics are already an important aspect of thegame.

GamblingStanding on the shoulders of giants. Working out the odds of dice and card games iswhat inspired the first interest in statistics and probability.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 3 / 22

Page 10: Modelling the distribution of first innings runs in T20 ...

Data

Scope of the Data

There are a vast number of matches played worldwide each year for which data is publiclyavailable. We are going to restrict attention to the following types of matches:

T20 cricket, i.e. 20 overs per team.

Only ’Top Tier’ competitions: T20 internationals, English County T20s, IPL, BigBash, South African T20.

We are going to be modelling the number runs teams score in an innings, and so we

First Innings (only data for the team that bats first).

Full allocation of overs was available, i.e. not weather affected.

These restrictions lead to a sample of 1138 matches.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 4 / 22

Page 11: Modelling the distribution of first innings runs in T20 ...

Data

Scope of the Data

There are a vast number of matches played worldwide each year for which data is publiclyavailable. We are going to restrict attention to the following types of matches:

T20 cricket, i.e. 20 overs per team.

Only ’Top Tier’ competitions: T20 internationals, English County T20s, IPL, BigBash, South African T20.

We are going to be modelling the number runs teams score in an innings, and so we

First Innings (only data for the team that bats first).

Full allocation of overs was available, i.e. not weather affected.

These restrictions lead to a sample of 1138 matches.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 4 / 22

Page 12: Modelling the distribution of first innings runs in T20 ...

Data

Scope of the Data

There are a vast number of matches played worldwide each year for which data is publiclyavailable. We are going to restrict attention to the following types of matches:

T20 cricket, i.e. 20 overs per team.

Only ’Top Tier’ competitions: T20 internationals, English County T20s, IPL, BigBash, South African T20.

We are going to be modelling the number runs teams score in an innings, and so we

First Innings (only data for the team that bats first).

Full allocation of overs was available, i.e. not weather affected.

These restrictions lead to a sample of 1138 matches.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 4 / 22

Page 13: Modelling the distribution of first innings runs in T20 ...

Data

Data Description

We observe the progression of runs that a team scores through the innings. At thebeginning of each over we have the following information:

The number of runs scored in the remainder of the innings.

The number of wickets down / number of batsmen remaining.

The number of overs / balls remaining.

We will focus on the run rate (runs per over) to ensure that results are comparable withdifferent numbers of overs remaining.

Definition

We define the random variable, YW,R as the subsequent run rate a team achieves giventhat they are currently W wickets down with R overs remaining in the innings.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 5 / 22

Page 14: Modelling the distribution of first innings runs in T20 ...

Data

Data Description

We observe the progression of runs that a team scores through the innings. At thebeginning of each over we have the following information:

The number of runs scored in the remainder of the innings.

The number of wickets down / number of batsmen remaining.

The number of overs / balls remaining.

We will focus on the run rate (runs per over) to ensure that results are comparable withdifferent numbers of overs remaining.

Definition

We define the random variable, YW,R as the subsequent run rate a team achieves giventhat they are currently W wickets down with R overs remaining in the innings.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 5 / 22

Page 15: Modelling the distribution of first innings runs in T20 ...

Data

Data Description

We observe the progression of runs that a team scores through the innings. At thebeginning of each over we have the following information:

The number of runs scored in the remainder of the innings.

The number of wickets down / number of batsmen remaining.

The number of overs / balls remaining.

We will focus on the run rate (runs per over) to ensure that results are comparable withdifferent numbers of overs remaining.

Definition

We define the random variable, YW,R as the subsequent run rate a team achieves giventhat they are currently W wickets down with R overs remaining in the innings.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 5 / 22

Page 16: Modelling the distribution of first innings runs in T20 ...

Data

Our Aim

We would like to estimate the distributions of the various YW,R with the followingrequirements.

Avoid a full rank method - don’t want be storing the entire data set in order toevaluate probabilities.

Want to be able to easily evaluate the probabilities from the distribution.

We would like a set of consistent distributions i.e. the probability of achieving anygiven run rate should be lower if a team has fewer wickets remaining.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 6 / 22

Page 17: Modelling the distribution of first innings runs in T20 ...

Data

Observed Data Frequency

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 7 / 22

Page 18: Modelling the distribution of first innings runs in T20 ...

Data

Empirical Distribution

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 8 / 22

Page 19: Modelling the distribution of first innings runs in T20 ...

Data

Empirical Distribution

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 9 / 22

Page 20: Modelling the distribution of first innings runs in T20 ...

Model

Notation

We observe many realisations of each of the YW,R. We will refer to the ith realisation ofYW,R, when W = w and R = r, as yw,r,i.

When it is clear from the context which W and R we are talking about, or if it doesn’tmatter, we will drop the subscripts and use Y and yi.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 10 / 22

Page 21: Modelling the distribution of first innings runs in T20 ...

Model

Distribution Assumption

We assume that Y follows a ’spline’ distribution, with pdf given by:

f(y) =

m∑j=1

Bj(y)αj . (1)

Sufficient conditions for a valid pdf are:

αj > 0 andm∑j=1

αj = 1. (2)

We can remove the need for the first condition by re-parameterizing to:

f(y) =

m∑j=1

Bj(y) exp(aj). (3)

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 11 / 22

Page 22: Modelling the distribution of first innings runs in T20 ...

Model

Likelihood

The log-likelihood for our data given the spline distribution

`(a;y) = 1Tn log (B exp(a)) (4)

where

B =

b1(yi) · · · bm(yi)...

...b1(yn) · · · bm(yn)

and a =

a1...am

(5)

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 12 / 22

Page 23: Modelling the distribution of first innings runs in T20 ...

Model

Estimation

Estimation of the parameters can now proceed by finding the roots of the Lagrangian:

L(a, γ) = 1T log (B expa) + γ(1Tm expa− 1

). (6)

The gradient vectors are:

∂L∂a

=

(1

B expa

)T

(B diag(expa)) + γ expa =

(1

)T

(Bdiag(α)) + γα (7)

and∂L∂γ

=(1Tm expa− 1

)=(1Tmα− 1

). (8)

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 13 / 22

Page 24: Modelling the distribution of first innings runs in T20 ...

Model

Estimation

Estimation of the parameters can now proceed by finding the roots of the Lagrangian:

L(a, γ) = 1T log (B expa) + γ(1Tm expa− 1

). (6)

The gradient vectors are:

∂L∂a

=

(1

B expa

)T

(B diag(expa)) + γ expa =

(1

)T

(Bdiag(α)) + γα (7)

and∂L∂γ

=(1Tm expa− 1

)=(1Tmα− 1

). (8)

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 13 / 22

Page 25: Modelling the distribution of first innings runs in T20 ...

Model

Estimation

The hessian of the our objective function is

Ha,γL =

[diag

(∂L∂a− γ expa

)−VTU−1V expa

(expa)T 0

], (9)

where U = diag (B expa)2 and V = B diag(expa).

This can be combined with expressions (7) and (8) to find the maximum likelihoodestimate of the coefficients, a, using Newton-Raphson.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 14 / 22

Page 26: Modelling the distribution of first innings runs in T20 ...

Model

Result

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 15 / 22

Page 27: Modelling the distribution of first innings runs in T20 ...

Model

Further Smoothing

We would like to impose some smoothness on the distributions, so that when the numberof wickets remaining and overs remaining is similar we have a similar distribution.

We can achieve this by imposing a difference penalty on the parameters of theneighbouring distributions.

In order to be able to add the penalty we first need to be able to estimate the parametersjointly, which requires that we make a couple of tweaks to our basis and likelihood.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 16 / 22

Page 28: Modelling the distribution of first innings runs in T20 ...

Model

Multi-density Basis

In order to model the distributions joint, you would naively define the basis as:

B =

BW=0,R=20 0 0 · · · 0 00 BW=1,R=20 0 · · · 0 00 0 BW=2,R=20 · · · 0 0...

......

......

0 0 0 · · · BW=8,R=1 00 0 0 · · · 0 BW=9,R=1

.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 17 / 22

Page 29: Modelling the distribution of first innings runs in T20 ...

Model

Multi-density Basis

In order to model the distributions joint, you would naively define the basis as:

B =

BW=0,R=20 0 0 · · · 0 00 BW=1,R=20 0 · · · 0 00 0 BW=2,R=20 · · · 0 0...

......

......

0 0 0 · · · BW=8,R=1 00 0 0 · · · 0 BW=9,R=1

.

This part of the basis does not support any data!

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 17 / 22

Page 30: Modelling the distribution of first innings runs in T20 ...

Model

Multi-density Basis

So after removing columns from the basis which support no observations, we havesomething like:

B =

BW=0,R=20 0 0 · · · 0 00 BW=0,R=19 0 · · · 0 00 0 BW=1,R=19 · · · 0 0...

......

......

0 0 0 · · · BW=8,R=1 00 0 0 · · · 0 BW=9,R=1

.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 18 / 22

Page 31: Modelling the distribution of first innings runs in T20 ...

Model

Multi-density Basis

We also need to define a summing matrix to enforce the constraints in the Lagrangian :

N =

1m 0 0 · · · 0 00 1m 0 · · · 0 00 0 1m · · · 0 0...

......

......

0 0 0 · · · 1m 00 0 0 · · · 0 1m

.

Clearly we will need to define an analogue of B for N, which we will refer to as N.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 19 / 22

Page 32: Modelling the distribution of first innings runs in T20 ...

Model

Bring on the smoothing

Our unpenalised target function becomes

L(a, γ) = 1T log(B exp a

)+ γT

(N exp a− 1

). (10)

We can then simply add add a difference penalty to impose smoothness across ourdistributions:

LP(a, γ) = L(a, γ)− λ exp(a)TDTD exp(a), (11)

where D is matrix that has been chopped down from some difference matrix D. For ourexample, we will use

D = DW ⊗DR ⊗ Im.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 20 / 22

Page 33: Modelling the distribution of first innings runs in T20 ...

Model

Result

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 21 / 22

Page 34: Modelling the distribution of first innings runs in T20 ...

Model

Further Work

Would be good to take account of the repeated measurements in the data.

Find a way to introduce a parametric component into the model.

Performance improvements - Woodbury Matrix Identity / Schur Complement

Alternative penalty structure - add a penalty to ensure the CDFs do not cross.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 22 / 22

Page 35: Modelling the distribution of first innings runs in T20 ...

Model

Further Work

Would be good to take account of the repeated measurements in the data.

Find a way to introduce a parametric component into the model.

Performance improvements - Woodbury Matrix Identity / Schur Complement

Alternative penalty structure - add a penalty to ensure the CDFs do not cross.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 22 / 22

Page 36: Modelling the distribution of first innings runs in T20 ...

Model

Further Work

Would be good to take account of the repeated measurements in the data.

Find a way to introduce a parametric component into the model.

Performance improvements - Woodbury Matrix Identity / Schur Complement

Alternative penalty structure - add a penalty to ensure the CDFs do not cross.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 22 / 22

Page 37: Modelling the distribution of first innings runs in T20 ...

Model

Further Work

Would be good to take account of the repeated measurements in the data.

Find a way to introduce a parametric component into the model.

Performance improvements - Woodbury Matrix Identity / Schur Complement

Alternative penalty structure - add a penalty to ensure the CDFs do not cross.

James Kirkby Modelling the distribution of first innings runs in T20 Cricket The joy of smoothing 22 / 22