THE POISSON & NEGATIVE BINOMIAL MODELS By: ALVARD AYRAPETYAN.

THE POISSON &

NEGATIVE BINOMIAL MODELS

By: ALVARD AYRAPETYAN

OUTLINE OF PRESENTATION Poisson Regression

Model Assumptions, Assessment, and Interpretations Applications in SAS and R Quick Programming in SPSS and MINITAB

Negative Binomial Model Assumptions, Assessment, and Interpretations Applications in SAS and R Quick Programming in SPSS

3

ASSUMPTIONS FOR POISSON MODEL• Number of events must occur at a

fixed period of time• Number of events must occur at a

constant rate• Events must be independent• Dependent variable’s conditional

mean and variance must be equal• Dependent variable must be an

integer

4

THE POISSON MODEL

Random Component: Poisson Distribution for the # of lead changes

Systematic Component:

Mass Function: E(Y) = µ & V(Y)= µ Link Function: g(µ) = log(µ)

,...2,1,0

!

)(),,|(

)(

321

yy

XeXXXyYP

yX

332211

332211)log()(XXXe

XXXg

5

EXAMPLES OF POISSON DISTRIBUTION• Number of earthquakes in a region

• Number of accidents on a highway in a certain area in a specified time

• Number of telephone calls received in one hour

• Number of customers that enter a bank in one hour

• Number of times an elderly person will fall in a month

6

INTEPRETING COEFFICIENTSCONTINUOUS PREDICTOR Keeping all constant,

when is increased by one unit, Y increases/decreases (+/-) by

Keeping all constant, when is increased by one unit, the expected number of Y will go up/down (+/-) by

CATEGORICAL PREDICTOR Keeping all constant,

when , Y increases/decreases (+/-) by

Keeping all constant, when the expected number of Y will go up/down (+/-) by

1x

%100)1)ˆ(( 1 Exp

1x

1̂

11 x

%100)ˆ( 1 Exp11 x

1̂

7

POTENTIAL PROBLEM WITH POISSON

• OVERDISPERSION-the variance is much larger than the mean

• Negative Binomial is the solution!

8

THE DATA Trying to predict the number of field goal

attempts in NBA Extracted the top 100 highest scoring players

in the NBA for the 2013-2014 season The following were used as predictors:

Number of games played (GP) Number of defensive rebounds(DREB) Number of assists (AST) Number of steals (STL) Number of blocks (BLK) Number of turnovers (TOV) Number of free throws made (FTM)

9

SAMPLE OF THE DATA

Rank Player GP FGA DREB AST STL FTM TOV

1 Kevin Love (MIN) 15 268 146 68 13 95 41

2 Kevin Durant (OKC) 12 209 72 62 17 131 45

3 Monta Ellis (DAL) 14 235 42 76 22 85 55

4 Blake Griffin (LAC) 15 242 129 47 19 59 40

5 LeBron James (MIA) 13 201 67 88 12 71 49

6 Evan Turner (PHI) 15 272 85 53 15 71 56

7 Kevin Martin (MIN) 14 248 48 33 18 71 20

8 Paul George (IND) 13 231 72 41 23 70 33

9 LaMarcus Aldridge (POR) 14 285 105 35 19 54 34

10 Carmelo Anthony (NYK) 12 264 79 33 15 72 36

11 Kyrie Irving (CLE) 14 268 40 89 14 55 47

12 Klay Thompson (GSW) 14 212 30 22 12 30 20

13 Dirk Nowitzki (DAL) 14 206 82 33 16 74 25

14 James Harden (HOU) 12 195 45 65 21 91 52

15 Chris Paul (LAC) 15 208 65 188 36 81 44

16 Arron Afflalo (ORL) 13 197 62 61 10 56 33

17 Damian Lillard (POR) 14 225 54 85 11 64 31

18 DeMarcus Cousins (SAC) 13 230 103 31 22 65 36

10

POISSON-EXAMPLE WITH SAS

proc genmod data = nba;

model FGA= GP DREB AST STL TOV FTM /dist=poisson;

run;

/*check goodness of fit for model*/

data pvalue;

df = 93; chisq = 511.6210;

pvalue = 1 - probchi(chisq, df);

run;

proc print data = pvalue noobs;

run; /*pvalue is NOT significant, model isnt good*; dispersion parameter 5.5013 >> 1, major overdipsersion/

11

EXAMPLE RESULTS-GOODNESS OF FIT

The GENMOD Procedure Model Information Data Set WORK.NBA Distribution Poisson Link Function Log Dependent Variable FGA Number of Observations Read 100 Number of Observations Used 100 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 93 511.6210 5.5013 Scaled Deviance 93 511.6210 5.5013 Pearson Chi-Square 93 518.3345 5.5735 Scaled Pearson X2 93 518.3345 5.5735 Log Likelihood 72301.7048 Full Log Likelihood -604.2412 AIC (smaller is better) 1222.4824 AICC (smaller is better) 1223.6998 BIC (smaller is better) 1240.7186

12

RESULTS: Analysis of Maximum Likelihood Parameter Estimates

PARAMETER DF ESTIMATE STANDARD ERROR

WALD 95% CONFIDENCE LIMITS

WALD CHI-SQUARE

PR>CHISQ

Intercept 1 4.1864 0.0749 (4.0396,43332)

3125.02 <.0001

GP 1 0.0422 0.0057 (0.0310,0.0534)

54.93 <.0001

DREB 1 0.0004 0.0003 (-0.0002,0.0010)

1.55 0.2131

AST 1 -0.0002 0.0003 (-0.0008,0.0005)

0.28 0.5995

STL 1 0.0028 0.0012 (0.0004,0.0052)

5.17 0.0230

TOV 1 0.0057 0.0010 (0.0038,0.0077)

33.53 <.0001

FTM 1 0.0040 0.0004 (0.0032,0.0048)

98.23 <.0001

Scale 0 1.000 0 (1.0, 1.0)

13

ASSESSMENT OF RESULTSRatio of Deviance/Df=5.5013

>>>1==major overdispersionDeviance=511.6210, not well fit because

pvalue=1-prob(chisq,df) is NOT significant

Every term significant except for AST and DREB

False results possible if model is inaccurate

Must perform a NEGATIVE BINOMIAL

14

POISSON-EXAMPLE WITH R

nba <- read.csv("F:/STATS544/nba.csv",header=TRUE)

poiss<-glm(FGA ~GP+DREB+AST+STL+TOV+FTM, family = "poisson", data = nba)

summary(poiss)

15

R-GOODNESS OF FITS

Deviance Residuals:

Min 1Q Median 3Q Max

-5.5397 -1.2614 -0.1643 1.2650 6.2786

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 926.60 on 99 degrees of freedom

Residual deviance: 511.62 on 93 degrees of freedom

AIC: 1222.5

R-ANALYSIS OF PARAMETER ESTIMATES

Call:

glm(formula = FGA ~ GP + DREB + AST + STL + TOV + FTM, family = "poisson",

data = nba)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

16

ESTIMATE STD.ERROR Z VALUE PR(>|z|)

(Intercept) 4.1864100 0.0748885 55.902 < 2e-16 ***

GP 0.0422013 0.0056940 7.411 1.25e-13 ***

DREB 0.0003719 0.0002987 1.245 0.213

AST -0.0001778 0.0003387 -0.525 0.600

STL 0.0027777 0.0012221 2.273 0.023 *

TOV 0.0057220 0.0009882 5.790 7.02e-09 ***

FTM 0.0040405 0.0004077 9.911 < 2e-16 ***

17

POISSON WITH SPSS & MINITAB

SPSS

genlin FGA with GP DREB AST STL TOV FTM

/model GP DREB AST STL TOV FTM INTERCEP=YESdistribution = poisson link = log

/print FIT SUMMARY SOLUTION.

MINITAB

Stat > Regression > Poisson Regression > Fit Poisson Model.

Detecting over-dispersionwith SAS

Poisson regression gives a ratio between DEVIANCE and DF >1.


model FGA= GP DREB AST STL TOV FTM /dist=poisson;

run;

PROC MEANS--- the variance of FGA(Y) is much higher than its mean

proc means data = nba n mean var min max;

var FGA

run;

Detecting over-dispersionwith R

Poisson regression gives a ratio between RESIDUAL DEVIANCE and DF >1 poiss<-glm(FGA ~GP+DREB+AST+STL+TOV+FTM, family = "poisson",

data = nba)

summary(poiss)

mean(nba$FGA) [1] 173.47

var(nba$FGA) [1] 1684.858

20

NEGATIVE BINOMIAL REGRESSION

Generalization of Poisson regression

Used for over-dispersed count data

PMF:

E(Y)= m, V(Y) = +m k*(m2) K=dispersion parameter As k0, the V(Y) , m NB approaches Poisson and

V(Y)=E(Y)= m Link Function same as Poisson: g(m) = log(m.) Equation: Log(λ(X))= β0 + β1Χ1 + β2Χ2+……..+ βp-1Xp-1 Goodness Of fit Test-same as Poisson

,...2,1,0)1()(

)(),,,|( 321

y

kk

k

yk

kykXXXyYP

yk

21

NEGATIVE BINOMAL-EXAMPLE WITH SAS


model FGA= GP DREB AST STL TOV FTM /dist=negbin; (ONLY DIFFERENCE FROM POISSON)

run;

/*check goodness of fit for model*/

data pvalue;

df = 93; chisq = 99.3405;

pvalue = 1 - probchi(chisq, df);

run;

proc print data = pvalue noobs;

run;

22

EXAMPLE RESULTS-GOODNESS OF FIT Data Set WORK.NBA

Distribution Negative Binomial Link Function Log Dependent Variable FGA Number of Observations Read 100 Number of Observations Used 100 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 93 99.3405 1.0682 Scaled Deviance 93 99.3405 1.0682 Pearson Chi-Square 93 100.7383 1.0832 Scaled Pearson X2 93 100.7383 1.0832 Log Likelihood 72428.1189 Full Log Likelihood -477.8271 AIC (smaller is better) 971.6543 AICC (smaller is better) 973.2367 BIC (smaller is better) 992.4957

23


PARAMETER

DF ESTIMATE

STANDARD ERROR

WALD 95% CONFIDENCE LIMITS

WALD CHI-SQUARE

PR>CHI-SQ

INTERCEPT 1 4.1742 0.1641 (3.8525,4.4958)

647.01 <.0001

GP 1 0.0426 0.0125 (0.0181,0.0671)

11.62 0.0007

DREB 1 0.0003 0.0007 (-0.0011,0.0016)

0.15 0.7028

AST 1 -0.0001 0.0008 (-0.0017,0.0014)

0.03 0.8619

STL 1 0.0024 0.0027 (-0.0029,0.0077)

0.78 0.3756

TOV 1 0.0060 0.0023 (0.0015,0.0105)

6.95 0.0084

FTM 1 0.0042 0.0010 (0.0023,0.0061)

19.32 <.0001

DISPERSION

1 0.0230 0.0040 (0.0163,0.0325)

24

Assessment of Results Ratio of Deviance/Df=1.0682 ≈1 (over-dispersion fixed!) Deviance=99.3405, now is well fit because pvalue=1-

prob(chisq,df) IS significant Extra parameter in the “Analysis of Maximum Likelihood

Parameter Estimates” called “Dispersion” (aka ALPHA) Accounts for the over-dispersion factor we came across

in the Poisson regression This estimate has a value of .0230 with a Wald

Confidence Interval of (.0163, 0325). Based on the 95% Confidence Limits for our dispersion parameter, we can say that dispersion is significantly different from 0, justifying the negative binomial model is more appropriate

GP, TOV, & FTM only significant predictors

25

NEGATIVE BINOMIAL-EXAMPLE WITH R

nba <- read.csv("F:/STATS544/nba.csv",header=TRUE)

install.packages('MASS') library(MASS) nb<-glm.nb(FGA

~GP+DREB+AST+STL+TOV+FTM, data = nba)

summary(nb)

26

EXAMPLE RESULTS-GOODNESS OF FIT

(Dispersion parameter for Negative Binomial(43.4291) family taken to be 1)

Null deviance: 182.54 on 99 degrees of freedom

Residual deviance: 99.34 on 93 degrees of freedom

AIC: 971.65

Number of Fisher Scoring iterations: 1

Deviance Residuals:

Min 1Q Median 3Q Max

-2.36322 -0.60467 -0.06083 0.55227 2.72053

Theta: 43.43

Std. Err.: 7.62

2 x log-likelihood: -955.654

27


Call:

glm.nb(formula = FGA ~ GP + DREB + AST + STL + TOV + FTM, data = nba,

init.theta = 43.42912732, link = log)

Coefficients:

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

ESTIMATE STD.ERROR Z-VALUE PR(>|Z|)

(Intercept) 4.1741833 0.1626544 25.663 < 2e-16 ***

GP 0.0425988 0.0123895 3.438 0.000585 ***

DREB 0.0002619 0.0006835 0.383 0.701571

AST -0.000139 0.0007904 -0.176 0.860433

STL 0.0023962 0.0027055 0.886 0.375794

TOV 0.0060360 0.0022760 2.652 0.008001 **

FTM 0.0042121 0.0009430 4.467 7.95e-06 ***

28

INTERPETATION OF SIGNIFICANT COEFFICIENTS

GP: Holding all other variables constant, for every one unit addition of games played, the expected log number of field goal attempts will go up by .0426. Or similarly, for every additional game played, the number of field goal attempts will increase by 4.35%

TOV: Holding all other variables constant, for every one extra TOV, the expected log number of field goal attempts will increase by 0.0060. Or similarly, for every additional turnover made, the number of field goal attempts will increase by 0.60%.

FTM: Holding all other variables constant, for every one unit addition of free throws made, the expected log number of field goal attempts will go up by 0.0042. Or similarly, for every additional free throw made, the number of field goal attempts will increase by 0.42%.

29

NEGATIVE BINOMIAL WITH SPSS & MINITAB

SPSS

genlin FGA with GP DREB AST STL TOV FTM/model GP DREB AST STL TOV FTM INTERCEP=YESDistribution=negbin(mle) link = log /print FIT SUMMARY SOLUTION.

MINITAB

NA

30

SUMMARY

Use Poisson regression when dealing with COUNT data

If there’s Overdispersion, switch to Negative binomial

Assumptions for both Poisson and NB are the same

Both model coefficients are interpreted same manner

Can perform both regressions in SAS, R, & SPSS

THE POISSON & NEGATIVE BINOMIAL MODELS By: ALVARD AYRAPETYAN.

Documents

Transcript of THE POISSON & NEGATIVE BINOMIAL MODELS By: ALVARD AYRAPETYAN.