THE POISSON & NEGATIVE BINOMIAL MODELS By: ALVARD AYRAPETYAN.
-
Upload
princess-shaker -
Category
Documents
-
view
233 -
download
10
Transcript of THE POISSON & NEGATIVE BINOMIAL MODELS By: ALVARD AYRAPETYAN.
THE POISSON &
NEGATIVE BINOMIAL MODELS
By: ALVARD AYRAPETYAN
OUTLINE OF PRESENTATION Poisson Regression
Model Assumptions, Assessment, and Interpretations Applications in SAS and R Quick Programming in SPSS and MINITAB
Negative Binomial Model Assumptions, Assessment, and Interpretations Applications in SAS and R Quick Programming in SPSS
3
ASSUMPTIONS FOR POISSON MODEL• Number of events must occur at a
fixed period of time• Number of events must occur at a
constant rate• Events must be independent• Dependent variable’s conditional
mean and variance must be equal• Dependent variable must be an
integer
4
THE POISSON MODEL
Random Component: Poisson Distribution for the # of lead changes
Systematic Component:
Mass Function: E(Y) = µ & V(Y)= µ Link Function: g(µ) = log(µ)
,...2,1,0
!
)(),,|(
)(
321
yy
XeXXXyYP
yX
332211
332211)log()(XXXe
XXXg
5
EXAMPLES OF POISSON DISTRIBUTION• Number of earthquakes in a region
• Number of accidents on a highway in a certain area in a specified time
• Number of telephone calls received in one hour
• Number of customers that enter a bank in one hour
• Number of times an elderly person will fall in a month
6
INTEPRETING COEFFICIENTSCONTINUOUS PREDICTOR Keeping all constant,
when is increased by one unit, Y increases/decreases (+/-) by
Keeping all constant, when is increased by one unit, the expected number of Y will go up/down (+/-) by
CATEGORICAL PREDICTOR Keeping all constant,
when , Y increases/decreases (+/-) by
Keeping all constant, when the expected number of Y will go up/down (+/-) by
1x
%100)1)ˆ(( 1 Exp
1x
1̂
11 x
%100)ˆ( 1 Exp11 x
1̂
7
POTENTIAL PROBLEM WITH POISSON
• OVERDISPERSION-the variance is much larger than the mean
• Negative Binomial is the solution!
8
THE DATA Trying to predict the number of field goal
attempts in NBA Extracted the top 100 highest scoring players
in the NBA for the 2013-2014 season The following were used as predictors:
Number of games played (GP) Number of defensive rebounds(DREB) Number of assists (AST) Number of steals (STL) Number of blocks (BLK) Number of turnovers (TOV) Number of free throws made (FTM)
9
SAMPLE OF THE DATA
Rank Player GP FGA DREB AST STL FTM TOV
1 Kevin Love (MIN) 15 268 146 68 13 95 41
2 Kevin Durant (OKC) 12 209 72 62 17 131 45
3 Monta Ellis (DAL) 14 235 42 76 22 85 55
4 Blake Griffin (LAC) 15 242 129 47 19 59 40
5 LeBron James (MIA) 13 201 67 88 12 71 49
6 Evan Turner (PHI) 15 272 85 53 15 71 56
7 Kevin Martin (MIN) 14 248 48 33 18 71 20
8 Paul George (IND) 13 231 72 41 23 70 33
9 LaMarcus Aldridge (POR) 14 285 105 35 19 54 34
10 Carmelo Anthony (NYK) 12 264 79 33 15 72 36
11 Kyrie Irving (CLE) 14 268 40 89 14 55 47
12 Klay Thompson (GSW) 14 212 30 22 12 30 20
13 Dirk Nowitzki (DAL) 14 206 82 33 16 74 25
14 James Harden (HOU) 12 195 45 65 21 91 52
15 Chris Paul (LAC) 15 208 65 188 36 81 44
16 Arron Afflalo (ORL) 13 197 62 61 10 56 33
17 Damian Lillard (POR) 14 225 54 85 11 64 31
18 DeMarcus Cousins (SAC) 13 230 103 31 22 65 36
10
POISSON-EXAMPLE WITH SAS
proc genmod data = nba;
model FGA= GP DREB AST STL TOV FTM /dist=poisson;
run;
/*check goodness of fit for model*/
data pvalue;
df = 93; chisq = 511.6210;
pvalue = 1 - probchi(chisq, df);
run;
proc print data = pvalue noobs;
run; /*pvalue is NOT significant, model isnt good*; dispersion parameter 5.5013 >> 1, major overdipsersion/
11
EXAMPLE RESULTS-GOODNESS OF FIT
The GENMOD Procedure Model Information Data Set WORK.NBA Distribution Poisson Link Function Log Dependent Variable FGA Number of Observations Read 100 Number of Observations Used 100 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 93 511.6210 5.5013 Scaled Deviance 93 511.6210 5.5013 Pearson Chi-Square 93 518.3345 5.5735 Scaled Pearson X2 93 518.3345 5.5735 Log Likelihood 72301.7048 Full Log Likelihood -604.2412 AIC (smaller is better) 1222.4824 AICC (smaller is better) 1223.6998 BIC (smaller is better) 1240.7186
12
RESULTS: Analysis of Maximum Likelihood Parameter Estimates
PARAMETER DF ESTIMATE STANDARD ERROR
WALD 95% CONFIDENCE LIMITS
WALD CHI-SQUARE
PR>CHISQ
Intercept 1 4.1864 0.0749 (4.0396,43332)
3125.02 <.0001
GP 1 0.0422 0.0057 (0.0310,0.0534)
54.93 <.0001
DREB 1 0.0004 0.0003 (-0.0002,0.0010)
1.55 0.2131
AST 1 -0.0002 0.0003 (-0.0008,0.0005)
0.28 0.5995
STL 1 0.0028 0.0012 (0.0004,0.0052)
5.17 0.0230
TOV 1 0.0057 0.0010 (0.0038,0.0077)
33.53 <.0001
FTM 1 0.0040 0.0004 (0.0032,0.0048)
98.23 <.0001
Scale 0 1.000 0 (1.0, 1.0)
13
ASSESSMENT OF RESULTSRatio of Deviance/Df=5.5013
>>>1==major overdispersionDeviance=511.6210, not well fit because
pvalue=1-prob(chisq,df) is NOT significant
Every term significant except for AST and DREB
False results possible if model is inaccurate
Must perform a NEGATIVE BINOMIAL
14
POISSON-EXAMPLE WITH R
nba <- read.csv("F:/STATS544/nba.csv",header=TRUE)
poiss<-glm(FGA ~GP+DREB+AST+STL+TOV+FTM, family = "poisson", data = nba)
summary(poiss)
15
R-GOODNESS OF FITS
Deviance Residuals:
Min 1Q Median 3Q Max
-5.5397 -1.2614 -0.1643 1.2650 6.2786
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 926.60 on 99 degrees of freedom
Residual deviance: 511.62 on 93 degrees of freedom
AIC: 1222.5
R-ANALYSIS OF PARAMETER ESTIMATES
Call:
glm(formula = FGA ~ GP + DREB + AST + STL + TOV + FTM, family = "poisson",
data = nba)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
16
ESTIMATE STD.ERROR Z VALUE PR(>|z|)
(Intercept) 4.1864100 0.0748885 55.902 < 2e-16 ***
GP 0.0422013 0.0056940 7.411 1.25e-13 ***
DREB 0.0003719 0.0002987 1.245 0.213
AST -0.0001778 0.0003387 -0.525 0.600
STL 0.0027777 0.0012221 2.273 0.023 *
TOV 0.0057220 0.0009882 5.790 7.02e-09 ***
FTM 0.0040405 0.0004077 9.911 < 2e-16 ***
17
POISSON WITH SPSS & MINITAB
SPSS
genlin FGA with GP DREB AST STL TOV FTM
/model GP DREB AST STL TOV FTM INTERCEP=YESdistribution = poisson link = log
/print FIT SUMMARY SOLUTION.
MINITAB
Stat > Regression > Poisson Regression > Fit Poisson Model.
Detecting over-dispersionwith SAS
Poisson regression gives a ratio between DEVIANCE and DF >1.
proc genmod data = nba;
model FGA= GP DREB AST STL TOV FTM /dist=poisson;
run;
PROC MEANS--- the variance of FGA(Y) is much higher than its mean
proc means data = nba n mean var min max;
var FGA
run;
Detecting over-dispersionwith R
Poisson regression gives a ratio between RESIDUAL DEVIANCE and DF >1 poiss<-glm(FGA ~GP+DREB+AST+STL+TOV+FTM, family = "poisson",
data = nba)
summary(poiss)
mean(nba$FGA) [1] 173.47
var(nba$FGA) [1] 1684.858
20
NEGATIVE BINOMIAL REGRESSION
Generalization of Poisson regression
Used for over-dispersed count data
PMF:
E(Y)= m, V(Y) = +m k*(m2) K=dispersion parameter As k0, the V(Y) , m NB approaches Poisson and
V(Y)=E(Y)= m Link Function same as Poisson: g(m) = log(m.) Equation: Log(λ(X))= β0 + β1Χ1 + β2Χ2+……..+ βp-1Xp-1 Goodness Of fit Test-same as Poisson
,...2,1,0)1()(
)(),,,|( 321
y
kk
k
yk
kykXXXyYP
yk
21
NEGATIVE BINOMAL-EXAMPLE WITH SAS
proc genmod data = nba;
model FGA= GP DREB AST STL TOV FTM /dist=negbin; (ONLY DIFFERENCE FROM POISSON)
run;
/*check goodness of fit for model*/
data pvalue;
df = 93; chisq = 99.3405;
pvalue = 1 - probchi(chisq, df);
run;
proc print data = pvalue noobs;
run;
22
EXAMPLE RESULTS-GOODNESS OF FIT Data Set WORK.NBA
Distribution Negative Binomial Link Function Log Dependent Variable FGA Number of Observations Read 100 Number of Observations Used 100 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 93 99.3405 1.0682 Scaled Deviance 93 99.3405 1.0682 Pearson Chi-Square 93 100.7383 1.0832 Scaled Pearson X2 93 100.7383 1.0832 Log Likelihood 72428.1189 Full Log Likelihood -477.8271 AIC (smaller is better) 971.6543 AICC (smaller is better) 973.2367 BIC (smaller is better) 992.4957
23
RESULTS: Analysis of Maximum Likelihood Parameter Estimates
PARAMETER
DF ESTIMATE
STANDARD ERROR
WALD 95% CONFIDENCE LIMITS
WALD CHI-SQUARE
PR>CHI-SQ
INTERCEPT 1 4.1742 0.1641 (3.8525,4.4958)
647.01 <.0001
GP 1 0.0426 0.0125 (0.0181,0.0671)
11.62 0.0007
DREB 1 0.0003 0.0007 (-0.0011,0.0016)
0.15 0.7028
AST 1 -0.0001 0.0008 (-0.0017,0.0014)
0.03 0.8619
STL 1 0.0024 0.0027 (-0.0029,0.0077)
0.78 0.3756
TOV 1 0.0060 0.0023 (0.0015,0.0105)
6.95 0.0084
FTM 1 0.0042 0.0010 (0.0023,0.0061)
19.32 <.0001
DISPERSION
1 0.0230 0.0040 (0.0163,0.0325)
24
Assessment of Results Ratio of Deviance/Df=1.0682 ≈1 (over-dispersion fixed!) Deviance=99.3405, now is well fit because pvalue=1-
prob(chisq,df) IS significant Extra parameter in the “Analysis of Maximum Likelihood
Parameter Estimates” called “Dispersion” (aka ALPHA) Accounts for the over-dispersion factor we came across
in the Poisson regression This estimate has a value of .0230 with a Wald
Confidence Interval of (.0163, 0325). Based on the 95% Confidence Limits for our dispersion parameter, we can say that dispersion is significantly different from 0, justifying the negative binomial model is more appropriate
GP, TOV, & FTM only significant predictors
25
NEGATIVE BINOMIAL-EXAMPLE WITH R
nba <- read.csv("F:/STATS544/nba.csv",header=TRUE)
install.packages('MASS') library(MASS) nb<-glm.nb(FGA
~GP+DREB+AST+STL+TOV+FTM, data = nba)
summary(nb)
26
EXAMPLE RESULTS-GOODNESS OF FIT
(Dispersion parameter for Negative Binomial(43.4291) family taken to be 1)
Null deviance: 182.54 on 99 degrees of freedom
Residual deviance: 99.34 on 93 degrees of freedom
AIC: 971.65
Number of Fisher Scoring iterations: 1
Deviance Residuals:
Min 1Q Median 3Q Max
-2.36322 -0.60467 -0.06083 0.55227 2.72053
Theta: 43.43
Std. Err.: 7.62
2 x log-likelihood: -955.654
27
RESULTS: Analysis of Maximum Likelihood Parameter Estimates
Call:
glm.nb(formula = FGA ~ GP + DREB + AST + STL + TOV + FTM, data = nba,
init.theta = 43.42912732, link = log)
Coefficients:
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
ESTIMATE STD.ERROR Z-VALUE PR(>|Z|)
(Intercept) 4.1741833 0.1626544 25.663 < 2e-16 ***
GP 0.0425988 0.0123895 3.438 0.000585 ***
DREB 0.0002619 0.0006835 0.383 0.701571
AST -0.000139 0.0007904 -0.176 0.860433
STL 0.0023962 0.0027055 0.886 0.375794
TOV 0.0060360 0.0022760 2.652 0.008001 **
FTM 0.0042121 0.0009430 4.467 7.95e-06 ***
28
INTERPETATION OF SIGNIFICANT COEFFICIENTS
GP: Holding all other variables constant, for every one unit addition of games played, the expected log number of field goal attempts will go up by .0426. Or similarly, for every additional game played, the number of field goal attempts will increase by 4.35%
TOV: Holding all other variables constant, for every one extra TOV, the expected log number of field goal attempts will increase by 0.0060. Or similarly, for every additional turnover made, the number of field goal attempts will increase by 0.60%.
FTM: Holding all other variables constant, for every one unit addition of free throws made, the expected log number of field goal attempts will go up by 0.0042. Or similarly, for every additional free throw made, the number of field goal attempts will increase by 0.42%.
29
NEGATIVE BINOMIAL WITH SPSS & MINITAB
SPSS
genlin FGA with GP DREB AST STL TOV FTM/model GP DREB AST STL TOV FTM INTERCEP=YESDistribution=negbin(mle) link = log /print FIT SUMMARY SOLUTION.
MINITAB
NA
30
SUMMARY
Use Poisson regression when dealing with COUNT data
If there’s Overdispersion, switch to Negative binomial
Assumptions for both Poisson and NB are the same
Both model coefficients are interpreted same manner
Can perform both regressions in SAS, R, & SPSS
31