AMMBR II Gerrit Rooks. Checking assumptions in logistic regression Hosmer & Lemeshow Residuals...

21
AMMBR II Gerrit Rooks

Transcript of AMMBR II Gerrit Rooks. Checking assumptions in logistic regression Hosmer & Lemeshow Residuals...

Page 1: AMMBR II Gerrit Rooks. Checking assumptions in logistic regression Hosmer & Lemeshow Residuals Multi-collinearity Cooks distance.

AMMBR II

Gerrit Rooks

Page 2: AMMBR II Gerrit Rooks. Checking assumptions in logistic regression Hosmer & Lemeshow Residuals Multi-collinearity Cooks distance.

Checking assumptions in logistic regression

• Hosmer & Lemeshow• Residuals• Multi-collinearity• Cooks distance

Page 3: AMMBR II Gerrit Rooks. Checking assumptions in logistic regression Hosmer & Lemeshow Residuals Multi-collinearity Cooks distance.

Hosmer & Lemeshow

Test divides sample in subgroups, checks whether difference between observed and predicted is about equal in these groups

Test should not be significant (indicating no difference)

Page 4: AMMBR II Gerrit Rooks. Checking assumptions in logistic regression Hosmer & Lemeshow Residuals Multi-collinearity Cooks distance.

Hosmer & Lemeshow

AverageProbabilityIn j th group

Page 5: AMMBR II Gerrit Rooks. Checking assumptions in logistic regression Hosmer & Lemeshow Residuals Multi-collinearity Cooks distance.

First logistic regression

_cons 2.425635 .3995025 6.07 0.000 1.642624 3.208645 cred_ml .7406536 .3152647 2.35 0.019 .1227463 1.358561 meals -.0936 .0084587 -11.07 0.000 -.1101786 -.0770213 yr_rnd -1.189537 .5022235 -2.37 0.018 -2.173877 -.2051967 hiqual Coef. Std. Err. z P>|z| [95% Conf. Interval]

Log likelihood = -156.25611 Pseudo R2 = 0.5523 Prob > chi2 = 0.0000 LR chi2(3) = 385.53Logistic regression Number of obs = 707

Iteration 5: log likelihood = -156.25611 Iteration 4: log likelihood = -156.25612 Iteration 3: log likelihood = -156.27132 Iteration 2: log likelihood = -160.11854 Iteration 1: log likelihood = -199.10312 Iteration 0: log likelihood = -349.01971

. logit hiqual yr_rnd meals cred_ml

Page 6: AMMBR II Gerrit Rooks. Checking assumptions in logistic regression Hosmer & Lemeshow Residuals Multi-collinearity Cooks distance.

Then postestimation command

Prob > chi2 = 0.0000 Hosmer-Lemeshow chi2(8) = 40.45 number of groups = 10 number of observations = 707

10 0.9595 62 61.1 8 8.9 70 9 0.7531 44 43.5 26 26.5 70 8 0.4960 23 22.0 47 48.0 70 7 0.1554 4 7.4 68 64.6 72 6 0.0560 2 2.4 68 67.6 70 5 0.0208 1 0.9 71 71.1 72 4 0.0078 0 0.4 68 67.6 68 3 0.0037 0 0.2 71 70.8 71 2 0.0019 1 0.1 71 71.9 72 1 0.0008 1 0.0 71 72.0 72 Group Prob Obs_1 Exp_1 Obs_0 Exp_0 Total (Table collapsed on quantiles of estimated probabilities)

Logistic model for hiqual, goodness-of-fit test

. estat gof, table group(10)

Page 7: AMMBR II Gerrit Rooks. Checking assumptions in logistic regression Hosmer & Lemeshow Residuals Multi-collinearity Cooks distance.

Including interaction term helps

_cons 2.686005 .4307661 6.24 0.000 1.841719 3.530291 ym .0463257 .0188326 2.46 0.014 .0094145 .0832368 cred_ml .7789823 .3206881 2.43 0.015 .1504452 1.407519 meals -.1019211 .0098691 -10.33 0.000 -.1212641 -.0825781 yr_rnd -2.834458 .8630901 -3.28 0.001 -4.526083 -1.142832 hiqual Coef. Std. Err. z P>|z| [95% Conf. Interval]

Log likelihood = -153.78831 Pseudo R2 = 0.5594 Prob > chi2 = 0.0000 LR chi2(4) = 390.46Logistic regression Number of obs = 707

. logit hiqual yr_rnd meals cred_ml ym , nolog

. gen ym=yr_rnd*meals

Page 8: AMMBR II Gerrit Rooks. Checking assumptions in logistic regression Hosmer & Lemeshow Residuals Multi-collinearity Cooks distance.

Multicollinearity

Mean VIF 2.56 yr_rnd 1.11 0.903460 avg_ed 3.25 0.307731 meals 3.31 0.301982 Variable VIF 1/VIF

. vif

_cons .2445202 .0824989 2.96 0.003 .0826554 .4063849 meals -.0076084 .000527 -14.44 0.000 -.0086423 -.0065744 yr_rnd -.0008586 .0248112 -0.03 0.972 -.0495386 .0478215 avg_ed .1729601 .021089 8.20 0.000 .1315831 .2143371 hiqual Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 254.263385 1157 .219760921 Root MSE = .30632 Adj R-squared = 0.5730 Residual 108.279876 1154 .093830049 R-squared = 0.5741 Model 145.983509 3 48.6611696 Prob > F = 0.0000 F( 3, 1154) = 518.61 Source SS df MS Number of obs = 1158

. reg hiqual avg_ed yr_rnd meals

Page 9: AMMBR II Gerrit Rooks. Checking assumptions in logistic regression Hosmer & Lemeshow Residuals Multi-collinearity Cooks distance.

Residuals

• Residual = observed value – predicted value / square root of variation

(42 missing values generated). predict stdres, rstand

(42 missing values generated)(option pr assumed; Pr(hiqual)). predict p

Page 10: AMMBR II Gerrit Rooks. Checking assumptions in logistic regression Hosmer & Lemeshow Residuals Multi-collinearity Cooks distance.

Residuals

186018135012 4552185045346984007187112933069 2521134440642152

211569651401951505919521859 5222664347513024873492140860444006452227723728227459071185406111864901 61461618545718994068208639751967 22083509413244912921724409858703121401247811607488264411872114 28223612273218467671660 19852223214459931807 198739781713164213650951630194721401315 50031966 47834718 20735980 521552952075018 1629738592218631103 28353166 323553871748594649758939302269176930232977524225192339

3147881417317291077186160465990244147271843 509014512326 39661497300921038721728214122721886 52108121501992594857685052477 3246465 84472959061988736176839704961980198219971926759 654696 46983118208931309801623413345041595932486855732786187258345899189026854876462740031069131894986717861879 207658671494 193248603767 321448651894151652402623524548331192788321617466572127589617233174190617174146909313113455953191617182280 593741437531721 1819 5379148817142323189823254618 141545964445430137942981 14374358 6124207852703161 33653735 54332548 33757142882 427612744353107 38252284927233231761083688 6105 53385900 252746263353 52734351 549945854673 432610723159

4773410 5725 60435783162 52184461004

3294314

433037644521 4329670 43284320 52945694 302937415663

33711085 44333801001 52242989 204356

309850542324 1523461422546 112542

2910526857285926 5299591289949555547 49213703 5110337359492582 8623622 3945065527152276126

30975000653700 4550

5154

5123084 4436296 6145297356201851 2490

227042533426 1502487060082480 5636479946991709995

19041923 38001706 356648152624 23336007 160017711685 39542281 32071657 4561173 17992276 538048155811339 11123150 5607

583692 5375 5374

5300

47024736 6116

549438871340 252051011500

4663

1402

11611762

1403

481138954852

22272226 3986 61066017846 13622957 320147465358 33552136 24531696140

29551264216845002128 384936381055954 1792337

50165911 58275018281795

229340593944173916872695 57191350302226633833 521919242083 4853 2319249148804002 30042752

167

19654539 536144144724

44152070 169828706182 1777 32389416715421

430726005561

162053042951520018243013 2841342544 35832972 536340225313258914841855 660 5295 1035722639180917372282981 29441751273060154533484 5211755 58734056 10451280 2984536958829355093 12391118 230714901758127514502494395511562599 70

784948 38241839 2338116 5036401821266885998 661278

52112116347116131511 38845252 335048263522 30871887

14931672

16792430 472812766180591058742440147327953765

2991 1646 623519 590464317225834130 531216016088 470540102692 389319141853 50622714 4705406 57011949 49634536

60385700

4553 2313507835324638 924640 4284184523 379736993411 2905330555482606

3307490 5276503533163775 29086109 49233853

30815847

4719

743 44394411 54425404 4271647

503641 3343

5765708 56925483581834283778 1294396 1131505754692929853666 694

3003 5323742 563532063043 4399203 5408285

43345534 451936213083 51333295092 459445254745

51146614544842 3422 550328856769234381 321037125967 4452 5716 520448657483695

3465107 28985943 272583

47353708628 42852535457246835787

336657375039

3272 692256544374194594 4083121556937613733 45575842550630077733754 49264289649 3265612959173760 33452179 487 32856674019 5612 53053193574 5563540142864497 6030 51924984583 435098 49855471 2904560541314496 1115

3845 395648791461

5864

38341234

540326522672351 14015134 258710627542698 2489

26351419 5904 4651

28496 1249 12321108 4040228 13792334

4747532913115063

1912609040245599

23772580 22661426 484948243822 544118301390 4175 270536133502 2755

216521195956

2369 12132583570 38365434

166623781492 46452679 13834223

55694816 39041297 5664

3917 181

144457732386

1219 5409422 533126252691 2098

36751199 610116614135 219818743610 60434309585838763864 5798 478647 138

372

9316252588 1608 205

596

42023460

399851941427 27035020 3870

385838291373 5572

4558

13104512 54145427531627 259

335654444226 32245316483951331198 2802

365019152593 21676190 13723832687 351847785196

51495586 3655

1033

215941822622 2381241 396053974709 121426966016856 1240

53262191 11001458 5593363

26074220373

41454477129916815589

836 5334421361714240 54654248 3521 999801 20973415 4391101834544409 42574275 784 42824670 4237799 3296

13843289

45473340

443944266134490 352042683111 5928

2817 49642924

8102711 3266

56464654 2902319 367058371031

6036

44003408 5555 3293283 61864033 2935

840

4608 4314505 388223874822 45144292

450645374036 48203593 2930

3656

42031038

3582 6114678

30635755

2922559 4790404558624518

2509792 5777387450262573 543563 571349111473843244 5853

358150565851

5704 284255974385 51895639 35894556

4369 35305656 293429135657

4580776 30643881

401

36362918 480023535796 3865606 3126019 302

49361514932

3634323656383204329426364121 5427

42785712

4591572327045844

4043 386828013449

6087

5192

328

4609342

44285761 420034164084

125301

748 331737305

4366 12344834035

5752 492952882816

5524

4386 381259784302 427040915968

493426436156063

4910364061725647 3757 257142644581010

2030

4050

stan

dard

ize

d P

ears

on r

esi

dua

l

0 .2 .4 .6 .8 1Pr(hiqual)

. scatter stdres p, mlabel(snum)

Page 11: AMMBR II Gerrit Rooks. Checking assumptions in logistic regression Hosmer & Lemeshow Residuals Multi-collinearity Cooks distance.

Inspect observations with large residuals (>2.5 a 3)

No 27 2.19 0 100 awards ell avg_ed hicred ym low medium medium . 808 824 59 28 cred_hl pared pared_ml pared_hl api00 api99 full some_col 1403 315 high high nd 100 497 low low 458. snum dnum schqual hiqual yr_rnd meals enroll cred cred_ml

. list if snum==1403

Page 12: AMMBR II Gerrit Rooks. Checking assumptions in logistic regression Hosmer & Lemeshow Residuals Multi-collinearity Cooks distance.

_cons -3.528875 1.037345 -3.40 0.001 -5.562035 -1.495716 avg_ed 2.010791 .2947269 6.82 0.000 1.433137 2.588445 meals -.0790397 .0076984 -10.27 0.000 -.0941283 -.0639511 yr_rnd -1.1328 .3842377 -2.95 0.003 -1.885892 -.3797077 hiqual Coef. Std. Err. z P>|z| [95% Conf. Interval]

Log likelihood = -265.68934 Pseudo R2 = 0.6358 Prob > chi2 = 0.0000 LR chi2(3) = 927.75Logistic regression Number of obs = 1157

Iteration 5: log likelihood = -265.68934 Iteration 4: log likelihood = -265.68934 Iteration 3: log likelihood = -265.70542 Iteration 2: log likelihood = -270.06297 Iteration 1: log likelihood = -332.43297 Iteration 0: log likelihood = -729.56398

. logit hiqual yr_rnd meals avg_ed if snum != 1403

_cons -3.566451 1.01715 -3.51 0.000 -5.560028 -1.572874 avg_ed 1.98805 .2884154 6.89 0.000 1.422766 2.553334 meals -.0758864 .0074453 -10.19 0.000 -.090479 -.0612938 yr_rnd -.9913148 .3743452 -2.65 0.008 -1.725018 -.2576117 hiqual Coef. Std. Err. z P>|z| [95% Conf. Interval]

Log likelihood = -273.66402 Pseudo R2 = 0.6255 Prob > chi2 = 0.0000 LR chi2(3) = 914.05Logistic regression Number of obs = 1158

. logit hiqual yr_rnd meals avg_ed, nolog

Page 13: AMMBR II Gerrit Rooks. Checking assumptions in logistic regression Hosmer & Lemeshow Residuals Multi-collinearity Cooks distance.

Cooks distance (< 1)

Means square errorNumber of parameter

Prediction for j from all observations

Prediction for j for observations excludingobservation i

Page 14: AMMBR II Gerrit Rooks. Checking assumptions in logistic regression Hosmer & Lemeshow Residuals Multi-collinearity Cooks distance.

cook 707 .0257177 .0899176 2.11e-07 .6101257 Variable Obs Mean Std. Dev. Min Max

. summ cook

(493 missing values generated). predict cook, dbeta

_cons 2.425635 .3995025 6.07 0.000 1.642624 3.208645 cred_ml .7406536 .3152647 2.35 0.019 .1227463 1.358561 yr_rnd -1.189537 .5022235 -2.37 0.018 -2.173877 -.2051967 meals -.0936 .0084587 -11.07 0.000 -.1101786 -.0770213 hiqual Coef. Std. Err. z P>|z| [95% Conf. Interval]

Log likelihood = -156.25611 Pseudo R2 = 0.5523 Prob > chi2 = 0.0000 LR chi2(3) = 385.53Logistic regression Number of obs = 707

. logit hiqual meals yr_rnd cred_ml, nolog

Page 15: AMMBR II Gerrit Rooks. Checking assumptions in logistic regression Hosmer & Lemeshow Residuals Multi-collinearity Cooks distance.

. graph twoway scatter cook p, mlabel(snum)

1860

1813411650121779

4552

185045346984007187112933069

2521

13444064 215221156965140

1951505919521859 5222664347513024873492140860444006452227723728227459071185406111864901 614615991618

5457

1899

40682086

3975

4312

1967 220835094132 449129217244098587031214012 478116074882644118721142245

4294

282236122732

1846

7671660873 19852223214459931807892

1987

397817131642 13650951630194721401315 5003

1966

47834718 20735980 5215529

520750184095 16297385922186311031175

904

2835 32483166 3235

5387

1748178459464975893

930

226917693023297752422519

2339

3147881

4173172910771861

709

60465990244147271843 5090145123263966

1497

300921038721728214122721886 5210

81

2150

117116221992594857685052477 324646584

472959061988736176839704961980198219971926759 65

46962239

4698

311822402089313098016234133450415959324868557327861872583443445899

934

1890

718

2685487646274003106913189498671786

1879

207658671494

1932

48603767 321448651894

1516

52402623524548331192788321687817466572127589617233174429611801906171741469093131134559531916 72017182280 593743184143

753

1721

1819

53791488171423231898117990023254618 1415

5620

1851 2490

22704253

3426 1502487060082480 5636

4799

46991709995

1904

19233800

17063566

48152624 23336007160017711685 39542281 32071657 4561173 17992276

5380

48155811339

1112

31505607

5836

92 5375 53745300

4702

4736 61165494388713402520

5101

1500

4663

1402

11611762

1403

48113895 4852222722263986 61066017846 13622957 320147465358

33552136 24531696

140

29551264216845002128 3849

3638

1055954 179 233750165911

5827501828

1795229340593944173916872695

57191350

302226633833

521919242083

4853

231924914880

4002

30042752

167

19654539 53614414 472444152070

169828706182

1777

3238941671 542143072600 5561

16205304

2951520018243013 2841342544 358329725363

40225313258914841855660

529510

3572 26391809

1737

2282981 2944175127306015 4533484 5211755 587340561045

1280

2984

53695882

935

5093

1239

11182307

14901758127514502494 395511562599

70

78

49483824

1839 2338116

5036

401821266885998

66

1278 52112116

347116131511 38845252 335048263522 308718871493

1672

16792430 47281276 61805910

5874

244014732795

3765

2991 1646

62

3519590464317225834130 53121601 6088

470540102692

3893

19141853

5062

2714 4705406

1115

3845

395648791461

58643834

1234

5403

26522672

351

1401

5134 258710627542698248926351419 5904

4651284

96

1249

1232

11084040

228 1379

2334

4747

53291311

5063

191260904024

5599

237725802266

1426 484948243822

5441

183013904175 2705

36133502 2755

2165

21195956

2369

1213258 3570 3836 543416662378

14924645

2679

13834223

5569

4816 39041297

5664

3917

181 14445773 238612195409422 5331

26252691 2098

3675

11996101

16614135219818743610 60434309585838763864 5798

4786

47 138 37293162525881608205596

420234603998 51941427

2703

50203870 3858

3829

1373

55724558 1310

451254145427531627

259 335654444226

3224

5316

483951331198

2802

36501915

2593

21676190

137

23832687

35184778

51965149

55863655

1033

2159

4182

2622 238

1241

396053974709 12142696

6016856 1240

5326

2191 11001458 5593363

2607

4220

373

4145447712991681

5589

836

5334

42136171 4240

546542480.2

.4.6

Pre

gib

on's

dbe

ta

0 .2 .4 .6 .8 1Pr(hiqual)

Page 16: AMMBR II Gerrit Rooks. Checking assumptions in logistic regression Hosmer & Lemeshow Residuals Multi-collinearity Cooks distance.

To Stata

• Use apilog.dta• Awards = dependent variable• For Awards inspect frequency counts• Recode Awards into binary variable• Estimate a LR model using yr_rnd meals enroll

as predictors

Page 17: AMMBR II Gerrit Rooks. Checking assumptions in logistic regression Hosmer & Lemeshow Residuals Multi-collinearity Cooks distance.

To Stata

• Inspect classification table• Perform Hosmer & Lemeshow test• Inspect standardized residuals• Inspect cooks distance• See if interaction effects improve fit

Page 18: AMMBR II Gerrit Rooks. Checking assumptions in logistic regression Hosmer & Lemeshow Residuals Multi-collinearity Cooks distance.

• Is the Wald test an accurate test to the significance of coefficients in Logistic regression analysis?a) Yes, just like regression analysis.b) Yes, it is accurate, although a Likelihood ratio test is

more efficientc) No, unlike regression analysis, the Wald test is biased,

especially for relatively small coefficients .d) No, unlike regression analysis, the Wald test is biased,

especially for relatively large coefficients .

Page 19: AMMBR II Gerrit Rooks. Checking assumptions in logistic regression Hosmer & Lemeshow Residuals Multi-collinearity Cooks distance.

• Use LRtest to check the significance effect of the variable yr_rnd

• Use auto.dta (if not on your pc then)– use http://www.stata-press.com/data/r11/auto

• Predict which car will be foreign, using weigth and mpg as predictors

Page 20: AMMBR II Gerrit Rooks. Checking assumptions in logistic regression Hosmer & Lemeshow Residuals Multi-collinearity Cooks distance.

• Is the interaction between weigth and mpg significant?

• Tip: always center variable before making interactionvariable.

Page 21: AMMBR II Gerrit Rooks. Checking assumptions in logistic regression Hosmer & Lemeshow Residuals Multi-collinearity Cooks distance.

• use http://www.stata-press.com/data/r11/choice

• Does income, gender or type of car (European, Japanese or American) predict whether a car will be bought (choice)?