Ed231C- Generalized Linear Models

9
26/02/10 12:26 Ed231C: Generalized Linear Models Page 1 sur 9 http://www.gseis.ucla.edu/courses/ed231c/notes1/glm.html Applies Categorical & Nonnormal Data Analysis Generalized Linear Models Generalized Linear Models Most students are introduced to linear models through either multiple regression or analysis of variance. With these methods the expected value of the response variable is statistically modeled, that is, it is expressed as a linear combination of the explanatory variables. With categorical and count response variables, the regression cannot be linear. The problem of nonlinearity is handled through nonlinear functions that transform the expected value of the categorical or count variable into a linear function of the explanatory variables. Such transformations are referred to as link functions. For example, in the analysis of count data, the expected frequencies must be nonnegative. To ensure that the predicted values from the linear models fit these constraints, the log link is used to transform the expected value of the response variable. This loglinear transformation serves two purposes: it ensures that the fitted values are appropriate for count data, and it permits the unknown regression parameters to lie within the real number space. Different types of response variables utilize different link functions: both the logit and probit link functions work with binomial response variables while the log link function works with both poisson and negative binomial response variables. Growing out of the work of Nelder & Wedderburn (1972) and McCullagh & Nelder (1989), generalized linear models provides a unified framework which can be applied to various 'linear' models. Generalized linear models take the form: g(E(y)) = xβ, y -> {F} where F is the distribution family and g( ) is the link function. You might recognize this example more easily if it were rewritten as follows: Y' = b 0 + b 1 X 1 + b 2 X 2 + ... y -> {gaussian} Now we can replace Y' with E(y), E(y) = b 0 + b 1 X 1 + b 2 X 2 + ... y -> {gaussian} In OLS the distribution family is gaussian (normal), i.e., y -> {gaussian} and the link function is identity, i.e., g(y) = y. Thus, we can write g(E(y)) as just E(y).

description

Ed231C- Generalized Linear Models

Transcript of Ed231C- Generalized Linear Models

  • 26/02/10 12:26Ed231C: Generalized Linear Models

    Page 1 sur 9http://www.gseis.ucla.edu/courses/ed231c/notes1/glm.html

    Applies Categorical & Nonnormal Data AnalysisGeneralized Linear Models

    Generalized Linear Models

    Most students are introduced to linear models through either multiple regression or analysis of variance.With these methods the expected value of the response variable is statistically modeled, that is, it isexpressed as a linear combination of the explanatory variables. With categorical and count responsevariables, the regression cannot be linear. The problem of nonlinearity is handled through nonlinearfunctions that transform the expected value of the categorical or count variable into a linear function ofthe explanatory variables. Such transformations are referred to as link functions.

    For example, in the analysis of count data, the expected frequencies must be nonnegative. To ensure thatthe predicted values from the linear models fit these constraints, the log link is used to transform theexpected value of the response variable. This loglinear transformation serves two purposes: it ensures thatthe fitted values are appropriate for count data, and it permits the unknown regression parameters to liewithin the real number space.

    Different types of response variables utilize different link functions: both the logit and probit linkfunctions work with binomial response variables while the log link function works with both poisson andnegative binomial response variables. Growing out of the work of Nelder & Wedderburn (1972) andMcCullagh & Nelder (1989), generalized linear models provides a unified framework which can beapplied to various 'linear' models.

    Generalized linear models take the form:

    g(E(y)) = x, y -> {F}

    where F is the distribution family and g( ) is the link function.

    You might recognize this example more easily if it were rewritten as follows:

    Y' = b0 + b1X1 + b2X2 + ... y -> {gaussian}

    Now we can replace Y' with E(y),

    E(y) = b0 + b1X1 + b2X2 + ... y -> {gaussian}

    In OLS the distribution family is gaussian (normal), i.e., y -> {gaussian} and the link function is identity,i.e., g(y) = y. Thus, we can write g(E(y)) as just E(y).

  • 26/02/10 12:26Ed231C: Generalized Linear Models

    Page 2 sur 9http://www.gseis.ucla.edu/courses/ed231c/notes1/glm.html

    Another example is poisson regression in which the distribution family is poisson, i.e., y -> {poisson}and the link function is the natural log, i.e., g(y) = ln(y). The glm model would then be written as,

    g(E(y)) = b0 + b1X1 + b2X2 + ... y -> {poisson}

    Here are examples of distributions and link functions for some common estimation procedures:

    type of distribution linkestimation family functionOLS regression gaussian identitylogistic regression binomial logitprobit binomial probitcloglog binomial cloglogpoisson regression poisson logneg binomial regression neg binomial log

    Stata's GLM Procedure

    Stata's glm procedure estimates generalized linear models in which the user can specify both thedistribution family and the link function. Here is the basic syntax of the glm procedure:

    glm depvar indvars [if exp] [in range] [, family(fname) link(lname) eform ]

    where fname can take on the values gaussian | igaussian | binomial | poisson | nbinomial | gammaand lname can take on the values identity | log | logit | probit | cloglog | nbinomial |power | opower.

    An OLS regression would look like this using regress and glm:

    regress write read math genderglm write read math gender, family(gaus) link(iden)

    A logistic regression would look like this:

    logistic honors read math genderglm honors read math gender, family(binom) link(logit)

    A poisson regression would look like this:

    poisson days read math genderglm days read math gender, family(poisson) link(log)

    A negative binomial regression would look like this:

    nbreg days read math genderglm days read math gender, family(nbinom) link(log)

    Here is a list of the allowable distribution families:

    gaussian (normal)inverse gaussianbernoulli (binomial)poisson

  • 26/02/10 12:26Ed231C: Generalized Linear Models

    Page 3 sur 9http://www.gseis.ucla.edu/courses/ed231c/notes1/glm.html

    negative binomialgamma

    And here is a list of the link functions that are available:

    indentityloglogitprobitcomplementary log-logodds powerpowernegative binomiallog-loglog-compliment

    Of course, if all that glm could do was duplicate OLS, logistic, poisson and negative binomial regressionthat it would not appear to be very useful. However, it is possible to combine distribution families andlink functions in ways that do not duplicate existing estimation procedures. The table below give thepossible combinations that make sense from a data analysis perspective:

    iden log logit probit cloglog nbinom power opower loglog logcgaussian X X Xinverse gaussian X X Xbinomial X X X X X X X X Xpoisson X X Xnegative binomial X X X Xgamma X X X

    Examples

    use http://www.gseis.ucla.edu/courses/data/hsb2

    generate hon = write>=60

    regress write read math female

    Source | SS df MS Number of obs = 200-------------+------------------------------ F( 3, 196) = 72.52 Model | 9405.34864 3 3135.11621 Prob > F = 0.0000 Residual | 8473.52636 196 43.2322773 R-squared = 0.5261-------------+------------------------------ Adj R-squared = 0.5188 Total | 17878.875 199 89.843593 Root MSE = 6.5751

    ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | .3252389 .0607348 5.36 0.000 .2054613 .4450166 math | .3974826 .0664037 5.99 0.000 .266525 .5284401 female | 5.44337 .9349987 5.82 0.000 3.59942 7.287319 _cons | 11.89566 2.862845 4.16 0.000 6.249728 17.5416------------------------------------------------------------------------------

    glm write read math female, link(iden) fam(gauss) nolog

    Generalized linear models No. of obs = 200

  • 26/02/10 12:26Ed231C: Generalized Linear Models

    Page 4 sur 9http://www.gseis.ucla.edu/courses/ed231c/notes1/glm.html

    Optimization : ML: Newton-Raphson Residual df = 196 Scale parameter = 43.23228Deviance = 8473.526357 (1/df) Deviance = 43.23228Pearson = 8473.526357 (1/df) Pearson = 43.23228

    Variance function: V(u) = 1 [Gaussian]Link function : g(u) = u [Identity]Standard errors : OIM

    Log likelihood = -658.4261736 AIC = 6.624262BIC = 7435.056153

    ------------------------------------------------------------------------------ write | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | .3252389 .0607348 5.36 0.000 .2062009 .444277 math | .3974826 .0664037 5.99 0.000 .2673336 .5276315 female | 5.44337 .9349987 5.82 0.000 3.610806 7.275934 _cons | 11.89566 2.862845 4.16 0.000 6.28459 17.50674------------------------------------------------------------------------------

    logit hon read math female, nolog

    Logit estimates Number of obs = 200 LR chi2(3) = 80.87 Prob > chi2 = 0.0000Log likelihood = -75.209827 Pseudo R2 = 0.3496

    ------------------------------------------------------------------------------ hon | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | .0752424 .027577 2.73 0.006 .0211924 .1292924 math | .1317117 .0324607 4.06 0.000 .06809 .1953335 female | 1.154801 .4340856 2.66 0.008 .304009 2.005593 _cons | -13.12749 1.850769 -7.09 0.000 -16.75493 -9.50005------------------------------------------------------------------------------

    logit, or

    Logit estimates Number of obs = 200 LR chi2(3) = 80.87 Prob > chi2 = 0.0000Log likelihood = -75.209827 Pseudo R2 = 0.3496

    ------------------------------------------------------------------------------ hon | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | 1.078145 .0297321 2.73 0.006 1.021419 1.138023 math | 1.140779 .0370305 4.06 0.000 1.070462 1.215716 female | 3.173393 1.377524 2.66 0.008 1.355281 7.430502------------------------------------------------------------------------------

    glm hon read math female, link(logit) fam(bin) nolog

    Generalized linear models No. of obs = 200Optimization : ML: Newton-Raphson Residual df = 196 Scale parameter = 1Deviance = 150.4196543 (1/df) Deviance = .7674472Pearson = 164.2509104 (1/df) Pearson = .8380148

    Variance function: V(u) = u*(1-u) [Bernoulli]

  • 26/02/10 12:26Ed231C: Generalized Linear Models

    Page 5 sur 9http://www.gseis.ucla.edu/courses/ed231c/notes1/glm.html

    Link function : g(u) = ln(u/(1-u)) [Logit]Standard errors : OIM

    Log likelihood = -75.20982717 AIC = .7920983BIC = -888.0505495

    ------------------------------------------------------------------------------ hon | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | .0752424 .0275779 2.73 0.006 .0211906 .1292941 math | .1317117 .0324623 4.06 0.000 .0680869 .1953366 female | 1.154801 .4341012 2.66 0.008 .3039785 2.005624 _cons | -13.12749 1.850893 -7.09 0.000 -16.75517 -9.499808------------------------------------------------------------------------------

    glm, eform

    Generalized linear models No. of obs = 200Optimization : ML: Newton-Raphson Residual df = 196 Scale parameter = 1Deviance = 150.4196543 (1/df) Deviance = .7674472Pearson = 164.2509104 (1/df) Pearson = .8380148

    Variance function: V(u) = u*(1-u) [Bernoulli]Link function : g(u) = ln(u/(1-u)) [Logit]Standard errors : OIM

    Log likelihood = -75.20982717 AIC = .7920983BIC = -888.0505495

    ------------------------------------------------------------------------------ hon | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | 1.078145 .029733 2.73 0.006 1.021417 1.138025 math | 1.140779 .0370323 4.06 0.000 1.070458 1.21572 female | 3.173393 1.377573 2.66 0.008 1.35524 7.430728------------------------------------------------------------------------------

    probit hon read math female, nolog

    Probit estimates Number of obs = 200 LR chi2(3) = 81.80 Prob > chi2 = 0.0000Log likelihood = -74.745943 Pseudo R2 = 0.3537

    ------------------------------------------------------------------------------ hon | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | .0473262 .0157561 3.00 0.003 .0164449 .0782076 math | .0735256 .0173216 4.24 0.000 .0395759 .1074754 female | .6824682 .2447275 2.79 0.005 .2028112 1.162125 _cons | -7.663304 .9921289 -7.72 0.000 -9.607841 -5.718767------------------------------------------------------------------------------

    glm hon read math female, link(probit) fam(bin) nolog

    Generalized linear models No. of obs = 200Optimization : ML: Newton-Raphson Residual df = 196 Scale parameter = 1Deviance = 149.4918859 (1/df) Deviance = .7627137Pearson = 160.9679286 (1/df) Pearson = .8212649

  • 26/02/10 12:26Ed231C: Generalized Linear Models

    Page 6 sur 9http://www.gseis.ucla.edu/courses/ed231c/notes1/glm.html

    Variance function: V(u) = u*(1-u) [Bernoulli]Link function : g(u) = invnorm(u) [Probit]Standard errors : OIM

    Log likelihood = -74.74594294 AIC = .7874594BIC = -888.978318

    ------------------------------------------------------------------------------ hon | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | .0473262 .0157561 3.00 0.003 .0164448 .0782077 math | .0735256 .0173217 4.24 0.000 .0395758 .1074755 female | .6824681 .2447281 2.79 0.005 .2028098 1.162126 _cons | -7.663303 .9921345 -7.72 0.000 -9.607851 -5.718755------------------------------------------------------------------------------

    use http://www.gseis.ucla.edu/courses/data/lahigh, clear

    poisson daysabs langnce gender, nolog

    Poisson regression Number of obs = 316 LR chi2(2) = 171.50 Prob > chi2 = 0.0000Log likelihood = -1549.8567 Pseudo R2 = 0.0524

    ------------------------------------------------------------------------------ daysabs | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- langnce | -.01467 .0012934 -11.34 0.000 -.0172051 -.0121349 gender | -.4093528 .0482192 -8.49 0.000 -.5038606 -.3148449 _cons | 2.646977 .0697764 37.94 0.000 2.510217 2.783736------------------------------------------------------------------------------

    poisson, irr

    Poisson regression Number of obs = 316 LR chi2(2) = 171.50 Prob > chi2 = 0.0000Log likelihood = -1549.8567 Pseudo R2 = 0.0524

    ------------------------------------------------------------------------------ daysabs | IRR Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- langnce | .9854371 .0012746 -11.34 0.000 .982942 .9879384 gender | .6640799 .0320214 -8.49 0.000 .6041936 .7299021------------------------------------------------------------------------------

    glm daysabs langnce gender, link(log) fam(poisson) nolog

    Generalized linear models No. of obs = 316Optimization : ML: Newton-Raphson Residual df = 313 Scale parameter = 1Deviance = 2238.317597 (1/df) Deviance = 7.151174Pearson = 2752.913231 (1/df) Pearson = 8.79525

    Variance function: V(u) = u [Poisson]Link function : g(u) = ln(u) [Log]Standard errors : OIM

  • 26/02/10 12:26Ed231C: Generalized Linear Models

    Page 7 sur 9http://www.gseis.ucla.edu/courses/ed231c/notes1/glm.html

    Log likelihood = -1549.85665 AIC = 9.828207BIC = 436.7702841

    ------------------------------------------------------------------------------ daysabs | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- langnce | -.01467 .0012934 -11.34 0.000 -.0172051 -.0121349 gender | -.4093528 .0482192 -8.49 0.000 -.5038606 -.3148449 _cons | 2.646977 .0697764 37.94 0.000 2.510217 2.783736------------------------------------------------------------------------------

    glm, eform

    Generalized linear models No. of obs = 316Optimization : ML: Newton-Raphson Residual df = 313 Scale parameter = 1Deviance = 2238.317597 (1/df) Deviance = 7.151174Pearson = 2752.913231 (1/df) Pearson = 8.79525

    Variance function: V(u) = u [Poisson]Link function : g(u) = ln(u) [Log]Standard errors : OIM

    Log likelihood = -1549.85665 AIC = 9.828207BIC = 436.7702841

    ------------------------------------------------------------------------------ daysabs | IRR Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- langnce | .9854371 .0012746 -11.34 0.000 .982942 .9879384 gender | .6640799 .0320214 -8.49 0.000 .6041936 .7299021------------------------------------------------------------------------------

    nbreg daysabs langnce gender, nolog

    Negative binomial regression Number of obs = 316 LR chi2(2) = 20.63 Prob > chi2 = 0.0000Log likelihood = -880.9274 Pseudo R2 = 0.0116

    ------------------------------------------------------------------------------ daysabs | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- langnce | -.0156493 .0039485 -3.96 0.000 -.0233882 -.0079104 gender | -.4312069 .1396913 -3.09 0.002 -.7049968 -.1574169 _cons | 2.70344 .2292762 11.79 0.000 2.254067 3.152813-------------+---------------------------------------------------------------- /lnalpha | .25394 .095509 .0667457 .4411342-------------+---------------------------------------------------------------- alpha | 1.289094 .1231201 1.069024 1.554469------------------------------------------------------------------------------Likelihood ratio test of alpha=0: chibar2(01) = 1337.86 Prob>=chibar2 = 0.000

    glm daysabs langnce gender, link(log) fam(nbin) nolog

    Generalized linear models No. of obs = 316Optimization : ML: Newton-Raphson Residual df = 313 Scale parameter = 1Deviance = 425.603464 (1/df) Deviance = 1.359755Pearson = 415.6288036 (1/df) Pearson = 1.327888

  • 26/02/10 12:26Ed231C: Generalized Linear Models

    Page 8 sur 9http://www.gseis.ucla.edu/courses/ed231c/notes1/glm.html

    Variance function: V(u) = u+(1)u^2 [Neg. Binomial]Link function : g(u) = ln(u) [Log]Standard errors : OIM

    Log likelihood = -884.4953535 AIC = 5.617059BIC = -1375.943849

    ------------------------------------------------------------------------------ daysabs | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- langnce | -.0156357 .0035438 -4.41 0.000 -.0225814 -.0086899 gender | -.4307736 .1253082 -3.44 0.001 -.6763732 -.185174 _cons | 2.702606 .2052709 13.17 0.000 2.300282 3.104929------------------------------------------------------------------------------

    glm, eform

    Generalized linear models No. of obs = 316Optimization : ML: Newton-Raphson Residual df = 313 Scale parameter = 1Deviance = 425.603464 (1/df) Deviance = 1.359755Pearson = 415.6288036 (1/df) Pearson = 1.327888

    Variance function: V(u) = u+(1)u^2 [Neg. Binomial]Link function : g(u) = ln(u) [Log]Standard errors : OIM

    Log likelihood = -884.4953535 AIC = 5.617059BIC = -1375.943849

    ------------------------------------------------------------------------------ daysabs | IRR Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- langnce | .9844859 .0034888 -4.41 0.000 .9776716 .9913477 gender | .650006 .0814511 -3.44 0.001 .5084577 .8309596------------------------------------------------------------------------------

    glm daysabs langnce gender, fam(gamma) link(log) nolog

    Generalized linear models No. of obs = 316Optimization : ML: Newton-Raphson Residual df = 313 Scale parameter = 1.583724Deviance = 251.8270233 (1/df) Deviance = .8045592Pearson = 495.7055497 (1/df) Pearson = 1.583724

    Variance function: V(u) = u^2 [Gamma]Link function : g(u) = ln(u) [Log]Standard errors : OIM

    Log likelihood = -856.2487643 AIC = 5.438283BIC = -1549.72029

    ------------------------------------------------------------------------------ daysabs | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- langnce | -.0156852 .0040626 -3.86 0.000 -.0236478 -.0077226 gender | -.4326492 .1443719 -3.00 0.003 -.7156129 -.1496854 _cons | 2.705757 .2383799 11.35 0.000 2.238541 3.172973------------------------------------------------------------------------------

    glm, eform

  • 26/02/10 12:26Ed231C: Generalized Linear Models

    Page 9 sur 9http://www.gseis.ucla.edu/courses/ed231c/notes1/glm.html

    Generalized linear models No. of obs = 316Optimization : ML: Newton-Raphson Residual df = 313 Scale parameter = 1.583724Deviance = 251.8270233 (1/df) Deviance = .8045592Pearson = 495.7055497 (1/df) Pearson = 1.583724

    Variance function: V(u) = u^2 [Gamma]Link function : g(u) = ln(u) [Log]Standard errors : OIM

    Log likelihood = -856.2487643 AIC = 5.438283BIC = -1549.72029

    ------------------------------------------------------------------------------ daysabs | ExpB Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- langnce | .9844372 .0039994 -3.86 0.000 .9766296 .9923071 gender | .6487881 .0936668 -3.00 0.003 .4888924 .8609788------------------------------------------------------------------------------

    Categorical Data Analysis Course

    Phil Ender

    http://www.gseis.ucla.edu/courses/ed231c/231c.htmlhttp://www.gseis.ucla.edu/ender/