Introduction to General and Generalized Linear Modelshmad/GLM/slides/lect12.pdf · Introduction to...

download Introduction to General and Generalized Linear Modelshmad/GLM/slides/lect12.pdf · Introduction to General and Generalized Linear Models ... (IMM-DTU) Chapman & Hall ... The integral

If you can't read please download the document

Transcript of Introduction to General and Generalized Linear Modelshmad/GLM/slides/lect12.pdf · Introduction to...

  • Introduction to General and Generalized Linear Models

    Mixed effects models - Part IV

    Henrik MadsenPoul Thyregod

    Informatics and Mathematical ModellingTechnical University of Denmark

    DK-2800 Kgs. Lyngby

    January 2011

    Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 1 / 23

  • This lecture

    General mixed effects models

    Laplace approximation

    Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 2 / 23

  • General mixed effects models

    General mixed effects models

    Let us now look at methods to deal with nonlinear and non-normalmixed effects models.

    In general it will be impossible to obtain closed form solutions andhence numerical methods must be used.

    Estimation and inference will be based on likelihood principles.

    Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 3 / 23

  • General mixed effects models

    General mixed effects models

    The general mixed effects model can be represented by its likelihoodfunction:

    LM (;y) =

    Rq

    L(;u,y)du

    where y is the observed random variables, is the model parameters to beestimated, and U is the q unobserved random variables or effects.

    The likelihood function L is the joint likelihood of both the observed andthe unobserved random variables.

    The likelihood function for estimating is the marginal likelihood LMobtained by integrating out the unobserved random variables.

    Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 4 / 23

  • General mixed effects models

    General mixed effects models

    The integral shown on the previous slide is generally difficult to solve if thenumber of unobserved random variables is more than a few, i.e. for largevalues of q.

    A large value of q significantly increases the computational demands due tothe product rule which states that if an integral is sampled in m points perdimension to evaluate it, the total number of samples needed is mq, whichrapidly becomes infeasible even for a limited number of random effects.

    The likelihood function gives a very broad definition of mixed models: theonly requirement for using mixed modeling is to define a joint likelihoodfunction for the model of interest.

    In this way mixed modeling can be applied to any likelihood based statisticalmodeling.

    Examples of applications are linear mixed models (LMM) and nonlinearmixed models (NLMM), generalized linear mixed models, but also modelsbased on Markov chains, ODEs or SDEs.

    Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 5 / 23

  • General mixed effects models

    Hierarchical models

    As for the Gaussian linear mixed models it is useful to formulate the modelas a hierarchical model containing a first stage model

    fY |u(y;u,)

    which is a model for the data given the random effects, and a second stagemodel

    fU (u;)

    which is a model for the random effects. The total set of parameters is = (,). Hence the joint likelihood is given as

    L(,;u,y) = fY |u(y;u,)fU (u;)

    Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 6 / 23

  • General mixed effects models

    Hierarchical models

    To obtain the likelihood for the model parameters (,) the unobservedrandom effects are again integrated out.

    The likelihood function for estimating (,) is as before the marginallikelihood

    LM (,;y) =

    Rq

    L(,;u,y)du

    where q is the number of random effects, and and are the parametersto be estimated.

    Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 7 / 23

  • General mixed effects models

    Grouping structures and nested effects

    For nonlinear mixed models where no closed form solution to the likelihoodfunction is available it is necessary to invoke some form of numericalapproximation to be able to estimate the model parameters.

    The complexity of this problem is mainly dependent on the dimensionality ofthe integration problem which in turn is dependent on the dimension of Uand in particular the grouping structure in the data for the random effects.

    These structures include a single grouping, nested grouping, partially crossedand crossed random effects.

    For problems with only one level of grouping the marginal likelihood can besimplified as

    LM (,;y) =

    M

    i=1

    Rqi

    fY |ui(y;ui,)fUi(ui;) dui

    where qi is the number of random effects for group i and M is the numberof groups.

    Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 8 / 23

  • General mixed effects models

    Grouping structures and nested effects

    Instead of having to solve an integral of dimension q it is onlynecessary to solve M smaller integrals of dimension qi.

    In typical applications there is often just one or only a few randomeffects for each group, and this thus greatly reduces the complexity ofthe integration problem.

    If the data has a nested grouping structure a reduction of thedimensionality of the integral similar to that shown on the previousslide can be performed.

    An example of a nested grouping structure is data collected from anumber of schools, a number of classes within each school and anumber of students from each class.

    Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 9 / 23

  • General mixed effects models

    Grouping structures and nested effects

    If the nonlinear mixed model is extended to include any structure ofrandom effects such as crossed or partially crossed random effects it isrequired to evaluate the full multi-dimensional integral

    Estimation in these models can efficiently be handled using themultivariate Laplace approximation, which only samples the integrandin one point common to all dimensions.

    Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 10 / 23

  • Laplace approximation

    The Laplace approximation

    For a given set of model parameters the joint log-likelihood(,u,y) = log(L(,u,y)) is approximated by a second order Taylorapproximation around the optimum u = u of the log-likelihoodfunction w.r.t. the unobserved random variables u, i.e.

    (,u,y) (, u,y) 12(u u)TH(u)(u u)

    where the first-order term of the Taylor expansion disappears sincethe expansion is done around the optimum u andH(u) = uu(,u,y)|u=u is the negativeHessian of the joint log-likelihood evaluated at u which will simply bereferred to as the Hessian.

    Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 11 / 23

  • Laplace approximation

    The Laplace approximation

    Using the approximation in the Laplace approximation of the marginallog-likelihood becomes

    M,LA(,y) = log

    Rq

    exp((, u,y) 12 (u u)

    TH(u)(u u))du

    = (, u,y) 12 logH(u)

    2

    Yhe integral is eliminated by transforming it to an integration of amultivariate Gaussian density with mean u and covariance H1(u).

    Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 12 / 23

  • Laplace approximation

    The Laplace approximation

    The Laplace likelihood only approximates the marginal likelihood formixed models with nonlinear random effects and thus maximizing theLaplace likelihood will result in some amount of error in the resultingestimates.

    It can be shown that joint log-likelihood converges to a quadraticfunction of the random effect for increasing number of observationsper random effect and thus that the Laplace approximation isasymptotically exact.

    In practical applications the accuracy of the Laplace approximationmay still be of concern, but often improved numerical approximationof the marginal likelihood (such as Gaussian quadrature) may easilybe computationally infeasible to perform.

    Another option for improving the accuracy is Importance sampling.

    Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 13 / 23

  • Laplace approximation

    Two-level hierarchical model

    For the two-level or hierarchical model it is readily seen that the jointlog-likelihood is

    (,u,y) = (,,u,y) = log fY |u(y;u,) + log fU (u;)

    which implies that the Laplace approximation becomes

    M,LA(,y) = log fY |u(y; u,) + log fU(u;)1

    2log

    H(u)

    2

    It is clear that as long as a likelihood function of the random effectsand model parameters can be defined it is possible to use the Laplacelikelihood for estimation in a mixed model framework.

    Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 14 / 23

  • Laplace approximation

    Gaussian second stage model

    Let us assume that the second stage model is zero mean Gaussian, i.e.

    u N(0,)

    which means that the random effect distribution is completelydescribed by its covariance matrix .

    In this case the Laplace likelihood in becomes

    M,LA(,y) = log fY |u(y; u,)1

    2log ||

    12uT1u 1

    2log |H(u)|

    where it is seen that we still have no assumptions on the first stagemodel fY |u(y;u,).

    Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 15 / 23

  • Laplace approximation

    Gaussian second stage model

    If we furthermore assume that the first stage model is Gaussian

    Y |U = u N((,u),)

    then the Laplace likelihood can be further specified.

    For the hierarchical Gaussian model it is rather easy to obtain anumerical approximation of the Hessian H at the optimum, u

    H(u) u1uT+1

    where u is the partial derivative with respect to u.

    The approximation in is called Gauss-Newton approximation

    In some contexts estimation using this approximation is also calledthe First Order Conditional Estimation (FOCE) method.

    Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 16 / 23

  • Laplace approximation

    Automatic differentiation

    A simpel and efficient way to use the Laplace transformationtechnique outlined is via the open source software package AD ModelBuilder, which takes advantages of automatic differentiation.

    Any calculation done via a computer program can be broken down toa long chain of simple operations like +, , , /, exp , log,sin, cos, tan,

    , and so on.

    It is simple to write down the analytical derivative of each of theseoperations by themselves.

    If our log-likelihood function consisted of only a few of these simpleoperations, then it would be tractable to use the chain rule(f g)(x) = f (g(x))g(x) to find the analytical gradient of thelog-likelihood function .

    Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 17 / 23

  • Laplace approximation

    Automatic differentiation

    Automatic differentiation is a technique where the chain rule is usedby the computer program itself.

    When the program evaluates the log-likelihood it keeps track of allthe operations used along the way, and then runs the programbackwards (reverse mode automatic differentiation) and uses thechain rule to update the derivatives one simple operation at a time.

    Automatic differentiation is accurate, and the computational cost ofevaluating the gradient is surprisingly low.

    Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 18 / 23

  • Laplace approximation

    Automatic differentiation

    Theorem

    The computational cost of evaluating the gradient of the log-likelihood with reverse mode automatic differentiation is less than four times the

    computational cost of evaluating the log-likelihood function itself. Thisholds no matter how many parameters the model contain.

    It is surprising that computational cost does not depend on how manyparameters the model contain.

    There is however a practical concern. The computational costmentioned above is measured in the number of operations, butreverse mode automatic differentiation requires all the intermediatevariables in the calculation of the negative log-likelihood to be storedin the computers memory, so if the calculation is lengthy, for instanceconsisting of a long iterative procedure, then the memoryrequirements can be enormous.

    Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 19 / 23

  • Laplace approximation

    Automatic differentiation combined with the Laplace

    approximation

    Finding the gradient of the Laplace approximation of the marginallog-likelihood is challenging, because the approximation itself includesthe result of a function minimization, and not just a straightforwardsequence of simple operations.

    It is however possible, but requires up to third order derivatives to becomputed internally by clever successive application of automaticdifferentiation.

    Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 20 / 23

  • Laplace approximation

    Importance sampling

    Importance sampling is a re-weighting technique for approximatingintegrals w.r.t. a density f by simulation in cases where it is notfeasible to simulate from the distribution with density f .

    Instead it uses samples from a different distribution with density g,where the support of g includes the support of f .

    For general mixed effects models it is possible to simulate from thedistribution with density proportional to the second order Taylorapproximation

    L(, u,Y ) = exp{(, u,Y ) 12 (u u)

    T (uu(,u,Y )|u=u )(u u)}

    as, apart from a normalization constant, it is the density u,V

    (u) ofa multivariate normal with mean u and covariance

    V = H1(u) = (uu(,u,Y )|u=u)1.

    Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 21 / 23

  • Laplace approximation

    Importance sampling

    The integral to be approximated can be rewritten as:

    LM (,Y ) =

    L(,u,Y )du =

    L(,u,Y )

    u,V

    (u)u,V

    (u)du.

    So if u(i), i = 1, . . . , N is simulated from the multivariate normaldistribution with mean u and covariance V, then the integral can beapproximated by the mean of the importance weights

    LM (,Y ) =1

    N

    L(,u(i),Y )u,V

    (u(i))

    Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 22 / 23

  • Laplace approximation

    AD Model Builder

    AD Model Builder is a programming language that builds on C++.

    It includes helper functions for reading in data, defining modelparameters, and implementing and optimizing the negativelog-likelihood function.

    The central feature is automatic differentiation (AD), which isimplemented in such a way that the user rarely has to think about itat all.

    AD Model Builder can be used for fixed effects models, but inaddition it includes Laplace approximation and importance samplingfor dealing with general mixed effects models.

    AD Model Builder is developed by Dr. Dave Fournier and was acommercial product for many years. Recently AD Model Builder hasbeen placed in the public domain (seehttp://www.admb-project.org).

    Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 23 / 23

    http://www.admb-project.org

    General mixed effects modelsLaplace approximation