Introduction to General and Generalized Linear Modelshmad/GLM/slides/lect12.pdf · Introduction to...

Introduction to General and Generalized Linear Models

Mixed effects models - Part IV

Henrik MadsenPoul Thyregod

Informatics and Mathematical ModellingTechnical University of Denmark

DK-2800 Kgs. Lyngby

January 2011

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 1 / 23

This lecture

General mixed effects models

Laplace approximation




Let us now look at methods to deal with nonlinear and non-normalmixed effects models.

In general it will be impossible to obtain closed form solutions andhence numerical methods must be used.

Estimation and inference will be based on likelihood principles.




The general mixed effects model can be represented by its likelihoodfunction:

LM (;y) =

Rq

L(;u,y)du

where y is the observed random variables, is the model parameters to beestimated, and U is the q unobserved random variables or effects.

The likelihood function L is the joint likelihood of both the observed andthe unobserved random variables.

The likelihood function for estimating is the marginal likelihood LMobtained by integrating out the unobserved random variables.




The integral shown on the previous slide is generally difficult to solve if thenumber of unobserved random variables is more than a few, i.e. for largevalues of q.

A large value of q significantly increases the computational demands due tothe product rule which states that if an integral is sampled in m points perdimension to evaluate it, the total number of samples needed is mq, whichrapidly becomes infeasible even for a limited number of random effects.

The likelihood function gives a very broad definition of mixed models: theonly requirement for using mixed modeling is to define a joint likelihoodfunction for the model of interest.

In this way mixed modeling can be applied to any likelihood based statisticalmodeling.

Examples of applications are linear mixed models (LMM) and nonlinearmixed models (NLMM), generalized linear mixed models, but also modelsbased on Markov chains, ODEs or SDEs.



Hierarchical models

As for the Gaussian linear mixed models it is useful to formulate the modelas a hierarchical model containing a first stage model

fY |u(y;u,)

which is a model for the data given the random effects, and a second stagemodel

fU (u;)

which is a model for the random effects. The total set of parameters is = (,). Hence the joint likelihood is given as

L(,;u,y) = fY |u(y;u,)fU (u;)



Hierarchical models

To obtain the likelihood for the model parameters (,) the unobservedrandom effects are again integrated out.

The likelihood function for estimating (,) is as before the marginallikelihood

LM (,;y) =

Rq

L(,;u,y)du

where q is the number of random effects, and and are the parametersto be estimated.



Grouping structures and nested effects

For nonlinear mixed models where no closed form solution to the likelihoodfunction is available it is necessary to invoke some form of numericalapproximation to be able to estimate the model parameters.

The complexity of this problem is mainly dependent on the dimensionality ofthe integration problem which in turn is dependent on the dimension of Uand in particular the grouping structure in the data for the random effects.

These structures include a single grouping, nested grouping, partially crossedand crossed random effects.

For problems with only one level of grouping the marginal likelihood can besimplified as

LM (,;y) =

M

i=1

Rqi

fY |ui(y;ui,)fUi(ui;) dui

where qi is the number of random effects for group i and M is the numberof groups.




Instead of having to solve an integral of dimension q it is onlynecessary to solve M smaller integrals of dimension qi.

In typical applications there is often just one or only a few randomeffects for each group, and this thus greatly reduces the complexity ofthe integration problem.

If the data has a nested grouping structure a reduction of thedimensionality of the integral similar to that shown on the previousslide can be performed.

An example of a nested grouping structure is data collected from anumber of schools, a number of classes within each school and anumber of students from each class.




If the nonlinear mixed model is extended to include any structure ofrandom effects such as crossed or partially crossed random effects it isrequired to evaluate the full multi-dimensional integral

Estimation in these models can efficiently be handled using themultivariate Laplace approximation, which only samples the integrandin one point common to all dimensions.



The Laplace approximation

For a given set of model parameters the joint log-likelihood(,u,y) = log(L(,u,y)) is approximated by a second order Taylorapproximation around the optimum u = u of the log-likelihoodfunction w.r.t. the unobserved random variables u, i.e.

(,u,y) (, u,y) 12(u u)TH(u)(u u)

where the first-order term of the Taylor expansion disappears sincethe expansion is done around the optimum u andH(u) = uu(,u,y)|u=u is the negativeHessian of the joint log-likelihood evaluated at u which will simply bereferred to as the Hessian.




Using the approximation in the Laplace approximation of the marginallog-likelihood becomes

M,LA(,y) = log

Rq

exp((, u,y) 12 (u u)

TH(u)(u u))du

= (, u,y) 12 logH(u)

2

Yhe integral is eliminated by transforming it to an integration of amultivariate Gaussian density with mean u and covariance H1(u).




The Laplace likelihood only approximates the marginal likelihood formixed models with nonlinear random effects and thus maximizing theLaplace likelihood will result in some amount of error in the resultingestimates.

It can be shown that joint log-likelihood converges to a quadraticfunction of the random effect for increasing number of observationsper random effect and thus that the Laplace approximation isasymptotically exact.

In practical applications the accuracy of the Laplace approximationmay still be of concern, but often improved numerical approximationof the marginal likelihood (such as Gaussian quadrature) may easilybe computationally infeasible to perform.

Another option for improving the accuracy is Importance sampling.



Two-level hierarchical model

For the two-level or hierarchical model it is readily seen that the jointlog-likelihood is

(,u,y) = (,,u,y) = log fY |u(y;u,) + log fU (u;)

which implies that the Laplace approximation becomes

M,LA(,y) = log fY |u(y; u,) + log fU(u;)1

2log

H(u)

2

It is clear that as long as a likelihood function of the random effectsand model parameters can be defined it is possible to use the Laplacelikelihood for estimation in a mixed model framework.



Gaussian second stage model

Let us assume that the second stage model is zero mean Gaussian, i.e.

u N(0,)

which means that the random effect distribution is completelydescribed by its covariance matrix .

In this case the Laplace likelihood in becomes

M,LA(,y) = log fY |u(y; u,)1

2log ||

12uT1u 1

2log |H(u)|

where it is seen that we still have no assumptions on the first stagemodel fY |u(y;u,).



Gaussian second stage model

If we furthermore assume that the first stage model is Gaussian

Y |U = u N((,u),)

then the Laplace likelihood can be further specified.

For the hierarchical Gaussian model it is rather easy to obtain anumerical approximation of the Hessian H at the optimum, u

H(u) u1uT+1

where u is the partial derivative with respect to u.

The approximation in is called Gauss-Newton approximation

In some contexts estimation using this approximation is also calledthe First Order Conditional Estimation (FOCE) method.



Automatic differentiation

A simpel and efficient way to use the Laplace transformationtechnique outlined is via the open source software package AD ModelBuilder, which takes advantages of automatic differentiation.

Any calculation done via a computer program can be broken down toa long chain of simple operations like +, , , /, exp , log,sin, cos, tan,

, and so on.

It is simple to write down the analytical derivative of each of theseoperations by themselves.

If our log-likelihood function consisted of only a few of these simpleoperations, then it would be tractable to use the chain rule(f g)(x) = f (g(x))g(x) to find the analytical gradient of thelog-likelihood function .




Automatic differentiation is a technique where the chain rule is usedby the computer program itself.

When the program evaluates the log-likelihood it keeps track of allthe operations used along the way, and then runs the programbackwards (reverse mode automatic differentiation) and uses thechain rule to update the derivatives one simple operation at a time.

Automatic differentiation is accurate, and the computational cost ofevaluating the gradient is surprisingly low.




Theorem

The computational cost of evaluating the gradient of the log-likelihood with reverse mode automatic differentiation is less than four times the

computational cost of evaluating the log-likelihood function itself. Thisholds no matter how many parameters the model contain.

It is surprising that computational cost does not depend on how manyparameters the model contain.

There is however a practical concern. The computational costmentioned above is measured in the number of operations, butreverse mode automatic differentiation requires all the intermediatevariables in the calculation of the negative log-likelihood to be storedin the computers memory, so if the calculation is lengthy, for instanceconsisting of a long iterative procedure, then the memoryrequirements can be enormous.



Automatic differentiation combined with the Laplace

approximation

Finding the gradient of the Laplace approximation of the marginallog-likelihood is challenging, because the approximation itself includesthe result of a function minimization, and not just a straightforwardsequence of simple operations.

It is however possible, but requires up to third order derivatives to becomputed internally by clever successive application of automaticdifferentiation.



Importance sampling

Importance sampling is a re-weighting technique for approximatingintegrals w.r.t. a density f by simulation in cases where it is notfeasible to simulate from the distribution with density f .

Instead it uses samples from a different distribution with density g,where the support of g includes the support of f .

For general mixed effects models it is possible to simulate from thedistribution with density proportional to the second order Taylorapproximation

L(, u,Y ) = exp{(, u,Y ) 12 (u u)

T (uu(,u,Y )|u=u )(u u)}

as, apart from a normalization constant, it is the density u,V

(u) ofa multivariate normal with mean u and covariance

V = H1(u) = (uu(,u,Y )|u=u)1.



Importance sampling

The integral to be approximated can be rewritten as:

LM (,Y ) =

L(,u,Y )du =

L(,u,Y )

u,V

(u)u,V

(u)du.

So if u(i), i = 1, . . . , N is simulated from the multivariate normaldistribution with mean u and covariance V, then the integral can beapproximated by the mean of the importance weights

LM (,Y ) =1

N

L(,u(i),Y )u,V

(u(i))



AD Model Builder

AD Model Builder is a programming language that builds on C++.

It includes helper functions for reading in data, defining modelparameters, and implementing and optimizing the negativelog-likelihood function.

The central feature is automatic differentiation (AD), which isimplemented in such a way that the user rarely has to think about itat all.

AD Model Builder can be used for fixed effects models, but inaddition it includes Laplace approximation and importance samplingfor dealing with general mixed effects models.

AD Model Builder is developed by Dr. Dave Fournier and was acommercial product for many years. Recently AD Model Builder hasbeen placed in the public domain (seehttp://www.admb-project.org).


http://www.admb-project.org

General mixed effects modelsLaplace approximation

Introduction to General and Generalized Linear Modelshmad/GLM/slides/lect12.pdf · Introduction to...

Documents

Transcript of Introduction to General and Generalized Linear Modelshmad/GLM/slides/lect12.pdf · Introduction to...