Fisher Information for GLM

download Fisher Information for GLM

of 35

Transcript of Fisher Information for GLM

  • 8/11/2019 Fisher Information for GLM

    1/35

    MSH3 Generalized Linear model (Part 1)

    Jennifer S.K. CHAN

    Course outline

    Part I: Generalized Linear Model

    1. Maximum Likelihood Inference

    Newton-Raphson and Fisher Scoring methods; EM and Monte

    Carlo EM algorithms

    2. Exponential Family

    Two parameter Exponential family; ML estimation for GLM,Deviance; Quasi-likelihood, Random effects models.

    3. Model Selection

    Deviance for Likelihood Ratio Tests, Wald Tests, AIC and BIC,Examples

    4. Survival Analysis

    Kaplan-Meier estimator; Proportional hazards models; Coxsproportional hazards model.

  • 8/11/2019 Fisher Information for GLM

    2/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model

    Contents

    1 Maximum likelihood Inference 21.1 Motivating examples . . . . . . . . . . . . . . . . . . . 21.2 Likelihood function . . . . . . . . . . . . . . . . . . . . 71.3 Score vector . . . . . . . . . . . . . . . . . . . . . . . . 91.4 Information matrix . . . . . . . . . . . . . . . . . . . . 101.5 Newton-Raphson and Fisher Scoring methods . . . . . 131.6 Expectation Maximization (EM) algorithm . . . . . . 17

    1.6.1 Basic EM algorithm . . . . . . . . . . . . . . . 17

    1.6.2 Monte Carlo EM Algorithm . . . . . . . . . . . 271.7 Appendix for EM algorithm . . . . . . . . . . . . . . . 32

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 1

  • 8/11/2019 Fisher Information for GLM

    3/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    1 Maximum likelihood Inference

    1.1 Motivating examplesAIDS deaths (counts)

    The numbers of death Yi from AIDS in Australia for three-monthperiods from 1983 to 1986 are shown below.

    The Poisson regression model

    Yi Poisson(i), with i= exp(a+bti)> 0

    is fitted and the maximum likelihood (ML) estimates are a = 0.376and b = 0.254. For each 3-month period, there will be a 29.3%

    (exp(0.254) = 1.293) increase in expected AIDS deaths. Note thatthe variance increases with the mean and the log link function g(i) =ln(i) =

    xi is used.

    > no=c(0,1,2,3,1,5,10,17,23,31,20,25,37,45)

    > time=c(1:14)

    > poi=glm(no~time, family=poisson(link=log))

    > summary(poi)

    Call:

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 2

  • 8/11/2019 Fisher Information for GLM

    4/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    glm(formula = y ~ x, family = poisson(link = log))

    Deviance Residuals:

    Min 1Q Median 3Q Max

    -2.2502 -0.9815 -0.6770 0.2545 2.6731

    Coefficients:

    Estimate Std. Error z value Pr(>|z|)

    (Intercept) 0.37571 0.24884 1.51 0.131

    x 0.25365 0.02188 11.60 par=poi$coeff

    > names(par)=NULL

    > par

    [1] 0.3757110 0.2536485

    > beta0=par[1]

    > beta1=par[2]

    > par(mfrow=c(2,2))

    > c1=function(time) exp(beta0+beta1*time)

    > plot(time,no,pch=20,col=blue )> curve(c1,1,14,add=TRUE)

    > title("Poisson regression")

    Mice data

    Twenty six mice were given different levelxiof drug. OutcomesYiarewhether they responded to the drug (Yi= 1) or not (Yi= 0).

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 3

  • 8/11/2019 Fisher Information for GLM

    5/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    The logistic regression model for binary data is

    Yi Bernoulli(i), with logit(i) = ln

    i1 i

    = a+bxi.

    Note that

    ln i

    1 i

    = a+bxi i= ea+bxi

    1 +ea+bxi.

    > y=c(0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,1,1,1,1,1,1,1,1,1,1,1)

    > dose=c(0:25)/10

    > log=glm(y~dose, family=binomial(link=logit))

    > summary(log)

    Call:

    glm(formula = y ~ dose, family = binomial(link = logit))

    Deviance Residuals:

    Min 1Q Median 3Q Max

    -1.5766 -0.4757 0.1376 0.4129 2.1975

    Coefficients:Estimate Std. Error z value Pr(>|z|)

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 4

  • 8/11/2019 Fisher Information for GLM

    6/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    (Intercept) -4.111 1.638 -2.510 0.0121 *

    dose 3.581 1.316 2.722 0.0065 **

    ---

    Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

    (Dispersion parameter for binomial family taken to be 1)

    Null deviance: 35.890 on 25 degrees of freedom

    Residual deviance: 17.639 on 24 degrees of freedom

    AIC: 21.639

    Number of Fisher Scoring iterations: 6

    > par=log$coeff

    > names(par)=NULL

    > par

    [1] -4.111361 3.581176

    > beta0=par[1]

    > beta1=par[2]

    > c1=function(dose) exp(beta0+beta1*dose)/(1+exp(beta0+beta1*dose))

    > plot(dose,y, pch=20,col=red)

    > curve(c1,0,2.5,add=TRUE)> title("Logistic regression")

    2 4 6 8 10 14

    0

    10

    20

    30

    40

    time

    no

    Poisson regression

    0.0 0.5 1.0 1.5 2.0 2.5

    0.

    0

    0.

    4

    0.

    8

    dose

    y

    Logistic regression

    For parameter estimation, the nonparametric LSE , the parametricmaximum likelihood (ML, PML, QML, GEE, EM, MCEM, etc) andBayesian methods methods will be discussed. Kernel smoothing and

    other semi-parametricmethods are not included. Model selection isbased on Akaike Information criterion (AIC), Bayesian Information

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 5

  • 8/11/2019 Fisher Information for GLM

    7/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    criterion (BIC) and the Deviance information criterion (DIC).

    For application, the two examples analyse counts data with Pois-son distribution and binary data with Bernounill distribution respec-tively. Others include categorial data with multinominal distributionand positive continuous data with Weibull distribution. These illus-trate different data distributions under the Exponential Family. Themeanof the data distribution is linked to a linear function of covari-ates with possibly random effects but the varianceis NOT modelled.Popular time series models with heteroskedastic variance and long

    memory such as Generalized autoregressive conditional heteroskedas-tic (GARCH) model and stochastic volatility (SV) model will not beconsidered.

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 6

  • 8/11/2019 Fisher Information for GLM

    8/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    1.2 Likelihood function

    Let Y1, . . . , Y n be n independent random variables (rv) with proba-

    bility density functions (pdf) fi(yi, ) depending on a vector-valueparameter . The joint density ofy= (y1, . . . , yn)

    f(y,) =n

    i=1

    f(yi,) =L(,y)

    as a function of unknown parameter given y is called the likeli-hoodfunction. We often work with the logarithm off(y, ), the log-likelihoodfunction:

    (;y) = ln L(;y) =n

    i=1

    ln f(yi;).

    Themaximum-likelihood(ML) estimator

    maximzes the log-likelihood

    function given the data y, that is,

    (;y) (;y) for all .In other words, they make the observed data as likely as possible underthe model.

    Example: The log-likelhood function for Geometric Distribution.Consider a series of independent Bernoulli trials with a common prob-ability of success . The distribution for the number of failures Yi

    before the first success has a pdf

    Pr(Yi=yi) = (1 )yifor yi = 0, 1, . . . . Direct calculation shows that E(Yi) = (1 )/.The log-likelihood function given y is

    (;y) = ln L(;y) =n

    i=1 [yiln(1 ) + ln ]= n[y ln(1 ) + ln ],

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 7

  • 8/11/2019 Fisher Information for GLM

    9/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    where y= 1n

    n

    i=1

    yiis the sample mean. The fact that the log-likelihood

    function depends on the observations only through y shows that y isa sufficientstatistic for the unknown probability .

    0.0 0.2 0.4 0.6 0.8 1.0

    250

    150

    50

    pi

    logl(pi)

    loglikelihood function

    0.0 0.2 0.4 0.6 0.8 1.06000

    2000

    2000

    pi

    score(pi)

    score function

    0.0 0.2 0.4 0.6 0.8 1.0

    0

    100000

    200000

    pi

    Ie(pi)

    Expected information function

    Figure: Log-likelihood function for geometric dist. when n= 20 and y= 3.

    > n=20

    > ym=3

    > pi=c(1:100)/100> logl=function(pi) n*(ym*log(1-pi)+log(pi))

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 8

  • 8/11/2019 Fisher Information for GLM

    10/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    1.3 Score vector

    The first order derivative of the log-likelihood function, called Fishers

    score function, is a vector of dimension p where p is the number ofparameters and is denoted by

    u() =(;y)

    .

    For example, when Yi N(, 2), u() =

    ,

    2

    .

    If the log-likelihood function is concave, the ML estimates can beobtained by solving the system of equations:

    u() =0.

    Example: The score function for the geometric distribution.The score function fornobservations from a geometric distribution is

    u() = d

    d =

    d

    dn(y ln(1 ) + ln )= n

    1

    y

    1

    .

    Setting this equation to zero and solving for gives the ML estimate:

    1

    =

    y

    1

    y = 1 = 1

    1 + y and y=

    1

    .

    Note that the ML estimate of the probability of success is the recip-rocal of the average number of trials. The more trials it takes to geta success, the lower is the estimated probability of success.

    For a sample ofn = 20 observations and with a sample mean of y= 3,the ML estimate is = 1/(1 + 3) = 0.25.

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 9

  • 8/11/2019 Fisher Information for GLM

    11/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    1.4 Information matrix

    It can be shown that

    Ey

    ()

    = 0 and Ey

    2()

    +Ey

    ()

    ()

    = 0.

    Proof: Since

    f(y,)dy= 1,

    f(y, )

    dy=0

    f(y,)

    f(y, )f(y, )dy=0

    ln f(y, )

    f(y, )dy=0

    ()

    f(y,)dy=0

    Ey

    ()

    = 0 and

    ()

    f(y,)

    dy=0

    2()

    f(y, ) +()

    f(y,)

    dy=0

    2()

    +

    ()

    ()

    f(y,)dy=0

    Ey

    2()

    +Ey

    ()

    ()

    = 0. (1)

    Hence the score function is a random vector such that it has a zeromean

    Ey[u()] =Ey

    ()

    = 0

    and a variance-covariance matrix which is given by the informativematrix:

    var[u()] =Ey[u()u()] =Ey ()

    ()

    = I().

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 10

  • 8/11/2019 Fisher Information for GLM

    12/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    Under mild regularity conditions, the information matrix can also beobtained as minus the expected value of the second derivatives of the

    log-likelihood from (1):

    var[u()] =I() = Ey

    2()

    .

    Note that theHessianmatrix is

    H() = Io() = 2()

    =

    u()

    (2)

    and Io() =2()

    =H() is sometimes called the observedinformation matrix. Io() indicatesthe extent to which()is peakedrather than flat. If it is more peaked, Io() is more positive. Forexample, when Yi N(, 2),

    2()

    =

    22

    22

    22

    2(2)2

    .

    Example: Information matrix for geometric distribution.Differentiating the score, we find the observed information to be

    Io() = d2()

    d2 = du()

    d = d

    d n

    1

    y

    1

    = n

    1

    2+

    y

    (1 )2

    .

    To find the expected information, we subsitute ybyE(Y) =E(Yi) =(1

    )/ inIo() to obtain

    Ie() =n

    1

    2 +

    (1 )/(1 )2

    = n

    1

    2 +

    1

    (1 )

    = n

    1 +2(1 )

    =

    n

    2(1 ) .

    Note thatIe() depending on the sharpness of the peak increases withthe sample size n since larger sample size provides more informationand hence the loglikelihood function is more sharp at the peak. Whenn= 20 and = 0.15, the expected information is

    Ie(0.15) = n2(1 ) =

    200.152(1 0.15)= 1045.8.

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 11

  • 8/11/2019 Fisher Information for GLM

    13/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    If the sample mean y= 3, the observed information is

    Io(0.15) =n 12 + y(1 )2=20 10.152 + 3(1 0.15)2= 971.9.Substituting the ML estimate = 0.25, the expected and observedinformation areIo(0.25) =Ie(0.25) = 426.7 since y= (1 )/.> score=function(pi) n*(1/pi-ym/(1-pi))

    > Ie=function(pi) n/(pi^2*(1-pi))

    > Io=n*(1/pi^2+ym/(1-pi)^2)

    >> logl1=n*(ym*log(1-pi)+log(pi))

    > score1=n*(1/pi-ym/(1-pi))

    > Ie1=n/(pi^2*(1-pi))

    > c(pi[logl1==max(logl1)],pi[score1==0],max(logl1))

    [1] 0.25000 0.25000 -44.98681

    > c(Io[pi==0.15],Ie1[pi==0.15],Io[pi==0.25],Ie1[pi==0.25])

    [1] 971.9339 1045.7516 426.6667 426.6667

    >

    > par(mfrow=c(2,2))

    > plot(logl, col=red,xlab="pi",ylab="logl(pi)")

    > points(pi[score1==0],logl1[pi==pi[score1==0]],pch=2,col="red",cex=0.6)

    > title("log-likelihood function")

    > plot(score, col=red,xlab="pi",ylab="score(pi)")

    > abline(h = 0)

    > points(pi[score1==0],0,pch=2,col="red",cex=0.6)

    > title("score function")

    > plot(Ie, col=red,xlab="pi",ylab="Ie(pi)")

    > title("Expected information function")

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 12

  • 8/11/2019 Fisher Information for GLM

    14/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    1.5 Newton-Raphson and Fisher Scoring meth-ods

    Calculation of the ML estimate often requires iterative procedures.Expanding the score function u() evaluated at the ML estimatearound a trial value 0 using a first order Taylor series gives

    u() =u(0) +u(0)

    (0)+higher order terms in (0). (3)Ignoring higher order terms, equating (3) to zero and solving for,we have 0 u(0)

    1u(0) (4)

    since u() = 0. Then the Newton-Raphson (NR) procedure to ob-tain an improved estimate (k+1) using the estimate (k) at the k-thiteration is

    (k+1) =(k) 2()

    1 ()

    =(k) . (5)

    The iterative procedure is repeated until the difference between (k+1)

    and (k) is sufficiently close to zero. Then (proof as exercise)

    var() =Io()1 = 2(

    )

    1

    .

    For ML estimates, the second order derivative H() isconcave down-wards and negative. The sharper the curvature (more information)

    of (), the more negativeH() is and hence the estimates havesmaller variancevar(

    ) = Io(

    )1 =H(

    )1. The NR procedure

    tends to converge quickly if the log-likelihood is well-behaved (close to

    quadratic) in a neighborhood of the ML estimateand if the startingvalue 0 is reasonably close to.SydU MSH3 GLM (2012) First semester Dr. J. Chan 13

  • 8/11/2019 Fisher Information for GLM

    15/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    An alternative procedure first suggested by Fisher is to replace theinformation matrix Io() by its expected value Ie(). The procedure

    knwon as Fisher Scoring(FS) is

    (k+1) =(k) E

    2()

    1()

    =(k). (6)

    For multimodal distributions, both methods will converge to a local(not global) maximum.

    Example: NR and FS methods for geometric distribution.Setting the score to zero leads to an explicit solution for the ML esti-

    mate = 1

    1 + y and no iteration is needed. For illustrative purpose,

    the iterative procedure is performed. Using the previous results,

    d

    d =n

    1

    y

    1

    , d2

    d2 = n

    1

    2+

    y

    (1 )2

    , E

    d2

    d2

    =

    n2(1 ) .

    TheFisher scoringprocedure leads to the updating formula

    (k+1) = (k) E

    d2

    d2

    1d

    d|=(k)

    = (k) +((k))2(1 (k))

    n n

    1

    (k) y

    1 (k)

    = (k) + ((k))2(1

    (k))

    1 (k) (k)y

    (k)

    (1 (k)

    )= (k) + (1 (k) (k)y)(k).

    If the sample mean is y = 3 and we start from 0 = 0.1, say, theprocedure converges to the ML estimate = 0.25 in four iterations.

    > n=20

    > ym=3

    > pi=0.1

    > result=matrix(0,10,7)>

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 14

  • 8/11/2019 Fisher Information for GLM

    16/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    > for (i in 1:10) {

    + dl=n*(1/pi-ym/(1-pi))

    + dl2=-n/(pi^2*(1-pi))

    + pi=pi-dl/dl2

    + #pi=pi+(1-pi-pi*ym)*pi

    + se=sqrt(-1/dl2)

    + l=n*(ym*log(1-pi)+log(pi))

    + step=(1-pi-pi*ym)*pi

    + result[i,]=c(i,pi,se,l,dl,dl2,step)

    + }

    > colnames(result)=c("Iter","pi","se","l","dl","dl2","step")

    > result

    Iter pi se l dl dl2 step

    [1,] 1 0.1600000 0.02121320 -47.11283 1.333333e+02 -2222.2222 5.760000e-02

    [2,] 2 0.2176000 0.03279024 -45.22528 5.357143e+01 -930.0595 2.820096e-02

    [3,] 3 0.2458010 0.04303862 -44.99060 1.522465e+01 -539.8628 4.128512e-03

    [4,] 4 0.2499295 0.04773221 -44.98681 1.812051e+00 -438.9114 7.050785e-05

    [5,] 5 0.2500000 0.04840091 -44.98681 3.009750e-02 -426.8674 1.989665e-08

    [6,] 6 0.2500000 0.04841229 -44.98681 8.489239e-06 -426.6667 1.582068e-15

    [7,] 7 0.2500000 0.04841229 -44.98681 6.661338e-13 -426.6667 0.000000e+00

    [8,] 8 0.2500000 0.04841229 -44.98681 0.000000e+00 -426.6667 0.000000e+00

    [9,] 9 0.2500000 0.04841229 -44.98681 0.000000e+00 -426.6667 0.000000e+00

    [10,] 10 0.2500000 0.04841229 -44.98681 0.000000e+00 -426.6667 0.000000e+00

    Alternatively theNewton-Raphsonprocedure is

    (k+1) = (k)

    d2

    d2

    1d

    d|=(k)

    = (k) +1

    n

    1

    ((k))2+

    y

    (1 (k))21

    n

    1

    (k) y

    1 (k)

    = (k) + ((k))2(1 (k))21 2(k) + ((k))2 + y((k))21 (k) (k)y(k)(1 (k))

    = (k) +(k)(1 (k))(1 (k) (k)y)

    1 2(k) + (1 + y)((k))2 .

    > n=20

    > ym=3

    > pi=0.1 #starting value> result=matrix(0,10,7)

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 15

  • 8/11/2019 Fisher Information for GLM

    17/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    >

    > for (i in 1:10) {

    + dl=20*(1/pi-ym/(1-pi))

    + dl2=-20*(1/pi^2+3/(1-pi)^2)

    + pi=pi-dl/dl2

    + #pi=pi+(1-pi)*(1-pi-pi*ym)*pi/(1-2*pi+4*pi^2)

    + se=sqrt(-1/dl2)

    + l=n*(ym*log(1-pi)+log(pi))

    + step=(1-pi)*(1-pi-pi*ym)*pi/(1-2*pi+4*pi^2)

    + result[i,]=c(i,pi,se,l,dl,dl2,step)

    + }

    > colnames(result)=c("Iter","pi","se","l","dl","dl2","step")

    > result

    Iter pi se l dl dl2 step

    [1,] 1 0.1642857 0.02195775 -46.89107 1.333333e+02 -2074.0741 6.039726e-02

    [2,] 2 0.2246830 0.03477490 -45.13029 4.994426e+01 -826.9292 2.344114e-02

    [3,] 3 0.2481241 0.04490170 -44.98756 1.162661e+01 -495.9916 1.866426e-03[4,] 4 0.2499905 0.04816876 -44.98681 8.044145e-01 -430.9919 9.453797e-06

    [5,] 5 0.2500000 0.04841107 -44.98681 4.033823e-03 -426.6882 2.383524e-10

    [6,] 6 0.2500000 0.04841229 -44.98681 1.016970e-07 -426.6667 0.000000e+00

    [7,] 7 0.2500000 0.04841229 -44.98681 0.000000e+00 -426.6667 0.000000e+00

    [8,] 8 0.2500000 0.04841229 -44.98681 0.000000e+00 -426.6667 0.000000e+00

    [9,] 9 0.2500000 0.04841229 -44.98681 0.000000e+00 -426.6667 0.000000e+00

    [10,] 10 0.2500000 0.04841229 -44.98681 0.000000e+00 -426.6667 0.000000e+00

    For both algorithms ,u() andu() converge to 0.25, 0 (slope) and-426.6667 (curvature) respectively. Note that the NR method, usingexactIo(), may converge faster than the FS method.

    The maximization can also be done using a maximizer:

    > logl = function(pi) -20*(3*log(1-pi)+log(pi))

    > pi.hat = optimize(logl, c(0, 1), tol = 0.0001)> pi.hat

    $minimum

    [1] 0.2500143

    $objective

    [1] 44.98681

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 16

  • 8/11/2019 Fisher Information for GLM

    18/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    1.6 Expectation Maximization (EM) algorithm

    1.6.1 Basic EM algorithm

    TheExpectation-Maximization (EM)algorithm was proposed by Demp-steret al. (1977). It is an iterative approach for computing the max-imum likelihood estimates (MLEs) for incomplete-dataproblems.

    Lety be the observed data, z be the latent or missing data and bethe unknown parameters to be estimated. The functions f(y|) andf(y, z|) are called the observed data and complete data likelihoodfunctions respectively. The observed data likelihood Lo() = f(y

    |)

    is the expectation off(y|z, ) w.r.t. f(z|), that is,f(y|) =

    f(y, z|) dz=

    f(y|z,)f(z|) dz=Ez|[f(y|z,)].

    To find the ML estimate, one should maximize

    o() = ln f(y|) = ln

    f(y|z,)f(z|) dz=ln Ez|[f(y|z, )].

    The EM algorithm maximizes o(|(k)) (the proof is given in theappendix) which is equivalent to maximize

    Ez|y,(k){ln f(y, z|(k))} =

    ln f(y,z|(k))f(z|y,(k)) dz (7)

    given (k) in an iterative procedure. Note that it takes into accountthe posteriordistribution ofz, i.e. f(z|y, (k)) and so it provides aframework for estimating z in the E-step. With the estimated z, theM-step is simplified whereas the classical ML method requires directmaximization of o() which may involve integration over f(z|), apriordistribution for z.

    The EM algorithm consists of two steps: TheE-step and theM-step.

    1. E-step: Evaluate the conditional expectation of the complete

    data log-likelihood function,c() = ln f(y,z(k)

    |) by replacingz byz(k) =E(z|y,(k)).

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 17

  • 8/11/2019 Fisher Information for GLM

    19/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    2. M-step: Maximize c() = l n f(y,z(k)|) w.r.t. to obtain

    (k+1). Return to the E-step with (k+1).

    3. Stopping rule: Iterations (expectation of z given (k)) withiniterations (maximization of given z(k) for each k) arises andthey should stop when||(k+1) (k)|| is sufficiently small.

    Remarks

    1. The EM algorithm makes use of thePrinciple of Data Augmen-

    tationwhich states that:

    EM inference: Augment the observed data y with latent dataz so that the likelihood of the complete data f(y,z|) is sim-ple and then obtain the MLE of based on this complete like-lihood function.

    Bayesian inference: Augment the observed data y with la-

    tent data z so that the augmented posterior density f(|y,z)is simple and then use this simple posterior distribution insampling the parameters .

    2. Bayesian approach simply treatsz as another latent variable andso the distinction between the E and M steps disappears. Both and z are optimized through a (Markov) chain one at a time.

    3. The EM algorithm can be applied to different missing or incomplete-data situations, such as censored observations, random effectsmodel, mixtures model, and models with latent class or latentvariable.

    4. The EM algorithm has a linear rate of convergence which de-pends on the proportion of information about inf(y|) whichis observed. The convergence is usually slower than the NR

    method.

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 18

  • 8/11/2019 Fisher Information for GLM

    20/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    Example: (Darwin data) The data contains two very low outliers.We consider the mixturemodel:

    yi N(1, 2), p= 0.9N(2, 2), p= 0.1.or

    yi 0.9N(1, 2) + 0.1N(2, 2).Letwijbe the indicator that observationicomes from groupj, j = 1, 2and , wi1+wi2 = 1. We dont know which normal distribution each

    observation yi comes from. In other words, wij is unobserved.

    In the M-step, writing rij = yi j, the complete data likelihood,log-likelihood and their 1st and 2nd order derivative functions are

    L() =n

    i=1

    [0.9 (yi|1, 2)]wi1[0.1 (yi|2, 2)]wi2

    () =

    ni=1

    wi1ln 0.9 +wi1ln (yi|1, 2) +wi2ln 0.1 +

    wi2ln (yi|2, 2)]

    ln (yi|j , 2) = 12

    ln(22) 122

    (yi j)2

    jln (yi|j , 2) = 1

    2(yi j) = rij

    2

    2ln (yi|j , 2) = 1

    22+

    1

    24(yi j)2 = 1

    24(r2ij 2)

    ()

    j=

    j

    ni=1

    wijln (yi|j , 2) = 12

    ni=1

    wijrij, j = 1, 2

    ()2 = 2

    ni=1

    2j=1

    wijln (yi|j , 2) = 124n

    i=1

    2j=1

    wij(r2ij 2)

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 19

  • 8/11/2019 Fisher Information for GLM

    21/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    2()

    2j=

    1

    2

    n

    i=1 wij2()

    (2)2 =

    16

    ni=1

    2j=1

    wij(r2ij 2)

    n

    24

    2()

    j2 = 1

    4

    ni=1

    wijrij

    2()

    12

    = 0

    In the E-step, the conditional expectation ofwi1 is

    wi1 = 1 Pr(Wi1 = 1|yi) + 0 Pr(Wi1 = 0|yi) = Pr(Wi1= 1|yi)=

    Pr(Wi1 = 1, yi)

    Pr(yi) =

    Pr(Wi1= 1)Pr(yi|Wi1 = 1)Pr(yi)

    = Pr(Wi1 = 1) Pr(yi|Wi1 = 1)

    Pr(Wi1 = 1) Pr(yi|Wi1 = 1) + Pr(Wi1 = 0) Pr(yi|Wi1 = 0)= 0.9 (yi|1, 2)

    0.9 (yi|1, 2) + 0.1 (yi|2, 2)

    > y=c(-67,-48,6,8,14,16,23,24,28,29,41,49,56,60,75)

    > n=length(y)

    > p=3 #no. of par.

    > iterE=5

    > iterM=10

    > dim1=iterE*iterM> dim2=2*p+3

    > dl=c(rep(0,p))

    > result=matrix(0,dim1,dim2)

    > theta=c(30,-37,729) #starting values

    >

    > for (k in 1:iterE) { # E-step

    + ew1=0.9*exp(-0.5*(y-theta[1])^2/theta[3])

    + ew2=0.1*exp(-0.5*(y-theta[2])^2/theta[3])

    + w1=ew1/(ew1+ew2)+ w1m=mean(w1)

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 20

  • 8/11/2019 Fisher Information for GLM

    22/35

  • 8/11/2019 Fisher Information for GLM

    23/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    [18,] 2 8 32.93380 5.507870 -57.00761 14.03678 394.3344 143.99055 -72.09096

    [19,] 2 9 32.93380 5.507870 -57.00761 14.03678 394.3344 143.99055 -72.09096

    [20,] 2 10 32.93380 5.507870 -57.00761 14.03678 394.3344 143.99055 -72.09096

    [21,] 3 1 32.99249 5.507838 -57.40463 14.03871 384.9492 145.69396 -71.91673

    [22,] 3 2 32.99113 5.441860 -57.39541 13.86992 385.1901 140.51959 -71.91425[23,] 3 3 32.99113 5.443563 -57.39540 13.87426 385.1903 140.65153 -71.91425

    [24,] 3 4 32.99113 5.443564 -57.39540 13.87426 385.1903 140.65162 -71.91425

    [25,] 3 5 32.99113 5.443564 -57.39540 13.87426 385.1903 140.65162 -71.91425

    [26,] 3 6 32.99113 5.443564 -57.39540 13.87426 385.1903 140.65162 -71.91425

    [27,] 3 7 32.99113 5.443564 -57.39540 13.87426 385.1903 140.65162 -71.91425

    [28,] 3 8 32.99113 5.443564 -57.39540 13.87426 385.1903 140.65162 -71.91425

    [29,] 3 9 32.99113 5.443564 -57.39540 13.87426 385.1903 140.65162 -71.91425

    [30,] 3 10 32.99113 5.443564 -57.39540 13.87426 385.1903 140.65162 -71.91425

    [31,] 4 1 32.99290 5.443541 -57.41186 13.87464 384.8527 140.71323 -71.90744

    [32,] 4 2 32.99290 5.441156 -57.41185 13.86856 384.8531 140.52829 -71.90744

    [33,] 4 3 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744

    [34,] 4 4 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744

    [35,] 4 5 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744

    [36,] 4 6 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744

    [37,] 4 7 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744

    [38,] 4 8 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744

    [39,] 4 9 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744

    [40,] 4 10 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744

    [41,] 5 1 32.99296 5.441157 -57.41245 13.86859 384.8414 140.53061 -71.90720

    [42,] 5 2 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720

    [43,] 5 3 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720

    [44,] 5 4 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720

    [45,] 5 5 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720

    [46,] 5 6 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720

    [47,] 5 7 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720

    [48,] 5 8 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720

    [49,] 5 9 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720

    [50,] 5 10 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720

    > w=cbind(w1,w2)

    > w

    w1 w2[1,] 2.315067e-05 9.999768e-01

    [2,] 2.004754e-03 9.979952e-01[3,] 9.984605e-01 1.539495e-03

    [4,] 9.990371e-01 9.629220e-04

    [5,] 9.997646e-01 2.353932e-04

    [6,] 9.998528e-01 1.471616e-04

    [7,] 9.999716e-01 2.842587e-05

    [8,] 9.999775e-01 2.247488e-05

    [9,] 9.999912e-01 8.782696e-06

    [10,] 9.999931e-01 6.944001e-06

    [11,] 9.999996e-01 4.143677e-07

    [12,] 9.999999e-01 6.327540e-08

    [13,] 1.000000e+00 1.222089e-08

    [14,] 1.000000e+00 4.775591e-09

    [15,] 1.000000e+00 1.408457e-10> mean(w1)

    [1] 0.866605

    From (wi1, wi2), the first two observations belong to group 2 while theothers all group 1. Hence EM method enables classification like clusteranalysis, an advantage over the classical likelihood method where themissing data wij are integrated out as the observed data likelihood

    Lo() =

    ni=1

    [0.9(yi|1, 2) + 0.1(yi|2, 2)] (8)

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 22

  • 8/11/2019 Fisher Information for GLM

    24/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    is a marginal mixture of two distributions and contains no missingobservations.

    > x=rep(-0.001,n)

    > x1=seq(-120,100,0.1)

    > fx1=dnorm(x1,theta[1],sqrt(theta[3]))

    > fx2=dnorm(x1,theta[2],sqrt(theta[3]))

    > fx=0.9*fx1+0.1*fx2

    > plot(x1, fx1, xlab="x", ylab="f(x)", ylim=c(-0.001,0.025),

    xlim=c(-120,100), pch=20, col="red",cex=0.5)

    > points(x1,fx2,pch=20,col=blue,cex=0.5)

    > points(x1,fx,pch=20,cex=0.5)

    > points(y,x,pch=20,cex=0.8)

    > title("Mixture of normal distributions for Darwin data")

    100 50 0 50 100

    0.

    000

    0.

    005

    0.0

    10

    0.

    015

    0.

    020

    0.

    025

    x

    f(x)

    Mixture of normal distributions for Darwin data

    Note that this is a mixture model where the two model densities arerepresented by the blue and red lines. The mixing density in (8) isrepresented by the black line.

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 23

  • 8/11/2019 Fisher Information for GLM

    25/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    Example: (Right-Censored Data) with Darwin dataSuppose that the first four observations (ci, i = 1, . . . , 4) are right

    censored (yi> ci) and we assume thatzi, i= 1, . . . , 4; yi, i= 5, . . . , n N(, 2).

    Let = (, ), z= (z1, . . . , z 4) and y= (y5, . . . , yn). Then

    c() = ln f(z,y|) = n2

    ln 2 122

    4i=1

    (zi )2 122

    ni=5

    (yi )2

    For the censored observations zi > ci, i = 1, . . . , 4, the conditional

    distribution is atruncated normalon (ci,), with the density functionf(z|,,ci) = (z|,

    2)

    1 ci

    , z > ci (9)where and are the pdf and cdf functions for normal. Let (k) =((k), 2 (k)) be the current estimates of.

    In the E-step, the conditional expectation ofzi, i= 1, . . . , 4 given y,

    (k) and ci is

    z(k)i = E(zi|y,(k), ci) =

    ci

    z f(z|(k), (k), ci) dz =(k) +ci

    (k)

    (k)

    (k)

    1 ci(k)

    (k)

    or 1

    S

    Sj=1

    z(k)ij

    since12

    c

    z exp1

    2z2

    dz= 1

    2

    c

    z exp1

    2z2

    d1

    2z2

    = 1

    2exp

    1

    2(c)2

    where z

    (k)ij , j = 1, . . . , S is simulated fromf(z|(k), 2 (k), ci) in (9) in

    the Monte Carlo approximation of conditional expectation.

    In the M-step, the z(k)i is substituted for the censored observationzi.

    With the complete data (z(k),y), and 2 of the normal data distri-

    bution are given by their sample mean and sample variance. Henceno iteration is required for the M-step.

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 24

  • 8/11/2019 Fisher Information for GLM

    26/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    > library("msm")

    > cy=c(6,8,14,16,23,24,28,29,41,49,56,60,75) #first 4 obs are censored

    > w=c(0,0,0,0,1,1,1,1,1,1,1,1,1)

    > n=length(cy)

    > S=10000 #sim 10000 z for z hat

    > m=4 #first 4 censored

    > p=2 #2 parmeters

    > cen=cy[1:m] #censored obs

    > y=cy[(m+1):n] #uncensored obs

    > iterE=10

    > dim=p+m+1

    > result=matrix(0,iterE,dim)

    > simz=matrix(0,m,S)

    > z=rep(0,m)

    > theta=c(mean(cy),var(cy)) #starting value for mu & sigma2

    >

    > for (k in 1:iterE) { #E-step

    +

    + for (j in 1:m) {

    + # simz[j,]=rtnorm(S,mean=theta[1],sd=sqrt(theta[2]),lower=cen[j],

    upper=Inf) #monte carlo approx. of E(Z|Z>c)+ # z[j]=mean(simz[j,])

    +

    + cz=(cen[j]-theta[1])/sqrt(theta[2])

    + z[j]=theta[1]+dnorm(cz)*sqrt(theta[2])/(1-pnorm(cz)) #exact

    + }

    + yr=c(z,y)

    + theta[1]=mean(yr) #M-step

    + theta[2]=(sum(yr^2)-sum(yr)^2/n)/n

    + result[k,]=c(k,theta[1],theta[2],z[1],z[2],z[3],z[4])+

    + }

    > colnames(result)=c("iE","mu","sigma2","ez1","ez2","ez3","ez4")

    > print(result,digit=5) #monte carlo approx. of E(Z|Z>c)

    iE mu sigma2 ez1 ez2 ez3 ez4

    [1,] 1 41.630 211.67 37.262 37.842 40.054 41.036

    [2,] 2 42.647 208.03 41.943 42.259 42.455 42.755

    [3,] 3 42.921 208.09 42.540 43.007 43.614 43.810

    [4,] 4 43.017 208.11 43.281 43.385 43.655 43.900[5,] 5 43.040 208.17 43.154 43.441 43.733 44.188

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 25

  • 8/11/2019 Fisher Information for GLM

    27/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    [6,] 6 43.048 208.21 42.985 43.371 44.071 44.201

    [7,] 7 43.046 208.18 43.158 43.473 43.689 44.277

    [8,] 8 43.036 208.14 43.393 43.317 43.721 44.039

    [9,] 9 43.061 208.19 43.375 43.298 43.968 44.156

    [10,] 10 43.071 208.22 43.336 43.364 43.810 44.411

    > print(result,digit=5) #exact E(Z|Z>c)

    iE mu sigma2 ez1 ez2 ez3 ez4

    [1,] 1 41.659 211.47 37.377 37.996 40.180 41.018

    [2,] 2 42.660 208.05 41.948 42.062 42.638 42.932

    [3,] 3 42.930 208.06 42.889 42.983 43.478 43.737

    [4,] 4 43.006 208.12 43.148 43.239 43.718 43.969

    [5,] 5 43.027 208.14 43.221 43.311 43.785 44.035

    [6,] 6 43.033 208.15 43.242 43.332 43.805 44.054

    [7,] 7 43.035 208.15 43.248 43.337 43.810 44.059

    [8,] 8 43.035 208.15 43.249 43.339 43.812 44.061

    [9,] 9 43.036 208.15 43.250 43.339 43.812 44.061

    [10,] 10 43.036 208.15 43.250 43.340 43.812 44.061

    The convergence using Monte Carlo approx. is subjected to random

    error in the simulation. Parameter estimates are given by the averagesover iterations.

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 26

  • 8/11/2019 Fisher Information for GLM

    28/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    1.6.2 Monte Carlo EM Algorithm

    Given the current guess to the posterior mode,(k), the conditional ex-

    pectation in the E-step may involve integration and can be calculatedusing Monte Carlo (MC) approximation. Similarly the complete datalog-likelihood function c() = ln f(y,z|) can also be approximatedusing MC approximation:

    c() = lnf(y, z|) = 1S

    Sj=1

    ln f(y, z(k)j |) (10)

    where z(k)1 , . . . , z(k)S f(z|(k),y) as required in the E-step. Thismaximizes an average of log-likelihood based ln f(y, z|) on simu-lated values which is different from maximizing ln f(y,z|) wherezisaverage of simulated values. Then, in the M-step, we maximize c()in (10) to obtain a new guess, (k+1).Monitoring of convergence: Plot each component of(k) against theiteration number k.

    Example: (Right-Censored Data) Consider the Darwin data again.In the E-step, the conditional expectation ofzi given y,

    (k) and ci isgiven by (10) and estimated by drawing samplez

    (k)i1 , z

    (k)i2 , . . . , z

    (k)iS from

    the truncated normalf(zi|(k), (k), ci) in (9) at the current estimates(k) = ((k), (k)).

    In the M-step, one obtains a MC approximation to ln f(y, z|) by

    c() = 1

    S

    Sj=1

    n

    2ln(22) 1

    22

    4i=1

    (z(k)ij )2 +

    ni=5

    (yi )2

    = n2

    ln(22) 122

    4i=1

    1

    S

    Sj=1

    (z(k)ij )2

    +

    ni=5

    (yi )2

    and maximizes it w.r.t. to obtain (k+1)

    through iterationsinsteadofclose-form solution. Writeri =yi , i= 5, . . . , n, rij =z(k)ij ,

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 27

  • 8/11/2019 Fisher Information for GLM

    29/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    zi= 1S

    S

    j=1z(k)ij and ri = zi , i= 1, . . . , 4,

    ()

    =

    1

    2

    4i=1

    1S

    Sj=1

    (z(k)it )

    + ni=5

    (yi )= 1

    2

    ni=1

    ri,

    ()

    2 =

    1

    24

    4

    i=1

    1S

    Sj=1

    (z(k)ij )2

    + ni=5

    (yi )2 n2

    = 1

    24 4

    i=1 1

    S

    s

    j=1 r2ij 2+n

    i=5(r2i 2)Since

    1

    S

    Sj=1

    r2ij = 1

    S

    Sj=1

    (z(k)it )2=

    1S

    Sj=1

    (z(k)it )

    2 = 1

    S

    Sj=1

    rij

    2 = r2i ,closed-form solution using sample mean and sample variance can not be used.

    2()

    2

    = n

    2

    2()

    (2)2 =

    16

    4

    i=1

    1S

    Sj=1

    (z(k)ij )2

    + ni=5

    (yi )2+ n24

    = 1

    6

    4

    i=1

    1S

    Sj=1

    r2ij

    2+ n

    i=5

    (r2i 2) n24

    2()

    2 =

    1

    4

    n

    i=1 ri> library("msm")

    > cy=c(6,8,14,16,23,24,28,29,41,49,56,60,75) #first 4 obs are censor time

    > w=c(0,0,0,0,1,1,1,1,1,1,1,1,1)

    > mean(cy)

    [1] 33

    > n=length(cy)

    > T=10000 #sim 10000 z for z hat

    > m=4 #first 4 censored obs> p=2 #2 pars

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 28

  • 8/11/2019 Fisher Information for GLM

    30/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    > cen=cy[1:m] #censored obs

    > y=cy[(m+1):n] #uncensored obs

    > iterE=5

    > iterM=10

    > dim1=iterE*iterM

    > dim2=2*p+7

    > dl=c(rep(0,p))

    > dl2=matrix(0,p,p)

    > result=matrix(0,dim1,dim2)

    > simz=matrix(0,m,T)

    > z=matrix(0,m,1)

    > rz=rep(0,m)

    > r2z=rep(0,m)

    > theta=c(40,400) #starting values for mu & var

    >

    > for (k in 1:iterE) { #E-step

    + for (j in 1:m) {

    + simz[j,]=rtnorm(T,mean=theta[1],sd=sqrt(theta[2]),lower=cen[j],

    upper=Inf)

    + z[j]=mean(simz[j,])

    + }+

    + for (i in 1:iterM) { #M-step

    + rz=z-theta[1]

    + ry=y-theta[1]

    + r=c(rz,ry)

    + r2z=apply((simz-theta[1])^2,1,mean)

    + r2=c(r2z,ry^2)

    + s2=r2-theta[2]

    + dl[1]=sum(r)/theta[2]+ dl[2]=0.5*sum(s2)/theta[2]^2

    + dl2[1,1]=-n/theta[2]

    + dl2[2,2]=-sum(s2)/theta[2]^3-0.5*n/theta[2]^2

    + dl2[2,1]=dl2[1,2]=-sum(r)/theta[2]^2

    + dl2i=solve(dl2)

    + theta=theta-dl2i%*%dl

    + se=sqrt(diag(-dl2i))

    + l=-n*log(2*pi*theta[2])/2-sum(r^2)/(2*theta[2]) #pi=3.141593

    + row=(k-1)*10+i+ result[row,]=c(k,i,theta[1],se[1],theta[2],se[2],l,z[1],z[2],

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 29

  • 8/11/2019 Fisher Information for GLM

    31/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    z[3],z[4])

    + }

    + }

    > colnames(result)=c("iE","iM","mu","se","sigma2","se","logL","ez1",

    "ez2","ez3","ez4")

    > print(result,digit=5)

    iE iM mu se sigma2 se logL ez1 ez2 ez3 ez4

    [1,] 1 1 43.745 5.6699 288.50 160.37 -53.653 42.163 42.427 44.008 44.473

    [2,] 1 2 42.965 4.7218 301.25 113.42 -53.557 42.163 42.427 44.008 44.473

    [3,] 1 3 42.929 4.8139 301.86 118.16 -53.547 42.163 42.427 44.008 44.473

    [4,] 1 4 42.929 4.8187 301.86 118.40 -53.547 42.163 42.427 44.008 44.473

    [5,] 1 5 42.929 4.8187 301.86 118.40 -53.547 42.163 42.427 44.008 44.473

    [6,] 1 6 42.929 4.8187 301.86 118.40 -53.547 42.163 42.427 44.008 44.473

    [7,] 1 7 42.929 4.8187 301.86 118.40 -53.547 42.163 42.427 44.008 44.473

    [8,] 1 8 42.929 4.8187 301.86 118.40 -53.547 42.163 42.427 44.008 44.473[9,] 1 9 42.929 4.8187 301.86 118.40 -53.547 42.163 42.427 44.008 44.473

    [10,] 1 10 42.929 4.8187 301.86 118.40 -53.547 42.163 42.427 44.008 44.473

    [11,] 2 1 43.280 4.8205 287.03 118.44 -53.460 43.734 43.914 44.876 44.900

    [12,] 2 2 43.263 4.6989 287.16 112.58 -53.458 43.734 43.914 44.876 44.900

    [13,] 2 3 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900

    [14,] 2 4 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900

    [15,] 2 5 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900[16,] 2 6 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900

    [17,] 2 7 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900

    [18,] 2 8 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900

    [19,] 2 9 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900

    [20,] 2 10 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900

    [21,] 3 1 43.320 4.6999 284.71 112.63 -53.446 44.105 43.867 44.783 45.399

    [22,] 3 2 43.320 4.6799 284.72 111.67 -53.446 44.105 43.867 44.783 45.399

    [23,] 3 3 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399[24,] 3 4 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399

    [25,] 3 5 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399

    [26,] 3 6 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399

    [27,] 3 7 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399

    [28,] 3 8 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399

    [29,] 3 9 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399

    [30,] 3 10 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399

    [31,] 4 1 43.325 4.6799 284.61 111.68 -53.446 43.945 43.878 45.109 45.295

    [32,] 4 2 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295

    [33,] 4 3 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295

    [34,] 4 4 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295

    [35,] 4 5 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295

    [36,] 4 6 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295

    [37,] 4 7 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295

    [38,] 4 8 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295

    [39,] 4 9 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295

    [40,] 4 10 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295

    [41,] 5 1 43.343 4.6790 284.06 111.63 -53.443 44.072 44.284 44.957 45.149

    [42,] 5 2 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149

    [43,] 5 3 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149

    [44,] 5 4 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149

    [45,] 5 5 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149

    [46,] 5 6 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149

    [47,] 5 7 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149

    [48,] 5 8 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149

    [49,] 5 9 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149[50,] 5 10 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149

    References

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 30

  • 8/11/2019 Fisher Information for GLM

    32/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    Dempster, A.P., Laird, N. & Rubin, D.B. (1977) Maximum likelihood from incom-plete data via the EM algorithm. Journal of the Royal Statistical Society,Series B, 39, 1-38. (with discussion).

    McLachlan, G.J. & Krishnan, T (1997) The EM Algorithm and Extensions. Wi-ley.

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 31

  • 8/11/2019 Fisher Information for GLM

    33/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    1.7 Appendix for EM algorithm

    To maximizeo() = ln f(y

    |), we wish to compute an updated esti-

    mate (k+1) such that,

    o((k+1))> o(

    (k)).

    The idea is to maximize alternatively the function (|(k)) whichis (i) bounded above by o(

    (k+1)) at (k+1) and (ii) equal to o((k))

    at (k). Then any (k+1) which increases o((k+1)|(k)) also increases

    o((k+1)). Lastly, the EM algorithm chooses (k+1) as the value of

    for which o(|(k)) is a maximum.

    To show (i), we first consider maximizing the difference

    o((k+1)) o((k))

    = ln f(y|(k+1)) ln f(y|(k))= ln

    f(y|z,(k+1))f(z|(k+1)) dz ln f(y|(k))

    = ln

    f(y

    |z,(k+1))f(z

    |(k+1))

    f(z|y,(k))f(z|y,(k))

    dz

    ln f(y

    |(k))

    = ln

    f(z|y,(k)) f(y|z,(k+1))f(z|(k+1))

    f(z|y,(k)) dz ln f(y|(k))

    f(z|y,(k)) lnf(y|z,(k+1))f(z|(k+1))

    f(z|y,(k)) dz

    f(z|y,(k)) ln f(y|(k)) dz

    =

    f(z|y,(k)) lnf(y|z,(k+1))f(z|(k+1))

    f(z|y,(k))f(y|(k)) dz ((k+1)|(k))

    since

    f(z|y, (k))dz= 1 and lnn

    i=1

    iyin

    i=1

    iln(yi) with ln()

    being concave. Then define o((k+1)|(k)) such that

    o((k+1)) o((k)) + ((k+1)|(k)) o((k+1)|(k))

    or o() o((k)) + (|(k)) o(|(k)) (writing(k+1) =)

    whereo(|

    (k)

    )

    o((k)

    )+(|(k)

    ). Henceo(

    (k+1)

    |(k)

    ) is boundabove byo(

    (k+1)) or o(|(k)) is bound above by o() in general.

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 32

  • 8/11/2019 Fisher Information for GLM

    34/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    where in the diagram, (k+1) , n+1, o((k+1)|(k)) (|n) ando(

    (k+1)) L(n+1). The function o(|(k)) is bounded above bythe log-likelihood function o().Next we show (ii) that o(|(k)) and o() are equal at =(k).

    o((k)|(k)) = o((k)) + ((k)|(k))

    = o((k)) +

    f(z|y, (k)) lnf(y|z,(k))f(z|(k))

    f(z|y,(k))f(y|(k)) dz

    = o((k)

    ) +

    f(z|y, (k)

    ) ln

    f(y,z

    |(k))

    f(y,z|(k)) dz= o(

    (k)).

    Hence any (k+1) which increaseso((k+1)|(k)) also increaseso((k+1)).

    Lastly, we show (iii) that the EM algorithm chooses (k+1) for whicho(|(k)) is a maximum. Sinceo() o(|(k)), increasingo(|(k))ensures that o() is increased at each step.

    To achieve the greatest increase in o((k+1)), EM algorithm selects

    (k+1) which maximizeo(|(k)), i.e.

    SydU MSH3 GLM (2012) First semester Dr. J. Chan 33

  • 8/11/2019 Fisher Information for GLM

    35/35

    SID

    EREMENS

    EADEM

    MUTATO

    MSH3 Generalized linear model Ch. 1 Max. likelihood inference

    (k+1) = arg max

    [o(

    |(k))] = arg max

    [o((k)) + (

    |(k))]

    = arg max

    o(

    (k))+

    f(z|y,(k)) ln f(y|z,)f(z|)f(z|y,(k))f(y|(k)) dz

    = arg max

    f(z|y,(k))ln[f(y|z,)f(z|)] dz

    (drop the constant term w.r.t. )

    = arg max

    ln f(y, z|)f(z|y,(k)) dz= arg max

    Ez|y,(k){ln f(y, z|)}and hence proved (7).