Fisher Information for GLM

8/11/2019 Fisher Information for GLM

1/35

MSH3 Generalized Linear model (Part 1)

Jennifer S.K. CHAN

Course outline

Part I: Generalized Linear Model

1. Maximum Likelihood Inference

Newton-Raphson and Fisher Scoring methods; EM and Monte

Carlo EM algorithms

2. Exponential Family

Two parameter Exponential family; ML estimation for GLM,Deviance; Quasi-likelihood, Random effects models.

3. Model Selection

Deviance for Likelihood Ratio Tests, Wald Tests, AIC and BIC,Examples

4. Survival Analysis

Kaplan-Meier estimator; Proportional hazards models; Coxsproportional hazards model.


2/35

SID

EREMENS

EADEM

MUTATO

MSH3 Generalized linear model

Contents

1 Maximum likelihood Inference 21.1 Motivating examples . . . . . . . . . . . . . . . . . . . 21.2 Likelihood function . . . . . . . . . . . . . . . . . . . . 71.3 Score vector . . . . . . . . . . . . . . . . . . . . . . . . 91.4 Information matrix . . . . . . . . . . . . . . . . . . . . 101.5 Newton-Raphson and Fisher Scoring methods . . . . . 131.6 Expectation Maximization (EM) algorithm . . . . . . 17

1.6.1 Basic EM algorithm . . . . . . . . . . . . . . . 17

1.6.2 Monte Carlo EM Algorithm . . . . . . . . . . . 271.7 Appendix for EM algorithm . . . . . . . . . . . . . . . 32

SydU MSH3 GLM (2012) First semester Dr. J. Chan 1


3/35

SID

EREMENS

EADEM

MUTATO

MSH3 Generalized linear model Ch. 1 Max. likelihood inference

1 Maximum likelihood Inference

1.1 Motivating examplesAIDS deaths (counts)

The numbers of death Yi from AIDS in Australia for three-monthperiods from 1983 to 1986 are shown below.

The Poisson regression model

Yi Poisson(i), with i= exp(a+bti)> 0

is fitted and the maximum likelihood (ML) estimates are a = 0.376and b = 0.254. For each 3-month period, there will be a 29.3%

(exp(0.254) = 1.293) increase in expected AIDS deaths. Note thatthe variance increases with the mean and the log link function g(i) =ln(i) =

xi is used.

> no=c(0,1,2,3,1,5,10,17,23,31,20,25,37,45)

> time=c(1:14)

> poi=glm(no~time, family=poisson(link=log))

> summary(poi)

Call:



4/35

SID

EREMENS

EADEM

MUTATO


glm(formula = y ~ x, family = poisson(link = log))

Deviance Residuals:

Min 1Q Median 3Q Max

-2.2502 -0.9815 -0.6770 0.2545 2.6731

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 0.37571 0.24884 1.51 0.131

x 0.25365 0.02188 11.60 par=poi$coeff

> names(par)=NULL

> par

[1] 0.3757110 0.2536485

> beta0=par[1]

> beta1=par[2]

> par(mfrow=c(2,2))

> c1=function(time) exp(beta0+beta1*time)

> plot(time,no,pch=20,col=blue )> curve(c1,1,14,add=TRUE)

> title("Poisson regression")

Mice data

Twenty six mice were given different levelxiof drug. OutcomesYiarewhether they responded to the drug (Yi= 1) or not (Yi= 0).



5/35

SID

EREMENS

EADEM

MUTATO


The logistic regression model for binary data is

Yi Bernoulli(i), with logit(i) = ln

i1 i

= a+bxi.

Note that

ln i

1 i

= a+bxi i= ea+bxi

1 +ea+bxi.

> y=c(0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,1,1,1,1,1,1,1,1,1,1,1)

> dose=c(0:25)/10

> log=glm(y~dose, family=binomial(link=logit))

> summary(log)

Call:

glm(formula = y ~ dose, family = binomial(link = logit))

Deviance Residuals:

Min 1Q Median 3Q Max

-1.5766 -0.4757 0.1376 0.4129 2.1975

Coefficients:Estimate Std. Error z value Pr(>|z|)



6/35

SID

EREMENS

EADEM

MUTATO


(Intercept) -4.111 1.638 -2.510 0.0121 *

dose 3.581 1.316 2.722 0.0065 **

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 35.890 on 25 degrees of freedom

Residual deviance: 17.639 on 24 degrees of freedom

AIC: 21.639

Number of Fisher Scoring iterations: 6

> par=log$coeff

> names(par)=NULL

> par

[1] -4.111361 3.581176

> beta0=par[1]

> beta1=par[2]

> c1=function(dose) exp(beta0+beta1*dose)/(1+exp(beta0+beta1*dose))

> plot(dose,y, pch=20,col=red)

> curve(c1,0,2.5,add=TRUE)> title("Logistic regression")

2 4 6 8 10 14

0

10

20

30

40

time

no

Poisson regression

0.0 0.5 1.0 1.5 2.0 2.5

0.

0

0.

4

0.

8

dose

y

Logistic regression

For parameter estimation, the nonparametric LSE , the parametricmaximum likelihood (ML, PML, QML, GEE, EM, MCEM, etc) andBayesian methods methods will be discussed. Kernel smoothing and

other semi-parametricmethods are not included. Model selection isbased on Akaike Information criterion (AIC), Bayesian Information



7/35

SID

EREMENS

EADEM

MUTATO


criterion (BIC) and the Deviance information criterion (DIC).

For application, the two examples analyse counts data with Pois-son distribution and binary data with Bernounill distribution respec-tively. Others include categorial data with multinominal distributionand positive continuous data with Weibull distribution. These illus-trate different data distributions under the Exponential Family. Themeanof the data distribution is linked to a linear function of covari-ates with possibly random effects but the varianceis NOT modelled.Popular time series models with heteroskedastic variance and long

memory such as Generalized autoregressive conditional heteroskedas-tic (GARCH) model and stochastic volatility (SV) model will not beconsidered.



8/35

SID

EREMENS

EADEM

MUTATO


1.2 Likelihood function

Let Y1, . . . , Y n be n independent random variables (rv) with proba-

bility density functions (pdf) fi(yi, ) depending on a vector-valueparameter . The joint density ofy= (y1, . . . , yn)

f(y,) =n

i=1

f(yi,) =L(,y)

as a function of unknown parameter given y is called the likeli-hoodfunction. We often work with the logarithm off(y, ), the log-likelihoodfunction:

(;y) = ln L(;y) =n

i=1

ln f(yi;).

Themaximum-likelihood(ML) estimator

maximzes the log-likelihood

function given the data y, that is,

(;y) (;y) for all .In other words, they make the observed data as likely as possible underthe model.

Example: The log-likelhood function for Geometric Distribution.Consider a series of independent Bernoulli trials with a common prob-ability of success . The distribution for the number of failures Yi

before the first success has a pdf

Pr(Yi=yi) = (1 )yifor yi = 0, 1, . . . . Direct calculation shows that E(Yi) = (1 )/.The log-likelihood function given y is

(;y) = ln L(;y) =n

i=1 [yiln(1 ) + ln ]= n[y ln(1 ) + ln ],



9/35

SID

EREMENS

EADEM

MUTATO


where y= 1n

n

i=1

yiis the sample mean. The fact that the log-likelihood

function depends on the observations only through y shows that y isa sufficientstatistic for the unknown probability .

0.0 0.2 0.4 0.6 0.8 1.0

250

150

50

pi

logl(pi)

loglikelihood function

0.0 0.2 0.4 0.6 0.8 1.06000

2000

2000

pi

score(pi)

score function

0.0 0.2 0.4 0.6 0.8 1.0

0

100000

200000

pi

Ie(pi)

Expected information function

Figure: Log-likelihood function for geometric dist. when n= 20 and y= 3.

> n=20

> ym=3

> pi=c(1:100)/100> logl=function(pi) n*(ym*log(1-pi)+log(pi))



10/35

SID

EREMENS

EADEM

MUTATO


1.3 Score vector

The first order derivative of the log-likelihood function, called Fishers

score function, is a vector of dimension p where p is the number ofparameters and is denoted by

u() =(;y)

.

For example, when Yi N(, 2), u() =

,

2

.

If the log-likelihood function is concave, the ML estimates can beobtained by solving the system of equations:

u() =0.

Example: The score function for the geometric distribution.The score function fornobservations from a geometric distribution is

u() = d

d =

d

dn(y ln(1 ) + ln )= n

1

y

1

.

Setting this equation to zero and solving for gives the ML estimate:

1

=

y

1

y = 1 = 1

1 + y and y=

1

.

Note that the ML estimate of the probability of success is the recip-rocal of the average number of trials. The more trials it takes to geta success, the lower is the estimated probability of success.

For a sample ofn = 20 observations and with a sample mean of y= 3,the ML estimate is = 1/(1 + 3) = 0.25.



11/35

SID

EREMENS

EADEM

MUTATO


1.4 Information matrix

It can be shown that

Ey

()

= 0 and Ey

2()

+Ey

()

()

= 0.

Proof: Since

f(y,)dy= 1,

f(y, )

dy=0

f(y,)

f(y, )f(y, )dy=0

ln f(y, )

f(y, )dy=0

()

f(y,)dy=0

Ey

()

= 0 and

()

f(y,)

dy=0

2()

f(y, ) +()

f(y,)

dy=0

2()

+

()

()

f(y,)dy=0

Ey

2()

+Ey

()

()

= 0. (1)

Hence the score function is a random vector such that it has a zeromean

Ey[u()] =Ey

()

= 0

and a variance-covariance matrix which is given by the informativematrix:

var[u()] =Ey[u()u()] =Ey ()

()

= I().



12/35

SID

EREMENS

EADEM

MUTATO


Under mild regularity conditions, the information matrix can also beobtained as minus the expected value of the second derivatives of the

log-likelihood from (1):

var[u()] =I() = Ey

2()

.

Note that theHessianmatrix is

H() = Io() = 2()

=

u()

(2)

and Io() =2()

=H() is sometimes called the observedinformation matrix. Io() indicatesthe extent to which()is peakedrather than flat. If it is more peaked, Io() is more positive. Forexample, when Yi N(, 2),

2()

=

22

22

22

2(2)2

.

Example: Information matrix for geometric distribution.Differentiating the score, we find the observed information to be

Io() = d2()

d2 = du()

d = d

d n

1

y

1

= n

1

2+

y

(1 )2

.

To find the expected information, we subsitute ybyE(Y) =E(Yi) =(1

)/ inIo() to obtain

Ie() =n

1

2 +

(1 )/(1 )2

= n

1

2 +

1

(1 )

= n

1 +2(1 )

=

n

2(1 ) .

Note thatIe() depending on the sharpness of the peak increases withthe sample size n since larger sample size provides more informationand hence the loglikelihood function is more sharp at the peak. Whenn= 20 and = 0.15, the expected information is

Ie(0.15) = n2(1 ) =

200.152(1 0.15)= 1045.8.



13/35

SID

EREMENS

EADEM

MUTATO


If the sample mean y= 3, the observed information is

Io(0.15) =n 12 + y(1 )2=20 10.152 + 3(1 0.15)2= 971.9.Substituting the ML estimate = 0.25, the expected and observedinformation areIo(0.25) =Ie(0.25) = 426.7 since y= (1 )/.> score=function(pi) n*(1/pi-ym/(1-pi))

> Ie=function(pi) n/(pi^2*(1-pi))

> Io=n*(1/pi^2+ym/(1-pi)^2)

>> logl1=n*(ym*log(1-pi)+log(pi))

> score1=n*(1/pi-ym/(1-pi))

> Ie1=n/(pi^2*(1-pi))

> c(pi[logl1==max(logl1)],pi[score1==0],max(logl1))

[1] 0.25000 0.25000 -44.98681

> c(Io[pi==0.15],Ie1[pi==0.15],Io[pi==0.25],Ie1[pi==0.25])

[1] 971.9339 1045.7516 426.6667 426.6667

>

> par(mfrow=c(2,2))

> plot(logl, col=red,xlab="pi",ylab="logl(pi)")

> points(pi[score1==0],logl1[pi==pi[score1==0]],pch=2,col="red",cex=0.6)

> title("log-likelihood function")

> plot(score, col=red,xlab="pi",ylab="score(pi)")

> abline(h = 0)

> points(pi[score1==0],0,pch=2,col="red",cex=0.6)

> title("score function")

> plot(Ie, col=red,xlab="pi",ylab="Ie(pi)")

> title("Expected information function")



14/35

SID

EREMENS

EADEM

MUTATO


1.5 Newton-Raphson and Fisher Scoring meth-ods

Calculation of the ML estimate often requires iterative procedures.Expanding the score function u() evaluated at the ML estimatearound a trial value 0 using a first order Taylor series gives

u() =u(0) +u(0)

(0)+higher order terms in (0). (3)Ignoring higher order terms, equating (3) to zero and solving for,we have 0 u(0)

1u(0) (4)

since u() = 0. Then the Newton-Raphson (NR) procedure to ob-tain an improved estimate (k+1) using the estimate (k) at the k-thiteration is

(k+1) =(k) 2()

1 ()

=(k) . (5)

The iterative procedure is repeated until the difference between (k+1)

and (k) is sufficiently close to zero. Then (proof as exercise)

var() =Io()1 = 2(

)

1

.

For ML estimates, the second order derivative H() isconcave down-wards and negative. The sharper the curvature (more information)

of (), the more negativeH() is and hence the estimates havesmaller variancevar(

) = Io(

)1 =H(

)1. The NR procedure

tends to converge quickly if the log-likelihood is well-behaved (close to

quadratic) in a neighborhood of the ML estimateand if the startingvalue 0 is reasonably close to.SydU MSH3 GLM (2012) First semester Dr. J. Chan 13


15/35

SID

EREMENS

EADEM

MUTATO


An alternative procedure first suggested by Fisher is to replace theinformation matrix Io() by its expected value Ie(). The procedure

knwon as Fisher Scoring(FS) is

(k+1) =(k) E

2()

1()

=(k). (6)

For multimodal distributions, both methods will converge to a local(not global) maximum.

Example: NR and FS methods for geometric distribution.Setting the score to zero leads to an explicit solution for the ML esti-

mate = 1

1 + y and no iteration is needed. For illustrative purpose,

the iterative procedure is performed. Using the previous results,

d

d =n

1

y

1

, d2

d2 = n

1

2+

y

(1 )2

, E

d2

d2

=

n2(1 ) .

TheFisher scoringprocedure leads to the updating formula

(k+1) = (k) E

d2

d2

1d

d|=(k)

= (k) +((k))2(1 (k))

n n

1

(k) y

1 (k)

= (k) + ((k))2(1

(k))

1 (k) (k)y

(k)

(1 (k)

)= (k) + (1 (k) (k)y)(k).

If the sample mean is y = 3 and we start from 0 = 0.1, say, theprocedure converges to the ML estimate = 0.25 in four iterations.

> n=20

> ym=3

> pi=0.1

> result=matrix(0,10,7)>



16/35

SID

EREMENS

EADEM

MUTATO


> for (i in 1:10) {

+ dl=n*(1/pi-ym/(1-pi))

+ dl2=-n/(pi^2*(1-pi))

+ pi=pi-dl/dl2

+ #pi=pi+(1-pi-pi*ym)*pi

+ se=sqrt(-1/dl2)

+ l=n*(ym*log(1-pi)+log(pi))

+ step=(1-pi-pi*ym)*pi

+ result[i,]=c(i,pi,se,l,dl,dl2,step)

+ }

> colnames(result)=c("Iter","pi","se","l","dl","dl2","step")

> result

Iter pi se l dl dl2 step

[1,] 1 0.1600000 0.02121320 -47.11283 1.333333e+02 -2222.2222 5.760000e-02

[2,] 2 0.2176000 0.03279024 -45.22528 5.357143e+01 -930.0595 2.820096e-02

[3,] 3 0.2458010 0.04303862 -44.99060 1.522465e+01 -539.8628 4.128512e-03

[4,] 4 0.2499295 0.04773221 -44.98681 1.812051e+00 -438.9114 7.050785e-05

[5,] 5 0.2500000 0.04840091 -44.98681 3.009750e-02 -426.8674 1.989665e-08

[6,] 6 0.2500000 0.04841229 -44.98681 8.489239e-06 -426.6667 1.582068e-15

[7,] 7 0.2500000 0.04841229 -44.98681 6.661338e-13 -426.6667 0.000000e+00

[8,] 8 0.2500000 0.04841229 -44.98681 0.000000e+00 -426.6667 0.000000e+00

[9,] 9 0.2500000 0.04841229 -44.98681 0.000000e+00 -426.6667 0.000000e+00

[10,] 10 0.2500000 0.04841229 -44.98681 0.000000e+00 -426.6667 0.000000e+00

Alternatively theNewton-Raphsonprocedure is

(k+1) = (k)

d2

d2

1d

d|=(k)

= (k) +1

n

1

((k))2+

y

(1 (k))21

n

1

(k) y

1 (k)

= (k) + ((k))2(1 (k))21 2(k) + ((k))2 + y((k))21 (k) (k)y(k)(1 (k))

= (k) +(k)(1 (k))(1 (k) (k)y)

1 2(k) + (1 + y)((k))2 .

> n=20

> ym=3

> pi=0.1 #starting value> result=matrix(0,10,7)



17/35

SID

EREMENS

EADEM

MUTATO


>

> for (i in 1:10) {

+ dl=20*(1/pi-ym/(1-pi))

+ dl2=-20*(1/pi^2+3/(1-pi)^2)

+ pi=pi-dl/dl2

+ #pi=pi+(1-pi)*(1-pi-pi*ym)*pi/(1-2*pi+4*pi^2)

+ se=sqrt(-1/dl2)

+ l=n*(ym*log(1-pi)+log(pi))

+ step=(1-pi)*(1-pi-pi*ym)*pi/(1-2*pi+4*pi^2)

+ result[i,]=c(i,pi,se,l,dl,dl2,step)

+ }

> colnames(result)=c("Iter","pi","se","l","dl","dl2","step")

> result

Iter pi se l dl dl2 step

[1,] 1 0.1642857 0.02195775 -46.89107 1.333333e+02 -2074.0741 6.039726e-02

[2,] 2 0.2246830 0.03477490 -45.13029 4.994426e+01 -826.9292 2.344114e-02

[3,] 3 0.2481241 0.04490170 -44.98756 1.162661e+01 -495.9916 1.866426e-03[4,] 4 0.2499905 0.04816876 -44.98681 8.044145e-01 -430.9919 9.453797e-06

[5,] 5 0.2500000 0.04841107 -44.98681 4.033823e-03 -426.6882 2.383524e-10

[6,] 6 0.2500000 0.04841229 -44.98681 1.016970e-07 -426.6667 0.000000e+00

[7,] 7 0.2500000 0.04841229 -44.98681 0.000000e+00 -426.6667 0.000000e+00

[8,] 8 0.2500000 0.04841229 -44.98681 0.000000e+00 -426.6667 0.000000e+00

[9,] 9 0.2500000 0.04841229 -44.98681 0.000000e+00 -426.6667 0.000000e+00

[10,] 10 0.2500000 0.04841229 -44.98681 0.000000e+00 -426.6667 0.000000e+00

For both algorithms ,u() andu() converge to 0.25, 0 (slope) and-426.6667 (curvature) respectively. Note that the NR method, usingexactIo(), may converge faster than the FS method.

The maximization can also be done using a maximizer:

> logl = function(pi) -20*(3*log(1-pi)+log(pi))

> pi.hat = optimize(logl, c(0, 1), tol = 0.0001)> pi.hat

$minimum

[1] 0.2500143

$objective

[1] 44.98681



18/35

SID

EREMENS

EADEM

MUTATO


1.6 Expectation Maximization (EM) algorithm

1.6.1 Basic EM algorithm

TheExpectation-Maximization (EM)algorithm was proposed by Demp-steret al. (1977). It is an iterative approach for computing the max-imum likelihood estimates (MLEs) for incomplete-dataproblems.

Lety be the observed data, z be the latent or missing data and bethe unknown parameters to be estimated. The functions f(y|) andf(y, z|) are called the observed data and complete data likelihoodfunctions respectively. The observed data likelihood Lo() = f(y

|)

is the expectation off(y|z, ) w.r.t. f(z|), that is,f(y|) =

f(y, z|) dz=

f(y|z,)f(z|) dz=Ez|[f(y|z,)].

To find the ML estimate, one should maximize

o() = ln f(y|) = ln

f(y|z,)f(z|) dz=ln Ez|[f(y|z, )].

The EM algorithm maximizes o(|(k)) (the proof is given in theappendix) which is equivalent to maximize

Ez|y,(k){ln f(y, z|(k))} =

ln f(y,z|(k))f(z|y,(k)) dz (7)

given (k) in an iterative procedure. Note that it takes into accountthe posteriordistribution ofz, i.e. f(z|y, (k)) and so it provides aframework for estimating z in the E-step. With the estimated z, theM-step is simplified whereas the classical ML method requires directmaximization of o() which may involve integration over f(z|), apriordistribution for z.

The EM algorithm consists of two steps: TheE-step and theM-step.

1. E-step: Evaluate the conditional expectation of the complete

data log-likelihood function,c() = ln f(y,z(k)

|) by replacingz byz(k) =E(z|y,(k)).



19/35

SID

EREMENS

EADEM

MUTATO


2. M-step: Maximize c() = l n f(y,z(k)|) w.r.t. to obtain

(k+1). Return to the E-step with (k+1).

3. Stopping rule: Iterations (expectation of z given (k)) withiniterations (maximization of given z(k) for each k) arises andthey should stop when||(k+1) (k)|| is sufficiently small.

Remarks

1. The EM algorithm makes use of thePrinciple of Data Augmen-

tationwhich states that:

EM inference: Augment the observed data y with latent dataz so that the likelihood of the complete data f(y,z|) is sim-ple and then obtain the MLE of based on this complete like-lihood function.

Bayesian inference: Augment the observed data y with la-

tent data z so that the augmented posterior density f(|y,z)is simple and then use this simple posterior distribution insampling the parameters .

2. Bayesian approach simply treatsz as another latent variable andso the distinction between the E and M steps disappears. Both and z are optimized through a (Markov) chain one at a time.

3. The EM algorithm can be applied to different missing or incomplete-data situations, such as censored observations, random effectsmodel, mixtures model, and models with latent class or latentvariable.

4. The EM algorithm has a linear rate of convergence which de-pends on the proportion of information about inf(y|) whichis observed. The convergence is usually slower than the NR

method.



20/35

SID

EREMENS

EADEM

MUTATO


Example: (Darwin data) The data contains two very low outliers.We consider the mixturemodel:

yi N(1, 2), p= 0.9N(2, 2), p= 0.1.or

yi 0.9N(1, 2) + 0.1N(2, 2).Letwijbe the indicator that observationicomes from groupj, j = 1, 2and , wi1+wi2 = 1. We dont know which normal distribution each

observation yi comes from. In other words, wij is unobserved.

In the M-step, writing rij = yi j, the complete data likelihood,log-likelihood and their 1st and 2nd order derivative functions are

L() =n

i=1

[0.9 (yi|1, 2)]wi1[0.1 (yi|2, 2)]wi2

() =

ni=1

wi1ln 0.9 +wi1ln (yi|1, 2) +wi2ln 0.1 +

wi2ln (yi|2, 2)]

ln (yi|j , 2) = 12

ln(22) 122

(yi j)2

jln (yi|j , 2) = 1

2(yi j) = rij

2

2ln (yi|j , 2) = 1

22+

1

24(yi j)2 = 1

24(r2ij 2)

()

j=

j

ni=1

wijln (yi|j , 2) = 12

ni=1

wijrij, j = 1, 2

()2 = 2

ni=1

2j=1

wijln (yi|j , 2) = 124n

i=1

2j=1

wij(r2ij 2)



21/35

SID

EREMENS

EADEM

MUTATO


2()

2j=

1

2

n

i=1 wij2()

(2)2 =

16

ni=1

2j=1

wij(r2ij 2)

n

24

2()

j2 = 1

4

ni=1

wijrij

2()

12

= 0

In the E-step, the conditional expectation ofwi1 is

wi1 = 1 Pr(Wi1 = 1|yi) + 0 Pr(Wi1 = 0|yi) = Pr(Wi1= 1|yi)=

Pr(Wi1 = 1, yi)

Pr(yi) =

Pr(Wi1= 1)Pr(yi|Wi1 = 1)Pr(yi)

= Pr(Wi1 = 1) Pr(yi|Wi1 = 1)

Pr(Wi1 = 1) Pr(yi|Wi1 = 1) + Pr(Wi1 = 0) Pr(yi|Wi1 = 0)= 0.9 (yi|1, 2)

0.9 (yi|1, 2) + 0.1 (yi|2, 2)

> y=c(-67,-48,6,8,14,16,23,24,28,29,41,49,56,60,75)

> n=length(y)

> p=3 #no. of par.

> iterE=5

> iterM=10

> dim1=iterE*iterM> dim2=2*p+3

> dl=c(rep(0,p))

> result=matrix(0,dim1,dim2)

> theta=c(30,-37,729) #starting values

>

> for (k in 1:iterE) { # E-step

+ ew1=0.9*exp(-0.5*(y-theta[1])^2/theta[3])

+ ew2=0.1*exp(-0.5*(y-theta[2])^2/theta[3])

+ w1=ew1/(ew1+ew2)+ w1m=mean(w1)



22/35


23/35

SID

EREMENS

EADEM

MUTATO


[18,] 2 8 32.93380 5.507870 -57.00761 14.03678 394.3344 143.99055 -72.09096

[19,] 2 9 32.93380 5.507870 -57.00761 14.03678 394.3344 143.99055 -72.09096

[20,] 2 10 32.93380 5.507870 -57.00761 14.03678 394.3344 143.99055 -72.09096

[21,] 3 1 32.99249 5.507838 -57.40463 14.03871 384.9492 145.69396 -71.91673

[22,] 3 2 32.99113 5.441860 -57.39541 13.86992 385.1901 140.51959 -71.91425[23,] 3 3 32.99113 5.443563 -57.39540 13.87426 385.1903 140.65153 -71.91425

[24,] 3 4 32.99113 5.443564 -57.39540 13.87426 385.1903 140.65162 -71.91425

[25,] 3 5 32.99113 5.443564 -57.39540 13.87426 385.1903 140.65162 -71.91425

[26,] 3 6 32.99113 5.443564 -57.39540 13.87426 385.1903 140.65162 -71.91425

[27,] 3 7 32.99113 5.443564 -57.39540 13.87426 385.1903 140.65162 -71.91425

[28,] 3 8 32.99113 5.443564 -57.39540 13.87426 385.1903 140.65162 -71.91425

[29,] 3 9 32.99113 5.443564 -57.39540 13.87426 385.1903 140.65162 -71.91425

[30,] 3 10 32.99113 5.443564 -57.39540 13.87426 385.1903 140.65162 -71.91425

[31,] 4 1 32.99290 5.443541 -57.41186 13.87464 384.8527 140.71323 -71.90744

[32,] 4 2 32.99290 5.441156 -57.41185 13.86856 384.8531 140.52829 -71.90744

[33,] 4 3 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744

[34,] 4 4 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744

[35,] 4 5 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744

[36,] 4 6 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744

[37,] 4 7 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744

[38,] 4 8 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744

[39,] 4 9 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744

[40,] 4 10 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744

[41,] 5 1 32.99296 5.441157 -57.41245 13.86859 384.8414 140.53061 -71.90720

[42,] 5 2 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720

[43,] 5 3 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720

[44,] 5 4 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720

[45,] 5 5 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720

[46,] 5 6 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720

[47,] 5 7 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720

[48,] 5 8 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720

[49,] 5 9 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720

[50,] 5 10 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720

> w=cbind(w1,w2)

> w

w1 w2[1,] 2.315067e-05 9.999768e-01

[2,] 2.004754e-03 9.979952e-01[3,] 9.984605e-01 1.539495e-03

[4,] 9.990371e-01 9.629220e-04

[5,] 9.997646e-01 2.353932e-04

[6,] 9.998528e-01 1.471616e-04

[7,] 9.999716e-01 2.842587e-05

[8,] 9.999775e-01 2.247488e-05

[9,] 9.999912e-01 8.782696e-06

[10,] 9.999931e-01 6.944001e-06

[11,] 9.999996e-01 4.143677e-07

[12,] 9.999999e-01 6.327540e-08

[13,] 1.000000e+00 1.222089e-08

[14,] 1.000000e+00 4.775591e-09

[15,] 1.000000e+00 1.408457e-10> mean(w1)

[1] 0.866605

From (wi1, wi2), the first two observations belong to group 2 while theothers all group 1. Hence EM method enables classification like clusteranalysis, an advantage over the classical likelihood method where themissing data wij are integrated out as the observed data likelihood

Lo() =

ni=1

[0.9(yi|1, 2) + 0.1(yi|2, 2)] (8)



24/35

SID

EREMENS

EADEM

MUTATO


is a marginal mixture of two distributions and contains no missingobservations.

> x=rep(-0.001,n)

> x1=seq(-120,100,0.1)

> fx1=dnorm(x1,theta[1],sqrt(theta[3]))

> fx2=dnorm(x1,theta[2],sqrt(theta[3]))

> fx=0.9*fx1+0.1*fx2

> plot(x1, fx1, xlab="x", ylab="f(x)", ylim=c(-0.001,0.025),

xlim=c(-120,100), pch=20, col="red",cex=0.5)

> points(x1,fx2,pch=20,col=blue,cex=0.5)

> points(x1,fx,pch=20,cex=0.5)

> points(y,x,pch=20,cex=0.8)

> title("Mixture of normal distributions for Darwin data")

100 50 0 50 100

0.

000

0.

005

0.0

10

0.

015

0.

020

0.

025

x

f(x)

Mixture of normal distributions for Darwin data

Note that this is a mixture model where the two model densities arerepresented by the blue and red lines. The mixing density in (8) isrepresented by the black line.



25/35

SID

EREMENS

EADEM

MUTATO


Example: (Right-Censored Data) with Darwin dataSuppose that the first four observations (ci, i = 1, . . . , 4) are right

censored (yi> ci) and we assume thatzi, i= 1, . . . , 4; yi, i= 5, . . . , n N(, 2).

Let = (, ), z= (z1, . . . , z 4) and y= (y5, . . . , yn). Then

c() = ln f(z,y|) = n2

ln 2 122

4i=1

(zi )2 122

ni=5

(yi )2

For the censored observations zi > ci, i = 1, . . . , 4, the conditional

distribution is atruncated normalon (ci,), with the density functionf(z|,,ci) = (z|,

2)

1 ci

, z > ci (9)where and are the pdf and cdf functions for normal. Let (k) =((k), 2 (k)) be the current estimates of.

In the E-step, the conditional expectation ofzi, i= 1, . . . , 4 given y,

(k) and ci is

z(k)i = E(zi|y,(k), ci) =

ci

z f(z|(k), (k), ci) dz =(k) +ci

(k)

(k)

(k)

1 ci(k)

(k)

or 1

S

Sj=1

z(k)ij

since12

c

z exp1

2z2

dz= 1

2

c

z exp1

2z2

d1

2z2

= 1

2exp

1

2(c)2

where z

(k)ij , j = 1, . . . , S is simulated fromf(z|(k), 2 (k), ci) in (9) in

the Monte Carlo approximation of conditional expectation.

In the M-step, the z(k)i is substituted for the censored observationzi.

With the complete data (z(k),y), and 2 of the normal data distri-

bution are given by their sample mean and sample variance. Henceno iteration is required for the M-step.



26/35

SID

EREMENS

EADEM

MUTATO


> library("msm")

> cy=c(6,8,14,16,23,24,28,29,41,49,56,60,75) #first 4 obs are censored

> w=c(0,0,0,0,1,1,1,1,1,1,1,1,1)

> n=length(cy)

> S=10000 #sim 10000 z for z hat

> m=4 #first 4 censored

> p=2 #2 parmeters

> cen=cy[1:m] #censored obs

> y=cy[(m+1):n] #uncensored obs

> iterE=10

> dim=p+m+1

> result=matrix(0,iterE,dim)

> simz=matrix(0,m,S)

> z=rep(0,m)

> theta=c(mean(cy),var(cy)) #starting value for mu & sigma2

>

> for (k in 1:iterE) { #E-step

+

+ for (j in 1:m) {

+ # simz[j,]=rtnorm(S,mean=theta[1],sd=sqrt(theta[2]),lower=cen[j],

upper=Inf) #monte carlo approx. of E(Z|Z>c)+ # z[j]=mean(simz[j,])

+

+ cz=(cen[j]-theta[1])/sqrt(theta[2])

+ z[j]=theta[1]+dnorm(cz)*sqrt(theta[2])/(1-pnorm(cz)) #exact

+ }

+ yr=c(z,y)

+ theta[1]=mean(yr) #M-step

+ theta[2]=(sum(yr^2)-sum(yr)^2/n)/n

+ result[k,]=c(k,theta[1],theta[2],z[1],z[2],z[3],z[4])+

+ }

> colnames(result)=c("iE","mu","sigma2","ez1","ez2","ez3","ez4")

> print(result,digit=5) #monte carlo approx. of E(Z|Z>c)

iE mu sigma2 ez1 ez2 ez3 ez4

[1,] 1 41.630 211.67 37.262 37.842 40.054 41.036

[2,] 2 42.647 208.03 41.943 42.259 42.455 42.755

[3,] 3 42.921 208.09 42.540 43.007 43.614 43.810

[4,] 4 43.017 208.11 43.281 43.385 43.655 43.900[5,] 5 43.040 208.17 43.154 43.441 43.733 44.188



27/35

SID

EREMENS

EADEM

MUTATO


[6,] 6 43.048 208.21 42.985 43.371 44.071 44.201

[7,] 7 43.046 208.18 43.158 43.473 43.689 44.277

[8,] 8 43.036 208.14 43.393 43.317 43.721 44.039

[9,] 9 43.061 208.19 43.375 43.298 43.968 44.156

[10,] 10 43.071 208.22 43.336 43.364 43.810 44.411

> print(result,digit=5) #exact E(Z|Z>c)

iE mu sigma2 ez1 ez2 ez3 ez4

[1,] 1 41.659 211.47 37.377 37.996 40.180 41.018

[2,] 2 42.660 208.05 41.948 42.062 42.638 42.932

[3,] 3 42.930 208.06 42.889 42.983 43.478 43.737

[4,] 4 43.006 208.12 43.148 43.239 43.718 43.969

[5,] 5 43.027 208.14 43.221 43.311 43.785 44.035

[6,] 6 43.033 208.15 43.242 43.332 43.805 44.054

[7,] 7 43.035 208.15 43.248 43.337 43.810 44.059

[8,] 8 43.035 208.15 43.249 43.339 43.812 44.061

[9,] 9 43.036 208.15 43.250 43.339 43.812 44.061

[10,] 10 43.036 208.15 43.250 43.340 43.812 44.061

The convergence using Monte Carlo approx. is subjected to random

error in the simulation. Parameter estimates are given by the averagesover iterations.



28/35

SID

EREMENS

EADEM

MUTATO


1.6.2 Monte Carlo EM Algorithm

Given the current guess to the posterior mode,(k), the conditional ex-

pectation in the E-step may involve integration and can be calculatedusing Monte Carlo (MC) approximation. Similarly the complete datalog-likelihood function c() = ln f(y,z|) can also be approximatedusing MC approximation:

c() = lnf(y, z|) = 1S

Sj=1

ln f(y, z(k)j |) (10)

where z(k)1 , . . . , z(k)S f(z|(k),y) as required in the E-step. Thismaximizes an average of log-likelihood based ln f(y, z|) on simu-lated values which is different from maximizing ln f(y,z|) wherezisaverage of simulated values. Then, in the M-step, we maximize c()in (10) to obtain a new guess, (k+1).Monitoring of convergence: Plot each component of(k) against theiteration number k.

Example: (Right-Censored Data) Consider the Darwin data again.In the E-step, the conditional expectation ofzi given y,

(k) and ci isgiven by (10) and estimated by drawing samplez

(k)i1 , z

(k)i2 , . . . , z

(k)iS from

the truncated normalf(zi|(k), (k), ci) in (9) at the current estimates(k) = ((k), (k)).

In the M-step, one obtains a MC approximation to ln f(y, z|) by

c() = 1

S

Sj=1

n

2ln(22) 1

22

4i=1

(z(k)ij )2 +

ni=5

(yi )2

= n2

ln(22) 122

4i=1

1

S

Sj=1

(z(k)ij )2

+

ni=5

(yi )2

and maximizes it w.r.t. to obtain (k+1)

through iterationsinsteadofclose-form solution. Writeri =yi , i= 5, . . . , n, rij =z(k)ij ,



29/35

SID

EREMENS

EADEM

MUTATO


zi= 1S

S

j=1z(k)ij and ri = zi , i= 1, . . . , 4,

()

=

1

2

4i=1

1S

Sj=1

(z(k)it )

+ ni=5

(yi )= 1

2

ni=1

ri,

()

2 =

1

24

4

i=1

1S

Sj=1

(z(k)ij )2

+ ni=5

(yi )2 n2

= 1

24 4

i=1 1

S

s

j=1 r2ij 2+n

i=5(r2i 2)Since

1

S

Sj=1

r2ij = 1

S

Sj=1

(z(k)it )2=

1S

Sj=1

(z(k)it )

2 = 1

S

Sj=1

rij

2 = r2i ,closed-form solution using sample mean and sample variance can not be used.

2()

2

= n

2

2()

(2)2 =

16

4

i=1

1S

Sj=1

(z(k)ij )2

+ ni=5

(yi )2+ n24

= 1

6

4

i=1

1S

Sj=1

r2ij

2+ n

i=5

(r2i 2) n24

2()

2 =

1

4

n

i=1 ri> library("msm")

> cy=c(6,8,14,16,23,24,28,29,41,49,56,60,75) #first 4 obs are censor time

> w=c(0,0,0,0,1,1,1,1,1,1,1,1,1)

> mean(cy)

[1] 33

> n=length(cy)

> T=10000 #sim 10000 z for z hat

> m=4 #first 4 censored obs> p=2 #2 pars



30/35

SID

EREMENS

EADEM

MUTATO


> cen=cy[1:m] #censored obs

> y=cy[(m+1):n] #uncensored obs

> iterE=5

> iterM=10

> dim1=iterE*iterM

> dim2=2*p+7

> dl=c(rep(0,p))

> dl2=matrix(0,p,p)

> result=matrix(0,dim1,dim2)

> simz=matrix(0,m,T)

> z=matrix(0,m,1)

> rz=rep(0,m)

> r2z=rep(0,m)

> theta=c(40,400) #starting values for mu & var

>

> for (k in 1:iterE) { #E-step

+ for (j in 1:m) {

+ simz[j,]=rtnorm(T,mean=theta[1],sd=sqrt(theta[2]),lower=cen[j],

upper=Inf)

+ z[j]=mean(simz[j,])

+ }+

+ for (i in 1:iterM) { #M-step

+ rz=z-theta[1]

+ ry=y-theta[1]

+ r=c(rz,ry)

+ r2z=apply((simz-theta[1])^2,1,mean)

+ r2=c(r2z,ry^2)

+ s2=r2-theta[2]

+ dl[1]=sum(r)/theta[2]+ dl[2]=0.5*sum(s2)/theta[2]^2

+ dl2[1,1]=-n/theta[2]

+ dl2[2,2]=-sum(s2)/theta[2]^3-0.5*n/theta[2]^2

+ dl2[2,1]=dl2[1,2]=-sum(r)/theta[2]^2

+ dl2i=solve(dl2)

+ theta=theta-dl2i%*%dl

+ se=sqrt(diag(-dl2i))

+ l=-n*log(2*pi*theta[2])/2-sum(r^2)/(2*theta[2]) #pi=3.141593

+ row=(k-1)*10+i+ result[row,]=c(k,i,theta[1],se[1],theta[2],se[2],l,z[1],z[2],



31/35

SID

EREMENS

EADEM

MUTATO


z[3],z[4])

+ }

+ }

> colnames(result)=c("iE","iM","mu","se","sigma2","se","logL","ez1",

"ez2","ez3","ez4")

> print(result,digit=5)

iE iM mu se sigma2 se logL ez1 ez2 ez3 ez4

[1,] 1 1 43.745 5.6699 288.50 160.37 -53.653 42.163 42.427 44.008 44.473

[2,] 1 2 42.965 4.7218 301.25 113.42 -53.557 42.163 42.427 44.008 44.473

[3,] 1 3 42.929 4.8139 301.86 118.16 -53.547 42.163 42.427 44.008 44.473

[4,] 1 4 42.929 4.8187 301.86 118.40 -53.547 42.163 42.427 44.008 44.473

[5,] 1 5 42.929 4.8187 301.86 118.40 -53.547 42.163 42.427 44.008 44.473

[6,] 1 6 42.929 4.8187 301.86 118.40 -53.547 42.163 42.427 44.008 44.473

[7,] 1 7 42.929 4.8187 301.86 118.40 -53.547 42.163 42.427 44.008 44.473

[8,] 1 8 42.929 4.8187 301.86 118.40 -53.547 42.163 42.427 44.008 44.473[9,] 1 9 42.929 4.8187 301.86 118.40 -53.547 42.163 42.427 44.008 44.473

[10,] 1 10 42.929 4.8187 301.86 118.40 -53.547 42.163 42.427 44.008 44.473

[11,] 2 1 43.280 4.8205 287.03 118.44 -53.460 43.734 43.914 44.876 44.900

[12,] 2 2 43.263 4.6989 287.16 112.58 -53.458 43.734 43.914 44.876 44.900

[13,] 2 3 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900

[14,] 2 4 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900

[15,] 2 5 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900[16,] 2 6 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900

[17,] 2 7 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900

[18,] 2 8 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900

[19,] 2 9 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900

[20,] 2 10 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900

[21,] 3 1 43.320 4.6999 284.71 112.63 -53.446 44.105 43.867 44.783 45.399

[22,] 3 2 43.320 4.6799 284.72 111.67 -53.446 44.105 43.867 44.783 45.399

[23,] 3 3 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399[24,] 3 4 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399

[25,] 3 5 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399

[26,] 3 6 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399

[27,] 3 7 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399

[28,] 3 8 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399

[29,] 3 9 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399

[30,] 3 10 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399

[31,] 4 1 43.325 4.6799 284.61 111.68 -53.446 43.945 43.878 45.109 45.295

[32,] 4 2 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295

[33,] 4 3 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295

[34,] 4 4 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295

[35,] 4 5 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295

[36,] 4 6 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295

[37,] 4 7 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295

[38,] 4 8 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295

[39,] 4 9 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295

[40,] 4 10 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295

[41,] 5 1 43.343 4.6790 284.06 111.63 -53.443 44.072 44.284 44.957 45.149

[42,] 5 2 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149

[43,] 5 3 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149

[44,] 5 4 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149

[45,] 5 5 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149

[46,] 5 6 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149

[47,] 5 7 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149

[48,] 5 8 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149

[49,] 5 9 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149[50,] 5 10 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149

References



32/35

SID

EREMENS

EADEM

MUTATO


Dempster, A.P., Laird, N. & Rubin, D.B. (1977) Maximum likelihood from incom-plete data via the EM algorithm. Journal of the Royal Statistical Society,Series B, 39, 1-38. (with discussion).

McLachlan, G.J. & Krishnan, T (1997) The EM Algorithm and Extensions. Wi-ley.



35/35

SID

EREMENS

EADEM

MUTATO


(k+1) = arg max

[o(

|(k))] = arg max

[o((k)) + (

|(k))]

= arg max

o(

(k))+

f(z|y,(k)) ln f(y|z,)f(z|)f(z|y,(k))f(y|(k)) dz

= arg max

f(z|y,(k))ln[f(y|z,)f(z|)] dz

(drop the constant term w.r.t. )

= arg max

ln f(y, z|)f(z|y,(k)) dz= arg max

Ez|y,(k){ln f(y, z|)}and hence proved (7).

Fisher Information for GLM

Documents

Transcript of Fisher Information for GLM