Fisher Information for GLM
-
Upload
tindechealex -
Category
Documents
-
view
231 -
download
0
Transcript of Fisher Information for GLM
-
8/11/2019 Fisher Information for GLM
1/35
MSH3 Generalized Linear model (Part 1)
Jennifer S.K. CHAN
Course outline
Part I: Generalized Linear Model
1. Maximum Likelihood Inference
Newton-Raphson and Fisher Scoring methods; EM and Monte
Carlo EM algorithms
2. Exponential Family
Two parameter Exponential family; ML estimation for GLM,Deviance; Quasi-likelihood, Random effects models.
3. Model Selection
Deviance for Likelihood Ratio Tests, Wald Tests, AIC and BIC,Examples
4. Survival Analysis
Kaplan-Meier estimator; Proportional hazards models; Coxsproportional hazards model.
-
8/11/2019 Fisher Information for GLM
2/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model
Contents
1 Maximum likelihood Inference 21.1 Motivating examples . . . . . . . . . . . . . . . . . . . 21.2 Likelihood function . . . . . . . . . . . . . . . . . . . . 71.3 Score vector . . . . . . . . . . . . . . . . . . . . . . . . 91.4 Information matrix . . . . . . . . . . . . . . . . . . . . 101.5 Newton-Raphson and Fisher Scoring methods . . . . . 131.6 Expectation Maximization (EM) algorithm . . . . . . 17
1.6.1 Basic EM algorithm . . . . . . . . . . . . . . . 17
1.6.2 Monte Carlo EM Algorithm . . . . . . . . . . . 271.7 Appendix for EM algorithm . . . . . . . . . . . . . . . 32
SydU MSH3 GLM (2012) First semester Dr. J. Chan 1
-
8/11/2019 Fisher Information for GLM
3/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
1 Maximum likelihood Inference
1.1 Motivating examplesAIDS deaths (counts)
The numbers of death Yi from AIDS in Australia for three-monthperiods from 1983 to 1986 are shown below.
The Poisson regression model
Yi Poisson(i), with i= exp(a+bti)> 0
is fitted and the maximum likelihood (ML) estimates are a = 0.376and b = 0.254. For each 3-month period, there will be a 29.3%
(exp(0.254) = 1.293) increase in expected AIDS deaths. Note thatthe variance increases with the mean and the log link function g(i) =ln(i) =
xi is used.
> no=c(0,1,2,3,1,5,10,17,23,31,20,25,37,45)
> time=c(1:14)
> poi=glm(no~time, family=poisson(link=log))
> summary(poi)
Call:
SydU MSH3 GLM (2012) First semester Dr. J. Chan 2
-
8/11/2019 Fisher Information for GLM
4/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
glm(formula = y ~ x, family = poisson(link = log))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2502 -0.9815 -0.6770 0.2545 2.6731
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.37571 0.24884 1.51 0.131
x 0.25365 0.02188 11.60 par=poi$coeff
> names(par)=NULL
> par
[1] 0.3757110 0.2536485
> beta0=par[1]
> beta1=par[2]
> par(mfrow=c(2,2))
> c1=function(time) exp(beta0+beta1*time)
> plot(time,no,pch=20,col=blue )> curve(c1,1,14,add=TRUE)
> title("Poisson regression")
Mice data
Twenty six mice were given different levelxiof drug. OutcomesYiarewhether they responded to the drug (Yi= 1) or not (Yi= 0).
SydU MSH3 GLM (2012) First semester Dr. J. Chan 3
-
8/11/2019 Fisher Information for GLM
5/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
The logistic regression model for binary data is
Yi Bernoulli(i), with logit(i) = ln
i1 i
= a+bxi.
Note that
ln i
1 i
= a+bxi i= ea+bxi
1 +ea+bxi.
> y=c(0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,1,1,1,1,1,1,1,1,1,1,1)
> dose=c(0:25)/10
> log=glm(y~dose, family=binomial(link=logit))
> summary(log)
Call:
glm(formula = y ~ dose, family = binomial(link = logit))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5766 -0.4757 0.1376 0.4129 2.1975
Coefficients:Estimate Std. Error z value Pr(>|z|)
SydU MSH3 GLM (2012) First semester Dr. J. Chan 4
-
8/11/2019 Fisher Information for GLM
6/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
(Intercept) -4.111 1.638 -2.510 0.0121 *
dose 3.581 1.316 2.722 0.0065 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 35.890 on 25 degrees of freedom
Residual deviance: 17.639 on 24 degrees of freedom
AIC: 21.639
Number of Fisher Scoring iterations: 6
> par=log$coeff
> names(par)=NULL
> par
[1] -4.111361 3.581176
> beta0=par[1]
> beta1=par[2]
> c1=function(dose) exp(beta0+beta1*dose)/(1+exp(beta0+beta1*dose))
> plot(dose,y, pch=20,col=red)
> curve(c1,0,2.5,add=TRUE)> title("Logistic regression")
2 4 6 8 10 14
0
10
20
30
40
time
no
Poisson regression
0.0 0.5 1.0 1.5 2.0 2.5
0.
0
0.
4
0.
8
dose
y
Logistic regression
For parameter estimation, the nonparametric LSE , the parametricmaximum likelihood (ML, PML, QML, GEE, EM, MCEM, etc) andBayesian methods methods will be discussed. Kernel smoothing and
other semi-parametricmethods are not included. Model selection isbased on Akaike Information criterion (AIC), Bayesian Information
SydU MSH3 GLM (2012) First semester Dr. J. Chan 5
-
8/11/2019 Fisher Information for GLM
7/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
criterion (BIC) and the Deviance information criterion (DIC).
For application, the two examples analyse counts data with Pois-son distribution and binary data with Bernounill distribution respec-tively. Others include categorial data with multinominal distributionand positive continuous data with Weibull distribution. These illus-trate different data distributions under the Exponential Family. Themeanof the data distribution is linked to a linear function of covari-ates with possibly random effects but the varianceis NOT modelled.Popular time series models with heteroskedastic variance and long
memory such as Generalized autoregressive conditional heteroskedas-tic (GARCH) model and stochastic volatility (SV) model will not beconsidered.
SydU MSH3 GLM (2012) First semester Dr. J. Chan 6
-
8/11/2019 Fisher Information for GLM
8/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
1.2 Likelihood function
Let Y1, . . . , Y n be n independent random variables (rv) with proba-
bility density functions (pdf) fi(yi, ) depending on a vector-valueparameter . The joint density ofy= (y1, . . . , yn)
f(y,) =n
i=1
f(yi,) =L(,y)
as a function of unknown parameter given y is called the likeli-hoodfunction. We often work with the logarithm off(y, ), the log-likelihoodfunction:
(;y) = ln L(;y) =n
i=1
ln f(yi;).
Themaximum-likelihood(ML) estimator
maximzes the log-likelihood
function given the data y, that is,
(;y) (;y) for all .In other words, they make the observed data as likely as possible underthe model.
Example: The log-likelhood function for Geometric Distribution.Consider a series of independent Bernoulli trials with a common prob-ability of success . The distribution for the number of failures Yi
before the first success has a pdf
Pr(Yi=yi) = (1 )yifor yi = 0, 1, . . . . Direct calculation shows that E(Yi) = (1 )/.The log-likelihood function given y is
(;y) = ln L(;y) =n
i=1 [yiln(1 ) + ln ]= n[y ln(1 ) + ln ],
SydU MSH3 GLM (2012) First semester Dr. J. Chan 7
-
8/11/2019 Fisher Information for GLM
9/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
where y= 1n
n
i=1
yiis the sample mean. The fact that the log-likelihood
function depends on the observations only through y shows that y isa sufficientstatistic for the unknown probability .
0.0 0.2 0.4 0.6 0.8 1.0
250
150
50
pi
logl(pi)
loglikelihood function
0.0 0.2 0.4 0.6 0.8 1.06000
2000
2000
pi
score(pi)
score function
0.0 0.2 0.4 0.6 0.8 1.0
0
100000
200000
pi
Ie(pi)
Expected information function
Figure: Log-likelihood function for geometric dist. when n= 20 and y= 3.
> n=20
> ym=3
> pi=c(1:100)/100> logl=function(pi) n*(ym*log(1-pi)+log(pi))
SydU MSH3 GLM (2012) First semester Dr. J. Chan 8
-
8/11/2019 Fisher Information for GLM
10/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
1.3 Score vector
The first order derivative of the log-likelihood function, called Fishers
score function, is a vector of dimension p where p is the number ofparameters and is denoted by
u() =(;y)
.
For example, when Yi N(, 2), u() =
,
2
.
If the log-likelihood function is concave, the ML estimates can beobtained by solving the system of equations:
u() =0.
Example: The score function for the geometric distribution.The score function fornobservations from a geometric distribution is
u() = d
d =
d
dn(y ln(1 ) + ln )= n
1
y
1
.
Setting this equation to zero and solving for gives the ML estimate:
1
=
y
1
y = 1 = 1
1 + y and y=
1
.
Note that the ML estimate of the probability of success is the recip-rocal of the average number of trials. The more trials it takes to geta success, the lower is the estimated probability of success.
For a sample ofn = 20 observations and with a sample mean of y= 3,the ML estimate is = 1/(1 + 3) = 0.25.
SydU MSH3 GLM (2012) First semester Dr. J. Chan 9
-
8/11/2019 Fisher Information for GLM
11/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
1.4 Information matrix
It can be shown that
Ey
()
= 0 and Ey
2()
+Ey
()
()
= 0.
Proof: Since
f(y,)dy= 1,
f(y, )
dy=0
f(y,)
f(y, )f(y, )dy=0
ln f(y, )
f(y, )dy=0
()
f(y,)dy=0
Ey
()
= 0 and
()
f(y,)
dy=0
2()
f(y, ) +()
f(y,)
dy=0
2()
+
()
()
f(y,)dy=0
Ey
2()
+Ey
()
()
= 0. (1)
Hence the score function is a random vector such that it has a zeromean
Ey[u()] =Ey
()
= 0
and a variance-covariance matrix which is given by the informativematrix:
var[u()] =Ey[u()u()] =Ey ()
()
= I().
SydU MSH3 GLM (2012) First semester Dr. J. Chan 10
-
8/11/2019 Fisher Information for GLM
12/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
Under mild regularity conditions, the information matrix can also beobtained as minus the expected value of the second derivatives of the
log-likelihood from (1):
var[u()] =I() = Ey
2()
.
Note that theHessianmatrix is
H() = Io() = 2()
=
u()
(2)
and Io() =2()
=H() is sometimes called the observedinformation matrix. Io() indicatesthe extent to which()is peakedrather than flat. If it is more peaked, Io() is more positive. Forexample, when Yi N(, 2),
2()
=
22
22
22
2(2)2
.
Example: Information matrix for geometric distribution.Differentiating the score, we find the observed information to be
Io() = d2()
d2 = du()
d = d
d n
1
y
1
= n
1
2+
y
(1 )2
.
To find the expected information, we subsitute ybyE(Y) =E(Yi) =(1
)/ inIo() to obtain
Ie() =n
1
2 +
(1 )/(1 )2
= n
1
2 +
1
(1 )
= n
1 +2(1 )
=
n
2(1 ) .
Note thatIe() depending on the sharpness of the peak increases withthe sample size n since larger sample size provides more informationand hence the loglikelihood function is more sharp at the peak. Whenn= 20 and = 0.15, the expected information is
Ie(0.15) = n2(1 ) =
200.152(1 0.15)= 1045.8.
SydU MSH3 GLM (2012) First semester Dr. J. Chan 11
-
8/11/2019 Fisher Information for GLM
13/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
If the sample mean y= 3, the observed information is
Io(0.15) =n 12 + y(1 )2=20 10.152 + 3(1 0.15)2= 971.9.Substituting the ML estimate = 0.25, the expected and observedinformation areIo(0.25) =Ie(0.25) = 426.7 since y= (1 )/.> score=function(pi) n*(1/pi-ym/(1-pi))
> Ie=function(pi) n/(pi^2*(1-pi))
> Io=n*(1/pi^2+ym/(1-pi)^2)
>> logl1=n*(ym*log(1-pi)+log(pi))
> score1=n*(1/pi-ym/(1-pi))
> Ie1=n/(pi^2*(1-pi))
> c(pi[logl1==max(logl1)],pi[score1==0],max(logl1))
[1] 0.25000 0.25000 -44.98681
> c(Io[pi==0.15],Ie1[pi==0.15],Io[pi==0.25],Ie1[pi==0.25])
[1] 971.9339 1045.7516 426.6667 426.6667
>
> par(mfrow=c(2,2))
> plot(logl, col=red,xlab="pi",ylab="logl(pi)")
> points(pi[score1==0],logl1[pi==pi[score1==0]],pch=2,col="red",cex=0.6)
> title("log-likelihood function")
> plot(score, col=red,xlab="pi",ylab="score(pi)")
> abline(h = 0)
> points(pi[score1==0],0,pch=2,col="red",cex=0.6)
> title("score function")
> plot(Ie, col=red,xlab="pi",ylab="Ie(pi)")
> title("Expected information function")
SydU MSH3 GLM (2012) First semester Dr. J. Chan 12
-
8/11/2019 Fisher Information for GLM
14/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
1.5 Newton-Raphson and Fisher Scoring meth-ods
Calculation of the ML estimate often requires iterative procedures.Expanding the score function u() evaluated at the ML estimatearound a trial value 0 using a first order Taylor series gives
u() =u(0) +u(0)
(0)+higher order terms in (0). (3)Ignoring higher order terms, equating (3) to zero and solving for,we have 0 u(0)
1u(0) (4)
since u() = 0. Then the Newton-Raphson (NR) procedure to ob-tain an improved estimate (k+1) using the estimate (k) at the k-thiteration is
(k+1) =(k) 2()
1 ()
=(k) . (5)
The iterative procedure is repeated until the difference between (k+1)
and (k) is sufficiently close to zero. Then (proof as exercise)
var() =Io()1 = 2(
)
1
.
For ML estimates, the second order derivative H() isconcave down-wards and negative. The sharper the curvature (more information)
of (), the more negativeH() is and hence the estimates havesmaller variancevar(
) = Io(
)1 =H(
)1. The NR procedure
tends to converge quickly if the log-likelihood is well-behaved (close to
quadratic) in a neighborhood of the ML estimateand if the startingvalue 0 is reasonably close to.SydU MSH3 GLM (2012) First semester Dr. J. Chan 13
-
8/11/2019 Fisher Information for GLM
15/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
An alternative procedure first suggested by Fisher is to replace theinformation matrix Io() by its expected value Ie(). The procedure
knwon as Fisher Scoring(FS) is
(k+1) =(k) E
2()
1()
=(k). (6)
For multimodal distributions, both methods will converge to a local(not global) maximum.
Example: NR and FS methods for geometric distribution.Setting the score to zero leads to an explicit solution for the ML esti-
mate = 1
1 + y and no iteration is needed. For illustrative purpose,
the iterative procedure is performed. Using the previous results,
d
d =n
1
y
1
, d2
d2 = n
1
2+
y
(1 )2
, E
d2
d2
=
n2(1 ) .
TheFisher scoringprocedure leads to the updating formula
(k+1) = (k) E
d2
d2
1d
d|=(k)
= (k) +((k))2(1 (k))
n n
1
(k) y
1 (k)
= (k) + ((k))2(1
(k))
1 (k) (k)y
(k)
(1 (k)
)= (k) + (1 (k) (k)y)(k).
If the sample mean is y = 3 and we start from 0 = 0.1, say, theprocedure converges to the ML estimate = 0.25 in four iterations.
> n=20
> ym=3
> pi=0.1
> result=matrix(0,10,7)>
SydU MSH3 GLM (2012) First semester Dr. J. Chan 14
-
8/11/2019 Fisher Information for GLM
16/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
> for (i in 1:10) {
+ dl=n*(1/pi-ym/(1-pi))
+ dl2=-n/(pi^2*(1-pi))
+ pi=pi-dl/dl2
+ #pi=pi+(1-pi-pi*ym)*pi
+ se=sqrt(-1/dl2)
+ l=n*(ym*log(1-pi)+log(pi))
+ step=(1-pi-pi*ym)*pi
+ result[i,]=c(i,pi,se,l,dl,dl2,step)
+ }
> colnames(result)=c("Iter","pi","se","l","dl","dl2","step")
> result
Iter pi se l dl dl2 step
[1,] 1 0.1600000 0.02121320 -47.11283 1.333333e+02 -2222.2222 5.760000e-02
[2,] 2 0.2176000 0.03279024 -45.22528 5.357143e+01 -930.0595 2.820096e-02
[3,] 3 0.2458010 0.04303862 -44.99060 1.522465e+01 -539.8628 4.128512e-03
[4,] 4 0.2499295 0.04773221 -44.98681 1.812051e+00 -438.9114 7.050785e-05
[5,] 5 0.2500000 0.04840091 -44.98681 3.009750e-02 -426.8674 1.989665e-08
[6,] 6 0.2500000 0.04841229 -44.98681 8.489239e-06 -426.6667 1.582068e-15
[7,] 7 0.2500000 0.04841229 -44.98681 6.661338e-13 -426.6667 0.000000e+00
[8,] 8 0.2500000 0.04841229 -44.98681 0.000000e+00 -426.6667 0.000000e+00
[9,] 9 0.2500000 0.04841229 -44.98681 0.000000e+00 -426.6667 0.000000e+00
[10,] 10 0.2500000 0.04841229 -44.98681 0.000000e+00 -426.6667 0.000000e+00
Alternatively theNewton-Raphsonprocedure is
(k+1) = (k)
d2
d2
1d
d|=(k)
= (k) +1
n
1
((k))2+
y
(1 (k))21
n
1
(k) y
1 (k)
= (k) + ((k))2(1 (k))21 2(k) + ((k))2 + y((k))21 (k) (k)y(k)(1 (k))
= (k) +(k)(1 (k))(1 (k) (k)y)
1 2(k) + (1 + y)((k))2 .
> n=20
> ym=3
> pi=0.1 #starting value> result=matrix(0,10,7)
SydU MSH3 GLM (2012) First semester Dr. J. Chan 15
-
8/11/2019 Fisher Information for GLM
17/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
>
> for (i in 1:10) {
+ dl=20*(1/pi-ym/(1-pi))
+ dl2=-20*(1/pi^2+3/(1-pi)^2)
+ pi=pi-dl/dl2
+ #pi=pi+(1-pi)*(1-pi-pi*ym)*pi/(1-2*pi+4*pi^2)
+ se=sqrt(-1/dl2)
+ l=n*(ym*log(1-pi)+log(pi))
+ step=(1-pi)*(1-pi-pi*ym)*pi/(1-2*pi+4*pi^2)
+ result[i,]=c(i,pi,se,l,dl,dl2,step)
+ }
> colnames(result)=c("Iter","pi","se","l","dl","dl2","step")
> result
Iter pi se l dl dl2 step
[1,] 1 0.1642857 0.02195775 -46.89107 1.333333e+02 -2074.0741 6.039726e-02
[2,] 2 0.2246830 0.03477490 -45.13029 4.994426e+01 -826.9292 2.344114e-02
[3,] 3 0.2481241 0.04490170 -44.98756 1.162661e+01 -495.9916 1.866426e-03[4,] 4 0.2499905 0.04816876 -44.98681 8.044145e-01 -430.9919 9.453797e-06
[5,] 5 0.2500000 0.04841107 -44.98681 4.033823e-03 -426.6882 2.383524e-10
[6,] 6 0.2500000 0.04841229 -44.98681 1.016970e-07 -426.6667 0.000000e+00
[7,] 7 0.2500000 0.04841229 -44.98681 0.000000e+00 -426.6667 0.000000e+00
[8,] 8 0.2500000 0.04841229 -44.98681 0.000000e+00 -426.6667 0.000000e+00
[9,] 9 0.2500000 0.04841229 -44.98681 0.000000e+00 -426.6667 0.000000e+00
[10,] 10 0.2500000 0.04841229 -44.98681 0.000000e+00 -426.6667 0.000000e+00
For both algorithms ,u() andu() converge to 0.25, 0 (slope) and-426.6667 (curvature) respectively. Note that the NR method, usingexactIo(), may converge faster than the FS method.
The maximization can also be done using a maximizer:
> logl = function(pi) -20*(3*log(1-pi)+log(pi))
> pi.hat = optimize(logl, c(0, 1), tol = 0.0001)> pi.hat
$minimum
[1] 0.2500143
$objective
[1] 44.98681
SydU MSH3 GLM (2012) First semester Dr. J. Chan 16
-
8/11/2019 Fisher Information for GLM
18/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
1.6 Expectation Maximization (EM) algorithm
1.6.1 Basic EM algorithm
TheExpectation-Maximization (EM)algorithm was proposed by Demp-steret al. (1977). It is an iterative approach for computing the max-imum likelihood estimates (MLEs) for incomplete-dataproblems.
Lety be the observed data, z be the latent or missing data and bethe unknown parameters to be estimated. The functions f(y|) andf(y, z|) are called the observed data and complete data likelihoodfunctions respectively. The observed data likelihood Lo() = f(y
|)
is the expectation off(y|z, ) w.r.t. f(z|), that is,f(y|) =
f(y, z|) dz=
f(y|z,)f(z|) dz=Ez|[f(y|z,)].
To find the ML estimate, one should maximize
o() = ln f(y|) = ln
f(y|z,)f(z|) dz=ln Ez|[f(y|z, )].
The EM algorithm maximizes o(|(k)) (the proof is given in theappendix) which is equivalent to maximize
Ez|y,(k){ln f(y, z|(k))} =
ln f(y,z|(k))f(z|y,(k)) dz (7)
given (k) in an iterative procedure. Note that it takes into accountthe posteriordistribution ofz, i.e. f(z|y, (k)) and so it provides aframework for estimating z in the E-step. With the estimated z, theM-step is simplified whereas the classical ML method requires directmaximization of o() which may involve integration over f(z|), apriordistribution for z.
The EM algorithm consists of two steps: TheE-step and theM-step.
1. E-step: Evaluate the conditional expectation of the complete
data log-likelihood function,c() = ln f(y,z(k)
|) by replacingz byz(k) =E(z|y,(k)).
SydU MSH3 GLM (2012) First semester Dr. J. Chan 17
-
8/11/2019 Fisher Information for GLM
19/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
2. M-step: Maximize c() = l n f(y,z(k)|) w.r.t. to obtain
(k+1). Return to the E-step with (k+1).
3. Stopping rule: Iterations (expectation of z given (k)) withiniterations (maximization of given z(k) for each k) arises andthey should stop when||(k+1) (k)|| is sufficiently small.
Remarks
1. The EM algorithm makes use of thePrinciple of Data Augmen-
tationwhich states that:
EM inference: Augment the observed data y with latent dataz so that the likelihood of the complete data f(y,z|) is sim-ple and then obtain the MLE of based on this complete like-lihood function.
Bayesian inference: Augment the observed data y with la-
tent data z so that the augmented posterior density f(|y,z)is simple and then use this simple posterior distribution insampling the parameters .
2. Bayesian approach simply treatsz as another latent variable andso the distinction between the E and M steps disappears. Both and z are optimized through a (Markov) chain one at a time.
3. The EM algorithm can be applied to different missing or incomplete-data situations, such as censored observations, random effectsmodel, mixtures model, and models with latent class or latentvariable.
4. The EM algorithm has a linear rate of convergence which de-pends on the proportion of information about inf(y|) whichis observed. The convergence is usually slower than the NR
method.
SydU MSH3 GLM (2012) First semester Dr. J. Chan 18
-
8/11/2019 Fisher Information for GLM
20/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
Example: (Darwin data) The data contains two very low outliers.We consider the mixturemodel:
yi N(1, 2), p= 0.9N(2, 2), p= 0.1.or
yi 0.9N(1, 2) + 0.1N(2, 2).Letwijbe the indicator that observationicomes from groupj, j = 1, 2and , wi1+wi2 = 1. We dont know which normal distribution each
observation yi comes from. In other words, wij is unobserved.
In the M-step, writing rij = yi j, the complete data likelihood,log-likelihood and their 1st and 2nd order derivative functions are
L() =n
i=1
[0.9 (yi|1, 2)]wi1[0.1 (yi|2, 2)]wi2
() =
ni=1
wi1ln 0.9 +wi1ln (yi|1, 2) +wi2ln 0.1 +
wi2ln (yi|2, 2)]
ln (yi|j , 2) = 12
ln(22) 122
(yi j)2
jln (yi|j , 2) = 1
2(yi j) = rij
2
2ln (yi|j , 2) = 1
22+
1
24(yi j)2 = 1
24(r2ij 2)
()
j=
j
ni=1
wijln (yi|j , 2) = 12
ni=1
wijrij, j = 1, 2
()2 = 2
ni=1
2j=1
wijln (yi|j , 2) = 124n
i=1
2j=1
wij(r2ij 2)
SydU MSH3 GLM (2012) First semester Dr. J. Chan 19
-
8/11/2019 Fisher Information for GLM
21/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
2()
2j=
1
2
n
i=1 wij2()
(2)2 =
16
ni=1
2j=1
wij(r2ij 2)
n
24
2()
j2 = 1
4
ni=1
wijrij
2()
12
= 0
In the E-step, the conditional expectation ofwi1 is
wi1 = 1 Pr(Wi1 = 1|yi) + 0 Pr(Wi1 = 0|yi) = Pr(Wi1= 1|yi)=
Pr(Wi1 = 1, yi)
Pr(yi) =
Pr(Wi1= 1)Pr(yi|Wi1 = 1)Pr(yi)
= Pr(Wi1 = 1) Pr(yi|Wi1 = 1)
Pr(Wi1 = 1) Pr(yi|Wi1 = 1) + Pr(Wi1 = 0) Pr(yi|Wi1 = 0)= 0.9 (yi|1, 2)
0.9 (yi|1, 2) + 0.1 (yi|2, 2)
> y=c(-67,-48,6,8,14,16,23,24,28,29,41,49,56,60,75)
> n=length(y)
> p=3 #no. of par.
> iterE=5
> iterM=10
> dim1=iterE*iterM> dim2=2*p+3
> dl=c(rep(0,p))
> result=matrix(0,dim1,dim2)
> theta=c(30,-37,729) #starting values
>
> for (k in 1:iterE) { # E-step
+ ew1=0.9*exp(-0.5*(y-theta[1])^2/theta[3])
+ ew2=0.1*exp(-0.5*(y-theta[2])^2/theta[3])
+ w1=ew1/(ew1+ew2)+ w1m=mean(w1)
SydU MSH3 GLM (2012) First semester Dr. J. Chan 20
-
8/11/2019 Fisher Information for GLM
22/35
-
8/11/2019 Fisher Information for GLM
23/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
[18,] 2 8 32.93380 5.507870 -57.00761 14.03678 394.3344 143.99055 -72.09096
[19,] 2 9 32.93380 5.507870 -57.00761 14.03678 394.3344 143.99055 -72.09096
[20,] 2 10 32.93380 5.507870 -57.00761 14.03678 394.3344 143.99055 -72.09096
[21,] 3 1 32.99249 5.507838 -57.40463 14.03871 384.9492 145.69396 -71.91673
[22,] 3 2 32.99113 5.441860 -57.39541 13.86992 385.1901 140.51959 -71.91425[23,] 3 3 32.99113 5.443563 -57.39540 13.87426 385.1903 140.65153 -71.91425
[24,] 3 4 32.99113 5.443564 -57.39540 13.87426 385.1903 140.65162 -71.91425
[25,] 3 5 32.99113 5.443564 -57.39540 13.87426 385.1903 140.65162 -71.91425
[26,] 3 6 32.99113 5.443564 -57.39540 13.87426 385.1903 140.65162 -71.91425
[27,] 3 7 32.99113 5.443564 -57.39540 13.87426 385.1903 140.65162 -71.91425
[28,] 3 8 32.99113 5.443564 -57.39540 13.87426 385.1903 140.65162 -71.91425
[29,] 3 9 32.99113 5.443564 -57.39540 13.87426 385.1903 140.65162 -71.91425
[30,] 3 10 32.99113 5.443564 -57.39540 13.87426 385.1903 140.65162 -71.91425
[31,] 4 1 32.99290 5.443541 -57.41186 13.87464 384.8527 140.71323 -71.90744
[32,] 4 2 32.99290 5.441156 -57.41185 13.86856 384.8531 140.52829 -71.90744
[33,] 4 3 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744
[34,] 4 4 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744
[35,] 4 5 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744
[36,] 4 6 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744
[37,] 4 7 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744
[38,] 4 8 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744
[39,] 4 9 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744
[40,] 4 10 32.99290 5.441158 -57.41185 13.86857 384.8531 140.52848 -71.90744
[41,] 5 1 32.99296 5.441157 -57.41245 13.86859 384.8414 140.53061 -71.90720
[42,] 5 2 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720
[43,] 5 3 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720
[44,] 5 4 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720
[45,] 5 5 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720
[46,] 5 6 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720
[47,] 5 7 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720
[48,] 5 8 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720
[49,] 5 9 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720
[50,] 5 10 32.99296 5.441074 -57.41245 13.86837 384.8414 140.52421 -71.90720
> w=cbind(w1,w2)
> w
w1 w2[1,] 2.315067e-05 9.999768e-01
[2,] 2.004754e-03 9.979952e-01[3,] 9.984605e-01 1.539495e-03
[4,] 9.990371e-01 9.629220e-04
[5,] 9.997646e-01 2.353932e-04
[6,] 9.998528e-01 1.471616e-04
[7,] 9.999716e-01 2.842587e-05
[8,] 9.999775e-01 2.247488e-05
[9,] 9.999912e-01 8.782696e-06
[10,] 9.999931e-01 6.944001e-06
[11,] 9.999996e-01 4.143677e-07
[12,] 9.999999e-01 6.327540e-08
[13,] 1.000000e+00 1.222089e-08
[14,] 1.000000e+00 4.775591e-09
[15,] 1.000000e+00 1.408457e-10> mean(w1)
[1] 0.866605
From (wi1, wi2), the first two observations belong to group 2 while theothers all group 1. Hence EM method enables classification like clusteranalysis, an advantage over the classical likelihood method where themissing data wij are integrated out as the observed data likelihood
Lo() =
ni=1
[0.9(yi|1, 2) + 0.1(yi|2, 2)] (8)
SydU MSH3 GLM (2012) First semester Dr. J. Chan 22
-
8/11/2019 Fisher Information for GLM
24/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
is a marginal mixture of two distributions and contains no missingobservations.
> x=rep(-0.001,n)
> x1=seq(-120,100,0.1)
> fx1=dnorm(x1,theta[1],sqrt(theta[3]))
> fx2=dnorm(x1,theta[2],sqrt(theta[3]))
> fx=0.9*fx1+0.1*fx2
> plot(x1, fx1, xlab="x", ylab="f(x)", ylim=c(-0.001,0.025),
xlim=c(-120,100), pch=20, col="red",cex=0.5)
> points(x1,fx2,pch=20,col=blue,cex=0.5)
> points(x1,fx,pch=20,cex=0.5)
> points(y,x,pch=20,cex=0.8)
> title("Mixture of normal distributions for Darwin data")
100 50 0 50 100
0.
000
0.
005
0.0
10
0.
015
0.
020
0.
025
x
f(x)
Mixture of normal distributions for Darwin data
Note that this is a mixture model where the two model densities arerepresented by the blue and red lines. The mixing density in (8) isrepresented by the black line.
SydU MSH3 GLM (2012) First semester Dr. J. Chan 23
-
8/11/2019 Fisher Information for GLM
25/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
Example: (Right-Censored Data) with Darwin dataSuppose that the first four observations (ci, i = 1, . . . , 4) are right
censored (yi> ci) and we assume thatzi, i= 1, . . . , 4; yi, i= 5, . . . , n N(, 2).
Let = (, ), z= (z1, . . . , z 4) and y= (y5, . . . , yn). Then
c() = ln f(z,y|) = n2
ln 2 122
4i=1
(zi )2 122
ni=5
(yi )2
For the censored observations zi > ci, i = 1, . . . , 4, the conditional
distribution is atruncated normalon (ci,), with the density functionf(z|,,ci) = (z|,
2)
1 ci
, z > ci (9)where and are the pdf and cdf functions for normal. Let (k) =((k), 2 (k)) be the current estimates of.
In the E-step, the conditional expectation ofzi, i= 1, . . . , 4 given y,
(k) and ci is
z(k)i = E(zi|y,(k), ci) =
ci
z f(z|(k), (k), ci) dz =(k) +ci
(k)
(k)
(k)
1 ci(k)
(k)
or 1
S
Sj=1
z(k)ij
since12
c
z exp1
2z2
dz= 1
2
c
z exp1
2z2
d1
2z2
= 1
2exp
1
2(c)2
where z
(k)ij , j = 1, . . . , S is simulated fromf(z|(k), 2 (k), ci) in (9) in
the Monte Carlo approximation of conditional expectation.
In the M-step, the z(k)i is substituted for the censored observationzi.
With the complete data (z(k),y), and 2 of the normal data distri-
bution are given by their sample mean and sample variance. Henceno iteration is required for the M-step.
SydU MSH3 GLM (2012) First semester Dr. J. Chan 24
-
8/11/2019 Fisher Information for GLM
26/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
> library("msm")
> cy=c(6,8,14,16,23,24,28,29,41,49,56,60,75) #first 4 obs are censored
> w=c(0,0,0,0,1,1,1,1,1,1,1,1,1)
> n=length(cy)
> S=10000 #sim 10000 z for z hat
> m=4 #first 4 censored
> p=2 #2 parmeters
> cen=cy[1:m] #censored obs
> y=cy[(m+1):n] #uncensored obs
> iterE=10
> dim=p+m+1
> result=matrix(0,iterE,dim)
> simz=matrix(0,m,S)
> z=rep(0,m)
> theta=c(mean(cy),var(cy)) #starting value for mu & sigma2
>
> for (k in 1:iterE) { #E-step
+
+ for (j in 1:m) {
+ # simz[j,]=rtnorm(S,mean=theta[1],sd=sqrt(theta[2]),lower=cen[j],
upper=Inf) #monte carlo approx. of E(Z|Z>c)+ # z[j]=mean(simz[j,])
+
+ cz=(cen[j]-theta[1])/sqrt(theta[2])
+ z[j]=theta[1]+dnorm(cz)*sqrt(theta[2])/(1-pnorm(cz)) #exact
+ }
+ yr=c(z,y)
+ theta[1]=mean(yr) #M-step
+ theta[2]=(sum(yr^2)-sum(yr)^2/n)/n
+ result[k,]=c(k,theta[1],theta[2],z[1],z[2],z[3],z[4])+
+ }
> colnames(result)=c("iE","mu","sigma2","ez1","ez2","ez3","ez4")
> print(result,digit=5) #monte carlo approx. of E(Z|Z>c)
iE mu sigma2 ez1 ez2 ez3 ez4
[1,] 1 41.630 211.67 37.262 37.842 40.054 41.036
[2,] 2 42.647 208.03 41.943 42.259 42.455 42.755
[3,] 3 42.921 208.09 42.540 43.007 43.614 43.810
[4,] 4 43.017 208.11 43.281 43.385 43.655 43.900[5,] 5 43.040 208.17 43.154 43.441 43.733 44.188
SydU MSH3 GLM (2012) First semester Dr. J. Chan 25
-
8/11/2019 Fisher Information for GLM
27/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
[6,] 6 43.048 208.21 42.985 43.371 44.071 44.201
[7,] 7 43.046 208.18 43.158 43.473 43.689 44.277
[8,] 8 43.036 208.14 43.393 43.317 43.721 44.039
[9,] 9 43.061 208.19 43.375 43.298 43.968 44.156
[10,] 10 43.071 208.22 43.336 43.364 43.810 44.411
> print(result,digit=5) #exact E(Z|Z>c)
iE mu sigma2 ez1 ez2 ez3 ez4
[1,] 1 41.659 211.47 37.377 37.996 40.180 41.018
[2,] 2 42.660 208.05 41.948 42.062 42.638 42.932
[3,] 3 42.930 208.06 42.889 42.983 43.478 43.737
[4,] 4 43.006 208.12 43.148 43.239 43.718 43.969
[5,] 5 43.027 208.14 43.221 43.311 43.785 44.035
[6,] 6 43.033 208.15 43.242 43.332 43.805 44.054
[7,] 7 43.035 208.15 43.248 43.337 43.810 44.059
[8,] 8 43.035 208.15 43.249 43.339 43.812 44.061
[9,] 9 43.036 208.15 43.250 43.339 43.812 44.061
[10,] 10 43.036 208.15 43.250 43.340 43.812 44.061
The convergence using Monte Carlo approx. is subjected to random
error in the simulation. Parameter estimates are given by the averagesover iterations.
SydU MSH3 GLM (2012) First semester Dr. J. Chan 26
-
8/11/2019 Fisher Information for GLM
28/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
1.6.2 Monte Carlo EM Algorithm
Given the current guess to the posterior mode,(k), the conditional ex-
pectation in the E-step may involve integration and can be calculatedusing Monte Carlo (MC) approximation. Similarly the complete datalog-likelihood function c() = ln f(y,z|) can also be approximatedusing MC approximation:
c() = lnf(y, z|) = 1S
Sj=1
ln f(y, z(k)j |) (10)
where z(k)1 , . . . , z(k)S f(z|(k),y) as required in the E-step. Thismaximizes an average of log-likelihood based ln f(y, z|) on simu-lated values which is different from maximizing ln f(y,z|) wherezisaverage of simulated values. Then, in the M-step, we maximize c()in (10) to obtain a new guess, (k+1).Monitoring of convergence: Plot each component of(k) against theiteration number k.
Example: (Right-Censored Data) Consider the Darwin data again.In the E-step, the conditional expectation ofzi given y,
(k) and ci isgiven by (10) and estimated by drawing samplez
(k)i1 , z
(k)i2 , . . . , z
(k)iS from
the truncated normalf(zi|(k), (k), ci) in (9) at the current estimates(k) = ((k), (k)).
In the M-step, one obtains a MC approximation to ln f(y, z|) by
c() = 1
S
Sj=1
n
2ln(22) 1
22
4i=1
(z(k)ij )2 +
ni=5
(yi )2
= n2
ln(22) 122
4i=1
1
S
Sj=1
(z(k)ij )2
+
ni=5
(yi )2
and maximizes it w.r.t. to obtain (k+1)
through iterationsinsteadofclose-form solution. Writeri =yi , i= 5, . . . , n, rij =z(k)ij ,
SydU MSH3 GLM (2012) First semester Dr. J. Chan 27
-
8/11/2019 Fisher Information for GLM
29/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
zi= 1S
S
j=1z(k)ij and ri = zi , i= 1, . . . , 4,
()
=
1
2
4i=1
1S
Sj=1
(z(k)it )
+ ni=5
(yi )= 1
2
ni=1
ri,
()
2 =
1
24
4
i=1
1S
Sj=1
(z(k)ij )2
+ ni=5
(yi )2 n2
= 1
24 4
i=1 1
S
s
j=1 r2ij 2+n
i=5(r2i 2)Since
1
S
Sj=1
r2ij = 1
S
Sj=1
(z(k)it )2=
1S
Sj=1
(z(k)it )
2 = 1
S
Sj=1
rij
2 = r2i ,closed-form solution using sample mean and sample variance can not be used.
2()
2
= n
2
2()
(2)2 =
16
4
i=1
1S
Sj=1
(z(k)ij )2
+ ni=5
(yi )2+ n24
= 1
6
4
i=1
1S
Sj=1
r2ij
2+ n
i=5
(r2i 2) n24
2()
2 =
1
4
n
i=1 ri> library("msm")
> cy=c(6,8,14,16,23,24,28,29,41,49,56,60,75) #first 4 obs are censor time
> w=c(0,0,0,0,1,1,1,1,1,1,1,1,1)
> mean(cy)
[1] 33
> n=length(cy)
> T=10000 #sim 10000 z for z hat
> m=4 #first 4 censored obs> p=2 #2 pars
SydU MSH3 GLM (2012) First semester Dr. J. Chan 28
-
8/11/2019 Fisher Information for GLM
30/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
> cen=cy[1:m] #censored obs
> y=cy[(m+1):n] #uncensored obs
> iterE=5
> iterM=10
> dim1=iterE*iterM
> dim2=2*p+7
> dl=c(rep(0,p))
> dl2=matrix(0,p,p)
> result=matrix(0,dim1,dim2)
> simz=matrix(0,m,T)
> z=matrix(0,m,1)
> rz=rep(0,m)
> r2z=rep(0,m)
> theta=c(40,400) #starting values for mu & var
>
> for (k in 1:iterE) { #E-step
+ for (j in 1:m) {
+ simz[j,]=rtnorm(T,mean=theta[1],sd=sqrt(theta[2]),lower=cen[j],
upper=Inf)
+ z[j]=mean(simz[j,])
+ }+
+ for (i in 1:iterM) { #M-step
+ rz=z-theta[1]
+ ry=y-theta[1]
+ r=c(rz,ry)
+ r2z=apply((simz-theta[1])^2,1,mean)
+ r2=c(r2z,ry^2)
+ s2=r2-theta[2]
+ dl[1]=sum(r)/theta[2]+ dl[2]=0.5*sum(s2)/theta[2]^2
+ dl2[1,1]=-n/theta[2]
+ dl2[2,2]=-sum(s2)/theta[2]^3-0.5*n/theta[2]^2
+ dl2[2,1]=dl2[1,2]=-sum(r)/theta[2]^2
+ dl2i=solve(dl2)
+ theta=theta-dl2i%*%dl
+ se=sqrt(diag(-dl2i))
+ l=-n*log(2*pi*theta[2])/2-sum(r^2)/(2*theta[2]) #pi=3.141593
+ row=(k-1)*10+i+ result[row,]=c(k,i,theta[1],se[1],theta[2],se[2],l,z[1],z[2],
SydU MSH3 GLM (2012) First semester Dr. J. Chan 29
-
8/11/2019 Fisher Information for GLM
31/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
z[3],z[4])
+ }
+ }
> colnames(result)=c("iE","iM","mu","se","sigma2","se","logL","ez1",
"ez2","ez3","ez4")
> print(result,digit=5)
iE iM mu se sigma2 se logL ez1 ez2 ez3 ez4
[1,] 1 1 43.745 5.6699 288.50 160.37 -53.653 42.163 42.427 44.008 44.473
[2,] 1 2 42.965 4.7218 301.25 113.42 -53.557 42.163 42.427 44.008 44.473
[3,] 1 3 42.929 4.8139 301.86 118.16 -53.547 42.163 42.427 44.008 44.473
[4,] 1 4 42.929 4.8187 301.86 118.40 -53.547 42.163 42.427 44.008 44.473
[5,] 1 5 42.929 4.8187 301.86 118.40 -53.547 42.163 42.427 44.008 44.473
[6,] 1 6 42.929 4.8187 301.86 118.40 -53.547 42.163 42.427 44.008 44.473
[7,] 1 7 42.929 4.8187 301.86 118.40 -53.547 42.163 42.427 44.008 44.473
[8,] 1 8 42.929 4.8187 301.86 118.40 -53.547 42.163 42.427 44.008 44.473[9,] 1 9 42.929 4.8187 301.86 118.40 -53.547 42.163 42.427 44.008 44.473
[10,] 1 10 42.929 4.8187 301.86 118.40 -53.547 42.163 42.427 44.008 44.473
[11,] 2 1 43.280 4.8205 287.03 118.44 -53.460 43.734 43.914 44.876 44.900
[12,] 2 2 43.263 4.6989 287.16 112.58 -53.458 43.734 43.914 44.876 44.900
[13,] 2 3 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900
[14,] 2 4 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900
[15,] 2 5 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900[16,] 2 6 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900
[17,] 2 7 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900
[18,] 2 8 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900
[19,] 2 9 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900
[20,] 2 10 43.263 4.6999 287.16 112.63 -53.458 43.734 43.914 44.876 44.900
[21,] 3 1 43.320 4.6999 284.71 112.63 -53.446 44.105 43.867 44.783 45.399
[22,] 3 2 43.320 4.6799 284.72 111.67 -53.446 44.105 43.867 44.783 45.399
[23,] 3 3 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399[24,] 3 4 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399
[25,] 3 5 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399
[26,] 3 6 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399
[27,] 3 7 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399
[28,] 3 8 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399
[29,] 3 9 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399
[30,] 3 10 43.320 4.6799 284.72 111.68 -53.446 44.105 43.867 44.783 45.399
[31,] 4 1 43.325 4.6799 284.61 111.68 -53.446 43.945 43.878 45.109 45.295
[32,] 4 2 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295
[33,] 4 3 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295
[34,] 4 4 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295
[35,] 4 5 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295
[36,] 4 6 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295
[37,] 4 7 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295
[38,] 4 8 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295
[39,] 4 9 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295
[40,] 4 10 43.325 4.6790 284.61 111.63 -53.446 43.945 43.878 45.109 45.295
[41,] 5 1 43.343 4.6790 284.06 111.63 -53.443 44.072 44.284 44.957 45.149
[42,] 5 2 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149
[43,] 5 3 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149
[44,] 5 4 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149
[45,] 5 5 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149
[46,] 5 6 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149
[47,] 5 7 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149
[48,] 5 8 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149
[49,] 5 9 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149[50,] 5 10 43.343 4.6745 284.06 111.42 -53.443 44.072 44.284 44.957 45.149
References
SydU MSH3 GLM (2012) First semester Dr. J. Chan 30
-
8/11/2019 Fisher Information for GLM
32/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
Dempster, A.P., Laird, N. & Rubin, D.B. (1977) Maximum likelihood from incom-plete data via the EM algorithm. Journal of the Royal Statistical Society,Series B, 39, 1-38. (with discussion).
McLachlan, G.J. & Krishnan, T (1997) The EM Algorithm and Extensions. Wi-ley.
SydU MSH3 GLM (2012) First semester Dr. J. Chan 31
-
8/11/2019 Fisher Information for GLM
33/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
1.7 Appendix for EM algorithm
To maximizeo() = ln f(y
|), we wish to compute an updated esti-
mate (k+1) such that,
o((k+1))> o(
(k)).
The idea is to maximize alternatively the function (|(k)) whichis (i) bounded above by o(
(k+1)) at (k+1) and (ii) equal to o((k))
at (k). Then any (k+1) which increases o((k+1)|(k)) also increases
o((k+1)). Lastly, the EM algorithm chooses (k+1) as the value of
for which o(|(k)) is a maximum.
To show (i), we first consider maximizing the difference
o((k+1)) o((k))
= ln f(y|(k+1)) ln f(y|(k))= ln
f(y|z,(k+1))f(z|(k+1)) dz ln f(y|(k))
= ln
f(y
|z,(k+1))f(z
|(k+1))
f(z|y,(k))f(z|y,(k))
dz
ln f(y
|(k))
= ln
f(z|y,(k)) f(y|z,(k+1))f(z|(k+1))
f(z|y,(k)) dz ln f(y|(k))
f(z|y,(k)) lnf(y|z,(k+1))f(z|(k+1))
f(z|y,(k)) dz
f(z|y,(k)) ln f(y|(k)) dz
=
f(z|y,(k)) lnf(y|z,(k+1))f(z|(k+1))
f(z|y,(k))f(y|(k)) dz ((k+1)|(k))
since
f(z|y, (k))dz= 1 and lnn
i=1
iyin
i=1
iln(yi) with ln()
being concave. Then define o((k+1)|(k)) such that
o((k+1)) o((k)) + ((k+1)|(k)) o((k+1)|(k))
or o() o((k)) + (|(k)) o(|(k)) (writing(k+1) =)
whereo(|
(k)
)
o((k)
)+(|(k)
). Henceo(
(k+1)
|(k)
) is boundabove byo(
(k+1)) or o(|(k)) is bound above by o() in general.
SydU MSH3 GLM (2012) First semester Dr. J. Chan 32
-
8/11/2019 Fisher Information for GLM
34/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
where in the diagram, (k+1) , n+1, o((k+1)|(k)) (|n) ando(
(k+1)) L(n+1). The function o(|(k)) is bounded above bythe log-likelihood function o().Next we show (ii) that o(|(k)) and o() are equal at =(k).
o((k)|(k)) = o((k)) + ((k)|(k))
= o((k)) +
f(z|y, (k)) lnf(y|z,(k))f(z|(k))
f(z|y,(k))f(y|(k)) dz
= o((k)
) +
f(z|y, (k)
) ln
f(y,z
|(k))
f(y,z|(k)) dz= o(
(k)).
Hence any (k+1) which increaseso((k+1)|(k)) also increaseso((k+1)).
Lastly, we show (iii) that the EM algorithm chooses (k+1) for whicho(|(k)) is a maximum. Sinceo() o(|(k)), increasingo(|(k))ensures that o() is increased at each step.
To achieve the greatest increase in o((k+1)), EM algorithm selects
(k+1) which maximizeo(|(k)), i.e.
SydU MSH3 GLM (2012) First semester Dr. J. Chan 33
-
8/11/2019 Fisher Information for GLM
35/35
SID
EREMENS
EADEM
MUTATO
MSH3 Generalized linear model Ch. 1 Max. likelihood inference
(k+1) = arg max
[o(
|(k))] = arg max
[o((k)) + (
|(k))]
= arg max
o(
(k))+
f(z|y,(k)) ln f(y|z,)f(z|)f(z|y,(k))f(y|(k)) dz
= arg max
f(z|y,(k))ln[f(y|z,)f(z|)] dz
(drop the constant term w.r.t. )
= arg max
ln f(y, z|)f(z|y,(k)) dz= arg max
Ez|y,(k){ln f(y, z|)}and hence proved (7).