MCMC - University of California, Santa Cruzdraper/browne-PhD-dissertation-1999.pdf · Secondly MCMC...
-
Upload
nguyentuyen -
Category
Documents
-
view
216 -
download
1
Transcript of MCMC - University of California, Santa Cruzdraper/browne-PhD-dissertation-1999.pdf · Secondly MCMC...
Applying MCMC Methods to
Multi-level Models
submitted by
William J. Browne
for the degree of PhD
of the
University of Bath
1998
COPYRIGHT
Attention is drawn to the fact that copyright of this thesis rests with its author.
This copy of the thesis has been supplied on the condition that anyone who
consults it is understood to recognise that its copyright rests with its author and
that no quotation from the thesis and no information derived from it may be
published without the prior written consent of the author.
This thesis may be made available for consultation within the University Library
and may be photocopied or lent to other libraries for the purposes of consultation.
Signature of Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
William J. Browne
To
Health, Happiness and Honesty.
Summary
Multi-level modelling and Markov chain Monte Carlo methods are two areas of
statistics that have become increasingly popular recently due to improvements
in computer capabilities, both in storage and speed of operation. The aim
of this thesis is to combine the two areas by �tting multi-level models using
Markov chain Monte Carlo (MCMC) methods. This task has been split into
three parts in this thesis. Firstly the types of problems that are �tted in multi-
level modelling are identi�ed and the existing maximum likelihood methods are
investigated. Secondly MCMC algorithms for these models are derived and �nally
these methods are compared to the maximum likelihood based methods both in
terms of estimate bias and interval coverage properties.
Three main groups of multi-level models are considered. Firstly N level
Gaussian models, secondly binary response multi-level logistic regression models
and �nally Gaussian models with complex variation at level 1.
Two simple 2 level Gaussian models are �rstly considered and it is shown
how to �t these models using the Gibbs sampler. Then extensive simulation
studies are carried out to compare the Gibbs sampler method with maximum
likelihoodmethods on these two models. For the generalN level Gaussian models,
algorithms for the Gibbs sampler and two alternative hybrid Metropolis Gibbs
methods are given and these three methods are then compared with each other.
One of the hybrid Metropolis Gibbs methods is adapted to �t binary response
multi-level models. This method is then compared with two quasi-likelihood
methods via a simulation study on one binary response model where the quasi-
likelihood methods perform particularly badly.
All of the above models can also be �tted using the Gibbs sampling method
using the adaptive rejection algorithm in the BUGS package (Spiegelhalter et al.
1994). Finally Gaussian models with complex variation at level 1 which cannot
be �tted in BUGS are considered. Two methods based on Hastings update steps
are given and are tested on some simple examples.
The MCMC methods in this thesis have been added to the multi-level
modelling package MLwiN (Goldstein et al. 1998) as a by-product of this
research.
Acknowledgements
I would �rstly like to thank my supervisor, Dr David Draper whose research in
the �elds of hierarchical modelling and Bayesian statistics motivated this PhD.
I would also like to thank him for his advice and assistance throughout both
my MSc and PhD. I would like to thank my parents for supporting me both
�nancially and emotionally through my �rst degree and beyond.
I would like to thank the multilevel models project team at the Institute of
Education, in particular Jon Rasbash and Professor Harvey Goldstein for allowing
me to work with them on the MLwiN package. I would also like to thank them
for their advice and assistance while I have been working on the package.
I would like to thank my brother Edward and his �ance Meriel for arranging
their wedding a month before I am scheduled to �nish this thesis. This way I
can spread my worries between my PhD. and my best man's speech. I would like
to thank my girlfriends over the last three years for helping me through various
parts of my PhD. Thanks for giving me love and support when I needed it and
making my life both happy and interesting.
I would like to thank the other members of the statistics group at Bath for
teaching me all I know about statistics today. I would like to thank my fellow
o�ce mates, past and present for their humour, conversation and friendship and
for joining me in my many pointless conversations. Thanks to family and friends
both in Bath and elsewhere.
Special thanks are due to the EPSRC for their �nancial support.
\The only thing I know is that I don't know anything."
Socrates
Contents
1 Introduction 1
1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Summary of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Multi Level Models and MLn 4
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 JSP dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Analysing Redhill school data . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Linear models . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Analysing data on the four schools in the borough of Blackbridge 8
2.3.1 ANOVA model . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 ANCOVA model . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.3 Combined regression . . . . . . . . . . . . . . . . . . . . . 11
2.4 Two level modelling . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.1 Iterative generalised least squares . . . . . . . . . . . . . . 13
2.4.2 Restricted iterative generalised least squares . . . . . . . . 15
2.4.3 Fitting variance components models to the Blackbridge
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.4 Fitting variance components models to the JSP dataset . . 16
2.4.5 Random slopes model . . . . . . . . . . . . . . . . . . . . 17
2.5 Fitting models to pass/fail data . . . . . . . . . . . . . . . . . . . 18
2.5.1 Extending to multi-level modelling . . . . . . . . . . . . . 20
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
i
3 Markov Chain Monte Carlo Methods 23
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Metropolis sampling . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.1 Proposal distributions . . . . . . . . . . . . . . . . . . . . 26
3.4 Metropolis-Hastings sampling . . . . . . . . . . . . . . . . . . . . 27
3.5 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5.1 Rejection sampling . . . . . . . . . . . . . . . . . . . . . . 28
3.5.2 Adaptive rejection sampling . . . . . . . . . . . . . . . . . 29
3.5.3 Gibbs sampler as a special case of the Metropolis-Hastings
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6 Data summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6.1 Measures of location . . . . . . . . . . . . . . . . . . . . . 31
3.6.2 Measures of spread . . . . . . . . . . . . . . . . . . . . . . 31
3.6.3 Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.7 Convergence issues . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7.1 Length of burn-in . . . . . . . . . . . . . . . . . . . . . . . 35
3.7.2 Mixing properties of Markov chains . . . . . . . . . . . . . 36
3.7.3 Multi-modal models . . . . . . . . . . . . . . . . . . . . . 41
3.7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.8 Use of MCMC methods in multi-level modelling . . . . . . . . . . 43
3.9 Example - Bivariate normal distribution . . . . . . . . . . . . . . 44
3.9.1 Metropolis sampling . . . . . . . . . . . . . . . . . . . . . 44
3.9.2 Metropolis-Hasting sampling . . . . . . . . . . . . . . . . . 45
3.9.3 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . 46
3.9.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4 Gaussian Models 1 - Introduction 53
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Prior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.1 Informative priors . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.2 Non-informative priors . . . . . . . . . . . . . . . . . . . . 54
4.2.3 Priors for �xed e�ects . . . . . . . . . . . . . . . . . . . . 55
ii
4.2.4 Priors for single variances . . . . . . . . . . . . . . . . . . 55
4.2.5 Priors for variance matrices . . . . . . . . . . . . . . . . . 58
4.3 2 Level variance components model . . . . . . . . . . . . . . . . . 59
4.3.1 Gibbs sampling algorithm . . . . . . . . . . . . . . . . . . 59
4.3.2 Simulation method . . . . . . . . . . . . . . . . . . . . . . 62
4.3.3 Results : Bias . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.4 Results : Coverage probabilities and interval widths . . . . 70
4.3.5 Improving maximum likelihood method interval estimates
for �2u . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.6 Summary of results . . . . . . . . . . . . . . . . . . . . . . 78
4.4 Random slopes regression model . . . . . . . . . . . . . . . . . . . 79
4.4.1 Gibbs sampling algorithm . . . . . . . . . . . . . . . . . . 80
4.4.2 Simulation method . . . . . . . . . . . . . . . . . . . . . . 83
4.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.5.1 Simulation results . . . . . . . . . . . . . . . . . . . . . . . 96
4.5.2 Priors in MLwiN . . . . . . . . . . . . . . . . . . . . . . . 103
5 Gaussian Models 2 - General Models 104
5.1 General N level Gaussian hierarchical linear models . . . . . . . . 104
5.2 Gibbs sampling approach . . . . . . . . . . . . . . . . . . . . . . . 105
5.3 Generalising to N levels . . . . . . . . . . . . . . . . . . . . . . . 110
5.3.1 Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.3.2 Computational considerations . . . . . . . . . . . . . . . . 113
5.4 Method 2 : Metropolis Gibbs hybrid method with univariate updates114
5.4.1 Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.4.2 Choosing proposal distribution variances . . . . . . . . . . 115
5.4.3 Adaptive Metropolis univariate normal proposals . . . . . 120
5.5 Method 3 : Metropolis Gibbs hybrid method with block updates . 127
5.5.1 Algorithm 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.5.2 Choosing proposal distribution variances . . . . . . . . . . 128
5.5.3 Adaptive multivariate normal proposal distributions . . . 132
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.6.1 Timing considerations . . . . . . . . . . . . . . . . . . . . 138
iii
6 Logistic Regression Models 139
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.2 Multi-level binary response logistic regression models . . . . . . . 140
6.2.1 Metropolis Gibbs hybrid method with univariate updates . 141
6.2.2 Other existing methods . . . . . . . . . . . . . . . . . . . . 143
6.3 Example 1 : Voting intentions dataset . . . . . . . . . . . . . . . 143
6.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.3.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.3.4 Substantive Conclusions . . . . . . . . . . . . . . . . . . . 145
6.3.5 Optimum proposal distributions . . . . . . . . . . . . . . . 146
6.4 Example 2 : Guatemalan child health dataset . . . . . . . . . . . 148
6.4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.4.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.4.3 Original 25 datasets . . . . . . . . . . . . . . . . . . . . . 150
6.4.4 Simulating more datasets . . . . . . . . . . . . . . . . . . . 151
6.4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7 Gaussian Models 3 - Complex Variation at level 1 158
7.1 Model de�nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.2 Updating methods for a scalar variance . . . . . . . . . . . . . . . 159
7.2.1 Metropolis algorithm for log�2 . . . . . . . . . . . . . . . 160
7.2.2 Hastings algorithm for �2 . . . . . . . . . . . . . . . . . . 160
7.2.3 Example : Normal observations with an unknown variance 161
7.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.3 Updating methods for a variance matrix . . . . . . . . . . . . . . 163
7.3.1 Hastings algorithm with an inverse Wishart proposal . . . 164
7.3.2 Example : Bivariate normal observations with an unknown
variance matrix . . . . . . . . . . . . . . . . . . . . . . . . 164
7.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.4 Applying inverse Wishart updates to complex variation at level 1 166
7.4.1 MCMC algorithm . . . . . . . . . . . . . . . . . . . . . . . 167
7.4.2 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
iv
7.4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 170
7.5 Method 2 : Using truncated normal Hastings update steps . . . . 171
7.5.1 Update steps at level 1 for JSP example . . . . . . . . . . 171
7.5.2 Proposal distributions . . . . . . . . . . . . . . . . . . . . 176
7.5.3 Example 2 : Non-positive de�nite and incomplete variance
matrices at level 1 . . . . . . . . . . . . . . . . . . . . . . 176
7.5.4 General algorithm for truncated normal proposal method . 178
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
8 Conclusions and Further Work 182
8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
8.1.1 MCMC options in the MLwiN package . . . . . . . . . . . 183
8.2 Further work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
8.2.1 Binomial responses . . . . . . . . . . . . . . . . . . . . . . 184
8.2.2 Multinomial models . . . . . . . . . . . . . . . . . . . . . . 186
8.2.3 Poisson responses for count data . . . . . . . . . . . . . . . 186
8.2.4 Extensions to complex variation at level 1 . . . . . . . . . 187
8.2.5 Multivariate response models . . . . . . . . . . . . . . . . 188
v
List of Figures
2-1 Plot of the regression lines for the four schools in the Borough of
Blackbridge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2-2 Tree diagram for the Borough of Blackbridge. . . . . . . . . . . . 12
3-1 Histogram of �1 using the Gibbs sampling method. . . . . . . . . 33
3-2 Kernel density plot of �1 using the Gibbs sampling method and a
Gaussian kernel with a large value of the window width h. . . . . 35
3-3 Traces of parameter �1 and the running mean of �1 for a Metropolis
run that converges after about 50 iterations. Upper solid line in
lower panel is running mean with �rst 50 iterations discarded. . . 37
3-4 ACF and PACF for parameter �1 for a Gibbs sampling run of
length 5000 that is mixing well and a Metropolis run that is not
mixing very well. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3-5 Kernel density plot of �2 using the Gibbs sampling method and a
Gaussian kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3-6 Plots of the Raftery Lewis N values for various values of �p, the
proposal distribution standard deviation. . . . . . . . . . . . . . 50
3-7 Plot of the MCMC diagnostic window in the package MLwiN for
the parameter �1 from a random slopes regression model. . . . . 52
4-1 Plot of normal prior distributions over the range ({5,5) with mean
0 and variances 1,2,5,10 and 50 respectively. . . . . . . . . . . . . 56
4-2 Plots of biases obtained for the various methods against study
design and parameter settings. . . . . . . . . . . . . . . . . . . . 69
4-3 Trajectories plot of IGLS estimates for run of random slopes
regression model where convergence is not achieved. . . . . . . . 85
vi
4-4 Plots of biases obtained for the various methods �tting the
random slopes regression model against value of �u01 (Fixed e�ects
parameters and level 1 variance parameter). . . . . . . . . . . . . 94
4-5 Plots of biases obtained for the various methods �tting the random
slopes regression model against value of �u01 (Level 2 variance
parameters). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4-6 Plots of biases obtained for the various methods �tting the
random slopes regression model against study design (Fixed e�ects
parameters and level 1 variance parameter). . . . . . . . . . . . . 100
4-7 Plots of biases obtained for the various methods �tting the random
slopes regression model against study design (Level 2 variance
parameters). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5-1 Plots of the e�ect of varying the scale factor for the proposal
variance and hence the Metropolis acceptance rate on the Raftery
Lewis diagnostic for the �0 parameter in the variance components
model on the JSP dataset. . . . . . . . . . . . . . . . . . . . . . 117
5-2 Plots of the e�ect of varying the scale factor for the proposal
variance and hence the Metropolis acceptance rate on the Raftery
Lewis diagnostic for the �0 parameter in the random slopes
regression model on the JSP dataset. . . . . . . . . . . . . . . . . 118
5-3 Plots of the e�ect of varying the scale factor for the proposal
variance and hence the Metropolis acceptance rate on the Raftery
Lewis diagnostic for the �1 parameter in the random slopes
regression model on the JSP dataset. . . . . . . . . . . . . . . . . 119
5-4 Plots of the e�ect of varying the scale factor for the multivariate
normal proposal distribution and hence the Metropolis acceptance
rate on the Raftery Lewis diagnostic for the �0 parameter in the
random slopes regression model on the JSP dataset. . . . . . . . 130
5-5 Plots of the e�ect of varying the scale factor for the multivariate
normal proposal distribution and hence the Metropolis acceptance
rate on the Raftery Lewis diagnostic for the �1 parameter in the
random slopes regression model on the JSP dataset. . . . . . . . 131
vii
6-1 Plot of the e�ect of varying the scale factor for the univariate
Normal proposal distribution rate on the Raftery Lewis diagnostic
for the �2u parameter in the voting intentions dataset. . . . . . . 147
6-2 Plots comparing the actual coverage of the four estimation meth-
ods with their nominal coverage for the parameters �0; �1 and �2. 154
6-3 Plots comparing the actual coverage of the four estimation meth-
ods with their nominal coverage for the parameters �3; �2v and �2
u. 155
7-1 Plots of truncated univariate normal proposal distributions for a
parameter, �. A is the current value, �c and B is the proposed new
value, ��. M is max� and m is min�, the truncation points. The
distributions in (i) and (iii) have mean �c, while the distributions
in (ii) and (iv) have mean ��. . . . . . . . . . . . . . . . . . . . . 175
viii
List of Tables
2.1 Summary of Redhill primary school results from JSP dataset. . . 6
2.2 Parameter estimates for model including Sex and Non-Manual
covariates for Redhill primary school. . . . . . . . . . . . . . . . . 8
2.3 Summary of schools in the borough of Blackbridge. . . . . . . . . 8
2.4 Parameter estimates for ANOVA and ANCOVA models for the
borough of Blackbridge dataset. . . . . . . . . . . . . . . . . . . . 10
2.5 Parameter estimates for two variance components models using
both IGLS and RIGLS for Borough of Blackbridge dataset. . . . . 15
2.6 Parameter estimates for two variance components models using
both IGLS and RIGLS for all schools in the JSP dataset. . . . . . 16
2.7 Comparison between �tted values using the ANOVA model and
the variance components model 1 using RIGLS. . . . . . . . . . . 16
2.8 Parameter estimates for random slopes model using both IGLS
and RIGLS for all schools in the JSP dataset. . . . . . . . . . . . 17
2.9 Comparison between �tted regression lines produced by separate
regressions and the random slopes model. . . . . . . . . . . . . . . 18
2.10 Parameter estimates for the two logistic regression models �tted
to the Blackbridge dataset. . . . . . . . . . . . . . . . . . . . . . . 20
2.11 Parameter estimates for the two-level logistic regression models
�tted to the JSP dataset. . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 Comparison between MCMC methods for �tting a bivariate
normal model with unknown mean vector. . . . . . . . . . . . . . 48
3.2 Comparison between 95% con�dence intervals and Bayesian cred-
ible intervals in bivariate normal model. . . . . . . . . . . . . . . 51
ix
4.1 Summary of study designs for variance components model simulat-
ion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Summary of times for Gibbs sampling in the variance components
model with di�erent study designs for 50,000 iterations. . . . . . . 65
4.3 Summary of Raftery Lewis convergence times (thousands of
iterations) for various studies. . . . . . . . . . . . . . . . . . . . . 66
4.4 Summary of simulation lengths for Gibbs sampling the variance
components model with di�erent study designs. . . . . . . . . . . 67
4.5 Estimates of relative bias for the variance parameters using
di�erent methods and di�erent studies. True level 2/1 variance
values are 10 and 40. . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.6 Estimates of relative bias for the variance parameters using
di�erent methods and di�erent true values. All runs use study
design 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.7 Comparison of actual coverage percentage values for nominal 90%
and 95% intervals for the �xed e�ect parameter using di�erent
methods and di�erent studies. True values for the variance
parameters are 10 and 40. Approximate MCSEs are 0.28%/0.15%
for 90%/95% coverage estimates. . . . . . . . . . . . . . . . . . . 72
4.8 Average 90%/95% interval widths for the �xed e�ect parameter
using di�erent studies. True values for the variance parameters
are 10 and 40. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.9 Comparison of actual coverage percentage values for nominal 90%
and 95% intervals for the �xed e�ect parameter using di�erent
methods and di�erent true values. All runs use study design
7. Approximate MCSEs are 0.28%/0.15% for 90%/95% coverage
estimates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.10 Average 90%/95% interval widths for the �xed e�ect parameter
using di�erent true parameter values. All runs use study design 7. 73
4.11 Comparison of actual coverage percentage values for nominal
90% and 95% intervals for the level 2 variance parameter using
di�erent methods and di�erent studies. True values of the variance
parameters are 10 and 40. Approximate MCSEs are 0.28%/0.15%
for 90%/95% coverage estimates. . . . . . . . . . . . . . . . . . . 73
x
4.12 Average 90%/95% interval widths for the level 2 variance param-
eter using di�erent studies. True values of the variance parameters
are 10 and 40. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.13 Comparison of actual coverage percentage values for nominal 90%
and 95% intervals for the level 2 variance parameter using di�erent
methods and di�erent true values. All runs use study design
7. Approximate MCSEs are 0.28%/0.15% for 90%/95% coverage
estimates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.14 Average 90%/95% interval widths for the level 2 variance par-
ameter using di�erent true parameter values. All runs use study
design 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.15 Comparison of actual coverage percentage values for nominal
90% and 95% intervals for the level 1 variance parameter using
di�erent methods and di�erent studies. True values of the variance
parameters are 10 and 40. Approximate MCSEs are 0.28%/0.15%
for 90%/95% coverage estimates. . . . . . . . . . . . . . . . . . . 74
4.16 Average 90%/95% interval widths for the level 1 variance param-
eter using di�erent studies. True values of the variance parameters
are 10 and 40. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.17 Comparison of actual coverage percentage values for nominal 90%
and 95% intervals for the level 1 variance parameter using di�erent
methods and di�erent true values. All runs use study design
7. Approximate MCSEs are 0.28%/0.15% for 90%/95% coverage
estimates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.18 Average 90%/95% interval widths for the level 1 variance par-
ameter using di�erent true parameter values. All runs use study
design 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.19 Summary of results for the level 2 variance parameter, �2u using
the RIGLS method and inverse gamma intervals. . . . . . . . . . 78
4.20 Summary of the convergence for the random slopes regression
with the maximum likelihood based methods (IGLS/RIGLS). The
study design is given in terms of the number of level 2 units and
whether the study is balanced (B) or unbalanced (U). . . . . . . 86
xi
4.21 Summary of results for the random slopes regression with the 48
schools unbalanced design with parameter values, �u00 = 5;�u01 =
0 and �u11 = 0:5. All 1000 runs. . . . . . . . . . . . . . . . . . . 88
4.22 Summary of results for the random slopes regression with the 48
schools unbalanced design with parameter values, �u00 = 5;�u01 =
1:4 and �u11 = 0:5. Only 982 runs. . . . . . . . . . . . . . . . . . 90
4.23 Summary of results for the random slopes regression with the 48
schools unbalanced design with parameter values, �u00 = 5;�u01 =
�1:4 and �u11 = 0:5. Only 984 runs. . . . . . . . . . . . . . . . . 91
4.24 Summary of results for the random slopes regression with the 48
schools unbalanced design with parameter values, �u00 = 5;�u01 =
0:5 and �u11 = 0:5. All 1000 runs. . . . . . . . . . . . . . . . . . 92
4.25 Summary of results for the random slopes regression with the 48
schools unbalanced design with parameter values, �u00 = 5;�u01 =
�0:5 and �u11 = 0:5. Only 998 runs. . . . . . . . . . . . . . . . . 93
4.26 Summary of results for the random slopes regression with the 48
schools balanced design with parameter values, �u00 = 5;�u01 =
0:0 and �u11 = 0:5. All 1000 runs. . . . . . . . . . . . . . . . . . 97
4.27 Summary of results for the random slopes regression with the 12
schools unbalanced design with parameter values, �u00 = 5;�u01 =
0:0 and �u11 = 0:5. Only 877 runs. . . . . . . . . . . . . . . . . . 98
4.28 Summary of results for the random slopes regression with the 12
schools balanced design with parameter values, �u00 = 5;�u01 =
0:0 and �u11 = 0:5. Only 990 runs. . . . . . . . . . . . . . . . . . 99
5.1 Optimal scale factors for proposal variances and best acceptance
rates for several models. . . . . . . . . . . . . . . . . . . . . . . . 120
5.2 Demonstration of Adaptive Method 1 for parameters �0 and �1
using arbitrary (1.000) starting values. . . . . . . . . . . . . . . . 123
5.3 Comparison of results for the random slopes regression model
on the JSP dataset using uniform priors for the variances, and
di�erent MCMC methods. Each method was run for 50,000
iterations after a burn-in of 500. . . . . . . . . . . . . . . . . . . 125
xii
5.4 Demonstration of Adaptive Method 2 for parameters �0 and �1
using arbitrary (1.000) starting values. . . . . . . . . . . . . . . . 127
5.5 Demonstration of Adaptive Method 3 for the � parameter vector
using RIGLS starting values. . . . . . . . . . . . . . . . . . . . . 135
5.6 Comparison of results for the random slopes regression model
on the JSP dataset using uniform priors for the variances, and
di�erent block updating MCMC methods. Each method was run
for 50,000 iterations after a burn-in of 500. . . . . . . . . . . . . 136
6.1 Comparison of results from the quasi-likelihood methods and the
MCMC methods for the voting intention dataset. The MCMC
method is based on a run of 50,000 iterations after a burn-in of
500 and adapting period. . . . . . . . . . . . . . . . . . . . . . . 145
6.2 Optimal scale factors for proposal variances and best acceptance
rates for the voting intentions model. . . . . . . . . . . . . . . . 146
6.3 Summary of results (with Monte Carlo standard errors) for the
�rst 25 datasets of the Rodriguez Goldman example. . . . . . . . 151
6.4 Summary of results (with Monte Carlo standard errors) for the
Rodriguez Goldman example with 500 generated datasets. . . . . 153
7.1 Comparison between three MCMC methods for a univariate
normal model with unknown variance. . . . . . . . . . . . . . . . 163
7.2 Comparison between two MCMC methods for a bivariate normal
model with unknown variance matrix. . . . . . . . . . . . . . . . . 166
7.3 Comparison between IGLS/RIGLS and MCMC method on a
simulated dataset with the layout of the JSP dataset. . . . . . . . 170
7.4 Comparison between RIGLS and MCMC method 2 on three
models with complex variation �tted to the JSP dataset. . . . . . 177
xiii
Chapter 1
Introduction
1.1 Objectives
Multi-level modelling has recently become an increasingly interesting and app-
licable statistical tool. Many areas of application �t readily into a multi-
level structure. Goldstein and Spiegelhalter (1996) illustrate the use of multi-
level modelling in two leading application areas, health and education; other
application areas include household surveys and animal growth studies.
Several packages have been written to �t multi-level models. MLn (Rasbash
and Woodhouse (1995)), HLM (Bryk et al. (1988), Bryk and Raudenbush
(1992)), and VARCL (Longford (1987), Longford (1988)) are all packages
which use as their �tting mechanisms, maximum likelihood or empirical Bayes
methodology. These methods are used to �nd estimates of parameters of
interest in complicated models where exact methods would involve intractable
integrations.
Another technique that has come to the forefront of statistical research over
the last decade or so is the use of Markov chain Monte Carlo (MCMC) simulation
methods (Gelfand and Smith 1990). With the increase in computer power, both in
speed of computation and in memory capacity, techniques that were theoretical
ideas thirty years ago are now practical reality. The structure of the multi-
level model with its interdependence between variables makes it an ideal area of
application for MCMC techniques. Draper (1995) describes the use of multi-level
modelling in the social sciences and recommends greater use of MCMC methods
in this �eld.
1
When MCMC methods were �rst introduced, if statisticians wanted to �t a
complicated model they would program up their own MCMC sampler for the
problem they were considering and use it to solve that problem. More recently
a general purpose MCMC sampler, BUGS (Spiegelhalter et al. 1995) has been
produced that will �t a wide range of models in many application areas. BUGS
uses a technique called Gibbs sampling to �t its models using an adaptive rejection
algorithm described in Gilks and Wild (1992).
In this thesis I am interested in studying multi-level models, and comparing
the maximum likelihood based methods in the package MLn with MCMC
methods. I will parallel the work of BUGS and consider �tting various families of
multi-level models using both Gibbs sampling and Metropolis-Hastings sampling
methods. I will also consider how the maximum likelihoodmethods can be used to
give the MCMC methods good starting values and suitable proposal distributions
for Metropolis-Hastings sampling.
The package MLwiN (Goldstein et al. 1998) is the new version of MLn and
some of its new features are a result of the work contained in this thesis. MLwiN
contains for the �rst time MCMC methodology as well as the existing maximum
likelihood based methods.
1.2 Summary of Thesis
In the next chapter I will discuss some of the background to multi-level modelling
using as an example an educational dataset. I will introduce multi-level modelling
as an extension to linear modelling and explain brie y how the existing maximum
likelihood methods in MLn �t multi-level models.
In Chapter 3 I will consider MCMC simulation techniques and summarise the
main techniques, Metropolis sampling, Gibbs sampling and Hastings sampling.
I will explain how such techniques are used and how to get estimates from the
chains they produce. I will also consider convergence issues when using Markov
chains and motivate all the methods with a simple example.
In Chapter 4 I will consider two very simple multi-level models, the two-
level variance components model, and the random slopes regression model, both
introduced in Chapter 2. I will use these models to illustrate the important issue
of choosing general `di�use' prior distributions when using MCMC methods. The
2
chapter will consist of two large simulation experiments to compare and contrast
the IGLS and RIGLS maximum likelihood methods with MCMC methods using
various prior distributions under di�erent scenarios.
In Chapter 5 I will discuss some more general algorithms that will �t N level
Gaussian multi-level models. I will give three algorithms, �rstly Gibbs sampling
and then two hybrid Gibbs Metropolis samplers: the �rst containing univariate
updating steps, and the second block updating steps. For each hybrid sampler I
will also describe an adaptive Metropolis technique to improve its mixing. I will
then compare all the samplers through some simple examples.
In Chapter 6 I will discuss multi-level logistic regression models. I will consider
one of the hybrid samplers introduced in the previous chapter and show how it can
be modi�ed to �t these new models. These models are a family that maximum
likelihood techniques perform particularly badly on. I will therefore compare the
maximum likelihood based methods with the new hybrid sampler via another
simulation experiment.
In Chapter 7 I will introduce a complex variation structure at level 1 as
a generalisation of the Gaussian models introduced in Chapter 5. I will then
implement two Hastings updating techniques for the level 1 variance parameters
that aim to sample from such models. Firstly a technique based on an inverse
Wishart proposal distribution and secondly a technique based on a truncated
normal proposal distribution. I will then compare the results of both methods to
the maximum likelihood methods. In Chapter 8 I will discuss other multi-level
models that have not been �tted in the previous chapters and add some general
conclusions about the thesis as a whole.
3
Chapter 2
Multi Level Models and MLn
2.1 Introduction
In the introduction I mentioned several applications that contain datasets where a
multi-level structure is appropriate. The package MLn (Rasbash and Woodhouse
1995) was written at the Institute of Education primarily to �t models in the
area of education although it can be used in many of the other applications of
multi-level modelling. In this chapter I intend to consider, through examples,
some statistical problems that arise in the �eld of education. These problems
will increase in complexity to incorporate multi-level modelling. I will explain
how the maximum likelihood methods in MLn can be used to �t the models as
each new model is introduced. The dataset used in this chapter is the Junior
School Project (JSP) dataset analysed in Woodhouse et al. (1995).
2.1.1 JSP dataset
The JSP is a longitudinal study of approximately 2000 pupils who entered junior
school in 1980. Woodhouse et al. (1995) analyse a subset of the data containing
887 pupils from 48 primary schools taken from the Inner London Education
Authority (ILEA). For each child they consider his/her Maths scores in two tests,
marked out of 40, taken in years 3 and 5, along with other variables that measure
the child's background. I will consider smaller subsets of this subset in the models
considered in this chapter. Any names used in the examples are �ctitious, and
are simply used to aid my descriptions.
4
I will now consider as my �rst dataset the sample of pupils from one school
participating in the Junior School project, and consider how to statistically
describe information on an individual pupil.
2.2 Analysing Redhill school data
Redhill Primary school is the 5th school in the JSP dataset and the sample of
pupils participating in the JSP has 25 pupils who sat Maths tests in years 3
and 5. I will denote the Maths scores, out of a possible 40, in years 3 and 5
as M3 and M5 respectively. When considering the data from one school, it is
the individual pupils' marks that are of interest. The data for Redhill school are
given in Table 2.1.
The individual pupils and their parents will be interested in how they, or their
children are doing in terms of what marks were achieved, and how these marks
compare with the other pupils in the school. Consider John Smith, pupil 10,
who achieved 30 in both his M3 and M5 test scores, or equivalently 75% in each
test. If this is the only information available then John Smith appears to have
made steady progress in mathematics. If instead the marks for the whole class
are available then each child could be given a ranking to indicate where he/she
�nished in the class.
It can now be seen that John Smith ranked equal eighth in the third year test
but only equal eighteenth in the second test, so although he got the same mark
in each test, compared to the rest of the class he has done worse in the second
test. This is because although his marks have stayed constant the mean mark for
the class has risen from 27.8 to 32. This may be because the second test is in fact
comparatively easier than the �rst test, or the teaching between the two tests
has improved the children's average performance. With only the data given it
is impossible to distinguish between these two reasons for the improved average
mark.
2.2.1 Linear regression
A better way to compare John Smith's two marks is to perform a regression of
the M5 marks on the M3 marks. This will be the �rst model to be �tted to the
5
Table 2.1: Summary of Redhill primary school results from JSP dataset.
Pupil M3 Rank M5 Rank M/NM Sex1 25 18 25 21 M M2 17 25 33 13 M M3 33 2 36 5 NM F4 25 18 31 17 NM M5 23 22 25 21 M F6 33 2 37 2 NM F7 29 11 23 24 NM M8 30 8 34 10 NM F9 32 4 36 5 NM F10 30 8 30 18 M M11 34 1 33 13 NM F12 27 16 32 16 NM M13 24 20 21 25 M F14 32 4 34 10 M F15 28 14 38 1 NM M16 28 14 36 5 NM M17 24 20 33 13 M F18 30 8 36 5 M F19 32 4 37 2 NM F20 29 11 34 10 NM F21 29 11 28 20 NM F22 22 23 30 18 NM M23 22 23 25 21 M M24 32 4 37 2 NM M25 27 16 36 5 NM M
dataset and can be written as :
M5i = �0 + �1M3i + ei;
where the ei are the residuals and are assumed to be distributed normally with
mean zero and variance �2. The linear regression problem is studied in great
detail in basic statistics courses and the least squares estimates are as follows :
�0 = �y � �1�x;
6
�1 =X
(xi � �x)(yi � �y)=X
(xi � �x)2;
where in this example, x =M3 and y =M5.
Fitting the above model gives estimates, �0 = 15:96 and �1 = 0:575, and from
these estimates, expected values and residuals can be calculated. Consequently
John Smith with his mark of 30 for year 3 would be expected to get 33.22 for
year 5 and the residual for John Smith is then 30� 33:22 = �3:22. This means
that using this model, John Smith got 3.22 marks less than would be expected
of the average pupil (at Redhill school) with a mark of 30 in year 3 using this
model.
2.2.2 Linear models
The simple linear regression model is a member of a larger family of models known
as normal linear models (McCullagh and Nelder 1983). Two other covariates,
the sex of each pupil and whether their parent's occupation was manual or non-
manual were collected. The simple linear regression model can be expanded to
include these two covariates as follows
M5i = �0 + �1M3i + �2SEXi + �3NONMANi + ei
= Xi� + ei;
where SEXi = 0 for girls, 1 for boys and NONMANi = 0 for manual work and
1 for non-manual work.
The formula for the least squares estimates for a normal linear model is similar
to the formula for simple linear regression except with matrices replacing vectors.
The estimate for the parameter vector � is
� = (XTX)�1XTy:
For our model the least squares estimates are given in Table 2.2.
From Table 2.2 it can be seen that on average, pupils from a non-manual
background do better than pupils from a manual background and that boys do
slightly better than girls. Considering John Smith again, his expected M5 mark
7
Table 2.2: Parameter estimates for model including Sex and Non-Manualcovariates for Redhill primary school.
Parameter Variable Estimate (SE)�0 Intercept 18.09 (7.73)�1 M3 0.43 (0.28)�2 Sex 0.15 (2.20)�3 Non Manual 2.70 (2.20)
under this model is now 31.27 and so is closer to his actual mark. However, in
this model neither of the additional covariates has an e�ect that is signi�cantly
di�erent from zero, and so the standard procedure would be to remove them from
the model and revert to the simple linear regression model.
The purpose of reviewing normal linear models is to show how they are related
to hierarchical models. I will now expand the dataset to include all the pupils
sampled in four of the schools in the JSP.
2.3 Analysing data on the four schools in the
borough of Blackbridge
I am now going to consider a new educational situation involving the JSP dataset.
A family has moved into the borough of Blackbridge which has four primary
schools in its area and they want to choose a school for their children who have
sat M3 tests. The four schools I have selected for the �ctional borough are
Bluebell (School 2), Redhill (School 5), Greenacres (School 9) and Greyfriars
(School 13). The four schools are summarised in Table 2.3.
Table 2.3: Summary of schools in the borough of Blackbridge.
School Pupils M3 M5 Male NonManName Sampled Mean Mean (%) (%)Bluebell 10 25.1 30.4 50% 30%Redhill 25 27.9 32.0 48% 64%Greenacres 21 31.3 31.6 57% 67%Greyfriars 13 23.8 28.4 62% 0%All schools 69 27.8 31.0 54% 48%
8
From the table we can see that Redhill school had the best M5 average
results. Bluebell school had the best average improvement in Maths from years
3 to 5. Greenacres school had the second highest M5 average but had far less
improvement than the other schools. Greyfriars although having the worst M5
average of the four schools had 100% of pupils from a manual background. If the
simple linear regression model regressing M5 on M3 is �tted to the four schools
separately, the resulting regression lines can be seen in Figure 2-1.
.
.
.
.
Math 3 score
Mat
h 5
Sco
re
0 10 20 30 40
010
2030
40
BluebellRedhillGreenacresGreyfriars
Figure 2-1: Plot of the regression lines for the four schools in the Borough ofBlackbridge.
From these regression lines it can be seen that if we were to choose the
best school for a particular child by using their M3 mark then no one school
dominates and our choice would depend on theM3 mark for the particular child.
In comparing the 4 regression lines to consider which school is best, the same
9
model has e�ectively been �tted four times, and the results for each school are
only in uenced by the other schools by comparing the four graphs. It would be
better if the data from all four schools could be incorporated in one model. I will
now introduce some more models that attempt to do this.
2.3.1 ANOVA model
The Analysis of Variance (ANOVA) model consists of �tting factor variables to
a single response variable, for example, for the current dataset,
M5ij = �0 + SCHOOLj + eij:
One way of �tting the above model is to constrain SCHOOL4 to be zero to
avoid co-linearity amongst the predictor variables. The parameter estimates are
given in the second column of Table 2.4.
Table 2.4: Parameter estimates for ANOVA and ANCOVA models for theborough of Blackbridge dataset.
Parameter ANOVA Est. (SE) ANCOVA Est. (SE)�0 28.38 (1.57) 12.92 (3.28)�1 | 0.65 (0.13)
SCHOOL1 2.02 (2.38) 1.20 (2.02)SCHOOL2 3.62 (1.94) 1.00 (1.72)SCHOOL3 3.19 (2.00) {1.67 (1.94)
From the parameter estimates, expected values for pupils in all four schools
can be found and these are the actual school means. This shows that the ANOVA
model does not actually combine the data for the four schools but instead can be
used to compare the four schools to check if the schools are signi�cantly di�erent.
2.3.2 ANCOVA model
The Analysis of Covariance (ANCOVA) model is a combination of an ANOVA
model and a linear regression model. In its simplest form there are two predictors,
a regression variable and a factor variable, for example, for the current dataset
10
M5ij = �0 + �1M3ij + SCHOOLj + eij:
Again to �t the model co-linearity amongst the predictor variables needs to
be avoided and so SCHOOL4 can be constrained to be zero. The parameter
estimates are given in the third column of Table 2.4. The ANCOVA model �ts
parallel regression lines, one for each school, when regressing M5 on M3. Here
the data for one school is dependent on the other schools due to the common
slope parameter �1. The intercepts for each school are independent given the
common slope and so are the least squares estimates for the data assuming the
given slope.
2.3.3 Combined regression
An easy way to combine the information from the four schools into one model is
to simply ignore the fact that the children attend di�erent schools. This results
in the following regression model :
M5i = �0 + �1M3i + ei;
for all 69 pupils. Fitting this model gives parameter estimates �0 = 11:82 and
�1 = 0:52. While the ANOVA model considers the school means as a one level
dataset, here we consider all 69 pupils as a one level dataset, so neither of these
approaches exploits the structure of the problem, illustrated in Figure 2-2. An
alternative model that aims to �t this structure is illustrated in the next section.
2.4 Two level modelling
To adapt the problem to the structure illustrated in Figure 2-2, in a way that
permits generalising outward from the four schools, I must not only consider the
pupils from a school to be a random sample from that school, but also assume
the schools are a sample from a population of schools. In this way the ANOVA
model can be modi�ed as follows :
11
GreyfriarsGreenacres
.......
Pupil 11
.......
25 pupils
Pupil 1 Pupil 11
.......
13 pupils
Pupil 2 Pupil 12
....... .......
21 pupils
Pupil 2
.......
10 pupils
Pupil 1 Pupil 1
Borough of
Blackbridge
Pupil 12Pupil 2
Pupil 1
RedhillBluebell
Pupil 2
Figure 2-2: Tree diagram for the Borough of Blackbridge.
M5ij = �0 + Schoolj + eij;
Schoolj � N(0; �2s); eij � N(0; �2
e):
Similarly the ANCOVA model becomes
M5ij = �0 + �1M3ij + Schoolj + eij:
Schoolj � N(0; �2s); eij � N(0; �2
e):
In these models the predictor variables can be split into two parts, the �xed
part and the random part. In the above models the �xed part consists of the �'s,
and the product of the � vector and the X matrix is known as the �xed predictor.
These models are known as variance components models because the variance of
the response variable, M5 about the �xed predictor, is
12
var(M5ij j �) = var(Schoolj + eij) = �2s + �2
e ;
the sum of the level 1 (pupil) and level 2 (school) variances. Both models
described above are variance components models but with di�erent �xed
predictors.
Variance components models cannot be �tted by ordinary least squares as
no closed form least squares solutions exist, instead an alternative approach is
required. There are many such approaches and in this chapter I will consider
the techniques already used in the package MLn for �tting these models. These
techniques are based on iterative procedures to give estimates and are a type
of maximum likelihood estimation. The other multi-level modelling computer
packages use slightly di�erent approaches. HLM (Bryk et al. 1988) uses empirical
Bayes techniques based on the EM algorithm (Dempster, Laird, and Rubin 1977)
while VARCL uses a fast Fisher scoring algorithm (Longford 1987). In the
next chapter I will introduce MCMC techniques that use simulation methods
to calculate estimates, and these methods will be used throughout this thesis.
2.4.1 Iterative generalised least squares
Iterative generalised least squares (IGLS, Goldstein (1986)) is an iterative
procedure based on generalised least squares estimation. Consider the variance
components model that corresponds to the ANCOVA model, here estimates need
to be found for both the �xed e�ects, �0 and �1, and the random parameters �2s
and �2e . If the values of the two variances are known then the variance matrix,
V for the response variable can be calculated and this leads to estimates for the
�xed e�ects,
� = (XTV �1X)�1XTV �1Y;
where in this example, Y1 =M511 and X1 = (1M311) etc. Considering the earlier
dataset with 69 pupils in 4 schools then the variance matrix V will be 69 by 69
with a block diagonal structure as follows :
13
Vij =
8>>><>>>:�2s + �2
e if i = j
�2s if i 6= j but School[i] = School[j]
0 otherwise
From the estimates �, the raw residuals can then be calculated as follows :
~rij =M5ij � �0 � �1M3ij:
If the vector of these raw residuals, ~R is formed into the cross product matrix
R+ = ~R ~RT then this matrix has expected value V, the variance matrix de�ned
above. This can be used to create a linear model with predictors �2s and �2
e as
follows :
R+ij = Aij�
2s +Bij�
2e + �ij;
where Aij and Bij take values of 0 or 1, Aij = 1 when Schooli = Schoolj and
Bij = 1 if i = j. Using the block diagonal structure of V, vec(R+) can be
constructed so that it only contains the elements of R+ that do not have expected
value zero. This means in the example with 4 schools, the vector will be of length
102+252+212+132 = 1335, and just these terms need be included in the linear
model above as opposed to all 692 = 4761 terms.
After applying regression to this model new estimates are obtained for the two
variance parameters �2s and �2
e . These new estimates can now be used instead
of our initial estimates and the whole procedure repeated until convergence is
obtained.
Convergence
To calculate when the method has converged relies on setting a tolerance level.
Estimates from consecutive iterations are compared and if the di�erence between
the two estimates is within the tolerance boundaries then convergence of that
parameter is said to have been achieved. The method �nishes at the �rst iteration
at which all parameters have converged. To start the IGLS procedure requires
initial values, which are normally obtained by �nding the ordinary least squares
14
estimates of the �xed e�ects. The IGLS method is explained more fully in
Goldstein (1995).
2.4.2 Restricted iterative generalised least squares
The IGLS procedure can produce biased estimates of the random parameters.
This is due to the sampling variation of the �xed parameters which is not
accounted for in this method. A modi�cation to IGLS known as restricted
iterative generalised least squares (RIGLS, see Goldstein (1989)) can be used
to produce unbiased estimates. This modi�cation works by incorporating a bias
correction term into the equation when updating the variance parameters. The
RIGLS method then gives estimates that are equivalent to restricted maximum
likelihood (REML) estimates.
2.4.3 Fitting variance components models to the Black-
bridge dataset
The IGLS and RIGLS methods were used to �t the two variance components
models described earlier and the results can be found in Table 2.5. Two important
points can be drawn from this table. Firstly a de�ciency in the IGLS method can
be seen for model 1, the school level variance parameter, �2s has been estimated as
negative and has consequently been set to zero. This odd behaviour occurs when
a variance parameter is very small compared to its standard error and sometimes
in the iterative procedure the maximum likelihood estimate becomes negative.
In this example the behaviour is mainly due to trying to �t a two level model to
a dataset with only four schools.
Table 2.5: Parameter estimates for two variance components models using bothIGLS and RIGLS for Borough of Blackbridge dataset.
Parameter IGLS RIGLS IGLS RIGLSName Model 1 Model 1 Model 2 Model 2�0 30.96 (0.68) 30.87 (0.78) 15.05 (3.02) 14.56 (3.15)�1 | | 0.57 (0.11) 0.59 (0.11)�2s 0.00 (0.00) 0.54 (1.70) 0.03 (0.92) 0.54 (1.33)�2e 32.04 (5.46) 32.12 (5.62) 22.58 (3.95) 22.90 (4.01)
15
Secondly for both variance components models the variance at school level is
very small, and not signi�cantly di�erent from zero. This means that it may be
better, in this example, with only four schools to stick to single level modelling
and use the combined regression model.
2.4.4 Fitting variance components models to the JSP
dataset
The same two variance components models as considered in the previous section
were �tted to the whole JSP dataset and the results can be seen in Table 2.6.
When all 48 schools are considered, the school level variance, �2s is signi�cant
and so it makes sense to use a two level model. Table 2.7 shows the di�erent
estimates obtained for the four Blackbridge schools using ANOVA and RIGLS.
As described earlier the ANOVA estimates are simply the school means, and
the RIGLS estimates are shrinkage estimates. This means that the RIGLS
estimates are a weighted combination of the mean of all 887 results, 30.57, and
the individual school means.
Table 2.6: Parameter estimates for two variance components models using bothIGLS and RIGLS for all schools in the JSP dataset.
Parameter IGLS RIGLS IGLS RIGLSName Model 1 Model 1 Model 2 Model 2�0 30.60 (0.40) 30.60 (0.40) 15.15 (0.90) 15.14 (0.90)�1 | | 0.61 (0.03) 0.61 (0.03)�2s 5.16 (1.55) 5.32 (1.59) 4.03 (1.18) 4.16 (1.21)�2e 39.28 (1.92) 39.28 (1.92) 28.13 (1.37) 28.16 (1.37)
Table 2.7: Comparison between �tted values using the ANOVA model and thevariance components model 1 using RIGLS.
School ANOVA RIGLSBlueBell 30.40 30.49Redhill 32.00 31.68Greenacres 31.57 31.32Greyfriars 28.38 29.19
16
There are analogous results for the ANCOVA model and the second variance
components model. Here the ANCOVA model �ts the least squares regression
intercepts given the �xed slope, and the RIGLS method will shrink these parallel
lines towards a global average line, given the �xed slope.
2.4.5 Random slopes model
The two level models considered so far have all been variance components models,
that is the random variance structure at each level is simply a constant variance.
Earlier we considered �tting separate regression lines to each school and this
model can also be expanded into a two level model known as a random slopes
regression,
M5ij = �0 + �1M3ij + u0j + u1jM3ij + eij;
uj =
0@ u0j
u1j
1A �MVN(0;�s); eij � N(0; �2e):
In this notation, the �s are �xed e�ects and represent an average regression
for all schools. The ujs are the school level residuals and �s is the school level
variance, which is now a matrix as the slopes now also vary at the school level.
As the earlier variance components models suggested that the dataset with four
schools should not be treated as a two level model, we will only consider �tting
the random slopes model to the full JSP dataset with 48 schools, as shown in
Table 2.8.
Table 2.8: Parameter estimates for random slopes model using both IGLS andRIGLS for all schools in the JSP dataset.
Parameter IGLS RIGLS�0 15.04 (1.32) 15.03 (1.34)�1 0.612 (0.04) 0.613 (0.04)�s00 44.99 (16.37) 47.04 (16.75)�s01 {1.23 (0.52) {1.30 (0.53)�s11 0.034 (0.017) 0.036 (0.017)�2e 26.97 (1.34) 26.96 (1.34)
17
Table 2.9 gives as a comparison the regression lines �tted using separate
regressions and the random slopes two level model, for the four schools in
Blackbridge. Here shrinkage can be seen towards the average line M5 =
15:03 + 0:613M3.
Table 2.9: Comparison between �tted regression lines produced by separateregressions and the random slopes model.
School Separate Random SlopesName Regression RegressionBlueBell 22:44 + 0:317M3 17:16 + 0:548M3Redhill 15:96 + 0:575M3 15:06 + 0:610M3Greenacres �7:40 + 1:244M3 5:28 + 0:868M3Greyfriars 12:52 + 0:665M3 12:56 + 0:678M3
Both the variance components model and the random slopes regression will
be discussed in greater detail in Chapter 4.
2.5 Fitting models to pass/fail data
Often when considering exam datasets the main objective is to classify students
into grade bands, or at the simplest level, to pass or fail pupils. In this thesis I
will only be studying the simpler case where the response variable can be thought
of as a 0 or a 1, depending on whether the event of interest occurs or not. The
interest now lies in estimating the probability of the response variable being a 1
rather than the value of the response variable, as this is constrained to be either
0 or 1.
The normal linear models used in the earlier sections can still be used when
the response is binary but they assume that the response variable can take any
real values and so this leads to probability estimates that lie outside [0; 1]. The
more common approach to �tting binary data is to assume that the response
variable has a binomial distribution. Then the model produced can be �tted
by the technique of generalized linear models, also described in McCullagh and
Nelder (1983). A link function, g(�) is required to transform the response variable
from the [0; 1] scale to the whole real line. I will only consider the logistic link,
log(pij=(1� pij)), as it is the most frequently used link function for binary data,
18
although there are other alternatives. The model then becomes g(�) = X�,
where X� is the linear predictor. When a binary response model is �tted using
the logistic link, the technique is known as logistic regression.
Although the response variable is a binary outcome the predictor variables
can still be continuous variables. For example if a student is due to take an exam
that will be externally examined the result he/she obtains will generally be a
grade and the exact mark will not be given. However he/she will typically be
given a mock exam at some time before the exam and the exact mark of this
mock will be known. I will try and mimic this scenario by using the 4 schools
considered in the earlier examples and assume that the predictor M3 score is
known. However the response of interestM5 will now be converted to a pass/fail
response, Mp5, depending on whether the student gets at least 30 out of 40 in
the test. Considering the 69 students in the 4 schools, 46 of them got at least 30
on the year 5 test so two-thirds of the students actually pass.
Interest will now lie in whether the school of each pupil has an e�ect and
whether the `mock' exam mark, M3 is a good indicator of whether a student will
pass or fail. I will consider the following two models that are analogous to the
ANOVA and ANCOVA models for continuous responses.
Mp5ij � Bernoulli(pij)
log(pij=(1� pij)) = �0 + SCHOOLj (2.1)
log(pij=(1� pij)) = �0 + �1M3ij + SCHOOLj (2.2)
The estimates obtained by �tting these models are given in Table 2.10. The
parameter estimates obtained can be transformed back to give estimates for the
pij,
pij =exp(Xij�)
1 + exp(Xij�):
Looking more closely at model 2.1 and using the transform, the estimated
probability of passing for a pupil in school i is equal to the proportion of pupils
passing in school i. If we consider once again John Smith, who got 30 in his year
2 maths exam, using model 2.2 we see that a pupil from Redhill school with 30
19
in the M3 exam has an estimated 85:8% chance of passing. In fact, John Smith
just scraped a pass, scoring 30 in year 5.
Table 2.10: Parameter estimates for the two logistic regression models �tted tothe Blackbridge dataset.
Parameter Model 2.1 Model 2.2�0 {0.1541 {5.3627�1 | 0.2176
SCHOOL1 0.1541 {0.1727SCHOOL2 1.3068 0.6306SCHOOL3 1.3173 {0.1647
2.5.1 Extending to multi-level modelling
The above models that have been �tted using logistic regression techniques are
single level models but in an analogous way to the earlier Gaussian models these
models can be extended to two or more levels. As with the Gaussian models
there are iterative procedures to �t the multi-level logistic models, and MLn uses
two such methods.
Both techniques are quasi-likelihood based methods which can be used on all
non-linear multi-level models. They use a Taylor series expansion to linearize
the model. The �rst technique is Marginal Quasi-likelihood (MQL) proposed by
Goldstein (1991). MQL tends to underestimate �xed and random parameters
particularly with small datasets. The second technique is Predictive or Penalised
Quasi-likelihood (PQL) proposed by Laird (1978) and also Stiratelli, Laird, and
Ware (1984). PQL is more accurate than MQL but is not guaranteed to converge.
The distinction between the methods is that when forming the Taylor expansion,
in order to linearize the model, the higher level residuals are added to the linear
component of the nonlinear function in PQL, and not in MQL. The order of the
estimation procedure is the order of the Taylor series expansion.
I will now consider �tting several two level logistic regression models to the
JSP dataset with 48 schools with the response variable being a pass/fail indicator
for year 5 scores as before. The models are described below and the parameter
estimates obtained are given in Table 2.11.
20
Mp5ij � Bernoulli(pij)
log(pij=(1� pij)) = �0 + SCHOOLj (2.3)
SCHOOLj � N(0; �2s)
Mp5ij � Bernoulli(pij)
log(pij=(1� pij)) = �0 + �1M3ij + SCHOOLj (2.4)
SCHOOLj � N(0; �2s)
Table 2.11: Parameter estimates for the two-level logistic regression models �ttedto the JSP dataset.
Par. M3(MQL1) M3(PQL2) M4(MQL1) M4(PQL2)�0 0.680(0.120) 0.769(0.136) {3.703(0.420) {4.262(0.464)�1 | | 0.177(0.016) 0.205(0.018)�2s 0.404(0.138) 0.570(0.179) 0.633(0.198) 0.874(0.258)
From Table 2.11 we can see that the estimates using PQL are larger both for
the random and �xed parameters. There is a signi�cant positive e�ect of M3
score on the probability of passing and there is signi�cant variability between the
di�erent schools using both MQL and PQL.
2.6 Summary
The multi-level logistic regression model ends our discussion of models that can
be �tted to the JSP dataset. I will return to problems involving binary data in
Chapter 6, where the MQL and PQL methods described here will be compared
to MCMC alternatives.
In this chapter I have highlighted some examples of simple models that arise
when using data from education. I have then shown how the existing methods
in the MLn package can �t such models. In the next chapter I will introduce
21
MCMC methods which will then be used to �t the types of models introduced
here in later chapters.
22
Chapter 3
Markov Chain Monte Carlo
Methods
In the previous chapter I concentrated on maximum likelihood based techniques
for �tting multi-level models. The �eld of Bayesian statistics has grown
in importance as computer power has increased and techniques that would
previously have been impossible to implement can now be performed e�ciently.
MCMC methods are one group of Bayesian methods that can be used to �t
multi-level models. In this chapter I will describe the various MCMC methods in
common usage today and how they work. I will explain how to use the chains that
the methods produce to get answers to problems and how to tell when a chain
has reached its equilibrium distribution. I will then illustrate all these points in
a simple example. I will begin by explaining why such methods are used.
3.1 Background
Consider a sequence of n observations, yi that have been generated from a normal
distribution with unknown mean and variance, so that yi � N(�; �2). Then �
and �2 have standard (unbiased) estimates given the observations yi,
� = �y =nXi=1
yin
and �2 =nXi=1
(yi � �y)2
n� 1:
Consider instead the situation in reverse, if � and �2 were known, I could
23
generate a sample of observations from the normal distribution by simulation,
see for example Box and Muller (1958). Then if I drew a large enough sample
the mean and variance of the sample should be approximately equal to the mean
and variance of the underlying distribution.
Similarly consider a gamma distribution with parameters � and �. If after
�nding a suitable simulation algorithm (see Ripley (1987)), a large sample from
the gamma distribution is drawn, then it can be veri�ed that the mean of the
distribution is ��and the variance is �
�2. Both of these examples are trivial but
given any parameter of interest, if I can simulate from its distribution for long
enough I can calculate estimates for the parameter and any functionals of the
parameter.
Multi-level models are much more complicated than these two examples and
it is rare that samples from a parameter's distribution can be obtained directly
from simulation. The di�erence between multi-level models and our simple
examples is that the parameter of interest will depend on several other parameters
with unknown values. Bayesian estimation methods involve integrating over
these other parameters but this becomes infeasible as the model becomes more
complicated. The methods detailed in the last chapter involve the use of an
iterative procedure that leads to an approximation to the actual parameter value.
The methods in this chapter involve the generation, via simulation, of
Markov chains that will, given time, converge to the posterior distribution of the
parameter of interest. Before going on to describe the various MCMC techniques
I �rst need to cover some of the basic ideas of Bayesian inference.
3.2 Bayesian inference
In frequentist inference the data, whose distribution across hypothetical rep-
etitions of data-gathering is assumed to depend on a parameter vector, �, are
regarded as random with � �xed. In Bayesian inference the data are regarded
as �xed (at their observed values) and � is treated as random as a means
of quantifying uncertainty about it. In this formulation � possesses a prior
distribution and a posterior distribution linked by Bayes' theorem. The posterior
distribution of � is de�ned from Bayes' theorem as
24
p(� j data) / p(data j �)p(�):
Here p(�) is the prior distribution for the parameter vector � and should
represent all knowledge we have about � prior to obtaining the data. Prior
distributions can be split into two types, informative prior distributions which
contain information that has been obtained from previous experiments, and \non-
informative" or di�use priors which aim to express that we have little or no prior
knowledge about the parameter. In frequentist inference prior distributions are
not used in �tting models, and so \non-informative" priors are widely used in
Bayesian inference to compare with the frequentist procedures.
The posterior distribution for � is therefore proportional to the likelihood,
p(data j �) multiplied by the prior distribution. The proportionality constant
is such that the posterior distribution is a valid probability distribution. One
principal di�culty in Bayesian inference problems is calculating the propor-
tionality constant as this involves integration and does not always produce a
posterior distribution that can be written in closed form. Also to �nd marginal,
posterior and predictive distributions involves (high-dimensional) integrations
which MCMC methods can instead perform by simulation.
How do MCMC methods �t in?
MCMC methods generate samples from Markov chains which converge to
the posterior distribution of �, without having to calculate the constant of
proportionality. From these samples, summary statistics of the posterior
distribution can be calculated.
3.3 Metropolis sampling
The Metropolis algorithm was �rst described in Metropolis et al. (1953) in the
�eld of statistical mechanics. The idea is to generate values of �, the parameter
of interest from a proposal distribution and correct these values so that the draws
are actually simulating from the posterior distribution p(� j data). The proposaldistribution is generally dependent on the last value of � drawn but independent
25
of all other previous values of � to obey the Markov property. The method
works by generating new values at each time step from the current proposal
distribution but only accepting the values if they pass a criterion. In this way
the estimates of � are improved at each time step and the Markov chain reaches
its equilibrium or stationary distribution, which is the posterior distribution of
interest by construction.
The Metropolis algorithm for an unknown parameter � is as follows :
� Select a starting value for � which is feasible.
� For each time step t sample a point from the current proposal distribution
pt(�� j �t�1). The proposal distribution must be symmetric in �� and �t�1,
that is pt(�� j �t�1) = pt(�t�1 j ��) for all t.
� Let rt =p(��jy)p(�t�1jy)
be the posterior ratio and at = min(1; rt) be the acceptance
probability.
� Accept the new value � = �� with probability at, otherwise let �t = �t�1.
In multi-level models there are many parameters of interest and the above
algorithm can be used in several ways. Firstly � could be considered as a vector
containing all the parameters of interest and a multivariate proposal distribution
could be used. Secondly the above algorithm could be used separately for each
unknown parameter, �i. If this is done, it is generally done sequentially, that
is, at step t generate a new �1t, then a new �2t and so on until all parameters
have been updated then continue with step t+1. Thirdly a combination method
where parameters are updated in suitable blocks, some multivariately and some
univariately could be used.
3.3.1 Proposal distributions
To perform Metropolis sampling symmetric proposal distributions must be chosen
for all parameters. There are at least two distinct types of proposal distributions
that can be used.
Firstly the simplest type of proposal is the independence proposal. This
proposal generates a new value from the same proposal distribution regardless of
the current value. If a parameter is restricted to a range of values, for example a
26
correlation parameter must lie in the range [�1; 1], then an independence proposalcould consist of generating a new value from a uniform [�1; 1] distribution.Independence proposals are somewhat limited in that if parameters are de�ned
on <, then it is di�cult, if not impossible to �nd a proposal distribution that will
sample from the whole range of the parameter in an e�ective manner.
A second type of proposal that is popular is the random-walk proposal.
Here pt(� j �t�1) = pt(� � �t�1), so the proposal is centred at the current
value of the parameter, �t�1. Common examples are both uniform and normal
distributions centred at the current parameter value.
Both these proposal distributions will then have a free parameter, in the
case of the uniform the width of the interval and in the normal the variance of
the distribution. The values given to these parameters will a�ect how well the
simulation performs. If the variance parameter is too small then the sampler will
end up making lots of little jumps and will take a long time to reach all parts of the
sample space. If the variance is too big there will be a lower acceptance rate and
the sampler will end up staying at particular parameter values for long periods
and again the chain will take a long time to give good estimates. Convergence
rates will be dealt with later in this chapter. The common method used for
selecting parameter values for proposal distributions is to try several values for
the variance until the chain \mixes well". Adaptive methods which modify the
proposal distribution to improve convergence are also discussed in later chapters.
3.4 Metropolis-Hastings sampling
Hastings (1970) generalized the Metropolis algorithm to allow proposal distrib-
utions that were not symmetric. To correct for this the ratio of the posterior
probabilities rt is now replaced by a ratio of importance ratios:
rt =p(�� j y)=pt(�� j �t�1)
p(�t�1 j y)=pt(�t�1 j ��) :
The Metropolis algorithm is just a special case of this algorithm where
pt(�� j �t�1) = pt(�t�1 j ��) and these terms cancel out.
In later chapters it will be seen that it is not always easy to �nd a symmetric
proposal distribution for parameters with restricted ranges, for example variances.
27
Also asymmetric proposal distributions may sometimes assist in increasing the
rate of convergence. As with the Metropolis algorithm, the Metropolis-Hastings
algorithm has proposal distributions with parameters that can be modi�ed to
speed up convergence.
3.5 Gibbs sampling
The Gibbs sampler is a special case of the Metropolis-Hastings algorithm. Geman
and Geman (1984) used an approach in their work on image analysis based on
the Gibbs distribution. They consequently named this method Gibbs sampling.
Gelfand and Smith (1990) applied the Gibbs sampler to several statistical
problems bringing it to the attention of the statistical community. The Gibbs
sampler is best applied on problems where the marginal distributions of the
parameters of interest are di�cult to calculate, but the conditional distributions
of each parameter given all the other parameters and the data have nice forms.
For example, suppose the marginal posterior p(� j y) cannot be obtained fromthe joint posterior p(�; z j y) analytically. However suppose that the conditionalposteriors p(� j y; z) and p(z j y; �), have forms that are known and are easy to
sample from, for example normal or gamma distributions. Gibbs sampling can
then be used to sample indirectly from the marginal posterior.
The Gibbs sampler works on the above problem as follows, �rstly choose a
starting value for z, say z(0) and then generate via random sampling a single
value, �(1), from the conditional distribution p(� j y; z = z(0)). Next generate z(1)
from the conditional distribution p(z j y; � = �(1)). Then start cycling through
the algorithm generating �(2) and z(2) and so on.
If the conditional distributions of the parameters have standard forms, then
they can be simulated from easily. If this is not the case and the conditional
distribution does not have a standard form, then a di�erent method must be
used. Two such methods will now be described.
3.5.1 Rejection sampling
Rejection sampling is described in Ripley (1987). It is used when the distribution
of interest f(x) cannot be easily sampled from but there exists a distribution
28
g(x) such that f(x) < Mg(x)8x where M is a positive number, and g(x) can be
sampled from without di�culty. g(x) can be thought of as an envelope function
that completely bounds the required distribution from above. To generate a value
from f(x) use the following algorithm.
Repeat
Generate Y from g(Y),
Generate U from U(0,1),
until U < f(Y)/M g(Y).
Return X = Y.
The e�ciency of this method depends on the enveloping function g(x). There
are two major aims, �rstly to �nd a function that satis�es f(x) < Mg(x)8x, andsecondly to �nd a g(x) similar enough to f(x) to have a high acceptance rate.
The second method which will now be discussed brie y, tries to automatically
satisfy both these aims.
3.5.2 Adaptive rejection sampling
Adaptive rejection sampling (Gilks and Wild 1992) works when the conditional
distribution of interest is log concave. It starts by considering a small number of
points from the distribution of interest f and evaluating the tangents to log f at
these points. Then joining up these tangents will construct an envelope function,
g for f . Then proceed as in rejection sampling except that when a point xg is
chosen from g, as well as evaluating f(xg), also evaluate the tangent to log f(xg)
and modify the envelope accordingly. Then as more points are sampled, g(x)
becomes more and more like f(x), hence the rejection sampling is adaptive.
3.5.3 Gibbs sampler as a special case of the Metropolis-
Hastings algorithm
Looking at the Gibbs sampling algorithm as written above it is not immediately
obvious that the Gibbs sampler is a special case of the Metropolis-Hastings
algorithm. If I consider the proposal distribution for a particular member of
�, �(i) as
29
pt(�� j �t�1) =
8<: p(��(i) j �t�1(�i); y) if ��(�i) = �t�1(�i)
0 otherwise
where �(�i) is the vector � with element i removed. In other words the only
possible proposals involve holding all components of � constant except the ith.
Then the ratio rt of importance ratios is
rt =p(�� j y)=pt(�� j �t�1)
p(�t�1 j y)=pt(�t�1 j ��)=
p(�� j y)=p(��(i) j �t�1(�i); y)
p(�t�1 j y)=p(�t�1(i) j �t�1(�i); y)
=p(�t�1(�i) j y)p(�t�1(�i) j y)
= 1
and all proposals are accepted. This is the same Gibbs sampler algorithm as
described earlier but this time written in the Metropolis-Hastings format.
3.6 Data summaries
Once a Markov chain has been run the outputs produced are sequences of values,
one sequence for each parameter, assumed to be from the desired joint posterior
distribution. Each individual sequence can be thought of as a sample from the
marginal distribution of the individual parameter. From each sequence of values
we hope to describe the parameter it represents via summary statistics.
In the last chapter I reviewed the IGLS and RIGLS methods for multi-
level modelling. For each parameter of interest, � these methods calculated
a maximum likelihood based estimate � for � and a standard error for this
estimate. If con�dence intervals are required, then if you are prepared to assume
the parameter is normally distributed, you can generate central 95% con�dence
intervals for �, (� � 1:96SE(�); � + 1:96SE(�)):
Markov chains can also calculate these same summary statistics but they can
also produce other summaries for the parameter. We will now describe how to
30
calculate the various summary statistics from Markov chains.
3.6.1 Measures of location
There are three main estimates that can be used for the parameter of interest, �.
1. Sample mean
If I consider the chain values as a sample from the posterior distribution of �,
then I can calculate their mean in the usual way :
� =1
N
NXi=1
�i:
2. Sample median
The median can be found by �nding the N+12th sorted chain value. Computation-
ally it is quicker to calculate the median via a `binary chop' algorithm rather than
actually sorting the chain. The `binary chop' algorithm consists of taking the �rst
(unsorted) chain value and dividing the other values into two groups depending
on whether they are larger or smaller than the �rst value. Then depending on
the number of values bigger than this �rst value the median will be in one of
the two groups. Discard the group that does not contain the median and repeat
the procedure on the other group. Repeat this procedure recursively until the
median is found. This is an N log 2 algorithm as opposed to an N2 algorithm
that the simple sort would be.
3. Sample mode
This statistic is equivalent to the estimate given by the IGLS and RIGLS methods
when the prior distribution is at and is also known as the maximum likelihood
estimate (MLE) in that case. It is not calculated directly from the Markov chain
but is instead calculated from the kernel density plot described later.
3.6.2 Measures of spread
There are two main groups of summary statistics for the spread of a set of data.
31
1. Variance and standard deviation.
The variance and the standard deviation of the data are both summary statistics
associated with the mean. In a similar way to the mean formula, consider the
chain values as a sample from the posterior distribution of �, then the variance
has the usual formula
var(�) =1
N � 1(NXi=1
�2i �(PN
i=1 �i)2
N):
The standard deviation is the square root of the variance.
2. Quantile based estimates.
There are several measures of spread that are calculated from the quantiles of a
distribution. In Bayesian statistics con�dence intervals are replaced by credible
intervals which have a di�erent interpretation to the frequentist con�dence
interval. A frequentist 100(1 � �)% con�dence interval for � is de�ned as an
interval calculated from the data such that 100(1��)% of such intervals contain �.
In Bayesian statistics the data is thought of as �xed and the parameter, � variable,
and so a 100(1� �)% credible interval, C is such thatRC p(� j data )d� = 1 � �
(Bernardo and Smith 1994). The quantiles are used to produce credible intervals,
for example a 95% central Bayesian credible interval is (Q0:025; Q0:975), where Qi
is the ith quantile. The interquartile range = Q0:75�Q0:25 can also be calculated
from the quantiles and is an alternative measure of spread. The `binary chop'
algorithm used for the median can also be used to calculate the quantiles rather
than sorting.
3.6.3 Plots
Given that the sequence of values obtained for a parameter � can be thought of
as a sample of n points from the marginal posterior distribution of � then I can
use plots to show the shape of this distribution.
The simplest density plot is the histogram. Given the range of the parameter
values in the sequence, the range can be split into M contiguous intervals, not
necessarily of the same length, which are commonly known as bins. The numbers
32
ofvalu
esthat
fallin
eachbin
arethen
counted
andthehistogram
estimate
ata
poin
t�isde�ned
by
p(�)=
no.
of�iin
samebin
as�
n�width
ofbin
contain
ing�:
Anexam
ple
ofahistogram
fortheparam
eter�1from
theexam
ple
later
inthis
chapter
canbeseen
inFigu
re3-1.
Histogram
sgive
arath
er`blocky'
approx
imation
totheposterior
distrib
ution
ofinterest.
Theapprox
imation
is
improved
,upto
apoin
t,byincreasin
gthenumber
ofbins,
Mbutthis
also
dependson
thenumber
ofpoin
tsnbein
glarge.
3.63.8
4.04.2
4.4
0.0 0.5 1.0 1.5 2.0 2.5 3.0
mu(1)
Figu
re3-1:
Histogram
of�1usin
gtheGibbssam
plin
gmeth
od.
Thekern
eldensity
estimator
improves
onthehistogram
bygiv
ingasm
ooth
er
estimate
oftheposterior
distrib
ution
.Thehistogram
canbethoughtof
as
33
taking each point �i and spreading its contribution to the posterior distribution
uniformly over the bin containing �i. The kernel estimator on the other hand
spreads the contribution of each point conditional on a kernel function K around
the point, where K satis�es
Z 1
�1K(�)d� = 1:
The kernel estimator with kernel K can then be de�ned by
p(�) =1
nh
nXi=1
K
� � �ih
!;
where h is a parameter known as the window width which governs the smoothness
of the estimate. For a more detailed description of choosing the kernel function
K and the window width see Silverman (1986). An example of a kernel density
plot for the same data as the earlier histogram can be seen in Figure 3-2.
3.7 Convergence issues
The Maximum Likelihood based estimation procedures described in the last
chapter are iterative routines and consequently converge to an answer. The
convergence depends on a tolerance factor, that is how di�erent the current
estimate is from the last estimate. Here convergence is easy to establish.
The convergence of a Markov chain is di�erent from the convergence of these
techniques. In Markov chain methods we are not interested in convergence to an
estimate but instead are interested in convergence to a distribution, namely the
joint posterior distribution of interest. There are many points to be addressed
when considering convergence to a distribution. Firstly when has the chain moved
from its starting value and started sampling from its stationary distribution?
Secondly how large a sample is required to give estimates to a given accuracy,
and �nally is the stationary distribution the required posterior distribution?
34
mu(1)
3.54.0
4.5
0.0 0.5 1.0 1.5 2.0 2.5
Figu
re3-2:
Kern
eldensity
plot
of�1usin
gtheGibbssam
plin
gmeth
odanda
Gaussian
kernelwith
alarge
valueof
thewindow
width
h.
3.7.1
Length
ofburn-in
ItisusualinMarkov
chain
sto
ignore
the�rst
Bvalu
eswhile
thechain
converges
totheposterior
distrib
ution
.These
Bvalu
esare
know
nas
the`burn-in
'perio
d
andthere
aremanymeth
odsto
estimate
B.Theeasiest
meth
odisto
look
at
atrace
foreach
param
eterof
interest.
Ifaparam
eter,�is
consid
eredwhen
convergen
cehas
been
attained
at�B,theobservation
s�i ;i
>B
should
allcom
e
fromthesam
edistrib
ution
.Anequivalen
tapproach
isto
consid
erthetrace
of
themean
oftheparam
eterof
interest
against
time.
Thistrace
should
becom
e
approx
imately
constan
twhen
convergen
cehas
been
reached.Exam
ples
ofboth
ofthese
tracescan
beseen
inFigu
re3-3
where
convergen
ceisreach
edafter
about
50iteration
s.Theuppersolid
lineinthebottom
graphistherunningmean
after
35
discarding the �rst 50 iterations.
There are many convergence diagnostics that can be used to estimate whether
a chain has converged. The Raftery and Lewis diagnostic (Raftery and Lewis
1992) can also be used to estimate the chain length required for a given estimator
accuracy and is mentioned later. The Gelman and Rubin diagnostic (Gelman and
Rubin 1992) uses multiple chains and will be described when I consider multi-
modal models.
Geweke diagnostic
Geweke (1992) assumes that a burn-in of length B has been chosen and these B
iterations have been discarded. The method has its origin in the �eld of spectral
analysis and compares the trace of � over two distinct parts, the �rst nA and
the last nB iterations, typically the �rst tenth and the last half of the data. The
following statistic
��A � ��B
(n�1A S�(0)A + n�1
B S�(0)B)12
tends to a standard normal distribution as n ! 1 if the chain has converged.
Here �� is the sample mean of � and S�(0) is the consistent spectral density
estimate.
If the above statistic gives large absolute values for a chain, then convergence
has not occurred. In the models I am considering in this thesis, convergence
to the stationary distribution is quick when using the IGLS or RIGLS starting
values and an arbitrary `burn-in' period will be used, for example 500 iterations.
If the chain has not converged by this point I can observe this from its trace and
amend the `burn-in' accordingly.
3.7.2 Mixing properties of Markov chains
After a Markov chain has converged the next consideration is how long to run
the chain to get accurate enough estimates. For some samplers such as the
independence sampler, it is possible to calculate the number of iterations required
to calculate particular summary statistics to a given accuracy. This is because
the independence sampler by de�nition should give uncorrelated values.
36
iteration
mu(
1)
0 100 200 300 400 500
01
23
4
iteration
runn
ing
mea
n
0 100 200 300 400 500
01
23
4
...............................................................................
.........................................................................
............................................................................................................................................................................................................................................................................................................................................................
Figure 3-3: Traces of parameter �1 and the running mean of �1 for a Metropolisrun that converges after about 50 iterations. Upper solid line in lower panel isrunning mean with �rst 50 iterations discarded.
37
Auto-correlation is an important issue when considering the chain length, as
a chain that is mixing badly, that is, has a high auto-correlation will need to be
run longer to give estimates to the required accuracy. Two useful plots that come
from the time series literature (Chat�eld 1989) are the autocorrelation function
(ACF) and the partial autocorrelation function (PACF). The ACF is de�ned by
:
�(�) =Cov[�(t); �(t+ �)]
V ar(�(t));
and describes correlations between the chain itself and a chain produced by
moving the start of the chain forward � iterations. The chain is mixing well
if these values are all small. The pth partial autocorrelation (PAC) is the excess
auto-correlation at lag p when �tting an AR(p) process not accounted for by an
AR(p � 1) model. The �rst PAC will be equal to the �rst autocorrelation as
this describes the correlation in the chain. For the chain to obey the (�rst-order)
Markov property all other PACs should be near zero. A large pth PAC would
indicate that the next value is dependent on past values and not just the current
value. The ACF and PACF for one Gibbs sampling run and one Metropolis
sampling run with �p = 0:05 for our example in the later section are shown in
Figure 3-4. The ACFs in Figure 3-4 include the auto-correlation at lag 0 which
is always 1. Here it can be seen that the Gibbs run is mixing well and the
auto-correlations are all small whereas the Metropolis run is highly correlated.
There are many ways to improve the mixing of a Markov chain. The simplest
way would be to thin the chain by using only every kth observation from the
chain. Thinning a chain will give a new chain that has less autocorrelation but it
can be shown (MacEachern and Berliner 1994) that the thinned chain gives less
accurate estimates than the complete chain. Thinning is still a useful technique
as longer runs need greater storage capacity and although the thinned chain is
not as useful as the full chain, it will generally be better than a section of the
complete chain of the same length.
When considering Gibbs sampling methods, there are several ways of
improving the mixing of the chain by actually altering the form of the model
that is being �tted. Hills and Smith (1992) explore re-parameterising the vector
of variables �, so that the new variables correspond to the principle axes of the
38
Lag
AC
F
0 10 20 30
0.0
0.2
0.4
0.6
0.8
1.0
Series : gibbs
Lag
AC
F
0 10 20 30
0.0
0.2
0.4
0.6
0.8
1.0
Series : metropolis
Lag
Par
tial A
CF
0 10 20 30
-0.0
3-0
.01
0.01
0.03
Series : gibbs
Lag
Par
tial A
CF
0 10 20 30
0.0
0.2
0.4
0.6
0.8
Series : metropolis
Figure 3-4: ACF and PACF for parameter �1 for a Gibbs sampling run of length5000 that is mixing well and a Metropolis run that is not mixing very well.
39
posterior distribution. This is done by transforming the data to the new axes.
When considering multi-level models, techniques such as hierarchical centring
(Gelfand, Sahu, and Carlin 1995) where variables that appear at lower levels are
also included at higher levels will improve the mixing of the sampler.
The mixing of a Metropolis Hastings chain will depend greatly on the proposal
distribution used. I will discuss in greater detail in the example at the end of
this chapter the e�ect of the proposal distribution. Most of the techniques used
to improve mixing in the Gibbs sampling algorithms, which involve changing the
structure of the model can also be used with Metropolis Hastings algorithms.
Raftery & Lewis diagnostic
The Raftery and Lewis diagnostic (Raftery and Lewis 1992) considers the
convergence of a run based on estimating a quantile, q of a function of the
parameters, g(�) to within a given accuracy. The method works by �rstly �nding
the estimated quantile, g(�q) from the chain and then creating a chain of binary
values, Zt de�ned by
Zt = 1 if g(�t) > g(�q);
Zt = 0 if g(�t) � g(�q):
This binary sequence, or a thinned version of the binary sequence can then
be thought of as a one step Markov chain with transition matrix
P =
0@ 1� � �
� 1� �
1A :
Using results from Markov chain theory and estimates for � and � from the
chain, estimates for the length of `burn-in' required, B and the minimum number
of iterations to run the chain for, N can be calculated. N is de�ned as the
minimum chain length to obtain estimates for the qth quantile within �r (on the
probability scale) with probability s such that the n step transition probabilities
of the Markov chain are within � of its equilibrium distribution.
The estimates are
40
B =log
��(�+�)
max(�;�)
�log(1� �� �)
and
N = B +��(2� �� �)
(� + �)3
(r
�(12(1 + s))
)�2
:
The Raftery Lewis diagnostic can also be used to assess the mixing of the
Markov chain by comparing the value N with the value Nmin obtained if the
chain values were an independent sample. The statistic IRL = NNmin
, can then be
used to describe the e�ciency of the sampler. The default settings for the Raftery
Lewis diagnostic as used in Raftery and Lewis (1992), are q = 0:025; r = 0:005
and s = 0:95, and these will be used in MLwiN.
3.7.3 Multi-modal models
If the joint posterior distribution of interest is multi-modal then when an MCMC
sampler is used to simulate from the distribution it is possible that, particularly
if the modes are distinct, the sampler will simulate from one of the modes and
not the whole distribution. To get around this problem it is always useful to
run several chains in parallel with di�erent starting values spread around the
expected posterior distribution and compare the estimates that are obtained from
each chain. If chains give widely di�ering estimates then the posterior is likely
to be multi-modal and the di�erent chains are sampling from distinct modes of
the distribution. There are many convergence diagnostics that rely on running
several chains from di�erent starting points. One of the more popular will now
be described.
Gelman & Rubin diagnostic
Gelman and Rubin (1992) assume that m runs of the same model, each of length
2n starting from dispersed starting points are run, and the �rst n iterations of each
run have been discarded to allow each sequence to move away from its starting
point. Then the between run, B and within run, W variances are calculated,
B=n =Pm
i=1(��i: � ��::)
2=(m� 1) and W =Pm
i=1 s2i =m where ��i: is the mean of the
n values for run i, s2i is the variance and��:: is the overall mean.
41
The variance of the parameter of interest, �2, can be estimated by a weighted
average of W and B,
�2 =n� 1
nW +
1
nB:
This along with � = ��::, gives a normal estimate for the target distribution.
If the dispersed starting points are still in uencing the runs then the estimate �2
will be an overestimate of �2. The potential scale reduction as n ! 1, that is
the overestimation factor for �2 can be estimated by
R =�n� 1
n+m + 1
mn
B
W
�df
df � 2;
where df = 2�2=dvar(�2), and
dvar(�2) =�n� 1
n
�2 1
mdvar(s2i ) + �
m + 1
mn
�2 2
m� 1B2
+2(m+ 1)(n� 1)
mn2
n
m[dcov(s2i ; ��2i:)� 2��i:dcov(s2i ; ��i:)]
in which the estimated variances and covariances are obtained from the sample
means and variances of the m runs.
If R is near 1 for all parameters of interest then there is little evidence of
multi-modality. If R is signi�cantly bigger than 1 then at least one of the m
runs has not converged, or the runs have converged to di�erent modes in the
distribution.
The majority of models I will study in this thesis will be unimodal. I am
aiming to use MLn's maximum likelihood techniques to give good starting values
for the parameters of interest and consequently will not use widely di�erent
starting values and so generally choose not to use the Gelman and Rubin
diagnostic. I will however run several chains with di�erent starting seeds for
the random number generator. This will work in a similar way to the di�erent
starting values. A chain may get stuck in a local mode using one set of random
number seeds, whereas another chain starting from the same starting values but
with di�erent random numbers may get stuck in a di�erent mode. However if
the model does have multiple modes this procedure will not �nd them as well as
42
the Gelman-Rubin sampling strategy.
3.7.4 Summary
Convergence diagnostics for MCMC methods has become a large �eld of
statistical research and the three diagnostics described here are simply the tip
of the iceberg. Both Cowles and Carlin (1996) and Brooks and Roberts (1997)
review larger groups of convergence diagnostics and are recommended for further
reading on this subject.
3.8 Use of MCMC methods in multi-level mod-
elling
Following the introduction of Gibbs sampling in Gelfand and Smith (1990),
Gelfand et al. (1990) applied the Gibbs sampling algorithm to many problems
including variance components models and a simple hierarchical model. Seltzer
(1993) considers using Gibbs sampling on a two level hierarchical model with a
scalar random regression parameter. The algorithm used is fully generalized in
Seltzer, Wong, and Bryk (1996) to allow vectors of random regression parameters.
Zeger and Karim (1991) consider using Gibbs sampling for generalized linear
models with random e�ects, which are two level multi-level models. They
concentrate mainly on the logistic normal model which I will investigate in
Chapter 6. The package BUGS (Spiegelhalter et al. 1994), is a general purpose
Gibbs sampling package using the adaptive rejection method (Gilks and Wild
1992) that can be used to �t many models including multi-level models. They
have concentrated mainly on models with univariate parameter distributions
although in BUGS versions 0.5 and later they include multivariate distributions.
It can be seen that most research in the use of MCMC methods in the �eld of
multi-level modelling has concentrated on Gibbs sampling. This is primarily
because of its ease of programming. In MLwiN I will start by using Gibbs
sampling for the simplest models. Then when the conditional distributions do
not have standard forms, for example logistic regression models, where Zeger and
Karim (1991) use rejection sampling and BUGS uses adaptive rejection sampling,
I will instead consider using Metropolis and Metropolis Hastings sampling. I will
43
also consider using these methods as an alternative to Gibbs sampling in the less
complex Gaussian models. Before looking at multi-level models, I will end this
chapter with an example that will illustrate the three MCMC methods and the
other issues described in this chapter.
3.9 Example - Bivariate normal distribution
Gelman et al. (1995) considered taking one observation from a bivariate normal
distribution to illustrate the use of the Gibbs sampler. I will consider the more
general case of a sequence of n pairs of observations (y1i; y2i) from a bivariate
normal distribution with unknown mean (�1; �2) and known variance matrix �.
Assume that � has a non-informative uniform prior distribution then the posterior
distribution has a known form :0@ �1
�2
1A j y � N
0@0@ �y1
�y2
1A ;�
n
1A :
I can verify the use of the MCMC techniques in this chapter by comparing
the answers they produce with the correct posterior distribution. I will consider
a set of 100 draws generated from a bivariate normal distribution with mean
vector � = (4; 2) and variance matrix � =
0@ 2.0 -0.2
-0.2 1.0
1A. I will assume that �
is known and that I want to estimate �. In the test data set, �y1 = 4:0154 and
�y2 = 2:0013, so the posterior distributions are as follows :0@ �1
�2
1A j y � N
0@0@ 4.0154
2.0013
1A ;
0@ 0.02 -0.002
-0.002 0.01
1A1A :
I will now explain brie y how to use the various techniques on this problem.
3.9.1 Metropolis sampling
There are two parameters, �1 and �2 for which posterior distributions are
required. As uniform priors are being used for �1 and �2, the conditional posterior
distributions are simply determined by the likelihood :
44
p(�1 j �2;�; y) = p(y j �;�)p(�1)
p(�1 j �2;�; y) / exp(�1
2
NXi=1
(yi � �)T��1(yi � �)):
Similarly for �2,
p(�2 j �1;�; y) / exp(�1
2
NXi=1
(yi � �)T��1(yi � �)):
I will use the normal proposal distribution
�i(t+1) � N(�i(t); �2p)
for both �1 and �2. I will consider several values for �2p to show the e�ect of the
proposal variance on acceptance rate and convergence of the chain.
3.9.2 Metropolis-Hasting sampling
As an example of a Metropolis-Hastings sampler I will consider the following
normal proposal distribution
�i(t+1) � N(�i(t) +1
2; �2
p)
This proposal distribution has two di�erences from the earlier Metropolis
proposal distribution. Firstly it is biased which in this example induces slow
mixing, generally it is preferable to have an unbiased proposal distribution.
Secondly it is not symmetric, so it does not have the Metropolis property,
p(�t+1 = a j �t = b) = p(�t+1 = b j �t = a). Consequently the ratio of the
proposal distributions has to be worked out, that is
r =p(�t+1 = a j �t = b)
p(�t+1 = b j �t = a):
For this proposal distribution,
45
r =p(�t+1 = a j �t+1 � N(b + 1
2; �2))
p(�t+1 = b j �t+1 � N(a + 12; �2))
=exp(� 1
2�2(a� b� 1
2)2)
exp(� 12�2
(b� a� 12)2)
= exp(� 1
2�2((a� b� 1
2)2 � (b� a� 1
2)2))
= exp(a� b
�2)
So when choosing to accept or reject each new value the Hastings ratio is used
as a multiplying factor. Again I will consider using several di�erent proposal
variances, �2p to improve our acceptance rate and convergence time.
3.9.3 Gibbs sampling
To use Gibbs sampling on this model I will consider updating the two parameters,
�1 and �2 separately. It would be pointless as an illustration of the Gibbs sampler,
updating the parameters together using a multivariate updating step as this would
be generating from the conditional distribution, p(� j �; y) which is the joint
posterior distribution of interest and I could �nd its mean and variance directly.
To use Gibbs sampling I need to �nd the two conditional distributions,
p(�1 j �2;�; y) and p(�2 j �1;�; y). I am using uniform priors for �1 and �2 and
so the posterior distribution is simply the normalised likelihood. The conditional
distributions are found as follows :
p(�1 j �2;�; y) = p(y j �;�)p(�1)
p(�1 j �2;�; y) / exp(�1
2
nXi=1
(yi � �)T��1(yi � �))
Let D =
24 d11 d12
d12 d22
35 = ��1, then expand in terms of �1
p(�1 j �2;�; y) / exp(�1
2
nXi=1
(yi1��1)2d11+2(yi1��1)(yi2��2)d12+(yi2��2)
2d22):
46
Then assuming that �1 has a normal distribution, �1 � N(�c; �2c ) and equating
powers of �1 gives
�21
�2c
= nd11�21 ! �2
c =1
nd11;
and
�2�c�1
�2c
= �2nXi=1
(yi1�1d11 + yi2�1d12 � �2�1d12)
! �c = �y1 +d12d11
(�y2 � �2):
So I have
p(�1 j �2;�; y) � N(�y1 +d12d11
(�y2 � �2);1
nd11);
and similarly for �2,
p(�2 j �1;�; y) � N(�y2 +d12d22
(�y1 � �1);1
nd22):
These expressions could also have been derived from standard bi-variate
normal regression results. I can now use the Gibbs sampling algorithm by
alternately sampling from these two conditional distributions. Unlike the other
two methods, I do not have a free parameter to set to change the acceptance rate
and improve the convergence rate, the Gibbs sampler always accepts the new
state. This is one of the reasons it is more widely used than the other methods.
3.9.4 Results
The model was �tted using all three methods described above. For the Gibbs
sampler 3 runs were performed using a burn-in of 1,000 and a main run of 5,000
updates. For both the Metropolis and Metropolis-Hastings sampling methods 3
runs were performed for several di�erent values of �p. A burn-in of 1,000 was used
and a main run of 100,000 updates for both methods. The results are summarised
in Table 3.1.
From Table 3.1 it can clearly be seen that all the methods eventually converge
47
Table 3.1: Comparison between MCMC methods for �tting a bivariate normalmodel with unknown mean vector.
Method �p �1 (sd) �2 (sd) Acc % �1=�2 R-L NTheory N/A 4.015 (0.141) 2.001 (0.100) N/A N/A
0.05 4.012 (0.143) 2.004 (0.100) 88.6/84.3 81,5000.1 4.014 (0.141) 2.002 (0.100) 78.1/70.3 35,000
Metropolis 0.2 4.017 (0.141) 2.001 (0.100) 60.6/49.7 16,6000.3 4.017 (0.141) 2.001 (0.100) 47.8/37.2 14,8000.5 4.016 (0.141) 2.002 (0.100) 32.5/24.0 20,2001.0 4.015 (0.142) 2.002 (0.100) 17.4/12.4 36,8000.25 4.019 (0.143) 2.002 (0.101) 4.4/4.3 92,6000.3 4.017 (0.141) 2.002 (0.099) 8.9/7.9 34,900
Hastings 0.5 4.014 (0.140) 2.001 (0.100) 18.8/14.1 25,7000.75 4.014 (0.141) 2.001 (0.100) 17.9/13.1 31,9001.0 4.016 (0.141) 2.001 (0.100) 15.2/11.0 41,8001.5 4.013 (0.143) 2.001 (0.100) 11.1/7.8 63,200
Gibbs N/A 4.014 (0.140) 2.002(0.103) 100.0/100.0 3,900
to approximately the correct answers. According to the Raftery-Lewis diagnostic,
the Gibbs sampling method achieves the default accuracy goals in the least
number of iterations. Both the Metropolis and the Hastings methods have
accuracies which vary depending on the proposal distribution. The Hastings
method has much smaller acceptance rates due to the bias in the proposal
distribution. It is clear that the Hastings sampler takes longer to converge than
the Metropolis sampler on average, and that the best proposal standard deviation
�p is higher for the Hastings sampler than the Metropolis sampler. Both these
points are also due to the bias in the sampler, and in general a biased sampler
would not be used in preference to an unbiased sampler.
Gelman, Roberts, and Gilks (1995) studied optimal Metropolis sampler
proposal distributions for Gaussian target distributions. They found that the best
univariate proposal distributions have standard deviations that are 2.38 times the
sample standard deviation. I used the known correct standard deviations for the
parameters �1 and �2 to �nd that the optimal proposal standard deviations are
0.336 for �1 and 0.238 for �2. In Table 3.1 the same proposal standard deviation is
used for both parameters, but it is as easy to use di�erent proposal distributions
for each parameter. Using the proposal standard deviations proposed in Gelman,
48
Roberts, and Gilks (1995) gives acceptance rates of approximately 44.2% and
Raftery Lewis N values of around 14,000 for both parameters which compare
favourably with the best results in the table.
mu(2)
1.4 1.6 1.8 2.0 2.2 2.4 2.6
01
23
Figure 3-5: Kernel density plot of �2 using the Gibbs sampling method and aGaussian kernel.
Looking at the kernel density plots for the two variables, Figures 3-2 and
3-5, constructed from one run of the Gibbs sampler method, it can be seen that
both variables have Gaussian posterior distributions as expected. In the case
where the posterior distributions of the parameters of interest are Gaussian, then
the 95% central Bayesian credible intervals (BCI) from the simulation methods
will be approximately the same as the standard 95% con�dence interval (CI).
This point is illustrated in Table 3.2 for the three MCMC methods. Both the
Bayesian credible intervals and the con�dence intervals are based on one run of
each sampler respectively.
49
proposal sd mu1
RL
diag
(10
00s)
0.0 0.5 1.0 1.5
2040
6080
100
(a)
proposal sd mu2
RL
diag
(10
00s)
0.0 0.2 0.4 0.6 0.8 1.0
2040
6080
100
(b)
acceptance rate mu1
RL
diag
(10
00s)
0.2 0.4 0.6 0.8
2040
6080
100
120
140
(c)
acceptance rate mu2
RL
diag
(10
00s)
0.2 0.4 0.6 0.8
2040
6080
100
(d)
proposal sd/actual sd mu1
RL
diag
(10
00s)
0 2 4 6 8 10
2040
6080
100
(e)
proposal sd/actual sd mu2
RL
diag
(10
00s)
0 2 4 6 8 10
2040
6080
100
(f)
Figure 3-6: Plots of the Raftery Lewis N values for various values of �p, theproposal distribution standard deviation.
50
Table 3.2: Comparison between 95% con�dence intervals and Bayesian credibleintervals in bivariate normal model.
Method �1 CI �2 CI �1 BCI �2 BCITheory (3.739,4.291) (1.805,2.197) (3.739,4.291) (1.805,2.197)Met. �p = 0:3 (3.737,4.294) (1.807,2.196) (3.737,4.294) (1.806,2.196)Met. optimal (3.737,4.292) (1.804,2.200) (3.738,4.291) (1.804,2.200)Hast. �p = 0:5 (3.739,4.288) (1.802,2.190) (3.739,4.283) (1.804,2.191)Gibbs (3.733,4.291) (1.799,2.205) (3.737,4.290) (1.802,2.200)
Considering the Metropolis sampler in more detail and running the sampler
with lots of di�erent values of �p the optimal value given by Gelman, Roberts,
and Gilks (1995) can be veri�ed. The graphs in Figure 3-6 were created by �tting
smooth curves to the results from the data for the new runs of the sampler. The
graphs (a) and (b) show the Raftery Lewis N values plotted against �p, the
proposal standard deviation for �1 and �2 respectively. The graphs (c) and (d)
show the e�ect of acceptance rate of the proposals on N for the two variables.
Graphs (e) and (f) are graphs (a) and (b) rescaled by dividing �p by the true
standard deviations for the two parameters.
In their paper Gelman, Roberts, and Gilks (1995) compared the e�ects of
di�erent values of the parameter �p, and the acceptance rate of the parameter,
on the e�ciency of the sampler. I have substituted e�ciency for the Raftery
and Lewis diagnostic, N and found that the same values of standardised �p that
maximise e�ciency, minimise N . From this it appears that 1=N = f(e�ciency),
for increasing f .
It is worth noting from Figure 3-6 that while the optimal value for the ratio �
= proposal SD/ actual SD is (close to) the 2.4 value obtained in Gelman, Roberts,
and Gilks (1995), the region of near optimality is quite broad : the Raftery Lewis
N value is below 20,000 for 0:8 � � � 5.
3.9.5 Summary
This example has shown how to use the three MCMC methods described earlier
in this Chapter on a simple problem. It has shown that the Gibbs sampler works
best if the conditional distributions of the unknown parameters are known. It
also shows that the Metropolis and Hastings algorithms are easier to implement
51
but need more tuning to give good answers. The Hastings algorithm used here
performed worst but it will be shown how more sensible Hastings algorithms can
be used in later chapters. The example also illustrates how the output from a
simulation method can be summarised and highlights the importance of checking
convergence.
The summary results and run diagnostics covered in this chapter have now
been added to MLwiN and can be seen in Figure 3-7. This parameter has been
given an informative prior distribution whose graph can also be seen in the kernel
density picture. In the next chapter I will discuss prior distributions in greater
detail.
Figure 3-7: Plot of the MCMC diagnostic window in the package MLwiN for theparameter �1 from a random slopes regression model.
52
Chapter 4
Gaussian Models 1 - Introduction
4.1 Introduction
In this chapter the aim is to combine the knowledge gained in the previous
two chapters, to use the MCMC methods described in Chapter 3 to �t some
of the simple multi-level models described in Chapter 2. This work will then
lead on to the next chapter where I will consider how to �t general models
using MCMC methods. I will only consider one of the three MCMC methods
described in chapter 3, the Gibbs sampler, and use it to �t the simple variance
components model and the random slopes regression model. I will give Gibbs
sampling algorithms to �t both these models and compare the results obtained
to the results obtained with the IGLS and RIGLS maximum likelihood methods.
Before considering any modelling, I will �rstly concentrate on one important
aspect of Bayesian methods, prior distributions. To create a general purpose
multi-level modelling package that uses Bayesian methods, some default prior
distributions must be found for all parameters. These default priors should be
\non-informative" and so some possible default priors will be described in the next
section of this chapter. These di�erent candidate priors will then be compared
via simulation with each other and the maximum likelihood methods. I will end
the chapter with some conclusions on which methods perform best.
53
4.2 Prior distributions
A prior distribution p(�) for a parameter � is a probability distribution that
describes all that is known about � before the data has been collected. There
are two distinct types of prior distribution, informative priors and non-informative
priors.
4.2.1 Informative priors
An informative prior for � is a prior distribution that is used when information
about the parameter of interest is available before the data is collected, and this
information is to be included in the analysis. For example, say I was interested in
estimating the average height of male university students. Then before collecting
my data by sampling from the student population, I go to the library and �nd that
the average height of men in Britain is 1.79m. I can then create a normal prior
distribution with mean 1.79 and variance �2, where the value, �2 will determine
the information content of the prior knowledge. I could also incorporate my belief
that as students are generally in the 18-30 age group, and this age group is on
average taller, this group will be on average taller by increasing the mean of my
prior distribution.
4.2.2 Non-informative priors
A \non-informative" prior distribution for � is a prior distribution that is used
to express complete ignorance of the value of � before the data is collected. They
are non-informative in the sense that no value is favoured over any other and are
also described as di�use or at priors due to this reason and their shape. The
most common non-informative prior is the uniform distribution over the range
of the sample space for �. If the parameter is de�ned over an in�nite range,
for example the whole real line, then the uniform distribution is an improper
prior distribution, as its distribution function does not integrate to 1. Improper
prior distributions should be used with caution, and only be used if they produce
proper posterior distributions.
54
4.2.3 Priors for �xed e�ects
Fixed e�ect parameters have no constraints and can take any value. A prior
distribution for such parameters will need to be de�ned over the whole real line.
The conjugate prior distribution for such parameters is the normal distribution
as will be illustrated in the algorithms described in later sections.
Uniform prior
If a non-informative prior is required then a good choice would be a uniform prior
over the whole real line, p(�) / 1. This prior is improper as it does not integrate
to 1, but will give proper posterior distributions and can be approximated by the
following prior.
Normal prior with huge variance
As the variance of the normal distribution is increased, the distribution becomes
locally at around its mean. Although �xed e�ects can take any value, close
examination of the data can narrow the range of values and a suitable normal
prior can be found. Generally the normal prior, p(�) � N(0; 104) will be an
acceptable approximation to a uniform distribution but if the �xed e�ects are
very large, a suitable increase in the prior variance may be necessary. Figure 4-1
shows several normal priors over the range ({5,5). It can clearly be seen that as
the variance increases the prior distribution becomes atter over the range and
when the variance is increased to 50 the graph looks like a at line.
4.2.4 Priors for single variances
Variance parameters are constrained to have strictly positive values, and so prior
distributions such as the normal cannot be used. The conjugate prior for a
variance parameter is an inverse chi squared or inverse gamma distribution. As
these distributions are not commonly simulated from, the precision parameter,
the reciprocal of the variance is generally considered instead. The conjugate priors
for the precision parameter are then the chi-squared or gamma distributions.
There are a variety of main contenders for the non-informative distribution for
the variance and these will now be considered.
55
.........................................................................................................................................................................................................................................................
................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
...............................................................................................
.....................................................
.........................................
...........................................................................................................................................................................................................................
........................................................
.........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
......................................................................................................................................................................................
...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
.........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
.........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Figure 4-1: Plot of normal prior distributions over the range ({5,5) with mean 0and variances 1,2,5,10 and 50 respectively.
Uniform prior for �2
The parameter of interest is the variance, �2 so this prior tries to allow any
variance to be equally likely. This prior is used by Gelman and Rubin (1992)
and Seltzer (1993) amongst others but appears to have the disadvantage that
it favours large values of �2. This is because even unfeasibly large values of �2
have equal prior probability. This prior is improper and the following is a proper
alternative.
Pareto(1,c) prior for � = 1=�2
The Pareto distribution is a left-truncated gamma distribution, and when used
as a prior for the precision parameter is equivalent to a locally uniform prior for
56
the variance parameter.
� � Pareto(1; c), p(�) = c��2; � > c:
This means that a uniform prior for �2 on (0; c�1) is equivalent to a Pareto (1; c)
prior for � . As c is decreased the distribution will approach the improper uniform
distribution on (0;1).
Uniform prior for log �2
Box and Tiao (1992) try to �nd `data-translated' uniform priors to represent
suitably non-informative priors. They try and �nd a scale upon which the
distribution of the parameter will have the same shape for any possible value
of that parameter. For the �xed e�ects parameter this scale is simply the
parameter's own scale, as altering a normal distribution's mean does not alter
its shape, it simply translates the distribution. When considering a variance
parameter the likelihood has an inverse-chi squared distribution and this implies
that the correct scale is the log scale. Consequently Box and Tiao suggest
using a uniform prior on the log �2 scale. DuMouchel and Waternaux (1992)
discourage the use of this improper prior distribution with hierarchical models as
they claim it can give improper posterior distributions so instead the following
proper alternative is often used.
Gamma(�,�) prior for � = 1=�2
The gamma(�,�) prior for � approaches a uniform prior for log�2 as � ! 0. In
fact the improper gamma distribution when � = � = 0 for � is equivalent to a
uniform prior for log �2. This prior is the standard prior recommended by BUGS
(Spiegelhalter et al. 1994) for variance parameters, as BUGS does not permit the
use of improper priors.
Scaled inverse chi-squared(�; �2) prior for �2
An alternative approach would be to use an estimate of the parameter of interest
to choose a particular prior distribution. This prior is then a data-driven prior
57
as it requires an estimate for �2. The parameter � is small and if the estimate of
�2, �2 is 1, this prior is equivalent to a gamma(�=2,�=2) prior for � .
4.2.5 Priors for variance matrices
When considering a variance matrix most priors for single variances can be
translated to a multivariate alternative.
Uniform prior for �
This is similar to the univariate case, ie p(�) / 1.
Uniform prior for log�
This is similar to the univariate case, ie p(log�) / 1. This prior will not be
considered directly as it gives rise to improper posterior distributions.
Wishart prior equivalent to the gamma(�,�) prior
It is di�cult to evaluate what would be the multivariate equivalent of the
gamma(�,�) prior. One candidate prior is a Wishart prior for the precision matrix,
��1 with parameters � = n and S = I. In fact, it will be seen later, that this
prior is slightly informative and shrinks the estimate for � towards I, the identity
matrix.
Wishart prior equivalent to SI �2(�; �2) prior
It can be shown (Spiegelhalter et al. 1994) that if ��1 � Wishartn(�; S) then
E(�) = S=(� � n� 1). It clearly follows that if there is a prior estimate, � for �
and I want to incorporate this estimate into a `vaguely' informative prior, then
the following is an obvious candidate :
��1 � Wishartn(n+ 2; �):
I will now compare the results obtained using all the above priors on 2 simple
two level models.
58
4.3 2 Level variance components model
One of the simplest possible multi-level model is the two level variance com-
ponents model. This model has a single response variable and interest lies in
quantifying variability of this response at di�erent levels. The two level variance
components model can be written mathematically as :
yij = �0 + uj + eij;
uj � N(0; �2u); eij � N(0; �2
e);
where i = 1; : : : ; nj; j = 1; : : : ; J ,P
j nj = N and in which all the uj and eij
are independent. This model can be �tted using the Gibbs sampling method as
shown in the next section.
4.3.1 Gibbs sampling algorithm
The unknown parameters in the variance components model can be split into
four groups, the �xed e�ect, �0, the level 2 residuals uj, the level 2 variance �2u
and the level 1 variance �2e . Conditional posterior distributions for each of these
parameters need to be found so that the Gibbs sampling method described in
the previous chapter can be used. Then sampling from the distributions in turn
gives estimates for the parameters and their posterior distributions can be found
by simulation.
Prior distributions
I will assume a uniform prior for the �xed e�ect parameter �0. The two variances
�2u and �
2e will take various priors in the simulation experiment so in the algorithm
I will use general scaled inverse �2 priors with parameters �u; s2u; and �e; s
2e
respectively. Then all the priors in the earlier section can be obtained from
particular values of these parameters. The algorithm is then as follows :
Step 1. p(�0 j y; �2u; �
2e ; u)
Let �0 � N(�0; D0), then to �nd �0 and D0 :
59
p(�0 j y; �2u; �
2e ; u) / Y
i;j
(1
�2e
)12 exp[� 1
2�2e
(yij � uj � �0)2]
/ exp[� N
2�2e
�20 +
1
�2e
Xi;j
(yij � uj)�0 + const]:
Comparing this with the form of a normal distribution and matching powers of
�0 gives
D0 =�2e
N;
and
�0 =1
N
Xi;j
(yij � uj):
Step 2. p(uj j y; �2u; �
2e ; �0)
Let uj � N(uj; Dj), then to �nd uj and Dj :
p(uj j y; �2u; �
2e ; �0) /
njYi=1
(1
�2e
)12 exp[� 1
2�2e
(yij � uj � �0)2] � ( 1
�2u
)12 exp[� u2j
2�2u
]
/ exp[�1
2(nj�2e
+1
�2u
)u2j +1
2�2e
njXi=1
(yij � �0)uj + const]:
Comparing this with the form of a normal distribution and matching powers of
uj gives
Dj = (nj�2e
+1
�2u
)�1;
and
uj =Dj
�2e
njXi=1
(yij � �0):
60
Step 3. p(�2u j y; �0; u; �2
e)
Consider instead p(1=�2u j y; �0; u; �2
e) and let 1=�2u � gamma(au; bu). Then
p(1=�2u) = (1=�2
u)�2p(�2
u) and so
p(1=�2u j y; �0; u; �2
e) /JYj=1
(1
�2u
)12 exp[� u2j
2�2u
](1
�2u
)�2p(�2u)
/ (1
�2u
)(j2+ �u
2�1)exp[� 1
2�2u
(JXj=1
u2j + �us2u)]:
Comparing this with the form of a gamma distribution produces
au =J + �u
2and bu =
1
2(�us
2u +
JXj=1
u2j):
A uniform prior on �2u, or the equivalent Pareto prior is equivalent to �u =
�2; s2u = 0. A uniform prior on log �2u is equivalent to �u = 0; s2u = 0 and a
gamma(�; �) prior for 1=�2u is equivalent to �u = 2�; s2u = 1.
Step 4. p(�2e j y; �0; u; �2
u)
Consider instead p(1=�2e j y; �0; u; �2
u) and let 1=�2e � gamma(ae; be). Then
p(1=�2e) = (1=�2
e)�2p(�2
e) and so
p(1=�2e j y; �0; u; �2
u) / Yi;j
(1
�2e
)12 exp[� 1
2�2e
(yij � �0 � uj)2](
1
�2e
)�2p(�2e)
/ (1
�2e
)(N2+ �e
2�1)exp[� 1
2�2e
(Xi;j
(yij � �0 � uj)2 + �es
2e)]:
Comparing this with the form of a gamma distribution produces
ae =N + �e
2and be =
1
2(�es
2e +
Xi;j
e2ij):
A uniform prior on �2e , or the equivalent Pareto prior is equivalent to �e =
�2; s2e = 0. A uniform prior on log �2e is equivalent to �e = 0; s2e = 0 and a
gamma(�; �) prior for 1=�2e is equivalent to �e = 2�; s2e = 1.
61
Having found the four sets of conditional distributions, it is now simple
enough to program up the algorithm and compare via simulation the various
prior distributions.
4.3.2 Simulation method
In this simulation experiment I want to compare the maximum likelihoodmethods
IGLS and RIGLS with the Gibbs sampler method using several di�erent prior
distributions for the variance parameters. For ease of terminology I will consider
this two level dataset in an educational setting and use pupils within schools as
the units under consideration.
I will now consider what parameters in the above model are important and run
several di�erent sets of simulations using di�erent values for these parameters.
For each set of parameter values 1000 simulated datasets will be generated and
each method will be used to �t the variance components model to each dataset.
I will then compare how the methods perform for each group of parameter
values. These comparisons will be of two sorts. Firstly how biased the methods
are and secondly how well the con�dence intervals they produce cover the data.
The �rst two parameters I consider as important will in uence the structure
of the study. Firstly I will consider the size of the study; There are J schools,
each with nj pupils to give a dataset of size N . I will consider changing J , the
number of schools included in the study, and consequently modifying N to re ect
this change. Secondly the number of pupils in each school will be considered.
Here two schemes will be adopted, �rstly having equal numbers of pupils in each
school and secondly having a more widely spread distribution of pupils in each
school.
To use a realistic scenario I will consider the JSP dataset introduced in earlier
chapters. If slight modi�cations are made by removing 1 pupil at random from
each of the 23 largest schools then there will be 864 students, an average of 18
students per school.
The number of schools included in the study can then be varied and schools
can be chosen so that the average pupils per school is maintained at 18 and the
sizes of the individual schools are well spread. I will consider four sizes of study,
6, 12, 24 and 48 schools with a total of 108, 216, 432 and 864 pupils respectively.
62
The 8 study designs are included in Table 4.1 below. The individual schools in the
cases with unequal nj, were chosen to resemble the actual (skewed) distribution
of class size in the JSP data.
Table 4.1: Summary of study designs for variance components model simulation.
Study Pupils per School N1 5 10 13 18 24 38 1082 18 18 18 18 18 18 1083 5 8 10 11 11 12 13 15 20 24 26 61 2164 18 18 18 18 18 18 18 18 18 18 18 18 2165 5 7 8 10 10 11 11 12 12 13 13 14 432
15 16 18 19 20 21 23 24 26 29 34 616 18 18 18 18 18 18 18 18 18 18 18 18 432
18 18 18 18 18 18 18 18 18 18 18 187 5 6 7 8 8 10 10 10 11 11 11 11 864
12 12 12 12 13 13 13 13 14 14 15 1516 16 17 18 18 19 19 20 20 21 21 2123 24 24 24 25 26 27 29 34 37 38 61
8 18 18 18 18 18 18 18 18 18 18 18 18 86418 18 18 18 18 18 18 18 18 18 18 1818 18 18 18 18 18 18 18 18 18 18 1818 18 18 18 18 18 18 18 18 18 18 18
The other variables that need varying are the true values given to the
parameters of interest, �0; �2u and �
2e . The �xed e�ect, �0, is not of great interest
and so will be �xed at 30.0 and not modi�ed. The two variances are more
interesting and I will choose three possible values for each of these parameters.
The between schools variance, �2u will take values 1.0, 10.0 and 40.0, and for the
between pupils variation, �2e the values 10.0, 40.0 and 80.0 will be used. I will
assume that �2e is always greater than or equal to �2
u as this is more likely in the
educational scenario.
I will consider the eight di�erent study designs with true values that are most
like the original JSP model, that is, �0 = 30; �2u = 10 and �2
e = 40. I will then
only consider study design 7, which is similar to the actual JSP dataset and
modify the true values of the variance parameters. This will make in total 15
di�erent simulation settings.
63
Creating the simulation datasets
For the variance components model, creating the simulation datasets is easy as
the only data that need generating are the values of the response variable for the
N pupils. Considering the case of 864 pupils within 48 schools the procedure is
as follows :
1. Generate 48 ujs, one for each school, by drawing from a normal distribution
with mean 0 and variance �2u.
2. Generate 864 eijs, one for each pupil, by drawing from a normal distribution
with mean 0 and variance �2e .
3. Evaluate Yij = �0 + uj + eij for all 864 pupils.
This will generate one simulation dataset for the current parameter values.
This dataset is then �tted using each method, and the whole procedure is repeated
1000 times. The datasets will be generated using a short C program.
The Gibbs sampling routines using the various prior distributions will be run
using the BUGS package while the IGLS and RIGLS estimates will be calculated
using MLwiN. The main reason to use BUGS and not MLwiN to perform the
Gibbs sampling runs is due to the computing resources available to me. I have
access to several UNIX Sun workstations on which I can run BUGS in parallel but
I only have one PC to use for MLwiN and this machine is also used for MLwiN
development work.
Comparing methods
To compare the bias of each method I can �nd the mean values of each parameter
estimate over the 1000 runs and the standard errors of these means. These means
can then be compared with the true answer. To compare the coverage, I want
to �nd how many of the 1000 runs contain the true value in an x% con�dence
interval. Ideally x% of the runs will contain the true value in an x% con�dence
interval. In particular I will concentrate on the 90% and 95% con�dence intervals
as these are the most used con�dence levels.
For the Gibbs sampling methods, it is easy to calculate how many of the I
iterations in each run are larger than the true value and so I can then �nd whether
or not the true value lies in an x% credible interval without actually calculating
all the credible intervals. I will consider three proper priors, the Pareto(1,c) prior
64
for 1=�2, the gamma(�; �) prior for 1=�2, and the scaled inverse chi-squared prior
for �2 mentioned in the earlier section of this chapter. The gamma(�; �) estimate
has been used as the parameter for the scaled inverse chi-squared prior. For a
particular method, the same prior will be used for both �2u and �2
e .
For IGLS, we can assume normality for the �xed e�ects parameter and
calculate a normal x% con�dence interval. For the variances it is not so clear
how to calculate con�dence intervals so for now I will assume normality and later
I will consider another method of improving on this assumption.
Preliminary analysis for Gibbs sampling
To gauge how long the simulations will take and consequently how many iterations
I can a�ord to run for each model, I have performed two preliminary tests. Firstly
I ran each of the 8 study designs with a generated dataset for 50,000 iterations
on a fast machine. This test will give an estimate of how long the simulation
studies will take. The results are in Table 4.2.
Table 4.2: Summary of times for Gibbs sampling in the variance componentsmodel with di�erent study designs for 50,000 iterations.
Study CPU time Act. time N1 59s 64s 1082 58s 68s 1083 109s 128s 2164 106s 107s 2165 201s 203s 4326 199s 203s 4327 392s 398s 8648 393s 411s 864
Secondly, to calculate how long to run each model I need to consider when
the model has produced estimates to a given accuracy. For this I will consider
the Raftery Lewis diagnostic at two percentiles, the standard 2.5 percentile and
the median. I will use a value for r in the Raftery Lewis notation of 0.01 instead
of 0.005 which is the default. This is because it takes a lot longer to obtain the
same accuracy for the median as the 2.5 percentile.
The results in Table 4.3 are the average values for N from 10 runs of the Gibbs
65
sampler, each of length 10,000 and with a burn-in of 1,000. The true values used
are �0 = 30; �2u = 10; and �2
e = 40.
Table 4.3: Summary of Raftery Lewis convergence times (thousands of iterations)for various studies.
Study Prior �0(2.5/50) �2u(2.5/50) �2
e(2.5/50)1 Gamma 10/67 8/146 1/121 Pareto 20/138 3/87 1/102 Gamma 13/79 8/69 1/112 Pareto 19/168 2/91 1/103 Gamma 6/76 10/34 1/103 Pareto 8/93 2/28 1/105 Gamma 5/77 2/28 1/105 Pareto 5/90 2/27 1/107 Gamma 5/73 1/16 1/107 Pareto 5/79 1/15 1/10
The large N values for the median of �0 are due to large serial correlation and
do not decrease much with the size of the study. The large N values for �2u are
due to the skewness of the posterior distribution which decreases as the number of
schools increases. Although it is important to ensure the estimates have converged
to a given accuracy, this has to be balanced by the fact that many simulation
datasets are to be generated for each situation and time is therefore constrained.
It is important to realise that the models under study are in common usage and
are known to be unimodal and so longer runs are only used to establish accuracy
in the estimates and not convergence of the chain. As I will be running lots of
simulations and then averaging the results the accuracy of individual simulations
does not have to be so rigid.
The number of iterations in each simulation run will depend on the size of
the study. The smaller studies need longer to converge but take less time per
iteration so can be run for longer. The lengths of simulation runs I chose are in
Table 4.4 below. Even with the modest values for N with the larger sample sizes,
the full simulation took 3 months to run using 3 Sun machines.
66
Table 4.4: Summary of simulation lengths for Gibbs sampling the variancecomponents model with di�erent study designs.
Study Length N1 50,000 1082 50,000 1083 30,000 2164 30,000 2165 20,000 4326 20,000 4327 10,000 8648 10,000 864
4.3.3 Results : Bias
In Table 4.5 the estimates of the relative bias for each method, obtained from
the simulations are given for the eight di�erent study designs. It can be seen
immediately from the table and Figures 4-2(i) and (ii) that the RIGLS method
gives the smallest bias on almost all study designs. The IGLS method generally
underestimates the variances and in particular the level 2 variance. All the `non-
informative' MCMC methods overestimate the variances and the Pareto prior
does particularly badly.
It is no great surprise that the level 1 variance is better estimated, and has
smaller percentage biases than the level 2 variance as there are far more pupils
than schools. This reduction in relative bias is due to the bias of the estimates
being inversely related to the number of observations they are based on. The
study design also has a signi�cant e�ect on how the methods perform. As the
number of schools is increased and consequently the number of students is also
increased, all the methods give better estimates. The e�ect of using a balanced
design as opposed to an unbalanced design is unclear. The IGLS, RIGLS and
Pareto prior methods give better estimates with a balanced design but the other
MCMC priors generally give worse estimates. The size of the dataset has far
more e�ect on the estimates than whether the design is balanced.
In Table 4.6, and Figures 4-2(iii) and (iv), study design 7, which is the
design most like the original JSP dataset and is a design where most methods
perform well has been considered in more detail. The true values for the variance
67
Table 4.5: Estimates of relative bias for the variance parameters using di�erentmethods and di�erent studies. True level 2/1 variance values are 10 and 40.
Study IGLS RIGLS Gamma S.I.�2 ParetoNumber (�; �) prior prior (1; �) prior
Level 2 variance relative bias % (Monte Carlo SE)
1 {22.64 (2.05) {0.97 (2.48) 49.05 (4.05) 50.90 (4.03) 481.3 (10.23)2 {20.07 (2.01) 0.03 (2.42) 51.41 (3.98) 52.75 (3.96) 449.8 (9.66)3 {11.86 (1.57) {0.99 (1.72) 18.36 (2.17) 18.66 (2.16) 74.89 (2.71)4 {9.75 (1.38) 0.40 (1.51) 20.27 (2.08) 20.47 (2.07) 70.88 (2.60)5 {2.37 (1.13) 3.10 (1.18) 11.98 (1.30) 12.00 (1.30) 30.77 (1.42)6 {4.11 (1.12) 1.02 (1.16) 9.70 (1.28) 9.71 (1.28) 26.72 (1.41)7 {2.14 (0.85) 0.52 (0.86) 4.69 (0.90) 4.70 (0.90) 12.48 (0.95)8 {2.02 (0.81) 0.53 (0.82) 4.75 (0.86) 4.75 (0.86) 12.04 (0.90)
Level 1 variance relative bias % (Monte Carlo SE)
1 {0.42 (0.453) {0.42 (0.453) 2.79 (0.473) 2.61 (0.471) 3.47 (0.470)2 {0.45 (0.458) {0.41 (0.458) 2.78 (0.478) 2.62 (0.476) 3.58 (0.476)3 {0.02 (0.320) {0.03 (0.320) 1.63 (0.328) 1.59 (0.328) 2.02 (0.325)4 {0.16 (0.323) {0.16 (0.323) 1.43 (0.322) 1.40 (0.321) 1.94 (0.319)5 {0.31 (0.223) {0.31 (0.223) 0.28 (0.224) 0.28 (0.224) 0.66 (0.224)6 {0.15 (0.223) {0.15 (0.223) 0.42 (0.224) 0.42 (0.224) 0.83 (0.224)7 {0.04 (0.158) {0.04 (0.158) 0.25 (0.158) 0.25 (0.158) 0.42 (0.158)8 {0.09 (0.158) {0.09 (0.158) 0.19 (0.158) 0.19 (0.158) 0.38 (0.158)
parameters have then been modi�ed. It can be seen that the IGLS method again
underestimates the true values and that the RIGLS method corrects for this.
What is surprising is the e�ects obtained when the level 2 variance is set
to values much less than the level 1 variance. Here the MCMC methods with
the gamma and S.I.�2 priors now underestimate the level 2 variance. The
corresponding level 1 variance is still over-estimated and so perhaps some of the
level 2 variance is being estimated as level 1 variance. The Pareto prior biases
are not similarly a�ected by modifying the true values of the variances. In fact
the percentage bias in the estimate of the level 2 variance increases when the true
value of level 1 variance is increased.
So to summarise the RIGLS method gives the least biased estimates in all the
scenarios studied. The IGLS method underestimates the variance parameters
while all the MCMC methods overestimate the variance except when the true
value of �2e is much greater than the true value of �2
u. Of the MCMC methods,
68
Simulation DesignR
elat
ive
% b
ias
: lev
el 2
var
ianc
e
1 2 3 4 5 6 7 8
010
020
030
040
050
0
(i)
IGLSRIGLSGammaSI chi-squaredPareto
Simulation Design
Rel
ativ
e %
bia
s : l
evel
1 v
aria
nce
1 2 3 4 5 6 7 8
01
23
4
(ii)
IGLSRIGLSGammaSI chi-squaredPareto
Parameter Settings
Rel
ativ
e %
bia
s : l
evel
2 v
aria
nce
-20
020
4060
80
1:10 1:40 1:80 10:10 10:40 10:80 40:40 40:80
(iii)
IGLSRIGLSGammaSI chi-squaredPareto
Parameter Settings
Rel
ativ
e %
bia
s : l
evel
1 v
aria
nce
-0.2
0.2
0.6
1.0
1:10 1:40 1:80 10:10 10:40 10:80 40:40 40:80
(iv)
IGLSRIGLSGammaSI chi-squaredPareto
Figure 4-2: Plots of biases obtained for the various methods against study designand parameter settings.
69
Table 4.6: Estimates of relative bias for the variance parameters using di�erentmethods and di�erent true values. All runs use study design 7.
Level 2/1 IGLS RIGLS Gamma S.I.�2 ParetoVariances (�; �) prior prior (1; �) prior
Level 2 variance relative bias % (Monte Carlo SE)
1/10 {3.10 (1.08) 0.36 (1.11) 3.15 (1.19) 3.28 (1.20) 15.92 (1.22)1/40 {6.54 (2.07) 0.31 (2.14) {18.48 (2.14) {20.01 (2.18) 39.57 (2.18)1/80 {3.43 (3.02) 7.18 (3.16) {22.81 (2.54) {24.60 (2.61) 84.53 (3.11)10/10 {1.71 (0.72) 0.53 (0.73) 4.94 (0.76) 4.95 (0.76) 10.56 (0.79)10/40 {2.14 (0.85) 0.52 (0.86) 4.69 (0.90) 4.70 (0.90) 12.48 (0.95)10/80 {2.77 (1.01) 0.43 (1.03) 3.68 (1.09) 3.69 (1.09) 14.84 (1.13)40/40 {1.71 (0.72) 0.53 (0.73) 4.94 (0.76) 4.95 (0.76) 10.56 (0.79)40/80 {1.88 (0.76) 0.50 (0.77) 4.90 (0.81) 4.91 (0.81) 11.23 (0.85)
Level 1 variance relative bias % (Monte Carlo SE)
1/10 {0.04 (0.16) {0.04 (0.16) 0.43 (0.16) 0.42 (0.16) 0.48 (0.16)1/40 {0.08 (0.16) {0.07 (0.16) 0.73 (0.16) 0.76 (0.16) 0.24 (0.16)1/80 {0.18 (0.16) {0.16 (0.16) 0.64 (0.16) 0.66 (0.16) 0.17 (0.16)10/10 {0.15 (0.16) {0.15 (0.16) 0.19 (0.16) 0.19 (0.16) 0.42 (0.16)10/40 {0.04 (0.16) {0.04 (0.16) 0.25 (0.16) 0.25 (0.16) 0.42 (0.16)10/80 {0.04 (0.16) {0.04 (0.16) 0.33 (0.16) 0.32 (0.16) 0.42 (0.16)40/40 {0.15 (0.16) {0.15 (0.16) 0.19 (0.16) 0.19 (0.16) 0.42 (0.16)40/80 {0.05 (0.16) {0.05 (0.16) 0.21 (0.16) 0.21 (0.16) 0.42 (0.16)
the Pareto prior gives far greater bias than the other priors.
4.3.4 Results : Coverage probabilities and interval widths
The Bayesian MCMC methods are not designed speci�cally to give unbiased
estimates. In the Bayesian framework, interval estimates and coverage prob-
abilities are considered more important. The maximum likelihood IGLS and
RIGLS methods are not ideally suited for �nding interval estimates and coverage
probabilities as additional assumptions now have to be made to create intervals.
In the following tables I have made the assumption that all the parameters have
Gaussian distributions. In the case of the level 2 variance parameter this is a
very implausible assumption and I will consider an alternative assumption later
in this chapter.
I will �rstly consider the �xed e�ect parameter, �0 where it is plausible to
70
assume there is a Gaussian posterior distribution.
Table 4.7 contains the coverage probabilities for the �xed e�ect parameter
using the eight di�erent study designs and Table 4.8 contains the corresponding
interval widths. It can be seen in Table 4.7 that the gamma and S.I.�2 priors
perform signi�cantly better than the RIGLS method. All three methods have
actual coverage that is too small and the IGLS method gives the smallest
coverage. The Pareto method gives actual coverage that is far too big for the
smaller studies but gives better coverage as study size increases. In fact all
methods perform better the larger the study and generally the coverage is slightly
better when the design is balanced.
Table 4.8 echoes the results in Table 4.7 in that the gamma and S.I.�2 intervals
are on average slightly wider than the IGLS and RIGLS intervals which leads to
better coverage. The Pareto prior gives intervals in the smaller studies that are
on average almost twice as wide as the other methods and consequently gives too
much coverage. As the studies get larger the Pareto intervals get closer in size to
the other methods and the method performs better.
In Table 4.9 when only study 7 is considered there are far smaller di�erences
between the various methods. The Pareto prior is doing slightly better than all
the other methods, while the other MCMC methods are generally improving on
the maximum likelihood based methods. It can be seen here that when the level 2
variance is much smaller than the level 1 variance, the gamma and S.I.�2 priors do
worse than the IGLS and RIGLS methods. Table 4.10 shows the corresponding
interval widths which are very similar for all methods.
Table 4.11 considers the coverage for the level 2 variance parameter, �2u
using the eight di�erent study designs. It can be seen that there is far greater
discrepancy between the maximum likelihood methods and the MCMC methods
for this parameter but this is to some extent due to the normality assumption.
RIGLS and IGLS give coverage probabilities that are much smaller than the
actual coverage should be. The Pareto prior does better than the other priors
when the study size is small but there is very little to choose between the priors
as the size gets larger. All methods give coverage probabilities that are smaller
than they should be.
Table 4.12 shows that the Pareto prior has average interval widths that are
four times larger than the other priors for studies 1 and 2. The size of the average
71
Table 4.7: Comparison of actual coverage percentage values for nominal 90% and95% intervals for the �xed e�ect parameter using di�erent methods and di�erentstudies. True values for the variance parameters are 10 and 40. ApproximateMCSEs are 0.28%/0.15% for 90%/95% coverage estimates.
Study IGLS RIGLS Gamma S.I.�2 ParetoNumber (�; �) prior prior (1; �) prior
1 81.7/86.8 85.0/89.4 85.8/91.1 86.0/91.3 97.8/99.82 81.5/88.5 85.3/90.6 86.9/92.7 86.9/93.0 97.7/99.53 85.4/90.5 87.1/91.6 88.4/93.6 88.4/93.7 94.0/97.64 86.6/91.5 88.2/92.9 89.4/94.4 89.5/94.4 94.2/97.25 88.3/93.3 89.0/94.0 89.8/94.2 89.8/94.3 91.4/95.66 88.0/93.9 89.0/94.2 89.8/94.9 89.8/94.9 91.3/95.57 88.7/93.2 88.8/93.4 89.4/93.5 89.4/93.5 89.8/94.28 88.8/94.5 89.4/94.6 90.1/95.3 90.1/95.3 91.2/95.7
Table 4.8: Average 90%/95% interval widths for the �xed e�ect parameter usingdi�erent studies. True values for the variance parameters are 10 and 40.
Study IGLS RIGLS Gamma S.I.�2 ParetoNumber (�; �) prior prior (1; �) prior
1 4.15/4.94 4.57/5.44 5.00/6.39 5.04/6.42 8.74/12.002 4.10/4.88 4.48/5.33 4.96/6.34 4.98/6.37 8.15/11.153 3.18/3.79 3.33/3.97 3.57/4.37 3.57/4.38 4.28/5.294 3.11/3.71 3.25/3.88 3.47/4.26 3.48/4.26 4.09/5.085 2.35/2.80 2.40/2.87 2.47/2.99 2.47/2.99 2.62/3.156 2.28/2.72 2.31/2.78 2.41/2.92 2.41/2.92 2.53/3.057 1.67/1.99 1.69/2.01 1.71/2.05 1.71/2.05 1.76/2.108 1.64/1.95 1.65/1.97 1.69/2.02 1.69/2.02 1.73/2.08
Table 4.9: Comparison of actual coverage percentage values for nominal 90% and95% intervals for the �xed e�ect parameter using di�erent methods and di�erenttrue values. All runs use study design 7. Approximate MCSEs are 0.28%/0.15%for 90%/95% coverage estimates.
Level 2/1 IGLS RIGLS Gamma S.I.�2 ParetoVariances (�; �) prior prior (1; �) prior
1/10 88.4/92.5 88.7/92.9 88.5/93.1 88.5/93.1 89.2/93.91/40 87.9/93.2 88.6/93.5 87.2/92.9 87.1/92.7 90.5/95.31/80 88.3/93.8 88.6/94.0 88.2/93.9 88.1/93.8 91.7/96.110/10 89.0/93.4 89.3/93.5 89.7/94.7 89.7/94.7 90.1/95.110/40 88.7/93.2 88.8/93.4 89.4/93.5 89.4/93.5 89.8/94.210/80 88.5/92.4 88.7/92.8 88.7/93.4 88.7/93.4 89.6/94.040/40 89.0/93.4 89.3/93.5 89.8/94.8 89.7/94.7 90.1/95.140/80 88.8/93.3 89.2/93.5 89.1/94.1 89.1/94.1 89.7/94.7
72
Table 4.10: Average 90%/95% interval widths for the �xed e�ect parameter usingdi�erent true parameter values. All runs use study design 7.
Level 2/1 IGLS RIGLS Gamma S.I.�2 ParetoVariances (�; �) prior prior (1; �) prior
1/10 0.60/0.71 0.61/0.72 0.61/0.73 0.61/0.73 0.63/0.761/40 0.86/1.02 0.87/1.04 0.84/1.01 0.84/1.01 0.91/1.111/80 1.12/1.33 1.13/1.35 1.10/1.31 1.10/1.31 1.21/1.4610/10 1.53/1.83 1.55/1.85 1.60/1.92 1.60/1.92 1.63/1.9610/40 1.67/1.99 1.69/2.01 1.71/2.05 1.71/2.05 1.76/2.1010/80 1.82/2.17 1.84/2.20 1.85/2.23 1.85/2.23 1.92/2.3040/40 3.06/3.65 3.10/3.69 3.19/3.84 3.19/3.84 3.26/3.9140/80 3.16/3.76 3.19/3.80 3.26/3.93 3.26/3.93 3.35/4.03
Table 4.11: Comparison of actual coverage percentage values for nominal 90%and 95% intervals for the level 2 variance parameter using di�erent methodsand di�erent studies. True values of the variance parameters are 10 and 40.Approximate MCSEs are 0.28%/0.15% for 90%/95% coverage estimates.
Study IGLS RIGLS Gamma S.I.�2 ParetoNumber (�; �) prior prior (1; �) prior
1 68.3/71.9 75.3/78.5 81.2/88.9 81.9/89.0 84.3/91.52 69.5/73.3 77.4/80.4 81.6/88.5 82.0/88.6 84.7/90.73 77.0/80.9 81.7/86.2 86.8/92.2 87.3/92.3 88.1/94.24 78.0/83.0 83.3/87.1 88.2/93.7 88.2/93.9 88.2/93.05 86.8/89.5 88.6/91.2 88.5/94.0 88.5/94.0 88.5/93.86 84.4/88.5 87.3/90.2 88.3/93.9 88.3/93.9 87.5/93.57 87.4/91.4 88.4/92.4 88.5/93.8 88.5/93.8 88.6/93.08 87.1/90.7 87.3/91.1 87.8/93.4 87.8/93.4 87.7/93.2
Table 4.12: Average 90%/95% interval widths for the level 2 variance parameterusing di�erent studies. True values of the variance parameters are 10 and 40.
Study IGLS RIGLS Gamma S.I.�2 ParetoNumber (�; �) prior prior (1; �) prior
1 19.43/23.15 23.73/28.28 41.56/59.45 41.82/59.82 182.21/298.852 19.18/22.86 22.99/27.39 40.77/58.11 40.93/58.34 164.52/273.343 15.64/18.63 17.12/20.40 22.78/29.50 22.79/29.51 35.21/46.654 15.11/18.01 16.39/19.52 21.94/28.44 21.93/28.43 33.10/44.005 11.84/14.10 12.35/14.72 14.14/17.50 14.15/17.50 16.89/20.976 11.23/13.38 11.69/13.93 13.40/16.57 13.40/16.57 15.86/19.717 8.33/9.93 8.52/10.15 9.05/10.95 9.05/10.95 9.77/11.848 8.08/9.63 8.25/9.83 8.79/10.63 8.79/10.63 9.46/11.46
73
Table 4.13: Comparison of actual coverage percentage values for nominal 90%and 95% intervals for the level 2 variance parameter using di�erent methods anddi�erent true values. All runs use study design 7. Approximate MCSEs are0.28%/0.15% for 90%/95% coverage estimates.
Level 2/1 IGLS RIGLS Gamma S.I.�2 ParetoVariances (�; �) prior prior (1; �) prior
1/10 86.1/90.7 87.0/91.8 85.9/92.6 85.8/92.5 88.4/93.01/40 84.0/88.0 85.3/89.4 76.2/84.6 74.2/83.2 91.4/95.51/80 77.7/78.5 79.1/80.4 76.6/89.5 73.8/85.7 92.2/95.910/10 86.2/91.4 88.1/92.7 88.3/94.3 88.3/94.3 87.4/92.810/40 87.4/91.4 88.4/92.4 88.5/93.8 88.5/93.8 88.6/93.010/80 86.3/90.1 86.9/91.7 86.6/93.5 86.6/93.5 88.7/93.140/40 86.2/91.4 88.1/92.7 88.3/94.3 88.3/94.3 87.4/92.840/80 86.9/92.1 88.6/92.9 88.1/93.8 88.1/93.8 88.2/92.9
Table 4.14: Average 90%/95% interval widths for the level 2 variance parameterusing di�erent true parameter values. All runs use study design 7.
Level 2/1 IGLS RIGLS Gamma S.I.�2 ParetoVariances (�; �) prior prior (1; �) prior
1/10 1.07/1.27 1.09/1.30 1.15/1.40 1.16/1.40 1.26/1.531/40 2.00/2.39 2.08/2.47 1.88/2.28 1.85/2.24 2.48/2.991/80 2.93/3.49 3.07/3.66 2.35/2.95 2.29/2.89 3.89/4.7110/10 7.06/8.41 7.21/8.59 7.71/9.34 7.71/9.34 8.23/9.9710/40 8.33/9.93 8.52/10.15 9.05/10.95 9.05/10.95 9.77/11.8410/80 9.91/11.81 10.13/12.07 10.71/12.98 10.71/12.98 11.69/14.1740/40 28.23/33.63 28.85/34.37 30.84/37.36 30.83/37.35 32.91/39.8940/80 29.97/35.71 30.63/36.50 32.69/39.55 32.69/39.55 35.04/42.45
Table 4.15: Comparison of actual coverage percentage values for nominal 90%and 95% intervals for the level 1 variance parameter using di�erent methodsand di�erent studies. True values of the variance parameters are 10 and 40.Approximate MCSEs are 0.28%/0.15% for 90%/95% coverage estimates.
Study IGLS RIGLS Gamma S.I.�2 ParetoNumber (�; �) prior prior (1; �) prior
1 87.8/92.5 87.6/92.7 88.0/93.8 88.1/94.0 88.5/94.22 87.8/92.4 87.9/92.4 88.7/93.6 88.8/93.6 89.1/93.43 88.8/94.4 88.8/94.4 89.1/94.3 89.2/94.5 89.3/94.74 89.5/94.8 89.5/94.8 89.3/94.8 89.3/94.8 89.5/95.05 89.0/94.2 89.1/94.2 88.7/94.2 88.7/94.2 89.2/93.96 89.1/93.7 89.1/93.7 89.3/94.1 89.4/94.1 89.3/94.37 89.0/94.9 88.9/94.9 89.4/95.7 89.4/95.7 89.6/95.78 88.9/94.4 88.9/94.4 89.6/94.1 89.6/94.1 89.8/94.3
74
Table 4.16: Average 90%/95% interval widths for the level 1 variance parameterusing di�erent studies. True values of the variance parameters are 10 and 40.
Study IGLS RIGLS Gamma S.I.�2 ParetoNumber (�; �) prior prior (1; �) prior
1 18.29/21.79 18.29/21.80 19.26/23.17 19.18/23.07 19.34/23.142 18.32/21.82 18.33/21.84 19.29/23.20 19.21/23.11 19.38/23.173 13.01/15.51 13.01/15.51 13.42/16.03 13.40/16.00 13.42/16.054 13.01/15.50 13.01/15.50 13.40/16.01 13.39/15.99 13.42/16.055 9.18/10.94 9.18/10.94 9.24/11.03 9.24/11.03 9.31/11.116 9.20/10.96 9.20/10.96 9.26/11.05 9.26/11.05 9.33/11.147 6.51/7.76 6.51/7.76 6.50/7.76 6.50/7.76 6.50/7.778 6.51/7.76 6.51/7.76 6.50/7.75 6.50/7.75 6.49/7.76
Table 4.17: Comparison of actual coverage percentage values for nominal 90%and 95% intervals for the level 1 variance parameter using di�erent methods anddi�erent true values. All runs use study design 7. Approximate MCSEs are0.28%/0.15% for 90%/95% coverage estimates.
Level 2/1 IGLS RIGLS Gamma S.I.�2 ParetoVariances (�; �) prior prior (1; �) prior
1/10 89.1/95.0 89.1/95.0 89.3/95.4 89.3/95.4 89.1/95.11/40 89.4/95.1 89.5/95.0 90.3/95.5 89.9/95.4 89.7/95.61/80 89.6/94.7 89.5/94.8 90.2/95.6 90.3/95.6 89.7/95.510/10 88.6/95.1 88.6/95.1 89.4/95.6 89.4/95.6 89.9/95.710/40 89.0/94.9 88.9/94.9 89.4/95.7 89.4/95.7 89.6/95.710/80 89.1/95.0 89.1/94.9 89.8/96.0 89.8/96.0 89.7/95.840/40 88.6/95.1 88.6/95.1 89.4/95.6 89.4/95.6 89.9/95.740/80 88.9/95.1 88.9/95.1 89.5/95.5 89.5/95.5 89.8/95.7
Table 4.18: Average 90%/95% interval widths for the level 1 variance parameterusing di�erent true parameter values. All runs use study design 7.
Level 2/1 IGLS RIGLS Gamma S.I.�2 ParetoVariances (�; �) prior prior (1; �) prior
1/10 1.63/1.94 1.63/1.94 1.63/1.95 1.63/1.95 1.62/1.941/40 6.48/7.72 6.48/7.72 6.51/7.77 6.51/7.77 6.44/7.711/80 12.89/15.35 12.89/15.36 12.85/15.35 12.84/15.34 12.80/15.3210/10 1.63/1.94 1.63/1.94 1.62/1.93 1.62/1.93 1.62/1.9410/40 6.51/7.76 6.51/7.76 6.50/7.76 6.50/7.76 6.50/7.7710/80 13.02/15.51 13.01/15.51 13.02/15.57 13.02/15.57 12.98/15.5440/40 6.51/7.75 6.51/7.75 6.50/7.74 6.50/7.74 6.50/7.7640/80 13.02/15.51 13.02/15.52 13.00/15.49 13.00/15.49 13.00/15.52
75
intervals for the maximum likelihood methods may be arti�cially small due to
the assumption of normality.
When the values of the variance parameters are modi�ed and the study design
is design 7, it can be seen (Table 4.13) that the Pareto prior still has the best
coverage intervals. The other MCMC methods do better than the maximum
likelihood based methods except when the true value of �2u is much smaller than
the true value of �2e , when the situation is reversed.
This point is emphasised in Table 4.14 where it can be seen that the intervals
for the gamma and S.I.�2 are narrower than for the other methods when �2u = 1
and �2e = 40 or 80. The Pareto prior intervals are slightly wider which explains
why they cover better.
When comparing the coverage for the level 1 variance, �2e in Table 4.15 it is
easy to see that there is little to choose between the methods. When the study
is small the MCMC methods do slightly better than the maximum likelihood
methods and overall the Pareto prior generally gives the best coverage. This
parameter has much more accurate coverage intervals than the other parameters.
In Table 4.16 it can be seen that the MCMC intervals are wider when there
are only 108 pupils but as the size of the study gets bigger the intervals become
virtually identical. The same behaviour would probably be seen with the level 2
variance if the number of schools was made a lot larger.
Table 4.17 shows the coverage probabilities when we only consider study
design 7. Here there is nothing to choose between any of the methods and all the
coverage probabilities are very good. Table 4.18 con�rms this by showing that
the intervals from all the models are virtually identical.
4.3.5 Improving maximum likelihood method interval
estimates for �2u
From the above results it is clear that the Gaussian distribution is a bad
approximation to use when calculating interval estimates for the level 2 variance
parameter, �2u. I have shown in the Gibbs sampling algorithm that the true
conditional distribution for �2u is an inverse gamma distribution and I will now
try and use this fact to construct con�dence intervals for �2u based on the inverse
gamma distribution.
76
For each of the 1,000 simulations generated, RIGLS gives an estimate of �2u,
�2u = � and a variance for �2
u; var(�2u) = �. Now if I use the assumption that �2
u
has an inverse gamma (�; �) distribution, then the mean and variance of �2u can
be used to calculate the appropriate distribution as follows :
� =�
�� 1! � = �(�� 1)
� =�2
(�� 1)2(�� 2)
=�2(�� 1)2
(�� 1)2(�� 2)
! � =�2
�+ 2 and
� =�3
�+ �:
Now having found the required inverse gamma distribution, to construct a
x% con�dence interval the quantiles from the distribution need to be found.
Instead of using an inverse gamma distribution directly, the equivalent gamma
distribution is used and the points are inverted :
x%CI =
0@ 1
Gam1�x=2(�2
�+ 2; �
3
�+ �)
;1
Gamx=2(�2
�+ 2; �
3
�+ �)
1A :
The results obtained using the inverse gamma approach for the RIGLS
method can be seen in Table 4.19. In comparing these results with the results
obtained using a Gaussian interval for the level 2 variance (Tables 4.11 - 4.14)
two points emerge.
Firstly the inverse gamma method generally gives worse coverage at the 90%
level than the Gaussian method but better coverage at the 95% level. This is
probably due to the skewed form of the con�dence interval. The exception to
this rule is when the level 2 variance is small when the inverse gamma method
does much worse than the Gaussian method. This may now explain why the
MCMC methods do badly in these cases, as the Gaussian interval for RIGLS was
an unfair comparison and we now see that the MCMC methods are performing
77
Table 4.19: Summary of results for the level 2 variance parameter, �2u using the
RIGLS method and inverse gamma intervals.
Study 90% 95% Level 2/1 90% 95%Number Variances
Coverage Probabilities
1 68.3 77.3 1/10 84.8 91.92 70.4 79.0 1/40 71.7 78.63 78.1 87.1 1/80 56.8 65.14 77.0 86.3 10/10 86.3 93.95 85.2 91.7 10/40 86.2 92.76 84.3 91.1 10/80 84.4 92.57 86.2 92.7 40/40 86.3 93.98 86.1 92.2 40/80 86.8 93.7
Average Interval Widths (90%/95%)
1 18.004 24.064 1/10 1.024 1.2692 17.778 23.644 1/40 1.589 2.0883 14.678 18.919 1/80 1.954 2.6454 12.913 16.591 10/10 7.024 8.5235 11.466 14.293 10/40 8.206 10.0236 10.911 13.567 10/80 9.602 11.8397 8.206 10.023 40/40 28.097 34.0928 7.966 9.716 40/80 29.731 36.152
better in terms of coverage than the maximum likelihood methods. Secondly
the inverse gamma intervals are on average much narrower than the Gaussian
methods.
4.3.6 Summary of results
Although these results don't look great for the Gibbs sampling methods in terms
of bias, particularly for the smaller datasets, what the reader must realise is that
even the largest JSP dataset studied with 48 schools is a small dataset in multi-
level modelling terms. If I were to consider larger datasets there will be less bias
and better agreement between the methods over coverage of intervals. I have also
78
only considered using the chain mean as a parameter estimate. As an alternative
I could have considered the median or mode of the parameter which, for the
variances will give smaller estimates and hence less bias.
Overall the MCMC methods appear to have an edge in terms of coverage
probabilities, particularly when the inverse gamma intervals are used for the
IGLS/RIGLS methods.
The main danger is that people who are currently using maximum likelihood
methods may decide to only use the MCMC methods on the smaller datasets as
in computational terms the MCMC methods take much longer than the other
methods. The other danger is that people who are unfamiliar with multi-level
modelling will not realise that due to the structure of the problems studied, it is
not only the number of level 1 units that is important but also the number of level
2 units, particularly when estimating the level 2 variance. Models with 6 or even
12 level 2 units would probably not be considered large enough for multi-level
modelling. To compare the multivariate priors described in the earlier section I
will now consider a second model.
4.4 Random slopes regression model
In order to compare the various priors for a variance matrix, I will now need to
consider a more complicated model. One of the simplest multi-level models that
includes a variance matrix is the random slopes regression model introduced in the
last chapter. The random slopes regression model can be written mathematically
as :
yij = �0 + �1Xij + u0j + u1jXij + eij;
uj =
0@ u0j
u1j
1A �MVN(0; Vu); eij � N(0; �2e);
where i = 1; : : : ; nj; j = 1; : : : ; J andP
j nj = N .
I will re-express the �rst line of this model as follows :
yij = �0X0ij + �1X1ij + u0jX0ij + u1jXij + eij
79
where X0ij is constant and X1ij = Xij in the previous notation. I will now use Xij
to mean the vector (X0ij; X1ij) as this change will help simplify the expressions
in the algorithms that follow.
4.4.1 Gibbs sampling algorithm
As with the variance components model the parameters can be split into four
groups, the �xed e�ects, �, the level 2 residuals, uj, the level 2 variance matrix
Vu and the level 1 variance �2e . The conditional posterior distributions can be
found using similar methodology to that used for the variance components model
so I will just outline the posteriors without explaining how to obtain them.
Prior distributions
I will assume a uniform prior for the �xed e�ect parameters �0 and �1. The
level 1 variance �2e will take various univariate priors in the simulations. In the
algorithm I will use a general scaled inverse �2 prior with parameter �e and s2e.
The level 2 variance matrix will similarly take various multivariate priors and so
I will assume a general Wishart prior with parameters �p and Sp for the precision
matrix at level 2. All the priors in the earlier section can then be obtained from
particular values for these parameters. The algorithm is then as follows :
Step 1. p(� j y; Vu; �2e ; u)
Let � � N(�; D), then to �nd � and D :
p(� j y; Vu; �2e ; u) / Y
i;j
(1
�2e
)12 exp[� 1
2�2e
(yij � Xijuj � Xij�)2]
/ exp[�P
i;j XTijXij
2�2e
�2 +1
�2e
Xi;j
XTij(yij � Xijuj)� + const]:
Comparing this with the form of a multivariate normal distribution and matching
powers of � gives
D = �2e [Xi;j
XTijXij]
�1;
80
and
� = [Xi;j
XTijXij]
�1Xi;j
XTij(yij � Xijuj)
=D
�2e
Xi;j
XTij(yij � Xijuj):
Step 2. p(uj j y; Vu; �2e ; �)
Let uj � N(uj; Dj), then to �nd uj and Dj :
p(uj j y; Vu; �2e ; �) /
njYi=1
(1
�2e
)12 exp[� 1
2�2e
(yij � Xij� � Xijuj)2]
�jVuj� 12 exp[�1
2uTj V
�1u uj]
/ exp(�1
2[njXi=1
XTijXij
�2e
+ V �1u ]u2j +
1
2�2e
njXi=1
XTij(yij � Xij�)uj
+const):
Comparing this with the form of a multivariate normal distribution and matching
powers of uj gives
Dj = [
Pnji=1X
TijXij
�2e
+ V �1u ]�1;
and
uj =Dj
�2e
njXi=1
XTij(yij � Xij�):
Step 3. p(Vu j y; �; u; �2e)
Consider instead p(V �1u j y; �; u; �2
e) and let V�1u �Wishart(�u; Su). Then letting
p(V �1u ) �Wishart(�p; Sp) gives
81
p(V �1u j y; �; u; �2
e) /JYj=1
jVuj� 12 exp(�1
2uTj V
�1u uj)p(V
�1u )
/ jVuj�J2 exp(�1
2
JXj=1
uTj V�1u uj)
�jV �1u j(�p�3)=2exp(�1
2tr.(S�1
p V �1u ))
/ jVuj(J+�p�3)=2exp(�1
2tr.((
JXj=1
uTj uj + S�1p )V �1
u )):
Then comparing this with the form of a Wishart distribution produces
�u = J + �p and Su = (JXj=1
uTj uj + S�1p )�1:
The uniform prior on Vu is equivalent to �p = �3; Sp = 0.
Step 4. p(�2e j y; �; u; Vu)
Consider instead p(1=�2e j y; �; u; Vu) and let 1=�2
e � gamma(ae; be). Then
p(1=�2e) = (1=�2
e)�2p(�2
e) and so
p(1=�2e j y; �; u; Vu) / Y
i;j
(1
�2e
)12 exp[� 1
2�2e
(yij � Xij� � Xijuj)2](
1
�2e
)�2p(�2e)
/ (1
�2e
)(N2+ �e
2�1)exp[� 1
2�2e
(Xi;j
(yij � Xij� � Xijuj)2 + �es
2e)]:
Then comparing this with the form of a gamma distribution produces
ae =N + �e
2and be =
1
2(�es
2e +
Xi;j
e2ij):
A uniform prior on �2e , or the equivalent Pareto prior is equivalent to �e =
�2; s2e = 0. A uniform prior on log �2e is equivalent to �e = 0; s2e = 0 and a
gamma(�; �) prior for 1=�2e is equivalent to �e = 2�; s2e = 1.
Having found the four sets of conditional distributions, it is now simple
82
enough to program up the algorithm and compare via simulation the various
prior distributions.
4.4.2 Simulation method
Following on from the variance components model simulation I now want to
extend the comparisons by considering the random slopes regression model.
The random slopes regression model has more parameters than the variance
components model, so I could quite easily study even more designs than the
15 studied using the variance components model.
Due to time constraints and to avoid too much repetition, only two main areas
of interest will be considered. I will �rstly again consider the study design of the
model and consider two sizes of study design and whether the design is balanced
or unbalanced. These will correspond to designs 3,4,7 and 8 in Table 4.1.
The second area of interest is due to the new model design. When looking at
the variance components model I considered the e�ect of varying the true values
of the two variance parameters. Now that there is a variance matrix at level 2 I
will consider the e�ect of varying the correlation between the two parameters at
level 2. I will consider �ve di�erent scenarios, �rstly when the two variables are
uncorrelated which I will consider using all four study designs. The other four
scenarios will have both large and small correlations that are positive and then
negative. These correlations will only be considered using study design 7, which
is similar to the actual JSP dataset.
This will give a total of 9 designs. The true values for all the parameters apart
from the level 2 co-variance term will be the same for each design as follows :
�0 = 30:0; �1 = 0:5;�u00 = 5:0;�u11 = 0:5; and �2e = 30:0. The level 2 co-
variance, �u01 will be set to a value c to give the required correlation. For each
set of parameter values I will generate 1000 simulated datasets and �t the random
slopes regression model using each method to each dataset.
Creating the simulation datasets
Creating the simulation datasets is also easy for the random slopes regression
model. The only data that needs to be generated are the values of the response
variable for the N pupils. The second variable Xij will be �xed throughout the
83
dataset. Considering the case of 864 pupils within 48 schools the procedure is as
follows :
1. Generate 48 u0js and u1js, one for each school, by drawing from a
multivariate normal distribution with mean 0 and variance matrix �u.
2. Generate 864 eijs, one for each pupil, by drawing from a normal distribution
with mean 0 and variance �2e .
3. Evaluate Yij = �0 + �1Xij + u0j + u1jXij + eij for all 864 pupils.
This will generate one simulation dataset for the current parameter values.
This dataset is then �tted using each method, and the whole procedure is repeated
1000 times. The datasets will be generated using a short C program.
Comparison of methods
The priors to be considered for the level 2 variance are the uniform prior on the �u
scale, and the two Wishart priors for the level 2 precision described in the earlier
section. The uniform prior method for �u will be run using MLwiN and the other
two priors using the BUGS package. The Maximum likelihood IGLS and RIGLS
methods will also be run using MLwiN. Again the main reason for running the
bulk of the Gibbs sampling runs using BUGS is computing resources. BUGS
however cannot �t the uniform prior as it is improper and so this will be carried
out using MLwiN. The estimates from RIGLS will be used as prior estimates for
the `data driven' Wishart prior.
The lengths of the `burnin' and main runs of these simulations will be the
same as for the equivalent study designs �tting the variance components model.
The bias and coverage probabilities will be worked out in the same way as for the
variance components model. The level 1 variance is not of great interest here,
and so I have used the gamma(�; �) prior when using BUGS and the uniform prior
when using MLwiN as these are the defaults.
Preliminary analysis for IGLS and RIGLS
With the added complexity of the random slopes regression model, the IGLS
and RIGLS methods occasionally, for certain datasets, have problems �tting the
model. These problems can be of two types. Firstly the method may at some
iteration generate an estimate for a variance matrix that is not positive de�nite.
84
Secondly the method may not converge before the maximum number of iterations
have been reached. Generally what is happening in the second case is that the
method is cycling between several estimates and so increasing the maximum
number of iterations will not help (see Figure 4-3).
Figure 4-3: Trajectories plot of IGLS estimates for run of random slopesregression model where convergence is not achieved.
The MLn command CONV (Rasbash and Woodhouse 1995) will return
whether the estimation procedure has converged, and if not which of the above
two reasons is the problem. As the maximum likelihood methods are fast to run,
I have run the random slopes regression model using several simulation studies to
identify how well the IGLS and RIGLS methods perform in di�erent scenarios.
The results are given in Table 4.20. The studies marked with a star will be used
in the main analysis. Table 4.20 shows that several factors in uence how well the
maximum likelihood methods perform.
Firstly it should be noted that as study size gets bigger and consequently
the number of level 2 units increases the number of datasets that the maximum
likelihood methods fail on is minimal. The number of datasets where the methods
have problems increases when the size of study is decreased, and dramatically
increases when the design is unbalanced. Also the correlation between the 2
variables at level 2 is important, if the two variables are highly correlated, either
85
Table 4.20: Summary of the convergence for the random slopes regression with themaximum likelihood based methods (IGLS/RIGLS). The study design is givenin terms of the number of level 2 units and whether the study is balanced (B) orunbalanced (U).
Study �u01 Con NCon Not Posdev3 (12U) {1.4 623/574 356/349 21/773 (12U) {0.5 902/857 93/124 5/19
* 3 (12U) 0.0 927/877 71/116 2/73 (12U) 0.5 906/871 91/118 3/113 (12U) 1.4 621/558 367/366 12/764 (12B) {1.4 914/903 83/74 3/234 (12B) {0.5 986/985 13/14 1/1
* 4 (12B) 0.0 991/990 9/9 0/14 (12B) 0.5 994/991 6/7 0/24 (12B) 1.4 912/903 85/72 3/25
* 7 (48U) {1.4 986/984 13/14 1/2* 7 (48U) {0.5 998/998 2/2 0/0* 7 (48U) 0.0 1000/1000 0/0 0/0* 7 (48U) 0.5 1000/1000 0/0 0/0* 7 (48U) 1.4 984/983 16/15 0/28 (48B) {1.4 994/992 6/6 0/28 (48B) {0.5 999/999 1/1 0/0
* 8 (48B) 0.0 1000/1000 0/0 0/08 (48B) 0.5 1000/1000 0/0 0/08 (48B) 1.4 992/992 8/8 0/0
positively or negatively, the number of problem datasets increases.
Most of the studies chosen for further investigation with the Gibbs sampling
methods do not have many problem datasets. The study 3 scenario with �u01 = 0
is the worst with only 877 good datasets. For the further analysis I will simply
discard any problem datasets and analyse the remaining datasets using all the
methods.
One problem that is not captured by the MLn CONV command is when
the �nal converged estimate is not positive de�nite. These situations will be
included in the converged category in the above table. This has a knock-on
e�ect when I consider using the RIGLS estimate as a parameter in the Wishart
prior distribution. Consequently if the level 2 variance matrix estimate has a
86
correlation outside [{1,1], I will reduce the estimate of the co-variance, �01 so
that the correlation becomes �0:95 before using it as a prior parameter.
4.4.3 Results
The results for the 8 simulation designs comparing the 2 maximum likelihood
methods and the three MCMC methods can be seen in Tables 4.21 to 4.28. The
columns labelled Wish 1 prior are the results for the Wishart (I; 2) prior for the
precision matrix, ��1u . The columns labelled Wish 2 prior are the results for the
Wishart (�u; 4) prior for the precision matrix, ��1u .
The results for the unbalanced design with 48 schools and uncorrelated
parameters at level 2 are in Table 4.21. From this table it can be seen that the
results for the two maximum likelihood methods and the uniform prior method
are similar to the results already seen for the variance components model. The
IGLS method tends to underestimate the variance parameters at level 2 and the
RIGLS method corrects for this giving the least biased estimates. The uniform
prior on the other hand tends to overestimate the level 2 variance parameters.
In terms of coverage probabilities there is little to choose between the RIGLS
method and the uniform prior. This is probably partly due to the uniform prior
method giving larger intervals.
The two other MCMC methods give interesting results. The �rst Wishart
prior method which has a parameter with value the identity matrix, uses this
parameter as a prior guess for the level 2 variance matrix. This is clearly shown
by the estimate of �u00 which has a true value of 5 (greater than 1) being an
underestimate, and the estimate of �u11 which has a true value of 0.5 (less than 1)
being an overestimate. This in turn a�ects the coverage intervals, as the estimates
for �u00 give worse coverage than RIGLS while the estimates for �u11 give better
coverage.
The second Wishart prior method, based on using the RIGLS estimate of �u
as a parameter in the prior appears to underestimate all the parameters in the
variance matrix. This in turn leads to smaller average interval widths and in this
case worse coverage than the RIGLS method for virtually all parameters.
Tables 4.22 to 4.25 contain the results when the level 2 covariance parameter,
�u01 is given di�erent true values that give matrices with high and low, positive
87
Table 4.21: Summary of results for the random slopes regression with the 48schools unbalanced design with parameter values, �u00 = 5;�u01 = 0 and�u11 = 0:5. All 1000 runs.
Param. IGLS RIGLS Wish 1 Wish 2 Uniform(True) prior prior prior
Relative % Bias in estimates (Monte Carlo SE)Except values in [ ] which are actual biases due to the true value being 0
�0(30:0) 0.03 (0.04) 0.03 (0.04) {0.01 (0.04) 0.00 (0.04) 0.03 (0.04)�1(0:5) 0.64 (0.70) 0.64 (0.70) 2.51 (0.70) 1.10 (0.70) 0.74 (0.70)�u00(5:0) {2.88 (0.92) 0.24 (0.94) {3.00 (0.98) {7.64 (0.97) 22.42 (1.08)�u01(0:0) [{0.01 (0.01)] [{0.01 (0.01)] [{0.01 (0.01)] [{0.01 (0.01)] [{0.02 (0.01)]�u11(0:5) {3.46 (0.72) {1.08 (0.74) 5.12 (0.75) {3.39 (0.74) 15.76 (0.85)�2e(30:0) 0.03 (0.16) 0.03 (0.16) 0.68 (0.16) 0.90 (0.17) 0.53 (0.16)
Coverage Probabilities (90%/95%) : Approximate MCSE (0.28%/0.15%)
�0 89.7/95.1 90.1/95.3 89.2/95.0 89.1/94.4 92.5/96.8�1 88.0/93.5 88.3/93.8 88.8/93.8 86.5/92.5 90.2/95.3�u00 87.2/91.5 88.2/92.3 86.8/92.7 83.8/89.5 86.8/93.2�u01 90.1/96.3 90.1/96.3 88.3/94.9 85.5/91.5 90.2/95.7�u11 86.2/90.7 87.5/92.3 91.1/95.3 86.9/92.9 87.7/93.2�2e
89.4/95.2 89.3/95.1 88.7/94.2 89.4/94.8 88.7/94.6
Average Interval Widths (90%/95%)
�0 1.275/1.519 1.289/1.536 1.269/1.536 1.250/1.516 1.384/1.660�1 0.353/0.420 0.356/0.425 0.359/0.427 0.337/0.404 0.381/0.456�u00 4.812/5.734 4.923/5.865 4.973/6.042 4.599/5.584 6.158/7.484�u01 0.946/1.126 0.967/1.152 0.994/1.216 0.922/1.129 1.212/1.491�u11 0.374/0.445 0.382/0.455 0.406/0.492 0.373/0.452 0.475/0.577�2e
5.027/5.989 5.027/5.989 5.072/6.008 5.126/6.099 5.061/6.034
88
and negative correlation. The parameter percentage biases are plotted in
Figures 4-4 and 4-5 against the value of �u01. The immediate thing to notice is
that changes to the covariance parameter value have, for most methods and most
parameters, little overall e�ect in terms of bias, and the results in Tables 4.22 to
4.25 are similar to those obtained in Table 4.21.
The IGLS and RIGLS methods give similar results as before with the RIGLS
method giving approximately unbiased estimates. This shows that removing the
datasets that did not converge does not appear to have had any noticeable e�ect
on the bias of the estimates. The uniform prior method gives approximately the
same percentage bias for the variance parameters, and the covariance estimates
appear (Figure 4-5 (ii)) to have percentage biases that are positively correlated
with �u01.
The �rst Wishart prior method does not exhibit the shrinkage towards 1
property for parameter �u00 when the correlation is large (in magnitude). It
is also noticeable (Figure 4-4 (i)) that the bias of the parameter �0 using this
method is proportional to �u01.
The second Wishart prior method still underestimates the variance parame-
ters at level 2, although the bias is reduced as the correlation is increased (in
magnitude) but is approximately unbiased for the covariance term. It also gives
the largest bias for the level 1 variance parameter for all values of �u01.
Considering the coverage properties of the �ve methods we �nd di�erences
from the variance components model. The MCMC methods now no longer give
the best results for all parameters. In particular the second Wishart prior method
has the smallest intervals for most parameters and consequently performs poorly
in terms of coverage. The �rst Wishart prior performs well and gives reasonable
coverage for all parameters.
Although the uniform prior gives estimates that are highly biased it should
not be disregarded as it has the best coverage properties for the level 2 variance
parameters. It does not perform as well when the correlation is increased
(in magnitude). The RIGLS method also gives reasonable coverage for most
parameters and performs better than for the variance components model.
When the study design is changed (Tables 4.26 to 4.28) the e�ects are similar
to those observed for the variance components model. From Figures 4-6 and 4-7
it can be seen that reducing the number of schools in the study increases the
89
Table 4.22: Summary of results for the random slopes regression with the 48schools unbalanced design with parameter values, �u00 = 5;�u01 = 1:4 and�u11 = 0:5. Only 982 runs.
Param. IGLS RIGLS Wish 1 Wish 2 Uniform(True) prior prior prior
Relative % Bias in estimates (Monte Carlo SE)
�0(30:0) 0.02 (0.04) 0.03 (0.04) 0.07 (0.04) 0.03 (0.04) 0.03 (0.04)�1(0:5) 0.68 (0.69) 0.67 (0.69) 2.22 (0.69) 0.99 (0.69) 0.76 (0.70)�u00(5:0) {2.72 (0.92) 0.40 (0.92) 1.64 (0.92) {6.43 (0.90) 22.86 (1.04)�u01(1:4) {2.07 (0.79) 0.07 (0.79) {5.31 (0.81) {0.43 (0.80) 14.64 (0.93)�u11(0:5) {3.26 (0.71) {0.94 (0.71) 7.96 (0.72) {2.63 (0.71) 15.76 (0.82)�2e(30:0) 0.03 (0.16) 0.02 (0.16) {0.08 (0.16) 1.27 (0.16) 0.32 (0.16)
Coverage Probabilities (90%/95%) : Approximate MCSE (0.29%/0.15%)
�0 89.5/95.2 89.7/95.5 89.4/95.6 87.9/93.3 92.1/96.8�1 88.2/94.3 88.7/94.5 89.2/94.9 86.0/92.5 90.7/96.2�u00 86.6/91.0 88.3/92.5 89.3/94.1 82.8/89.2 87.5/93.5�u01 87.6/92.8 88.9/93.7 88.3/93.8 88.9/93.8 89.8/94.3�u11 87.0/90.6 88.1/92.5 92.9/97.0 87.3/93.4 89.4/94.8�2e
88.6/95.2 88.8/95.3 89.1/95.2 89.5/94.4 89.2/95.7
Average Interval Widths (90%/95%)
�0 1.247/1.486 1.262/1.503 1.259/1.517 1.171/1.398 1.363/1.635�1 0.351/0.418 0.354/0.422 0.358/0.431 0.328/0.396 0.379/0.455�u00 4.631/5.517 4.742/5.650 4.865/5.896 4.146/5.038 5.988/7.276�u01 1.135/1.353 1.162/1.384 1.195/1.443 1.116/1.352 1.445/1.758�u11 0.368/0.439 0.377/0.449 0.417/0.506 0.370/0.448 0.469/0.570�2e
4.998/5.955 5.003/5.962 4.969/5.887 5.048/6.014 5.013/5.977
90
Table 4.23: Summary of results for the random slopes regression with the 48schools unbalanced design with parameter values, �u00 = 5;�u01 = �1:4 and�u11 = 0:5. Only 984 runs.
Param. IGLS RIGLS Wish 1 Wish 2 Uniform(True) prior prior prior
Relative % Bias in estimates (Monte Carlo SE)
�0(30:0) 0.02 (0.04) 0.02 (0.04) -0.08 (0.04) -0.01 (0.04) 0.03 (0.04)�1(0:5) {0.01 (0.69) {0.01 (0.69) 2.27 (0.69) 0.11 (0.69) {0.01 (0.69)�u00(5:0) {2.28 (0.92) 0.70 (0.94) 2.44 (0.94) {5.65 (0.90) 23.90 (1.06)�u01(�1:4) 1.43 (0.79) {0.79 (0.79) 4.68 (0.82) {0.23 (0.82) {15.86 (0.93)�u11(0:5) {1.74 (0.72) 0.68 (0.73) 8.97 (0.74) {1.23 (0.75) 17.65 (0.84)�2e(30:0) 0.00 (0.16) {0.01 (0.16) {0.10 (0.16) 1.17 (0.16) 0.29 (0.16)
Coverage Probabilities (90%/95%) : Approximate MCSE (0.29%/0.15%)
�0 89.5/95.1 90.4/95.1 90.4/95.5 89.7/94.6 92.9/96.3�1 89.4/94.4 89.9/94.8 90.8/95.0 88.4/94.1 92.2/96.1�u00 87.6/90.7 88.7/92.4 89.6/94.9 84.2/90.0 87.2/92.9�u01 88.4/92.9 89.6/94.1 88.3/94.3 88.8/93.7 90.0/95.3�u11 88.3/92.5 89.5/93.6 92.0/96.7 87.9/93.8 87.9/93.6�2e
89.2/95.0 89.1/95.0 89.1/95.0 89.0/94.6 89.2/94.9
Average Interval Widths (90%/95%)
�0 1.260/1.501 1.275/1.519 1.292/1.544 1.239/1.491 1.376/1.651�1 0.354/0.422 0.358/0.426 0.364/0.431 0.346/0.415 0.382/0.458�u00 4.701/5.601 4.811/5.733 5.036/6.126 4.311/5.241 6.094/7.407�u01 1.157/1.379 1.183/1.409 1.209/1.476 1.142/1.396 1.478/1.797�u11 0.376/0.448 0.384/0.458 0.421/0.513 0.374/0.456 0.479/0.582�2e
5.005/5.964 5.005/5.964 4.964/5.866 5.039/5.999 5.013/5.972
91
Table 4.24: Summary of results for the random slopes regression with the 48schools unbalanced design with parameter values, �u00 = 5;�u01 = 0:5 and�u11 = 0:5. All 1000 runs.
Param. IGLS RIGLS Wish 1 Wish 2 Uniform(True) prior prior prior
Relative % Bias in estimates (Monte Carlo SE)
�0(30:0) 0.02 (0.04) 0.02 (0.04) 0.02 (0.04) 0.01 (0.04) 0.03 (0.04)�1(0:5) 0.72 (0.70) 0.72 (0.70) 2.44 (0.70) 1.08 (0.70) 0.83 (0.70)�u00(5:0) {3.00 (0.92) 0.08 (0.94) {3.12 (0.97) {8.08 (0.96) 22.10 (1.08)�u01(0:5) {3.45 (1.89) {1.45 (1.93) {2.42 (1.92) {0.80 (1.93) 12.43 (2.22)�u11(0:5) {3.79 (0.72) {1.42 (0.73) 5.07 (0.74) {3.61 (0.73) 15.35 (0.98)�2e(30:0) 0.04 (0.16) 0.04 (0.16) 0.67 (0.16) 0.97 (0.17) 0.54 (0.16)
Coverage Probabilities (90%/95%) : Approximate MCSE (0.28%/0.15%)
�0 89.7/94.9 90.2/95.5 89.1/95.1 89.1/94.0 92.5/96.8�1 88.2/93.0 88.3/93.4 88.3/93.8 86.9/92.3 90.5/95.3�u00 87.3/91.3 88.9/92.3 86.4/92.6 82.9/89.5 87.3/93.3�u01 90.0/95.5 90.9/95.7 90.9/95.4 85.9/93.0 91.0/95.3�u11 86.3/91.3 88.6/91.7 91.8/95.7 87.7/92.9 89.4/94.2�2e
89.0/95.1 89.0/95.1 88.6/94.2 89.6/94.8 89.1/94.6
Average Interval Widths (90%/95%)
�0 1.270/1.514 1.285/1.531 1.255/1.524 1.230/1.487 1.380/1.654�1 0.352/0.419 0.356/0.424 0.357/0.428 0.334/0.401 0.380/0.456�u00 4.787/5.704 4.897/5.834 4.927/5.982 4.543/5.517 6.122/7.441�u01 0.967/1.152 0.989/1.178 1.020/1.241 0.947/1.154 1.236/1.517�u11 0.372/0.443 0.380/0.453 0.406/0.492 0.371/0.450 0.472/0.574�2e
5.026/5.988 5.026/5.988 5.066/6.007 5.125/6.103 5.061/6.033
92
Table 4.25: Summary of results for the random slopes regression with the 48schools unbalanced design with parameter values, �u00 = 5;�u01 = �0:5 and�u11 = 0:5. Only 998 runs.
Param. IGLS RIGLS Wish 1 Wish 2 Uniform(True) prior prior prior
Relative % Bias in estimates (Monte Carlo SE)
�0(30:0) 0.03 (0.04) 0.03 (0.04) {0.04 (0.04) {0.01 (0.04) 0.03 (0.04)�1(0:5) 0.49 (0.69) 0.49 (0.69) 2.52 (0.69) 1.06 (0.69) 0.57 (0.70)�u00(5:0) {2.62 (0.92) 0.50 (0.94) {2.60 (0.97) {7.53 (0.97) 22.78 (1.08)�u01(�0:5) 0.27 (1.89) {2.00 (1.93) {0.72 (1.94) {1.96 (1.96) {18.12 (2.22)�u11(0:5) {3.00 (0.72) {0.61 (0.73) 5.68 (0.74) {2.93 (0.73) 16.30 (0.98)�2e(30:0) 0.01 (0.16) 0.01 (0.16) 0.65 (0.16) 0.93 (0.17) 0.50 (0.16)
Coverage Probabilities (90%/95%) : Approximate MCSE (0.28%/0.15%)
�0 89.7/95.1 89.8/95.3 89.3/95.1 88.6/94.5 92.2/96.6�1 88.4/93.7 89.3/94.0 88.9/94.1 87.7/93.3 90.8/95.5�u00 87.8/91.6 88.4/92.5 86.8/93.0 84.0/90.1 87.1/93.4�u01 89.8/94.9 90.1/95.1 88.9/94.3 85.7/91.9 89.3/94.9�u11 87.0/91.3 88.1/92.6 90.6/95.2 87.3/92.7 87.5/93.3�2e
89.6/95.1 89.6/95.1 88.9/94.3 89.6/94.6 89.0/94.8
Average Interval Widths (90%/95%)
�0 1.276/1.521 1.291/1.538 1.282/1.542 1.256/1.524 1.385/1.660�1 0.353/0.421 0.357/0.426 0.361/0.427 0.341/0.408 0.381/0.457�u00 4.817/5.740 4.929/5.872 4.997/6.072 4.606/5.594 6.176/7.506�u01 0.979/1.167 1.001/1.193 1.025/1.254 0.955/1.170 1.253/1.538�u11 0.375/0.447 0.383/0.453 0.409/0.497 0.375/0.454 0.477/0.579�2e
5.025/5.988 5.025/5.988 5.069/5.999 5.130/6.096 5.061/6.030
93
True Sigma_u01
Est
imat
e of
% b
ias
beta
0
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
-0.1
0-0
.05
0.0
0.05
0.10
(i)
IGLSRIGLSWishart Prior 1Wishart Prior 2Uniform Prior
True Sigma_u01
Est
imat
e of
% b
ias
beta
1
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
2.0
2.5
(ii)
True Sigma_u01
Est
imat
e of
% b
ias
sigm
a^2_
e
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
0.0
0.5
1.0
(iii)
Figure 4-4: Plots of biases obtained for the various methods �tting the randomslopes regression model against value of �u01 (Fixed e�ects parameters and level1 variance parameter).
94
True Sigma_u01
Est
imat
e of
% b
ias
Sig
ma_
u00
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
-10
010
20
(i)
IGLSRIGLSWishart Prior 1Wishart Prior 2Uniform Prior
True Sigma_u01
Est
imat
e of
% b
ias
Sig
ma_
u01
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
-20
-10
010
20
(ii)
True Sigma_u01
Est
imat
e of
% b
ias
Sig
ma_
u11
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
05
1015
20
(iii)
Figure 4-5: Plots of biases obtained for the various methods �tting the randomslopes regression model against value of �u01 (Level 2 variance parameters).
95
percentage bias of all the methods and in particular the uniform prior method.
It is also noticeable that �2e is biased high for the two Wishart prior methods,
which may explain in part why the level 2 variance parameters are biased low.
The coverage properties are changed slightly when the number of schools is
reduced to 12. The uniform prior method now has very poor coverage properties
as its intervals are too wide and it gives far higher actual percentage coverage for
nominal 90% and 95% intervals for the �xed e�ects. For the variance parameters
the uniform prior gives highly biased estimates that lead to intervals that have
lower actual percentage coverage than required. The second Wishart prior
method is once again performing poorly and the best coverage is either from
the RIGLS or the �rst Wishart prior method depending on the parameter.
4.5 Conclusions
Simulation studies comparing various maximum likelihood and empirical Bayes
methods have been performed in the past (see for example Kreft, de Leeuw, and
van der Leeden (1994)). There has however been very little comparison work
between fully Bayesian MCMC methods and maximum likelihood methods. This
may be due to the fact that the MCMC methods take a lot longer to perform
than the maximum likelihood methods. For example the two sets of simulations
in this chapter together took over 6 months to perform and this was using several
machines simultaneously.
Although these simulations do highlight some interesting points it would be
useful to run more simulations particularly to compare the priors for variance
matrices. The results obtained in these simulations will now be summarised and
then I will talk about how this has in uenced the default prior settings in the
MLwiN package.
4.5.1 Simulation results
All the simulations performed in this chapter have been compared in terms of
both bias and coverage properties. Two maximum likelihood methods have been
included in the simulations for completeness but the IGLS method almost always
performs worse than the RIGLS method and so can probably be disregarded.
96
Table 4.26: Summary of results for the random slopes regression with the48 schools balanced design with parameter values, �u00 = 5;�u01 = 0:0 and�u11 = 0:5. All 1000 runs.
Param. IGLS RIGLS Wish 1 Wish 2 Uniform(True) prior prior prior
Relative % Bias in estimates (Monte Carlo SE)Except values in [ ] which are actual biases due to the true value being 0
�0(30:0) {0.05 (0.04) {0.05 (0.04) {0.08 (0.04) {0.08 (0.04) {0.05 (0.04)�1(0:5) {0.03 (0.68) {0.03 (0.68) 2.13 (0.68) 0.78 (0.68) 0.07 (0.68)�u00(5:0) {4.66 (0.90) {1.74 (0.92) {4.91 (0.95) {9.32 (0.94) 18.84 (1.04)�u01(0:0) [0.00 (0.01)] [0.00 (0.01)] [0.00 (0.01)] [0.00 (0.01)] [0.00 (0.01)]�u11(0:5) {1.59 (0.75) 0.80 (0.77) 7.02 (0.77) {1.30 (0.77) 17.62 (0.88)�2e(30:0) {0.05 (0.16) {0.05 (0.16) 0.59 (0.16) 0.81 (0.16) 0.46 (0.16)
Coverage Probabilities (90%/95%) : Approximate MCSE (0.28%/0.15%)
�0 87.9/93.8 88.1/94.0 87.7/93.5 87.2/92.7 90.8/95.5�1 89.2/94.4 89.7/94.5 89.8/94.8 87.6/93.3 91.6/95.6�u00 86.3/91.0 87.8/92.2 86.7/92.0 82.7/89.6 88.5/94.4�u01 92.5/96.6 92.5/96.7 91.1/95.3 88.6/94.1 92.8/96.1�u11 86.4/91.2 88.1/93.2 89.9/94.9 87.2/94.2 85.3/92.0�2e
89.0/94.1 89.0/94.1 89.8/94.7 89.9/94.7 90.0/94.9
Average Interval Widths (90%/95%)
�0 1.240/1.477 1.253/1.493 1.223/1.475 1.209/1.458 1.343/1.609�1 0.353/0.421 0.357/0.425 0.362/0.432 0.339/0.407 0.381/0.457�u00 4.610/5.493 4.710/5.611 4.775/5.795 4.440/5.383 5.839/7.091�u01 0.924/1.101 0.944/1.125 0.972/1.188 0.903/1.106 1.178/1.448�u11 0.375/0.447 0.383/0.456 0.407/0.494 0.375/0.455 0.476/0.578�2e
5.027/5.991 5.028/5.991 5.077/6.016 5.129/6.107 5.063/6.035
97
Table 4.27: Summary of results for the random slopes regression with the 12schools unbalanced design with parameter values, �u00 = 5;�u01 = 0:0 and�u11 = 0:5. Only 877 runs.
Param. IGLS RIGLS Wish 1 Wish 2 Uniform(True) prior prior prior
Relative % Bias in estimates (Monte Carlo SE)Except values in [ ] which are actual biases due to the true value being 0
�0(30:0) {0.05 (0.15) {0.06 (0.15) 0.14 (0.09) 0.10 (0.09) 0.06 (0.09)�1(0:5) {0.45 (1.50) {0.64 (1.49) 1.48 (1.48) {0.16 (1.48) 0.06 (1.50)�u00(5:0) {4.78 (2.22) 9.70 (2.22) {4.15 (2.29) {12.34 (2.11) 219.22 (5.10)�u01(0:0) [0.01 (0.02)] [0.01 (0.02)] [{0.01 (0.02)] [0.00 (0.02)] [{0.04 (0.05)]�u11(0:5) {9.19 (1.57) 0.65 (1.71) 34.76 (1.77) {5.26 (1.71) 142.98 (3.74)�2e(30:0) {0.56 (0.40) {0.82 (0.41) 2.12 (0.36) 2.67 (0.36) 1.62 (0.35)
Coverage Probabilities (90%/95%) : Approximate MCSE (0.29%/0.15%)
�0 83.5/90.8 86.2/92.1 85.3/91.4 82.2/90.1 96.9/99.7�1 86.1/91.3 88.1/92.2 92.4/96.7 85.4/91.6 96.1/97.9�u00 80.0/83.5 84.8/87.8 83.2/91.1 73.5/81.9 73.2/81.8�u01 87.7/94.1 88.8/95.3 92.9/97.9 71.8/79.2 96.2/98.7�u11 76.9/81.4 82.7/86.5 93.6/96.8 78.8/86.9 75.4/85.5�2e
89.1/93.7 89.2/93.5 88.8/95.0 87.9/94.6 90.1/95.1
Average Interval Widths (90%/95%)
�0 2.525/3.009 2.660/3.170 2.559/3.124 2.458/2.981 3.957/4.950�1 0.673/0.802 0.705/0.840 0.788/0.962 0.671/0.802 1.030/1.290�u00 9.685/11.54 10.76/12.82 10.40/13.45 8.13/10.44 37.15/50.49�u01 1.814/2.162 2.002/2.385 2.230/2.974 1.678/2.168 6.876/9.696�u11 0.706/0.841 0.772/0.920 1.067/1.373 0.703/0.887 2.554/3.488�2e
9.955/11.86 9.931/11.83 10.47/12.56 10.45/12.51 10.29/12.32
98
Table 4.28: Summary of results for the random slopes regression with the12 schools balanced design with parameter values, �u00 = 5;�u01 = 0:0 and�u11 = 0:5. Only 990 runs.
Param. IGLS RIGLS Wish 1 Wish 2 Uniform(True) prior prior prior
Relative % Bias in estimates (Monte Carlo SE)Except values in [ ] which are actual biases due to the true value being 0
�0(30:0) 0.04 (0.08) 0.04 (0.08) 0.13 (0.08) 0.06 (0.08) 0.04 (0.08)�1(0:5) 0.48 (1.38) 0.49 (1.38) 1.73 (1.38) 1.14 (1.39) 0.96 (1.39)�u00(5:0) {12.80 (1.80) {0.32 (1.96) {11.42 (1.96) {19.73 (1.82) 187.26 (4.42)�u01(0:0) [0.01 (0.02)] 0.01 (0.02)] [{0.01 (0.02)] [{0.00 (0.02)] [{0.03 (0.04)]�u11(0:5) {9.05 (1.41) 0.46 (1.54) 33.95 (1.60) {4.53 (1.54) 138.75 (3.40)�2e(30:0) 0.06 (0.32) 0.05 (0.32) 2.49 (0.33) 2.90 (0.33) 2.01 (0.32)
Coverage Probabilities (90%/95%) : Approximate MCSE (0.30%/0.16%)
�0 85.8/91.6 87.1/93.1 86.4/92.1 84.5/90.8 97.7/99.4�1 86.5/91.5 87.9/92.7 92.0/96.1 86.2/91.2 96.3/98.3�u00 77.0/81.2 82.5/85.2 78.0/88.1 68.5/77.7 76.6/87.3�u01 91.1/97.0 91.6/97.4 93.1/97.1 75.9/81.6 95.2/98.3�u11 79.1/82.7 83.0/85.8 93.6/97.5 80.5/86.0 77.2/85.7�2e
90.2/94.8 90.2/94.8 89.6/95.3 89.3/94.9 89.8/95.1
Average Interval Widths (90%/95%)
�0 2.409/2.871 2.525/3.008 2.415/2.947 2.335/2.834 3.741/4.691�1 0.667/0.795 0.698/0.832 0.781/0.947 0.668/0.792 1.020/1.279�u00 8.891/10.59 9.764/11.63 9.522/12.29 7.386/9.456 33.35/45.38�u01 1.721/2.051 1.892/2.254 2.094/2.795 1.582/2.049 6.359/8.994�u11 0.693/0.826 0.758/0.904 1.048/1.347 0.696/0.877 2.499/3.419�2e
10.04/11.96 10.04/11.97 10.54/12.64 10.49/12.56 10.33/12.37
99
Simulation Design
Est
imat
e of
% b
ias
beta
0
-0.0
50.
00.
050.
100.
15
12U 12B 48U 48B
(i)
IGLSRIGLSWishart Prior 1Wishart Prior 2Uniform Prior
Simulation Design
Est
imat
e of
% b
ias
beta
1
01
2
12U 12B 48U 48B
(ii)
Simulation Design
Est
imat
e of
% b
ias
sigm
a^2_
e
-10
12
3
12U 12B 48U 48B
(iii)
Figure 4-6: Plots of biases obtained for the various methods �tting the randomslopes regression model against study design (Fixed e�ects parameters and level1 variance parameter).
100
Simulation Design
Est
imat
e of
% b
ias
Sig
ma_
u00
050
100
150
200
12U 12B 48U 48B
(i)
IGLSRIGLSWishart Prior 1Wishart Prior 2Uniform Prior
Simulation Design
Est
imat
e of
bia
s S
igm
a_u0
1
-0.0
4-0
.02
0.0
0.02
0.04
12U 12B 48U 48B
(ii)
Simulation Design
Est
imat
e of
% b
ias
Sig
ma_
u11
050
100
150
12U 12B 48U 48B
(iii)
Figure 4-7: Plots of biases obtained for the various methods �tting the randomslopes regression model against study design (Level 2 variance parameters).
101
(It is included in the package MLwiN as there exist situations where the IGLS
method converges when the RIGLS method doesn't).
Although the RIGLS method performs well in terms of bias, it is not designed
for interval estimation and the additional (sometimes false) assumption that the
parameter of interest has a Gaussian distribution has been used to generate
interval estimates. The MCMC methods have been compared to see if they
will improve on the RIGLS method in terms of coverage.
The main di�culty with the MCMC methods is choosing default priors for
the variance parameters. In the univariate case I have compared three possible
prior distributions using the variance components model. All three priors give
variance estimates that are positively biased but this bias decreases as N the
number of units associated with the variance decreases. Of the three priors, the
Pareto prior for the precision parameter which is a proper prior equivalent to
a uniform prior for the variance, has far larger bias but in turn often has the
best coverage properties. The gamma(�; �) prior for the precision parameter has
far less bias and also improves over the maximum likelihood methods in terms
of coverage so would be preferable except that it does not easily generalise to a
multivariate distribution. The �nal prior which uses a prior estimate, taken from
the gamma prior estimate gives approximately the same answers as the gamma
prior.
When variance matrices are considered, as in the random slopes regression
model, multivariate priors are required. The uniform prior easily translates to
a multivariate uniform prior but unfortunately the gamma prior does not. A
candidate multivariate Wishart prior (Wish1) for the precision matrix was used
to replace the gamma prior. A third alternative prior (Wish2) based on a prior
estimate for the variance matrix, this time from RIGLS, was also considered. This
third prior performed poorly and tended to underestimate the variance matrix
and generally gave worse coverage than the maximum likelihood methods.
The Wish1 prior tended to shrink the variance estimates towards the identity
matrix but generally was less biased than the other two priors. The uniform
prior once again was highly positively biased. In terms of coverage properties the
uniform and Wish1 prior both performed as well overall as the RIGLS method
but no better.
So in conclusion, in some situations the RIGLS maximum likelihood method,
102
which is far faster to run, improves on MCMC and in other situations it is
MCMC that has better performance. Both the uniform and gamma priors and
their multivariate equivalents have good points and bad points but overall the
gamma prior appears to be slightly better. In Chapter 6, I will consider a
similar simulation study using a multi-level logistic regression model. Here the
approximation based methods do not perform as well as noted by Rodriguez
and Goldman (1995), and it will be shown that the MCMC methods make an
improvement with these models.
4.5.2 Priors in MLwiN
The �rst release of the MLwiN package occurred while these simulations were
still being performed. In this version I included the uniform prior on the �2 scale
for all variance parameters as a default, mainly because it was simple and easiest
to extend to the multivariate case. The user is also given the option to include
informative priors for variance parameters. For an informative prior the user
must input a prior estimate for the variance or variance matrix and a sample size
on which this prior estimate is based. If the user gives a prior sample size of 1
then the Wishart prior produced is identical to the second Wishart prior used for
the random slopes regression simulations in this Chapter.
In future releases the gamma(�; �) prior for the precision and the Wish1 prior
may be added as an alternative following their performance in these simulations.
In this chapter I have introduced two simple multi-level models and shown
how they can be �tted using the Gibbs sampler. I will generalise this to include
the whole family of Gaussian models in the next chapter, along with showing how
to apply the other MCMC methods to multi-level models.
103
Chapter 5
Gaussian Models 2 - General
Models
In the previous chapter I introduced two simple two level models and showed
how to use one MCMC method, Gibbs sampling to �t them. In this chapter I
will extend this work in two directions. Firstly I will give a general description
of an N level Gaussian multi-level model and show how to �t this model using
Gibbs sampling. Secondly I will show how to use other MCMC methods with
Gibbs sampling via a hybrid approach to �t N level Gaussian models. I will give
two alternative Metropolis-Gibbs hybrid sampling algorithms and explain how
these methods can be extended into adaptive samplers. I will compare through a
simple example how well the methods perform in terms of their times to produce
estimates with a desired accuracy.
5.1 General N level Gaussian hierarchical linear
models
In the �eld of education I have already looked at a two level scenario with pupils
within schools. This structure could easily be extended in many directions by the
addition of extra levels to the model. The schools could be divided into di�erent
education authorities giving another higher level. Pupils in each school could
be de�ned by their class allowing a level between pupils and schools. Below the
pupil level, each student could sit tests over a period of several years and so each
104
test could be a lower level unit.
It is quite easy to see how a 2 level model can be extended to a 5 level model
in an educational setting and it is conceivable that in other application areas
there could be even more levels. In the general framework, predictor variables
can be de�ned as �xed e�ects or random e�ects at any level in this model. For
example, predictors such as sex, parental background and ethnic origin are pupil
level variables, whilst class size and teacher variables are class level variables and
school size and type are school level variables.
The multi-level structure of these models produces similarities between the
conditional distributions for predictor variables at di�erent levels, and it will be
shown later that only four parameter updating steps are needed for a general N
level Gaussian model. One of the main di�culties with extending the algorithm
to N levels is notational. In the two level model we have pupil i in school j, and
this cannot be extended inde�nitely.
I will �rstly look at the work on hierarchical models in the paper by Seltzer,
Wong, and Bryk (1996) and show how their algorithms can be modi�ed to �t a
general 3 level multi-level model before extending this work to N levels.
5.2 Gibbs sampling approach
Seltzer, Wong, and Bryk (1996) considered hierarchical models of 2 levels with
�xed e�ects. They found the conditional posterior distributions for all the
parameters so that a Gibbs sampling algorithm could be easily implemented.
They also included speci�cations of prior distributions for all variance parameters
and incorporated these prior distributions into their posterior distributions. They
stated that it was easy to extend the algorithm to hierarchical models with 3 or
more levels but did not state how.
I wish to follow on from their work but to consider a wider family of
distributions namely the N level Gaussian multi-level models. Their formulation
for a 2 level hierarchical model is as follows :
yij = Xij�j +X�ij
� + eij;
�j = Wj +Uj:
105
where eij � N(0; �2) and Uj �MVN(0;T):
This is a general 2 level hierarchical model with �xed e�ects. To translate
this into a 2 level multi-level model I will re-parameterise as follows :
yij = Xij(Wj +Uj) + X�ij
� + eij
= XijUj +XijWj +X�ij
� + eij
= XijUj + Zij �� + eij
where Zij =
0@ XijWj 0
0 X�ij
1A and �� =
0@
�
1Awith eij � N(0; �2);Uj � MVN(0;T):
In this formulation estimates for the variance parameters, �2 and T as well as
the �xed e�ects � can still be found. We can also �nd estimates for the lowest
level random variables (in the above notation ) which are really also �xed e�ects.
Due to re-parameterisation the level 2 residuals, the Uj are estimated as opposed
to the �j which were random parameters. It is easy to calculate the �j from the
Uj and vice versa.
Re-parameterising the model in this way does not have any particular
advantages in terms of convergence, in fact (Gelfand, Sahu, and Carlin 1995)
with some models that �t this framework, re-parameterising in the way described
will give worse mixing properties for the Markov chain. The main reason for re-
parameterising into this format is that we are now working on a far larger family
of distributions. This is because there exist models that �t this framework but
cannot be described in the previous format.
The next step is to construct conditional posterior distributions for this new
model structure. I will consider a general 3 level model as opposed to the 2 level
model that Seltzer, Wong, and Bryk (1996) consider, as this easily generalises to
N levels. The three level model will be de�ned as follows :
yijk = X1ijk�1 +X2ijk�2jk +X3ijk�3k + eijk
106
eijk � N(0; �2); �2jk �MVN(0;V2); �3k �MVN(0;V3):
with �1 as �xed e�ects, �2 the level 2 residuals and �3 the level 3 residuals.
There are now 6 sets of unknowns to consider. I will consider these in turn
and will assume that the variance parameters have general scaled inverse �2 and
inverse Wishart priors, whilst the �xed e�ects have uniform priors. The steps
required are then :
Step 1. p(�1 j y; �2; �3; �2;V2;V3)
�1 � N( b�1; bD1)
p(�1 j y; �2; �3; �2;V2;V3) / p(y j �1; �2; �3; �2;V2;V3)p(�1)
/ Yijk
(1
�2)12 exp[� 1
2�2(yijk � X1ijk�1 � X2ijk�2jk � X3ijk�3k)
2]
/ Yijk
(1
�2)12 exp[� 1
2�2(d1ijk � X1ijk�1)
2]
where d1ijk = yijk � X2ijk�2jk � X3ijk�3k, giving
bD1 = �2[Xijk
XT1ijkX1ijk]
�1;
and
b�1 = [Xijk
XT1ijkX1ijk]
�1Xijk
XT1ijkd1ijk =
bD1
�2
Xijk
XT1ijkd1ijk:
This is the formula for a simple linear regression of d1 against X1.
Step 2. p(�2 j y; �1; �3; �2;V2;V3)
�2jk � N( b�2jk; bD2jk)
107
p(�2jk j y; �1; �3; �2;V2;V3) / p(y j �1; �2; �3; �2;V2;V3)p(�2jk j V2)
/njkYi=1
[(1
�2)12 exp[� 1
2�2(d2ijk � X2ijk�2jk)
2]]:jV2j� 12 exp[�1
2�T2jkV
�12 �2jk]
where d2ijk = yijk � X1ijk�1 � X3ijk�3k, giving
bD2jk = [njkXi=1
XT2ijkX2ijk
�2+V�1
2 ]�1;
and
b�2jk = bD2jk
�2
njkXi=1
XT2ijkd2ijk:
Step 3. p(�3 j y; �1; �2; �2;V2;V3)
�3k � N( b�3k; bD3k)
p(�3k j y; �1; �2; �2;V2;V3) / p(y j �1; �2; �3; �2;V2;V3)p(�3k j V3)
/Yij
[(1
�2)12 exp[� 1
2�2(d3ijk � X3ijk�3k)
2]]:jV3j� 12 exp[�1
2�T3kV
�13 �3k]
where d3ijk = yijk � X1ijk�1 � X2ijk�2jk, giving
bD3k = [Xij
XT3ijkX3ijk
�2+V�1
3 ]�1;
and
b�3k = bD3k
�2
Xij
XT3ijkd3ijk:
108
Step 4. p(�2 j y; �1; �2; �3;V2;V3)
Assume �2 has a scaled inverse �2 prior, p(�2) � SI�2(�e; s2e). Considering 1=�
2,
and using the change of variables formula with h(�2) = 1=�2 gives
p(1
�2) = p(�2) j h0(�2) j�1= (
1
�4)�1 = (
1
�2)�2:
Substituting will give
p(1=�2 j y; �1; �2; �3;V2;V3) / (1
�2)N=2exp[�X
ijk
(eijk)2
2�2]:(
1
�2)�2p(�2);
so 1=�2 � gamma(a; b) where
a =N + �e
2; b =
1
2(Xijk
e2ijk + �es2e):
A uniform prior on �2 is equivalent to setting �e = �2 and s2e = 0.
Step 5. p(V2 j y; �1; �2; �3; �2;V3)
Assume V2 has an inverse Wishart prior, V2 � IW (�p2; Sp2) then :
p(V�12 j y; �1; �2; �3; �2;V3) / p(�2 j V2)p(V
�12 );
which gives
V�12 �Wishartn2 [S2 = (
Xjk
�2jk�T2jk + Sp2)
�1; �2 = njk + �p2]:
Here S2 is a n2 � n2 scale matrix where n2 is the number of random variables at
level 2 and �2 is the degrees of freedom of the Wishart distribution, and njk is the
number of level 2 units. A uniform prior is equivalent to setting �p2 = �n2 � 1,
and Sp2 = 0.
Step 6. p(V3 j y; �0; �1; �2; �2;V2)
Assume V3 has an inverse Wishart prior, V3 � IW (�p3; Sp3) then :
109
p(V�13 j y; �1; �2; �3; �2;V2) / p(�3 j V3)p(V
�13 );
which gives
V�13 �Wishartn3[S3 = (
Xk
�3k�T3k + Sp3)
�1; �3 = nk + �p3]:
Here S3 is a n3 � n3 scale matrix where n3 is the number of random variables at
level 3 and �3 is the degrees of freedom of the Wishart distribution, and nk is the
number of level 3 units. A uniform prior is equivalent to setting �p3 = �n3 � 1,
and Sp3 = 0.
The above algorithm already shows some similarities between steps. It can be
seen that steps 2 and 3 are e�ectively the same form but with summations over
di�erent levels. The same is also true for steps 5 and 6 and so although I have
written the algorithm in six steps it could actually be written out in four. I will
now consider the N level model and show that this also only needs four steps.
5.3 Generalising to N levels
For an N level model there is 1 set of �xed e�ects, N sets of residuals ( although
residuals at level 1 can be calculated via subtraction and so do not need to be
sampled ) and N sets of variance parameters. These parameters can be split into
4 groups in such a way that all parameters in each group have posteriors of the
same form as illustrated previously in the 3 level model.
� 1. The �xed e�ects.
� 2. The N � 1 sets of residuals (excluding level 1).
� 3. The level 1 scalar variance �2.
� 4. The N � 1 higher level variances.
I will need some additional notation as using summations over N levels i.e.
N indices becomes impractical and messy. Firstly I will describe level 1 as the
110
observation level, and units at level 1 as observations. Then let MT be the set
of all observations in the model and let Ml;j be the set of observations that at
level l are in category j. For example in the simple 2 level educational datasets
in Chapter 4, MT will contain all pupils in all schools while M2;j will contain all
the pupils in school j.
Now also let Xli be the vector of variables at level l for observation i,
where l = 1 refers to the variables associated with the �xed e�ects. Finally
let the random parameters at level l; l > 1 be denoted by �lj, where j is one
of the combinations of higher level terms (The �xed e�ects will be �1). Also
dli = ei � Xli�lj in the same way as d2ijk = eijk � X2ijk�2jk in the 3 level model.
I will use the following prior distributions:
For the level 1 variance, p(�2) � SI�2(�e; s2e), for the level l variance, where
l > 1, Vl � IW (�P l; SP l), and for the �xed e�ects, �1 � N(�p; Sp). I will now
describe the four steps.
5.3.1 Algorithm 1
Step 1 - The �xed e�ects, �1.
p(�1 j y; : : :) / p(y j �1; : : :)p(�1)
�1 �MVN( b�1; bD1)
where
bD1 = [Xi2MT
XT1iX1i
�2+ S�1
p ]�1;
and
b�1 = bD1 � [Xi2MT
XT1id1i�2
+ S�1p �p]:
111
Step 2 - The level l residuals, �l.
p(�l j y; : : :) / p(y j �l; : : :)p(�ljVl)
�lj � MVN( b�lj; bDlj)
where
bDlj = [X
i2Ml;j
XTliXli
�2+V�1
l ]�1;
and
b�lj = bDlj
�2� Xi2Ml;j
XTli dli:
Step 3 - The level 1 scalar variance �2.
p(1=�2 j y; : : :) / p(y j �2; : : :)p(1=�2)
1=�2 � gamma(apos; bpos);
where apos =12(N+�e); bpos =
12(P
n e2n+�es
2e). For a uniform prior �e = �2; s2e =
0.
Step 4 - The level l variance, Vl.
p(V�1l j y; : : :) / p(�l j Vl)p(V
�1l )
V�1l �Wishartnrl[Spos = (
nlXi=1
�li�Tli + SP l)
�1; �pos = nl + �P l];
where nl is the number of level l units. For a uniform prior, SP l = 0 and
112
�P l = �nrl � 1 where nrl is the number of random variables at level l.
5.3.2 Computational considerations
When writing Gibbs Sampling code one of the main concerns is the speed of
processing. The code for 1 iteration will be repeated thousands of times and so
any small speed gain for an individual iteration will be magni�ed greatly. The
actual memory requirements for storing intermediate quantities will be small in
comparison to the size of the results. There is therefore scope to store a few more
intermediate results if they will in turn speed up the code. I will now explain two
computational steps that will speed up the processing time.
Speed up 1
From the 4 steps shown in the general N level algorithm note that the quantitiesPi2Ml;j
XTliXli and
Pi2MT
XT1iX1i are �xed constant matrices. It would save a large
amount of time if these quantities are calculated at the beginning and then stored
so that they can be used in each iteration.
Speed up 2
Much use is made of quantities such as dli which are equal to ei+cli where cli is the
product of a parameter vector and a data vector. For example d2i = ei +X2i�2j.
If I store ei the level 1 residual for observation i, then whenever (steps 1 and 2 of
the algorithm), one of the d quantities needs to be calculated, I can add on the
current value of the parameter multiplied by the data vector, for example X2i�2j
to produce d2i. Then use this to calculate a new value for the appropriate �.
Once a new value has been calculated this procedure can be applied backwards
to give the new value of the level 1 residual ei, i.e., subtract the new parameter
value � the data vector. This idea will also be repeated in later methods.
113
5.4 Method 2 : Metropolis Gibbs hybrid meth-
od with univariate updates
In the previous section I have given an algorithm to �t the multi-level Gaussian
model using the Gibbs sampler. In the next chapter the models considered do not
give conditional distributions that have nice forms to be simulated from easily
using the Gibbs sampler. I will now �t the current models using some alternative
MCMC methods which can then be used on the models in the next chapter.
The steps that cause the simple Gibbs sampler problems in the multi-level
logistic regression models in the next chapter are updating the residuals and
�xed e�ects. The �rst plan is to replace the Gibbs sampler on these steps
with univariate normal proposal Metropolis steps as described in the following
algorithm.
5.4.1 Algorithm 2
Step 1 - The �xed e�ects, �1.
For i in 1; : : : ; NFixed;
�(t)1i = ��1i with probability min(1; p(��1i j y; : : :)=p(�(t�1)
1i j y; : : :))= �
(t�1)1i otherwise
where ��1i = �(t�1)1i + 1i; 1i � N(0; �2
1i):
Step 2 - The level l residuals, �l.
For l in 2; : : : ; N; j in 1; : : : ; nl; and i in 1; : : : ; nrl;
�(t)lji = ��lji with probability min(1; p(��lji j y; : : :)=p(�(t�1)
lji j y; : : :))= �
(t�1)lji otherwise
where ��lji = �(t�1)lji + lji; lji � N(0; �2
lji), nl is the number of level l units, and
nrl is the number of random parameters at level l.
Step 3 - The level 1 scalar variance �2.
This step is the same as Algorithm 1.
114
Step 4 - The level l variance, Vl.
This step is the same as Algorithm 1.
5.4.2 Choosing proposal distribution variances
When using the Gibbs sampler on multi-level models and having de�ned the
steps of the algorithm the only remaining task is to �x starting values for all
the parameters. Generally starting values could be set fairly arbitrarily and the
results should be similar. To improve the mixing of the Markov chains, and
to utilise MLwiN's other facilities, I use the current estimates obtained by the
maximum likelihood IGLS or RIGLS methods as starting values. Having set the
starting values it is now simply a question of running through the steps of the
algorithm repeatedly.
When Metropolis steps are introduced to the algorithm, there is now one
more set of parameters that need to be assigned values. In Steps 1 and 2 of
the above algorithm, there are normal proposal distributions with unde�ned
variances, and these variances need to be given sensible values. The Metropolis
steps will actually work with any positive values for the proposal variances, and
will eventually, given time, give estimates with a reasonable accuracy but ideally
we would like accurate estimates in the minimum number of iterations. To achieve
this aim, proposal variances that give a chain that mixes well are desirable.
Gelman, Roberts, and Gilks (1995) explore e�cient Metropolis proposal
distributions for normally distributed data in some detail. They show that the
ideal proposal standard deviation for a parameter of interest, � is approximately
2.4 times the standard deviation of �. This implies the ideal proposal distribution
variance is 5.8 times the variance of �. This means that if an estimate of the
variance of the parameter of interest was available, this result can be used and
the Metropolis algorithm can be used e�ciently. Fortunately MLwiN also gives
standard errors to its estimates produced by IGLS or RIGLS and so these values
can be used.
The models studied in Gelman, Roberts, and Gilks (1995) are fairly simple
and there is no guarantee that the optimal value of 5.8 for the scaling factor
will follow for multi-level models. To test this out I will consider a few simple
multi-level models and �nd through simulation whether this optimal value of 5.8
115
holds.
Finding optimal scaling factors
To �nd optimal scaling factors for the variance of the proposal distribution a
practical approach was taken. Several values for the scaling factor spread over
the range 0.05 to 20 were considered, and for each value 3 MCMC runs with a
burn-in of 500 and a main run of 50,000 were performed. The same value was
used as a multiplier for the variance estimate from the RIGLS method for each
�xed e�ect and higher level residual parameter. To �nd the optimal value the
Raftery Lewis statistic was calculated. In Chapter 3 I showed that the Raftery
Lewis N statistic is equivalent to the recipricol of the e�ciency of the estimate,
so the optimal scaling factor will be the scaling factor value that minimises N .
The method was �rstly used on the two simple models considered in Chapter
4, the variance components and random slopes regression models. The results
can be seen in Figures 5-1, 5-2 and 5-3. From these �gures the shape of the
graphs can be seen to be similar to those in Chapter 3 (Figure 3-6), although
the graphs in Chapter 3 are based on the scale factor for the standard deviation
and not the variance. What is immediately clear is that the value 5.8 is not
the minimum for the scale factor as was found in the simple Gaussian model in
Chapter 3. In the second example (Figures 5-2 and 5-3) it can be seen that the
optimal scale factor is not even the same for both parameters.
There is some noise when using N as an estimate of e�ciency but a rough
estimate of the minimum can be obtained and on all three graphs this is far
smaller than 5.8. This creates a problem as the same scale factor is being
used for each parameter and so this constraint prevents the use of the di�erent
optimal values. The calculations in Gelman, Roberts, and Gilks (1995) that give
the optimal value of the scale factor are fairly mathematically complex. Due
to this complexity I do not intend to attempt to �nd similar optimal formulae
mathematically for multi-level models in this thesis.
The method was also considered on some other models and the results can be
seen in Table 5.1. In Table 5.1, �0 is the intercept, �1 is the Math3 e�ect, and
�2 is the sex e�ect. The models are all either variance components or random
slopes regression models, with the number indexing the number of �xed e�ects.
The �nal model (SCH1) uses a di�erent educational dataset from Goldstein et al.
116
.
.
.
.
..
.
.
.
.
......
.
.......... ..
....
...
.
.
.... .
.....
.
.
. .
.
. .
.
..
.
.
.
.
. ...
.
.
.
Parameter Beta_0 : Scale Factor
Raf
tery
Lew
is N
hat
0 5 10 15 20
3000
050
000
7000
0
.
.
.
.
..
.
.
.
...
..
..
.
.....
.......
....
...
.
.
.....
.....
....
...
.
..
.
.
.
.
....
.
.
.
Parameter Beta_0 : Acceptance Rate
Raf
tery
Lew
is N
hat
0.2 0.4 0.6 0.8
3000
050
000
7000
0
Figure 5-1: Plots of the e�ect of varying the scale factor for the proposal varianceand hence the Metropolis acceptance rate on the Raftery Lewis diagnostic for the�0 parameter in the variance components model on the JSP dataset.
117
.
...
....
..........
..
....... ... ..
....
..
. ... ..
. ...
.
.
. .
....
. ... ...
.
.
...
.
.
.
.
Parameter Beta_0 : Scale Factor
Raf
tery
Lew
is N
hat
0 5 10 15 20
4000
080
000
1200
00
.
.....
..
......
....
...
..
..........
.......
......
..
.
...
....
.......
.
.
...
.
.
.
.
Parameter Beta_0 : Acceptance Rate
Raf
tery
Lew
is N
hat
0.2 0.4 0.6 0.8
4000
080
000
1200
00
Figure 5-2: Plots of the e�ect of varying the scale factor for the proposal varianceand hence the Metropolis acceptance rate on the Raftery Lewis diagnostic for the�0 parameter in the random slopes regression model on the JSP dataset.
118
.
..
.
.
...................... ... ... ... ... ... .. ...
.
..... ... ...
.
..... ...
.
.
.
Parameter Beta_1 : Scale Factor
Raf
tery
Lew
is N
hat
0 5 10 15 20
2000
060
000
1000
00
.
..
.
.
....
..
.....................................
...
............
......
.
.
.
Parameter Beta_1 : Acceptance Rate
Raf
tery
Lew
is N
hat
0.2 0.4 0.6 0.8
2000
060
000
1000
00
Figure 5-3: Plots of the e�ect of varying the scale factor for the proposal varianceand hence the Metropolis acceptance rate on the Raftery Lewis diagnostic for the�1 parameter in the random slopes regression model on the JSP dataset.
119
(1998) which has 4059 students in 65 schools. To this dataset a similar random
slopes regression model has been �tted to assess whether the results obtained for
the JSP dataset are unique.
Table 5.1: Optimal scale factors for proposal variances and best acceptance ratesfor several models.
Model Optimal Scale factor Acceptance Percentage�0 �1 �2 �0 �1 �2
VC1 0.75 | | 45%{70% | |VC2 1.0 4.0 | 45%{70% 40%{60% |VC3 1.0 4.0 2.0 45%{70% 40%{60% 40%{65%RSR2 0.75 2.0 | 40%{75% 35%{65% |RSR3 0.75 2.5 2.0 45%{70% 40%{70% 40%{65%SCH1 0.5 1.5 | 45%{80% 40%{70% |
If the acceptance rate is considered instead it can be seen that the graphs are
far atter, and there are a wide range of acceptance rates that give similar values
of N . Gelman, Roberts, and Gilks (1995) calculate the optimal acceptance rate
to be 44% for Gaussian data, and although this value appears to give a reasonably
low N it does not appear to be the minimum for all parameters. It will however
give far better results than using the scale factor of 5.8.
In Table 5.1 ranges of values have been given for the acceptance rates as
the graphs of N values over these ranges are fairly at. It can be seen that
all parameters considered give good results with acceptance rates between 45%
and 60%. This shows that if a proposal distribution that gives the same desired
acceptance rate for every parameter could be found then this would be a better
method than using the scale factor method considered thus far. This is the
motivation behind considering adaptive samplers.
5.4.3 Adaptive Metropolis univariate normal proposals
An additional problem with using the variance estimates produced by IGLS
and RIGLS to calculate the proposal distribution variances is the assumption
that these methods give good estimates. This does not however explain the
discrepancies from the value 5.8 for the scale factor, as the IGLS and RIGLS
variance estimates will generally be too small which would have the opposite
120
e�ect on the scaling factor. An alternative approach would be to have starting
proposal distributions and then adapt these distributions as the algorithm is
running to improve the mixing of the Markov chain.
Care has to be taken when performing adaptive Metropolis sampling (Gelfand
and Sahu 1994) as the simulations produced may not be a Markov chain. Gilks,
Roberts, and Sahu (1996) give a mathematical method based on Markov chain
regeneration that will give time points when it is acceptable to modify the
proposal distribution during the monitoring run of the chain. This method
although shown to be e�ective in the paper is rather complicated and so I decided
instead to use the simpler approach of adapting the proposal distributions in a
preliminary period before the `burn-in' and main monitoring run of the chain.
Muller (1993) gives a simple adaptive Metropolis sampler based on the belief
that the ideal sampler will accept approximately 50% of the iterations. Gelman,
Roberts, and Gilks (1995) show that for univariate normal proposals, used on
a multivariate normal posterior density, the ideal acceptance rate is 44% but
from Table 5.1 it can be seen that for multi-level models 50% is an equally
good acceptance rate. Muller (1993) considers the last 10 observed acceptance
probabilities and uses the simple approach of modifying the proposal distribution
if the average of these acceptance rates lies outside the range 0.2 to 0.8.
There are several factors to consider when designing an adaptive algorithm.
Firstly how often to adapt the proposal distributions, secondly how to adapt
the proposal distributions and thirdly when to stop the adapting period and
to continue with the `burn-in' period. I will outline two adaptive Metropolis
algorithms that aim to give acceptance rates of 44% for all parameters, although
44% can be substituted by any other percentage.
Adaptive sampler 1
This method has been implemented in MLwiN and has an adapting period of
unknown length (up to an upper limit) followed by the usual `burn-in' period
and �nally the main run from which the estimates are obtained. The objective
of this method is to achieve an acceptance rate of x% for all the parameters of
interest. Although in the MLwiN package, the proposal distributions used in the
non-adaptive method will be used as initial proposal distributions, the algorithm
will work on arbitrary starting proposals as illustrated in the examples below.
121
The algorithm needs the user to input 2 parameters. Firstly x% the desired
acceptance rate, which in the example will be 44% and a tolerance parameter,
which in the example will be 10%. This tolerance parameter governs when the
algorithm stops and is meant to signify bounds on the desired acceptance, that
is the desired acceptance rate is 44% but if the acceptance rate is somewhere
between 34% and 54% we are fairly happy. The algorithm then runs the sampler
with the current proposal distributions for batches of 100 and at the end of each
batch of 100, the proposal distributions are modi�ed. This procedure is repeated
until the tolerance conditions are achieved. The modi�cation procedure that
happens after each batch of 100 is detailed in the algorithm below.
Method
The following algorithm is repeated for each parameter. Let NAcc be the number
of iterations accepted in the current batch for the chosen parameter (out of 100),
OPTAcc be the desired acceptance rate, and PSD be the current proposal standard
deviation for the parameter.
If NAcc > OPTAcc PSD = PSD ��2�
�100�NAcc
100� OPTAcc
��;
If NAcc < OPTAcc PSD = PSD=�2� NAcc
OPTAcc
�:
The above will modify the proposal standard deviation by a greater amount
the further the acceptance rate is from the desired acceptance rate. If the
acceptance rate is too small then the proposed new values are too far from the
current value and so the proposal SD is decreased. If the acceptance rate is too
high, then the proposed new values are not exploring enough of the posterior
distribution and so the proposal SD is increased.
To check if the tolerance condition is achieved NAcc is compared with the
tolerance interval, (OPTAcc � TOLAcc; OPTAcc + TOLAcc). If three successive
values of NAcc are in this interval then the parameter is marked as satisfying the
tolerance conditions. Once all parameters have been marked then the tolerance
condition is satis�ed. After a parameter has been marked it is still modi�ed
as before until all parameters are marked, but each parameter only needs to be
122
marked once for the algorithm to end. To limit the time spent in the adapting
procedure an upper limit is set (in MLwiN this is 5,000 iterations) and after this
time the adapting period ends regardless of whether the tolerance conditions are
met.
Note that it may be better to use the sum of the actual Metropolis acceptance
probabilities as in Muller (1993) instead of NAcc in the above algorithm,
although preliminary investigations show no signi�cant di�erences in the proposal
distributions chosen.
Results
Table 5.2 shows the adapting period for the two �xed e�ects parameters for
one run of the random slopes regression model with the JSP dataset. Here the
starting values have been chosen arbitrarily to be 1.0 whereas when this method
is used in MLwiN the RIGLS estimates will be used instead. From the table
it can be seen that both parameters have ful�lled the tolerance criteria by 700
iterations. However as the adapting period also includes the level 2 residuals, it
is not complete until 3,300 iterations when the �nal set of residuals satisfy the
criteria.
Table 5.2: Demonstration of Adaptive Method 1 for parameters �0 and �1 usingarbitrary (1.000) starting values.
N �0 SD NAcc N in Tol �1 SD NAcc N in Tol0 1.0 | | 1.0 | |
100 0.587 13 0 0.512 2 0200 0.574 43 1 0.271 5 0300 0.451 32 0 0.155 11 0400 0.441 43 1 0.105 23 0500 0.422 42 2 0.087 35 1600 0.379 39 3� 0.075 37 2700 0.412 49 3� 0.082 49 3�
800 0.370 39 3� 0.075 40 3�
900 0.354 42 3� 0.086 52 3�
1,000 0.436 57 3� 0.065 30 3�
3,300 0.381 47 3� 0.064 42 3�
In Table 5.3 runs of length 50,000 for various di�erent methods using the same
123
random slopes regression model are compared. The four methods considered
are the Gibbs sampling method used in Chapter 4, and three versions of the
Metropolis Gibbs hybrid method. Firstly using proposal SDs set at 1.0 for all
parameters, secondly using the RIGLS starting values to create the proposal
distributions and �nally using the �rst adaptive method.
After 50,000 iterations, the parameter estimates of all four methods are
reasonably similar. The Raftery Lewis N values show more clearly how well
the methods are performing. The Gibbs sampler generally has the lowest values
of N , with the adaptive method the best of the hybrid methods. The need to
choose good proposal distributions is highlighted by the huge N value for �1 using
the arbitrary 1.0 proposal distribution SD. This value is over 30 times longer than
the suggested run length for Gibbs for �1.
The acceptance rates and proposal standard deviations in Table 5.3 show how
far from the expected 44% acceptance rate, the RIGLS starting values method
actually is. This in turn explains why the N value for �0 using this method is
larger than for the adaptive method. The table also shows that the adaptive
method is a better approach than using the RIGLS starting values and this is
backed up by the Figures 5-1 to 5-3 seen earlier. The results in Table 5.3 are
based on only one run of each method, but other runs were performed and similar
results were obtained.
Adaptive sampler 2
Two criticism that may be levelled against the �rst adaptive sampler are �rstly
there is no de�nite length for the adapting period and secondly that the method
includes a tolerance parameter which has to be set. This second method will
hopefully be an improvement on the �rst algorithm that does away with the
tolerance parameter and gives acceptance rates closer to the desired acceptance
rate.
Method
In the �rst sampler, although the change in acceptance rate is less the closer
the current acceptance rate is to the desired acceptance rate, the change does
not vary with time. I will try to incorporate the MCMC technique, simulated
124
Table 5.3: Comparison of results for the random slopes regression model on theJSP dataset using uniform priors for the variances, and di�erent MCMC methods.Each method was run for 50,000 iterations after a burn-in of 500.
Par. Gibbs MH (SD = 1) MH (RIGLS) MH Adapt 1�0 30.60(0.396) 30.59(0.374) 30.60(0.406) 30.57(0.417)�1 0.614(0.048) 0.614(0.047) 0.614(0.051) 0.616(0.049)
�u00 5.674(1.732) 5.699(1.716) 5.702(1.656) 5.780(1.754)�u01 {0.426(0.163) {0.428(0.162) {0.420(0.160) {0.436(0.168)�u11 0.055(0.024) 0.055(0.023) 0.054(0.025) 0.055(0.024)�2e 26.98(1.339) 26.93(1.342) 26.99(1.337) 26.92(1.336)
Raftery and Lewis diagnostic (N)
�0 10,520 60,728 58,778 32,528�1 6,453 216,954 24,421 24,999
�u00 5,792 6,175 5,645 5,684�u01 4,866 4,714 4,882 5,212�u11 12,345 9,480 14,389 11,877�2e 3,898 3,810 3,867 3,835
Acceptance Rates for �xed e�ects (%)
�0 100% 21.7% 23.2% 46.3%�1 100% 3.6% 34.0% 40.6%
Proposal Standard deviations
�0 - 1.000 0.880 0.395�1 - 1.000 0.103 0.081
125
annealing (Geman and Geman 1984) by allowing the change to the proposal SD
to decrease with time. The algorithm will then be run for a �xed length of time,
Tmax (Tmax is chosen to be 5,000 in the example) and the proposal distributions
will be modi�ed every 100 iterations. The following procedure will be carried out
for each parameter at time t� :
If NAcc > OPTAcc PSD = PSD ��1 +
�1�
�100�NAcc
100� OPTAcc
����Tmax � t� + 100
Tmax
��;
If NAcc < OPTAcc PSD = PSD=�1 +
�1� NAcc
OPTAcc
���Tmax � t� + 100
Tmax
��:
So after the �rst 100 iterations the range of possible changes to the
proposal SD is (12PSD; 2PSD) as in the �rst algorithm but this shrinks to
( Tmax
Tmax+100PSD;
Tmax+100Tmax
PSD) at time Tmax.
Results
Table 5.4 shows the adapting period for the two �xed e�ects parameters for
one run of the random slopes regression model with the JSP dataset using the
second method. Here again, the starting values have been chosen arbitrarily
to be 1.0 whereas this method could use the RIGLS estimates from MLwiN.
From Table 5.4 it can be seen that as time increases the changes to the proposal
standard deviation become smaller as with a simulated annealing algorithm.
The two methods were each run ten times for 5,000 iterations using the
random slopes regression model, with the ideal acceptance rate set to 44%. The
actual acceptance rates achieved for the two �xed e�ects were recorded for both
methods. For parameter �0, the �rst method obtained acceptance rates between
40.0% and 49.0% whilst the second method obtained rates of between 43.3% and
47.3%. For parameter �1, the �rst method obtained rates between 41.0% and
48.4% whilst the second method obtained rates between 43.3% and 46.4%. It is
not entirely fair to compare these �gures directly as the second method was run
for 5,000 iterations every time whereas the �rst method ran on average for only
2,100 iterations, however the second method does appear to give a more accurate
acceptance rate. A balance has to be struck between the additional burden of on
126
Table 5.4: Demonstration of Adaptive Method 2 for parameters �0 and �1 usingarbitrary (1.000) starting values.
N �0 SD NAcc �1 SD NAcc
0 1.0 | 1.0 |100 0.587 13 0.512 2200 0.574 43 0.274 5300 0.440 30 0.161 12400 0.484 50 0.109 22500 0.499 46 0.091 34600 0.437 37 0.081 38700 0.364 34 0.083 46800 0.403 51 0.072 36900 0.396 43 0.067 40
1,000 0.333 34 0.075 52. . . . .
4,800 0.397 46 0.073 414,900 0.397 44 0.072 385,000 0.398 51 0.072 47
average 2,900 extra iterations (in this example) in the adapting period and any
gain in speed of obtaining accurate estimates.
Although this second method is an interesting alternative to the �rst adaptive
method, it will not be considered further in this thesis as it does not o�er any
signi�cant improvements. Instead I will now go on to consider multivariate
normal Metropolis updating methods.
5.5 Method 3 : Metropolis Gibbs hybrid meth-
od with block updates
One disadvantage of using univariate normal proposal distributions is that the
correlation between parameters is completely ignored and so if two parameters
are highly correlated it would be nice to adjust for this in the proposal
distribution. Highly correlated parameters are generally avoided by centering
predictor variables but sometimes large correlations still exist.
The Gibbs sampler algorithm updates parameters in blocks, for example the
�xed e�ects are updated together and all residuals for one level 2 unit are updated
127
together. The second hybrid method will mimic the Gibbs sampler steps for
residuals and �xed e�ects by using multivariate normal proposal Metropolis steps
for these blocks as described in the following algorithm :
5.5.1 Algorithm 3
Step 1 - The �xed e�ects, �1.
�(t)1 = ��1 with probability min(1; p(��1 j y; : : :)=p(�(t�1)
1 j y; : : :))= �
(t�1)1 otherwise
where ��1 = �(t�1)1 + 1; 1 �MVN(0;�1):
Step 2 - The level l residuals, �l.
For l in 2 : : : N; i in 1 : : : nl;
�(t)lj = ��lj with probability min(1; p(��lj j y; : : :)=p(�(t�1)
lj j y; : : :))= �
(t�1)lj otherwise
where ��lj = �(t�1)lj + lj; lj �MVN(0;�lj); and nl is the number of level l units.
Step 3 - The level 1 scalar variance �2.
This step is the same as Algorithm 1.
Step 4 - The level l variance, Vl.
This step is the same as Algorithm 1.
5.5.2 Choosing proposal distribution variances
When using multivariate normal proposal distributions, a similar problem exists
as for the univariate case, the variance matrices for the proposal distributions in
steps 1 and 2 need to be assigned values. This time the Metropolis steps 1 and
2 will work with any positive de�nite matrix for the proposal variances, however
ideally a matrix that gives estimates with a reasonable accuracy in the minimum
number of iterations is desired.
128
Gelman, Roberts, and Gilks (1995) also consider the case of a multivariate
normal Metropolis proposal distribution on multivariate normal posterior distrib-
utions. They calculate the optimal scale factor to be used as a multiplier for the
estimated standard deviation matrix for dimensions 1 to 10 and also found an
asymptotically optimal estimator for this scale factor. This asymptotic estimator
is 2:38=q(d) where d is the dimension of the proposal distribution. When
considering the estimated covariance matrix this multiplier becomes 5:66=d.
This now implies that if an estimate of the covariance matrix of the parameters
of interest can be found then this can be multiplied by this optimal scale factor
and the resulting matrix can be used as the variance matrix for the proposal
distribution. Fortunately MLwiN will give the covariance matrices associated
with both the �xed e�ects and the residuals and so these values can be used.
Finding optimal scaling factors
When I considered the univariate normal distributions I found that the optimal
value of 5.8 for the scale factor from Gelman, Roberts, and Gilks (1995) did
not actually follow for multi-level models. For the multivariate normal proposal
distributions I will now consider again the random slopes regression model on the
JSP dataset.
The asymptotic estimator gives the value 2.83 when d = 2 as in the random
slopes regression model. The optimal estimate from Gelman, Roberts, and Gilks
(1995) is 2.89, so the asymptotic estimator is fairly accurate when d = 2. I used
a similar approach to that used for the univariate proposal distributions. Several
values of the scaling factor spread over the range 0.02 and 10 were chosen, and for
each value three MCMC runs with a burn-in of 1,000 and a main run of 50,000
were performed. As with the univariate case the Raftery Lewis N statistic was
calculated to measure e�ciency. The results for the random slopes regression
model can be seen in the Figures 5-4 and 5-5.
From Figures 5-4 and 5-5 the optimal scale factors for both �0 and �1 appear
to be around 0.75 which is far smaller than the values from Gelman, Roberts,
and Gilks (1995). As with the univariate case the acceptance rate gives a far
atter graph. Here acceptance rates in the range 30% to 75% for �0 and in the
range 30% to 70% for �1 give similar low values of N . Gelman, Roberts, and
Gilks (1995) give the acceptance rate of 35.2% as optimal for a bivariate normal
129
.
.
.
.
.
.
.
.
.......... ..
....
.....
..
.
..
.
.
.
.
..
....
.
.
. ...
...
.
..
..
. .
.
.
Parameter Beta_0 : Scale Factor
Raf
tery
Lew
is N
hat
0 2 4 6 8 10
4000
080
000
1200
00
.
..
.
.
.
...
.
.....
.....
....
......
..
.
..
.
.
.
.
..
....
.
.
.......
.
..
..
..
.
.
Parameter Beta_0 : Acceptance Rate
Raf
tery
Lew
is N
hat
0.2 0.4 0.6 0.8
4000
080
000
1200
00
Figure 5-4: Plots of the e�ect of varying the scale factor for the multivariatenormal proposal distribution and hence the Metropolis acceptance rate on theRaftery Lewis diagnostic for the �0 parameter in the random slopes regressionmodel on the JSP dataset.
130
.
.
.
.
.
.
.
..
...... ... ..
. ... ... ...
..
. ..
....
.
..
.
.. ..
.
..
.
..
. .
.
.
.
.
.
.
.
.
Parameter Beta_1 : Scale Factor
Raf
tery
Lew
is N
hat
0 2 4 6 8 10
4000
080
000
.
.
.
.
.
.
...
...........
..........
..
...
....
.
..
.
....
.
...
..
..
.
.
.
.
.
..
.
Parameter Beta_1 : Acceptance Rate
Raf
tery
Lew
is N
hat
0.2 0.4 0.6 0.8
4000
080
000
Figure 5-5: Plots of the e�ect of varying the scale factor for the multivariatenormal proposal distribution and hence the Metropolis acceptance rate on theRaftery Lewis diagnostic for the �1 parameter in the random slopes regressionmodel on the JSP dataset.
131
proposal which does appear in these ranges. The atness of the acceptance rate
graph around the minimum again implies that �nding proposal distributions that
give a desired acceptance rate for every parameter is a better approach than the
scale factor method. This leads us to consider adaptive multivariate samplers.
5.5.3 Adaptive multivariate normal proposal distribu-
tions
The adaptive samplers considered for univariate normal proposal distributions
can both be extended to multivariate proposals. I will only consider modifying
the �rst sampler which is used in MLwiN but the alternative sampler based on
simulated annealing could also be easily modi�ed. When considering multivariate
proposals there is more exibility in the possible variance matrices that can
be considered. I could simply modify the univariate algorithm by allowing the
scale factor to vary and keeping the original estimate of the covariance matrix (
generally from RIGLS ) �xed. This is rather restrictive as the proposal variance
would then have to be a scalar multiple of the initial variance estimate and so an
alternative approach will be considered.
In Gelman, Roberts, and Gilks (1995) di�erent optimal acceptance rates are
given for di�erent dimensions and these range from 44% when d = 1 down to 25%.
The optimal acceptance rate when d = 2 is 35.2%, although Figures 5-4 and 5-5
show that for the random slopes regression model acceptance rates between 30%
and 70% are all reasonably good.
Adaptive sampler 3
This sampler is a slightly modi�ed generalisation of adaptive sampler 1. The
objective of the method is again to achieve an acceptance rate of x% for all
blocks of parameters. E�ectively any positive de�nite matrix can be used as an
initial variance matrix for the proposal distribution but in practice it is better
to use good estimates as problems may occur if there are no changes accepted in
the �rst batch of 100 iterations.
The algorithm needs the user to input 2 parameters. Firstly x% the desired
acceptance rate and secondly a tolerance parameter. This tolerance parameter
will work in exactly the same way as in sampler 1. The algorithm runs the
132
sampler with the current proposal distributions for batches of 100 and at the
end of each batch of 100 iterations the proposal distributions are modi�ed. Each
proposal distribution consists of two distinct parts, �rstly the current estimate
of the covariance matrix for the block of parameters considered and secondly the
scale factor which this matrix is multiplied by to give the proposal distribution
variance. The main di�erence from the univariate case is that the current estimate
of the covariance matrix is updated after every 100 iterations whereas for the
univariate case the variance estimate remains �xed at the RIGLS estimate.
For the �rst 100 iterations, the RIGLS estimate for the covariance matrix is
used. Then after each 100 iterations the covariance matrix is calculated from all
the iterations run thus far. The procedure to follow after every 100 iterations is
as given below :
Method
The following algorithm is repeated for each block of parameters. Let NAcc be
the number of moves accepted in the current batch for the chosen block (out of
100), OPTAcc be the desired acceptance rate, and SF be the current proposal
scale factor for the block.
If NAcc > OPTAcc SF = SF ��2�
�100�NAcc
100� OPTAcc
��;
If NAcc < OPTAcc SF = SF=�2� NAcc
OPTAcc
�:
The above will modify the proposal scale factor by a greater amount the
further the current acceptance rate is from the desired acceptance rate. If the
acceptance rate is too small then the proposed new values are too far from the
current value and so the scale factor is decreased. If the acceptance rate is too
high, then the proposed new values are not exploring enough of the posterior
distribution and so the scale factor is increased.
To calculate the actual variance matrix for the proposal distribution, this
scale factor has to be multiplied by the current estimate of the covariance matrix
for the block of parameters. This estimate is based on the values obtained from
the iterations run thus far and after each batch of 100 iterations this estimate is
133
modi�ed accordingly.
The procedure for checking that the tolerance criteria is satis�ed and the
maximum length of the adapting period are both the same as in adaptive sampler
1.
Results
Table 5.5 shows the adapting period for the block of two �xed e�ects parameters
in the random slopes regression model with the JSP dataset. The starting values
are the RIGLS estimates of the covariance matrix from MLwiN and the desired
acceptance rate is 35%. The columns labelled Vp00; Vp01 and Vp11 are the proposal
variance matrix. It is interesting to note that there is a huge jump in the proposal
variance matrix after 100 iterations. This is because the actual iterations are
then used to estimate the covariance matrix instead of the RIGLS estimates and
the estimate after 100 iterations will be less accurate than the RIGLS estimate.
However as the number of iterations increases the accuracy will improve.
From Table 5.5 it can be seen that this block of parameters ful�ls the tolerance
criteria by 400 iterations. However as the adapting period also includes the
level 2 residuals, it is not complete until 1,300 iterations when the �nal set of
residuals satisfy the criteria. It is interesting to note that for the �xed e�ects
only 17 iterations were accepted in the last block of 100 and so the �nal proposal
distribution is quite di�erent from the penultimate one. This does not however
seem to have a�ected the results in Table 5.6
In Table 5.6 runs of length 50,000 iterations for various di�erent multivariate
proposal distribution methods are compared. These results can also be compared
with Table 5.3 which gives results for the same model but using Gibbs sampling
and univariate proposal distribution methods. The three proposal distributions
considered in Table 5.6 are �rstly an arbitrary identity matrix for the proposal
variance, secondly the estimate from RIGLS multiplied by the scale factor (2.9)
as the variance matrix and thirdly the adaptive method with 35% as the desired
acceptance rate.
The method using the identity matrix as the proposal variance shows the
importance of choosing a sensible proposal distribution. It only has an acceptance
rate of less than 1% which leads to huge values of N for the �xed e�ects. Although
the estimates it produces are similar to the other method, this is due to using
134
Table 5.5: Demonstration of Adaptive Method 3 for the � parameter vector usingRIGLS starting values.
N SF (�) NAcc N in Tol Vp00 Vp01 Vp110 2.90 | | 0.388 {0.024 0.0053
100 1.75 12 0 0.062 {0.002 0.0007200 1.99 44 1 0.078 {0.005 0.0016300 1.99 35 2 0.161 {0.014 0.0030400 2.18 41 3� 0.192 {0.015 0.0036500 1.55 21 3� 0.119 {0.009 0.0023600 1.60 37 3� 0.139 {0.013 0.0036700 1.63 36 3� 0.132 {0.011 0.0034800 1.29 26 3� 0.152 {0.011 0.0028900 1.41 41 3� 0.212 {0.014 0.0033
1,000 1.21 29 3� 0.220 {0.015 0.00301,300 0.73 17 3� 0.128 {0.009 0.0018
the starting values from RIGLS. The estimates of parameter standard deviations
it produces are too small due to the low acceptance rate and this gives a better
indication of the method's poor performance.
The method based on the RIGLS starting values and a scale factor of 2.9
gives results that are far better and values of N that are similar to the equivalent
univariate method. The adaptive method, as in the univariate case improves on
the scale factor method, although the univariate adaptive method appears to do
better than the multivariate method in terms of N values.
This section has shown that Metropolis block updating methods can be
produced by modifying the univariate updating methods. For the one bivariate
example considered the block updating methods do not show any improvement
over their univariate equivalents in terms of minimising expected run lengths
N . There is scope to consider these methods in more detail and to see whether
there are any improvements on other datasets where there is greater correlation
between parameters in a block, but not in this thesis.
135
Table 5.6: Comparison of results for the random slopes regression model on theJSP dataset using uniform priors for the variances, and di�erent block updatingMCMC methods. Each method was run for 50,000 iterations after a burn-in of500.
Par. MH (Vp = I) MH (RIGLS) MH Adapt 3�0 30.57(0.359) 30.60(0.405) 30.58(0.397)�1 0.616(0.042) 0.615(0.048) 0.615(0.048)
�u00 5.656(1.702) 5.680(1.735) 5.624(1.713)�u01 {0.429(0.161) {0.424(0.163) {0.427(0.165)�u11 0.055(0.024) 0.054(0.024) 0.056(0.024)�2e 26.92(1.336) 26.98(1.341) 27.00(1.342)
Raftery and Lewis diagnostic (N)
�0 1,045,468 62,443 43,618�1 456,906 48,805 36,881
�u00 5,917 6,058 6,017�u01 4,842 5,065 4,762�u11 10,424 18,712 14,535�2e 3,860 3,767 3,791
Acceptance Rates for �xed e�ects (%)
� 0.96% 18.7% 36.2%
Proposal Variance Matrix
Vp00 1.000 0.388 0.128Vp01 0.000 {0.024 {0.009Vp11 1.000 0.0053 0.0018
136
5.6 Summary
In this chapter several MCMC methods for �tting N level Gaussian models have
been discussed and algorithms produced. The Gibbs sampling method introduced
in the last chapter for two simple multi-level models was extended to �t general
N level models. For N level Gaussian models, where the conditional distributions
required in the Gibbs algorithm have forms that are easily to simulate from this
method performs best.
Two other hybrid methods based on a combination of Metropolis and Gibbs
sampling steps were introduced in this chapter. These methods do not perform
as well as the Gibbs method for the Gaussian models but can be easily applied
to models with complicated conditional distributions as will be seen in the next
chapter. The �rst method uses univariate normal proposal distributions whilst
the second method uses multivariate normal proposal distributions. The two
methods were compared using a simple example model and no bene�t was seen
in using the multivariate proposals and so the univariate proposal method will
be used in the next chapter.
Two approaches were considered for generating optimal proposal distributions
for the Metropolis steps in the hybrid algorithms. The �rst approach was
based on using scaled estimates of the variance of the parameter of interest as
variances for the proposal distribution. This approach has some problems as
the results in Gelman, Roberts, and Gilks (1995) for optimal scale factors for
multivariate normal posterior distributions do not follow for multi-level models.
The second approach considered uses an adapting period before the main run
of the Markov chain in which the proposal distributions are modi�ed to give
particular acceptance rates for the parameters of interest. This approach works
better as for the univariate proposal distribution, acceptance rates in the range
45% to 60% lead to close to optimal proposals in all the examples considered.
One type of comparison that is missing from this chapter and this thesis in
general is timing comparisons and they deserve some mention before I close this
chapter.
137
5.6.1 Timing considerations
In this thesis the only places where timings are included are generally to justify
the dimensions of simulation runs and not to compare individual methods. The
main reason for not including timing comparisons is that I personally think that
they should only be done on released software, on a stand alone machine and by
a third party.
Before the MCMC options were incorporated into the MLwiN package they
existed in a more primitive version as a stand alone C program. At this time
I compared my stand alone code with the BUGS package (Spiegelhalter et al.
1994), mainly to con�rm that my code gave reasonable estimates but also to
compare the speed di�erences. I found that my Gibbs sampling code was slightly
quicker for the few small models tested. This was to be expected as the algorithms
in this chapter generate from the posterior distributions directly whilst the BUGS
package uses the adaptive rejection method.
I would expect now that the BUGS package will outperform the MLwiN Gibbs
sampler for the Gaussian models. This is because the Gibbs sampling code in
MLwiN is embedded beneath the graphical interface which slows the code down
considerably. Some of the methods described in this chapter have not yet been
added to the released version of MLwiN and have not yet been optimised and so
comparisons would not be fair.
All comparisons obviously depend on the e�ciency of the coding and the
model considered. However it is generally the case that a Metropolis sampling
step will be quicker than a Gibbs sampling step. Also Metropolis steps should
in general be quicker than rejection sampling algorithms and adaptive rejection
sampling as only one value is generated per parameter per iteration using
Metropolis. The Metropolis steps, however in general need longer to get estimates
of a given accuracy as demonstrated earlier in this chapter.
138
Chapter 6
Logistic Regression Models
6.1 Introduction
In the previous chapters the models considered have been restricted to the family
of Gaussian multi-level models. This family of models is very useful and can
�t most datasets well. The models considered thus far have all had a response
variable that is assumed to be de�ned on the whole real line. There are many
variables that are not de�ned on the whole real line, for example age which must
be positive and sex which is either male or female.
In this chapter I am interested in the second type of variable, one which has
two possible states that can be de�ned as zero and one ie a binary response.
These types of variables, as seen at the end of chapter 2, also appear in linear
modelling as responses. In this case they are �tted as a Bernoulli response using
generalized linear modelling. The most common way of �tting such a model
is by using the logit link function, which is the canonical link for the binomial
and Bernoulli distributions. The technique is then known as logistic regression.
McCullagh and Nelder (1983) is a useful text for all generalized linear models
including logistic regression models.
In a similar way that Gaussian linear models can be extended to multi-level
Gaussian models, logistic regression models can also be extended to multi-level
logistic regression models.
In this chapter I will de�ne the general model structure for a multi-level
binary response logistic regression model. In the last chapter it was pointed out
that the simple Gibbs sampling method cannot be used to �t these models, as
139
the full conditional distributions do not all have forms that are easily simulated
from. Gilks (1995) shows that this is true for a very simple logistic regression
model and then gives some alternative ways to �t such models. I will show how
the Metropolis-Gibbs hybrid methods of the last chapter can easily be adapted
to �t these logistic regression models. In fact the motivation behind putting
these hybrid methods in the last chapter was to be able to �t multi-level logistic
regression models.
I will then consider two examples to show some other �elds of applications
of hierarchical modelling, that have models of this structure. The �rst example
is taken from survey sampling and involves a political voting dataset from the
British Election study. This example will be used to illustrate how to �t multi-
level binary response models, and how to calculate optimal Metropolis proposal
distributions.
The second example considers a collection of simulated datasets, designed
to represent closely the structure of a dataset used in an analysis of health care
utilisation in Guatemala. These simulated datasets were considered in Rodriguez
and Goldman (1995), where de�ciencies in the quasi-likelihood methods used by
MLn to �t binary response models were pointed out. I hope to show that the
Metropolis-Gibbs hybrid methods will improve on the quasi-likelihood methods.
6.2 Multi-level binary response logistic regress-
ion models
The multi-level binary response logistic regression model has a similar structure to
the Gaussian models discussed thus far. The only di�erence is how the response
variable is linked to the predictor variables. When considering the Gaussian
models, a three level model was described and an algorithm for this model was
included. A generalisation was then given to N levels.
A 3 level binary response logistic regression model can be de�ned as follows :
yijk = Bernoulli(pijk); where
logit(pijk) = X1ijk�1 +X2ijk�2jk +X3ijk�3k
140
�2jk �MVN(0;V2); �3k �MVN(0;V3):
This 3 level model can be easily extended to N levels in a similar way to the
Gaussian models. Then any of the Metropolis-Gibbs hybrid methods can be
adapted to �t the logistic model. I will only consider the method with univariate
updates here and now show how this can be adapted from the Gaussian version.
6.2.1 Metropolis Gibbs hybrid method with univariate
updates
As can be seen in the above model de�nition, multilevel logistic regression
models do not have variance terms at level 1. This is because for the Bernoulli
distribution, both the mean and the variance are functions of the parameter pijk
only (E(yijk) = pijk, var(yijk) = pijk(1�pijk)), Therefore if the mean is estimated,
then the variance will be �xed. This means that the algorithm for a general N
level logistic regression model has only three steps.
Notation
In the following algorithm I will use similar notation as used for the N level
Gaussian models in Chapter 5. Let MT be the set of all observations in the
model, and let Ml;j be the set of observations that at level l are in category j.
Let Xli be the vector of variables at level l for observation i, where l = 1 refers
to the variables associated with the �xed e�ects. Let the random parameters
at level l; l > 1 be denoted by �lj, where j is one of the combination of higher
level terms and the �xed e�ects be �1. Finally let Vl be the level l variance
matrix. I will use the abbreviation (X�)i to mean the sum of all the predictor
terms for observation i. For example in the three level model de�nition earlier,
(X�)i = X1ijk�1 + X2ijk�2jk + X3ijk�3k. Using these notational short cuts, the
model can be written :
yi = Bernoulli(pi); where
logit(pi) = (X�)i; �lj � MVN(0;Vl):
141
Algorithm
The main di�erences between this algorithm and the Gaussian algorithm arise
from the di�erent likelihood functions. For prior distributions, I will allow the
�xed e�ects to have any prior distribution and the level l variance to have a
general inverse Wishart prior, Vl � IW (SP l; �P l). The three steps of the algorithm
are then as follows :
Step 1 - The �xed e�ects, �1.
For i in 1; : : : ; NFixed;
�(t)1i = ��1i with probability min(1; p(��1i j y; : : :)=p(�(t�1)
1i j y; : : :))= �
(t�1)1i otherwise
where ��1i = �(t�1)1i + 1i; 1i � N(0; �2
1i); and
p(�1i j y; : : :) / p(�1) �Y
i2MT
(1 + e�(X�)i)�yi(1 + e(X�)i)yi�1:
Step 2 - The level l residuals, �l.
For l in 2; : : : ; N; j in 1; : : : ; nl; and i in 1; : : : ; nrl;
�(t)lji = ��lji with probability min(1; p(��lji j y; : : :)=p(�(t�1)
lji j y; : : :))= �
(t�1)lji otherwise
where ��lji = �(t�1)lji + lji; lji � N(0; �2
lji); and
p(�lji j y; : : :) /Y
i2Ml;j
(1 + e�(X�)i)�yi(1 + e(X�)i)yi�1� j Vl j� 12 exp[�1
2�TljV
�1l �lj]:
Step 3 - The level l variance, Vl.
p(V�1l j y; : : :) / p(�l j Vl)p(V
�1l )
V�1l �Wishartnrl[Spos = (
nlXi=1
�li�Tli + SP l)
�1; �pos = nl + �P l]
where nl is the number of level l units. If we want a uniform prior then we need
142
SP l = 0 and �P l = �nrl � 1 where nrl is the number of random variables at level
l.
Note that it is possible, by reparameterising this model to use the Gibbs
sampler instead of the Metropolis sampler for residuals at levels 3 and upwards
but this is not considered in this thesis.
6.2.2 Other existing methods
The existing methods for �tting multi-level logistic regression models in MLwiN
are described brie y at the end of Chapter 2. They are quasi-likelihood methods
based around Taylor series expansions. Marginal quasi-likelihood (MQL) is
described in Goldstein (1991), and Penalised quasi-likelihood (PQL) is introduced
in Laird (1978).
MCMC methods have also been used to �t these models. Zeger and Karim
(1991) give a Gibbs sampling approach to �tting multi-level logistic regression
models amongst other multi-level models. They use rejection sampling with
a Gaussian kernel that is a good estimate of the current likelihood function
to generate new estimates for the high level residuals. The BUGS package
(Spiegelhalter et al. 1994) will also �t these models. It uses adaptive rejection
sampling (Gilks and Wild 1992) as discussed in Chapter 3 in place of rejection
sampling. Breslow and Clayton (1993) consider multi-level logistic regression
models within the family of generalized linear mixed models. They perform some
brief comparisons between the quasi-likelihood methods and the Gibbs sampling
results in Zeger and Karim (1991).
6.3 Example 1 : Voting intentions dataset
6.3.1 Background
The dataset used in this example is a component of the British Election Study
analysed in Heath, Yang, and Goldstein (1996). This dataset also appears in
Goldstein et al. (1998) where it is used as the main example in the binary
response models chapter. The subsample analysed in Goldstein et al. (1998)
contains data on 800 voters from 110 constituencies, who were asked how they
voted in the 1983 election. Their response was categorised as to whether they
143
voted Conservative or not, and the interest was in how the voters' opinion on
certain issues in uenced their voting intentions.
The explanatory variables are the voters' opinion scored on a 21 point
scale and then centred around its mean for the following four issues. Firstly
whether Britain should possess nuclear weapons (Def). Secondly whether low
unemployment or low in ation is important (Unemp). Thirdly whether or not
they would prefer tax cuts or higher taxes to pay for more government spending
(Tax) and �nally whether they are in favour of privatisation of public services
(Priv).
I will use this example for two purposes. Firstly to show the di�erences in
the estimates produced by the Metropolis-Gibbs hybrid method and the quasi-
likelihood methods and secondly to see if the �ndings on the ideal scaling for
the Metropolis proposal distributions from the last chapter extend to logistic
regression models.
6.3.2 Model
The model �tted to the dataset is the same model as in Goldstein et al. (1998).
There will be �ve �xed e�ects, an intercept term and �xed e�ects for the four
opinion variables described above along with a random term to measure the
constituency e�ect. Let pij be the probability that the ith voter in the jth
constituency voted Conservative, then
logit(pij) = �1 + �2Defij + �3Unempij + �4Taxij + �5Privij + uj;
where uj � N(0; �2u).
To translate this to the response variable, yij, requires yij � Bernoulli(pij).
This model now �ts into the framework described in the earlier section and can
be �tted using the Metropolis Gibbs hybrid method.
6.3.3 Results
The two quasi-likelihood methods, MQL and PQL and the Metropolis Gibbs
hybrid method were all used to �t the above model and the results are given
in Table 6.1. The Metropolis Gibbs hybrid method was run using the adaptive
144
method described in the last chapter and with a desired acceptance rate of 44%.
A uniform prior was used for the variance parameter �2u.
Table 6.1: Comparison of results from the quasi-likelihood methods and theMCMC methods for the voting intention dataset. The MCMC method is basedon a run of 50,000 iterations after a burn-in of 500 and adapting period.
Par. MQL1 PQL2 MH Adapt�1 {0.355 (0.092) {0.367 (0.094) {0.375 (0.102)�2 0.089 (0.018) 0.092 (0.018) 0.095 (0.019)�3 0.045 (0.019) 0.046 (0.019) 0.046 (0.020)�4 0.067 (0.013) 0.069 (0.014) 0.070 (0.014)�5 0.138 (0.018) 0.143 (0.018) 0.146 (0.019)�2u 0.132 (0.112) 0.154 (0.117) 0.253 (0.154)
From the table it can be seen that the PQL method gives parameter estimates
that are larger (in magnitude) than the MQL method estimates for both the
�xed e�ects and the level 2 variance. The MCMC method gives estimates that
are larger (in magnitude) than both quasi-likelihood methods particularly for the
level 2 variance. In the simulations in Chapter 4 it was shown that for Gaussian
models the uniform prior for the variance parameter gave variance estimates that
were biased high, particularly for small datasets. This dataset is not particularly
small (110 level 2 units) and so more investigation is needed on which of the
methods is giving the better variance estimate. Analysis on which method is
performing best in terms of bias and coverage properties will be performed on
the second example in this chapter.
6.3.4 Substantive Conclusions
Considering just the PQL results and back transforming the variables onto an
interpretable scale the following conclusions can be made. Firstly a voter with
average views on all four issues (Defij = Unempij = Taxij = Privij = 0), had
a 40.9% probability of voting Conservative. A voter was more likely to vote
Conservative if they were in favour of Britain possessing nuclear weapons (5
points above average score implies a 52.3% probability of voting Conservative).
They were more likely to vote Conservative if they preferred low in ation to
low unemployment (5 points above average score implies a 49.5% probability of
145
voting Conservative). They were more likely to vote Conservative if they preferred
low taxes rather than higher government spending (5 points above average score
implies a 46.6% probability of voting Conservative). They were more likely to
vote Conservative if they were in favour of privatising public services (5 points
above average score implies a 58.6% probability of voting Conservative). The
unexplained variation at the constituency level is fairly large in practical terms
but not statistically signi�cant.
6.3.5 Optimum proposal distributions
In the last chapter I showed that the optimum values for the scaling factor for the
variance of a univariate normal proposal distribution as suggested by Gelman,
Roberts, and Gilks (1995) does not generally follow for multi-level models. I will
now do a similar analysis of the multilevel logistic regression model by considering
the voting intentions example. To �nd optimal proposal distributions several
values of the scaling factor spread over the range 0.05 to 20 were chosen. For
each value of the scaling factor 3 runs were performed with a burn-in of 500
iterations and a main run of 50,000 iterations. As before the Raftery Lewis N
statistic was calculated for each parameter in each run. Then the optimal scaling
factor was chosen to be the value that minimises N . The results for the voting
intention dataset are summarised in Table 6.2.
Table 6.2: Optimal scale factors for proposal variances and best acceptance ratesfor the voting intentions model.
Par. Optimal SF Acceptance % Min N�1 4.0 40%{60% 15K�2 5.5 40%{60% 13K�3 5.5 40%{60% 14K�4 5.0 40%{60% 14K�5 5.5 40%{60% 14K
Table 6.2 shows that for this model the results in Gelman, Roberts, and Gilks
(1995) appear to be close to the optimal values obtained from the simulations.
One problem that is not highlighted by this table is that for this model the worst
mixing properties are exhibited by the level 2 variance parameter, �2u. Figure 6-1
shows the e�ect of varying the scale factor on the N value for the parameter �2u
146
along with a best �t loess curve. From this �gure there appears to be no clear
relationship between the value of the scale factor and the N values for �2u. This
parameter is actually updated using Gibbs sampling so is only a�ected indirectly
by modifying the scale factor. The loess curve does show a slight upturn as the
scale factor gets smaller but there is far greater variability in the N values and
so a clear relationship is more di�cult to establish.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... ...
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
..
.
.
.
.
. .
.
.
.
.
. .
.
.
Level 2 Variance Parameter : Scale Factor
Raf
tery
Lew
is N
hat
0 5 10 15 20
5010
015
020
025
030
0
Figure 6-1: Plot of the e�ect of varying the scale factor for the univariate Normalproposal distribution rate on the Raftery Lewis diagnostic for the �2
u parameterin the voting intentions dataset.
To con�rm that the behaviour seen above for the scale factor follows for all
multi-level logistic regression models I considered again an example from Chapter
2. At the end of Chapter 2 I introduced multi-level logistic regression models
147
by converting the response variable M5 from the JSP dataset into a pass/fail
indicator, Mp5 which depended on whether the M5 mark was at least 30 or not.
The Model 2.4 (below) amongst others was �tted to the JSP dataset.
Mp5ij � Bernoulli(pij)
log(pij=(1� pij)) = �0 + �1M3ij + SCHOOLj
SCHOOLj � N(0; �2s)
I then repeated the procedure for �nding the optimal proposal distributions
detailed above using this model instead of the voting intentions model. For this
model the optimum value of the scale factor was found to be between 0.1 and
0.2 for both �0 and �1. The minimum values of N for �0 and �1 (both roughly
60,000) were for this model found to be far greater than than the minimum N
value for �2u (roughly 5,000). The only common factor between the two models
is the optimal acceptance rate, which for Model 2.4 is in the range 40% to 70%
for both �0 and �1. This adds further weight to using the adapting procedure
detailed in Chapter 5.
As in Chapter 5, no simple formula for an optimal scaling factor appears to
exist for the multi-level models in this chapter. However if the adapting method
is used, a desired acceptance rate in the range 40% to 60% for the univariate
normal Metropolis proposals will give close to optimal proposal distributions.
6.4 Example 2 : Guatemalan child health data-
set
6.4.1 Background
The original Guatemalan Child Health dataset consisted of a subsample of
respondents from the 1987 National Survey of Maternal and Child Health. The
subsample has 2449 responses and a three level structure of births within mothers
within communities. The subsample consists of all women from the chosen
communities who had some form of prenatal care during pregnancy. The response
variable is whether this prenatal care was modern (physician or trained nurse) or
148
not.
Rodriguez and Goldman (1995) use the structure of this dataset to consider
how well quasi-likelihood methods compare to considering the dataset without
the multi-level structure and �tting a standard logistic regression. They perform
this by constructing simulated datasets based on the original structure but with
known true values for the �xed e�ects and variances. Rodriguez and Goldman
(1995) consider the MQL method and show that the estimates of the �xed e�ects
produced by MQL are worse than the estimates produced by a standard logistic
regression disregarding the multi-level structure.
Goldstein and Rasbash (1996) consider the same problem but consider the
PQL method. They show that the results produced by PQL second order
estimation are far better than for MQL but still biased. They also state that the
example considered, with large underlying random parameter values is unusual.
If the variances in a variance component model do not exceed 0.5, which is more
common, the �rst order PQL estimation method and even the �rst order MQL
method will be adequate.
Although Rodriguez and Goldman (1995) considered several di�erent models
and several di�erent values for the parameters, Goldstein and Rasbash (1996)
only consider the model with the Guatemala structure and parameter values
that MQL performs badly on. I will now try using the MCMC Metropolis Gibbs
hybrid method described earlier in the chapter to �t the same model and compare
the results.
6.4.2 Model
The model considered is as follows :
yijk � Bernoulli(pijk) where
logit(pijk) = �0 + �1x1ijk + �2x2jk + �3x3k + ujk + vk
and
ujk � N(0; �2u) and vk � N(0; �2
v):
In this formulation i; j and k index the level 1,2 and 3 units respectively. The
149
variables x1; x2 and x3 are composite variables at each level, as the original model
contained many co-variates at each level.
6.4.3 Original 25 datasets
Goldstein and Rasbash (1996) considered the �rst 25 datasets simulated by
Rodriguez and Goldman (1995) to construct a table of comparisons between
MQL and PQL ( Table 1 in Goldstein and Rasbash (1996)). This table will
now be reconstructed here but will also include the MCMC method with the two
alternative priors for the variance parameters. The two priors to be considered
are �rstly a gamma (�; �) prior for the precision parameters at both levels 2 and 3,
and secondly a uniform prior for the variance parameters at levels 2 and 3. The
MCMC procedures were run using a burn-in of 500 iterations after the adaptive
method with desired acceptance rate 44%. The main run was of length 100,000
for each dataset, which was based on preliminary analysis using the Raftery-Lewis
diagnostic. The MCMC methods took 2 hours each per dataset using MLwiN on
a Pentium 200 MHz PC.
The quasi-likelihood results here vary from those seen in Goldstein and
Rasbash (1996) as the tolerance was set at the default level in MLwiN (relative
change from one iteration to the next of at most 0.01). In Goldstein and Rasbash
(1996) a more stringent convergence criterion is used (relative change from one
iteration to the next of at most 0.001) and the results are in this case slightly less
biased. The other di�erence with this table is that I am considering the variance
parameters rather than the standard deviations at levels 2 and 3, as this is the
variable reported by MLwiN. The results of this analysis can be seen in Table
6.3.
Results
From Table 6.3 the improvements achieved by the MCMC methods in terms of
bias can be clearly seen. Both prior distributions for the variance parameters
give results that are less biased than the PQL 2 method. The biases in the
random parameters are more extreme for the quasi-likelihood methods when the
variances are considered instead of the standard deviations. There is little to
choose between the two MCMC variance priors in this example. The gamma
150
Table 6.3: Summary of results (with Monte Carlo standard errors) for the �rst25 datasets of the Rodriguez Goldman example.
Parameter MQL 1 PQL 2 Gamma Uniform(True) prior prior
�0 (0.65) 0.483 (0.028) 0.615 (0.035) 0.643 (0.037) 0.659 (0.038)�1 (1.00) 0.758 (0.033) 0.948 (0.043) 0.996 (0.045) 1.017 (0.047)�2 (1.00) 0.760 (0.013) 0.951 (0.018) 1.004 (0.020) 1.025 (0.021)�3 (1.00) 0.744 (0.033) 0.952 (0.043) 0.997 (0.044) 1.022 (0.046)�2v (1.00) 0.530 (0.016) 0.837 (0.036) 0.970 (0.054) 1.066 (0.0 57)�2u (1.00) 0.025 (0.008) 0.513 (0.036) 0.928 (0.041) 1.044 (0.0 44)
prior gives results that are biased on the low side for the variance parameters
while the uniform prior gives estimates that are biased high for both variances
and �xed e�ects. With only 25 datasets the coverage properties of the 4 methods
cannot be evaluated and so more datasets are needed.
6.4.4 Simulating more datasets
Although the results from these 25 dataset appear to show improvement in the
unbiasedness of estimates produced by the MCMC methods over the quasi-
likelihood methods, to emphasise this improvement more datasets are needed.
Also with more datasets the coverage properties of the methods as well as the
bias can be evaluated.
Simulation procedure
The MLwiN SIMU, PRED and BRAN commands (Rasbash and Woodhouse
1995) were used to generate 500 datasets with the same underlying structure as
the 25 datasets from Rodriguez and Goldman (1995). The simulation procedure
is as follows :
1. Generate 161 vks, one for each community, by drawing from a normal
distribution with mean 0 and variance �2v (1.0).
2. Generate 1558 ujks, one for each mother, by drawing from a normal
distribution with mean 0 and variance �2u (1.0).
3. Evaluate logit(pijk) = �0 + �1x1ijk + �2x2jk + �3x3k + ujk + vk for all 2449
births using the PRED command.
151
4. Use the BRAN command to generate a 0 or 1 (yijk) for each birth based
on the relative likelihoods of a 0 and 1 response and a random uniform draw.
The same model considered so far was �tted to these 500 datasets using the
two quasi-likelihood methods and the MCMC method using the two di�erent
priors. As the MCMC methods are time consuming the main run length was
reduced for �nding estimates for the 500 datasets from 100,000 iterations to
25,000 iterations. The adaptive procedure and a `burn-in' of 500 iterations were
used as before. The results for the 500 simulations can be seen in Table 6.4.
Results
Table 6.4 con�rms the bias results already seen when only 25 datasets were
considered. The MQL method performs badly for the �xed e�ects and hopelessly
for the variance parameters. The PQL method performs a lot better with smaller
�xed e�ect biases but still shows large bias for the variances.
Both the MCMC priors perform much better than the quasi-likelihood
methods and there is little bias. The one exception is the variance parameters
using the Uniform prior where some positive bias similar to that seen in the
Gaussian models in Chapter 4 can be seen.
The coverage properties are illustrated in Table 6.4 and Figures 6-2 and 6-3.
Here we see again how poor the MQL method is on this example. In fact the
method is so poor at estimating �2u that none of the 500 datasets has a 95%
interval estimate that contains the true value. This is partly due to the large
number of runs that have level 2 variance estimates of 0. The PQL method
does reasonably well in terms of coverage for the �xed e�ects but not as well as
the MCMC methods. It also gives variance estimates with very poor coverage
properties, particularly at level 2.
The MCMC methods both give very similar coverage properties and are both
a vast improvement over the quasi-likelihood methods. The Uniform prior in
general gives slightly better coverage estimates but this is not always true and
the Uniform prior also has larger interval widths.
152
Table 6.4: Summary of results (with Monte Carlo standard errors) for theRodriguez Goldman example with 500 generated datasets.
Parameter MQL 1 PQL 2 Gamma Uniform(True) prior prior
Estimates (Monte Carlo SE)
�0 (0.65) 0.474 (0.007) 0.612 (0.009) 0.638 (0.010) 0.655 (0.010)�1 (1.00) 0.741 (0.007) 0.945 (0.009) 0.991 (0.010) 1.015 (0.010)�2 (1.00) 0.753 (0.004) 0.958 (0.005) 1.006 (0.006) 1.031 (0.005)�3 (1.00) 0.727 (0.009) 0.942 (0.011) 0.982 (0.012) 1.007 (0.013)�2v(1.00) 0.550 (0.004) 0.888 (0.009) 1.023 (0.011) 1.108 (0.011)
�2u(1.00) 0.026 (0.002) 0.568 (0.010) 0.964 (0.018) 1.130 (0.016)
Coverage Probabilities (90%/95%)
�0 67.6/76.8 86.2/92.0 86.8/93.2 88.6/93.6�1 56.2/68.6 90.4/96.2 92.8/96.4 92.2/96.4�2 13.2/17.6 84.6/90.8 88.4/92.6 88.6/92.8�3 59.0/69.6 85.2/89.8 86.2/92.2 88.6/93.6�2v
0.6/2.4 70.2/77.6 89.4/94.4 87.8/92.2�2u
0.0/0.0 21.2/26.8 84.2/88.6 88.0/93.0
Average Interval Widths (90%/95%)
�0 0.494/0.589 0.617/0.735 0.668/0.798 0.693/0.828�1 0.572/0.681 0.668/0.796 0.734/0.875 0.750/0.895�2 0.274/0.327 0.336/0.400 0.388/0.463 0.399/0.476�3 0.626/0.746 0.781/0.930 0.850/1.014 0.881/1.053�2v
0.339/0.404 0.535/0.638 0.731/0.878 0.789/0.948�2u
0.149/0.177 0.496/0.591 1.058/1.251 1.105/1.315
153
Parameter beta_0 : posterior predictive probability * 100
% o
bser
ved
poin
ts <
= p
oste
rior
pred
ictiv
e pe
rcen
tile
0 20 40 60 80 100
020
4060
8010
0
MQLPQLMCMC GammaMCMC Uniform
Parameter beta_1 : posterior predictive probability * 100
% o
bser
ved
poin
ts <
= p
oste
rior
pred
ictiv
e pe
rcen
tile
0 20 40 60 80 100
020
4060
8010
0
MQLPQLMCMC GammaMCMC Uniform
Parameter beta_2 : posterior predictive probability * 100
% o
bser
ved
poin
ts <
= p
oste
rior
pred
ictiv
e pe
rcen
tile
0 20 40 60 80 100
020
4060
8010
0
MQLPQLMCMC GammaMCMC Uniform
Figure 6-2: Plots comparing the actual coverage of the four estimation methodswith their nominal coverage for the parameters �0; �1 and �2.
154
Parameter beta_3 : posterior predictive probability * 100
% o
bser
ved
poin
ts <
= p
oste
rior
pred
ictiv
e pe
rcen
tile
0 20 40 60 80 100
020
4060
8010
0
MQLPQLMCMC GammaMCMC Uniform
Level 3 variance parameter : posterior predictive probability * 100
% o
bser
ved
poin
ts <
= p
oste
rior
pred
ictiv
e pe
rcen
tile
0 20 40 60 80 100
020
4060
8010
0
MQLPQLMCMC GammaMCMC Uniform
Level 2 variance parameter : posterior predictive probability * 100
% o
bser
ved
poin
ts <
= p
oste
rior
pred
ictiv
e pe
rcen
tile
0 20 40 60 80 100
020
4060
8010
0
MQLPQLMCMC GammaMCMC Uniform
Figure 6-3: Plots comparing the actual coverage of the four estimation methodswith their nominal coverage for the parameters �3; �
2v and �2
u.
155
6.4.5 Conclusions
Rodriguez and Goldman (1995) originally pointed out the de�ciencies of the MQL
method on multi-level logistic regression models with structures similar to our
datasets. Goldstein and Rasbash (1996) then showed how the PQL method
improved greatly on the MQL method but still showed some bias. In this section
I have shown that the Metropolis-Gibbs hybrid method described earlier in this
thesis gives even better estimates both in terms of bias and coverage. It is also
clear that the choice of prior distribution for the variance parameters is not as
important in this problem due to the large numbers of level 2 and 3 units, and
both priors considered give better estimates than the quasi-likelihood methods.
One point to note as shown in the simulations in Breslow and Clayton (1993)
is that the under-estimation from the quasi-likelihood methods is worst when
there is a Bernoulli response. When the model is changed to a binomial response
and the denominator increased this under-estimation is reduced. I will discuss
brie y models with a general binomial response in Chapter 8.
6.5 Summary
In this chapter the family of hierarchical binary response logistic regression
models have been introduced and it has been shown how to adapt the Metropolis-
Gibbs hybrid method to �t these models. Two examples that show yet more
applications of multi-level modelling were used to illustrate �tting such models.
The �rst example had data on the intention of voters in the 1983 election, and
was used to demonstrate the di�erences between the estimates from the quasi-
likelihood methods and the MCMC methods. It was also used to �nd optimal
proposal distributions for the Metropolis steps of the MCMC method.
The second example used simulation datasets based on a dataset from child
health in Guatemala. This example was used to compare the performance of the
quasi-likelihood methods, MQL and PQL against the Metropolis-Gibbs hybrid
method using two prior distributions for the variance parameter, where the true
values of all parameters were known. In this example it was shown that the
MCMC methods outperform the quasi-likelihood methods both in terms of bias
and coverage properties. It was also shown that with such a large dataset the
156
choice of prior distribution for the variance is not of great importance.
157
Chapter 7
Gaussian Models 3 - Complex
Variation at level 1
7.1 Model de�nition
In Chapters 4 and 5 I considered Gaussian models with a simple level 1 variance
�2. In all the algorithms given the Gibbs sampler was used to update all variance
parameters. Although in Chapter 5 I considered some hybrid methods containing
a mix of Gibbs and Metropolis sampling steps, the Metropolis steps updated the
�xed e�ects and residuals, and all variance parameters were always updated using
the Gibbs sampler.
In this chapter I will remove the restriction that the model must have a simple
constant variance at level 1 and instead allow the level 1 variance to depend on
other predictor variables. I will consider by way of an example the following
simple two level model with complex variation at level 1.
yij = XFij�
F +XRij�
Rj +XC
ijeij
eij � MVN(0;VC); �Rj � MVN(0;VR):
If I consider again the JSP dataset, introduced in Chapter 2, with M5, the
maths score in year 5, as the response variable, then I could extend the random
slopes regression model by allowing the M3 predictor variable to be random at
both levels 1 and 2. What this would then means is that the variability of an
158
individual pupil's M5 score is not only dependent on school level variables but
also on his/her individual maths score in year 3 . The model can be written as
follows :
M5ij = �F0 + �R0j + e0ij +M3ij(�F1 + �R1j + e1ij)
eij =
0@ e0ij
e1ij
1A �MVN(0;VC); �Rj =
0@ �R0j�R1j
1A � MVN(0;VR):
Then V ar(M5ij j �F ; �j) = VC00 + 2VC01M3ij +VC11(M3ij)2 and in this case, a
simple Gibbs sampling approach cannot be used to update the level 1 variance
parameters. In the random slopes regression example in Chapter 2 there was a
single eij for each observation (pupil) which was evaluated as follows :
eij = yij � XFij�
F � XRij�
Rj :
Then the level 1 variance, �2 could be updated via a scaled inverse �2 distribution
based solely on the eijs. When the variation at level 1 is complex there is not a
single eij for each observation, and instead
XCijeij = yij � XF
ij�F � XR
ij�Rj ;
where XCij is the data vector of variables that are random at level 1 for observation
ij, and eij is the vector of residuals for observation ij.
From this I would have to estimate each vector eij via an MCMC updating
rule which is di�cult. Instead I propose to use Metropolis Hastings sampling on
the level 1 variance matrix. Before discussing the problem of creating a Hastings
update for a variance matrix, I will �rst consider the simpler case of using MCMC
to update a scalar variance.
7.2 Updating methods for a scalar variance
In earlier chapters I dealt with multi-level models where the variance at level
1, �2 is a scalar. In these chapters the conditional distribution of �2 is easily
determined, and a Gibbs sampling step can be used to update �2. Similarly
BUGS (Spiegelhalter et al. 1994) �ts such models using an adaptive rejection
Gibbs sampling algorithm. I am interested in �nding alternative approaches for
159
sampling parameters that are restricted to being strictly positive. Two alternative
approaches are described below :
7.2.1 Metropolis algorithm for log�2
I have already used the Metropolis algorithm via a normal proposal distribution to
update parameters de�ned on the whole real line. A similar approach can be used
on the current problem but �rstly the variable of interest must be transformed
to a variable that is de�ned on the whole real line.
Draper and Cheal (1997), when analysing a problem from Copas and Li (1997)
use the approach of transforming �2 to log �2. They then use a multivariate
normal proposal for all unknowns in the model, including log �2. I will use a
univariate proposal for �2 as I am updating it separately from the other unknowns.
I will therefore consider updating log �2 by using a univariate normal(0; s2)
proposal distribution.
As the parameter of interest has been transformed from �2 to log �2 I need
to consider the Jacobian of the transformation and build this into the prior. The
one disadvantage of this method is that it cannot extend to the harder problem
where I have �, a variance matrix to update. The technique of transforming
variance parameters to the log scale is not unique to Draper and Cheal (1997)
and is also used in section 11.6 of Gelman et al. (1995) on a hierarchical model.
7.2.2 Hastings algorithm for �2
In Chapter 3, I used normal proposal distributions for unknown parameters, �
which had their mean �xed at the current value for �, �t to generate �t+1. This
type of proposal has the advantage that p(�t+1 j �t) = p(�t j �t+1), and so the
Metropolis algorithm can be used.
As I am now considering a parameter, �2 that is restricted to be positive
I want to use a proposal that generates strictly positive values. I am therefore
going to use a scaled inverse chi-squared distribution with expectation the current
estimate for �2, �2t to generate �2
t+1: The scaled inverse chi-squared distribution
with parameters � and s2 has expectation ���2
s2, so letting � = w + 2 and
s2 = w�2t , where w is a positive integer degrees of freedom parameter, produces
a distribution with expectation �2t . The parameter w can be set to any value,
160
and plays a similar role to the variance parameter in the Metropolis proposal
distribution, as it a�ects the acceptance probability.
This proposal is not symmetric, p(�t+1 j �t) 6= p(�t j �t+1), so the Hastings
ratio has to be calculated.
Assuming currently �2t = a and that the value �2
t+1 = b is generated, then the
Hastings ratio is as follows :
hr =p(�2
t+1 = b j �2t+1 � SI�2(w + 2; wa
w+2))
p(�2t+1 = a j �2
t+1 � SI�2(w + 2; wbw+2
))
=(wb=w + 2)
w+22 a�(w+2
2+1)exp(�w+2
2wb
(w+2)a)
(wa=w + 2)w+22 b�(w+2
2+1)exp(�w+2
2wa
(w+2)b)
=bw2+1a�(w
2+2)exp(�wa
2b)
aw2+1b�(w
2+2)exp(�wb
2a)
= (b
a)w+3exp(
w
2(a
b� b
a)):
This Hastings ratio can then be used in the Hastings algorithm.
7.2.3 Example : Normal observations with an unknown
variance
The three methods discussed in this section will now be illustrated by the
following example. I generate 100 observations from a normal distribution with
known mean 0 and known variance 4. Then I will assume the global mean
is known to be zero and interest lies in estimating the variance parameter �2.
Assigning a scaled inverse chi-squared prior for �2, the model is then as follows :
Yi � N(0; �2) i = 1; : : : ; 100
�2 � SI�2(�0; �20)
This problem has a conjugate prior and consequently the posterior has a
known distribution as follows :
�2 � SI�2(�0 + n; (�0�20 + nV)=(�0 + n))
161
where n is the number of Y 0s (in this case 100) and V = 1n
Pni=1(yi � �)2. For
the prior distribution the values �0 = 3 and �20 = 6 will be used and the sample
variance of the 100 simulated observations, V = 4.38292. This leads to the
posterior distribution :
�2 � SI�2(103; 4:43):
This distribution has posterior mean for �2 of 4.5177 with standard deviation
0.6421.
The Gibbs sampling method used by BUGS will sample IID from the posterior
distribution for �2, and so should perform best. The Gibbs method can sample
�2 IID from its posterior distribution as it is the only unknown in this model.
The Metropolis log�2 method and the Hastings method should get reasonable
answers but may take longer to obtain accuracy.
7.2.4 Results
In the following analysis I will compare the three methods described earlier. For
the Gibbs sampler method a burn-in of 1,000 updates and a main run of 5,000
updates was used for each of three random number seeds and the results were
averaged. Several di�erent s standard deviation values in the Metropolis method
and several di�erent w degrees of freedom values in the Hastings method were
used for comparative purposes. For each s or w, three runs were performed
with a burn-in of 1,000 and main run of 100,000 and di�erent random number
seeds. The results obtained were averaged over the three runs. The Raftery Lewis
convergence diagnostic column of Table 7.1 contains the largest value of N , the
estimated run length required, between the three runs. As can be seen from the
table the largest N is 83,800 which is smaller than the 100,000 run lengths and
so this suggests all the runs have achieved their default accuracy goals.
Table 7.1 shows that the Gibbs method has the predicted fast convergence
rate. All three methods give the correct answer to two decimal places. The
convergence rates of the Metropolis and Hastings methods vary with the
parameter value of the proposal distribution. This is also linked with the
acceptance rates of the two algorithms. The best proposal parameter values are
when there is an acceptance rate of approximately 44%, although in this simple
162
Table 7.1: Comparison between three MCMC methods for a univariate normalmodel with unknown variance.
Method s/w Mean Sd Acc % R/L NTheory N/A 4.5177 0.642 N/A N/AGibbs N/A 4.516 0.648 100.0 3,600
1.0 4.518 0.640 17.3 30,0000.5 4.519 0.642 32.6 16,000
Metropolis 0.3 4.519 0.642 47.9 13,9000.2 4.519 0.641 60.4 18,0000.1 4.519 0.643 78.1 30,5000.05 4.514 0.639 88.8 83,800
2 4.517 0.640 15.9 39,9005 4.522 0.643 25.7 26,40010 4.523 0.644 34.9 20,50020 4.519 0.645 45.5 18,100
Hastings 50 4.517 0.641 60.0 18,500100 4.517 0.643 69.8 26,000200 4.517 0.641 77.8 28,700500 4.523 0.640 85.7 50,1001000 4.517 0.637 89.7 72,200
example acceptance rates of between 30% and 60% give similar convergence rates.
There is therefore scope to incorporate the adaptive procedure illustrated in
Chapter 5 to these methods.
I will now return to the original problem of updating a variance matrix.
7.3 Updating methods for a variance matrix
In the models in Chapter 5, the level 2 variance can be a matrix. In the algorithms
given, a simple Gibbs sampling step is used to update the variance as it has an
inverse Wishart posterior distribution with parameters that are easily evaluated.
If there is complex variation at level 1, then the level 1 variance, � is a matrix and
this will also have an inverse Wishart posterior distribution. Unfortunately the
parameters will depend on the vector of level 1 residuals, eij which are not easily
evaluated. Consequently an alternative method is needed that does not need to
evaluate the level 1 residuals. When I considered the case of a scalar variance,
163
I used a Hastings step that had a proposal distribution of a similar form to the
posterior distribution, and I will consider a similar approach here.
7.3.1 Hastings algorithm with an inverse Wishart prop-
osal
When there was a scalar variance, �2 to update, a proposal distribution that
generated strictly positive values was required. Now that there is a variance
matrix, � to update, I require a proposal distribution that generates positive
de�nite matrices. I am therefore going to use an inverse Wishart proposal
distribution with expectation the current estimate for �, �t to generate �t+1.
The inverse Wishart distribution for a kxk matrix with parameters � and S has
expectation (� � k � 1)�1S, so letting � = w + k + 1 and S = w�t, where w
is a positive integer degrees of freedom parameter, produces a distribution with
expectation �t. As in the univariate case, the parameter w is set to a value that
gives the desired acceptance rate. Again the proposal is not symmetric so the
Hastings ratio must be calculated. Assuming currently that �t = A and that
�t+1 = B is generated, then the Hastings ratio is as follows :
hr =p(�t+1 = B j �t+1 � IW (w + k + 1;Aw))
p(�t+1 = A j �t+1 � IW (w + k + 1;Bw))
=j Aw jw+k+1
2 j B j�(w+k+1+k+12
) exp(�12tr(AwB�1))
j Bw jw+k+12 j A j�(w+k+1+k+1
2) exp(�1
2tr(BwA�1))
=j A j 2w+3k+3
2
j B j 2w+3k+32
exp(w
2(tr(BA�1)� tr(AB�1))
7.3.2 Example : Bivariate normal observations with an
unknown variance matrix
A second simple example will now be used to show that this method works. I will
compare the results obtained with the theoretical answers and the Gibbs sampling
results. One hundred observations from a bivariate normal distribution with
known mean vector, � = (4; 2)T and known variance matrix � =
0@ 2.0 {0.2
{0.2 1.0
1A164
were generated. I then assume that the mean vector, � is known and interest lies
in estimating the variance matrix, �. An inverse Wishart prior distribution for
� was assigned, and the model is then as follows :
Yi �MVN
0@0@ 4
2
1A ;�
1A ;
p(�) � IW (�0;�0):
The inverse Wishart prior is conjugate for � and consequently the posterior
for � has the following distribution :
p(� j Y ) � IW (�0 + n; (�0 + nV));
where n is the number of Yis, in this case 100 and V = 1n
Pni=1(Yi � �)(Yi � �)T .
In this example the values �0 = 3 and �0 =
0@ 5.0 {0.5
{0.5 2.0
1A will be used for the
prior distribution. Then the posterior for � is
p(� j Y ) � IW
0@103;0@ 218.77 {36.27
{36.27 118.07
1A1A
which has a posterior mean matrix
0@ 2.188 {0.363
{0.363 1.181
1A :
I will again compare the Gibbs sampling method used by BUGS which samples
directly from the posterior distribution for � and should perform well with the
Hastings method which should take longer to converge.
7.3.3 Results
For the Gibbs sampler method, a burn-in of 1,000 updates and a main run of
5000 updates was used for each of 3 random number seeds and the results were
averaged. For the Hastings method several di�erent degrees of freedom values, w
were used. For each w three runs were performed with di�erent random number
seeds with a burn-in of 1,000 and a main run of 100,000. The Raftery and Lewis
convergence diagnostic is the maximum N of the three variables monitored over
the three runs. The results can be seen in Table 7.2.
165
Table 7.2: Comparison between two MCMC methods for a bivariate normalmodel with unknown variance matrix.
Method w �20 �01 �2
1 Acc% R/LTheory N/A 2.188 {0.363 1.181 N/A N/AGibbs N/A 2.193 (0.315) {0.365 (0.167) 1.178 (0.168) 100.0 3.9k
20 2.190 (0.315) {0.362 (0.165) 1.181 (0.167) 13.6 47.3k50 2.188 (0.313) {0.364 (0.166) 1.180 (0.169) 29.3 30.6k
Hastings 100 2.187 (0.309) {0.361 (0.165) 1.180 (0.168) 43.6 34.0k200 2.190 (0.312) {0.364 (0.167) 1.181 (0.169) 57.2 44.8k500 2.190 (0.312) {0.365 (0.167) 1.181 (0.170) 71.8 63.1k1000 2.186 (0.312) {0.365 (0.168) 1.186 (0.169) 79.6 99.8k
From Table 7.2 it can be seen that the choice of 100,000 as main run length
for the Hastings method satis�es the Raftery Lewis convergence diagnostic for
all selected values of w. It can also be seen that the default accuracy goals
are achieved quickest when the acceptance rate is between 30% and 35%. This
is di�erent from the univariate case but this is due to the fact that the new
procedure involves updating the whole variance matrix and not just a single
parameter. In fact Gelman, Roberts, and Gilks (1995) suggest a rate of 31:6%
for a 3 dimensional normal update which compares favourably with this analysis.
It can be clearly seen that the method is estimating the variance matrix correctly.
I now need to incorporate this method into the algorithm for the models that are
the theme of this chapter, multi-level models with complex variation at level 1.
7.4 Applying inverse Wishart updates to com-
plex variation at level 1
In Chapter 5 I showed how the Gibbs algorithm for a Gaussian 3 level model is
easily generalised to N levels. In this section I will simply describe how to use
the inverse Wishart updating step for the level 1 variance with a 2 level model.
Extending this algorithm to N levels should be analogous to the work in Chapter
5. The model to be considered is as follows :
yij = XFij�
F +XRij�
Rj +XC
ijeij
166
eij � MVN(0;VC); �Rj � MVN(0;VR):
The important part of the algorithm is to store the variance for each
individual,
�2ij = (XC
ij)TVCX
Cij:
All the other parameters then depend on the level 1 variance, VC through these
individual variances. This means that the algorithm below is, apart from the
updating step for VC almost identical to the algorithm for the same model without
complex variation. The main di�erence is that everywhere that �2 appeared in
the old algorithm, it is replaced by �2ij and this often involves moving the �2
ij
inside summations.
7.4.1 MCMC algorithm
I will assume that the following general priors are used, for the level 1 variance,
p(VC) � IW (�1; S1), for the level 2 variance, p(VR) � IW (�2; S2) and for the
�xed e�ects, �F � N(�p; Sp). The algorithm then has four steps as follows :
Step 1 - The �xed e�ects, �F .
p(�F j y; �R;VC ;VR) / p(y j �F ; �R;VC ;VR)p(�F )
�F �MVN( b�F ; bDF)
where
bDF= [Xij
(XFij)
TXFij
�2ij
+ S�1p ]�1;
and
b�F = bDF �24X
ij
(XFij)
T (yij � XRij�
Rj )
�2ij
+ S�1p �p
35 :167
Step 2 - The level 2 residuals, �R.
p(�R j y; �F ;VC ;VR) / p(y j �F ; �R;VC ;VR)p(�RjVR)
�Rj �MVN( b�Rj ; bDR
j )
where
bDR
j = [njXi=1
(XRij)
TXRij
�2ij
+V�1R ]�1;
and
b�Rj = bDR
j �njXi=1
(XRij)
T (yij � XFij�
F )
�2ij
:
Step 3 - The level 1 variance, VC.
This step now involves a Hastings update using an inverse Wishart proposal
distribution.
V(t)C = V�
C with probability min(1; hr � p(V�C j y; : : :)=p(V(t�1)
C j y; : : :))= V
(t�1)C otherwise
where V�C � IWn1(S = w �V(t)
C ; � = w+n1+1); w being a tuning parameter which
will a�ect the acceptance rate of the Hastings proposals and n1 the number of
random parameters at level 1. The Hastings ratio, hr is as follows
hr =j V(t�1)
C j�j V�
C j�exp(
w
2(tr(V�
C(V(t�1)C )�1)� tr(V
(t�1)C (V�
C)�1)));
where � = 2w+3n1+32
.
Step 4 - The level 2 variance, VR.
p(V�1R j y; �F ;VC) / p(�R j VR)p(V
�1R )
168
V�1R �Wishartn2[Spos = (
JXj=1
�Rj (�Rj )
T + SP )�1; �pos = J + �P ];
where n2 is the number of random variables at level 2. If a uniform prior is
required then set SP = 0 and �P = �n2 � 1.
7.4.2 Example 1
The example to be used to illustrate the algorithm was described at the start
of the chapter and consists of the JSP dataset with an extension to the random
slopes regression model so that the predictor variable M3 is random at level 1.
The model is as follows :
M5ij = �F0 + �R0j + e0ij +M3ij(�F1 + �R1j + e1ij)
eij =
0@ e0ij
e1ij
1A �MVN(0;VC); �Rj =
0@ �R0j�R1j
1A � MVN(0;VR):
The predictor variable M3 has been centred around its mean. I will not use
the actual response variableM5 as this produces estimates via IGLS and RIGLS
that do not lead to a positive de�nite matrix VC . This problem will be discussed
in the next section where an alternative method is discussed. Instead I will use
the MLn SIMU command (Rasbash and Woodhouse 1995) to create a simulated
dataset with a positive de�nite matrix VC .
The results from one simulated dataset are given in Table 7.3. The results
for the MCMC methods are based on the average of three runs each of length
50,000 after a burn-in of 500 iterations. The value of w = 150 was selected as
this gives an acceptance rate of approximately 32%, and the rate suggested in
Gelman, Roberts, and Gilks (1995) for a 3 dimensional normal update is 31:6%.
The last column contains results for method 2 which is discussed later in the
chapter.
From Table 7.3 it should be noted that as the results are for only one dataset
generated using the values in the `True' column, the estimates should not be
identical to these true values. What is clear is that the MCMC methods are
169
Table 7.3: Comparison between IGLS/RIGLS and MCMC method on a simulateddataset with the layout of the JSP dataset.
Parameter IGLS RIGLS MCMC IW MCMC(True) w = 150 Method 2�0(30:0) 31.118 (0.487) 31.116 (0.492) 31.117 (0.527) 31.108 (0.526)�1(0:5) 0.537 (0.080) 0.537 (0.081) 0.536 (0.088) 0.538 (0.088)VR00(6:0) 8.417 (2.294) 8.642 (2.344) 10.303 (2.867) 10.289 (2.866)VR01(�0:25) {0.546 (0.282) {0.560 (0.288) {0.662 (0.361) {0.658 (0.359)VR11(0:1) 0.176 (0.062) 0.183 (0.064) 0.231 (0.084) 0.230 (0.084)VC00(28:0) 27.657 (2.187) 27.657 (2.188) 27.910 (2.279) 27.904 (2.291)VC01(�0:5) {0.654 (0.261) {0.656 (0.261) {0.675 (0.270) {0.677 (0.271)VC11(0:5) 0.571 (0.091) 0.571 (0.091) 0.589 (0.099) 0.589 (0.100)
exhibiting behaviour that could be predicted from the results in Chapter 4. The
MCMC methods are using uniform priors for both variance matrices and this is
giving larger variance estimates. The discrepancy is also larger at level 2 where
the estimates are based on 48 schools than at level 1 where the estimates are
based on 887 pupils as expected. The MCMC methods have wider uncertainty
bands than those from IGLS and RIGLS which is also to be expected.
7.4.3 Conclusions
From this one example it can be concluded that provided the dataset used gives
level 1 variance estimates in IGLS/RIGLS that produce a positive de�nite matrix
then the above method can be used to �t the model and give MCMC estimates.
One important point to note is that in the earlier examples with a single
residual at level 1 for each individual, eij these residuals could be estimated by
subtraction at each iteration. This would then lead to MCMC estimates for each
eij. Following this through to the complex variation models, the subtraction
approach will give estimates for the composite residual, XCijeij but not for the
individual eij. The best that is available is similar to the IGLS/RIGLS methods.
These methods calculate the residuals based on the �nal estimates of the other
parameters and could simply be applied while using the MCMC estimates for the
other parameters.
The one point I touched on earlier is that the IGLS/RIGLS method for the
170
original dataset produces estimates that do not form a positive de�nite variance
matrix, VC . Although the method I have just given is a useful way of using
the Metropolis-Hastings sampler, I will in the next section show an alternative
method that can handle non-positive de�nite variance matrices.
7.5 Method 2 : Using truncated normal Hast-
ings update steps
The inverse Wishart updating method assumes that the variance `matrix' at level
1 must be positive de�nite. Although the way the model has been written thus
far suggests that the variance at level 1 should be a matrix, an alternative form
would be to consider the variance at level 1 as a quadratic form. For example
using the JSP example considered already,
V ar(M5ij j �F�Rj ) = A+ 2BM3ij + C(M3ij)2:
Using the constraint that the matrix
0@ A B
B C
1A is positive de�nite is a
stronger constraint than is actually needed. Positive de�nite matrices will
guarantee that any vector XCij will produce a positive variance, but in the JSP
example the �rst random variable is constant and the second variable, M3 takes
integer values, before centering, between 0 and 40. So a looser constraint is to
allow all values (A�; B�; C�) such that A�+2B�M3ij +C�(M3ij)2 > 0 8 i; j: This
constraint looks quite complicated to work with but if I consider each of the
variables, A;B and C separately and assume the other variables are �xed the
constraints become easier.
I will now consider the steps required for our simple example before
generalising the algorithm to all problems with complex variation at level 1.
7.5.1 Update steps at level 1 for JSP example
At iteration t, assume that the current values for the parameters are A(t); B(t) and
C(t), and let �2ij = V ar(M5ij j �F�Rj ). Then I will update the three parameters
in turn.
171
Updating parameter A.
At time t,
�2ij = A(t) + 2B(t)M3ij + C(t)(M3ij)
2 > 0 8 i; j:
So let
2B(t)M3ij + C(t)(M3ij)2 = �rAij then A(t) > rAij 8 i; j:
This implies
A(t) > maxA where maxA = max(rAij):
I will use a normal proposal distribution with variance, vA but only consider
values generated that satisfy the constraint. This will lead to a truncated normal
proposal as shown in Figure 7-1 (i). The Hastings ratio can then be calculated
by the ratio of the two truncated normal distributions shown in Figure 7-1 (i)
and (ii). Letting the value for A at time t be Ac and the proposed value for time
t + 1 be A�.
hr =p(A(t+1) = A� j A(t) = Ac)
p(A(t+1) = Ac j A(t) = A�)
=1� �((maxA � A�)=
pvA)
1� �((maxA � Ac)=pvA)
:
The update step is now as follows :
A(t+1) = A� with probability min(1; hr � p(A� j y; : : :)=p(A(t) j y; : : :))= A(t) otherwise:
Updating parameter B.
At time t,
�2ij = A(t) + 2B(t)M3ij + C(t)(M3ij)
2 > 0 8 i; j:
172
So let A(t) + C(t)(M3ij)2 = �rBij ; then
B(t) > rBij=(2 �M3ij) 8M3ij > 0; and
B(t) < rBij=(2 �M3ij) 8M3ij < 0:
This leads to two constraints :
B(t) > maxB+ where maxB+ = max(rBij=(2 �M3ij)); M3ij > 0);
and
B(t) < minB� where minB� = min(rBij=(2 �M3ij)); M3ij < 0):
I will use a normal proposal distribution, with variance vB, but only consider
values generated that satisfy these constraints. This will lead to a truncated
normal proposal as shown in Figure 7-1 (iii). The Hastings ratio can then
be calculated by the ratio of the two truncated normal distributions shown in
Figure 7-1 (iii) and (iv). Letting the value for B at time t be Bc and the proposed
value for time t+ 1 be B�.
hr =p(B(t+1) = B� j B(t) = Bc)
p(B(t+1) = Bc j B(t) = B�)
=�((minB� �B�)=
pvB)� �((maxB+ � B�)=
pvB)
�((minB� � Bc)=pvB)� �((maxB+ � Bc)=
pvB)
:
The update step is now as follows :
B(t+1) = B� with probability min(1; hr � p(B� j y; : : :)=p(B(t) j y; : : :))= B(t) otherwise:
Updating parameter C.
At time t,
�2ij = A(t) + 2B(t)M3ij + C(t)(M3ij)
2 > 0 8 i; j:
173
So let
A(t) + 2B(t)M3ij = �rCij then C(t) > rAij=(M3ij)2 8 i; j:
This implies
C(t) > maxC where maxC = max(rCij=(M3ij)2):
I will use a normal proposal distribution but only consider values generated
that satisfy the constraint. This will lead to a truncated normal proposal as
shown in Figure 7-1 (i). The Hastings ratio can then be calculated by the ratio of
the two truncated normal distributions shown in Figure 7-1 (i) and (ii). Letting
the value for C at time t be Cc and the proposed value for time t+ 1 be C�.
hr =p(C(t+1) = C� j C(t) = Cc)
p(C(t+1) = Cc j C(t) = C�)
=1� �((maxC � Cc)=
pvC)
1� �((maxC � Cc)=pvC)
:
The update step is now as follows :
C(t+1) = C� with probability min(1; hr � p(C� j y; : : :)=p(C(t) j y; : : :))= C(t) otherwise:
The results of using this second method on the simulated dataset example
of the last section can be seen in the last column of Table 7.3. Here it can
be seen that although the two methods are based on di�erent constraints, this
example where the correct answer has a positive de�nite framework means that
both methods give similar estimates.
Before generalising the algorithm for all covariance structures at level 1, I will
now return to the original data from the JSP dataset. This will show that this
second method has the advantage of �tting models with a non-positive de�nite
variance structure at level 1.
174
...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
(i)
...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ABM
................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
(ii)
...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ABM
........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
(iii)
...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ABM m
................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
(iv)
...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ABM m
Figure 7-1: Plots of truncated univariate normal proposal distributions for aparameter, �. A is the current value, �c and B is the proposed new value, ��. Mis max� and m is min�, the truncation points. The distributions in (i) and (iii)have mean �c, while the distributions in (ii) and (iv) have mean ��.
175
7.5.2 Proposal distributions
I have not as yet mentioned the proposal distributions used in this method in
much detail. I simply stated that I was using a truncated normal distribution
with a particular variance for the untruncated version. The problem of choosing a
value for the variance parameter is the same problem that I had when considering
updating the �xed e�ects and higher level residuals by Metropolis Hastings
updates. Here like there, two possible solutions are to use the variance of the
parameter estimate from the RIGLS procedure multiplied by a suitable scaling
to give a variance for the normal proposal distribution, or to use an adaptive
approach before the burn-in and main run of the simulation. In example 1, I used
a scaling of 5.8 on the variance scale which gave acceptance rates of between 35%
and 50%.
7.5.3 Example 2 : Non-positive de�nite and incomplete
variance matrices at level 1
The original JSP dataset gave a non-positive de�nite matrix as an estimate for the
level 1 variance. The term Ve11 is very small and is not statistically signi�cant,
so it could be removed from the model. This will mean that in practical terms
the variance of individual students' M5 scores is dependent on their M3 score
in a linear way, rather than a quadratic. This will then lead to an incomplete
variance matrix at level 1.
This sort of variance structure at level 1 can be useful in other situations, for
example if the data consists of boys and girls and it is believed that there is a
di�erence in variability between boys and girls. Then the variance equation could
be as follows :
V ar(Yijj�F ; �R) = Ve00 + Ve01Boyij:
where Boyij has value 1 if the child is a boy and 0 if the child is a girl, and
Ve01 can be negative. Here including the quadratic term will not make sense as
(Boyij)2 = Boyij 8 i; j so an incomplete variance structure should be used.
Table 7.4 contains the estimates of the RIGLS and second MCMC methods
for the complex variation model �tted to the original JSP data (Model 2). The
176
table also contains estimates when �rstly Ve11 is removed (Model 3), and secondly
Ve01 is removed (Model 4). The MCMC algorithm described earlier can easily
accommodate the removal of terms from the variance matrix. The term is simply
set to zero before running the simulation method and during the method the
term is never updated.
Table 7.4: Comparison between RIGLS and MCMC method 2 on three modelswith complex variation �tted to the JSP dataset.
Parameter Model 2 Model 3 Model 4�0 30.582 (0.356) 30.582 (0.356) 30.589 (0.367)�1 0.617 (0.033) 0.617 (0.033) 0.612 (0.042)VR00 4.273 (1.230) 4.271 (1.230) 4.638 (1.306)
RIGLS VR01 {0.246 (0.098) {0.247 (0.098) {0.340 (0.117)VR11 0.017 (0.010) 0.017 (0.010) 0.028 (0.017)VC00 27.551 (1.465) 27.573 (1.419) 25.018 (1.651)VC01 {1.206 (0.137) {1.203 (0.079) 0.000 (0.000)VC11 0.001 (0.027) 0.000 (0.000) 0.063 (0.039)
�0 30.582 (0.387) 30.577 (0.385) 30.590 (0.398)�1 0.617 (0.039) 0.617 (0.039) 0.614 (0.048)VR00 5.302 (1.647) 5.330 (1.651) 5.603 (1.709)
MCMC VR01 {0.319 (0.143) {0.327 (0.142) {0.394 (0.160)VR11 0.030 (0.016) 0.030 (0.016) 0.048 (0.023)VC00 27.237 (1.528) 27.418 (1.409) 25.008 (1.545)VC01 {1.221 (0.136) {1.178 (0.079) 0.000 (0.000)VC11 0.012 (0.028) 0.000 (0.000) 0.067 (0.033)
Scale 1.0 1.3 3.0Accept Acc(VC00) 45.7 42.2 46.4
Acc(VC01) 36.0 43.9 |Acc(VC11) 37.8 | 43.8
RL(VC00) 51K 51.6K 22.7KDiag. RL(VC01) 119K 100.1K |
RL(VC11) 134K | 23.3K
The MCMC estimates are based on the average of 3 runs each of length 50,000
after a burn-in of 500. Each run takes approximately an hour on a Pentium
200MHz PC. The Raftery Lewis estimates are larger than the 50,000 iterations
performed on each run, however the estimates in the table are based on the
average of 3 runs, which gives a potential 150,000 iterations per estimate. The
177
Raftery Lewis estimates may also be in ated as they are based on a thinned
chain of length 10,000 (every 5th iteration of the main chain) due to storage
restrictions. The scaling of the proposal distribution variances was chosen based
on some shorter runs, so that acceptance rates were close to 44%.
From Table 7.4 it can be seen that the MCMC method can handle �rstly
non-positive de�nite level 1 variance matrices, and secondly incomplete level 1
variance matrices. The MCMC estimates exhibit behaviour similar to that seen
in example 1. The level 2 MCMC variance estimates are larger than those from
RIGLS but at level 1 there is less di�erence in the estimates from the two methods.
In fact the level 1 variance term, VC00 has smaller estimates using the MCMC
method.
Prior distributions
In the above example, uniform priors were used for all the variance parameters.
As an alternative, informative inverse Wishart priors could be used for the level
1 variance matrix, if the matrix is complete. If the matrix is incomplete, an
alternative prior to the uniform is not immediately obvious. The problem of
�nding alternative priors in this case is outside the scope of this thesis. A possible
solution would be to use univariate normal prior distributions for each parameter
as the likelihood would ensure that the posterior gave acceptable values for the
positivity constraints.
7.5.4 General algorithm for truncated normal proposal
method
The algorithm for a general Gaussian multi-level model using the updating
method of truncated normal proposals at level 1 follows directly from the example
in the last section. As both this method and the earlier method calculate the
�2ijs the update steps are all the same except for the update step for the level 1
variance. I will assume the model is as for the �rst method,
yij = XFij�
F +XRij�
Rj +XC
ijeij
eij � MVN(0;VC); �Rj � MVN(0;VR):
178
Step 3 - The level 1 variance, VC.
The variance matrix VC is considered term by term and one of the following two
steps is followed depending on whether the term lies on the diagonal.
Updating diagonal terms, VCnn.
At time t,
�2ij = (XC
ij)TV
(t)C XC
ij > 0
= (XCij(n))
2V(t)Cnn � rCij(nn) > 0 8 i; j;
where
rCij(nn) = (XCij(n))
2V(t)Cnn � (XC
ij)TV
(t)C XC
ij:
So
V(t)Cnn > maxCnn where maxCnn = max(rCij(nn)=(X
Cij(n))
2):
I will use a normal proposal distribution with variance, vnn but only consider
values generated that satisfy the constraint. This will lead to a truncated normal
proposal as shown in Figure 7-1 (i). The Hastings ratio can then be calculated
by the ratio of the two truncated normal distributions shown in Figure 7-1 (i)
and (ii). Letting the value for VCnn at time t be A and the proposed value for
time t+ 1 be B,
hr =p(V
(t+1)Cnn = B j V(t)
Cnn = A)
p(V(t+1)Cnn = A j V(t)
Cnn = B)
=1� �((maxCnn � B)=
pvnn)
1� �((maxCnn � A)=pvnn)
:
The update step is now as follows :
V(t+1)Cnn = V�
Cnn with probability min(1; hr � p(V�
Cnnjy;:::)
p(V(t)
Cnnjy;:::))
= V(t)Cnn otherwise:
179
Updating non diagonal terms, VCmn.
At time t,
�2ij = (XC
ij)TV
(t)C XC
ij > 0
= 2 � XCij(m)X
Cij(n)V
(t)Cmn � rCij(mn) > 0 8 i; j;
where
rCij(mn) = 2 � XCij(m)X
Cij(n)V
(t)Cmn � (XC
ij)TV
(t)C XC
ij:
So
V(t)Cmn > maxCmn+
where
maxCmn+ = max(rCij(mn)=(2 �XCij(m)X
Cij(n)); X
Cij(m)X
Cij(n) > 0);
and
V(t)Cmn < minCmn�
where
minCmn� = min(rCij(mn)=(2 � XCij(m)X
Cij(n)); X
Cij(m)X
Cij(n) < 0):
I will use a normal proposal distribution with variance, vmn but only consider
values generated that satisfy the constraint. This will lead to a truncated normal
proposal as shown in Figure 7-1 (iii). The Hastings ratio can then be calculated
by the ratio of the two truncated normal distributions shown in Figure 7-1 (iii)
and (iv). Letting the value for VCmn at time t be A and the proposed value for
time t+ 1 be B,
hr =p(V(t+1)
Cmn = B j V(t)Cmn = A)
p(V(t+1)Cmn = A j V(t)
Cmn = B)
=�((minCmn� � B)=
pvnn)� �((maxCmn+ �B)=
pvnn)
�((minCmn� � A)=pvnn)� �((maxCmn+ � A)=
pvnn)
:
The update step is now as follows :
180
V(t+1)Cmn = V�
Cmn with probability min(1; hr � p(V�
Cmnjy;:::)
p(V(t)
Cmnjy;:::))
= V(t)Cmn otherwise:
7.6 Summary
In this chapter two MCMC algorithms have been given for the solution of
Gaussian multi-level models with complex variation at level 1. The �rst method
uses an inverse Wishart proposal distribution for the level 1 variance matrix, and
can only be used when the variance matrix at level 1 is strictly positive de�nite.
The second method uses truncated normal proposals for the individual variance
terms. It is not constrained to only positive de�nite matrices and can even cope
with incomplete variance matrices.
Both these methods were illustrated through some simple examples and gave
sensible estimates for the problems given. The one problem that both methods
fail to solve is that they do not give estimates for individual level 1 residuals.
This along with other models that are not covered in this thesis will be discussed
in the next chapter, along with potential solutions.
181
Chapter 8
Conclusions and Further Work
8.1 Conclusions
The general aim of this thesis is to combine the two areas of multi-level modelling
and Markov chain Monte Carlo (MCMC) methods by �tting multi-level models
using MCMC methods. This task was split into three parts. Firstly the types of
problems that are �tted in multi-level modelling were identi�ed and the existing
maximum likelihood methods were investigated. Secondly MCMC algorithms
for these models were derived and �nally these methods were compared to the
maximum likelihood based methods both in terms of estimate bias and coverage
properties.
Two simple 2 level Gaussian models were �rstly considered and it was shown
how to �t these models using the Gibbs sampler method. Then extensive
simulation studies were carried out to compare di�erent prior distributions for
the variance parameters in these models using the Gibbs sampler method with
the two maximum likelihood methods IGLS and RIGLS on these two models.
The results showed that in terms of bias the RIGLS method was less biased than
the MCMC methods. In terms of coverage properties the MCMC methods do
better than the maximum likelihood methods in many situations although not
always. The conclusions are summarised in more detail in Section 4.5.
The Gibbs sampler algorithms given for the two simple multi-level models
were then generalised to �t the family of N level Gaussian models. Two
alternative hybrid Metropolis-Gibbs methods were also given along with adaptive
samplers and all these methods were compared with each other.
182
It was found that the Gibbs sampler method was better than the hybrid
methods for the Gaussian models in terms of number of iterations required
to achieve a desired accuracy. Of the hybrid methods there was little to
choose between the univariate proposal distribution method and the multivariate
proposal distribution method. It was also found that the adaptive samplers were a
better method of achieving desired accuracies in a minimum number of iterations
than the methods using proposal distributions based on scaled variance estimates.
The univariate Normal proposal distribution Metropolis method was adapted
to �t binary response multi-level models. This method was then compared with
the quasi-likelihood methods via a simulation study on one binary response model
from Rodriguez and Goldman (1995) where the quasi-likelihood methods perform
particularly badly. For this model it was shown that the Metropolis-Gibbs hybrid
method performs much better both in terms of bias and coverage properties. It
was also shown that for this model, the choice of prior distribution for the variance
parameters is less important.
Finally Gaussian models with complex variation at level 1 were considered
and two MCMC methods with Hastings updates at level 1 were given. The �rst
method was based on an inverse Wishart proposal distribution for the level 1
variance matrix but could only be used when the level 1 variance parameters
formed a complete positive de�nite matrix. The second method was based on
truncated normal proposals for the individual variance components and could be
used on any variance structure at level 1. Both these methods were tested on
some simple examples.
As can be seen many di�erent multi-level models have been considered in this
thesis but these are just a selection from a far larger �eld. The further work
section following these conclusions will outline some other models that I did not
have time to consider. Before mentioning this future work I will mention brie y
the implementation of MCMC methods in the MLwiN package (Goldstein et al.
1998).
8.1.1 MCMC options in the MLwiN package
A by-product of this thesis has been the programming of MCMC methods into
the MLwiN package (Goldstein et al. 1998). The �rst release of MLwiN was in
183
February 1998 and over the �rst 6 months the package has been used by hundreds
of users from the social science community and elsewhere, some familiar with its
forerunner, MLn (Rasbash and Woodhouse 1995) and some new users. Although
many users who are familiar with the IGLS and RIGLS methods may not use the
MCMC methods, it is hoped that by implementing MCMC in MLwiN, we will
be exposing more people to MCMC methods.
Early feedback is encouraging, and two workshops have been set up to teach
these new users more about the MCMC methods in MLwiN and Bayesian
statistics in general. I have also received several questions about the MCMC
options in MLwiN that show that a new MCMC user community has emerged.
Apart from introducing MCMC methods to a new community of users, the
MLwiN package has been invaluable to me, for performing the simulation studies
found in this thesis. On this note I must also acknowledge the BUGS package
(Spiegelhalter et al. 1994) which was used to compare results in the original
programming of the MCMC options in MLwiN and to perform some of the
simulations in Chapter 4.
8.2 Further work
In this section I intend to include a brief introduction to other models that
have not been considered in this thesis but which can be �tted in the MLwiN
package using maximum likelihood methods. Some of these models have not been
considered due to time constraints although �tting them using MCMC is not
di�cult. Other models are more di�cult or have simply not been considered yet
but are included for completeness. Goldstein (1995) contains more information
on �tting all the following models using maximum likelihood methods.
8.2.1 Binomial responses
Binomial response models are used to �t datasets where the response variable is
a proportion. The binomial distribution has two parameters, p the probability
of success, and n the number of trials, known as the denominator in Goldstein
(1995). The binary response models considered in Chapter 6 are a special case
of the binomial response models where the denominator is 1 for all observations.
184
The logit link function is used to �t binomial response models as shown for the
binary response models in Chapter 6. In fact a multi-level binomial response
model can be converted to a multi-level binary response model by including an
extra level in the model for the individual trials.
A 3 level binomial response logistic regression model can be de�ned as follows
:
yijk = Binomial(nijk; pijk); where
logit(pijk) = X1ijk�1 +X2ijk�2jk +X3ijk�3k
�2jk �MVN(0;V2); �3k �MVN(0;V3):
The Metropolis Gibbs hybrid method can be used to �t Binomial response
models, and this can be easily done via minor alterations to the algorithm for
the binary response model in Chapter 6. This has not yet been implemented in
MLwiN but the only main requirements are to store the denominator ni for each
observation i, and then modify two conditional distributions.
The conditional distribution for the �xed e�ects should now be :
p(�1i j y; : : :) / p(�1) �Y
i2MT
(1 + e�(X�)i)�yi(1 + e(X�)i)yi�ni:
and the conditional distribution for the level l residuals should now be :
p(�lji j y; : : :) /Y
i2Ml;j
(1 + e�(X�)i)�yi(1 + e(X�)i)yi�ni� j Vl j� 12 exp[�1
2�TljV
�1l �lj]:
Extra binomial variation
As with the binary response models, the binomial response models do not have
a parameter for the level 1 variance. This is because if the mean and the
denominator of a binomial distribution are known then the variance can be
calculated, (var(yi) = nipi(1� pi)).
The multi-level binomial response model could also be written as follows :
yi = pi + eizi; zi =qpi(1� pi)=ni;
185
where
logit(pi) = (X�)i; �lj � MVN(0;Vl):
If the model is assumed to have binomial variation then �2e is constrained
to 1. The assumption of binomial variation need not hold and the quasi-
likelihood methods in MLwiN will allow the constraint to be dropped. When
the assumption of binomial variation is dropped the variation is known as extra
binomial variation. Work is needed to �t models with extra binomial variation
using the MCMC methods.
8.2.2 Multinomial models
The binomial model is used for proportional data where the response variable is
a collection of observations that have two possible states (0 and 1). This is a
special case of the multinomial model which is used when the response variable is
a collection of observations with S possible states. The quasi-likelihood methods
in MLwiN can �t multinomial models but as yet I have not considered how to �t
these models using MCMC. This work would probably follow easily after �tting
the multivariate Normal response models mentioned later in this chapter.
8.2.3 Poisson responses for count data
The other sort of discrete data that needs to be considered is count data. As
count data are restricted to being positive integer values, the Poisson distribution
is usually used for the response variable along with the log link function.
A multi-level Poisson model with a log link can therefore be written :
yi � Poisson(�i)
where
log(�i) = (X�)i; �lj �MVN(0;Vl):
The Poisson model is related to the binomial in that it has a variance that is
�xed if the mean is known. The algorithm for the binary response model can also
186
be modi�ed to �t Poisson models by modifying the conditional distributions. For
the Poisson model the conditional distribution for the �xed e�ects should now be
:
p(�1i j y; : : :) / p(�1) �Y
i2MT
e�e(X�)i (e(X�)i)yi :
and the conditional distribution for the level l residuals should now be :
p(�lji j y; : : :) /Y
i2Ml;j
e�e(X�)i (e(X�)i)yi� j Vl j� 1
2 exp[�1
2�TljV
�1l �lj]:
Extra Poisson variation
An alternative form for the multi-level Poisson response model is as follows :
yi = �i + eizi; zi =q(�i);
where
log(�i) = (X�)i; �lj �MVN(0;Vl):
If the model is assumed to have Poisson variation then �2e is constrained to
1. The assumption of Poisson variation need not hold and the quasi-likelihood
methods in MLwiN will allow the constraint to be dropped. When the assumption
of Poisson variation is dropped the excess heterogeneity is known as extra Poisson
variation. More work is needed to �t models with extra Poisson variation using
the MCMC methods.
8.2.4 Extensions to complex variation at level 1
In Chapter 7 I introduced two methods based on Hastings updating steps that
will �t Gaussian models with complex variation. The Hastings update step for
the level 1 variance parameters was only added to the general Gibbs sampling
algorithm from Chapter 5 and uniform priors were used for these parameters.
More work needs to be done on incorporating the Hastings step into the hybrid
Metropolis Gibbs methods in Chapter 5 and to allow users to incorporate
187
informative prior distributions for these models. In particular informative prior
distributions for incomplete variance matrices at level 1 will need some more
thought.
Simulating level 1 residuals
At present the Hastings algorithm works with the composite level 1 residual,
XCijeij to generate simulated values for the level 1 variance parameters. This
has the disadvantage that the r individual level 1 residuals for observation
ij; erij cannot then be estimated by the MCMC method. However the maximum
likelihood methods have a procedure for generating residuals which can be used
with the MCMC estimates for the �xed e�ects and the variance parameters. This
procedure can be used for now and more work on �nding MCMC methods that
simulate the individual residuals can be performed.
Incomplete variance matrices at higher levels
The MLwiN package allows the variance structure at any level to be an incomplete
matrix. This can be useful if the random parameters at a higher level are
thought to be uncorrelated, as the covariance term can then be set to zero in
the model. The quasi-likelihood methods have no problem �tting these models
but these models are outside the general Gaussian framework �tted by the MCMC
algorithms in Chapter 5. The second Hastings update method given in Chapter
7 for the problem of complex variation at level 1 dealt with incomplete variance
matrices at level 1 and this method could probably be used also at higher levels.
The one disadvantage is that the higher level residuals will then not be estimated
but instead the composite higher level residuals will be used as for complex
variation at level 1.
8.2.5 Multivariate response models
All the models studied in this thesis have a single response variable but many
problems exist where there are many response variables. A simple approach would
be to �t each response variable separately with its own multi-level model. There
is often, however correlation between the response variables and this correlation
188
needs to be modelled. To model many responses together multivariate response
models need to be �tted.
The MLwiN package uses a clever approach to �tting such models. An extra
level is added to the model and the various responses for each observation are
considered di�erent observations at this bottom level. This in e�ect reduces the
multivariate response model to a special case of the univariate response variable
with no variability at level 1. Then through the use of indicator variables the
multivariate response model can be �tted. For examples of multivariate response
models and more details on how they are �tted using maximum likelihood
methods in MLwiN see Chapter 4 of Goldstein (1995). Work is required to
�t these models using MCMC methods in MLwiN.
189
Bibliography
Bernardo, J. M. and A. F. M. Smith (1994). Bayesian Theory. Chichester:
Wiley.
Box, G. E. P. and M. E. Muller (1958). A Note on the Generation of Random
Normal Deviates. Annals of Mathematical Statistics 29, 610{611.
Box, G. E. P. and G. C. Tiao (1992). Bayesian Inference in Statistical Analysis.
New York: John Wiley.
Breslow, N. E. and D. G. Clayton (1993). Approximate Inference in Generalized
Linear Mixed Models. Journal of the American Statistical Association 88,
9{25.
Brooks, S. P. and G. O. Roberts (1997). Assessing Convergence of Markov
Chain Monte Carlo Algorithms. Unpublished.
Bryk, A. S. and S. W. Raudenbush (1992). Hierarchical Linear Models.
Newbury Park: Sage.
Bryk, A. S., S. W. Raudenbush, M. Seltzer, and R. Congdon (1988). An
Introduction to HLM: Computer Program and User's guide (2.0 ed.).
Chicago: University of Chicago Dept of Education.
Chat�eld, C. (1989). The Analysis of Time Series : An Introduction (4th ed.).
London: Chapman and Hall.
Copas, J. B. and H. G. Li (1997). Inference for Non-Random Samples. Journal
of the Royal Statistical Society, Series B 59, 55{96.
Cowles, M. K. and B. P. Carlin (1996). Markov Chain Monte Carlo Con-
vergence Diagnostics: A Comparative Review. Journal of the American
Statistical Association 91, 883{904.
190
Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum Likelihood
from Incomplete Data via the EM Algorithm (with discussion). Journal of
the Royal Statistical Society, Series B 39, 1{38.
Draper, D. (1995). Inference and Hierarchical Modeling in the Social Sciences.
Journal of Educational and Behavioral Statistics 20, 115{147.
Draper, D. and R. Cheal (1997). Practical MCMC for Assessment and
Propagation of Model Uncertainty. Unpublished.
DuMouchel, W. and C. Waternaux (1992). Hierarchical Model for Combining
Information and for Meta-analyses (Discussion). In J. M. Bernardo, J. O.
Berger, A. P. Dawid, and A. F. M. Smith (Eds.), Bayesian Statistics 4, pp.
338{341. Oxford: Clarendon Press.
Gelfand, A. E., S. E. Hills, A. Racine-Poon, and A. F. M. Smith (1990).
Illustration of Bayesian Inference in Normal Data Models Using Gibbs
Sampling. Journal of the American Statistical Association 85, 972{985.
Gelfand, A. E. and S. K. Sahu (1994). On Markov Chain Monte Carlo
acceleration. Journal of Computational and Graphical Statistics 3, 261{276.
Gelfand, A. E., S. K. Sahu, and B. P. Carlin (1995). E�cient Parameterizations
for normal linear mixed models. Biometrika 82, 479{488.
Gelfand, A. E. and A. F. M. Smith (1990). Sampling Based Approaches
to Calculating Marginal Densities. Journal of the American Statistical
Association 85, 398{409.
Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin (1995). Bayesian Data
Analysis. London: Chapman and Hall.
Gelman, A., G. O. Roberts, and W. R. Gilks (1995). E�cient Metropolis
Jumping Rules. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and
A. F. M. Smith (Eds.), Bayesian Statistics 5, pp. 599{607. Oxford: Oxford
University Press.
Gelman, A. and D. B. Rubin (1992). Inference from Iterative Simulation Using
Multiple Sequences. Statistical Science 7, 457{472.
Geman, S. and D. Geman (1984). Stochastic Relaxation, Gibbs Distributions
and the Bayesian Restoration of Images. IEEE Transactions on Pattern
Analysis and Machine Intelligence 45, 721{741.
191
Geweke, J. (1992). Evaluating the Accuracy of Sampling Based Approaches
to the Calculation of Posterior Moments. In J. M. Bernardo, J. O. Berger,
A. P. Dawid, and A. F. M. Smith (Eds.), Bayesian Statistics 4, pp. 169{193.
Oxford: Oxford University Press.
Gilks, W. R. (1995). Full Conditional Distributions. In W R Gilks and S
Richardson and D J Spiegelhalter (Ed.), Markov Chain Monte Carlo in
Practice. London: Chapman and Hall.
Gilks, W. R., G. O. Roberts, and S. K. Sahu (1996). Adaptive Markov
Chain Monte Carlo. Research report 20, Statistics Laboratory, University
of Cambridge.
Gilks, W. R. and P. Wild (1992). Adaptive Rejection Sampling for Gibbs
Sampling. Journal of the Royal Statistical Society, Series C 41, 337{348.
Goldstein, H. (1986). Multilevel mixed linear model analysis using iterative
generalised least squares. Biometrika 73, 43{56.
Goldstein, H. (1989). Restricted unbiased iterative generalised least squares
estimation. Biometrika 76, 622{623.
Goldstein, H. (1991). Nonlinear Multilevel Models With an Application to
Discrete Response Data. Biometrika 78, 45{51.
Goldstein, H. (1995). Multilevel Statistical Models (2 ed.). London: Edward
Arnold.
Goldstein, H. and J. Rasbash (1996). Improved Approximations for Multilevel
Models with Binary Responses. Journal of the Royal Statistical Society,
Series A 159, 505{513.
Goldstein, H., J. Rasbash, I. Plewis, D. Draper, W. Browne, M. Yang,
G. Woodhouse, and M. Healy (1998). A user's guide to MLwiN (1.0 ed.).
London: Institute of Education.
Goldstein, H. and D. J. Spiegelhalter (1996). League Tables and Their
Limitations: Statistical Issues in Comparisons of Institutional Performance.
Journal of the Royal Statistical Society, Series A 159, 385{409.
Hastings, W. K. (1970). Monte Carlo Sampling Methods using Markov Chains
and their Applications. Biometrika 57 (1), 97{109.
192
Heath, A., M. Yang, and H. Goldstein (1996). Multilevel Analysis of the
Changing Relationship between Class and Party in Britain 1964-1992.
Quality and Quantity 30, 389{404.
Hills, S. W. and A. F. M. Smith (1992). Parameterization Issues in Bayesian
Inference. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith
(Eds.), Bayesian Statistics 4, pp. 227{246. Oxford: Oxford University
Press.
Kreft, I. G. G., J. de Leeuw, and R. van der Leeden (1994). Review of Five
Multilevel Analysis Programs : BMDP-5V, GENMOD, HLM, ML2, and
VARCL. . American Statistician 48, 324{335.
Laird, N. M. (1978). Empirical Bayes Methods for Two-Way Contingency
Tables. Biometrika 65, 581{590.
Longford, N. T. (1987). A Fast Scoring Algorithm for Maximum Likelihood
Estimation in Unbalanced Mixed Models with Nested Random E�ects.
Biometrika 74, 817{827.
Longford, N. T. (1988). VARCL - software for variance components analysis
of data with hierarchically nested random e�ects (maximum likelihood) (1.0
ed.). Princeton, NJ: Educational Testing Service.
MacEachern, S. N. and L. M. Berliner (1994). Subsampling the Gibbs Sampler.
The American Statistician 48, 188{190.
McCullagh, P. and J. A. Nelder (1983). Generalized Linear Models. London:
Chapman and Hall.
Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and
E. Teller (1953). Equations of State Calculations by Fast Computing
Machines. Journal of Chemical Physics 21, 1087{1092.
Muller, P. (1993). A generic approach to posterior integration and Gibbs
sampling. Technical report, ISDS, Duke University.
Raftery, A. E. and S. M. Lewis (1992). How Many Iterations in the Gibbs
Sampler? In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith
(Eds.), Bayesian Statistics 4, pp. 763{773. Oxford: Oxford University
Press.
193
Rasbash, J. and G. Woodhouse (1995). MLn: Command Reference Guide (1.0
ed.). London: Institute of Education.
Ripley, B. D. (1987). Stochastic Simulation. New York, USA: Wiley.
Rodriguez, G. and N. Goldman (1995). An Assessment of Estimation Proc-
edures for Multilevel Models with Binary Responses. Journal of the Royal
Statistical Society, Series A 158, 73{89.
Seltzer, M. H. (1993). Sensitivity Analysis for Fixed E�ects in the Hierarchical
Model: A Gibbs Sampling Approach. Journal of Educational Statistics 18,
207{235.
Seltzer, M. H., W. H. Wong, and A. S. Bryk (1996). Bayesian Analysis
in Applications of Hierarchical Models: Issues and Methods. Journal of
Educational and Behavioral Statistics 21, 131{167.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis.
London: Chapman and Hall.
Spiegelhalter, D. J., A. Thomas, N. G. Best, and W. R. Gilks (1994). BUGS:
Bayesian inference using Gibbs sampling. Version 0.30a. Technical report,
MRC Biostatistics Unit, Cambridge.
Spiegelhalter, D. J., A. Thomas, N. G. Best, and W. R. Gilks (1995). BUGS:
Bayesian inference using Gibbs sampling. Version 0.50. Technical report,
MRC Biostatistics Unit, Cambridge.
Stiratelli, R., N. M. Laird, and J. Ware (1984). Random E�ects Models for
Serial Observations with Binary Responses. Biometrics 40, 961{971.
Woodhouse, G., J. Rasbash, H. Goldstein, and M. Yang (1995). Introduction
to Multilevel Modelling. In G Woodhouse (Ed.), A Guide to MLn for New
Users. Institute of Education.
Zeger, S. L. and M. R. Karim (1991). Generalized Linear Models with Random
E�ects: a Gibbs Sampling Approach. Journal of the American Statistical
Association 86, 79{86.
194