MCMC - University of California, Santa Cruzdraper/browne-PhD-dissertation-1999.pdf · Secondly MCMC...

Applying MCMC Methods to

Multi-level Models

submitted by

William J. Browne

for the degree of PhD

of the

University of Bath

1998

COPYRIGHT

Attention is drawn to the fact that copyright of this thesis rests with its author.

This copy of the thesis has been supplied on the condition that anyone who

consults it is understood to recognise that its copyright rests with its author and

that no quotation from the thesis and no information derived from it may be

published without the prior written consent of the author.

This thesis may be made available for consultation within the University Library

and may be photocopied or lent to other libraries for the purposes of consultation.

Signature of Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

William J. Browne

To

Health, Happiness and Honesty.

Summary

Multi-level modelling and Markov chain Monte Carlo methods are two areas of

statistics that have become increasingly popular recently due to improvements

in computer capabilities, both in storage and speed of operation. The aim

of this thesis is to combine the two areas by �tting multi-level models using

Markov chain Monte Carlo (MCMC) methods. This task has been split into

three parts in this thesis. Firstly the types of problems that are �tted in multi-

level modelling are identi�ed and the existing maximum likelihood methods are

investigated. Secondly MCMC algorithms for these models are derived and �nally

these methods are compared to the maximum likelihood based methods both in

terms of estimate bias and interval coverage properties.

Three main groups of multi-level models are considered. Firstly N level

Gaussian models, secondly binary response multi-level logistic regression models

and �nally Gaussian models with complex variation at level 1.

Two simple 2 level Gaussian models are �rstly considered and it is shown

how to �t these models using the Gibbs sampler. Then extensive simulation

studies are carried out to compare the Gibbs sampler method with maximum

likelihoodmethods on these two models. For the generalN level Gaussian models,

algorithms for the Gibbs sampler and two alternative hybrid Metropolis Gibbs

methods are given and these three methods are then compared with each other.

One of the hybrid Metropolis Gibbs methods is adapted to �t binary response

multi-level models. This method is then compared with two quasi-likelihood

methods via a simulation study on one binary response model where the quasi-

likelihood methods perform particularly badly.

All of the above models can also be �tted using the Gibbs sampling method

using the adaptive rejection algorithm in the BUGS package (Spiegelhalter et al.

1994). Finally Gaussian models with complex variation at level 1 which cannot

be �tted in BUGS are considered. Two methods based on Hastings update steps

are given and are tested on some simple examples.

The MCMC methods in this thesis have been added to the multi-level

modelling package MLwiN (Goldstein et al. 1998) as a by-product of this

research.

Acknowledgements

I would �rstly like to thank my supervisor, Dr David Draper whose research in

the �elds of hierarchical modelling and Bayesian statistics motivated this PhD.

I would also like to thank him for his advice and assistance throughout both

my MSc and PhD. I would like to thank my parents for supporting me both

�nancially and emotionally through my �rst degree and beyond.

I would like to thank the multilevel models project team at the Institute of

Education, in particular Jon Rasbash and Professor Harvey Goldstein for allowing

me to work with them on the MLwiN package. I would also like to thank them

for their advice and assistance while I have been working on the package.

I would like to thank my brother Edward and his �ance Meriel for arranging

their wedding a month before I am scheduled to �nish this thesis. This way I

can spread my worries between my PhD. and my best man's speech. I would like

to thank my girlfriends over the last three years for helping me through various

parts of my PhD. Thanks for giving me love and support when I needed it and

making my life both happy and interesting.

I would like to thank the other members of the statistics group at Bath for

teaching me all I know about statistics today. I would like to thank my fellow

o�ce mates, past and present for their humour, conversation and friendship and

for joining me in my many pointless conversations. Thanks to family and friends

both in Bath and elsewhere.

Special thanks are due to the EPSRC for their �nancial support.

\The only thing I know is that I don't know anything."

Socrates

Contents

1 Introduction 1

1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Summary of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Multi Level Models and MLn 4

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 JSP dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Analysing Redhill school data . . . . . . . . . . . . . . . . . . . . 5

2.2.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.2 Linear models . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Analysing data on the four schools in the borough of Blackbridge 8

2.3.1 ANOVA model . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.2 ANCOVA model . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.3 Combined regression . . . . . . . . . . . . . . . . . . . . . 11

2.4 Two level modelling . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4.1 Iterative generalised least squares . . . . . . . . . . . . . . 13

2.4.2 Restricted iterative generalised least squares . . . . . . . . 15

2.4.3 Fitting variance components models to the Blackbridge

dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.4 Fitting variance components models to the JSP dataset . . 16

2.4.5 Random slopes model . . . . . . . . . . . . . . . . . . . . 17

2.5 Fitting models to pass/fail data . . . . . . . . . . . . . . . . . . . 18

2.5.1 Extending to multi-level modelling . . . . . . . . . . . . . 20

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

i

3 Markov Chain Monte Carlo Methods 23

3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Metropolis sampling . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.1 Proposal distributions . . . . . . . . . . . . . . . . . . . . 26

3.4 Metropolis-Hastings sampling . . . . . . . . . . . . . . . . . . . . 27

3.5 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.5.1 Rejection sampling . . . . . . . . . . . . . . . . . . . . . . 28

3.5.2 Adaptive rejection sampling . . . . . . . . . . . . . . . . . 29

3.5.3 Gibbs sampler as a special case of the Metropolis-Hastings

algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.6 Data summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.6.1 Measures of location . . . . . . . . . . . . . . . . . . . . . 31

3.6.2 Measures of spread . . . . . . . . . . . . . . . . . . . . . . 31

3.6.3 Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.7 Convergence issues . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.7.1 Length of burn-in . . . . . . . . . . . . . . . . . . . . . . . 35

3.7.2 Mixing properties of Markov chains . . . . . . . . . . . . . 36

3.7.3 Multi-modal models . . . . . . . . . . . . . . . . . . . . . 41

3.7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.8 Use of MCMC methods in multi-level modelling . . . . . . . . . . 43

3.9 Example - Bivariate normal distribution . . . . . . . . . . . . . . 44

3.9.1 Metropolis sampling . . . . . . . . . . . . . . . . . . . . . 44

3.9.2 Metropolis-Hasting sampling . . . . . . . . . . . . . . . . . 45

3.9.3 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . 46

3.9.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4 Gaussian Models 1 - Introduction 53

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2 Prior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.2.1 Informative priors . . . . . . . . . . . . . . . . . . . . . . . 54

4.2.2 Non-informative priors . . . . . . . . . . . . . . . . . . . . 54

4.2.3 Priors for �xed e�ects . . . . . . . . . . . . . . . . . . . . 55

ii

4.2.4 Priors for single variances . . . . . . . . . . . . . . . . . . 55

4.2.5 Priors for variance matrices . . . . . . . . . . . . . . . . . 58

4.3 2 Level variance components model . . . . . . . . . . . . . . . . . 59

4.3.1 Gibbs sampling algorithm . . . . . . . . . . . . . . . . . . 59

4.3.2 Simulation method . . . . . . . . . . . . . . . . . . . . . . 62

4.3.3 Results : Bias . . . . . . . . . . . . . . . . . . . . . . . . 67

4.3.4 Results : Coverage probabilities and interval widths . . . . 70

4.3.5 Improving maximum likelihood method interval estimates

for �2u . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.3.6 Summary of results . . . . . . . . . . . . . . . . . . . . . . 78

4.4 Random slopes regression model . . . . . . . . . . . . . . . . . . . 79

4.4.1 Gibbs sampling algorithm . . . . . . . . . . . . . . . . . . 80

4.4.2 Simulation method . . . . . . . . . . . . . . . . . . . . . . 83

4.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.5.1 Simulation results . . . . . . . . . . . . . . . . . . . . . . . 96

4.5.2 Priors in MLwiN . . . . . . . . . . . . . . . . . . . . . . . 103

5 Gaussian Models 2 - General Models 104

5.1 General N level Gaussian hierarchical linear models . . . . . . . . 104

5.2 Gibbs sampling approach . . . . . . . . . . . . . . . . . . . . . . . 105

5.3 Generalising to N levels . . . . . . . . . . . . . . . . . . . . . . . 110

5.3.1 Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.3.2 Computational considerations . . . . . . . . . . . . . . . . 113

5.4 Method 2 : Metropolis Gibbs hybrid method with univariate updates114

5.4.1 Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.4.2 Choosing proposal distribution variances . . . . . . . . . . 115

5.4.3 Adaptive Metropolis univariate normal proposals . . . . . 120

5.5 Method 3 : Metropolis Gibbs hybrid method with block updates . 127

5.5.1 Algorithm 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 128

5.5.2 Choosing proposal distribution variances . . . . . . . . . . 128

5.5.3 Adaptive multivariate normal proposal distributions . . . 132

5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

5.6.1 Timing considerations . . . . . . . . . . . . . . . . . . . . 138

iii

6 Logistic Regression Models 139

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6.2 Multi-level binary response logistic regression models . . . . . . . 140

6.2.1 Metropolis Gibbs hybrid method with univariate updates . 141

6.2.2 Other existing methods . . . . . . . . . . . . . . . . . . . . 143

6.3 Example 1 : Voting intentions dataset . . . . . . . . . . . . . . . 143

6.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 143

6.3.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

6.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

6.3.4 Substantive Conclusions . . . . . . . . . . . . . . . . . . . 145

6.3.5 Optimum proposal distributions . . . . . . . . . . . . . . . 146

6.4 Example 2 : Guatemalan child health dataset . . . . . . . . . . . 148

6.4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.4.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

6.4.3 Original 25 datasets . . . . . . . . . . . . . . . . . . . . . 150

6.4.4 Simulating more datasets . . . . . . . . . . . . . . . . . . . 151

6.4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 156

6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

7 Gaussian Models 3 - Complex Variation at level 1 158

7.1 Model de�nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

7.2 Updating methods for a scalar variance . . . . . . . . . . . . . . . 159

7.2.1 Metropolis algorithm for log�2 . . . . . . . . . . . . . . . 160

7.2.2 Hastings algorithm for �2 . . . . . . . . . . . . . . . . . . 160

7.2.3 Example : Normal observations with an unknown variance 161

7.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

7.3 Updating methods for a variance matrix . . . . . . . . . . . . . . 163

7.3.1 Hastings algorithm with an inverse Wishart proposal . . . 164

7.3.2 Example : Bivariate normal observations with an unknown

variance matrix . . . . . . . . . . . . . . . . . . . . . . . . 164

7.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

7.4 Applying inverse Wishart updates to complex variation at level 1 166

7.4.1 MCMC algorithm . . . . . . . . . . . . . . . . . . . . . . . 167

7.4.2 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

iv

7.4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 170

7.5 Method 2 : Using truncated normal Hastings update steps . . . . 171

7.5.1 Update steps at level 1 for JSP example . . . . . . . . . . 171

7.5.2 Proposal distributions . . . . . . . . . . . . . . . . . . . . 176

7.5.3 Example 2 : Non-positive de�nite and incomplete variance

matrices at level 1 . . . . . . . . . . . . . . . . . . . . . . 176

7.5.4 General algorithm for truncated normal proposal method . 178

7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

8 Conclusions and Further Work 182

8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

8.1.1 MCMC options in the MLwiN package . . . . . . . . . . . 183

8.2 Further work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

8.2.1 Binomial responses . . . . . . . . . . . . . . . . . . . . . . 184

8.2.2 Multinomial models . . . . . . . . . . . . . . . . . . . . . . 186

8.2.3 Poisson responses for count data . . . . . . . . . . . . . . . 186

8.2.4 Extensions to complex variation at level 1 . . . . . . . . . 187

8.2.5 Multivariate response models . . . . . . . . . . . . . . . . 188

v

List of Figures

2-1 Plot of the regression lines for the four schools in the Borough of

Blackbridge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2-2 Tree diagram for the Borough of Blackbridge. . . . . . . . . . . . 12

3-1 Histogram of �1 using the Gibbs sampling method. . . . . . . . . 33

3-2 Kernel density plot of �1 using the Gibbs sampling method and a

Gaussian kernel with a large value of the window width h. . . . . 35

3-3 Traces of parameter �1 and the running mean of �1 for a Metropolis

run that converges after about 50 iterations. Upper solid line in

lower panel is running mean with �rst 50 iterations discarded. . . 37

3-4 ACF and PACF for parameter �1 for a Gibbs sampling run of

length 5000 that is mixing well and a Metropolis run that is not

mixing very well. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3-5 Kernel density plot of �2 using the Gibbs sampling method and a

Gaussian kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3-6 Plots of the Raftery Lewis N values for various values of �p, the

proposal distribution standard deviation. . . . . . . . . . . . . . 50

3-7 Plot of the MCMC diagnostic window in the package MLwiN for

the parameter �1 from a random slopes regression model. . . . . 52

4-1 Plot of normal prior distributions over the range ({5,5) with mean

0 and variances 1,2,5,10 and 50 respectively. . . . . . . . . . . . . 56

4-2 Plots of biases obtained for the various methods against study

design and parameter settings. . . . . . . . . . . . . . . . . . . . 69

4-3 Trajectories plot of IGLS estimates for run of random slopes

regression model where convergence is not achieved. . . . . . . . 85

vi

4-4 Plots of biases obtained for the various methods �tting the

random slopes regression model against value of �u01 (Fixed e�ects

parameters and level 1 variance parameter). . . . . . . . . . . . . 94

4-5 Plots of biases obtained for the various methods �tting the random

slopes regression model against value of �u01 (Level 2 variance

parameters). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4-6 Plots of biases obtained for the various methods �tting the

random slopes regression model against study design (Fixed e�ects

parameters and level 1 variance parameter). . . . . . . . . . . . . 100

4-7 Plots of biases obtained for the various methods �tting the random

slopes regression model against study design (Level 2 variance

parameters). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5-1 Plots of the e�ect of varying the scale factor for the proposal

variance and hence the Metropolis acceptance rate on the Raftery

Lewis diagnostic for the �0 parameter in the variance components

model on the JSP dataset. . . . . . . . . . . . . . . . . . . . . . 117



Lewis diagnostic for the �0 parameter in the random slopes

regression model on the JSP dataset. . . . . . . . . . . . . . . . . 118



Lewis diagnostic for the �1 parameter in the random slopes

regression model on the JSP dataset. . . . . . . . . . . . . . . . . 119

5-4 Plots of the e�ect of varying the scale factor for the multivariate

normal proposal distribution and hence the Metropolis acceptance

rate on the Raftery Lewis diagnostic for the �0 parameter in the

random slopes regression model on the JSP dataset. . . . . . . . 130

5-5 Plots of the e�ect of varying the scale factor for the multivariate

normal proposal distribution and hence the Metropolis acceptance

rate on the Raftery Lewis diagnostic for the �1 parameter in the

random slopes regression model on the JSP dataset. . . . . . . . 131

vii

6-1 Plot of the e�ect of varying the scale factor for the univariate

Normal proposal distribution rate on the Raftery Lewis diagnostic

for the �2u parameter in the voting intentions dataset. . . . . . . 147

6-2 Plots comparing the actual coverage of the four estimation meth-

ods with their nominal coverage for the parameters �0; �1 and �2. 154

6-3 Plots comparing the actual coverage of the four estimation meth-

ods with their nominal coverage for the parameters �3; �2v and �2

u. 155

7-1 Plots of truncated univariate normal proposal distributions for a

parameter, �. A is the current value, �c and B is the proposed new

value, ��. M is max� and m is min�, the truncation points. The

distributions in (i) and (iii) have mean �c, while the distributions

in (ii) and (iv) have mean ��. . . . . . . . . . . . . . . . . . . . . 175

viii

List of Tables

2.1 Summary of Redhill primary school results from JSP dataset. . . 6

2.2 Parameter estimates for model including Sex and Non-Manual

covariates for Redhill primary school. . . . . . . . . . . . . . . . . 8

2.3 Summary of schools in the borough of Blackbridge. . . . . . . . . 8

2.4 Parameter estimates for ANOVA and ANCOVA models for the

borough of Blackbridge dataset. . . . . . . . . . . . . . . . . . . . 10

2.5 Parameter estimates for two variance components models using

both IGLS and RIGLS for Borough of Blackbridge dataset. . . . . 15

2.6 Parameter estimates for two variance components models using

both IGLS and RIGLS for all schools in the JSP dataset. . . . . . 16

2.7 Comparison between �tted values using the ANOVA model and

the variance components model 1 using RIGLS. . . . . . . . . . . 16

2.8 Parameter estimates for random slopes model using both IGLS

and RIGLS for all schools in the JSP dataset. . . . . . . . . . . . 17

2.9 Comparison between �tted regression lines produced by separate

regressions and the random slopes model. . . . . . . . . . . . . . . 18

2.10 Parameter estimates for the two logistic regression models �tted

to the Blackbridge dataset. . . . . . . . . . . . . . . . . . . . . . . 20

2.11 Parameter estimates for the two-level logistic regression models

�tted to the JSP dataset. . . . . . . . . . . . . . . . . . . . . . . . 21

3.1 Comparison between MCMC methods for �tting a bivariate

normal model with unknown mean vector. . . . . . . . . . . . . . 48

3.2 Comparison between 95% con�dence intervals and Bayesian cred-

ible intervals in bivariate normal model. . . . . . . . . . . . . . . 51

ix

4.1 Summary of study designs for variance components model simulat-

ion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 Summary of times for Gibbs sampling in the variance components

model with di�erent study designs for 50,000 iterations. . . . . . . 65

4.3 Summary of Raftery Lewis convergence times (thousands of

iterations) for various studies. . . . . . . . . . . . . . . . . . . . . 66

4.4 Summary of simulation lengths for Gibbs sampling the variance

components model with di�erent study designs. . . . . . . . . . . 67

4.5 Estimates of relative bias for the variance parameters using

di�erent methods and di�erent studies. True level 2/1 variance

values are 10 and 40. . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.6 Estimates of relative bias for the variance parameters using

di�erent methods and di�erent true values. All runs use study

design 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.7 Comparison of actual coverage percentage values for nominal 90%

and 95% intervals for the �xed e�ect parameter using di�erent

methods and di�erent studies. True values for the variance

parameters are 10 and 40. Approximate MCSEs are 0.28%/0.15%

for 90%/95% coverage estimates. . . . . . . . . . . . . . . . . . . 72

4.8 Average 90%/95% interval widths for the �xed e�ect parameter

using di�erent studies. True values for the variance parameters

are 10 and 40. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72


and 95% intervals for the �xed e�ect parameter using di�erent

methods and di�erent true values. All runs use study design

7. Approximate MCSEs are 0.28%/0.15% for 90%/95% coverage

estimates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.10 Average 90%/95% interval widths for the �xed e�ect parameter

using di�erent true parameter values. All runs use study design 7. 73

4.11 Comparison of actual coverage percentage values for nominal

90% and 95% intervals for the level 2 variance parameter using

di�erent methods and di�erent studies. True values of the variance



x

4.12 Average 90%/95% interval widths for the level 2 variance param-

eter using di�erent studies. True values of the variance parameters

are 10 and 40. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73


and 95% intervals for the level 2 variance parameter using di�erent



estimates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.14 Average 90%/95% interval widths for the level 2 variance par-

ameter using di�erent true parameter values. All runs use study

design 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.15 Comparison of actual coverage percentage values for nominal

90% and 95% intervals for the level 1 variance parameter using

di�erent methods and di�erent studies. True values of the variance



4.16 Average 90%/95% interval widths for the level 1 variance param-

eter using di�erent studies. True values of the variance parameters

are 10 and 40. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75


and 95% intervals for the level 1 variance parameter using di�erent



estimates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.18 Average 90%/95% interval widths for the level 1 variance par-

ameter using di�erent true parameter values. All runs use study

design 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.19 Summary of results for the level 2 variance parameter, �2u using

the RIGLS method and inverse gamma intervals. . . . . . . . . . 78

4.20 Summary of the convergence for the random slopes regression

with the maximum likelihood based methods (IGLS/RIGLS). The

study design is given in terms of the number of level 2 units and

whether the study is balanced (B) or unbalanced (U). . . . . . . 86

xi

4.21 Summary of results for the random slopes regression with the 48

schools unbalanced design with parameter values, �u00 = 5;�u01 =

0 and �u11 = 0:5. All 1000 runs. . . . . . . . . . . . . . . . . . . 88



1:4 and �u11 = 0:5. Only 982 runs. . . . . . . . . . . . . . . . . . 90



�1:4 and �u11 = 0:5. Only 984 runs. . . . . . . . . . . . . . . . . 91



0:5 and �u11 = 0:5. All 1000 runs. . . . . . . . . . . . . . . . . . 92



�0:5 and �u11 = 0:5. Only 998 runs. . . . . . . . . . . . . . . . . 93


schools balanced design with parameter values, �u00 = 5;�u01 =

0:0 and �u11 = 0:5. All 1000 runs. . . . . . . . . . . . . . . . . . 97



0:0 and �u11 = 0:5. Only 877 runs. . . . . . . . . . . . . . . . . . 98


schools balanced design with parameter values, �u00 = 5;�u01 =

0:0 and �u11 = 0:5. Only 990 runs. . . . . . . . . . . . . . . . . . 99

5.1 Optimal scale factors for proposal variances and best acceptance

rates for several models. . . . . . . . . . . . . . . . . . . . . . . . 120

5.2 Demonstration of Adaptive Method 1 for parameters �0 and �1

using arbitrary (1.000) starting values. . . . . . . . . . . . . . . . 123

5.3 Comparison of results for the random slopes regression model

on the JSP dataset using uniform priors for the variances, and

di�erent MCMC methods. Each method was run for 50,000

iterations after a burn-in of 500. . . . . . . . . . . . . . . . . . . 125

xii

5.4 Demonstration of Adaptive Method 2 for parameters �0 and �1

using arbitrary (1.000) starting values. . . . . . . . . . . . . . . . 127

5.5 Demonstration of Adaptive Method 3 for the � parameter vector

using RIGLS starting values. . . . . . . . . . . . . . . . . . . . . 135

5.6 Comparison of results for the random slopes regression model

on the JSP dataset using uniform priors for the variances, and

di�erent block updating MCMC methods. Each method was run

for 50,000 iterations after a burn-in of 500. . . . . . . . . . . . . 136

6.1 Comparison of results from the quasi-likelihood methods and the

MCMC methods for the voting intention dataset. The MCMC

method is based on a run of 50,000 iterations after a burn-in of

500 and adapting period. . . . . . . . . . . . . . . . . . . . . . . 145

6.2 Optimal scale factors for proposal variances and best acceptance

rates for the voting intentions model. . . . . . . . . . . . . . . . 146

6.3 Summary of results (with Monte Carlo standard errors) for the

�rst 25 datasets of the Rodriguez Goldman example. . . . . . . . 151

6.4 Summary of results (with Monte Carlo standard errors) for the

Rodriguez Goldman example with 500 generated datasets. . . . . 153

7.1 Comparison between three MCMC methods for a univariate

normal model with unknown variance. . . . . . . . . . . . . . . . 163

7.2 Comparison between two MCMC methods for a bivariate normal

model with unknown variance matrix. . . . . . . . . . . . . . . . . 166

7.3 Comparison between IGLS/RIGLS and MCMC method on a

simulated dataset with the layout of the JSP dataset. . . . . . . . 170

7.4 Comparison between RIGLS and MCMC method 2 on three

models with complex variation �tted to the JSP dataset. . . . . . 177

xiii

Chapter 1

Introduction

1.1 Objectives

Multi-level modelling has recently become an increasingly interesting and app-

licable statistical tool. Many areas of application �t readily into a multi-

level structure. Goldstein and Spiegelhalter (1996) illustrate the use of multi-

level modelling in two leading application areas, health and education; other

application areas include household surveys and animal growth studies.

Several packages have been written to �t multi-level models. MLn (Rasbash

and Woodhouse (1995)), HLM (Bryk et al. (1988), Bryk and Raudenbush

(1992)), and VARCL (Longford (1987), Longford (1988)) are all packages

which use as their �tting mechanisms, maximum likelihood or empirical Bayes

methodology. These methods are used to �nd estimates of parameters of

interest in complicated models where exact methods would involve intractable

integrations.

Another technique that has come to the forefront of statistical research over

the last decade or so is the use of Markov chain Monte Carlo (MCMC) simulation

methods (Gelfand and Smith 1990). With the increase in computer power, both in

speed of computation and in memory capacity, techniques that were theoretical

ideas thirty years ago are now practical reality. The structure of the multi-

level model with its interdependence between variables makes it an ideal area of

application for MCMC techniques. Draper (1995) describes the use of multi-level

modelling in the social sciences and recommends greater use of MCMC methods

in this �eld.

1

When MCMC methods were �rst introduced, if statisticians wanted to �t a

complicated model they would program up their own MCMC sampler for the

problem they were considering and use it to solve that problem. More recently

a general purpose MCMC sampler, BUGS (Spiegelhalter et al. 1995) has been

produced that will �t a wide range of models in many application areas. BUGS

uses a technique called Gibbs sampling to �t its models using an adaptive rejection

algorithm described in Gilks and Wild (1992).

In this thesis I am interested in studying multi-level models, and comparing

the maximum likelihood based methods in the package MLn with MCMC

methods. I will parallel the work of BUGS and consider �tting various families of

multi-level models using both Gibbs sampling and Metropolis-Hastings sampling

methods. I will also consider how the maximum likelihoodmethods can be used to

give the MCMC methods good starting values and suitable proposal distributions

for Metropolis-Hastings sampling.

The package MLwiN (Goldstein et al. 1998) is the new version of MLn and

some of its new features are a result of the work contained in this thesis. MLwiN

contains for the �rst time MCMC methodology as well as the existing maximum

likelihood based methods.

1.2 Summary of Thesis

In the next chapter I will discuss some of the background to multi-level modelling

using as an example an educational dataset. I will introduce multi-level modelling

as an extension to linear modelling and explain brie y how the existing maximum

likelihood methods in MLn �t multi-level models.

In Chapter 3 I will consider MCMC simulation techniques and summarise the

main techniques, Metropolis sampling, Gibbs sampling and Hastings sampling.

I will explain how such techniques are used and how to get estimates from the

chains they produce. I will also consider convergence issues when using Markov

chains and motivate all the methods with a simple example.

In Chapter 4 I will consider two very simple multi-level models, the two-

level variance components model, and the random slopes regression model, both

introduced in Chapter 2. I will use these models to illustrate the important issue

of choosing general `di�use' prior distributions when using MCMC methods. The

2

chapter will consist of two large simulation experiments to compare and contrast

the IGLS and RIGLS maximum likelihood methods with MCMC methods using

various prior distributions under di�erent scenarios.

In Chapter 5 I will discuss some more general algorithms that will �t N level

Gaussian multi-level models. I will give three algorithms, �rstly Gibbs sampling

and then two hybrid Gibbs Metropolis samplers: the �rst containing univariate

updating steps, and the second block updating steps. For each hybrid sampler I

will also describe an adaptive Metropolis technique to improve its mixing. I will

then compare all the samplers through some simple examples.

In Chapter 6 I will discuss multi-level logistic regression models. I will consider

one of the hybrid samplers introduced in the previous chapter and show how it can

be modi�ed to �t these new models. These models are a family that maximum

likelihood techniques perform particularly badly on. I will therefore compare the

maximum likelihood based methods with the new hybrid sampler via another

simulation experiment.

In Chapter 7 I will introduce a complex variation structure at level 1 as

a generalisation of the Gaussian models introduced in Chapter 5. I will then

implement two Hastings updating techniques for the level 1 variance parameters

that aim to sample from such models. Firstly a technique based on an inverse

Wishart proposal distribution and secondly a technique based on a truncated

normal proposal distribution. I will then compare the results of both methods to

the maximum likelihood methods. In Chapter 8 I will discuss other multi-level

models that have not been �tted in the previous chapters and add some general

conclusions about the thesis as a whole.

3

Chapter 2

Multi Level Models and MLn

2.1 Introduction

In the introduction I mentioned several applications that contain datasets where a

multi-level structure is appropriate. The package MLn (Rasbash and Woodhouse

1995) was written at the Institute of Education primarily to �t models in the

area of education although it can be used in many of the other applications of

multi-level modelling. In this chapter I intend to consider, through examples,

some statistical problems that arise in the �eld of education. These problems

will increase in complexity to incorporate multi-level modelling. I will explain

how the maximum likelihood methods in MLn can be used to �t the models as

each new model is introduced. The dataset used in this chapter is the Junior

School Project (JSP) dataset analysed in Woodhouse et al. (1995).

2.1.1 JSP dataset

The JSP is a longitudinal study of approximately 2000 pupils who entered junior

school in 1980. Woodhouse et al. (1995) analyse a subset of the data containing

887 pupils from 48 primary schools taken from the Inner London Education

Authority (ILEA). For each child they consider his/her Maths scores in two tests,

marked out of 40, taken in years 3 and 5, along with other variables that measure

the child's background. I will consider smaller subsets of this subset in the models

considered in this chapter. Any names used in the examples are �ctitious, and

are simply used to aid my descriptions.

4

I will now consider as my �rst dataset the sample of pupils from one school

participating in the Junior School project, and consider how to statistically

describe information on an individual pupil.

2.2 Analysing Redhill school data

Redhill Primary school is the 5th school in the JSP dataset and the sample of

pupils participating in the JSP has 25 pupils who sat Maths tests in years 3

and 5. I will denote the Maths scores, out of a possible 40, in years 3 and 5

as M3 and M5 respectively. When considering the data from one school, it is

the individual pupils' marks that are of interest. The data for Redhill school are

given in Table 2.1.

The individual pupils and their parents will be interested in how they, or their

children are doing in terms of what marks were achieved, and how these marks

compare with the other pupils in the school. Consider John Smith, pupil 10,

who achieved 30 in both his M3 and M5 test scores, or equivalently 75% in each

test. If this is the only information available then John Smith appears to have

made steady progress in mathematics. If instead the marks for the whole class

are available then each child could be given a ranking to indicate where he/she

�nished in the class.

It can now be seen that John Smith ranked equal eighth in the third year test

but only equal eighteenth in the second test, so although he got the same mark

in each test, compared to the rest of the class he has done worse in the second

test. This is because although his marks have stayed constant the mean mark for

the class has risen from 27.8 to 32. This may be because the second test is in fact

comparatively easier than the �rst test, or the teaching between the two tests

has improved the children's average performance. With only the data given it

is impossible to distinguish between these two reasons for the improved average

mark.

2.2.1 Linear regression

A better way to compare John Smith's two marks is to perform a regression of

the M5 marks on the M3 marks. This will be the �rst model to be �tted to the

5

Table 2.1: Summary of Redhill primary school results from JSP dataset.

Pupil M3 Rank M5 Rank M/NM Sex1 25 18 25 21 M M2 17 25 33 13 M M3 33 2 36 5 NM F4 25 18 31 17 NM M5 23 22 25 21 M F6 33 2 37 2 NM F7 29 11 23 24 NM M8 30 8 34 10 NM F9 32 4 36 5 NM F10 30 8 30 18 M M11 34 1 33 13 NM F12 27 16 32 16 NM M13 24 20 21 25 M F14 32 4 34 10 M F15 28 14 38 1 NM M16 28 14 36 5 NM M17 24 20 33 13 M F18 30 8 36 5 M F19 32 4 37 2 NM F20 29 11 34 10 NM F21 29 11 28 20 NM F22 22 23 30 18 NM M23 22 23 25 21 M M24 32 4 37 2 NM M25 27 16 36 5 NM M

dataset and can be written as :

M5i = �0 + �1M3i + ei;

where the ei are the residuals and are assumed to be distributed normally with

mean zero and variance �2. The linear regression problem is studied in great

detail in basic statistics courses and the least squares estimates are as follows :

�0 = �y � �1�x;

6

�1 =X

(xi � �x)(yi � �y)=X

(xi � �x)2;

where in this example, x =M3 and y =M5.

Fitting the above model gives estimates, �0 = 15:96 and �1 = 0:575, and from

these estimates, expected values and residuals can be calculated. Consequently

John Smith with his mark of 30 for year 3 would be expected to get 33.22 for

year 5 and the residual for John Smith is then 30� 33:22 = �3:22. This means

that using this model, John Smith got 3.22 marks less than would be expected

of the average pupil (at Redhill school) with a mark of 30 in year 3 using this

model.

2.2.2 Linear models

The simple linear regression model is a member of a larger family of models known

as normal linear models (McCullagh and Nelder 1983). Two other covariates,

the sex of each pupil and whether their parent's occupation was manual or non-

manual were collected. The simple linear regression model can be expanded to

include these two covariates as follows

M5i = �0 + �1M3i + �2SEXi + �3NONMANi + ei

= Xi� + ei;

where SEXi = 0 for girls, 1 for boys and NONMANi = 0 for manual work and

1 for non-manual work.

The formula for the least squares estimates for a normal linear model is similar

to the formula for simple linear regression except with matrices replacing vectors.

The estimate for the parameter vector � is

� = (XTX)�1XTy:

For our model the least squares estimates are given in Table 2.2.

From Table 2.2 it can be seen that on average, pupils from a non-manual

background do better than pupils from a manual background and that boys do

slightly better than girls. Considering John Smith again, his expected M5 mark

7

Table 2.2: Parameter estimates for model including Sex and Non-Manualcovariates for Redhill primary school.

Parameter Variable Estimate (SE)�0 Intercept 18.09 (7.73)�1 M3 0.43 (0.28)�2 Sex 0.15 (2.20)�3 Non Manual 2.70 (2.20)

under this model is now 31.27 and so is closer to his actual mark. However, in

this model neither of the additional covariates has an e�ect that is signi�cantly

di�erent from zero, and so the standard procedure would be to remove them from

the model and revert to the simple linear regression model.

The purpose of reviewing normal linear models is to show how they are related

to hierarchical models. I will now expand the dataset to include all the pupils

sampled in four of the schools in the JSP.

2.3 Analysing data on the four schools in the

borough of Blackbridge

I am now going to consider a new educational situation involving the JSP dataset.

A family has moved into the borough of Blackbridge which has four primary

schools in its area and they want to choose a school for their children who have

sat M3 tests. The four schools I have selected for the �ctional borough are

Bluebell (School 2), Redhill (School 5), Greenacres (School 9) and Greyfriars

(School 13). The four schools are summarised in Table 2.3.

Table 2.3: Summary of schools in the borough of Blackbridge.

School Pupils M3 M5 Male NonManName Sampled Mean Mean (%) (%)Bluebell 10 25.1 30.4 50% 30%Redhill 25 27.9 32.0 48% 64%Greenacres 21 31.3 31.6 57% 67%Greyfriars 13 23.8 28.4 62% 0%All schools 69 27.8 31.0 54% 48%

8

From the table we can see that Redhill school had the best M5 average

results. Bluebell school had the best average improvement in Maths from years

3 to 5. Greenacres school had the second highest M5 average but had far less

improvement than the other schools. Greyfriars although having the worst M5

average of the four schools had 100% of pupils from a manual background. If the

simple linear regression model regressing M5 on M3 is �tted to the four schools

separately, the resulting regression lines can be seen in Figure 2-1.

.

.

.

.

Math 3 score

Mat

h 5

Sco

re

0 10 20 30 40

010

2030

40

BluebellRedhillGreenacresGreyfriars

Figure 2-1: Plot of the regression lines for the four schools in the Borough ofBlackbridge.

From these regression lines it can be seen that if we were to choose the

best school for a particular child by using their M3 mark then no one school

dominates and our choice would depend on theM3 mark for the particular child.

In comparing the 4 regression lines to consider which school is best, the same

9

model has e�ectively been �tted four times, and the results for each school are

only in uenced by the other schools by comparing the four graphs. It would be

better if the data from all four schools could be incorporated in one model. I will

now introduce some more models that attempt to do this.

2.3.1 ANOVA model

The Analysis of Variance (ANOVA) model consists of �tting factor variables to

a single response variable, for example, for the current dataset,

M5ij = �0 + SCHOOLj + eij:

One way of �tting the above model is to constrain SCHOOL4 to be zero to

avoid co-linearity amongst the predictor variables. The parameter estimates are

given in the second column of Table 2.4.

Table 2.4: Parameter estimates for ANOVA and ANCOVA models for theborough of Blackbridge dataset.

Parameter ANOVA Est. (SE) ANCOVA Est. (SE)�0 28.38 (1.57) 12.92 (3.28)�1 | 0.65 (0.13)

SCHOOL1 2.02 (2.38) 1.20 (2.02)SCHOOL2 3.62 (1.94) 1.00 (1.72)SCHOOL3 3.19 (2.00) {1.67 (1.94)

From the parameter estimates, expected values for pupils in all four schools

can be found and these are the actual school means. This shows that the ANOVA

model does not actually combine the data for the four schools but instead can be

used to compare the four schools to check if the schools are signi�cantly di�erent.

2.3.2 ANCOVA model

The Analysis of Covariance (ANCOVA) model is a combination of an ANOVA

model and a linear regression model. In its simplest form there are two predictors,

a regression variable and a factor variable, for example, for the current dataset

10

M5ij = �0 + �1M3ij + SCHOOLj + eij:

Again to �t the model co-linearity amongst the predictor variables needs to

be avoided and so SCHOOL4 can be constrained to be zero. The parameter

estimates are given in the third column of Table 2.4. The ANCOVA model �ts

parallel regression lines, one for each school, when regressing M5 on M3. Here

the data for one school is dependent on the other schools due to the common

slope parameter �1. The intercepts for each school are independent given the

common slope and so are the least squares estimates for the data assuming the

given slope.

2.3.3 Combined regression

An easy way to combine the information from the four schools into one model is

to simply ignore the fact that the children attend di�erent schools. This results

in the following regression model :

M5i = �0 + �1M3i + ei;

for all 69 pupils. Fitting this model gives parameter estimates �0 = 11:82 and

�1 = 0:52. While the ANOVA model considers the school means as a one level

dataset, here we consider all 69 pupils as a one level dataset, so neither of these

approaches exploits the structure of the problem, illustrated in Figure 2-2. An

alternative model that aims to �t this structure is illustrated in the next section.

2.4 Two level modelling

To adapt the problem to the structure illustrated in Figure 2-2, in a way that

permits generalising outward from the four schools, I must not only consider the

pupils from a school to be a random sample from that school, but also assume

the schools are a sample from a population of schools. In this way the ANOVA

model can be modi�ed as follows :

11

GreyfriarsGreenacres

.......

Pupil 11

.......

25 pupils

Pupil 1 Pupil 11

.......

13 pupils

Pupil 2 Pupil 12

....... .......

21 pupils

Pupil 2

.......

10 pupils

Pupil 1 Pupil 1

Borough of

Blackbridge

Pupil 12Pupil 2

Pupil 1

RedhillBluebell

Pupil 2

Figure 2-2: Tree diagram for the Borough of Blackbridge.

M5ij = �0 + Schoolj + eij;

Schoolj � N(0; �2s); eij � N(0; �2

e):

Similarly the ANCOVA model becomes

M5ij = �0 + �1M3ij + Schoolj + eij:

Schoolj � N(0; �2s); eij � N(0; �2

e):

In these models the predictor variables can be split into two parts, the �xed

part and the random part. In the above models the �xed part consists of the �'s,

and the product of the � vector and the X matrix is known as the �xed predictor.

These models are known as variance components models because the variance of

the response variable, M5 about the �xed predictor, is

12

var(M5ij j �) = var(Schoolj + eij) = �2s + �2

e ;

the sum of the level 1 (pupil) and level 2 (school) variances. Both models

described above are variance components models but with di�erent �xed

predictors.

Variance components models cannot be �tted by ordinary least squares as

no closed form least squares solutions exist, instead an alternative approach is

required. There are many such approaches and in this chapter I will consider

the techniques already used in the package MLn for �tting these models. These

techniques are based on iterative procedures to give estimates and are a type

of maximum likelihood estimation. The other multi-level modelling computer

packages use slightly di�erent approaches. HLM (Bryk et al. 1988) uses empirical

Bayes techniques based on the EM algorithm (Dempster, Laird, and Rubin 1977)

while VARCL uses a fast Fisher scoring algorithm (Longford 1987). In the

next chapter I will introduce MCMC techniques that use simulation methods

to calculate estimates, and these methods will be used throughout this thesis.

2.4.1 Iterative generalised least squares

Iterative generalised least squares (IGLS, Goldstein (1986)) is an iterative

procedure based on generalised least squares estimation. Consider the variance

components model that corresponds to the ANCOVA model, here estimates need

to be found for both the �xed e�ects, �0 and �1, and the random parameters �2s

and �2e . If the values of the two variances are known then the variance matrix,

V for the response variable can be calculated and this leads to estimates for the

�xed e�ects,

� = (XTV �1X)�1XTV �1Y;

where in this example, Y1 =M511 and X1 = (1M311) etc. Considering the earlier

dataset with 69 pupils in 4 schools then the variance matrix V will be 69 by 69

with a block diagonal structure as follows :

13

Vij =

8>>><>>>:�2s + �2

e if i = j

�2s if i 6= j but School[i] = School[j]

0 otherwise

From the estimates �, the raw residuals can then be calculated as follows :

~rij =M5ij � �0 � �1M3ij:

If the vector of these raw residuals, ~R is formed into the cross product matrix

R+ = ~R ~RT then this matrix has expected value V, the variance matrix de�ned

above. This can be used to create a linear model with predictors �2s and �2

e as

follows :

R+ij = Aij�

2s +Bij�

2e + �ij;

where Aij and Bij take values of 0 or 1, Aij = 1 when Schooli = Schoolj and

Bij = 1 if i = j. Using the block diagonal structure of V, vec(R+) can be

constructed so that it only contains the elements of R+ that do not have expected

value zero. This means in the example with 4 schools, the vector will be of length

102+252+212+132 = 1335, and just these terms need be included in the linear

model above as opposed to all 692 = 4761 terms.

After applying regression to this model new estimates are obtained for the two

variance parameters �2s and �2

e . These new estimates can now be used instead

of our initial estimates and the whole procedure repeated until convergence is

obtained.

Convergence

To calculate when the method has converged relies on setting a tolerance level.

Estimates from consecutive iterations are compared and if the di�erence between

the two estimates is within the tolerance boundaries then convergence of that

parameter is said to have been achieved. The method �nishes at the �rst iteration

at which all parameters have converged. To start the IGLS procedure requires

initial values, which are normally obtained by �nding the ordinary least squares

14

estimates of the �xed e�ects. The IGLS method is explained more fully in

Goldstein (1995).

2.4.2 Restricted iterative generalised least squares

The IGLS procedure can produce biased estimates of the random parameters.

This is due to the sampling variation of the �xed parameters which is not

accounted for in this method. A modi�cation to IGLS known as restricted

iterative generalised least squares (RIGLS, see Goldstein (1989)) can be used

to produce unbiased estimates. This modi�cation works by incorporating a bias

correction term into the equation when updating the variance parameters. The

RIGLS method then gives estimates that are equivalent to restricted maximum

likelihood (REML) estimates.

2.4.3 Fitting variance components models to the Black-

bridge dataset

The IGLS and RIGLS methods were used to �t the two variance components

models described earlier and the results can be found in Table 2.5. Two important

points can be drawn from this table. Firstly a de�ciency in the IGLS method can

be seen for model 1, the school level variance parameter, �2s has been estimated as

negative and has consequently been set to zero. This odd behaviour occurs when

a variance parameter is very small compared to its standard error and sometimes

in the iterative procedure the maximum likelihood estimate becomes negative.

In this example the behaviour is mainly due to trying to �t a two level model to

a dataset with only four schools.

Table 2.5: Parameter estimates for two variance components models using bothIGLS and RIGLS for Borough of Blackbridge dataset.

Parameter IGLS RIGLS IGLS RIGLSName Model 1 Model 1 Model 2 Model 2�0 30.96 (0.68) 30.87 (0.78) 15.05 (3.02) 14.56 (3.15)�1 | | 0.57 (0.11) 0.59 (0.11)�2s 0.00 (0.00) 0.54 (1.70) 0.03 (0.92) 0.54 (1.33)�2e 32.04 (5.46) 32.12 (5.62) 22.58 (3.95) 22.90 (4.01)

15

Secondly for both variance components models the variance at school level is

very small, and not signi�cantly di�erent from zero. This means that it may be

better, in this example, with only four schools to stick to single level modelling

and use the combined regression model.

2.4.4 Fitting variance components models to the JSP

dataset

The same two variance components models as considered in the previous section

were �tted to the whole JSP dataset and the results can be seen in Table 2.6.

When all 48 schools are considered, the school level variance, �2s is signi�cant

and so it makes sense to use a two level model. Table 2.7 shows the di�erent

estimates obtained for the four Blackbridge schools using ANOVA and RIGLS.

As described earlier the ANOVA estimates are simply the school means, and

the RIGLS estimates are shrinkage estimates. This means that the RIGLS

estimates are a weighted combination of the mean of all 887 results, 30.57, and

the individual school means.

Table 2.6: Parameter estimates for two variance components models using bothIGLS and RIGLS for all schools in the JSP dataset.

Parameter IGLS RIGLS IGLS RIGLSName Model 1 Model 1 Model 2 Model 2�0 30.60 (0.40) 30.60 (0.40) 15.15 (0.90) 15.14 (0.90)�1 | | 0.61 (0.03) 0.61 (0.03)�2s 5.16 (1.55) 5.32 (1.59) 4.03 (1.18) 4.16 (1.21)�2e 39.28 (1.92) 39.28 (1.92) 28.13 (1.37) 28.16 (1.37)

Table 2.7: Comparison between �tted values using the ANOVA model and thevariance components model 1 using RIGLS.

School ANOVA RIGLSBlueBell 30.40 30.49Redhill 32.00 31.68Greenacres 31.57 31.32Greyfriars 28.38 29.19

16

There are analogous results for the ANCOVA model and the second variance

components model. Here the ANCOVA model �ts the least squares regression

intercepts given the �xed slope, and the RIGLS method will shrink these parallel

lines towards a global average line, given the �xed slope.

2.4.5 Random slopes model

The two level models considered so far have all been variance components models,

that is the random variance structure at each level is simply a constant variance.

Earlier we considered �tting separate regression lines to each school and this

model can also be expanded into a two level model known as a random slopes

regression,

M5ij = �0 + �1M3ij + u0j + u1jM3ij + eij;

uj =

0@ u0j

u1j

1A �MVN(0;�s); eij � N(0; �2e):

In this notation, the �s are �xed e�ects and represent an average regression

for all schools. The ujs are the school level residuals and �s is the school level

variance, which is now a matrix as the slopes now also vary at the school level.

As the earlier variance components models suggested that the dataset with four

schools should not be treated as a two level model, we will only consider �tting

the random slopes model to the full JSP dataset with 48 schools, as shown in

Table 2.8.

Table 2.8: Parameter estimates for random slopes model using both IGLS andRIGLS for all schools in the JSP dataset.

Parameter IGLS RIGLS�0 15.04 (1.32) 15.03 (1.34)�1 0.612 (0.04) 0.613 (0.04)�s00 44.99 (16.37) 47.04 (16.75)�s01 {1.23 (0.52) {1.30 (0.53)�s11 0.034 (0.017) 0.036 (0.017)�2e 26.97 (1.34) 26.96 (1.34)

17

Table 2.9 gives as a comparison the regression lines �tted using separate

regressions and the random slopes two level model, for the four schools in

Blackbridge. Here shrinkage can be seen towards the average line M5 =

15:03 + 0:613M3.

Table 2.9: Comparison between �tted regression lines produced by separateregressions and the random slopes model.

School Separate Random SlopesName Regression RegressionBlueBell 22:44 + 0:317M3 17:16 + 0:548M3Redhill 15:96 + 0:575M3 15:06 + 0:610M3Greenacres �7:40 + 1:244M3 5:28 + 0:868M3Greyfriars 12:52 + 0:665M3 12:56 + 0:678M3

Both the variance components model and the random slopes regression will

be discussed in greater detail in Chapter 4.

2.5 Fitting models to pass/fail data

Often when considering exam datasets the main objective is to classify students

into grade bands, or at the simplest level, to pass or fail pupils. In this thesis I

will only be studying the simpler case where the response variable can be thought

of as a 0 or a 1, depending on whether the event of interest occurs or not. The

interest now lies in estimating the probability of the response variable being a 1

rather than the value of the response variable, as this is constrained to be either

0 or 1.

The normal linear models used in the earlier sections can still be used when

the response is binary but they assume that the response variable can take any

real values and so this leads to probability estimates that lie outside [0; 1]. The

more common approach to �tting binary data is to assume that the response

variable has a binomial distribution. Then the model produced can be �tted

by the technique of generalized linear models, also described in McCullagh and

Nelder (1983). A link function, g(�) is required to transform the response variable

from the [0; 1] scale to the whole real line. I will only consider the logistic link,

log(pij=(1� pij)), as it is the most frequently used link function for binary data,

18

although there are other alternatives. The model then becomes g(�) = X�,

where X� is the linear predictor. When a binary response model is �tted using

the logistic link, the technique is known as logistic regression.

Although the response variable is a binary outcome the predictor variables

can still be continuous variables. For example if a student is due to take an exam

that will be externally examined the result he/she obtains will generally be a

grade and the exact mark will not be given. However he/she will typically be

given a mock exam at some time before the exam and the exact mark of this

mock will be known. I will try and mimic this scenario by using the 4 schools

considered in the earlier examples and assume that the predictor M3 score is

known. However the response of interestM5 will now be converted to a pass/fail

response, Mp5, depending on whether the student gets at least 30 out of 40 in

the test. Considering the 69 students in the 4 schools, 46 of them got at least 30

on the year 5 test so two-thirds of the students actually pass.

Interest will now lie in whether the school of each pupil has an e�ect and

whether the `mock' exam mark, M3 is a good indicator of whether a student will

pass or fail. I will consider the following two models that are analogous to the

ANOVA and ANCOVA models for continuous responses.

Mp5ij � Bernoulli(pij)

log(pij=(1� pij)) = �0 + SCHOOLj (2.1)

log(pij=(1� pij)) = �0 + �1M3ij + SCHOOLj (2.2)

The estimates obtained by �tting these models are given in Table 2.10. The

parameter estimates obtained can be transformed back to give estimates for the

pij,

pij =exp(Xij�)

1 + exp(Xij�):

Looking more closely at model 2.1 and using the transform, the estimated

probability of passing for a pupil in school i is equal to the proportion of pupils

passing in school i. If we consider once again John Smith, who got 30 in his year

2 maths exam, using model 2.2 we see that a pupil from Redhill school with 30

19

in the M3 exam has an estimated 85:8% chance of passing. In fact, John Smith

just scraped a pass, scoring 30 in year 5.

Table 2.10: Parameter estimates for the two logistic regression models �tted tothe Blackbridge dataset.

Parameter Model 2.1 Model 2.2�0 {0.1541 {5.3627�1 | 0.2176

SCHOOL1 0.1541 {0.1727SCHOOL2 1.3068 0.6306SCHOOL3 1.3173 {0.1647

2.5.1 Extending to multi-level modelling

The above models that have been �tted using logistic regression techniques are

single level models but in an analogous way to the earlier Gaussian models these

models can be extended to two or more levels. As with the Gaussian models

there are iterative procedures to �t the multi-level logistic models, and MLn uses

two such methods.

Both techniques are quasi-likelihood based methods which can be used on all

non-linear multi-level models. They use a Taylor series expansion to linearize

the model. The �rst technique is Marginal Quasi-likelihood (MQL) proposed by

Goldstein (1991). MQL tends to underestimate �xed and random parameters

particularly with small datasets. The second technique is Predictive or Penalised

Quasi-likelihood (PQL) proposed by Laird (1978) and also Stiratelli, Laird, and

Ware (1984). PQL is more accurate than MQL but is not guaranteed to converge.

The distinction between the methods is that when forming the Taylor expansion,

in order to linearize the model, the higher level residuals are added to the linear

component of the nonlinear function in PQL, and not in MQL. The order of the

estimation procedure is the order of the Taylor series expansion.

I will now consider �tting several two level logistic regression models to the

JSP dataset with 48 schools with the response variable being a pass/fail indicator

for year 5 scores as before. The models are described below and the parameter

estimates obtained are given in Table 2.11.

20


log(pij=(1� pij)) = �0 + SCHOOLj (2.3)

SCHOOLj � N(0; �2s)


log(pij=(1� pij)) = �0 + �1M3ij + SCHOOLj (2.4)


Table 2.11: Parameter estimates for the two-level logistic regression models �ttedto the JSP dataset.

Par. M3(MQL1) M3(PQL2) M4(MQL1) M4(PQL2)�0 0.680(0.120) 0.769(0.136) {3.703(0.420) {4.262(0.464)�1 | | 0.177(0.016) 0.205(0.018)�2s 0.404(0.138) 0.570(0.179) 0.633(0.198) 0.874(0.258)

From Table 2.11 we can see that the estimates using PQL are larger both for

the random and �xed parameters. There is a signi�cant positive e�ect of M3

score on the probability of passing and there is signi�cant variability between the

di�erent schools using both MQL and PQL.

2.6 Summary

The multi-level logistic regression model ends our discussion of models that can

be �tted to the JSP dataset. I will return to problems involving binary data in

Chapter 6, where the MQL and PQL methods described here will be compared

to MCMC alternatives.

In this chapter I have highlighted some examples of simple models that arise

when using data from education. I have then shown how the existing methods

in the MLn package can �t such models. In the next chapter I will introduce

21

MCMC methods which will then be used to �t the types of models introduced

here in later chapters.

22

Chapter 3

Markov Chain Monte Carlo

Methods

In the previous chapter I concentrated on maximum likelihood based techniques

for �tting multi-level models. The �eld of Bayesian statistics has grown

in importance as computer power has increased and techniques that would

previously have been impossible to implement can now be performed e�ciently.

MCMC methods are one group of Bayesian methods that can be used to �t

multi-level models. In this chapter I will describe the various MCMC methods in

common usage today and how they work. I will explain how to use the chains that

the methods produce to get answers to problems and how to tell when a chain

has reached its equilibrium distribution. I will then illustrate all these points in

a simple example. I will begin by explaining why such methods are used.

3.1 Background

Consider a sequence of n observations, yi that have been generated from a normal

distribution with unknown mean and variance, so that yi � N(�; �2). Then �

and �2 have standard (unbiased) estimates given the observations yi,

� = �y =nXi=1

yin

and �2 =nXi=1

(yi � �y)2

n� 1:

Consider instead the situation in reverse, if � and �2 were known, I could

23

generate a sample of observations from the normal distribution by simulation,

see for example Box and Muller (1958). Then if I drew a large enough sample

the mean and variance of the sample should be approximately equal to the mean

and variance of the underlying distribution.

Similarly consider a gamma distribution with parameters � and �. If after

�nding a suitable simulation algorithm (see Ripley (1987)), a large sample from

the gamma distribution is drawn, then it can be veri�ed that the mean of the

distribution is ��and the variance is �

�2. Both of these examples are trivial but

given any parameter of interest, if I can simulate from its distribution for long

enough I can calculate estimates for the parameter and any functionals of the

parameter.

Multi-level models are much more complicated than these two examples and

it is rare that samples from a parameter's distribution can be obtained directly

from simulation. The di�erence between multi-level models and our simple

examples is that the parameter of interest will depend on several other parameters

with unknown values. Bayesian estimation methods involve integrating over

these other parameters but this becomes infeasible as the model becomes more

complicated. The methods detailed in the last chapter involve the use of an

iterative procedure that leads to an approximation to the actual parameter value.

The methods in this chapter involve the generation, via simulation, of

Markov chains that will, given time, converge to the posterior distribution of the

parameter of interest. Before going on to describe the various MCMC techniques

I �rst need to cover some of the basic ideas of Bayesian inference.

3.2 Bayesian inference

In frequentist inference the data, whose distribution across hypothetical rep-

etitions of data-gathering is assumed to depend on a parameter vector, �, are

regarded as random with � �xed. In Bayesian inference the data are regarded

as �xed (at their observed values) and � is treated as random as a means

of quantifying uncertainty about it. In this formulation � possesses a prior

distribution and a posterior distribution linked by Bayes' theorem. The posterior

distribution of � is de�ned from Bayes' theorem as

24

p(� j data) / p(data j �)p(�):

Here p(�) is the prior distribution for the parameter vector � and should

represent all knowledge we have about � prior to obtaining the data. Prior

distributions can be split into two types, informative prior distributions which

contain information that has been obtained from previous experiments, and \non-

informative" or di�use priors which aim to express that we have little or no prior

knowledge about the parameter. In frequentist inference prior distributions are

not used in �tting models, and so \non-informative" priors are widely used in

Bayesian inference to compare with the frequentist procedures.

The posterior distribution for � is therefore proportional to the likelihood,

p(data j �) multiplied by the prior distribution. The proportionality constant

is such that the posterior distribution is a valid probability distribution. One

principal di�culty in Bayesian inference problems is calculating the propor-

tionality constant as this involves integration and does not always produce a

posterior distribution that can be written in closed form. Also to �nd marginal,

posterior and predictive distributions involves (high-dimensional) integrations

which MCMC methods can instead perform by simulation.

How do MCMC methods �t in?

MCMC methods generate samples from Markov chains which converge to

the posterior distribution of �, without having to calculate the constant of

proportionality. From these samples, summary statistics of the posterior

distribution can be calculated.

3.3 Metropolis sampling

The Metropolis algorithm was �rst described in Metropolis et al. (1953) in the

�eld of statistical mechanics. The idea is to generate values of �, the parameter

of interest from a proposal distribution and correct these values so that the draws

are actually simulating from the posterior distribution p(� j data). The proposaldistribution is generally dependent on the last value of � drawn but independent

25

of all other previous values of � to obey the Markov property. The method

works by generating new values at each time step from the current proposal

distribution but only accepting the values if they pass a criterion. In this way

the estimates of � are improved at each time step and the Markov chain reaches

its equilibrium or stationary distribution, which is the posterior distribution of

interest by construction.

The Metropolis algorithm for an unknown parameter � is as follows :

� Select a starting value for � which is feasible.

� For each time step t sample a point from the current proposal distribution

pt(�� j �t�1). The proposal distribution must be symmetric in �� and �t�1,

that is pt(�� j �t�1) = pt(�t�1 j ��) for all t.

� Let rt =p(��jy)p(�t�1jy)

be the posterior ratio and at = min(1; rt) be the acceptance

probability.

� Accept the new value � = �� with probability at, otherwise let �t = �t�1.

In multi-level models there are many parameters of interest and the above

algorithm can be used in several ways. Firstly � could be considered as a vector

containing all the parameters of interest and a multivariate proposal distribution

could be used. Secondly the above algorithm could be used separately for each

unknown parameter, �i. If this is done, it is generally done sequentially, that

is, at step t generate a new �1t, then a new �2t and so on until all parameters

have been updated then continue with step t+1. Thirdly a combination method

where parameters are updated in suitable blocks, some multivariately and some

univariately could be used.

3.3.1 Proposal distributions

To perform Metropolis sampling symmetric proposal distributions must be chosen

for all parameters. There are at least two distinct types of proposal distributions

that can be used.

Firstly the simplest type of proposal is the independence proposal. This

proposal generates a new value from the same proposal distribution regardless of

the current value. If a parameter is restricted to a range of values, for example a

26

correlation parameter must lie in the range [�1; 1], then an independence proposalcould consist of generating a new value from a uniform [�1; 1] distribution.Independence proposals are somewhat limited in that if parameters are de�ned

on <, then it is di�cult, if not impossible to �nd a proposal distribution that will

sample from the whole range of the parameter in an e�ective manner.

A second type of proposal that is popular is the random-walk proposal.

Here pt(� j �t�1) = pt(� � �t�1), so the proposal is centred at the current

value of the parameter, �t�1. Common examples are both uniform and normal

distributions centred at the current parameter value.

Both these proposal distributions will then have a free parameter, in the

case of the uniform the width of the interval and in the normal the variance of

the distribution. The values given to these parameters will a�ect how well the

simulation performs. If the variance parameter is too small then the sampler will

end up making lots of little jumps and will take a long time to reach all parts of the

sample space. If the variance is too big there will be a lower acceptance rate and

the sampler will end up staying at particular parameter values for long periods

and again the chain will take a long time to give good estimates. Convergence

rates will be dealt with later in this chapter. The common method used for

selecting parameter values for proposal distributions is to try several values for

the variance until the chain \mixes well". Adaptive methods which modify the

proposal distribution to improve convergence are also discussed in later chapters.

3.4 Metropolis-Hastings sampling

Hastings (1970) generalized the Metropolis algorithm to allow proposal distrib-

utions that were not symmetric. To correct for this the ratio of the posterior

probabilities rt is now replaced by a ratio of importance ratios:

rt =p(�� j y)=pt(�� j �t�1)

p(�t�1 j y)=pt(�t�1 j ��) :

The Metropolis algorithm is just a special case of this algorithm where

pt(�� j �t�1) = pt(�t�1 j ��) and these terms cancel out.

In later chapters it will be seen that it is not always easy to �nd a symmetric

proposal distribution for parameters with restricted ranges, for example variances.

27

Also asymmetric proposal distributions may sometimes assist in increasing the

rate of convergence. As with the Metropolis algorithm, the Metropolis-Hastings

algorithm has proposal distributions with parameters that can be modi�ed to

speed up convergence.

3.5 Gibbs sampling

The Gibbs sampler is a special case of the Metropolis-Hastings algorithm. Geman

and Geman (1984) used an approach in their work on image analysis based on

the Gibbs distribution. They consequently named this method Gibbs sampling.

Gelfand and Smith (1990) applied the Gibbs sampler to several statistical

problems bringing it to the attention of the statistical community. The Gibbs

sampler is best applied on problems where the marginal distributions of the

parameters of interest are di�cult to calculate, but the conditional distributions

of each parameter given all the other parameters and the data have nice forms.

For example, suppose the marginal posterior p(� j y) cannot be obtained fromthe joint posterior p(�; z j y) analytically. However suppose that the conditionalposteriors p(� j y; z) and p(z j y; �), have forms that are known and are easy to

sample from, for example normal or gamma distributions. Gibbs sampling can

then be used to sample indirectly from the marginal posterior.

The Gibbs sampler works on the above problem as follows, �rstly choose a

starting value for z, say z(0) and then generate via random sampling a single

value, �(1), from the conditional distribution p(� j y; z = z(0)). Next generate z(1)

from the conditional distribution p(z j y; � = �(1)). Then start cycling through

the algorithm generating �(2) and z(2) and so on.

If the conditional distributions of the parameters have standard forms, then

they can be simulated from easily. If this is not the case and the conditional

distribution does not have a standard form, then a di�erent method must be

used. Two such methods will now be described.

3.5.1 Rejection sampling

Rejection sampling is described in Ripley (1987). It is used when the distribution

of interest f(x) cannot be easily sampled from but there exists a distribution

28

g(x) such that f(x) < Mg(x)8x where M is a positive number, and g(x) can be

sampled from without di�culty. g(x) can be thought of as an envelope function

that completely bounds the required distribution from above. To generate a value

from f(x) use the following algorithm.

Repeat

Generate Y from g(Y),

Generate U from U(0,1),

until U < f(Y)/M g(Y).

Return X = Y.

The e�ciency of this method depends on the enveloping function g(x). There

are two major aims, �rstly to �nd a function that satis�es f(x) < Mg(x)8x, andsecondly to �nd a g(x) similar enough to f(x) to have a high acceptance rate.

The second method which will now be discussed brie y, tries to automatically

satisfy both these aims.

3.5.2 Adaptive rejection sampling

Adaptive rejection sampling (Gilks and Wild 1992) works when the conditional

distribution of interest is log concave. It starts by considering a small number of

points from the distribution of interest f and evaluating the tangents to log f at

these points. Then joining up these tangents will construct an envelope function,

g for f . Then proceed as in rejection sampling except that when a point xg is

chosen from g, as well as evaluating f(xg), also evaluate the tangent to log f(xg)

and modify the envelope accordingly. Then as more points are sampled, g(x)

becomes more and more like f(x), hence the rejection sampling is adaptive.

3.5.3 Gibbs sampler as a special case of the Metropolis-

Hastings algorithm

Looking at the Gibbs sampling algorithm as written above it is not immediately

obvious that the Gibbs sampler is a special case of the Metropolis-Hastings

algorithm. If I consider the proposal distribution for a particular member of

�, �(i) as

29

pt(�� j �t�1) =

8<: p(��(i) j �t�1(�i); y) if ��(�i) = �t�1(�i)

0 otherwise

where �(�i) is the vector � with element i removed. In other words the only

possible proposals involve holding all components of � constant except the ith.

Then the ratio rt of importance ratios is

rt =p(�� j y)=pt(�� j �t�1)

p(�t�1 j y)=pt(�t�1 j ��)=

p(�� j y)=p(��(i) j �t�1(�i); y)

p(�t�1 j y)=p(�t�1(i) j �t�1(�i); y)

=p(�t�1(�i) j y)p(�t�1(�i) j y)

= 1

and all proposals are accepted. This is the same Gibbs sampler algorithm as

described earlier but this time written in the Metropolis-Hastings format.

3.6 Data summaries

Once a Markov chain has been run the outputs produced are sequences of values,

one sequence for each parameter, assumed to be from the desired joint posterior

distribution. Each individual sequence can be thought of as a sample from the

marginal distribution of the individual parameter. From each sequence of values

we hope to describe the parameter it represents via summary statistics.

In the last chapter I reviewed the IGLS and RIGLS methods for multi-

level modelling. For each parameter of interest, � these methods calculated

a maximum likelihood based estimate � for � and a standard error for this

estimate. If con�dence intervals are required, then if you are prepared to assume

the parameter is normally distributed, you can generate central 95% con�dence

intervals for �, (� � 1:96SE(�); � + 1:96SE(�)):

Markov chains can also calculate these same summary statistics but they can

also produce other summaries for the parameter. We will now describe how to

30

calculate the various summary statistics from Markov chains.

3.6.1 Measures of location

There are three main estimates that can be used for the parameter of interest, �.

1. Sample mean

If I consider the chain values as a sample from the posterior distribution of �,

then I can calculate their mean in the usual way :

� =1

N

NXi=1

�i:

2. Sample median

The median can be found by �nding the N+12th sorted chain value. Computation-

ally it is quicker to calculate the median via a `binary chop' algorithm rather than

actually sorting the chain. The `binary chop' algorithm consists of taking the �rst

(unsorted) chain value and dividing the other values into two groups depending

on whether they are larger or smaller than the �rst value. Then depending on

the number of values bigger than this �rst value the median will be in one of

the two groups. Discard the group that does not contain the median and repeat

the procedure on the other group. Repeat this procedure recursively until the

median is found. This is an N log 2 algorithm as opposed to an N2 algorithm

that the simple sort would be.

3. Sample mode

This statistic is equivalent to the estimate given by the IGLS and RIGLS methods

when the prior distribution is at and is also known as the maximum likelihood

estimate (MLE) in that case. It is not calculated directly from the Markov chain

but is instead calculated from the kernel density plot described later.

3.6.2 Measures of spread

There are two main groups of summary statistics for the spread of a set of data.

31

1. Variance and standard deviation.

The variance and the standard deviation of the data are both summary statistics

associated with the mean. In a similar way to the mean formula, consider the

chain values as a sample from the posterior distribution of �, then the variance

has the usual formula

var(�) =1

N � 1(NXi=1

�2i �(PN

i=1 �i)2

N):

The standard deviation is the square root of the variance.

2. Quantile based estimates.

There are several measures of spread that are calculated from the quantiles of a

distribution. In Bayesian statistics con�dence intervals are replaced by credible

intervals which have a di�erent interpretation to the frequentist con�dence

interval. A frequentist 100(1 � �)% con�dence interval for � is de�ned as an

interval calculated from the data such that 100(1��)% of such intervals contain �.

In Bayesian statistics the data is thought of as �xed and the parameter, � variable,

and so a 100(1� �)% credible interval, C is such thatRC p(� j data )d� = 1 � �

(Bernardo and Smith 1994). The quantiles are used to produce credible intervals,

for example a 95% central Bayesian credible interval is (Q0:025; Q0:975), where Qi

is the ith quantile. The interquartile range = Q0:75�Q0:25 can also be calculated

from the quantiles and is an alternative measure of spread. The `binary chop'

algorithm used for the median can also be used to calculate the quantiles rather

than sorting.

3.6.3 Plots

Given that the sequence of values obtained for a parameter � can be thought of

as a sample of n points from the marginal posterior distribution of � then I can

use plots to show the shape of this distribution.

The simplest density plot is the histogram. Given the range of the parameter

values in the sequence, the range can be split into M contiguous intervals, not

necessarily of the same length, which are commonly known as bins. The numbers

32

ofvalu

esthat

fallin

eachbin

arethen

counted

andthehistogram

estimate

ata

poin

t�isde�ned

by

p(�)=

no.

of�iin

samebin

as�

n�width

ofbin

contain

ing�:

Anexam

ple

ofahistogram

fortheparam

eter�1from

theexam

ple

later

inthis

chapter

canbeseen

inFigu

re3-1.

Histogram

sgive

arath

er`blocky'

approx

imation

totheposterior

distrib

ution

ofinterest.

Theapprox

imation

is

improved

,upto

apoin

t,byincreasin

gthenumber

ofbins,

Mbutthis

also

dependson

thenumber

ofpoin

tsnbein

glarge.

3.63.8

4.04.2

4.4

0.0 0.5 1.0 1.5 2.0 2.5 3.0

mu(1)

Figu

re3-1:

Histogram

of�1usin

gtheGibbssam

plin

gmeth

od.

Thekern

eldensity

estimator

improves

onthehistogram

bygiv

ingasm

ooth

er

estimate

oftheposterior

distrib

ution

.Thehistogram

canbethoughtof

as

33

taking each point �i and spreading its contribution to the posterior distribution

uniformly over the bin containing �i. The kernel estimator on the other hand

spreads the contribution of each point conditional on a kernel function K around

the point, where K satis�es

Z 1

�1K(�)d� = 1:

The kernel estimator with kernel K can then be de�ned by

p(�) =1

nh

nXi=1

K

� � �ih

!;

where h is a parameter known as the window width which governs the smoothness

of the estimate. For a more detailed description of choosing the kernel function

K and the window width see Silverman (1986). An example of a kernel density

plot for the same data as the earlier histogram can be seen in Figure 3-2.

3.7 Convergence issues

The Maximum Likelihood based estimation procedures described in the last

chapter are iterative routines and consequently converge to an answer. The

convergence depends on a tolerance factor, that is how di�erent the current

estimate is from the last estimate. Here convergence is easy to establish.

The convergence of a Markov chain is di�erent from the convergence of these

techniques. In Markov chain methods we are not interested in convergence to an

estimate but instead are interested in convergence to a distribution, namely the

joint posterior distribution of interest. There are many points to be addressed

when considering convergence to a distribution. Firstly when has the chain moved

from its starting value and started sampling from its stationary distribution?

Secondly how large a sample is required to give estimates to a given accuracy,

and �nally is the stationary distribution the required posterior distribution?

34

mu(1)

3.54.0

4.5

0.0 0.5 1.0 1.5 2.0 2.5

Figu

re3-2:

Kern

eldensity

plot

of�1usin

gtheGibbssam

plin

gmeth

odanda

Gaussian

kernelwith

alarge

valueof

thewindow

width

h.

3.7.1

Length

ofburn-in

ItisusualinMarkov

chain

sto

ignore

the�rst

Bvalu

eswhile

thechain

converges

totheposterior

distrib

ution

.These

Bvalu

esare

know

nas

the`burn-in

'perio

d

andthere

aremanymeth

odsto

estimate

B.Theeasiest

meth

odisto

look

at

atrace

foreach

param

eterof

interest.

Ifaparam

eter,�is

consid

eredwhen

convergen

cehas

been

attained

at�B,theobservation

s�i ;i

>B

should

allcom

e

fromthesam

edistrib

ution

.Anequivalen

tapproach

isto

consid

erthetrace

of

themean

oftheparam

eterof

interest

against

time.

Thistrace

should

becom

e

approx

imately

constan

twhen

convergen

cehas

been

reached.Exam

ples

ofboth

ofthese

tracescan

beseen

inFigu

re3-3

where

convergen

ceisreach

edafter

about

50iteration

s.Theuppersolid

lineinthebottom

graphistherunningmean

after

35

discarding the �rst 50 iterations.

There are many convergence diagnostics that can be used to estimate whether

a chain has converged. The Raftery and Lewis diagnostic (Raftery and Lewis

1992) can also be used to estimate the chain length required for a given estimator

accuracy and is mentioned later. The Gelman and Rubin diagnostic (Gelman and

Rubin 1992) uses multiple chains and will be described when I consider multi-

modal models.

Geweke diagnostic

Geweke (1992) assumes that a burn-in of length B has been chosen and these B

iterations have been discarded. The method has its origin in the �eld of spectral

analysis and compares the trace of � over two distinct parts, the �rst nA and

the last nB iterations, typically the �rst tenth and the last half of the data. The

following statistic

��A � ��B

(n�1A S�(0)A + n�1

B S�(0)B)12

tends to a standard normal distribution as n ! 1 if the chain has converged.

Here �� is the sample mean of � and S�(0) is the consistent spectral density

estimate.

If the above statistic gives large absolute values for a chain, then convergence

has not occurred. In the models I am considering in this thesis, convergence

to the stationary distribution is quick when using the IGLS or RIGLS starting

values and an arbitrary `burn-in' period will be used, for example 500 iterations.

If the chain has not converged by this point I can observe this from its trace and

amend the `burn-in' accordingly.

3.7.2 Mixing properties of Markov chains

After a Markov chain has converged the next consideration is how long to run

the chain to get accurate enough estimates. For some samplers such as the

independence sampler, it is possible to calculate the number of iterations required

to calculate particular summary statistics to a given accuracy. This is because

the independence sampler by de�nition should give uncorrelated values.

36

iteration

mu(

1)

0 100 200 300 400 500

01

23

4

iteration

runn

ing

mea

n

0 100 200 300 400 500

01

23

4

...............................................................................

.........................................................................

............................................................................................................................................................................................................................................................................................................................................................

Figure 3-3: Traces of parameter �1 and the running mean of �1 for a Metropolisrun that converges after about 50 iterations. Upper solid line in lower panel isrunning mean with �rst 50 iterations discarded.

37

Auto-correlation is an important issue when considering the chain length, as

a chain that is mixing badly, that is, has a high auto-correlation will need to be

run longer to give estimates to the required accuracy. Two useful plots that come

from the time series literature (Chat�eld 1989) are the autocorrelation function

(ACF) and the partial autocorrelation function (PACF). The ACF is de�ned by

:

�(�) =Cov[�(t); �(t+ �)]

V ar(�(t));

and describes correlations between the chain itself and a chain produced by

moving the start of the chain forward � iterations. The chain is mixing well

if these values are all small. The pth partial autocorrelation (PAC) is the excess

auto-correlation at lag p when �tting an AR(p) process not accounted for by an

AR(p � 1) model. The �rst PAC will be equal to the �rst autocorrelation as

this describes the correlation in the chain. For the chain to obey the (�rst-order)

Markov property all other PACs should be near zero. A large pth PAC would

indicate that the next value is dependent on past values and not just the current

value. The ACF and PACF for one Gibbs sampling run and one Metropolis

sampling run with �p = 0:05 for our example in the later section are shown in

Figure 3-4. The ACFs in Figure 3-4 include the auto-correlation at lag 0 which

is always 1. Here it can be seen that the Gibbs run is mixing well and the

auto-correlations are all small whereas the Metropolis run is highly correlated.

There are many ways to improve the mixing of a Markov chain. The simplest

way would be to thin the chain by using only every kth observation from the

chain. Thinning a chain will give a new chain that has less autocorrelation but it

can be shown (MacEachern and Berliner 1994) that the thinned chain gives less

accurate estimates than the complete chain. Thinning is still a useful technique

as longer runs need greater storage capacity and although the thinned chain is

not as useful as the full chain, it will generally be better than a section of the

complete chain of the same length.

When considering Gibbs sampling methods, there are several ways of

improving the mixing of the chain by actually altering the form of the model

that is being �tted. Hills and Smith (1992) explore re-parameterising the vector

of variables �, so that the new variables correspond to the principle axes of the

38

Lag

AC

F

0 10 20 30

0.0

0.2

0.4

0.6

0.8

1.0

Series : gibbs

Lag

AC

F

0 10 20 30

0.0

0.2

0.4

0.6

0.8

1.0

Series : metropolis

Lag

Par

tial A

CF

0 10 20 30

-0.0

3-0

.01

0.01

0.03

Series : gibbs

Lag

Par

tial A

CF

0 10 20 30

0.0

0.2

0.4

0.6

0.8

Series : metropolis

Figure 3-4: ACF and PACF for parameter �1 for a Gibbs sampling run of length5000 that is mixing well and a Metropolis run that is not mixing very well.

39

posterior distribution. This is done by transforming the data to the new axes.

When considering multi-level models, techniques such as hierarchical centring

(Gelfand, Sahu, and Carlin 1995) where variables that appear at lower levels are

also included at higher levels will improve the mixing of the sampler.

The mixing of a Metropolis Hastings chain will depend greatly on the proposal

distribution used. I will discuss in greater detail in the example at the end of

this chapter the e�ect of the proposal distribution. Most of the techniques used

to improve mixing in the Gibbs sampling algorithms, which involve changing the

structure of the model can also be used with Metropolis Hastings algorithms.

Raftery & Lewis diagnostic

The Raftery and Lewis diagnostic (Raftery and Lewis 1992) considers the

convergence of a run based on estimating a quantile, q of a function of the

parameters, g(�) to within a given accuracy. The method works by �rstly �nding

the estimated quantile, g(�q) from the chain and then creating a chain of binary

values, Zt de�ned by

Zt = 1 if g(�t) > g(�q);

Zt = 0 if g(�t) � g(�q):

This binary sequence, or a thinned version of the binary sequence can then

be thought of as a one step Markov chain with transition matrix

P =

0@ 1� � �

� 1� �

1A :

Using results from Markov chain theory and estimates for � and � from the

chain, estimates for the length of `burn-in' required, B and the minimum number

of iterations to run the chain for, N can be calculated. N is de�ned as the

minimum chain length to obtain estimates for the qth quantile within �r (on the

probability scale) with probability s such that the n step transition probabilities

of the Markov chain are within � of its equilibrium distribution.

The estimates are

40

B =log

��(�+�)

max(�;�)

�log(1� �� )

and

N = B +��(2� �� )

(� + �)3

(r

�(12(1 + s))

)�2

:

The Raftery Lewis diagnostic can also be used to assess the mixing of the

Markov chain by comparing the value N with the value Nmin obtained if the

chain values were an independent sample. The statistic IRL = NNmin

, can then be

used to describe the e�ciency of the sampler. The default settings for the Raftery

Lewis diagnostic as used in Raftery and Lewis (1992), are q = 0:025; r = 0:005

and s = 0:95, and these will be used in MLwiN.

3.7.3 Multi-modal models

If the joint posterior distribution of interest is multi-modal then when an MCMC

sampler is used to simulate from the distribution it is possible that, particularly

if the modes are distinct, the sampler will simulate from one of the modes and

not the whole distribution. To get around this problem it is always useful to

run several chains in parallel with di�erent starting values spread around the

expected posterior distribution and compare the estimates that are obtained from

each chain. If chains give widely di�ering estimates then the posterior is likely

to be multi-modal and the di�erent chains are sampling from distinct modes of

the distribution. There are many convergence diagnostics that rely on running

several chains from di�erent starting points. One of the more popular will now

be described.

Gelman & Rubin diagnostic

Gelman and Rubin (1992) assume that m runs of the same model, each of length

2n starting from dispersed starting points are run, and the �rst n iterations of each

run have been discarded to allow each sequence to move away from its starting

point. Then the between run, B and within run, W variances are calculated,

B=n =Pm

i=1(��i: � ��::)

2=(m� 1) and W =Pm

i=1 s2i =m where ��i: is the mean of the

n values for run i, s2i is the variance and��:: is the overall mean.

41

The variance of the parameter of interest, �2, can be estimated by a weighted

average of W and B,

�2 =n� 1

nW +

1

nB:

This along with � = ��::, gives a normal estimate for the target distribution.

If the dispersed starting points are still in uencing the runs then the estimate �2

will be an overestimate of �2. The potential scale reduction as n ! 1, that is

the overestimation factor for �2 can be estimated by

R =�n� 1

n+m + 1

mn

B

W

�df

df � 2;

where df = 2�2=dvar(�2), and

dvar(�2) =�n� 1

n

�2 1

mdvar(s2i ) + �

m + 1

mn

�2 2

m� 1B2

+2(m+ 1)(n� 1)

mn2

n

m[dcov(s2i ; ��2i:)� 2��i:dcov(s2i ; ��i:)]

in which the estimated variances and covariances are obtained from the sample

means and variances of the m runs.

If R is near 1 for all parameters of interest then there is little evidence of

multi-modality. If R is signi�cantly bigger than 1 then at least one of the m

runs has not converged, or the runs have converged to di�erent modes in the

distribution.

The majority of models I will study in this thesis will be unimodal. I am

aiming to use MLn's maximum likelihood techniques to give good starting values

for the parameters of interest and consequently will not use widely di�erent

starting values and so generally choose not to use the Gelman and Rubin

diagnostic. I will however run several chains with di�erent starting seeds for

the random number generator. This will work in a similar way to the di�erent

starting values. A chain may get stuck in a local mode using one set of random

number seeds, whereas another chain starting from the same starting values but

with di�erent random numbers may get stuck in a di�erent mode. However if

the model does have multiple modes this procedure will not �nd them as well as

42

the Gelman-Rubin sampling strategy.

3.7.4 Summary

Convergence diagnostics for MCMC methods has become a large �eld of

statistical research and the three diagnostics described here are simply the tip

of the iceberg. Both Cowles and Carlin (1996) and Brooks and Roberts (1997)

review larger groups of convergence diagnostics and are recommended for further

reading on this subject.

3.8 Use of MCMC methods in multi-level mod-

elling

Following the introduction of Gibbs sampling in Gelfand and Smith (1990),

Gelfand et al. (1990) applied the Gibbs sampling algorithm to many problems

including variance components models and a simple hierarchical model. Seltzer

(1993) considers using Gibbs sampling on a two level hierarchical model with a

scalar random regression parameter. The algorithm used is fully generalized in

Seltzer, Wong, and Bryk (1996) to allow vectors of random regression parameters.

Zeger and Karim (1991) consider using Gibbs sampling for generalized linear

models with random e�ects, which are two level multi-level models. They

concentrate mainly on the logistic normal model which I will investigate in

Chapter 6. The package BUGS (Spiegelhalter et al. 1994), is a general purpose

Gibbs sampling package using the adaptive rejection method (Gilks and Wild

1992) that can be used to �t many models including multi-level models. They

have concentrated mainly on models with univariate parameter distributions

although in BUGS versions 0.5 and later they include multivariate distributions.

It can be seen that most research in the use of MCMC methods in the �eld of

multi-level modelling has concentrated on Gibbs sampling. This is primarily

because of its ease of programming. In MLwiN I will start by using Gibbs

sampling for the simplest models. Then when the conditional distributions do

not have standard forms, for example logistic regression models, where Zeger and

Karim (1991) use rejection sampling and BUGS uses adaptive rejection sampling,

I will instead consider using Metropolis and Metropolis Hastings sampling. I will

43

also consider using these methods as an alternative to Gibbs sampling in the less

complex Gaussian models. Before looking at multi-level models, I will end this

chapter with an example that will illustrate the three MCMC methods and the

other issues described in this chapter.

3.9 Example - Bivariate normal distribution

Gelman et al. (1995) considered taking one observation from a bivariate normal

distribution to illustrate the use of the Gibbs sampler. I will consider the more

general case of a sequence of n pairs of observations (y1i; y2i) from a bivariate

normal distribution with unknown mean (�1; �2) and known variance matrix �.

Assume that � has a non-informative uniform prior distribution then the posterior

distribution has a known form :0@ �1

�2

1A j y � N

0@0@ �y1

�y2

1A ;�

n

1A :

I can verify the use of the MCMC techniques in this chapter by comparing

the answers they produce with the correct posterior distribution. I will consider

a set of 100 draws generated from a bivariate normal distribution with mean

vector � = (4; 2) and variance matrix � =

0@ 2.0 -0.2

-0.2 1.0

1A. I will assume that �

is known and that I want to estimate �. In the test data set, �y1 = 4:0154 and

�y2 = 2:0013, so the posterior distributions are as follows :0@ �1

�2

1A j y � N

0@0@ 4.0154

2.0013

1A ;

0@ 0.02 -0.002

-0.002 0.01

1A1A :

I will now explain brie y how to use the various techniques on this problem.

3.9.1 Metropolis sampling

There are two parameters, �1 and �2 for which posterior distributions are

required. As uniform priors are being used for �1 and �2, the conditional posterior

distributions are simply determined by the likelihood :

44

p(�1 j �2;�; y) = p(y j �;�)p(�1)

p(�1 j �2;�; y) / exp(�1

2

NXi=1

(yi � �)T��1(yi � �)):

Similarly for �2,

p(�2 j �1;�; y) / exp(�1

2

NXi=1

(yi � �)T��1(yi � �)):

I will use the normal proposal distribution

�i(t+1) � N(�i(t); �2p)

for both �1 and �2. I will consider several values for �2p to show the e�ect of the

proposal variance on acceptance rate and convergence of the chain.

3.9.2 Metropolis-Hasting sampling

As an example of a Metropolis-Hastings sampler I will consider the following

normal proposal distribution

�i(t+1) � N(�i(t) +1

2; �2

p)

This proposal distribution has two di�erences from the earlier Metropolis

proposal distribution. Firstly it is biased which in this example induces slow

mixing, generally it is preferable to have an unbiased proposal distribution.

Secondly it is not symmetric, so it does not have the Metropolis property,

p(�t+1 = a j �t = b) = p(�t+1 = b j �t = a). Consequently the ratio of the

proposal distributions has to be worked out, that is

r =p(�t+1 = a j �t = b)

p(�t+1 = b j �t = a):

For this proposal distribution,

45

r =p(�t+1 = a j �t+1 � N(b + 1

2; �2))

p(�t+1 = b j �t+1 � N(a + 12; �2))

=exp(� 1

2�2(a� b� 1

2)2)

exp(� 12�2

(b� a� 12)2)

= exp(� 1

2�2((a� b� 1

2)2 � (b� a� 1

2)2))

= exp(a� b

�2)

So when choosing to accept or reject each new value the Hastings ratio is used

as a multiplying factor. Again I will consider using several di�erent proposal

variances, �2p to improve our acceptance rate and convergence time.

3.9.3 Gibbs sampling

To use Gibbs sampling on this model I will consider updating the two parameters,

�1 and �2 separately. It would be pointless as an illustration of the Gibbs sampler,

updating the parameters together using a multivariate updating step as this would

be generating from the conditional distribution, p(� j �; y) which is the joint

posterior distribution of interest and I could �nd its mean and variance directly.

To use Gibbs sampling I need to �nd the two conditional distributions,

p(�1 j �2;�; y) and p(�2 j �1;�; y). I am using uniform priors for �1 and �2 and

so the posterior distribution is simply the normalised likelihood. The conditional

distributions are found as follows :

p(�1 j �2;�; y) = p(y j �;�)p(�1)

p(�1 j �2;�; y) / exp(�1

2

nXi=1

(yi � �)T��1(yi � �))

Let D =

24 d11 d12

d12 d22

35 = ��1, then expand in terms of �1

p(�1 j �2;�; y) / exp(�1

2

nXi=1

(yi1��1)2d11+2(yi1��1)(yi2��2)d12+(yi2��2)

2d22):

46

Then assuming that �1 has a normal distribution, �1 � N(�c; �2c ) and equating

powers of �1 gives

�21

�2c

= nd11�21 ! �2

c =1

nd11;

and

�2�c�1

�2c

= �2nXi=1

(yi1�1d11 + yi2�1d12 � �2�1d12)

! �c = �y1 +d12d11

(�y2 � �2):

So I have

p(�1 j �2;�; y) � N(�y1 +d12d11

(�y2 � �2);1

nd11);

and similarly for �2,

p(�2 j �1;�; y) � N(�y2 +d12d22

(�y1 � �1);1

nd22):

These expressions could also have been derived from standard bi-variate

normal regression results. I can now use the Gibbs sampling algorithm by

alternately sampling from these two conditional distributions. Unlike the other

two methods, I do not have a free parameter to set to change the acceptance rate

and improve the convergence rate, the Gibbs sampler always accepts the new

state. This is one of the reasons it is more widely used than the other methods.

3.9.4 Results

The model was �tted using all three methods described above. For the Gibbs

sampler 3 runs were performed using a burn-in of 1,000 and a main run of 5,000

updates. For both the Metropolis and Metropolis-Hastings sampling methods 3

runs were performed for several di�erent values of �p. A burn-in of 1,000 was used

and a main run of 100,000 updates for both methods. The results are summarised

in Table 3.1.

From Table 3.1 it can clearly be seen that all the methods eventually converge

47

Table 3.1: Comparison between MCMC methods for �tting a bivariate normalmodel with unknown mean vector.

Method �p �1 (sd) �2 (sd) Acc % �1=�2 R-L NTheory N/A 4.015 (0.141) 2.001 (0.100) N/A N/A

0.05 4.012 (0.143) 2.004 (0.100) 88.6/84.3 81,5000.1 4.014 (0.141) 2.002 (0.100) 78.1/70.3 35,000

Metropolis 0.2 4.017 (0.141) 2.001 (0.100) 60.6/49.7 16,6000.3 4.017 (0.141) 2.001 (0.100) 47.8/37.2 14,8000.5 4.016 (0.141) 2.002 (0.100) 32.5/24.0 20,2001.0 4.015 (0.142) 2.002 (0.100) 17.4/12.4 36,8000.25 4.019 (0.143) 2.002 (0.101) 4.4/4.3 92,6000.3 4.017 (0.141) 2.002 (0.099) 8.9/7.9 34,900

Hastings 0.5 4.014 (0.140) 2.001 (0.100) 18.8/14.1 25,7000.75 4.014 (0.141) 2.001 (0.100) 17.9/13.1 31,9001.0 4.016 (0.141) 2.001 (0.100) 15.2/11.0 41,8001.5 4.013 (0.143) 2.001 (0.100) 11.1/7.8 63,200

Gibbs N/A 4.014 (0.140) 2.002(0.103) 100.0/100.0 3,900

to approximately the correct answers. According to the Raftery-Lewis diagnostic,

the Gibbs sampling method achieves the default accuracy goals in the least

number of iterations. Both the Metropolis and the Hastings methods have

accuracies which vary depending on the proposal distribution. The Hastings

method has much smaller acceptance rates due to the bias in the proposal

distribution. It is clear that the Hastings sampler takes longer to converge than

the Metropolis sampler on average, and that the best proposal standard deviation

�p is higher for the Hastings sampler than the Metropolis sampler. Both these

points are also due to the bias in the sampler, and in general a biased sampler

would not be used in preference to an unbiased sampler.

Gelman, Roberts, and Gilks (1995) studied optimal Metropolis sampler

proposal distributions for Gaussian target distributions. They found that the best

univariate proposal distributions have standard deviations that are 2.38 times the

sample standard deviation. I used the known correct standard deviations for the

parameters �1 and �2 to �nd that the optimal proposal standard deviations are

0.336 for �1 and 0.238 for �2. In Table 3.1 the same proposal standard deviation is

used for both parameters, but it is as easy to use di�erent proposal distributions

for each parameter. Using the proposal standard deviations proposed in Gelman,

48

Roberts, and Gilks (1995) gives acceptance rates of approximately 44.2% and

Raftery Lewis N values of around 14,000 for both parameters which compare

favourably with the best results in the table.

mu(2)

1.4 1.6 1.8 2.0 2.2 2.4 2.6

01

23

Figure 3-5: Kernel density plot of �2 using the Gibbs sampling method and aGaussian kernel.

Looking at the kernel density plots for the two variables, Figures 3-2 and

3-5, constructed from one run of the Gibbs sampler method, it can be seen that

both variables have Gaussian posterior distributions as expected. In the case

where the posterior distributions of the parameters of interest are Gaussian, then

the 95% central Bayesian credible intervals (BCI) from the simulation methods

will be approximately the same as the standard 95% con�dence interval (CI).

This point is illustrated in Table 3.2 for the three MCMC methods. Both the

Bayesian credible intervals and the con�dence intervals are based on one run of

each sampler respectively.

49

proposal sd mu1

RL

diag

(10

00s)

0.0 0.5 1.0 1.5

2040

6080

100

(a)

proposal sd mu2

RL

diag

(10

00s)

0.0 0.2 0.4 0.6 0.8 1.0

2040

6080

100

(b)

acceptance rate mu1

RL

diag

(10

00s)

0.2 0.4 0.6 0.8

2040

6080

100

120

140

(c)

acceptance rate mu2

RL

diag

(10

00s)

0.2 0.4 0.6 0.8

2040

6080

100

(d)

proposal sd/actual sd mu1

RL

diag

(10

00s)

0 2 4 6 8 10

2040

6080

100

(e)

proposal sd/actual sd mu2

RL

diag

(10

00s)

0 2 4 6 8 10

2040

6080

100

(f)

Figure 3-6: Plots of the Raftery Lewis N values for various values of �p, theproposal distribution standard deviation.

50

Table 3.2: Comparison between 95% con�dence intervals and Bayesian credibleintervals in bivariate normal model.

Method �1 CI �2 CI �1 BCI �2 BCITheory (3.739,4.291) (1.805,2.197) (3.739,4.291) (1.805,2.197)Met. �p = 0:3 (3.737,4.294) (1.807,2.196) (3.737,4.294) (1.806,2.196)Met. optimal (3.737,4.292) (1.804,2.200) (3.738,4.291) (1.804,2.200)Hast. �p = 0:5 (3.739,4.288) (1.802,2.190) (3.739,4.283) (1.804,2.191)Gibbs (3.733,4.291) (1.799,2.205) (3.737,4.290) (1.802,2.200)

Considering the Metropolis sampler in more detail and running the sampler

with lots of di�erent values of �p the optimal value given by Gelman, Roberts,

and Gilks (1995) can be veri�ed. The graphs in Figure 3-6 were created by �tting

smooth curves to the results from the data for the new runs of the sampler. The

graphs (a) and (b) show the Raftery Lewis N values plotted against �p, the

proposal standard deviation for �1 and �2 respectively. The graphs (c) and (d)

show the e�ect of acceptance rate of the proposals on N for the two variables.

Graphs (e) and (f) are graphs (a) and (b) rescaled by dividing �p by the true

standard deviations for the two parameters.

In their paper Gelman, Roberts, and Gilks (1995) compared the e�ects of

di�erent values of the parameter �p, and the acceptance rate of the parameter,

on the e�ciency of the sampler. I have substituted e�ciency for the Raftery

and Lewis diagnostic, N and found that the same values of standardised �p that

maximise e�ciency, minimise N . From this it appears that 1=N = f(e�ciency),

for increasing f .

It is worth noting from Figure 3-6 that while the optimal value for the ratio �

= proposal SD/ actual SD is (close to) the 2.4 value obtained in Gelman, Roberts,

and Gilks (1995), the region of near optimality is quite broad : the Raftery Lewis

N value is below 20,000 for 0:8 � � � 5.

3.9.5 Summary

This example has shown how to use the three MCMC methods described earlier

in this Chapter on a simple problem. It has shown that the Gibbs sampler works

best if the conditional distributions of the unknown parameters are known. It

also shows that the Metropolis and Hastings algorithms are easier to implement

51

but need more tuning to give good answers. The Hastings algorithm used here

performed worst but it will be shown how more sensible Hastings algorithms can

be used in later chapters. The example also illustrates how the output from a

simulation method can be summarised and highlights the importance of checking

convergence.

The summary results and run diagnostics covered in this chapter have now

been added to MLwiN and can be seen in Figure 3-7. This parameter has been

given an informative prior distribution whose graph can also be seen in the kernel

density picture. In the next chapter I will discuss prior distributions in greater

detail.

Figure 3-7: Plot of the MCMC diagnostic window in the package MLwiN for theparameter �1 from a random slopes regression model.

52

Chapter 4

Gaussian Models 1 - Introduction

4.1 Introduction

In this chapter the aim is to combine the knowledge gained in the previous

two chapters, to use the MCMC methods described in Chapter 3 to �t some

of the simple multi-level models described in Chapter 2. This work will then

lead on to the next chapter where I will consider how to �t general models

using MCMC methods. I will only consider one of the three MCMC methods

described in chapter 3, the Gibbs sampler, and use it to �t the simple variance

components model and the random slopes regression model. I will give Gibbs

sampling algorithms to �t both these models and compare the results obtained

to the results obtained with the IGLS and RIGLS maximum likelihood methods.

Before considering any modelling, I will �rstly concentrate on one important

aspect of Bayesian methods, prior distributions. To create a general purpose

multi-level modelling package that uses Bayesian methods, some default prior

distributions must be found for all parameters. These default priors should be

\non-informative" and so some possible default priors will be described in the next

section of this chapter. These di�erent candidate priors will then be compared

via simulation with each other and the maximum likelihood methods. I will end

the chapter with some conclusions on which methods perform best.

53

4.2 Prior distributions

A prior distribution p(�) for a parameter � is a probability distribution that

describes all that is known about � before the data has been collected. There

are two distinct types of prior distribution, informative priors and non-informative

priors.

4.2.1 Informative priors

An informative prior for � is a prior distribution that is used when information

about the parameter of interest is available before the data is collected, and this

information is to be included in the analysis. For example, say I was interested in

estimating the average height of male university students. Then before collecting

my data by sampling from the student population, I go to the library and �nd that

the average height of men in Britain is 1.79m. I can then create a normal prior

distribution with mean 1.79 and variance �2, where the value, �2 will determine

the information content of the prior knowledge. I could also incorporate my belief

that as students are generally in the 18-30 age group, and this age group is on

average taller, this group will be on average taller by increasing the mean of my

prior distribution.

4.2.2 Non-informative priors

A \non-informative" prior distribution for � is a prior distribution that is used

to express complete ignorance of the value of � before the data is collected. They

are non-informative in the sense that no value is favoured over any other and are

also described as di�use or at priors due to this reason and their shape. The

most common non-informative prior is the uniform distribution over the range

of the sample space for �. If the parameter is de�ned over an in�nite range,

for example the whole real line, then the uniform distribution is an improper

prior distribution, as its distribution function does not integrate to 1. Improper

prior distributions should be used with caution, and only be used if they produce

proper posterior distributions.

54

4.2.3 Priors for �xed e�ects

Fixed e�ect parameters have no constraints and can take any value. A prior

distribution for such parameters will need to be de�ned over the whole real line.

The conjugate prior distribution for such parameters is the normal distribution

as will be illustrated in the algorithms described in later sections.

Uniform prior

If a non-informative prior is required then a good choice would be a uniform prior

over the whole real line, p(�) / 1. This prior is improper as it does not integrate

to 1, but will give proper posterior distributions and can be approximated by the

following prior.

Normal prior with huge variance

As the variance of the normal distribution is increased, the distribution becomes

locally at around its mean. Although �xed e�ects can take any value, close

examination of the data can narrow the range of values and a suitable normal

prior can be found. Generally the normal prior, p(�) � N(0; 104) will be an

acceptable approximation to a uniform distribution but if the �xed e�ects are

very large, a suitable increase in the prior variance may be necessary. Figure 4-1

shows several normal priors over the range ({5,5). It can clearly be seen that as

the variance increases the prior distribution becomes atter over the range and

when the variance is increased to 50 the graph looks like a at line.

4.2.4 Priors for single variances

Variance parameters are constrained to have strictly positive values, and so prior

distributions such as the normal cannot be used. The conjugate prior for a

variance parameter is an inverse chi squared or inverse gamma distribution. As

these distributions are not commonly simulated from, the precision parameter,

the reciprocal of the variance is generally considered instead. The conjugate priors

for the precision parameter are then the chi-squared or gamma distributions.

There are a variety of main contenders for the non-informative distribution for

the variance and these will now be considered.

55

.........................................................................................................................................................................................................................................................

................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

...............................................................................................

.....................................................

.........................................

...........................................................................................................................................................................................................................

........................................................

.........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

......................................................................................................................................................................................

...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

Figure 4-1: Plot of normal prior distributions over the range ({5,5) with mean 0and variances 1,2,5,10 and 50 respectively.

Uniform prior for �2

The parameter of interest is the variance, �2 so this prior tries to allow any

variance to be equally likely. This prior is used by Gelman and Rubin (1992)

and Seltzer (1993) amongst others but appears to have the disadvantage that

it favours large values of �2. This is because even unfeasibly large values of �2

have equal prior probability. This prior is improper and the following is a proper

alternative.

Pareto(1,c) prior for � = 1=�2

The Pareto distribution is a left-truncated gamma distribution, and when used

as a prior for the precision parameter is equivalent to a locally uniform prior for

56

the variance parameter.

� � Pareto(1; c), p(�) = c��2; � > c:

This means that a uniform prior for �2 on (0; c�1) is equivalent to a Pareto (1; c)

prior for � . As c is decreased the distribution will approach the improper uniform

distribution on (0;1).

Uniform prior for log �2

Box and Tiao (1992) try to �nd `data-translated' uniform priors to represent

suitably non-informative priors. They try and �nd a scale upon which the

distribution of the parameter will have the same shape for any possible value

of that parameter. For the �xed e�ects parameter this scale is simply the

parameter's own scale, as altering a normal distribution's mean does not alter

its shape, it simply translates the distribution. When considering a variance

parameter the likelihood has an inverse-chi squared distribution and this implies

that the correct scale is the log scale. Consequently Box and Tiao suggest

using a uniform prior on the log �2 scale. DuMouchel and Waternaux (1992)

discourage the use of this improper prior distribution with hierarchical models as

they claim it can give improper posterior distributions so instead the following

proper alternative is often used.

Gamma(�,�) prior for � = 1=�2

The gamma(�,�) prior for � approaches a uniform prior for log�2 as � ! 0. In

fact the improper gamma distribution when � = � = 0 for � is equivalent to a

uniform prior for log �2. This prior is the standard prior recommended by BUGS

(Spiegelhalter et al. 1994) for variance parameters, as BUGS does not permit the

use of improper priors.

Scaled inverse chi-squared(�; �2) prior for �2

An alternative approach would be to use an estimate of the parameter of interest

to choose a particular prior distribution. This prior is then a data-driven prior

57

as it requires an estimate for �2. The parameter � is small and if the estimate of

�2, �2 is 1, this prior is equivalent to a gamma(�=2,�=2) prior for � .

4.2.5 Priors for variance matrices

When considering a variance matrix most priors for single variances can be

translated to a multivariate alternative.

Uniform prior for �

This is similar to the univariate case, ie p(�) / 1.

Uniform prior for log�

This is similar to the univariate case, ie p(log�) / 1. This prior will not be

considered directly as it gives rise to improper posterior distributions.

Wishart prior equivalent to the gamma(�,�) prior

It is di�cult to evaluate what would be the multivariate equivalent of the

gamma(�,�) prior. One candidate prior is a Wishart prior for the precision matrix,

��1 with parameters � = n and S = I. In fact, it will be seen later, that this

prior is slightly informative and shrinks the estimate for � towards I, the identity

matrix.

Wishart prior equivalent to SI �2(�; �2) prior

It can be shown (Spiegelhalter et al. 1994) that if ��1 � Wishartn(�; S) then

E(�) = S=(� � n� 1). It clearly follows that if there is a prior estimate, � for �

and I want to incorporate this estimate into a `vaguely' informative prior, then

the following is an obvious candidate :

��1 � Wishartn(n+ 2; �):

I will now compare the results obtained using all the above priors on 2 simple

two level models.

58

4.3 2 Level variance components model

One of the simplest possible multi-level model is the two level variance com-

ponents model. This model has a single response variable and interest lies in

quantifying variability of this response at di�erent levels. The two level variance

components model can be written mathematically as :

yij = �0 + uj + eij;

uj � N(0; �2u); eij � N(0; �2

e);

where i = 1; : : : ; nj; j = 1; : : : ; J ,P

j nj = N and in which all the uj and eij

are independent. This model can be �tted using the Gibbs sampling method as

shown in the next section.

4.3.1 Gibbs sampling algorithm

The unknown parameters in the variance components model can be split into

four groups, the �xed e�ect, �0, the level 2 residuals uj, the level 2 variance �2u

and the level 1 variance �2e . Conditional posterior distributions for each of these

parameters need to be found so that the Gibbs sampling method described in

the previous chapter can be used. Then sampling from the distributions in turn

gives estimates for the parameters and their posterior distributions can be found

by simulation.

Prior distributions

I will assume a uniform prior for the �xed e�ect parameter �0. The two variances

�2u and �

2e will take various priors in the simulation experiment so in the algorithm

I will use general scaled inverse �2 priors with parameters �u; s2u; and �e; s

2e

respectively. Then all the priors in the earlier section can be obtained from

particular values of these parameters. The algorithm is then as follows :

Step 1. p(�0 j y; �2u; �

2e ; u)

Let �0 � N(�0; D0), then to �nd �0 and D0 :

59

p(�0 j y; �2u; �

2e ; u) / Y

i;j

(1

�2e

)12 exp[� 1

2�2e

(yij � uj � �0)2]

/ exp[� N

2�2e

�20 +

1

�2e

Xi;j

(yij � uj)�0 + const]:

Comparing this with the form of a normal distribution and matching powers of

�0 gives

D0 =�2e

N;

and

�0 =1

N

Xi;j

(yij � uj):

Step 2. p(uj j y; �2u; �

2e ; �0)

Let uj � N(uj; Dj), then to �nd uj and Dj :

p(uj j y; �2u; �

2e ; �0) /

njYi=1

(1

�2e

)12 exp[� 1

2�2e

(yij � uj � �0)2] � ( 1

�2u

)12 exp[� u2j

2�2u

]

/ exp[�1

2(nj�2e

+1

�2u

)u2j +1

2�2e

njXi=1

(yij � �0)uj + const]:

Comparing this with the form of a normal distribution and matching powers of

uj gives

Dj = (nj�2e

+1

�2u

)�1;

and

uj =Dj

�2e

njXi=1

(yij � �0):

60

Step 3. p(�2u j y; �0; u; �2

e)

Consider instead p(1=�2u j y; �0; u; �2

e) and let 1=�2u � gamma(au; bu). Then

p(1=�2u) = (1=�2

u)�2p(�2

u) and so

p(1=�2u j y; �0; u; �2

e) /JYj=1

(1

�2u

)12 exp[� u2j

2�2u

](1

�2u

)�2p(�2u)

/ (1

�2u

)(j2+ �u

2�1)exp[� 1

2�2u

(JXj=1

u2j + �us2u)]:

Comparing this with the form of a gamma distribution produces

au =J + �u

2and bu =

1

2(�us

2u +

JXj=1

u2j):

A uniform prior on �2u, or the equivalent Pareto prior is equivalent to �u =

�2; s2u = 0. A uniform prior on log �2u is equivalent to �u = 0; s2u = 0 and a

gamma(�; �) prior for 1=�2u is equivalent to �u = 2�; s2u = 1.

Step 4. p(�2e j y; �0; u; �2

u)

Consider instead p(1=�2e j y; �0; u; �2

u) and let 1=�2e � gamma(ae; be). Then

p(1=�2e) = (1=�2

e)�2p(�2

e) and so

p(1=�2e j y; �0; u; �2

u) / Yi;j

(1

�2e

)12 exp[� 1

2�2e

(yij � �0 � uj)2](

1

�2e

)�2p(�2e)

/ (1

�2e

)(N2+ �e

2�1)exp[� 1

2�2e

(Xi;j

(yij � �0 � uj)2 + �es

2e)]:

Comparing this with the form of a gamma distribution produces

ae =N + �e

2and be =

1

2(�es

2e +

Xi;j

e2ij):

A uniform prior on �2e , or the equivalent Pareto prior is equivalent to �e =

�2; s2e = 0. A uniform prior on log �2e is equivalent to �e = 0; s2e = 0 and a

gamma(�; �) prior for 1=�2e is equivalent to �e = 2�; s2e = 1.

61

Having found the four sets of conditional distributions, it is now simple

enough to program up the algorithm and compare via simulation the various

prior distributions.

4.3.2 Simulation method

In this simulation experiment I want to compare the maximum likelihoodmethods

IGLS and RIGLS with the Gibbs sampler method using several di�erent prior

distributions for the variance parameters. For ease of terminology I will consider

this two level dataset in an educational setting and use pupils within schools as

the units under consideration.

I will now consider what parameters in the above model are important and run

several di�erent sets of simulations using di�erent values for these parameters.

For each set of parameter values 1000 simulated datasets will be generated and

each method will be used to �t the variance components model to each dataset.

I will then compare how the methods perform for each group of parameter

values. These comparisons will be of two sorts. Firstly how biased the methods

are and secondly how well the con�dence intervals they produce cover the data.

The �rst two parameters I consider as important will in uence the structure

of the study. Firstly I will consider the size of the study; There are J schools,

each with nj pupils to give a dataset of size N . I will consider changing J , the

number of schools included in the study, and consequently modifying N to re ect

this change. Secondly the number of pupils in each school will be considered.

Here two schemes will be adopted, �rstly having equal numbers of pupils in each

school and secondly having a more widely spread distribution of pupils in each

school.

To use a realistic scenario I will consider the JSP dataset introduced in earlier

chapters. If slight modi�cations are made by removing 1 pupil at random from

each of the 23 largest schools then there will be 864 students, an average of 18

students per school.

The number of schools included in the study can then be varied and schools

can be chosen so that the average pupils per school is maintained at 18 and the

sizes of the individual schools are well spread. I will consider four sizes of study,

6, 12, 24 and 48 schools with a total of 108, 216, 432 and 864 pupils respectively.

62

The 8 study designs are included in Table 4.1 below. The individual schools in the

cases with unequal nj, were chosen to resemble the actual (skewed) distribution

of class size in the JSP data.

Table 4.1: Summary of study designs for variance components model simulation.

Study Pupils per School N1 5 10 13 18 24 38 1082 18 18 18 18 18 18 1083 5 8 10 11 11 12 13 15 20 24 26 61 2164 18 18 18 18 18 18 18 18 18 18 18 18 2165 5 7 8 10 10 11 11 12 12 13 13 14 432

15 16 18 19 20 21 23 24 26 29 34 616 18 18 18 18 18 18 18 18 18 18 18 18 432

18 18 18 18 18 18 18 18 18 18 18 187 5 6 7 8 8 10 10 10 11 11 11 11 864

12 12 12 12 13 13 13 13 14 14 15 1516 16 17 18 18 19 19 20 20 21 21 2123 24 24 24 25 26 27 29 34 37 38 61

8 18 18 18 18 18 18 18 18 18 18 18 18 86418 18 18 18 18 18 18 18 18 18 18 1818 18 18 18 18 18 18 18 18 18 18 1818 18 18 18 18 18 18 18 18 18 18 18

The other variables that need varying are the true values given to the

parameters of interest, �0; �2u and �

2e . The �xed e�ect, �0, is not of great interest

and so will be �xed at 30.0 and not modi�ed. The two variances are more

interesting and I will choose three possible values for each of these parameters.

The between schools variance, �2u will take values 1.0, 10.0 and 40.0, and for the

between pupils variation, �2e the values 10.0, 40.0 and 80.0 will be used. I will

assume that �2e is always greater than or equal to �2

u as this is more likely in the

educational scenario.

I will consider the eight di�erent study designs with true values that are most

like the original JSP model, that is, �0 = 30; �2u = 10 and �2

e = 40. I will then

only consider study design 7, which is similar to the actual JSP dataset and

modify the true values of the variance parameters. This will make in total 15

di�erent simulation settings.

63

Creating the simulation datasets

For the variance components model, creating the simulation datasets is easy as

the only data that need generating are the values of the response variable for the

N pupils. Considering the case of 864 pupils within 48 schools the procedure is

as follows :

1. Generate 48 ujs, one for each school, by drawing from a normal distribution

with mean 0 and variance �2u.

2. Generate 864 eijs, one for each pupil, by drawing from a normal distribution

with mean 0 and variance �2e .

3. Evaluate Yij = �0 + uj + eij for all 864 pupils.

This will generate one simulation dataset for the current parameter values.

This dataset is then �tted using each method, and the whole procedure is repeated

1000 times. The datasets will be generated using a short C program.

The Gibbs sampling routines using the various prior distributions will be run

using the BUGS package while the IGLS and RIGLS estimates will be calculated

using MLwiN. The main reason to use BUGS and not MLwiN to perform the

Gibbs sampling runs is due to the computing resources available to me. I have

access to several UNIX Sun workstations on which I can run BUGS in parallel but

I only have one PC to use for MLwiN and this machine is also used for MLwiN

development work.

Comparing methods

To compare the bias of each method I can �nd the mean values of each parameter

estimate over the 1000 runs and the standard errors of these means. These means

can then be compared with the true answer. To compare the coverage, I want

to �nd how many of the 1000 runs contain the true value in an x% con�dence

interval. Ideally x% of the runs will contain the true value in an x% con�dence

interval. In particular I will concentrate on the 90% and 95% con�dence intervals

as these are the most used con�dence levels.

For the Gibbs sampling methods, it is easy to calculate how many of the I

iterations in each run are larger than the true value and so I can then �nd whether

or not the true value lies in an x% credible interval without actually calculating

all the credible intervals. I will consider three proper priors, the Pareto(1,c) prior

64

for 1=�2, the gamma(�; �) prior for 1=�2, and the scaled inverse chi-squared prior

for �2 mentioned in the earlier section of this chapter. The gamma(�; �) estimate

has been used as the parameter for the scaled inverse chi-squared prior. For a

particular method, the same prior will be used for both �2u and �2

e .

For IGLS, we can assume normality for the �xed e�ects parameter and

calculate a normal x% con�dence interval. For the variances it is not so clear

how to calculate con�dence intervals so for now I will assume normality and later

I will consider another method of improving on this assumption.

Preliminary analysis for Gibbs sampling

To gauge how long the simulations will take and consequently how many iterations

I can a�ord to run for each model, I have performed two preliminary tests. Firstly

I ran each of the 8 study designs with a generated dataset for 50,000 iterations

on a fast machine. This test will give an estimate of how long the simulation

studies will take. The results are in Table 4.2.

Table 4.2: Summary of times for Gibbs sampling in the variance componentsmodel with di�erent study designs for 50,000 iterations.

Study CPU time Act. time N1 59s 64s 1082 58s 68s 1083 109s 128s 2164 106s 107s 2165 201s 203s 4326 199s 203s 4327 392s 398s 8648 393s 411s 864

Secondly, to calculate how long to run each model I need to consider when

the model has produced estimates to a given accuracy. For this I will consider

the Raftery Lewis diagnostic at two percentiles, the standard 2.5 percentile and

the median. I will use a value for r in the Raftery Lewis notation of 0.01 instead

of 0.005 which is the default. This is because it takes a lot longer to obtain the

same accuracy for the median as the 2.5 percentile.

The results in Table 4.3 are the average values for N from 10 runs of the Gibbs

65

sampler, each of length 10,000 and with a burn-in of 1,000. The true values used

are �0 = 30; �2u = 10; and �2

e = 40.

Table 4.3: Summary of Raftery Lewis convergence times (thousands of iterations)for various studies.

Study Prior �0(2.5/50) �2u(2.5/50) �2

e(2.5/50)1 Gamma 10/67 8/146 1/121 Pareto 20/138 3/87 1/102 Gamma 13/79 8/69 1/112 Pareto 19/168 2/91 1/103 Gamma 6/76 10/34 1/103 Pareto 8/93 2/28 1/105 Gamma 5/77 2/28 1/105 Pareto 5/90 2/27 1/107 Gamma 5/73 1/16 1/107 Pareto 5/79 1/15 1/10

The large N values for the median of �0 are due to large serial correlation and

do not decrease much with the size of the study. The large N values for �2u are

due to the skewness of the posterior distribution which decreases as the number of

schools increases. Although it is important to ensure the estimates have converged

to a given accuracy, this has to be balanced by the fact that many simulation

datasets are to be generated for each situation and time is therefore constrained.

It is important to realise that the models under study are in common usage and

are known to be unimodal and so longer runs are only used to establish accuracy

in the estimates and not convergence of the chain. As I will be running lots of

simulations and then averaging the results the accuracy of individual simulations

does not have to be so rigid.

The number of iterations in each simulation run will depend on the size of

the study. The smaller studies need longer to converge but take less time per

iteration so can be run for longer. The lengths of simulation runs I chose are in

Table 4.4 below. Even with the modest values for N with the larger sample sizes,

the full simulation took 3 months to run using 3 Sun machines.

66

Table 4.4: Summary of simulation lengths for Gibbs sampling the variancecomponents model with di�erent study designs.

Study Length N1 50,000 1082 50,000 1083 30,000 2164 30,000 2165 20,000 4326 20,000 4327 10,000 8648 10,000 864

4.3.3 Results : Bias

In Table 4.5 the estimates of the relative bias for each method, obtained from

the simulations are given for the eight di�erent study designs. It can be seen

immediately from the table and Figures 4-2(i) and (ii) that the RIGLS method

gives the smallest bias on almost all study designs. The IGLS method generally

underestimates the variances and in particular the level 2 variance. All the `non-

informative' MCMC methods overestimate the variances and the Pareto prior

does particularly badly.

It is no great surprise that the level 1 variance is better estimated, and has

smaller percentage biases than the level 2 variance as there are far more pupils

than schools. This reduction in relative bias is due to the bias of the estimates

being inversely related to the number of observations they are based on. The

study design also has a signi�cant e�ect on how the methods perform. As the

number of schools is increased and consequently the number of students is also

increased, all the methods give better estimates. The e�ect of using a balanced

design as opposed to an unbalanced design is unclear. The IGLS, RIGLS and

Pareto prior methods give better estimates with a balanced design but the other

MCMC priors generally give worse estimates. The size of the dataset has far

more e�ect on the estimates than whether the design is balanced.

In Table 4.6, and Figures 4-2(iii) and (iv), study design 7, which is the

design most like the original JSP dataset and is a design where most methods

perform well has been considered in more detail. The true values for the variance

67

Table 4.5: Estimates of relative bias for the variance parameters using di�erentmethods and di�erent studies. True level 2/1 variance values are 10 and 40.

Study IGLS RIGLS Gamma S.I.�2 ParetoNumber (�; �) prior prior (1; �) prior

Level 2 variance relative bias % (Monte Carlo SE)

1 {22.64 (2.05) {0.97 (2.48) 49.05 (4.05) 50.90 (4.03) 481.3 (10.23)2 {20.07 (2.01) 0.03 (2.42) 51.41 (3.98) 52.75 (3.96) 449.8 (9.66)3 {11.86 (1.57) {0.99 (1.72) 18.36 (2.17) 18.66 (2.16) 74.89 (2.71)4 {9.75 (1.38) 0.40 (1.51) 20.27 (2.08) 20.47 (2.07) 70.88 (2.60)5 {2.37 (1.13) 3.10 (1.18) 11.98 (1.30) 12.00 (1.30) 30.77 (1.42)6 {4.11 (1.12) 1.02 (1.16) 9.70 (1.28) 9.71 (1.28) 26.72 (1.41)7 {2.14 (0.85) 0.52 (0.86) 4.69 (0.90) 4.70 (0.90) 12.48 (0.95)8 {2.02 (0.81) 0.53 (0.82) 4.75 (0.86) 4.75 (0.86) 12.04 (0.90)


1 {0.42 (0.453) {0.42 (0.453) 2.79 (0.473) 2.61 (0.471) 3.47 (0.470)2 {0.45 (0.458) {0.41 (0.458) 2.78 (0.478) 2.62 (0.476) 3.58 (0.476)3 {0.02 (0.320) {0.03 (0.320) 1.63 (0.328) 1.59 (0.328) 2.02 (0.325)4 {0.16 (0.323) {0.16 (0.323) 1.43 (0.322) 1.40 (0.321) 1.94 (0.319)5 {0.31 (0.223) {0.31 (0.223) 0.28 (0.224) 0.28 (0.224) 0.66 (0.224)6 {0.15 (0.223) {0.15 (0.223) 0.42 (0.224) 0.42 (0.224) 0.83 (0.224)7 {0.04 (0.158) {0.04 (0.158) 0.25 (0.158) 0.25 (0.158) 0.42 (0.158)8 {0.09 (0.158) {0.09 (0.158) 0.19 (0.158) 0.19 (0.158) 0.38 (0.158)

parameters have then been modi�ed. It can be seen that the IGLS method again

underestimates the true values and that the RIGLS method corrects for this.

What is surprising is the e�ects obtained when the level 2 variance is set

to values much less than the level 1 variance. Here the MCMC methods with

the gamma and S.I.�2 priors now underestimate the level 2 variance. The

corresponding level 1 variance is still over-estimated and so perhaps some of the

level 2 variance is being estimated as level 1 variance. The Pareto prior biases

are not similarly a�ected by modifying the true values of the variances. In fact

the percentage bias in the estimate of the level 2 variance increases when the true

value of level 1 variance is increased.

So to summarise the RIGLS method gives the least biased estimates in all the

scenarios studied. The IGLS method underestimates the variance parameters

while all the MCMC methods overestimate the variance except when the true

value of �2e is much greater than the true value of �2

u. Of the MCMC methods,

68

Simulation DesignR

elat

ive

% b

ias

: lev

el 2

var

ianc

e

1 2 3 4 5 6 7 8

010

020

030

040

050

0

(i)

IGLSRIGLSGammaSI chi-squaredPareto

Simulation Design

Rel

ativ

e %

bia

s : l

evel

1 v

aria

nce

1 2 3 4 5 6 7 8

01

23

4

(ii)


Parameter Settings

Rel

ativ

e %

bia

s : l

evel

2 v

aria

nce

-20

020

4060

80

1:10 1:40 1:80 10:10 10:40 10:80 40:40 40:80

(iii)


Parameter Settings

Rel

ativ

e %

bia

s : l

evel

1 v

aria

nce

-0.2

0.2

0.6

1.0

1:10 1:40 1:80 10:10 10:40 10:80 40:40 40:80

(iv)


Figure 4-2: Plots of biases obtained for the various methods against study designand parameter settings.

69

Table 4.6: Estimates of relative bias for the variance parameters using di�erentmethods and di�erent true values. All runs use study design 7.

Level 2/1 IGLS RIGLS Gamma S.I.�2 ParetoVariances (�; �) prior prior (1; �) prior


1/10 {3.10 (1.08) 0.36 (1.11) 3.15 (1.19) 3.28 (1.20) 15.92 (1.22)1/40 {6.54 (2.07) 0.31 (2.14) {18.48 (2.14) {20.01 (2.18) 39.57 (2.18)1/80 {3.43 (3.02) 7.18 (3.16) {22.81 (2.54) {24.60 (2.61) 84.53 (3.11)10/10 {1.71 (0.72) 0.53 (0.73) 4.94 (0.76) 4.95 (0.76) 10.56 (0.79)10/40 {2.14 (0.85) 0.52 (0.86) 4.69 (0.90) 4.70 (0.90) 12.48 (0.95)10/80 {2.77 (1.01) 0.43 (1.03) 3.68 (1.09) 3.69 (1.09) 14.84 (1.13)40/40 {1.71 (0.72) 0.53 (0.73) 4.94 (0.76) 4.95 (0.76) 10.56 (0.79)40/80 {1.88 (0.76) 0.50 (0.77) 4.90 (0.81) 4.91 (0.81) 11.23 (0.85)


1/10 {0.04 (0.16) {0.04 (0.16) 0.43 (0.16) 0.42 (0.16) 0.48 (0.16)1/40 {0.08 (0.16) {0.07 (0.16) 0.73 (0.16) 0.76 (0.16) 0.24 (0.16)1/80 {0.18 (0.16) {0.16 (0.16) 0.64 (0.16) 0.66 (0.16) 0.17 (0.16)10/10 {0.15 (0.16) {0.15 (0.16) 0.19 (0.16) 0.19 (0.16) 0.42 (0.16)10/40 {0.04 (0.16) {0.04 (0.16) 0.25 (0.16) 0.25 (0.16) 0.42 (0.16)10/80 {0.04 (0.16) {0.04 (0.16) 0.33 (0.16) 0.32 (0.16) 0.42 (0.16)40/40 {0.15 (0.16) {0.15 (0.16) 0.19 (0.16) 0.19 (0.16) 0.42 (0.16)40/80 {0.05 (0.16) {0.05 (0.16) 0.21 (0.16) 0.21 (0.16) 0.42 (0.16)

the Pareto prior gives far greater bias than the other priors.

4.3.4 Results : Coverage probabilities and interval widths

The Bayesian MCMC methods are not designed speci�cally to give unbiased

estimates. In the Bayesian framework, interval estimates and coverage prob-

abilities are considered more important. The maximum likelihood IGLS and

RIGLS methods are not ideally suited for �nding interval estimates and coverage

probabilities as additional assumptions now have to be made to create intervals.

In the following tables I have made the assumption that all the parameters have

Gaussian distributions. In the case of the level 2 variance parameter this is a

very implausible assumption and I will consider an alternative assumption later

in this chapter.

I will �rstly consider the �xed e�ect parameter, �0 where it is plausible to

70

assume there is a Gaussian posterior distribution.

Table 4.7 contains the coverage probabilities for the �xed e�ect parameter

using the eight di�erent study designs and Table 4.8 contains the corresponding

interval widths. It can be seen in Table 4.7 that the gamma and S.I.�2 priors

perform signi�cantly better than the RIGLS method. All three methods have

actual coverage that is too small and the IGLS method gives the smallest

coverage. The Pareto method gives actual coverage that is far too big for the

smaller studies but gives better coverage as study size increases. In fact all

methods perform better the larger the study and generally the coverage is slightly

better when the design is balanced.

Table 4.8 echoes the results in Table 4.7 in that the gamma and S.I.�2 intervals

are on average slightly wider than the IGLS and RIGLS intervals which leads to

better coverage. The Pareto prior gives intervals in the smaller studies that are

on average almost twice as wide as the other methods and consequently gives too

much coverage. As the studies get larger the Pareto intervals get closer in size to

the other methods and the method performs better.

In Table 4.9 when only study 7 is considered there are far smaller di�erences

between the various methods. The Pareto prior is doing slightly better than all

the other methods, while the other MCMC methods are generally improving on

the maximum likelihood based methods. It can be seen here that when the level 2

variance is much smaller than the level 1 variance, the gamma and S.I.�2 priors do

worse than the IGLS and RIGLS methods. Table 4.10 shows the corresponding

interval widths which are very similar for all methods.

Table 4.11 considers the coverage for the level 2 variance parameter, �2u

using the eight di�erent study designs. It can be seen that there is far greater

discrepancy between the maximum likelihood methods and the MCMC methods

for this parameter but this is to some extent due to the normality assumption.

RIGLS and IGLS give coverage probabilities that are much smaller than the

actual coverage should be. The Pareto prior does better than the other priors

when the study size is small but there is very little to choose between the priors

as the size gets larger. All methods give coverage probabilities that are smaller

than they should be.

Table 4.12 shows that the Pareto prior has average interval widths that are

four times larger than the other priors for studies 1 and 2. The size of the average

71

Table 4.7: Comparison of actual coverage percentage values for nominal 90% and95% intervals for the �xed e�ect parameter using di�erent methods and di�erentstudies. True values for the variance parameters are 10 and 40. ApproximateMCSEs are 0.28%/0.15% for 90%/95% coverage estimates.


1 81.7/86.8 85.0/89.4 85.8/91.1 86.0/91.3 97.8/99.82 81.5/88.5 85.3/90.6 86.9/92.7 86.9/93.0 97.7/99.53 85.4/90.5 87.1/91.6 88.4/93.6 88.4/93.7 94.0/97.64 86.6/91.5 88.2/92.9 89.4/94.4 89.5/94.4 94.2/97.25 88.3/93.3 89.0/94.0 89.8/94.2 89.8/94.3 91.4/95.66 88.0/93.9 89.0/94.2 89.8/94.9 89.8/94.9 91.3/95.57 88.7/93.2 88.8/93.4 89.4/93.5 89.4/93.5 89.8/94.28 88.8/94.5 89.4/94.6 90.1/95.3 90.1/95.3 91.2/95.7

Table 4.8: Average 90%/95% interval widths for the �xed e�ect parameter usingdi�erent studies. True values for the variance parameters are 10 and 40.


1 4.15/4.94 4.57/5.44 5.00/6.39 5.04/6.42 8.74/12.002 4.10/4.88 4.48/5.33 4.96/6.34 4.98/6.37 8.15/11.153 3.18/3.79 3.33/3.97 3.57/4.37 3.57/4.38 4.28/5.294 3.11/3.71 3.25/3.88 3.47/4.26 3.48/4.26 4.09/5.085 2.35/2.80 2.40/2.87 2.47/2.99 2.47/2.99 2.62/3.156 2.28/2.72 2.31/2.78 2.41/2.92 2.41/2.92 2.53/3.057 1.67/1.99 1.69/2.01 1.71/2.05 1.71/2.05 1.76/2.108 1.64/1.95 1.65/1.97 1.69/2.02 1.69/2.02 1.73/2.08

Table 4.9: Comparison of actual coverage percentage values for nominal 90% and95% intervals for the �xed e�ect parameter using di�erent methods and di�erenttrue values. All runs use study design 7. Approximate MCSEs are 0.28%/0.15%for 90%/95% coverage estimates.


1/10 88.4/92.5 88.7/92.9 88.5/93.1 88.5/93.1 89.2/93.91/40 87.9/93.2 88.6/93.5 87.2/92.9 87.1/92.7 90.5/95.31/80 88.3/93.8 88.6/94.0 88.2/93.9 88.1/93.8 91.7/96.110/10 89.0/93.4 89.3/93.5 89.7/94.7 89.7/94.7 90.1/95.110/40 88.7/93.2 88.8/93.4 89.4/93.5 89.4/93.5 89.8/94.210/80 88.5/92.4 88.7/92.8 88.7/93.4 88.7/93.4 89.6/94.040/40 89.0/93.4 89.3/93.5 89.8/94.8 89.7/94.7 90.1/95.140/80 88.8/93.3 89.2/93.5 89.1/94.1 89.1/94.1 89.7/94.7

72

Table 4.10: Average 90%/95% interval widths for the �xed e�ect parameter usingdi�erent true parameter values. All runs use study design 7.


1/10 0.60/0.71 0.61/0.72 0.61/0.73 0.61/0.73 0.63/0.761/40 0.86/1.02 0.87/1.04 0.84/1.01 0.84/1.01 0.91/1.111/80 1.12/1.33 1.13/1.35 1.10/1.31 1.10/1.31 1.21/1.4610/10 1.53/1.83 1.55/1.85 1.60/1.92 1.60/1.92 1.63/1.9610/40 1.67/1.99 1.69/2.01 1.71/2.05 1.71/2.05 1.76/2.1010/80 1.82/2.17 1.84/2.20 1.85/2.23 1.85/2.23 1.92/2.3040/40 3.06/3.65 3.10/3.69 3.19/3.84 3.19/3.84 3.26/3.9140/80 3.16/3.76 3.19/3.80 3.26/3.93 3.26/3.93 3.35/4.03

Table 4.11: Comparison of actual coverage percentage values for nominal 90%and 95% intervals for the level 2 variance parameter using di�erent methodsand di�erent studies. True values of the variance parameters are 10 and 40.Approximate MCSEs are 0.28%/0.15% for 90%/95% coverage estimates.


1 68.3/71.9 75.3/78.5 81.2/88.9 81.9/89.0 84.3/91.52 69.5/73.3 77.4/80.4 81.6/88.5 82.0/88.6 84.7/90.73 77.0/80.9 81.7/86.2 86.8/92.2 87.3/92.3 88.1/94.24 78.0/83.0 83.3/87.1 88.2/93.7 88.2/93.9 88.2/93.05 86.8/89.5 88.6/91.2 88.5/94.0 88.5/94.0 88.5/93.86 84.4/88.5 87.3/90.2 88.3/93.9 88.3/93.9 87.5/93.57 87.4/91.4 88.4/92.4 88.5/93.8 88.5/93.8 88.6/93.08 87.1/90.7 87.3/91.1 87.8/93.4 87.8/93.4 87.7/93.2

Table 4.12: Average 90%/95% interval widths for the level 2 variance parameterusing di�erent studies. True values of the variance parameters are 10 and 40.


1 19.43/23.15 23.73/28.28 41.56/59.45 41.82/59.82 182.21/298.852 19.18/22.86 22.99/27.39 40.77/58.11 40.93/58.34 164.52/273.343 15.64/18.63 17.12/20.40 22.78/29.50 22.79/29.51 35.21/46.654 15.11/18.01 16.39/19.52 21.94/28.44 21.93/28.43 33.10/44.005 11.84/14.10 12.35/14.72 14.14/17.50 14.15/17.50 16.89/20.976 11.23/13.38 11.69/13.93 13.40/16.57 13.40/16.57 15.86/19.717 8.33/9.93 8.52/10.15 9.05/10.95 9.05/10.95 9.77/11.848 8.08/9.63 8.25/9.83 8.79/10.63 8.79/10.63 9.46/11.46

73

Table 4.13: Comparison of actual coverage percentage values for nominal 90%and 95% intervals for the level 2 variance parameter using di�erent methods anddi�erent true values. All runs use study design 7. Approximate MCSEs are0.28%/0.15% for 90%/95% coverage estimates.


1/10 86.1/90.7 87.0/91.8 85.9/92.6 85.8/92.5 88.4/93.01/40 84.0/88.0 85.3/89.4 76.2/84.6 74.2/83.2 91.4/95.51/80 77.7/78.5 79.1/80.4 76.6/89.5 73.8/85.7 92.2/95.910/10 86.2/91.4 88.1/92.7 88.3/94.3 88.3/94.3 87.4/92.810/40 87.4/91.4 88.4/92.4 88.5/93.8 88.5/93.8 88.6/93.010/80 86.3/90.1 86.9/91.7 86.6/93.5 86.6/93.5 88.7/93.140/40 86.2/91.4 88.1/92.7 88.3/94.3 88.3/94.3 87.4/92.840/80 86.9/92.1 88.6/92.9 88.1/93.8 88.1/93.8 88.2/92.9

Table 4.14: Average 90%/95% interval widths for the level 2 variance parameterusing di�erent true parameter values. All runs use study design 7.


1/10 1.07/1.27 1.09/1.30 1.15/1.40 1.16/1.40 1.26/1.531/40 2.00/2.39 2.08/2.47 1.88/2.28 1.85/2.24 2.48/2.991/80 2.93/3.49 3.07/3.66 2.35/2.95 2.29/2.89 3.89/4.7110/10 7.06/8.41 7.21/8.59 7.71/9.34 7.71/9.34 8.23/9.9710/40 8.33/9.93 8.52/10.15 9.05/10.95 9.05/10.95 9.77/11.8410/80 9.91/11.81 10.13/12.07 10.71/12.98 10.71/12.98 11.69/14.1740/40 28.23/33.63 28.85/34.37 30.84/37.36 30.83/37.35 32.91/39.8940/80 29.97/35.71 30.63/36.50 32.69/39.55 32.69/39.55 35.04/42.45

Table 4.15: Comparison of actual coverage percentage values for nominal 90%and 95% intervals for the level 1 variance parameter using di�erent methodsand di�erent studies. True values of the variance parameters are 10 and 40.Approximate MCSEs are 0.28%/0.15% for 90%/95% coverage estimates.


1 87.8/92.5 87.6/92.7 88.0/93.8 88.1/94.0 88.5/94.22 87.8/92.4 87.9/92.4 88.7/93.6 88.8/93.6 89.1/93.43 88.8/94.4 88.8/94.4 89.1/94.3 89.2/94.5 89.3/94.74 89.5/94.8 89.5/94.8 89.3/94.8 89.3/94.8 89.5/95.05 89.0/94.2 89.1/94.2 88.7/94.2 88.7/94.2 89.2/93.96 89.1/93.7 89.1/93.7 89.3/94.1 89.4/94.1 89.3/94.37 89.0/94.9 88.9/94.9 89.4/95.7 89.4/95.7 89.6/95.78 88.9/94.4 88.9/94.4 89.6/94.1 89.6/94.1 89.8/94.3

74

Table 4.16: Average 90%/95% interval widths for the level 1 variance parameterusing di�erent studies. True values of the variance parameters are 10 and 40.


1 18.29/21.79 18.29/21.80 19.26/23.17 19.18/23.07 19.34/23.142 18.32/21.82 18.33/21.84 19.29/23.20 19.21/23.11 19.38/23.173 13.01/15.51 13.01/15.51 13.42/16.03 13.40/16.00 13.42/16.054 13.01/15.50 13.01/15.50 13.40/16.01 13.39/15.99 13.42/16.055 9.18/10.94 9.18/10.94 9.24/11.03 9.24/11.03 9.31/11.116 9.20/10.96 9.20/10.96 9.26/11.05 9.26/11.05 9.33/11.147 6.51/7.76 6.51/7.76 6.50/7.76 6.50/7.76 6.50/7.778 6.51/7.76 6.51/7.76 6.50/7.75 6.50/7.75 6.49/7.76

Table 4.17: Comparison of actual coverage percentage values for nominal 90%and 95% intervals for the level 1 variance parameter using di�erent methods anddi�erent true values. All runs use study design 7. Approximate MCSEs are0.28%/0.15% for 90%/95% coverage estimates.


1/10 89.1/95.0 89.1/95.0 89.3/95.4 89.3/95.4 89.1/95.11/40 89.4/95.1 89.5/95.0 90.3/95.5 89.9/95.4 89.7/95.61/80 89.6/94.7 89.5/94.8 90.2/95.6 90.3/95.6 89.7/95.510/10 88.6/95.1 88.6/95.1 89.4/95.6 89.4/95.6 89.9/95.710/40 89.0/94.9 88.9/94.9 89.4/95.7 89.4/95.7 89.6/95.710/80 89.1/95.0 89.1/94.9 89.8/96.0 89.8/96.0 89.7/95.840/40 88.6/95.1 88.6/95.1 89.4/95.6 89.4/95.6 89.9/95.740/80 88.9/95.1 88.9/95.1 89.5/95.5 89.5/95.5 89.8/95.7

Table 4.18: Average 90%/95% interval widths for the level 1 variance parameterusing di�erent true parameter values. All runs use study design 7.


1/10 1.63/1.94 1.63/1.94 1.63/1.95 1.63/1.95 1.62/1.941/40 6.48/7.72 6.48/7.72 6.51/7.77 6.51/7.77 6.44/7.711/80 12.89/15.35 12.89/15.36 12.85/15.35 12.84/15.34 12.80/15.3210/10 1.63/1.94 1.63/1.94 1.62/1.93 1.62/1.93 1.62/1.9410/40 6.51/7.76 6.51/7.76 6.50/7.76 6.50/7.76 6.50/7.7710/80 13.02/15.51 13.01/15.51 13.02/15.57 13.02/15.57 12.98/15.5440/40 6.51/7.75 6.51/7.75 6.50/7.74 6.50/7.74 6.50/7.7640/80 13.02/15.51 13.02/15.52 13.00/15.49 13.00/15.49 13.00/15.52

75

intervals for the maximum likelihood methods may be arti�cially small due to

the assumption of normality.

When the values of the variance parameters are modi�ed and the study design

is design 7, it can be seen (Table 4.13) that the Pareto prior still has the best

coverage intervals. The other MCMC methods do better than the maximum

likelihood based methods except when the true value of �2u is much smaller than

the true value of �2e , when the situation is reversed.

This point is emphasised in Table 4.14 where it can be seen that the intervals

for the gamma and S.I.�2 are narrower than for the other methods when �2u = 1

and �2e = 40 or 80. The Pareto prior intervals are slightly wider which explains

why they cover better.

When comparing the coverage for the level 1 variance, �2e in Table 4.15 it is

easy to see that there is little to choose between the methods. When the study

is small the MCMC methods do slightly better than the maximum likelihood

methods and overall the Pareto prior generally gives the best coverage. This

parameter has much more accurate coverage intervals than the other parameters.

In Table 4.16 it can be seen that the MCMC intervals are wider when there

are only 108 pupils but as the size of the study gets bigger the intervals become

virtually identical. The same behaviour would probably be seen with the level 2

variance if the number of schools was made a lot larger.

Table 4.17 shows the coverage probabilities when we only consider study

design 7. Here there is nothing to choose between any of the methods and all the

coverage probabilities are very good. Table 4.18 con�rms this by showing that

the intervals from all the models are virtually identical.

4.3.5 Improving maximum likelihood method interval

estimates for �2u

From the above results it is clear that the Gaussian distribution is a bad

approximation to use when calculating interval estimates for the level 2 variance

parameter, �2u. I have shown in the Gibbs sampling algorithm that the true

conditional distribution for �2u is an inverse gamma distribution and I will now

try and use this fact to construct con�dence intervals for �2u based on the inverse

gamma distribution.

76

For each of the 1,000 simulations generated, RIGLS gives an estimate of �2u,

�2u = � and a variance for �2

u; var(�2u) = �. Now if I use the assumption that �2

u

has an inverse gamma (�; �) distribution, then the mean and variance of �2u can

be used to calculate the appropriate distribution as follows :

� =�

�� 1! � = �(�� 1)

� =�2

(�� 1)2(�� 2)

=�2(�� 1)2

(�� 1)2(�� 2)

! � =�2

�+ 2 and

� =�3

�+ �:

Now having found the required inverse gamma distribution, to construct a

x% con�dence interval the quantiles from the distribution need to be found.

Instead of using an inverse gamma distribution directly, the equivalent gamma

distribution is used and the points are inverted :

x%CI =

0@ 1

Gam1�x=2(�2

�+ 2; �

3

�+ �)

;1

Gamx=2(�2

�+ 2; �

3

�+ �)

1A :

The results obtained using the inverse gamma approach for the RIGLS

method can be seen in Table 4.19. In comparing these results with the results

obtained using a Gaussian interval for the level 2 variance (Tables 4.11 - 4.14)

two points emerge.

Firstly the inverse gamma method generally gives worse coverage at the 90%

level than the Gaussian method but better coverage at the 95% level. This is

probably due to the skewed form of the con�dence interval. The exception to

this rule is when the level 2 variance is small when the inverse gamma method

does much worse than the Gaussian method. This may now explain why the

MCMC methods do badly in these cases, as the Gaussian interval for RIGLS was

an unfair comparison and we now see that the MCMC methods are performing

77

Table 4.19: Summary of results for the level 2 variance parameter, �2u using the

RIGLS method and inverse gamma intervals.

Study 90% 95% Level 2/1 90% 95%Number Variances

Coverage Probabilities

1 68.3 77.3 1/10 84.8 91.92 70.4 79.0 1/40 71.7 78.63 78.1 87.1 1/80 56.8 65.14 77.0 86.3 10/10 86.3 93.95 85.2 91.7 10/40 86.2 92.76 84.3 91.1 10/80 84.4 92.57 86.2 92.7 40/40 86.3 93.98 86.1 92.2 40/80 86.8 93.7

Average Interval Widths (90%/95%)

1 18.004 24.064 1/10 1.024 1.2692 17.778 23.644 1/40 1.589 2.0883 14.678 18.919 1/80 1.954 2.6454 12.913 16.591 10/10 7.024 8.5235 11.466 14.293 10/40 8.206 10.0236 10.911 13.567 10/80 9.602 11.8397 8.206 10.023 40/40 28.097 34.0928 7.966 9.716 40/80 29.731 36.152

better in terms of coverage than the maximum likelihood methods. Secondly

the inverse gamma intervals are on average much narrower than the Gaussian

methods.

4.3.6 Summary of results

Although these results don't look great for the Gibbs sampling methods in terms

of bias, particularly for the smaller datasets, what the reader must realise is that

even the largest JSP dataset studied with 48 schools is a small dataset in multi-

level modelling terms. If I were to consider larger datasets there will be less bias

and better agreement between the methods over coverage of intervals. I have also

78

only considered using the chain mean as a parameter estimate. As an alternative

I could have considered the median or mode of the parameter which, for the

variances will give smaller estimates and hence less bias.

Overall the MCMC methods appear to have an edge in terms of coverage

probabilities, particularly when the inverse gamma intervals are used for the

IGLS/RIGLS methods.

The main danger is that people who are currently using maximum likelihood

methods may decide to only use the MCMC methods on the smaller datasets as

in computational terms the MCMC methods take much longer than the other

methods. The other danger is that people who are unfamiliar with multi-level

modelling will not realise that due to the structure of the problems studied, it is

not only the number of level 1 units that is important but also the number of level

2 units, particularly when estimating the level 2 variance. Models with 6 or even

12 level 2 units would probably not be considered large enough for multi-level

modelling. To compare the multivariate priors described in the earlier section I

will now consider a second model.

4.4 Random slopes regression model

In order to compare the various priors for a variance matrix, I will now need to

consider a more complicated model. One of the simplest multi-level models that

includes a variance matrix is the random slopes regression model introduced in the

last chapter. The random slopes regression model can be written mathematically

as :

yij = �0 + �1Xij + u0j + u1jXij + eij;

uj =

0@ u0j

u1j

1A �MVN(0; Vu); eij � N(0; �2e);

where i = 1; : : : ; nj; j = 1; : : : ; J andP

j nj = N .

I will re-express the �rst line of this model as follows :

yij = �0X0ij + �1X1ij + u0jX0ij + u1jXij + eij

79

where X0ij is constant and X1ij = Xij in the previous notation. I will now use Xij

to mean the vector (X0ij; X1ij) as this change will help simplify the expressions

in the algorithms that follow.

4.4.1 Gibbs sampling algorithm

As with the variance components model the parameters can be split into four

groups, the �xed e�ects, �, the level 2 residuals, uj, the level 2 variance matrix

Vu and the level 1 variance �2e . The conditional posterior distributions can be

found using similar methodology to that used for the variance components model

so I will just outline the posteriors without explaining how to obtain them.

Prior distributions

I will assume a uniform prior for the �xed e�ect parameters �0 and �1. The

level 1 variance �2e will take various univariate priors in the simulations. In the

algorithm I will use a general scaled inverse �2 prior with parameter �e and s2e.

The level 2 variance matrix will similarly take various multivariate priors and so

I will assume a general Wishart prior with parameters �p and Sp for the precision

matrix at level 2. All the priors in the earlier section can then be obtained from

particular values for these parameters. The algorithm is then as follows :

Step 1. p(� j y; Vu; �2e ; u)

Let � � N(�; D), then to �nd � and D :

p(� j y; Vu; �2e ; u) / Y

i;j

(1

�2e

)12 exp[� 1

2�2e

(yij � Xijuj � Xij�)2]

/ exp[�P

i;j XTijXij

2�2e

�2 +1

�2e

Xi;j

XTij(yij � Xijuj)� + const]:

Comparing this with the form of a multivariate normal distribution and matching

powers of � gives

D = �2e [Xi;j

XTijXij]

�1;

80

and

� = [Xi;j

XTijXij]

�1Xi;j

XTij(yij � Xijuj)

=D

�2e

Xi;j

XTij(yij � Xijuj):

Step 2. p(uj j y; Vu; �2e ; �)

Let uj � N(uj; Dj), then to �nd uj and Dj :

p(uj j y; Vu; �2e ; �) /

njYi=1

(1

�2e

)12 exp[� 1

2�2e

(yij � Xij� � Xijuj)2]

�jVuj� 12 exp[�1

2uTj V

�1u uj]

/ exp(�1

2[njXi=1

XTijXij

�2e

+ V �1u ]u2j +

1

2�2e

njXi=1

XTij(yij � Xij�)uj

+const):

Comparing this with the form of a multivariate normal distribution and matching

powers of uj gives

Dj = [

Pnji=1X

TijXij

�2e

+ V �1u ]�1;

and

uj =Dj

�2e

njXi=1

XTij(yij � Xij�):

Step 3. p(Vu j y; �; u; �2e)

Consider instead p(V �1u j y; �; u; �2

e) and let V�1u �Wishart(�u; Su). Then letting

p(V �1u ) �Wishart(�p; Sp) gives

81

p(V �1u j y; �; u; �2

e) /JYj=1

jVuj� 12 exp(�1

2uTj V

�1u uj)p(V

�1u )

/ jVuj�J2 exp(�1

2

JXj=1

uTj V�1u uj)

�jV �1u j(�p�3)=2exp(�1

2tr.(S�1

p V �1u ))

/ jVuj(J+�p�3)=2exp(�1

2tr.((

JXj=1

uTj uj + S�1p )V �1

u )):

Then comparing this with the form of a Wishart distribution produces

�u = J + �p and Su = (JXj=1

uTj uj + S�1p )�1:

The uniform prior on Vu is equivalent to �p = �3; Sp = 0.

Step 4. p(�2e j y; �; u; Vu)

Consider instead p(1=�2e j y; �; u; Vu) and let 1=�2

e � gamma(ae; be). Then

p(1=�2e) = (1=�2

e)�2p(�2

e) and so

p(1=�2e j y; �; u; Vu) / Y

i;j

(1

�2e

)12 exp[� 1

2�2e

(yij � Xij� � Xijuj)2](

1

�2e

)�2p(�2e)

/ (1

�2e

)(N2+ �e

2�1)exp[� 1

2�2e

(Xi;j

(yij � Xij� � Xijuj)2 + �es

2e)]:

Then comparing this with the form of a gamma distribution produces

ae =N + �e

2and be =

1

2(�es

2e +

Xi;j

e2ij):

A uniform prior on �2e , or the equivalent Pareto prior is equivalent to �e =

�2; s2e = 0. A uniform prior on log �2e is equivalent to �e = 0; s2e = 0 and a

gamma(�; �) prior for 1=�2e is equivalent to �e = 2�; s2e = 1.

Having found the four sets of conditional distributions, it is now simple

82

enough to program up the algorithm and compare via simulation the various

prior distributions.

4.4.2 Simulation method

Following on from the variance components model simulation I now want to

extend the comparisons by considering the random slopes regression model.

The random slopes regression model has more parameters than the variance

components model, so I could quite easily study even more designs than the

15 studied using the variance components model.

Due to time constraints and to avoid too much repetition, only two main areas

of interest will be considered. I will �rstly again consider the study design of the

model and consider two sizes of study design and whether the design is balanced

or unbalanced. These will correspond to designs 3,4,7 and 8 in Table 4.1.

The second area of interest is due to the new model design. When looking at

the variance components model I considered the e�ect of varying the true values

of the two variance parameters. Now that there is a variance matrix at level 2 I

will consider the e�ect of varying the correlation between the two parameters at

level 2. I will consider �ve di�erent scenarios, �rstly when the two variables are

uncorrelated which I will consider using all four study designs. The other four

scenarios will have both large and small correlations that are positive and then

negative. These correlations will only be considered using study design 7, which

is similar to the actual JSP dataset.

This will give a total of 9 designs. The true values for all the parameters apart

from the level 2 co-variance term will be the same for each design as follows :

�0 = 30:0; �1 = 0:5;�u00 = 5:0;�u11 = 0:5; and �2e = 30:0. The level 2 co-

variance, �u01 will be set to a value c to give the required correlation. For each

set of parameter values I will generate 1000 simulated datasets and �t the random

slopes regression model using each method to each dataset.

Creating the simulation datasets

Creating the simulation datasets is also easy for the random slopes regression

model. The only data that needs to be generated are the values of the response

variable for the N pupils. The second variable Xij will be �xed throughout the

83

dataset. Considering the case of 864 pupils within 48 schools the procedure is as

follows :

1. Generate 48 u0js and u1js, one for each school, by drawing from a

multivariate normal distribution with mean 0 and variance matrix �u.

2. Generate 864 eijs, one for each pupil, by drawing from a normal distribution

with mean 0 and variance �2e .

3. Evaluate Yij = �0 + �1Xij + u0j + u1jXij + eij for all 864 pupils.

This will generate one simulation dataset for the current parameter values.

This dataset is then �tted using each method, and the whole procedure is repeated

1000 times. The datasets will be generated using a short C program.

Comparison of methods

The priors to be considered for the level 2 variance are the uniform prior on the �u

scale, and the two Wishart priors for the level 2 precision described in the earlier

section. The uniform prior method for �u will be run using MLwiN and the other

two priors using the BUGS package. The Maximum likelihood IGLS and RIGLS

methods will also be run using MLwiN. Again the main reason for running the

bulk of the Gibbs sampling runs using BUGS is computing resources. BUGS

however cannot �t the uniform prior as it is improper and so this will be carried

out using MLwiN. The estimates from RIGLS will be used as prior estimates for

the `data driven' Wishart prior.

The lengths of the `burnin' and main runs of these simulations will be the

same as for the equivalent study designs �tting the variance components model.

The bias and coverage probabilities will be worked out in the same way as for the

variance components model. The level 1 variance is not of great interest here,

and so I have used the gamma(�; �) prior when using BUGS and the uniform prior

when using MLwiN as these are the defaults.

Preliminary analysis for IGLS and RIGLS

With the added complexity of the random slopes regression model, the IGLS

and RIGLS methods occasionally, for certain datasets, have problems �tting the

model. These problems can be of two types. Firstly the method may at some

iteration generate an estimate for a variance matrix that is not positive de�nite.

84

Secondly the method may not converge before the maximum number of iterations

have been reached. Generally what is happening in the second case is that the

method is cycling between several estimates and so increasing the maximum

number of iterations will not help (see Figure 4-3).

Figure 4-3: Trajectories plot of IGLS estimates for run of random slopesregression model where convergence is not achieved.

The MLn command CONV (Rasbash and Woodhouse 1995) will return

whether the estimation procedure has converged, and if not which of the above

two reasons is the problem. As the maximum likelihood methods are fast to run,

I have run the random slopes regression model using several simulation studies to

identify how well the IGLS and RIGLS methods perform in di�erent scenarios.

The results are given in Table 4.20. The studies marked with a star will be used

in the main analysis. Table 4.20 shows that several factors in uence how well the

maximum likelihood methods perform.

Firstly it should be noted that as study size gets bigger and consequently

the number of level 2 units increases the number of datasets that the maximum

likelihood methods fail on is minimal. The number of datasets where the methods

have problems increases when the size of study is decreased, and dramatically

increases when the design is unbalanced. Also the correlation between the 2

variables at level 2 is important, if the two variables are highly correlated, either

85

Table 4.20: Summary of the convergence for the random slopes regression with themaximum likelihood based methods (IGLS/RIGLS). The study design is givenin terms of the number of level 2 units and whether the study is balanced (B) orunbalanced (U).

Study �u01 Con NCon Not Posdev3 (12U) {1.4 623/574 356/349 21/773 (12U) {0.5 902/857 93/124 5/19

* 3 (12U) 0.0 927/877 71/116 2/73 (12U) 0.5 906/871 91/118 3/113 (12U) 1.4 621/558 367/366 12/764 (12B) {1.4 914/903 83/74 3/234 (12B) {0.5 986/985 13/14 1/1

* 4 (12B) 0.0 991/990 9/9 0/14 (12B) 0.5 994/991 6/7 0/24 (12B) 1.4 912/903 85/72 3/25

* 7 (48U) {1.4 986/984 13/14 1/2* 7 (48U) {0.5 998/998 2/2 0/0* 7 (48U) 0.0 1000/1000 0/0 0/0* 7 (48U) 0.5 1000/1000 0/0 0/0* 7 (48U) 1.4 984/983 16/15 0/28 (48B) {1.4 994/992 6/6 0/28 (48B) {0.5 999/999 1/1 0/0

* 8 (48B) 0.0 1000/1000 0/0 0/08 (48B) 0.5 1000/1000 0/0 0/08 (48B) 1.4 992/992 8/8 0/0

positively or negatively, the number of problem datasets increases.

Most of the studies chosen for further investigation with the Gibbs sampling

methods do not have many problem datasets. The study 3 scenario with �u01 = 0

is the worst with only 877 good datasets. For the further analysis I will simply

discard any problem datasets and analyse the remaining datasets using all the

methods.

One problem that is not captured by the MLn CONV command is when

the �nal converged estimate is not positive de�nite. These situations will be

included in the converged category in the above table. This has a knock-on

e�ect when I consider using the RIGLS estimate as a parameter in the Wishart

prior distribution. Consequently if the level 2 variance matrix estimate has a

86

correlation outside [{1,1], I will reduce the estimate of the co-variance, �01 so

that the correlation becomes �0:95 before using it as a prior parameter.

4.4.3 Results

The results for the 8 simulation designs comparing the 2 maximum likelihood

methods and the three MCMC methods can be seen in Tables 4.21 to 4.28. The

columns labelled Wish 1 prior are the results for the Wishart (I; 2) prior for the

precision matrix, ��1u . The columns labelled Wish 2 prior are the results for the

Wishart (�u; 4) prior for the precision matrix, ��1u .

The results for the unbalanced design with 48 schools and uncorrelated

parameters at level 2 are in Table 4.21. From this table it can be seen that the

results for the two maximum likelihood methods and the uniform prior method

are similar to the results already seen for the variance components model. The

IGLS method tends to underestimate the variance parameters at level 2 and the

RIGLS method corrects for this giving the least biased estimates. The uniform

prior on the other hand tends to overestimate the level 2 variance parameters.

In terms of coverage probabilities there is little to choose between the RIGLS

method and the uniform prior. This is probably partly due to the uniform prior

method giving larger intervals.

The two other MCMC methods give interesting results. The �rst Wishart

prior method which has a parameter with value the identity matrix, uses this

parameter as a prior guess for the level 2 variance matrix. This is clearly shown

by the estimate of �u00 which has a true value of 5 (greater than 1) being an

underestimate, and the estimate of �u11 which has a true value of 0.5 (less than 1)

being an overestimate. This in turn a�ects the coverage intervals, as the estimates

for �u00 give worse coverage than RIGLS while the estimates for �u11 give better

coverage.

The second Wishart prior method, based on using the RIGLS estimate of �u

as a parameter in the prior appears to underestimate all the parameters in the

variance matrix. This in turn leads to smaller average interval widths and in this

case worse coverage than the RIGLS method for virtually all parameters.

Tables 4.22 to 4.25 contain the results when the level 2 covariance parameter,

�u01 is given di�erent true values that give matrices with high and low, positive

87

Table 4.21: Summary of results for the random slopes regression with the 48schools unbalanced design with parameter values, �u00 = 5;�u01 = 0 and�u11 = 0:5. All 1000 runs.

Param. IGLS RIGLS Wish 1 Wish 2 Uniform(True) prior prior prior

Relative % Bias in estimates (Monte Carlo SE)Except values in [ ] which are actual biases due to the true value being 0

�0(30:0) 0.03 (0.04) 0.03 (0.04) {0.01 (0.04) 0.00 (0.04) 0.03 (0.04)�1(0:5) 0.64 (0.70) 0.64 (0.70) 2.51 (0.70) 1.10 (0.70) 0.74 (0.70)�u00(5:0) {2.88 (0.92) 0.24 (0.94) {3.00 (0.98) {7.64 (0.97) 22.42 (1.08)�u01(0:0) [{0.01 (0.01)] [{0.01 (0.01)] [{0.01 (0.01)] [{0.01 (0.01)] [{0.02 (0.01)]�u11(0:5) {3.46 (0.72) {1.08 (0.74) 5.12 (0.75) {3.39 (0.74) 15.76 (0.85)�2e(30:0) 0.03 (0.16) 0.03 (0.16) 0.68 (0.16) 0.90 (0.17) 0.53 (0.16)

Coverage Probabilities (90%/95%) : Approximate MCSE (0.28%/0.15%)

�0 89.7/95.1 90.1/95.3 89.2/95.0 89.1/94.4 92.5/96.8�1 88.0/93.5 88.3/93.8 88.8/93.8 86.5/92.5 90.2/95.3�u00 87.2/91.5 88.2/92.3 86.8/92.7 83.8/89.5 86.8/93.2�u01 90.1/96.3 90.1/96.3 88.3/94.9 85.5/91.5 90.2/95.7�u11 86.2/90.7 87.5/92.3 91.1/95.3 86.9/92.9 87.7/93.2�2e

89.4/95.2 89.3/95.1 88.7/94.2 89.4/94.8 88.7/94.6


�0 1.275/1.519 1.289/1.536 1.269/1.536 1.250/1.516 1.384/1.660�1 0.353/0.420 0.356/0.425 0.359/0.427 0.337/0.404 0.381/0.456�u00 4.812/5.734 4.923/5.865 4.973/6.042 4.599/5.584 6.158/7.484�u01 0.946/1.126 0.967/1.152 0.994/1.216 0.922/1.129 1.212/1.491�u11 0.374/0.445 0.382/0.455 0.406/0.492 0.373/0.452 0.475/0.577�2e

5.027/5.989 5.027/5.989 5.072/6.008 5.126/6.099 5.061/6.034

88

and negative correlation. The parameter percentage biases are plotted in

Figures 4-4 and 4-5 against the value of �u01. The immediate thing to notice is

that changes to the covariance parameter value have, for most methods and most

parameters, little overall e�ect in terms of bias, and the results in Tables 4.22 to

4.25 are similar to those obtained in Table 4.21.

The IGLS and RIGLS methods give similar results as before with the RIGLS

method giving approximately unbiased estimates. This shows that removing the

datasets that did not converge does not appear to have had any noticeable e�ect

on the bias of the estimates. The uniform prior method gives approximately the

same percentage bias for the variance parameters, and the covariance estimates

appear (Figure 4-5 (ii)) to have percentage biases that are positively correlated

with �u01.

The �rst Wishart prior method does not exhibit the shrinkage towards 1

property for parameter �u00 when the correlation is large (in magnitude). It

is also noticeable (Figure 4-4 (i)) that the bias of the parameter �0 using this

method is proportional to �u01.

The second Wishart prior method still underestimates the variance parame-

ters at level 2, although the bias is reduced as the correlation is increased (in

magnitude) but is approximately unbiased for the covariance term. It also gives

the largest bias for the level 1 variance parameter for all values of �u01.

Considering the coverage properties of the �ve methods we �nd di�erences

from the variance components model. The MCMC methods now no longer give

the best results for all parameters. In particular the second Wishart prior method

has the smallest intervals for most parameters and consequently performs poorly

in terms of coverage. The �rst Wishart prior performs well and gives reasonable

coverage for all parameters.

Although the uniform prior gives estimates that are highly biased it should

not be disregarded as it has the best coverage properties for the level 2 variance

parameters. It does not perform as well when the correlation is increased

(in magnitude). The RIGLS method also gives reasonable coverage for most

parameters and performs better than for the variance components model.

When the study design is changed (Tables 4.26 to 4.28) the e�ects are similar

to those observed for the variance components model. From Figures 4-6 and 4-7

it can be seen that reducing the number of schools in the study increases the

89

Table 4.22: Summary of results for the random slopes regression with the 48schools unbalanced design with parameter values, �u00 = 5;�u01 = 1:4 and�u11 = 0:5. Only 982 runs.


Relative % Bias in estimates (Monte Carlo SE)

�0(30:0) 0.02 (0.04) 0.03 (0.04) 0.07 (0.04) 0.03 (0.04) 0.03 (0.04)�1(0:5) 0.68 (0.69) 0.67 (0.69) 2.22 (0.69) 0.99 (0.69) 0.76 (0.70)�u00(5:0) {2.72 (0.92) 0.40 (0.92) 1.64 (0.92) {6.43 (0.90) 22.86 (1.04)�u01(1:4) {2.07 (0.79) 0.07 (0.79) {5.31 (0.81) {0.43 (0.80) 14.64 (0.93)�u11(0:5) {3.26 (0.71) {0.94 (0.71) 7.96 (0.72) {2.63 (0.71) 15.76 (0.82)�2e(30:0) 0.03 (0.16) 0.02 (0.16) {0.08 (0.16) 1.27 (0.16) 0.32 (0.16)


�0 89.5/95.2 89.7/95.5 89.4/95.6 87.9/93.3 92.1/96.8�1 88.2/94.3 88.7/94.5 89.2/94.9 86.0/92.5 90.7/96.2�u00 86.6/91.0 88.3/92.5 89.3/94.1 82.8/89.2 87.5/93.5�u01 87.6/92.8 88.9/93.7 88.3/93.8 88.9/93.8 89.8/94.3�u11 87.0/90.6 88.1/92.5 92.9/97.0 87.3/93.4 89.4/94.8�2e

88.6/95.2 88.8/95.3 89.1/95.2 89.5/94.4 89.2/95.7


�0 1.247/1.486 1.262/1.503 1.259/1.517 1.171/1.398 1.363/1.635�1 0.351/0.418 0.354/0.422 0.358/0.431 0.328/0.396 0.379/0.455�u00 4.631/5.517 4.742/5.650 4.865/5.896 4.146/5.038 5.988/7.276�u01 1.135/1.353 1.162/1.384 1.195/1.443 1.116/1.352 1.445/1.758�u11 0.368/0.439 0.377/0.449 0.417/0.506 0.370/0.448 0.469/0.570�2e

4.998/5.955 5.003/5.962 4.969/5.887 5.048/6.014 5.013/5.977

90

Table 4.23: Summary of results for the random slopes regression with the 48schools unbalanced design with parameter values, �u00 = 5;�u01 = �1:4 and�u11 = 0:5. Only 984 runs.



�0(30:0) 0.02 (0.04) 0.02 (0.04) -0.08 (0.04) -0.01 (0.04) 0.03 (0.04)�1(0:5) {0.01 (0.69) {0.01 (0.69) 2.27 (0.69) 0.11 (0.69) {0.01 (0.69)�u00(5:0) {2.28 (0.92) 0.70 (0.94) 2.44 (0.94) {5.65 (0.90) 23.90 (1.06)�u01(�1:4) 1.43 (0.79) {0.79 (0.79) 4.68 (0.82) {0.23 (0.82) {15.86 (0.93)�u11(0:5) {1.74 (0.72) 0.68 (0.73) 8.97 (0.74) {1.23 (0.75) 17.65 (0.84)�2e(30:0) 0.00 (0.16) {0.01 (0.16) {0.10 (0.16) 1.17 (0.16) 0.29 (0.16)


�0 89.5/95.1 90.4/95.1 90.4/95.5 89.7/94.6 92.9/96.3�1 89.4/94.4 89.9/94.8 90.8/95.0 88.4/94.1 92.2/96.1�u00 87.6/90.7 88.7/92.4 89.6/94.9 84.2/90.0 87.2/92.9�u01 88.4/92.9 89.6/94.1 88.3/94.3 88.8/93.7 90.0/95.3�u11 88.3/92.5 89.5/93.6 92.0/96.7 87.9/93.8 87.9/93.6�2e

89.2/95.0 89.1/95.0 89.1/95.0 89.0/94.6 89.2/94.9


�0 1.260/1.501 1.275/1.519 1.292/1.544 1.239/1.491 1.376/1.651�1 0.354/0.422 0.358/0.426 0.364/0.431 0.346/0.415 0.382/0.458�u00 4.701/5.601 4.811/5.733 5.036/6.126 4.311/5.241 6.094/7.407�u01 1.157/1.379 1.183/1.409 1.209/1.476 1.142/1.396 1.478/1.797�u11 0.376/0.448 0.384/0.458 0.421/0.513 0.374/0.456 0.479/0.582�2e

5.005/5.964 5.005/5.964 4.964/5.866 5.039/5.999 5.013/5.972

91

Table 4.24: Summary of results for the random slopes regression with the 48schools unbalanced design with parameter values, �u00 = 5;�u01 = 0:5 and�u11 = 0:5. All 1000 runs.



�0(30:0) 0.02 (0.04) 0.02 (0.04) 0.02 (0.04) 0.01 (0.04) 0.03 (0.04)�1(0:5) 0.72 (0.70) 0.72 (0.70) 2.44 (0.70) 1.08 (0.70) 0.83 (0.70)�u00(5:0) {3.00 (0.92) 0.08 (0.94) {3.12 (0.97) {8.08 (0.96) 22.10 (1.08)�u01(0:5) {3.45 (1.89) {1.45 (1.93) {2.42 (1.92) {0.80 (1.93) 12.43 (2.22)�u11(0:5) {3.79 (0.72) {1.42 (0.73) 5.07 (0.74) {3.61 (0.73) 15.35 (0.98)�2e(30:0) 0.04 (0.16) 0.04 (0.16) 0.67 (0.16) 0.97 (0.17) 0.54 (0.16)


�0 89.7/94.9 90.2/95.5 89.1/95.1 89.1/94.0 92.5/96.8�1 88.2/93.0 88.3/93.4 88.3/93.8 86.9/92.3 90.5/95.3�u00 87.3/91.3 88.9/92.3 86.4/92.6 82.9/89.5 87.3/93.3�u01 90.0/95.5 90.9/95.7 90.9/95.4 85.9/93.0 91.0/95.3�u11 86.3/91.3 88.6/91.7 91.8/95.7 87.7/92.9 89.4/94.2�2e

89.0/95.1 89.0/95.1 88.6/94.2 89.6/94.8 89.1/94.6


�0 1.270/1.514 1.285/1.531 1.255/1.524 1.230/1.487 1.380/1.654�1 0.352/0.419 0.356/0.424 0.357/0.428 0.334/0.401 0.380/0.456�u00 4.787/5.704 4.897/5.834 4.927/5.982 4.543/5.517 6.122/7.441�u01 0.967/1.152 0.989/1.178 1.020/1.241 0.947/1.154 1.236/1.517�u11 0.372/0.443 0.380/0.453 0.406/0.492 0.371/0.450 0.472/0.574�2e

5.026/5.988 5.026/5.988 5.066/6.007 5.125/6.103 5.061/6.033

92

Table 4.25: Summary of results for the random slopes regression with the 48schools unbalanced design with parameter values, �u00 = 5;�u01 = �0:5 and�u11 = 0:5. Only 998 runs.



�0(30:0) 0.03 (0.04) 0.03 (0.04) {0.04 (0.04) {0.01 (0.04) 0.03 (0.04)�1(0:5) 0.49 (0.69) 0.49 (0.69) 2.52 (0.69) 1.06 (0.69) 0.57 (0.70)�u00(5:0) {2.62 (0.92) 0.50 (0.94) {2.60 (0.97) {7.53 (0.97) 22.78 (1.08)�u01(�0:5) 0.27 (1.89) {2.00 (1.93) {0.72 (1.94) {1.96 (1.96) {18.12 (2.22)�u11(0:5) {3.00 (0.72) {0.61 (0.73) 5.68 (0.74) {2.93 (0.73) 16.30 (0.98)�2e(30:0) 0.01 (0.16) 0.01 (0.16) 0.65 (0.16) 0.93 (0.17) 0.50 (0.16)


�0 89.7/95.1 89.8/95.3 89.3/95.1 88.6/94.5 92.2/96.6�1 88.4/93.7 89.3/94.0 88.9/94.1 87.7/93.3 90.8/95.5�u00 87.8/91.6 88.4/92.5 86.8/93.0 84.0/90.1 87.1/93.4�u01 89.8/94.9 90.1/95.1 88.9/94.3 85.7/91.9 89.3/94.9�u11 87.0/91.3 88.1/92.6 90.6/95.2 87.3/92.7 87.5/93.3�2e

89.6/95.1 89.6/95.1 88.9/94.3 89.6/94.6 89.0/94.8


�0 1.276/1.521 1.291/1.538 1.282/1.542 1.256/1.524 1.385/1.660�1 0.353/0.421 0.357/0.426 0.361/0.427 0.341/0.408 0.381/0.457�u00 4.817/5.740 4.929/5.872 4.997/6.072 4.606/5.594 6.176/7.506�u01 0.979/1.167 1.001/1.193 1.025/1.254 0.955/1.170 1.253/1.538�u11 0.375/0.447 0.383/0.453 0.409/0.497 0.375/0.454 0.477/0.579�2e

5.025/5.988 5.025/5.988 5.069/5.999 5.130/6.096 5.061/6.030

93

True Sigma_u01

Est

imat

e of

% b

ias

beta

0

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

-0.1

0-0

.05

0.0

0.05

0.10

(i)

IGLSRIGLSWishart Prior 1Wishart Prior 2Uniform Prior

True Sigma_u01

Est

imat

e of

% b

ias

beta

1

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

2.0

2.5

(ii)

True Sigma_u01

Est

imat

e of

% b

ias

sigm

a^2_

e

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

0.0

0.5

1.0

(iii)

Figure 4-4: Plots of biases obtained for the various methods �tting the randomslopes regression model against value of �u01 (Fixed e�ects parameters and level1 variance parameter).

94

True Sigma_u01

Est

imat

e of

% b

ias

Sig

ma_

u00

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

-10

010

20

(i)


True Sigma_u01

Est

imat

e of

% b

ias

Sig

ma_

u01

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

-20

-10

010

20

(ii)

True Sigma_u01

Est

imat

e of

% b

ias

Sig

ma_

u11

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

05

1015

20

(iii)

Figure 4-5: Plots of biases obtained for the various methods �tting the randomslopes regression model against value of �u01 (Level 2 variance parameters).

95

percentage bias of all the methods and in particular the uniform prior method.

It is also noticeable that �2e is biased high for the two Wishart prior methods,

which may explain in part why the level 2 variance parameters are biased low.

The coverage properties are changed slightly when the number of schools is

reduced to 12. The uniform prior method now has very poor coverage properties

as its intervals are too wide and it gives far higher actual percentage coverage for

nominal 90% and 95% intervals for the �xed e�ects. For the variance parameters

the uniform prior gives highly biased estimates that lead to intervals that have

lower actual percentage coverage than required. The second Wishart prior

method is once again performing poorly and the best coverage is either from

the RIGLS or the �rst Wishart prior method depending on the parameter.

4.5 Conclusions

Simulation studies comparing various maximum likelihood and empirical Bayes

methods have been performed in the past (see for example Kreft, de Leeuw, and

van der Leeden (1994)). There has however been very little comparison work

between fully Bayesian MCMC methods and maximum likelihood methods. This

may be due to the fact that the MCMC methods take a lot longer to perform

than the maximum likelihood methods. For example the two sets of simulations

in this chapter together took over 6 months to perform and this was using several

machines simultaneously.

Although these simulations do highlight some interesting points it would be

useful to run more simulations particularly to compare the priors for variance

matrices. The results obtained in these simulations will now be summarised and

then I will talk about how this has in uenced the default prior settings in the

MLwiN package.

4.5.1 Simulation results

All the simulations performed in this chapter have been compared in terms of

both bias and coverage properties. Two maximum likelihood methods have been

included in the simulations for completeness but the IGLS method almost always

performs worse than the RIGLS method and so can probably be disregarded.

96

Table 4.26: Summary of results for the random slopes regression with the48 schools balanced design with parameter values, �u00 = 5;�u01 = 0:0 and�u11 = 0:5. All 1000 runs.



�0(30:0) {0.05 (0.04) {0.05 (0.04) {0.08 (0.04) {0.08 (0.04) {0.05 (0.04)�1(0:5) {0.03 (0.68) {0.03 (0.68) 2.13 (0.68) 0.78 (0.68) 0.07 (0.68)�u00(5:0) {4.66 (0.90) {1.74 (0.92) {4.91 (0.95) {9.32 (0.94) 18.84 (1.04)�u01(0:0) [0.00 (0.01)] [0.00 (0.01)] [0.00 (0.01)] [0.00 (0.01)] [0.00 (0.01)]�u11(0:5) {1.59 (0.75) 0.80 (0.77) 7.02 (0.77) {1.30 (0.77) 17.62 (0.88)�2e(30:0) {0.05 (0.16) {0.05 (0.16) 0.59 (0.16) 0.81 (0.16) 0.46 (0.16)


�0 87.9/93.8 88.1/94.0 87.7/93.5 87.2/92.7 90.8/95.5�1 89.2/94.4 89.7/94.5 89.8/94.8 87.6/93.3 91.6/95.6�u00 86.3/91.0 87.8/92.2 86.7/92.0 82.7/89.6 88.5/94.4�u01 92.5/96.6 92.5/96.7 91.1/95.3 88.6/94.1 92.8/96.1�u11 86.4/91.2 88.1/93.2 89.9/94.9 87.2/94.2 85.3/92.0�2e

89.0/94.1 89.0/94.1 89.8/94.7 89.9/94.7 90.0/94.9


�0 1.240/1.477 1.253/1.493 1.223/1.475 1.209/1.458 1.343/1.609�1 0.353/0.421 0.357/0.425 0.362/0.432 0.339/0.407 0.381/0.457�u00 4.610/5.493 4.710/5.611 4.775/5.795 4.440/5.383 5.839/7.091�u01 0.924/1.101 0.944/1.125 0.972/1.188 0.903/1.106 1.178/1.448�u11 0.375/0.447 0.383/0.456 0.407/0.494 0.375/0.455 0.476/0.578�2e

5.027/5.991 5.028/5.991 5.077/6.016 5.129/6.107 5.063/6.035

97

Table 4.27: Summary of results for the random slopes regression with the 12schools unbalanced design with parameter values, �u00 = 5;�u01 = 0:0 and�u11 = 0:5. Only 877 runs.



�0(30:0) {0.05 (0.15) {0.06 (0.15) 0.14 (0.09) 0.10 (0.09) 0.06 (0.09)�1(0:5) {0.45 (1.50) {0.64 (1.49) 1.48 (1.48) {0.16 (1.48) 0.06 (1.50)�u00(5:0) {4.78 (2.22) 9.70 (2.22) {4.15 (2.29) {12.34 (2.11) 219.22 (5.10)�u01(0:0) [0.01 (0.02)] [0.01 (0.02)] [{0.01 (0.02)] [0.00 (0.02)] [{0.04 (0.05)]�u11(0:5) {9.19 (1.57) 0.65 (1.71) 34.76 (1.77) {5.26 (1.71) 142.98 (3.74)�2e(30:0) {0.56 (0.40) {0.82 (0.41) 2.12 (0.36) 2.67 (0.36) 1.62 (0.35)


�0 83.5/90.8 86.2/92.1 85.3/91.4 82.2/90.1 96.9/99.7�1 86.1/91.3 88.1/92.2 92.4/96.7 85.4/91.6 96.1/97.9�u00 80.0/83.5 84.8/87.8 83.2/91.1 73.5/81.9 73.2/81.8�u01 87.7/94.1 88.8/95.3 92.9/97.9 71.8/79.2 96.2/98.7�u11 76.9/81.4 82.7/86.5 93.6/96.8 78.8/86.9 75.4/85.5�2e

89.1/93.7 89.2/93.5 88.8/95.0 87.9/94.6 90.1/95.1


�0 2.525/3.009 2.660/3.170 2.559/3.124 2.458/2.981 3.957/4.950�1 0.673/0.802 0.705/0.840 0.788/0.962 0.671/0.802 1.030/1.290�u00 9.685/11.54 10.76/12.82 10.40/13.45 8.13/10.44 37.15/50.49�u01 1.814/2.162 2.002/2.385 2.230/2.974 1.678/2.168 6.876/9.696�u11 0.706/0.841 0.772/0.920 1.067/1.373 0.703/0.887 2.554/3.488�2e

9.955/11.86 9.931/11.83 10.47/12.56 10.45/12.51 10.29/12.32

98

Table 4.28: Summary of results for the random slopes regression with the12 schools balanced design with parameter values, �u00 = 5;�u01 = 0:0 and�u11 = 0:5. Only 990 runs.



�0(30:0) 0.04 (0.08) 0.04 (0.08) 0.13 (0.08) 0.06 (0.08) 0.04 (0.08)�1(0:5) 0.48 (1.38) 0.49 (1.38) 1.73 (1.38) 1.14 (1.39) 0.96 (1.39)�u00(5:0) {12.80 (1.80) {0.32 (1.96) {11.42 (1.96) {19.73 (1.82) 187.26 (4.42)�u01(0:0) [0.01 (0.02)] 0.01 (0.02)] [{0.01 (0.02)] [{0.00 (0.02)] [{0.03 (0.04)]�u11(0:5) {9.05 (1.41) 0.46 (1.54) 33.95 (1.60) {4.53 (1.54) 138.75 (3.40)�2e(30:0) 0.06 (0.32) 0.05 (0.32) 2.49 (0.33) 2.90 (0.33) 2.01 (0.32)


�0 85.8/91.6 87.1/93.1 86.4/92.1 84.5/90.8 97.7/99.4�1 86.5/91.5 87.9/92.7 92.0/96.1 86.2/91.2 96.3/98.3�u00 77.0/81.2 82.5/85.2 78.0/88.1 68.5/77.7 76.6/87.3�u01 91.1/97.0 91.6/97.4 93.1/97.1 75.9/81.6 95.2/98.3�u11 79.1/82.7 83.0/85.8 93.6/97.5 80.5/86.0 77.2/85.7�2e

90.2/94.8 90.2/94.8 89.6/95.3 89.3/94.9 89.8/95.1


�0 2.409/2.871 2.525/3.008 2.415/2.947 2.335/2.834 3.741/4.691�1 0.667/0.795 0.698/0.832 0.781/0.947 0.668/0.792 1.020/1.279�u00 8.891/10.59 9.764/11.63 9.522/12.29 7.386/9.456 33.35/45.38�u01 1.721/2.051 1.892/2.254 2.094/2.795 1.582/2.049 6.359/8.994�u11 0.693/0.826 0.758/0.904 1.048/1.347 0.696/0.877 2.499/3.419�2e

10.04/11.96 10.04/11.97 10.54/12.64 10.49/12.56 10.33/12.37

99

Simulation Design

Est

imat

e of

% b

ias

beta

0

-0.0

50.

00.

050.

100.

15

12U 12B 48U 48B

(i)


Simulation Design

Est

imat

e of

% b

ias

beta

1

01

2

12U 12B 48U 48B

(ii)

Simulation Design

Est

imat

e of

% b

ias

sigm

a^2_

e

-10

12

3

12U 12B 48U 48B

(iii)

Figure 4-6: Plots of biases obtained for the various methods �tting the randomslopes regression model against study design (Fixed e�ects parameters and level1 variance parameter).

100

Simulation Design

Est

imat

e of

% b

ias

Sig

ma_

u00

050

100

150

200

12U 12B 48U 48B

(i)


Simulation Design

Est

imat

e of

bia

s S

igm

a_u0

1

-0.0

4-0

.02

0.0

0.02

0.04

12U 12B 48U 48B

(ii)

Simulation Design

Est

imat

e of

% b

ias

Sig

ma_

u11

050

100

150

12U 12B 48U 48B

(iii)

Figure 4-7: Plots of biases obtained for the various methods �tting the randomslopes regression model against study design (Level 2 variance parameters).

101

(It is included in the package MLwiN as there exist situations where the IGLS

method converges when the RIGLS method doesn't).

Although the RIGLS method performs well in terms of bias, it is not designed

for interval estimation and the additional (sometimes false) assumption that the

parameter of interest has a Gaussian distribution has been used to generate

interval estimates. The MCMC methods have been compared to see if they

will improve on the RIGLS method in terms of coverage.

The main di�culty with the MCMC methods is choosing default priors for

the variance parameters. In the univariate case I have compared three possible

prior distributions using the variance components model. All three priors give

variance estimates that are positively biased but this bias decreases as N the

number of units associated with the variance decreases. Of the three priors, the

Pareto prior for the precision parameter which is a proper prior equivalent to

a uniform prior for the variance, has far larger bias but in turn often has the

best coverage properties. The gamma(�; �) prior for the precision parameter has

far less bias and also improves over the maximum likelihood methods in terms

of coverage so would be preferable except that it does not easily generalise to a

multivariate distribution. The �nal prior which uses a prior estimate, taken from

the gamma prior estimate gives approximately the same answers as the gamma

prior.

When variance matrices are considered, as in the random slopes regression

model, multivariate priors are required. The uniform prior easily translates to

a multivariate uniform prior but unfortunately the gamma prior does not. A

candidate multivariate Wishart prior (Wish1) for the precision matrix was used

to replace the gamma prior. A third alternative prior (Wish2) based on a prior

estimate for the variance matrix, this time from RIGLS, was also considered. This

third prior performed poorly and tended to underestimate the variance matrix

and generally gave worse coverage than the maximum likelihood methods.

The Wish1 prior tended to shrink the variance estimates towards the identity

matrix but generally was less biased than the other two priors. The uniform

prior once again was highly positively biased. In terms of coverage properties the

uniform and Wish1 prior both performed as well overall as the RIGLS method

but no better.

So in conclusion, in some situations the RIGLS maximum likelihood method,

102

which is far faster to run, improves on MCMC and in other situations it is

MCMC that has better performance. Both the uniform and gamma priors and

their multivariate equivalents have good points and bad points but overall the

gamma prior appears to be slightly better. In Chapter 6, I will consider a

similar simulation study using a multi-level logistic regression model. Here the

approximation based methods do not perform as well as noted by Rodriguez

and Goldman (1995), and it will be shown that the MCMC methods make an

improvement with these models.

4.5.2 Priors in MLwiN

The �rst release of the MLwiN package occurred while these simulations were

still being performed. In this version I included the uniform prior on the �2 scale

for all variance parameters as a default, mainly because it was simple and easiest

to extend to the multivariate case. The user is also given the option to include

informative priors for variance parameters. For an informative prior the user

must input a prior estimate for the variance or variance matrix and a sample size

on which this prior estimate is based. If the user gives a prior sample size of 1

then the Wishart prior produced is identical to the second Wishart prior used for

the random slopes regression simulations in this Chapter.

In future releases the gamma(�; �) prior for the precision and the Wish1 prior

may be added as an alternative following their performance in these simulations.

In this chapter I have introduced two simple multi-level models and shown

how they can be �tted using the Gibbs sampler. I will generalise this to include

the whole family of Gaussian models in the next chapter, along with showing how

to apply the other MCMC methods to multi-level models.

103

Chapter 5

Gaussian Models 2 - General

Models

In the previous chapter I introduced two simple two level models and showed

how to use one MCMC method, Gibbs sampling to �t them. In this chapter I

will extend this work in two directions. Firstly I will give a general description

of an N level Gaussian multi-level model and show how to �t this model using

Gibbs sampling. Secondly I will show how to use other MCMC methods with

Gibbs sampling via a hybrid approach to �t N level Gaussian models. I will give

two alternative Metropolis-Gibbs hybrid sampling algorithms and explain how

these methods can be extended into adaptive samplers. I will compare through a

simple example how well the methods perform in terms of their times to produce

estimates with a desired accuracy.

5.1 General N level Gaussian hierarchical linear

models

In the �eld of education I have already looked at a two level scenario with pupils

within schools. This structure could easily be extended in many directions by the

addition of extra levels to the model. The schools could be divided into di�erent

education authorities giving another higher level. Pupils in each school could

be de�ned by their class allowing a level between pupils and schools. Below the

pupil level, each student could sit tests over a period of several years and so each

104

test could be a lower level unit.

It is quite easy to see how a 2 level model can be extended to a 5 level model

in an educational setting and it is conceivable that in other application areas

there could be even more levels. In the general framework, predictor variables

can be de�ned as �xed e�ects or random e�ects at any level in this model. For

example, predictors such as sex, parental background and ethnic origin are pupil

level variables, whilst class size and teacher variables are class level variables and

school size and type are school level variables.

The multi-level structure of these models produces similarities between the

conditional distributions for predictor variables at di�erent levels, and it will be

shown later that only four parameter updating steps are needed for a general N

level Gaussian model. One of the main di�culties with extending the algorithm

to N levels is notational. In the two level model we have pupil i in school j, and

this cannot be extended inde�nitely.

I will �rstly look at the work on hierarchical models in the paper by Seltzer,

Wong, and Bryk (1996) and show how their algorithms can be modi�ed to �t a

general 3 level multi-level model before extending this work to N levels.

5.2 Gibbs sampling approach

Seltzer, Wong, and Bryk (1996) considered hierarchical models of 2 levels with

�xed e�ects. They found the conditional posterior distributions for all the

parameters so that a Gibbs sampling algorithm could be easily implemented.

They also included speci�cations of prior distributions for all variance parameters

and incorporated these prior distributions into their posterior distributions. They

stated that it was easy to extend the algorithm to hierarchical models with 3 or

more levels but did not state how.

I wish to follow on from their work but to consider a wider family of

distributions namely the N level Gaussian multi-level models. Their formulation

for a 2 level hierarchical model is as follows :

yij = Xij�j +X�ij

� + eij;

�j = Wj +Uj:

105

where eij � N(0; �2) and Uj �MVN(0;T):

This is a general 2 level hierarchical model with �xed e�ects. To translate

this into a 2 level multi-level model I will re-parameterise as follows :

yij = Xij(Wj +Uj) + X�ij

� + eij

= XijUj +XijWj +X�ij

� + eij

= XijUj + Zij �� + eij

where Zij =

0@ XijWj 0

0 X�ij

1A and �� =

0@

�

1Awith eij � N(0; �2);Uj � MVN(0;T):

In this formulation estimates for the variance parameters, �2 and T as well as

the �xed e�ects � can still be found. We can also �nd estimates for the lowest

level random variables (in the above notation ) which are really also �xed e�ects.

Due to re-parameterisation the level 2 residuals, the Uj are estimated as opposed

to the �j which were random parameters. It is easy to calculate the �j from the

Uj and vice versa.

Re-parameterising the model in this way does not have any particular

advantages in terms of convergence, in fact (Gelfand, Sahu, and Carlin 1995)

with some models that �t this framework, re-parameterising in the way described

will give worse mixing properties for the Markov chain. The main reason for re-

parameterising into this format is that we are now working on a far larger family

of distributions. This is because there exist models that �t this framework but

cannot be described in the previous format.

The next step is to construct conditional posterior distributions for this new

model structure. I will consider a general 3 level model as opposed to the 2 level

model that Seltzer, Wong, and Bryk (1996) consider, as this easily generalises to

N levels. The three level model will be de�ned as follows :

yijk = X1ijk�1 +X2ijk�2jk +X3ijk�3k + eijk

106

eijk � N(0; �2); �2jk �MVN(0;V2); �3k �MVN(0;V3):

with �1 as �xed e�ects, �2 the level 2 residuals and �3 the level 3 residuals.

There are now 6 sets of unknowns to consider. I will consider these in turn

and will assume that the variance parameters have general scaled inverse �2 and

inverse Wishart priors, whilst the �xed e�ects have uniform priors. The steps

required are then :

Step 1. p(�1 j y; �2; �3; �2;V2;V3)

�1 � N( b�1; bD1)

p(�1 j y; �2; �3; �2;V2;V3) / p(y j �1; �2; �3; �2;V2;V3)p(�1)

/ Yijk

(1

�2)12 exp[� 1

2�2(yijk � X1ijk�1 � X2ijk�2jk � X3ijk�3k)

2]

/ Yijk

(1

�2)12 exp[� 1

2�2(d1ijk � X1ijk�1)

2]

where d1ijk = yijk � X2ijk�2jk � X3ijk�3k, giving

bD1 = �2[Xijk

XT1ijkX1ijk]

�1;

and

b�1 = [Xijk

XT1ijkX1ijk]

�1Xijk

XT1ijkd1ijk =

bD1

�2

Xijk

XT1ijkd1ijk:

This is the formula for a simple linear regression of d1 against X1.

Step 2. p(�2 j y; �1; �3; �2;V2;V3)

�2jk � N( b�2jk; bD2jk)

107

p(�2jk j y; �1; �3; �2;V2;V3) / p(y j �1; �2; �3; �2;V2;V3)p(�2jk j V2)

/njkYi=1

[(1

�2)12 exp[� 1

2�2(d2ijk � X2ijk�2jk)

2]]:jV2j� 12 exp[�1

2�T2jkV

�12 �2jk]

where d2ijk = yijk � X1ijk�1 � X3ijk�3k, giving

bD2jk = [njkXi=1

XT2ijkX2ijk

�2+V�1

2 ]�1;

and

b�2jk = bD2jk

�2

njkXi=1

XT2ijkd2ijk:

Step 3. p(�3 j y; �1; �2; �2;V2;V3)

�3k � N( b�3k; bD3k)

p(�3k j y; �1; �2; �2;V2;V3) / p(y j �1; �2; �3; �2;V2;V3)p(�3k j V3)

/Yij

[(1

�2)12 exp[� 1

2�2(d3ijk � X3ijk�3k)

2]]:jV3j� 12 exp[�1

2�T3kV

�13 �3k]

where d3ijk = yijk � X1ijk�1 � X2ijk�2jk, giving

bD3k = [Xij

XT3ijkX3ijk

�2+V�1

3 ]�1;

and

b�3k = bD3k

�2

Xij

XT3ijkd3ijk:

108

Step 4. p(�2 j y; �1; �2; �3;V2;V3)

Assume �2 has a scaled inverse �2 prior, p(�2) � SI�2(�e; s2e). Considering 1=�

2,

and using the change of variables formula with h(�2) = 1=�2 gives

p(1

�2) = p(�2) j h0(�2) j�1= (

1

�4)�1 = (

1

�2)�2:

Substituting will give

p(1=�2 j y; �1; �2; �3;V2;V3) / (1

�2)N=2exp[�X

ijk

(eijk)2

2�2]:(

1

�2)�2p(�2);

so 1=�2 � gamma(a; b) where

a =N + �e

2; b =

1

2(Xijk

e2ijk + �es2e):

A uniform prior on �2 is equivalent to setting �e = �2 and s2e = 0.

Step 5. p(V2 j y; �1; �2; �3; �2;V3)

Assume V2 has an inverse Wishart prior, V2 � IW (�p2; Sp2) then :

p(V�12 j y; �1; �2; �3; �2;V3) / p(�2 j V2)p(V

�12 );

which gives

V�12 �Wishartn2 [S2 = (

Xjk

�2jk�T2jk + Sp2)

�1; �2 = njk + �p2]:

Here S2 is a n2 � n2 scale matrix where n2 is the number of random variables at

level 2 and �2 is the degrees of freedom of the Wishart distribution, and njk is the

number of level 2 units. A uniform prior is equivalent to setting �p2 = �n2 � 1,

and Sp2 = 0.

Step 6. p(V3 j y; �0; �1; �2; �2;V2)

Assume V3 has an inverse Wishart prior, V3 � IW (�p3; Sp3) then :

109

p(V�13 j y; �1; �2; �3; �2;V2) / p(�3 j V3)p(V

�13 );

which gives

V�13 �Wishartn3[S3 = (

Xk

�3k�T3k + Sp3)

�1; �3 = nk + �p3]:

Here S3 is a n3 � n3 scale matrix where n3 is the number of random variables at

level 3 and �3 is the degrees of freedom of the Wishart distribution, and nk is the

number of level 3 units. A uniform prior is equivalent to setting �p3 = �n3 � 1,

and Sp3 = 0.

The above algorithm already shows some similarities between steps. It can be

seen that steps 2 and 3 are e�ectively the same form but with summations over

di�erent levels. The same is also true for steps 5 and 6 and so although I have

written the algorithm in six steps it could actually be written out in four. I will

now consider the N level model and show that this also only needs four steps.

5.3 Generalising to N levels

For an N level model there is 1 set of �xed e�ects, N sets of residuals ( although

residuals at level 1 can be calculated via subtraction and so do not need to be

sampled ) and N sets of variance parameters. These parameters can be split into

4 groups in such a way that all parameters in each group have posteriors of the

same form as illustrated previously in the 3 level model.

� 1. The �xed e�ects.

� 2. The N � 1 sets of residuals (excluding level 1).

� 3. The level 1 scalar variance �2.

� 4. The N � 1 higher level variances.

I will need some additional notation as using summations over N levels i.e.

N indices becomes impractical and messy. Firstly I will describe level 1 as the

110

observation level, and units at level 1 as observations. Then let MT be the set

of all observations in the model and let Ml;j be the set of observations that at

level l are in category j. For example in the simple 2 level educational datasets

in Chapter 4, MT will contain all pupils in all schools while M2;j will contain all

the pupils in school j.

Now also let Xli be the vector of variables at level l for observation i,

where l = 1 refers to the variables associated with the �xed e�ects. Finally

let the random parameters at level l; l > 1 be denoted by �lj, where j is one

of the combinations of higher level terms (The �xed e�ects will be �1). Also

dli = ei � Xli�lj in the same way as d2ijk = eijk � X2ijk�2jk in the 3 level model.

I will use the following prior distributions:

For the level 1 variance, p(�2) � SI�2(�e; s2e), for the level l variance, where

l > 1, Vl � IW (�P l; SP l), and for the �xed e�ects, �1 � N(�p; Sp). I will now

describe the four steps.

5.3.1 Algorithm 1

Step 1 - The �xed e�ects, �1.

p(�1 j y; : : :) / p(y j �1; : : :)p(�1)

�1 �MVN( b�1; bD1)

where

bD1 = [Xi2MT

XT1iX1i

�2+ S�1

p ]�1;

and

b�1 = bD1 � [Xi2MT

XT1id1i�2

+ S�1p �p]:

111

Step 2 - The level l residuals, �l.

p(�l j y; : : :) / p(y j �l; : : :)p(�ljVl)

�lj � MVN( b�lj; bDlj)

where

bDlj = [X

i2Ml;j

XTliXli

�2+V�1

l ]�1;

and

b�lj = bDlj

�2� Xi2Ml;j

XTli dli:

Step 3 - The level 1 scalar variance �2.

p(1=�2 j y; : : :) / p(y j �2; : : :)p(1=�2)

1=�2 � gamma(apos; bpos);

where apos =12(N+�e); bpos =

12(P

n e2n+�es

2e). For a uniform prior �e = �2; s2e =

0.

Step 4 - The level l variance, Vl.

p(V�1l j y; : : :) / p(�l j Vl)p(V

�1l )

V�1l �Wishartnrl[Spos = (

nlXi=1

�li�Tli + SP l)

�1; �pos = nl + �P l];

where nl is the number of level l units. For a uniform prior, SP l = 0 and

112

�P l = �nrl � 1 where nrl is the number of random variables at level l.

5.3.2 Computational considerations

When writing Gibbs Sampling code one of the main concerns is the speed of

processing. The code for 1 iteration will be repeated thousands of times and so

any small speed gain for an individual iteration will be magni�ed greatly. The

actual memory requirements for storing intermediate quantities will be small in

comparison to the size of the results. There is therefore scope to store a few more

intermediate results if they will in turn speed up the code. I will now explain two

computational steps that will speed up the processing time.

Speed up 1

From the 4 steps shown in the general N level algorithm note that the quantitiesPi2Ml;j

XTliXli and

Pi2MT

XT1iX1i are �xed constant matrices. It would save a large

amount of time if these quantities are calculated at the beginning and then stored

so that they can be used in each iteration.

Speed up 2

Much use is made of quantities such as dli which are equal to ei+cli where cli is the

product of a parameter vector and a data vector. For example d2i = ei +X2i�2j.

If I store ei the level 1 residual for observation i, then whenever (steps 1 and 2 of

the algorithm), one of the d quantities needs to be calculated, I can add on the

current value of the parameter multiplied by the data vector, for example X2i�2j

to produce d2i. Then use this to calculate a new value for the appropriate �.

Once a new value has been calculated this procedure can be applied backwards

to give the new value of the level 1 residual ei, i.e., subtract the new parameter

value � the data vector. This idea will also be repeated in later methods.

113

5.4 Method 2 : Metropolis Gibbs hybrid meth-

od with univariate updates

In the previous section I have given an algorithm to �t the multi-level Gaussian

model using the Gibbs sampler. In the next chapter the models considered do not

give conditional distributions that have nice forms to be simulated from easily

using the Gibbs sampler. I will now �t the current models using some alternative

MCMC methods which can then be used on the models in the next chapter.

The steps that cause the simple Gibbs sampler problems in the multi-level

logistic regression models in the next chapter are updating the residuals and

�xed e�ects. The �rst plan is to replace the Gibbs sampler on these steps

with univariate normal proposal Metropolis steps as described in the following

algorithm.

5.4.1 Algorithm 2


For i in 1; : : : ; NFixed;

�(t)1i = ��1i with probability min(1; p(��1i j y; : : :)=p(�(t�1)

1i j y; : : :))= �

(t�1)1i otherwise

where ��1i = �(t�1)1i + 1i; 1i � N(0; �2

1i):


For l in 2; : : : ; N; j in 1; : : : ; nl; and i in 1; : : : ; nrl;

�(t)lji = ��lji with probability min(1; p(��lji j y; : : :)=p(�(t�1)

lji j y; : : :))= �

(t�1)lji otherwise

where ��lji = �(t�1)lji + lji; lji � N(0; �2

lji), nl is the number of level l units, and

nrl is the number of random parameters at level l.


This step is the same as Algorithm 1.

114



5.4.2 Choosing proposal distribution variances

When using the Gibbs sampler on multi-level models and having de�ned the

steps of the algorithm the only remaining task is to �x starting values for all

the parameters. Generally starting values could be set fairly arbitrarily and the

results should be similar. To improve the mixing of the Markov chains, and

to utilise MLwiN's other facilities, I use the current estimates obtained by the

maximum likelihood IGLS or RIGLS methods as starting values. Having set the

starting values it is now simply a question of running through the steps of the

algorithm repeatedly.

When Metropolis steps are introduced to the algorithm, there is now one

more set of parameters that need to be assigned values. In Steps 1 and 2 of

the above algorithm, there are normal proposal distributions with unde�ned

variances, and these variances need to be given sensible values. The Metropolis

steps will actually work with any positive values for the proposal variances, and

will eventually, given time, give estimates with a reasonable accuracy but ideally

we would like accurate estimates in the minimum number of iterations. To achieve

this aim, proposal variances that give a chain that mixes well are desirable.

Gelman, Roberts, and Gilks (1995) explore e�cient Metropolis proposal

distributions for normally distributed data in some detail. They show that the

ideal proposal standard deviation for a parameter of interest, � is approximately

2.4 times the standard deviation of �. This implies the ideal proposal distribution

variance is 5.8 times the variance of �. This means that if an estimate of the

variance of the parameter of interest was available, this result can be used and

the Metropolis algorithm can be used e�ciently. Fortunately MLwiN also gives

standard errors to its estimates produced by IGLS or RIGLS and so these values

can be used.

The models studied in Gelman, Roberts, and Gilks (1995) are fairly simple

and there is no guarantee that the optimal value of 5.8 for the scaling factor

will follow for multi-level models. To test this out I will consider a few simple

multi-level models and �nd through simulation whether this optimal value of 5.8

115

holds.

Finding optimal scaling factors

To �nd optimal scaling factors for the variance of the proposal distribution a

practical approach was taken. Several values for the scaling factor spread over

the range 0.05 to 20 were considered, and for each value 3 MCMC runs with a

burn-in of 500 and a main run of 50,000 were performed. The same value was

used as a multiplier for the variance estimate from the RIGLS method for each

�xed e�ect and higher level residual parameter. To �nd the optimal value the

Raftery Lewis statistic was calculated. In Chapter 3 I showed that the Raftery

Lewis N statistic is equivalent to the recipricol of the e�ciency of the estimate,

so the optimal scaling factor will be the scaling factor value that minimises N .

The method was �rstly used on the two simple models considered in Chapter

4, the variance components and random slopes regression models. The results

can be seen in Figures 5-1, 5-2 and 5-3. From these �gures the shape of the

graphs can be seen to be similar to those in Chapter 3 (Figure 3-6), although

the graphs in Chapter 3 are based on the scale factor for the standard deviation

and not the variance. What is immediately clear is that the value 5.8 is not

the minimum for the scale factor as was found in the simple Gaussian model in

Chapter 3. In the second example (Figures 5-2 and 5-3) it can be seen that the

optimal scale factor is not even the same for both parameters.

There is some noise when using N as an estimate of e�ciency but a rough

estimate of the minimum can be obtained and on all three graphs this is far

smaller than 5.8. This creates a problem as the same scale factor is being

used for each parameter and so this constraint prevents the use of the di�erent

optimal values. The calculations in Gelman, Roberts, and Gilks (1995) that give

the optimal value of the scale factor are fairly mathematically complex. Due

to this complexity I do not intend to attempt to �nd similar optimal formulae

mathematically for multi-level models in this thesis.

The method was also considered on some other models and the results can be

seen in Table 5.1. In Table 5.1, �0 is the intercept, �1 is the Math3 e�ect, and

�2 is the sex e�ect. The models are all either variance components or random

slopes regression models, with the number indexing the number of �xed e�ects.

The �nal model (SCH1) uses a di�erent educational dataset from Goldstein et al.

116

.

.

.

.

..

.

.

.

.

......

.

.......... ..

....

...

.

.

.... .

.....

.

.

. .

.

. .

.

..

.

.

.

.

. ...

.

.

.

Parameter Beta_0 : Scale Factor

Raf

tery

Lew

is N

hat

0 5 10 15 20

3000

050

000

7000

0

.

.

.

.

..

.

.

.

...

..

..

.

.....

.......

....

...

.

.

.....

.....

....

...

.

..

.

.

.

.

....

.

.

.

Parameter Beta_0 : Acceptance Rate

Raf

tery

Lew

is N

hat

0.2 0.4 0.6 0.8

3000

050

000

7000

0

Figure 5-1: Plots of the e�ect of varying the scale factor for the proposal varianceand hence the Metropolis acceptance rate on the Raftery Lewis diagnostic for the�0 parameter in the variance components model on the JSP dataset.

117

.

...

....

..........

..

....... ... ..

....

..

. ... ..

. ...

.

.

. .

....

. ... ...

.

.

...

.

.

.

.


Raf

tery

Lew

is N

hat

0 5 10 15 20

4000

080

000

1200

00

.

.....

..

......

....

...

..

..........

.......

......

..

.

...

....

.......

.

.

...

.

.

.

.


Raf

tery

Lew

is N

hat

0.2 0.4 0.6 0.8

4000

080

000

1200

00

Figure 5-2: Plots of the e�ect of varying the scale factor for the proposal varianceand hence the Metropolis acceptance rate on the Raftery Lewis diagnostic for the�0 parameter in the random slopes regression model on the JSP dataset.

118

.

..

.

.

...................... ... ... ... ... ... .. ...

.

..... ... ...

.

..... ...

.

.

.


Raf

tery

Lew

is N

hat

0 5 10 15 20

2000

060

000

1000

00

.

..

.

.

....

..

.....................................

...

............

......

.

.

.


Raf

tery

Lew

is N

hat

0.2 0.4 0.6 0.8

2000

060

000

1000

00

Figure 5-3: Plots of the e�ect of varying the scale factor for the proposal varianceand hence the Metropolis acceptance rate on the Raftery Lewis diagnostic for the�1 parameter in the random slopes regression model on the JSP dataset.

119

(1998) which has 4059 students in 65 schools. To this dataset a similar random

slopes regression model has been �tted to assess whether the results obtained for

the JSP dataset are unique.

Table 5.1: Optimal scale factors for proposal variances and best acceptance ratesfor several models.

Model Optimal Scale factor Acceptance Percentage�0 �1 �2 �0 �1 �2

VC1 0.75 | | 45%{70% | |VC2 1.0 4.0 | 45%{70% 40%{60% |VC3 1.0 4.0 2.0 45%{70% 40%{60% 40%{65%RSR2 0.75 2.0 | 40%{75% 35%{65% |RSR3 0.75 2.5 2.0 45%{70% 40%{70% 40%{65%SCH1 0.5 1.5 | 45%{80% 40%{70% |

If the acceptance rate is considered instead it can be seen that the graphs are

far atter, and there are a wide range of acceptance rates that give similar values

of N . Gelman, Roberts, and Gilks (1995) calculate the optimal acceptance rate

to be 44% for Gaussian data, and although this value appears to give a reasonably

low N it does not appear to be the minimum for all parameters. It will however

give far better results than using the scale factor of 5.8.

In Table 5.1 ranges of values have been given for the acceptance rates as

the graphs of N values over these ranges are fairly at. It can be seen that

all parameters considered give good results with acceptance rates between 45%

and 60%. This shows that if a proposal distribution that gives the same desired

acceptance rate for every parameter could be found then this would be a better

method than using the scale factor method considered thus far. This is the

motivation behind considering adaptive samplers.

5.4.3 Adaptive Metropolis univariate normal proposals

An additional problem with using the variance estimates produced by IGLS

and RIGLS to calculate the proposal distribution variances is the assumption

that these methods give good estimates. This does not however explain the

discrepancies from the value 5.8 for the scale factor, as the IGLS and RIGLS

variance estimates will generally be too small which would have the opposite

120

e�ect on the scaling factor. An alternative approach would be to have starting

proposal distributions and then adapt these distributions as the algorithm is

running to improve the mixing of the Markov chain.

Care has to be taken when performing adaptive Metropolis sampling (Gelfand

and Sahu 1994) as the simulations produced may not be a Markov chain. Gilks,

Roberts, and Sahu (1996) give a mathematical method based on Markov chain

regeneration that will give time points when it is acceptable to modify the

proposal distribution during the monitoring run of the chain. This method

although shown to be e�ective in the paper is rather complicated and so I decided

instead to use the simpler approach of adapting the proposal distributions in a

preliminary period before the `burn-in' and main monitoring run of the chain.

Muller (1993) gives a simple adaptive Metropolis sampler based on the belief

that the ideal sampler will accept approximately 50% of the iterations. Gelman,

Roberts, and Gilks (1995) show that for univariate normal proposals, used on

a multivariate normal posterior density, the ideal acceptance rate is 44% but

from Table 5.1 it can be seen that for multi-level models 50% is an equally

good acceptance rate. Muller (1993) considers the last 10 observed acceptance

probabilities and uses the simple approach of modifying the proposal distribution

if the average of these acceptance rates lies outside the range 0.2 to 0.8.

There are several factors to consider when designing an adaptive algorithm.

Firstly how often to adapt the proposal distributions, secondly how to adapt

the proposal distributions and thirdly when to stop the adapting period and

to continue with the `burn-in' period. I will outline two adaptive Metropolis

algorithms that aim to give acceptance rates of 44% for all parameters, although

44% can be substituted by any other percentage.

Adaptive sampler 1

This method has been implemented in MLwiN and has an adapting period of

unknown length (up to an upper limit) followed by the usual `burn-in' period

and �nally the main run from which the estimates are obtained. The objective

of this method is to achieve an acceptance rate of x% for all the parameters of

interest. Although in the MLwiN package, the proposal distributions used in the

non-adaptive method will be used as initial proposal distributions, the algorithm

will work on arbitrary starting proposals as illustrated in the examples below.

121

The algorithm needs the user to input 2 parameters. Firstly x% the desired

acceptance rate, which in the example will be 44% and a tolerance parameter,

which in the example will be 10%. This tolerance parameter governs when the

algorithm stops and is meant to signify bounds on the desired acceptance, that

is the desired acceptance rate is 44% but if the acceptance rate is somewhere

between 34% and 54% we are fairly happy. The algorithm then runs the sampler

with the current proposal distributions for batches of 100 and at the end of each

batch of 100, the proposal distributions are modi�ed. This procedure is repeated

until the tolerance conditions are achieved. The modi�cation procedure that

happens after each batch of 100 is detailed in the algorithm below.

Method

The following algorithm is repeated for each parameter. Let NAcc be the number

of iterations accepted in the current batch for the chosen parameter (out of 100),

OPTAcc be the desired acceptance rate, and PSD be the current proposal standard

deviation for the parameter.

If NAcc > OPTAcc PSD = PSD ��2�

�100�NAcc

100� OPTAcc

��;

If NAcc < OPTAcc PSD = PSD=�2� NAcc

OPTAcc

�:

The above will modify the proposal standard deviation by a greater amount

the further the acceptance rate is from the desired acceptance rate. If the

acceptance rate is too small then the proposed new values are too far from the

current value and so the proposal SD is decreased. If the acceptance rate is too

high, then the proposed new values are not exploring enough of the posterior

distribution and so the proposal SD is increased.

To check if the tolerance condition is achieved NAcc is compared with the

tolerance interval, (OPTAcc � TOLAcc; OPTAcc + TOLAcc). If three successive

values of NAcc are in this interval then the parameter is marked as satisfying the

tolerance conditions. Once all parameters have been marked then the tolerance

condition is satis�ed. After a parameter has been marked it is still modi�ed

as before until all parameters are marked, but each parameter only needs to be

122

marked once for the algorithm to end. To limit the time spent in the adapting

procedure an upper limit is set (in MLwiN this is 5,000 iterations) and after this

time the adapting period ends regardless of whether the tolerance conditions are

met.

Note that it may be better to use the sum of the actual Metropolis acceptance

probabilities as in Muller (1993) instead of NAcc in the above algorithm,

although preliminary investigations show no signi�cant di�erences in the proposal

distributions chosen.

Results

Table 5.2 shows the adapting period for the two �xed e�ects parameters for

one run of the random slopes regression model with the JSP dataset. Here the

starting values have been chosen arbitrarily to be 1.0 whereas when this method

is used in MLwiN the RIGLS estimates will be used instead. From the table

it can be seen that both parameters have ful�lled the tolerance criteria by 700

iterations. However as the adapting period also includes the level 2 residuals, it

is not complete until 3,300 iterations when the �nal set of residuals satisfy the

criteria.

Table 5.2: Demonstration of Adaptive Method 1 for parameters �0 and �1 usingarbitrary (1.000) starting values.

N �0 SD NAcc N in Tol �1 SD NAcc N in Tol0 1.0 | | 1.0 | |

100 0.587 13 0 0.512 2 0200 0.574 43 1 0.271 5 0300 0.451 32 0 0.155 11 0400 0.441 43 1 0.105 23 0500 0.422 42 2 0.087 35 1600 0.379 39 3� 0.075 37 2700 0.412 49 3� 0.082 49 3�

800 0.370 39 3� 0.075 40 3�

900 0.354 42 3� 0.086 52 3�

1,000 0.436 57 3� 0.065 30 3�

3,300 0.381 47 3� 0.064 42 3�

In Table 5.3 runs of length 50,000 for various di�erent methods using the same

123

random slopes regression model are compared. The four methods considered

are the Gibbs sampling method used in Chapter 4, and three versions of the

Metropolis Gibbs hybrid method. Firstly using proposal SDs set at 1.0 for all

parameters, secondly using the RIGLS starting values to create the proposal

distributions and �nally using the �rst adaptive method.

After 50,000 iterations, the parameter estimates of all four methods are

reasonably similar. The Raftery Lewis N values show more clearly how well

the methods are performing. The Gibbs sampler generally has the lowest values

of N , with the adaptive method the best of the hybrid methods. The need to

choose good proposal distributions is highlighted by the huge N value for �1 using

the arbitrary 1.0 proposal distribution SD. This value is over 30 times longer than

the suggested run length for Gibbs for �1.

The acceptance rates and proposal standard deviations in Table 5.3 show how

far from the expected 44% acceptance rate, the RIGLS starting values method

actually is. This in turn explains why the N value for �0 using this method is

larger than for the adaptive method. The table also shows that the adaptive

method is a better approach than using the RIGLS starting values and this is

backed up by the Figures 5-1 to 5-3 seen earlier. The results in Table 5.3 are

based on only one run of each method, but other runs were performed and similar

results were obtained.

Adaptive sampler 2

Two criticism that may be levelled against the �rst adaptive sampler are �rstly

there is no de�nite length for the adapting period and secondly that the method

includes a tolerance parameter which has to be set. This second method will

hopefully be an improvement on the �rst algorithm that does away with the

tolerance parameter and gives acceptance rates closer to the desired acceptance

rate.

Method

In the �rst sampler, although the change in acceptance rate is less the closer

the current acceptance rate is to the desired acceptance rate, the change does

not vary with time. I will try to incorporate the MCMC technique, simulated

124

Table 5.3: Comparison of results for the random slopes regression model on theJSP dataset using uniform priors for the variances, and di�erent MCMC methods.Each method was run for 50,000 iterations after a burn-in of 500.

Par. Gibbs MH (SD = 1) MH (RIGLS) MH Adapt 1�0 30.60(0.396) 30.59(0.374) 30.60(0.406) 30.57(0.417)�1 0.614(0.048) 0.614(0.047) 0.614(0.051) 0.616(0.049)

�u00 5.674(1.732) 5.699(1.716) 5.702(1.656) 5.780(1.754)�u01 {0.426(0.163) {0.428(0.162) {0.420(0.160) {0.436(0.168)�u11 0.055(0.024) 0.055(0.023) 0.054(0.025) 0.055(0.024)�2e 26.98(1.339) 26.93(1.342) 26.99(1.337) 26.92(1.336)

Raftery and Lewis diagnostic (N)

�0 10,520 60,728 58,778 32,528�1 6,453 216,954 24,421 24,999

�u00 5,792 6,175 5,645 5,684�u01 4,866 4,714 4,882 5,212�u11 12,345 9,480 14,389 11,877�2e 3,898 3,810 3,867 3,835

Acceptance Rates for �xed e�ects (%)

�0 100% 21.7% 23.2% 46.3%�1 100% 3.6% 34.0% 40.6%

Proposal Standard deviations

�0 - 1.000 0.880 0.395�1 - 1.000 0.103 0.081

125

annealing (Geman and Geman 1984) by allowing the change to the proposal SD

to decrease with time. The algorithm will then be run for a �xed length of time,

Tmax (Tmax is chosen to be 5,000 in the example) and the proposal distributions

will be modi�ed every 100 iterations. The following procedure will be carried out

for each parameter at time t� :

If NAcc > OPTAcc PSD = PSD ��1 +

�1�

�100�NAcc

100� OPTAcc

��Tmax � t� + 100

Tmax

��;

If NAcc < OPTAcc PSD = PSD=�1 +

�1� NAcc

OPTAcc

��Tmax � t� + 100

Tmax

��:

So after the �rst 100 iterations the range of possible changes to the

proposal SD is (12PSD; 2PSD) as in the �rst algorithm but this shrinks to

( Tmax

Tmax+100PSD;

Tmax+100Tmax

PSD) at time Tmax.

Results

Table 5.4 shows the adapting period for the two �xed e�ects parameters for

one run of the random slopes regression model with the JSP dataset using the

second method. Here again, the starting values have been chosen arbitrarily

to be 1.0 whereas this method could use the RIGLS estimates from MLwiN.

From Table 5.4 it can be seen that as time increases the changes to the proposal

standard deviation become smaller as with a simulated annealing algorithm.

The two methods were each run ten times for 5,000 iterations using the

random slopes regression model, with the ideal acceptance rate set to 44%. The

actual acceptance rates achieved for the two �xed e�ects were recorded for both

methods. For parameter �0, the �rst method obtained acceptance rates between

40.0% and 49.0% whilst the second method obtained rates of between 43.3% and

47.3%. For parameter �1, the �rst method obtained rates between 41.0% and

48.4% whilst the second method obtained rates between 43.3% and 46.4%. It is

not entirely fair to compare these �gures directly as the second method was run

for 5,000 iterations every time whereas the �rst method ran on average for only

2,100 iterations, however the second method does appear to give a more accurate

acceptance rate. A balance has to be struck between the additional burden of on

126

Table 5.4: Demonstration of Adaptive Method 2 for parameters �0 and �1 usingarbitrary (1.000) starting values.

N �0 SD NAcc �1 SD NAcc

0 1.0 | 1.0 |100 0.587 13 0.512 2200 0.574 43 0.274 5300 0.440 30 0.161 12400 0.484 50 0.109 22500 0.499 46 0.091 34600 0.437 37 0.081 38700 0.364 34 0.083 46800 0.403 51 0.072 36900 0.396 43 0.067 40

1,000 0.333 34 0.075 52. . . . .

4,800 0.397 46 0.073 414,900 0.397 44 0.072 385,000 0.398 51 0.072 47

average 2,900 extra iterations (in this example) in the adapting period and any

gain in speed of obtaining accurate estimates.

Although this second method is an interesting alternative to the �rst adaptive

method, it will not be considered further in this thesis as it does not o�er any

signi�cant improvements. Instead I will now go on to consider multivariate

normal Metropolis updating methods.

5.5 Method 3 : Metropolis Gibbs hybrid meth-

od with block updates

One disadvantage of using univariate normal proposal distributions is that the

correlation between parameters is completely ignored and so if two parameters

are highly correlated it would be nice to adjust for this in the proposal

distribution. Highly correlated parameters are generally avoided by centering

predictor variables but sometimes large correlations still exist.

The Gibbs sampler algorithm updates parameters in blocks, for example the

�xed e�ects are updated together and all residuals for one level 2 unit are updated

127

together. The second hybrid method will mimic the Gibbs sampler steps for

residuals and �xed e�ects by using multivariate normal proposal Metropolis steps

for these blocks as described in the following algorithm :

5.5.1 Algorithm 3


�(t)1 = ��1 with probability min(1; p(��1 j y; : : :)=p(�(t�1)

1 j y; : : :))= �

(t�1)1 otherwise

where ��1 = �(t�1)1 + 1; 1 �MVN(0;�1):


For l in 2 : : : N; i in 1 : : : nl;

�(t)lj = ��lj with probability min(1; p(��lj j y; : : :)=p(�(t�1)

lj j y; : : :))= �

(t�1)lj otherwise

where ��lj = �(t�1)lj + lj; lj �MVN(0;�lj); and nl is the number of level l units.





5.5.2 Choosing proposal distribution variances

When using multivariate normal proposal distributions, a similar problem exists

as for the univariate case, the variance matrices for the proposal distributions in

steps 1 and 2 need to be assigned values. This time the Metropolis steps 1 and

2 will work with any positive de�nite matrix for the proposal variances, however

ideally a matrix that gives estimates with a reasonable accuracy in the minimum

number of iterations is desired.

128

Gelman, Roberts, and Gilks (1995) also consider the case of a multivariate

normal Metropolis proposal distribution on multivariate normal posterior distrib-

utions. They calculate the optimal scale factor to be used as a multiplier for the

estimated standard deviation matrix for dimensions 1 to 10 and also found an

asymptotically optimal estimator for this scale factor. This asymptotic estimator

is 2:38=q(d) where d is the dimension of the proposal distribution. When

considering the estimated covariance matrix this multiplier becomes 5:66=d.

This now implies that if an estimate of the covariance matrix of the parameters

of interest can be found then this can be multiplied by this optimal scale factor

and the resulting matrix can be used as the variance matrix for the proposal

distribution. Fortunately MLwiN will give the covariance matrices associated

with both the �xed e�ects and the residuals and so these values can be used.

Finding optimal scaling factors

When I considered the univariate normal distributions I found that the optimal

value of 5.8 for the scale factor from Gelman, Roberts, and Gilks (1995) did

not actually follow for multi-level models. For the multivariate normal proposal

distributions I will now consider again the random slopes regression model on the

JSP dataset.

The asymptotic estimator gives the value 2.83 when d = 2 as in the random

slopes regression model. The optimal estimate from Gelman, Roberts, and Gilks

(1995) is 2.89, so the asymptotic estimator is fairly accurate when d = 2. I used

a similar approach to that used for the univariate proposal distributions. Several

values of the scaling factor spread over the range 0.02 and 10 were chosen, and for

each value three MCMC runs with a burn-in of 1,000 and a main run of 50,000

were performed. As with the univariate case the Raftery Lewis N statistic was

calculated to measure e�ciency. The results for the random slopes regression

model can be seen in the Figures 5-4 and 5-5.

From Figures 5-4 and 5-5 the optimal scale factors for both �0 and �1 appear

to be around 0.75 which is far smaller than the values from Gelman, Roberts,

and Gilks (1995). As with the univariate case the acceptance rate gives a far

atter graph. Here acceptance rates in the range 30% to 75% for �0 and in the

range 30% to 70% for �1 give similar low values of N . Gelman, Roberts, and

Gilks (1995) give the acceptance rate of 35.2% as optimal for a bivariate normal

129

.

.

.

.

.

.

.

.

.......... ..

....

.....

..

.

..

.

.

.

.

..

....

.

.

. ...

...

.

..

..

. .

.

.


Raf

tery

Lew

is N

hat

0 2 4 6 8 10

4000

080

000

1200

00

.

..

.

.

.

...

.

.....

.....

....

......

..

.

..

.

.

.

.

..

....

.

.

.......

.

..

..

..

.

.


Raf

tery

Lew

is N

hat

0.2 0.4 0.6 0.8

4000

080

000

1200

00

Figure 5-4: Plots of the e�ect of varying the scale factor for the multivariatenormal proposal distribution and hence the Metropolis acceptance rate on theRaftery Lewis diagnostic for the �0 parameter in the random slopes regressionmodel on the JSP dataset.

130

.

.

.

.

.

.

.

..

...... ... ..

. ... ... ...

..

. ..

....

.

..

.

.. ..

.

..

.

..

. .

.

.

.

.

.

.

.

.


Raf

tery

Lew

is N

hat

0 2 4 6 8 10

4000

080

000

.

.

.

.

.

.

...

...........

..........

..

...

....

.

..

.

....

.

...

..

..

.

.

.

.

.

..

.


Raf

tery

Lew

is N

hat

0.2 0.4 0.6 0.8

4000

080

000

Figure 5-5: Plots of the e�ect of varying the scale factor for the multivariatenormal proposal distribution and hence the Metropolis acceptance rate on theRaftery Lewis diagnostic for the �1 parameter in the random slopes regressionmodel on the JSP dataset.

131

proposal which does appear in these ranges. The atness of the acceptance rate

graph around the minimum again implies that �nding proposal distributions that

give a desired acceptance rate for every parameter is a better approach than the

scale factor method. This leads us to consider adaptive multivariate samplers.

5.5.3 Adaptive multivariate normal proposal distribu-

tions

The adaptive samplers considered for univariate normal proposal distributions

can both be extended to multivariate proposals. I will only consider modifying

the �rst sampler which is used in MLwiN but the alternative sampler based on

simulated annealing could also be easily modi�ed. When considering multivariate

proposals there is more exibility in the possible variance matrices that can

be considered. I could simply modify the univariate algorithm by allowing the

scale factor to vary and keeping the original estimate of the covariance matrix (

generally from RIGLS ) �xed. This is rather restrictive as the proposal variance

would then have to be a scalar multiple of the initial variance estimate and so an

alternative approach will be considered.

In Gelman, Roberts, and Gilks (1995) di�erent optimal acceptance rates are

given for di�erent dimensions and these range from 44% when d = 1 down to 25%.

The optimal acceptance rate when d = 2 is 35.2%, although Figures 5-4 and 5-5

show that for the random slopes regression model acceptance rates between 30%

and 70% are all reasonably good.

Adaptive sampler 3

This sampler is a slightly modi�ed generalisation of adaptive sampler 1. The

objective of the method is again to achieve an acceptance rate of x% for all

blocks of parameters. E�ectively any positive de�nite matrix can be used as an

initial variance matrix for the proposal distribution but in practice it is better

to use good estimates as problems may occur if there are no changes accepted in

the �rst batch of 100 iterations.

The algorithm needs the user to input 2 parameters. Firstly x% the desired

acceptance rate and secondly a tolerance parameter. This tolerance parameter

will work in exactly the same way as in sampler 1. The algorithm runs the

132

sampler with the current proposal distributions for batches of 100 and at the

end of each batch of 100 iterations the proposal distributions are modi�ed. Each

proposal distribution consists of two distinct parts, �rstly the current estimate

of the covariance matrix for the block of parameters considered and secondly the

scale factor which this matrix is multiplied by to give the proposal distribution

variance. The main di�erence from the univariate case is that the current estimate

of the covariance matrix is updated after every 100 iterations whereas for the

univariate case the variance estimate remains �xed at the RIGLS estimate.

For the �rst 100 iterations, the RIGLS estimate for the covariance matrix is

used. Then after each 100 iterations the covariance matrix is calculated from all

the iterations run thus far. The procedure to follow after every 100 iterations is

as given below :

Method

The following algorithm is repeated for each block of parameters. Let NAcc be

the number of moves accepted in the current batch for the chosen block (out of

100), OPTAcc be the desired acceptance rate, and SF be the current proposal

scale factor for the block.

If NAcc > OPTAcc SF = SF ��2�

�100�NAcc

100� OPTAcc

��;

If NAcc < OPTAcc SF = SF=�2� NAcc

OPTAcc

�:

The above will modify the proposal scale factor by a greater amount the

further the current acceptance rate is from the desired acceptance rate. If the

acceptance rate is too small then the proposed new values are too far from the

current value and so the scale factor is decreased. If the acceptance rate is too

high, then the proposed new values are not exploring enough of the posterior

distribution and so the scale factor is increased.

To calculate the actual variance matrix for the proposal distribution, this

scale factor has to be multiplied by the current estimate of the covariance matrix

for the block of parameters. This estimate is based on the values obtained from

the iterations run thus far and after each batch of 100 iterations this estimate is

133

modi�ed accordingly.

The procedure for checking that the tolerance criteria is satis�ed and the

maximum length of the adapting period are both the same as in adaptive sampler

1.

Results

Table 5.5 shows the adapting period for the block of two �xed e�ects parameters

in the random slopes regression model with the JSP dataset. The starting values

are the RIGLS estimates of the covariance matrix from MLwiN and the desired

acceptance rate is 35%. The columns labelled Vp00; Vp01 and Vp11 are the proposal

variance matrix. It is interesting to note that there is a huge jump in the proposal

variance matrix after 100 iterations. This is because the actual iterations are

then used to estimate the covariance matrix instead of the RIGLS estimates and

the estimate after 100 iterations will be less accurate than the RIGLS estimate.

However as the number of iterations increases the accuracy will improve.

From Table 5.5 it can be seen that this block of parameters ful�ls the tolerance

criteria by 400 iterations. However as the adapting period also includes the

level 2 residuals, it is not complete until 1,300 iterations when the �nal set of

residuals satisfy the criteria. It is interesting to note that for the �xed e�ects

only 17 iterations were accepted in the last block of 100 and so the �nal proposal

distribution is quite di�erent from the penultimate one. This does not however

seem to have a�ected the results in Table 5.6

In Table 5.6 runs of length 50,000 iterations for various di�erent multivariate

proposal distribution methods are compared. These results can also be compared

with Table 5.3 which gives results for the same model but using Gibbs sampling

and univariate proposal distribution methods. The three proposal distributions

considered in Table 5.6 are �rstly an arbitrary identity matrix for the proposal

variance, secondly the estimate from RIGLS multiplied by the scale factor (2.9)

as the variance matrix and thirdly the adaptive method with 35% as the desired

acceptance rate.

The method using the identity matrix as the proposal variance shows the

importance of choosing a sensible proposal distribution. It only has an acceptance

rate of less than 1% which leads to huge values of N for the �xed e�ects. Although

the estimates it produces are similar to the other method, this is due to using

134

Table 5.5: Demonstration of Adaptive Method 3 for the � parameter vector usingRIGLS starting values.

N SF (�) NAcc N in Tol Vp00 Vp01 Vp110 2.90 | | 0.388 {0.024 0.0053

100 1.75 12 0 0.062 {0.002 0.0007200 1.99 44 1 0.078 {0.005 0.0016300 1.99 35 2 0.161 {0.014 0.0030400 2.18 41 3� 0.192 {0.015 0.0036500 1.55 21 3� 0.119 {0.009 0.0023600 1.60 37 3� 0.139 {0.013 0.0036700 1.63 36 3� 0.132 {0.011 0.0034800 1.29 26 3� 0.152 {0.011 0.0028900 1.41 41 3� 0.212 {0.014 0.0033

1,000 1.21 29 3� 0.220 {0.015 0.00301,300 0.73 17 3� 0.128 {0.009 0.0018

the starting values from RIGLS. The estimates of parameter standard deviations

it produces are too small due to the low acceptance rate and this gives a better

indication of the method's poor performance.

The method based on the RIGLS starting values and a scale factor of 2.9

gives results that are far better and values of N that are similar to the equivalent

univariate method. The adaptive method, as in the univariate case improves on

the scale factor method, although the univariate adaptive method appears to do

better than the multivariate method in terms of N values.

This section has shown that Metropolis block updating methods can be

produced by modifying the univariate updating methods. For the one bivariate

example considered the block updating methods do not show any improvement

over their univariate equivalents in terms of minimising expected run lengths

N . There is scope to consider these methods in more detail and to see whether

there are any improvements on other datasets where there is greater correlation

between parameters in a block, but not in this thesis.

135

Table 5.6: Comparison of results for the random slopes regression model on theJSP dataset using uniform priors for the variances, and di�erent block updatingMCMC methods. Each method was run for 50,000 iterations after a burn-in of500.

Par. MH (Vp = I) MH (RIGLS) MH Adapt 3�0 30.57(0.359) 30.60(0.405) 30.58(0.397)�1 0.616(0.042) 0.615(0.048) 0.615(0.048)

�u00 5.656(1.702) 5.680(1.735) 5.624(1.713)�u01 {0.429(0.161) {0.424(0.163) {0.427(0.165)�u11 0.055(0.024) 0.054(0.024) 0.056(0.024)�2e 26.92(1.336) 26.98(1.341) 27.00(1.342)

Raftery and Lewis diagnostic (N)

�0 1,045,468 62,443 43,618�1 456,906 48,805 36,881

�u00 5,917 6,058 6,017�u01 4,842 5,065 4,762�u11 10,424 18,712 14,535�2e 3,860 3,767 3,791

Acceptance Rates for �xed e�ects (%)

� 0.96% 18.7% 36.2%

Proposal Variance Matrix

Vp00 1.000 0.388 0.128Vp01 0.000 {0.024 {0.009Vp11 1.000 0.0053 0.0018

136

5.6 Summary

In this chapter several MCMC methods for �tting N level Gaussian models have

been discussed and algorithms produced. The Gibbs sampling method introduced

in the last chapter for two simple multi-level models was extended to �t general

N level models. For N level Gaussian models, where the conditional distributions

required in the Gibbs algorithm have forms that are easily to simulate from this

method performs best.

Two other hybrid methods based on a combination of Metropolis and Gibbs

sampling steps were introduced in this chapter. These methods do not perform

as well as the Gibbs method for the Gaussian models but can be easily applied

to models with complicated conditional distributions as will be seen in the next

chapter. The �rst method uses univariate normal proposal distributions whilst

the second method uses multivariate normal proposal distributions. The two

methods were compared using a simple example model and no bene�t was seen

in using the multivariate proposals and so the univariate proposal method will

be used in the next chapter.

Two approaches were considered for generating optimal proposal distributions

for the Metropolis steps in the hybrid algorithms. The �rst approach was

based on using scaled estimates of the variance of the parameter of interest as

variances for the proposal distribution. This approach has some problems as

the results in Gelman, Roberts, and Gilks (1995) for optimal scale factors for

multivariate normal posterior distributions do not follow for multi-level models.

The second approach considered uses an adapting period before the main run

of the Markov chain in which the proposal distributions are modi�ed to give

particular acceptance rates for the parameters of interest. This approach works

better as for the univariate proposal distribution, acceptance rates in the range

45% to 60% lead to close to optimal proposals in all the examples considered.

One type of comparison that is missing from this chapter and this thesis in

general is timing comparisons and they deserve some mention before I close this

chapter.

137

5.6.1 Timing considerations

In this thesis the only places where timings are included are generally to justify

the dimensions of simulation runs and not to compare individual methods. The

main reason for not including timing comparisons is that I personally think that

they should only be done on released software, on a stand alone machine and by

a third party.

Before the MCMC options were incorporated into the MLwiN package they

existed in a more primitive version as a stand alone C program. At this time

I compared my stand alone code with the BUGS package (Spiegelhalter et al.

1994), mainly to con�rm that my code gave reasonable estimates but also to

compare the speed di�erences. I found that my Gibbs sampling code was slightly

quicker for the few small models tested. This was to be expected as the algorithms

in this chapter generate from the posterior distributions directly whilst the BUGS

package uses the adaptive rejection method.

I would expect now that the BUGS package will outperform the MLwiN Gibbs

sampler for the Gaussian models. This is because the Gibbs sampling code in

MLwiN is embedded beneath the graphical interface which slows the code down

considerably. Some of the methods described in this chapter have not yet been

added to the released version of MLwiN and have not yet been optimised and so

comparisons would not be fair.

All comparisons obviously depend on the e�ciency of the coding and the

model considered. However it is generally the case that a Metropolis sampling

step will be quicker than a Gibbs sampling step. Also Metropolis steps should

in general be quicker than rejection sampling algorithms and adaptive rejection

sampling as only one value is generated per parameter per iteration using

Metropolis. The Metropolis steps, however in general need longer to get estimates

of a given accuracy as demonstrated earlier in this chapter.

138

Chapter 6

Logistic Regression Models

6.1 Introduction

In the previous chapters the models considered have been restricted to the family

of Gaussian multi-level models. This family of models is very useful and can

�t most datasets well. The models considered thus far have all had a response

variable that is assumed to be de�ned on the whole real line. There are many

variables that are not de�ned on the whole real line, for example age which must

be positive and sex which is either male or female.

In this chapter I am interested in the second type of variable, one which has

two possible states that can be de�ned as zero and one ie a binary response.

These types of variables, as seen at the end of chapter 2, also appear in linear

modelling as responses. In this case they are �tted as a Bernoulli response using

generalized linear modelling. The most common way of �tting such a model

is by using the logit link function, which is the canonical link for the binomial

and Bernoulli distributions. The technique is then known as logistic regression.

McCullagh and Nelder (1983) is a useful text for all generalized linear models

including logistic regression models.

In a similar way that Gaussian linear models can be extended to multi-level

Gaussian models, logistic regression models can also be extended to multi-level

logistic regression models.

In this chapter I will de�ne the general model structure for a multi-level

binary response logistic regression model. In the last chapter it was pointed out

that the simple Gibbs sampling method cannot be used to �t these models, as

139

the full conditional distributions do not all have forms that are easily simulated

from. Gilks (1995) shows that this is true for a very simple logistic regression

model and then gives some alternative ways to �t such models. I will show how

the Metropolis-Gibbs hybrid methods of the last chapter can easily be adapted

to �t these logistic regression models. In fact the motivation behind putting

these hybrid methods in the last chapter was to be able to �t multi-level logistic

regression models.

I will then consider two examples to show some other �elds of applications

of hierarchical modelling, that have models of this structure. The �rst example

is taken from survey sampling and involves a political voting dataset from the

British Election study. This example will be used to illustrate how to �t multi-

level binary response models, and how to calculate optimal Metropolis proposal

distributions.

The second example considers a collection of simulated datasets, designed

to represent closely the structure of a dataset used in an analysis of health care

utilisation in Guatemala. These simulated datasets were considered in Rodriguez

and Goldman (1995), where de�ciencies in the quasi-likelihood methods used by

MLn to �t binary response models were pointed out. I hope to show that the

Metropolis-Gibbs hybrid methods will improve on the quasi-likelihood methods.

6.2 Multi-level binary response logistic regress-

ion models

The multi-level binary response logistic regression model has a similar structure to

the Gaussian models discussed thus far. The only di�erence is how the response

variable is linked to the predictor variables. When considering the Gaussian

models, a three level model was described and an algorithm for this model was

included. A generalisation was then given to N levels.

A 3 level binary response logistic regression model can be de�ned as follows :

yijk = Bernoulli(pijk); where

logit(pijk) = X1ijk�1 +X2ijk�2jk +X3ijk�3k

140

�2jk �MVN(0;V2); �3k �MVN(0;V3):

This 3 level model can be easily extended to N levels in a similar way to the

Gaussian models. Then any of the Metropolis-Gibbs hybrid methods can be

adapted to �t the logistic model. I will only consider the method with univariate

updates here and now show how this can be adapted from the Gaussian version.

6.2.1 Metropolis Gibbs hybrid method with univariate

updates

As can be seen in the above model de�nition, multilevel logistic regression

models do not have variance terms at level 1. This is because for the Bernoulli

distribution, both the mean and the variance are functions of the parameter pijk

only (E(yijk) = pijk, var(yijk) = pijk(1�pijk)), Therefore if the mean is estimated,

then the variance will be �xed. This means that the algorithm for a general N

level logistic regression model has only three steps.

Notation

In the following algorithm I will use similar notation as used for the N level

Gaussian models in Chapter 5. Let MT be the set of all observations in the

model, and let Ml;j be the set of observations that at level l are in category j.

Let Xli be the vector of variables at level l for observation i, where l = 1 refers

to the variables associated with the �xed e�ects. Let the random parameters

at level l; l > 1 be denoted by �lj, where j is one of the combination of higher

level terms and the �xed e�ects be �1. Finally let Vl be the level l variance

matrix. I will use the abbreviation (X�)i to mean the sum of all the predictor

terms for observation i. For example in the three level model de�nition earlier,

(X�)i = X1ijk�1 + X2ijk�2jk + X3ijk�3k. Using these notational short cuts, the

model can be written :

yi = Bernoulli(pi); where

logit(pi) = (X�)i; �lj � MVN(0;Vl):

141

Algorithm

The main di�erences between this algorithm and the Gaussian algorithm arise

from the di�erent likelihood functions. For prior distributions, I will allow the

�xed e�ects to have any prior distribution and the level l variance to have a

general inverse Wishart prior, Vl � IW (SP l; �P l). The three steps of the algorithm

are then as follows :


For i in 1; : : : ; NFixed;

�(t)1i = ��1i with probability min(1; p(��1i j y; : : :)=p(�(t�1)

1i j y; : : :))= �

(t�1)1i otherwise

where ��1i = �(t�1)1i + 1i; 1i � N(0; �2

1i); and

p(�1i j y; : : :) / p(�1) �Y

i2MT

(1 + e�(X�)i)�yi(1 + e(X�)i)yi�1:


For l in 2; : : : ; N; j in 1; : : : ; nl; and i in 1; : : : ; nrl;

�(t)lji = ��lji with probability min(1; p(��lji j y; : : :)=p(�(t�1)

lji j y; : : :))= �

(t�1)lji otherwise

where ��lji = �(t�1)lji + lji; lji � N(0; �2

lji); and

p(�lji j y; : : :) /Y

i2Ml;j

(1 + e�(X�)i)�yi(1 + e(X�)i)yi�1� j Vl j� 12 exp[�1

2�TljV

�1l �lj]:


p(V�1l j y; : : :) / p(�l j Vl)p(V

�1l )

V�1l �Wishartnrl[Spos = (

nlXi=1

�li�Tli + SP l)

�1; �pos = nl + �P l]

where nl is the number of level l units. If we want a uniform prior then we need

142

SP l = 0 and �P l = �nrl � 1 where nrl is the number of random variables at level

l.

Note that it is possible, by reparameterising this model to use the Gibbs

sampler instead of the Metropolis sampler for residuals at levels 3 and upwards

but this is not considered in this thesis.

6.2.2 Other existing methods

The existing methods for �tting multi-level logistic regression models in MLwiN

are described brie y at the end of Chapter 2. They are quasi-likelihood methods

based around Taylor series expansions. Marginal quasi-likelihood (MQL) is

described in Goldstein (1991), and Penalised quasi-likelihood (PQL) is introduced

in Laird (1978).

MCMC methods have also been used to �t these models. Zeger and Karim

(1991) give a Gibbs sampling approach to �tting multi-level logistic regression

models amongst other multi-level models. They use rejection sampling with

a Gaussian kernel that is a good estimate of the current likelihood function

to generate new estimates for the high level residuals. The BUGS package

(Spiegelhalter et al. 1994) will also �t these models. It uses adaptive rejection

sampling (Gilks and Wild 1992) as discussed in Chapter 3 in place of rejection

sampling. Breslow and Clayton (1993) consider multi-level logistic regression

models within the family of generalized linear mixed models. They perform some

brief comparisons between the quasi-likelihood methods and the Gibbs sampling

results in Zeger and Karim (1991).

6.3 Example 1 : Voting intentions dataset

6.3.1 Background

The dataset used in this example is a component of the British Election Study

analysed in Heath, Yang, and Goldstein (1996). This dataset also appears in

Goldstein et al. (1998) where it is used as the main example in the binary

response models chapter. The subsample analysed in Goldstein et al. (1998)

contains data on 800 voters from 110 constituencies, who were asked how they

voted in the 1983 election. Their response was categorised as to whether they

143

voted Conservative or not, and the interest was in how the voters' opinion on

certain issues in uenced their voting intentions.

The explanatory variables are the voters' opinion scored on a 21 point

scale and then centred around its mean for the following four issues. Firstly

whether Britain should possess nuclear weapons (Def). Secondly whether low

unemployment or low in ation is important (Unemp). Thirdly whether or not

they would prefer tax cuts or higher taxes to pay for more government spending

(Tax) and �nally whether they are in favour of privatisation of public services

(Priv).

I will use this example for two purposes. Firstly to show the di�erences in

the estimates produced by the Metropolis-Gibbs hybrid method and the quasi-

likelihood methods and secondly to see if the �ndings on the ideal scaling for

the Metropolis proposal distributions from the last chapter extend to logistic

regression models.

6.3.2 Model

The model �tted to the dataset is the same model as in Goldstein et al. (1998).

There will be �ve �xed e�ects, an intercept term and �xed e�ects for the four

opinion variables described above along with a random term to measure the

constituency e�ect. Let pij be the probability that the ith voter in the jth

constituency voted Conservative, then

logit(pij) = �1 + �2Defij + �3Unempij + �4Taxij + �5Privij + uj;

where uj � N(0; �2u).

To translate this to the response variable, yij, requires yij � Bernoulli(pij).

This model now �ts into the framework described in the earlier section and can

be �tted using the Metropolis Gibbs hybrid method.

6.3.3 Results

The two quasi-likelihood methods, MQL and PQL and the Metropolis Gibbs

hybrid method were all used to �t the above model and the results are given

in Table 6.1. The Metropolis Gibbs hybrid method was run using the adaptive

144

method described in the last chapter and with a desired acceptance rate of 44%.

A uniform prior was used for the variance parameter �2u.

Table 6.1: Comparison of results from the quasi-likelihood methods and theMCMC methods for the voting intention dataset. The MCMC method is basedon a run of 50,000 iterations after a burn-in of 500 and adapting period.

Par. MQL1 PQL2 MH Adapt�1 {0.355 (0.092) {0.367 (0.094) {0.375 (0.102)�2 0.089 (0.018) 0.092 (0.018) 0.095 (0.019)�3 0.045 (0.019) 0.046 (0.019) 0.046 (0.020)�4 0.067 (0.013) 0.069 (0.014) 0.070 (0.014)�5 0.138 (0.018) 0.143 (0.018) 0.146 (0.019)�2u 0.132 (0.112) 0.154 (0.117) 0.253 (0.154)

From the table it can be seen that the PQL method gives parameter estimates

that are larger (in magnitude) than the MQL method estimates for both the

�xed e�ects and the level 2 variance. The MCMC method gives estimates that

are larger (in magnitude) than both quasi-likelihood methods particularly for the

level 2 variance. In the simulations in Chapter 4 it was shown that for Gaussian

models the uniform prior for the variance parameter gave variance estimates that

were biased high, particularly for small datasets. This dataset is not particularly

small (110 level 2 units) and so more investigation is needed on which of the

methods is giving the better variance estimate. Analysis on which method is

performing best in terms of bias and coverage properties will be performed on

the second example in this chapter.

6.3.4 Substantive Conclusions

Considering just the PQL results and back transforming the variables onto an

interpretable scale the following conclusions can be made. Firstly a voter with

average views on all four issues (Defij = Unempij = Taxij = Privij = 0), had

a 40.9% probability of voting Conservative. A voter was more likely to vote

Conservative if they were in favour of Britain possessing nuclear weapons (5

points above average score implies a 52.3% probability of voting Conservative).

They were more likely to vote Conservative if they preferred low in ation to

low unemployment (5 points above average score implies a 49.5% probability of

145

voting Conservative). They were more likely to vote Conservative if they preferred

low taxes rather than higher government spending (5 points above average score

implies a 46.6% probability of voting Conservative). They were more likely to

vote Conservative if they were in favour of privatising public services (5 points

above average score implies a 58.6% probability of voting Conservative). The

unexplained variation at the constituency level is fairly large in practical terms

but not statistically signi�cant.

6.3.5 Optimum proposal distributions

In the last chapter I showed that the optimum values for the scaling factor for the

variance of a univariate normal proposal distribution as suggested by Gelman,

Roberts, and Gilks (1995) does not generally follow for multi-level models. I will

now do a similar analysis of the multilevel logistic regression model by considering

the voting intentions example. To �nd optimal proposal distributions several

values of the scaling factor spread over the range 0.05 to 20 were chosen. For

each value of the scaling factor 3 runs were performed with a burn-in of 500

iterations and a main run of 50,000 iterations. As before the Raftery Lewis N

statistic was calculated for each parameter in each run. Then the optimal scaling

factor was chosen to be the value that minimises N . The results for the voting

intention dataset are summarised in Table 6.2.

Table 6.2: Optimal scale factors for proposal variances and best acceptance ratesfor the voting intentions model.

Par. Optimal SF Acceptance % Min N�1 4.0 40%{60% 15K�2 5.5 40%{60% 13K�3 5.5 40%{60% 14K�4 5.0 40%{60% 14K�5 5.5 40%{60% 14K

Table 6.2 shows that for this model the results in Gelman, Roberts, and Gilks

(1995) appear to be close to the optimal values obtained from the simulations.

One problem that is not highlighted by this table is that for this model the worst

mixing properties are exhibited by the level 2 variance parameter, �2u. Figure 6-1

shows the e�ect of varying the scale factor on the N value for the parameter �2u

146

along with a best �t loess curve. From this �gure there appears to be no clear

relationship between the value of the scale factor and the N values for �2u. This

parameter is actually updated using Gibbs sampling so is only a�ected indirectly

by modifying the scale factor. The loess curve does show a slight upturn as the

scale factor gets smaller but there is far greater variability in the N values and

so a clear relationship is more di�cult to establish.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

... ...

.

.

.

.

.

.

.

..

.

.

.

.

.

.

.

.

..

.

.

.

.

. .

.

.

.

.

. .

.

.

Level 2 Variance Parameter : Scale Factor

Raf

tery

Lew

is N

hat

0 5 10 15 20

5010

015

020

025

030

0

Figure 6-1: Plot of the e�ect of varying the scale factor for the univariate Normalproposal distribution rate on the Raftery Lewis diagnostic for the �2

u parameterin the voting intentions dataset.

To con�rm that the behaviour seen above for the scale factor follows for all

multi-level logistic regression models I considered again an example from Chapter

2. At the end of Chapter 2 I introduced multi-level logistic regression models

147

by converting the response variable M5 from the JSP dataset into a pass/fail

indicator, Mp5 which depended on whether the M5 mark was at least 30 or not.

The Model 2.4 (below) amongst others was �tted to the JSP dataset.


log(pij=(1� pij)) = �0 + �1M3ij + SCHOOLj


I then repeated the procedure for �nding the optimal proposal distributions

detailed above using this model instead of the voting intentions model. For this

model the optimum value of the scale factor was found to be between 0.1 and

0.2 for both �0 and �1. The minimum values of N for �0 and �1 (both roughly

60,000) were for this model found to be far greater than than the minimum N

value for �2u (roughly 5,000). The only common factor between the two models

is the optimal acceptance rate, which for Model 2.4 is in the range 40% to 70%

for both �0 and �1. This adds further weight to using the adapting procedure

detailed in Chapter 5.

As in Chapter 5, no simple formula for an optimal scaling factor appears to

exist for the multi-level models in this chapter. However if the adapting method

is used, a desired acceptance rate in the range 40% to 60% for the univariate

normal Metropolis proposals will give close to optimal proposal distributions.

6.4 Example 2 : Guatemalan child health data-

set

6.4.1 Background

The original Guatemalan Child Health dataset consisted of a subsample of

respondents from the 1987 National Survey of Maternal and Child Health. The

subsample has 2449 responses and a three level structure of births within mothers

within communities. The subsample consists of all women from the chosen

communities who had some form of prenatal care during pregnancy. The response

variable is whether this prenatal care was modern (physician or trained nurse) or

148

not.

Rodriguez and Goldman (1995) use the structure of this dataset to consider

how well quasi-likelihood methods compare to considering the dataset without

the multi-level structure and �tting a standard logistic regression. They perform

this by constructing simulated datasets based on the original structure but with

known true values for the �xed e�ects and variances. Rodriguez and Goldman

(1995) consider the MQL method and show that the estimates of the �xed e�ects

produced by MQL are worse than the estimates produced by a standard logistic

regression disregarding the multi-level structure.

Goldstein and Rasbash (1996) consider the same problem but consider the

PQL method. They show that the results produced by PQL second order

estimation are far better than for MQL but still biased. They also state that the

example considered, with large underlying random parameter values is unusual.

If the variances in a variance component model do not exceed 0.5, which is more

common, the �rst order PQL estimation method and even the �rst order MQL

method will be adequate.

Although Rodriguez and Goldman (1995) considered several di�erent models

and several di�erent values for the parameters, Goldstein and Rasbash (1996)

only consider the model with the Guatemala structure and parameter values

that MQL performs badly on. I will now try using the MCMC Metropolis Gibbs

hybrid method described earlier in the chapter to �t the same model and compare

the results.

6.4.2 Model

The model considered is as follows :

yijk � Bernoulli(pijk) where

logit(pijk) = �0 + �1x1ijk + �2x2jk + �3x3k + ujk + vk

and

ujk � N(0; �2u) and vk � N(0; �2

v):

In this formulation i; j and k index the level 1,2 and 3 units respectively. The

149

variables x1; x2 and x3 are composite variables at each level, as the original model

contained many co-variates at each level.

6.4.3 Original 25 datasets

Goldstein and Rasbash (1996) considered the �rst 25 datasets simulated by

Rodriguez and Goldman (1995) to construct a table of comparisons between

MQL and PQL ( Table 1 in Goldstein and Rasbash (1996)). This table will

now be reconstructed here but will also include the MCMC method with the two

alternative priors for the variance parameters. The two priors to be considered

are �rstly a gamma (�; �) prior for the precision parameters at both levels 2 and 3,

and secondly a uniform prior for the variance parameters at levels 2 and 3. The

MCMC procedures were run using a burn-in of 500 iterations after the adaptive

method with desired acceptance rate 44%. The main run was of length 100,000

for each dataset, which was based on preliminary analysis using the Raftery-Lewis

diagnostic. The MCMC methods took 2 hours each per dataset using MLwiN on

a Pentium 200 MHz PC.

The quasi-likelihood results here vary from those seen in Goldstein and

Rasbash (1996) as the tolerance was set at the default level in MLwiN (relative

change from one iteration to the next of at most 0.01). In Goldstein and Rasbash

(1996) a more stringent convergence criterion is used (relative change from one

iteration to the next of at most 0.001) and the results are in this case slightly less

biased. The other di�erence with this table is that I am considering the variance

parameters rather than the standard deviations at levels 2 and 3, as this is the

variable reported by MLwiN. The results of this analysis can be seen in Table

6.3.

Results

From Table 6.3 the improvements achieved by the MCMC methods in terms of

bias can be clearly seen. Both prior distributions for the variance parameters

give results that are less biased than the PQL 2 method. The biases in the

random parameters are more extreme for the quasi-likelihood methods when the

variances are considered instead of the standard deviations. There is little to

choose between the two MCMC variance priors in this example. The gamma

150

Table 6.3: Summary of results (with Monte Carlo standard errors) for the �rst25 datasets of the Rodriguez Goldman example.

Parameter MQL 1 PQL 2 Gamma Uniform(True) prior prior

�0 (0.65) 0.483 (0.028) 0.615 (0.035) 0.643 (0.037) 0.659 (0.038)�1 (1.00) 0.758 (0.033) 0.948 (0.043) 0.996 (0.045) 1.017 (0.047)�2 (1.00) 0.760 (0.013) 0.951 (0.018) 1.004 (0.020) 1.025 (0.021)�3 (1.00) 0.744 (0.033) 0.952 (0.043) 0.997 (0.044) 1.022 (0.046)�2v (1.00) 0.530 (0.016) 0.837 (0.036) 0.970 (0.054) 1.066 (0.0 57)�2u (1.00) 0.025 (0.008) 0.513 (0.036) 0.928 (0.041) 1.044 (0.0 44)

prior gives results that are biased on the low side for the variance parameters

while the uniform prior gives estimates that are biased high for both variances

and �xed e�ects. With only 25 datasets the coverage properties of the 4 methods

cannot be evaluated and so more datasets are needed.

6.4.4 Simulating more datasets

Although the results from these 25 dataset appear to show improvement in the

unbiasedness of estimates produced by the MCMC methods over the quasi-

likelihood methods, to emphasise this improvement more datasets are needed.

Also with more datasets the coverage properties of the methods as well as the

bias can be evaluated.

Simulation procedure

The MLwiN SIMU, PRED and BRAN commands (Rasbash and Woodhouse

1995) were used to generate 500 datasets with the same underlying structure as

the 25 datasets from Rodriguez and Goldman (1995). The simulation procedure

is as follows :

1. Generate 161 vks, one for each community, by drawing from a normal

distribution with mean 0 and variance �2v (1.0).

2. Generate 1558 ujks, one for each mother, by drawing from a normal

distribution with mean 0 and variance �2u (1.0).

3. Evaluate logit(pijk) = �0 + �1x1ijk + �2x2jk + �3x3k + ujk + vk for all 2449

births using the PRED command.

151

4. Use the BRAN command to generate a 0 or 1 (yijk) for each birth based

on the relative likelihoods of a 0 and 1 response and a random uniform draw.

The same model considered so far was �tted to these 500 datasets using the

two quasi-likelihood methods and the MCMC method using the two di�erent

priors. As the MCMC methods are time consuming the main run length was

reduced for �nding estimates for the 500 datasets from 100,000 iterations to

25,000 iterations. The adaptive procedure and a `burn-in' of 500 iterations were

used as before. The results for the 500 simulations can be seen in Table 6.4.

Results

Table 6.4 con�rms the bias results already seen when only 25 datasets were

considered. The MQL method performs badly for the �xed e�ects and hopelessly

for the variance parameters. The PQL method performs a lot better with smaller

�xed e�ect biases but still shows large bias for the variances.

Both the MCMC priors perform much better than the quasi-likelihood

methods and there is little bias. The one exception is the variance parameters

using the Uniform prior where some positive bias similar to that seen in the

Gaussian models in Chapter 4 can be seen.

The coverage properties are illustrated in Table 6.4 and Figures 6-2 and 6-3.

Here we see again how poor the MQL method is on this example. In fact the

method is so poor at estimating �2u that none of the 500 datasets has a 95%

interval estimate that contains the true value. This is partly due to the large

number of runs that have level 2 variance estimates of 0. The PQL method

does reasonably well in terms of coverage for the �xed e�ects but not as well as

the MCMC methods. It also gives variance estimates with very poor coverage

properties, particularly at level 2.

The MCMC methods both give very similar coverage properties and are both

a vast improvement over the quasi-likelihood methods. The Uniform prior in

general gives slightly better coverage estimates but this is not always true and

the Uniform prior also has larger interval widths.

152

Table 6.4: Summary of results (with Monte Carlo standard errors) for theRodriguez Goldman example with 500 generated datasets.

Parameter MQL 1 PQL 2 Gamma Uniform(True) prior prior

Estimates (Monte Carlo SE)

�0 (0.65) 0.474 (0.007) 0.612 (0.009) 0.638 (0.010) 0.655 (0.010)�1 (1.00) 0.741 (0.007) 0.945 (0.009) 0.991 (0.010) 1.015 (0.010)�2 (1.00) 0.753 (0.004) 0.958 (0.005) 1.006 (0.006) 1.031 (0.005)�3 (1.00) 0.727 (0.009) 0.942 (0.011) 0.982 (0.012) 1.007 (0.013)�2v(1.00) 0.550 (0.004) 0.888 (0.009) 1.023 (0.011) 1.108 (0.011)

�2u(1.00) 0.026 (0.002) 0.568 (0.010) 0.964 (0.018) 1.130 (0.016)

Coverage Probabilities (90%/95%)

�0 67.6/76.8 86.2/92.0 86.8/93.2 88.6/93.6�1 56.2/68.6 90.4/96.2 92.8/96.4 92.2/96.4�2 13.2/17.6 84.6/90.8 88.4/92.6 88.6/92.8�3 59.0/69.6 85.2/89.8 86.2/92.2 88.6/93.6�2v

0.6/2.4 70.2/77.6 89.4/94.4 87.8/92.2�2u

0.0/0.0 21.2/26.8 84.2/88.6 88.0/93.0


�0 0.494/0.589 0.617/0.735 0.668/0.798 0.693/0.828�1 0.572/0.681 0.668/0.796 0.734/0.875 0.750/0.895�2 0.274/0.327 0.336/0.400 0.388/0.463 0.399/0.476�3 0.626/0.746 0.781/0.930 0.850/1.014 0.881/1.053�2v

0.339/0.404 0.535/0.638 0.731/0.878 0.789/0.948�2u

0.149/0.177 0.496/0.591 1.058/1.251 1.105/1.315

153

Parameter beta_0 : posterior predictive probability * 100

% o

bser

ved

poin

ts <

= p

oste

rior

pred

ictiv

e pe

rcen

tile

0 20 40 60 80 100

020

4060

8010

0

MQLPQLMCMC GammaMCMC Uniform


% o

bser

ved

poin

ts <

= p

oste

rior

pred

ictiv

e pe

rcen

tile

0 20 40 60 80 100

020

4060

8010

0



% o

bser

ved

poin

ts <

= p

oste

rior

pred

ictiv

e pe

rcen

tile

0 20 40 60 80 100

020

4060

8010

0


Figure 6-2: Plots comparing the actual coverage of the four estimation methodswith their nominal coverage for the parameters �0; �1 and �2.

154


% o

bser

ved

poin

ts <

= p

oste

rior

pred

ictiv

e pe

rcen

tile

0 20 40 60 80 100

020

4060

8010

0


Level 3 variance parameter : posterior predictive probability * 100

% o

bser

ved

poin

ts <

= p

oste

rior

pred

ictiv

e pe

rcen

tile

0 20 40 60 80 100

020

4060

8010

0


Level 2 variance parameter : posterior predictive probability * 100

% o

bser

ved

poin

ts <

= p

oste

rior

pred

ictiv

e pe

rcen

tile

0 20 40 60 80 100

020

4060

8010

0


Figure 6-3: Plots comparing the actual coverage of the four estimation methodswith their nominal coverage for the parameters �3; �

2v and �2

u.

155

6.4.5 Conclusions

Rodriguez and Goldman (1995) originally pointed out the de�ciencies of the MQL

method on multi-level logistic regression models with structures similar to our

datasets. Goldstein and Rasbash (1996) then showed how the PQL method

improved greatly on the MQL method but still showed some bias. In this section

I have shown that the Metropolis-Gibbs hybrid method described earlier in this

thesis gives even better estimates both in terms of bias and coverage. It is also

clear that the choice of prior distribution for the variance parameters is not as

important in this problem due to the large numbers of level 2 and 3 units, and

both priors considered give better estimates than the quasi-likelihood methods.

One point to note as shown in the simulations in Breslow and Clayton (1993)

is that the under-estimation from the quasi-likelihood methods is worst when

there is a Bernoulli response. When the model is changed to a binomial response

and the denominator increased this under-estimation is reduced. I will discuss

brie y models with a general binomial response in Chapter 8.

6.5 Summary

In this chapter the family of hierarchical binary response logistic regression

models have been introduced and it has been shown how to adapt the Metropolis-

Gibbs hybrid method to �t these models. Two examples that show yet more

applications of multi-level modelling were used to illustrate �tting such models.

The �rst example had data on the intention of voters in the 1983 election, and

was used to demonstrate the di�erences between the estimates from the quasi-

likelihood methods and the MCMC methods. It was also used to �nd optimal

proposal distributions for the Metropolis steps of the MCMC method.

The second example used simulation datasets based on a dataset from child

health in Guatemala. This example was used to compare the performance of the

quasi-likelihood methods, MQL and PQL against the Metropolis-Gibbs hybrid

method using two prior distributions for the variance parameter, where the true

values of all parameters were known. In this example it was shown that the

MCMC methods outperform the quasi-likelihood methods both in terms of bias

and coverage properties. It was also shown that with such a large dataset the

156

choice of prior distribution for the variance is not of great importance.

157

Chapter 7

Gaussian Models 3 - Complex

Variation at level 1

7.1 Model de�nition

In Chapters 4 and 5 I considered Gaussian models with a simple level 1 variance

�2. In all the algorithms given the Gibbs sampler was used to update all variance

parameters. Although in Chapter 5 I considered some hybrid methods containing

a mix of Gibbs and Metropolis sampling steps, the Metropolis steps updated the

�xed e�ects and residuals, and all variance parameters were always updated using

the Gibbs sampler.

In this chapter I will remove the restriction that the model must have a simple

constant variance at level 1 and instead allow the level 1 variance to depend on

other predictor variables. I will consider by way of an example the following

simple two level model with complex variation at level 1.

yij = XFij�

F +XRij�

Rj +XC

ijeij

eij � MVN(0;VC); �Rj � MVN(0;VR):

If I consider again the JSP dataset, introduced in Chapter 2, with M5, the

maths score in year 5, as the response variable, then I could extend the random

slopes regression model by allowing the M3 predictor variable to be random at

both levels 1 and 2. What this would then means is that the variability of an

158

individual pupil's M5 score is not only dependent on school level variables but

also on his/her individual maths score in year 3 . The model can be written as

follows :

M5ij = �F0 + �R0j + e0ij +M3ij(�F1 + �R1j + e1ij)

eij =

0@ e0ij

e1ij

1A �MVN(0;VC); �Rj =

0@ �R0j�R1j

1A � MVN(0;VR):

Then V ar(M5ij j �F ; �j) = VC00 + 2VC01M3ij +VC11(M3ij)2 and in this case, a

simple Gibbs sampling approach cannot be used to update the level 1 variance

parameters. In the random slopes regression example in Chapter 2 there was a

single eij for each observation (pupil) which was evaluated as follows :

eij = yij � XFij�

F � XRij�

Rj :

Then the level 1 variance, �2 could be updated via a scaled inverse �2 distribution

based solely on the eijs. When the variation at level 1 is complex there is not a

single eij for each observation, and instead

XCijeij = yij � XF

ij�F � XR

ij�Rj ;

where XCij is the data vector of variables that are random at level 1 for observation

ij, and eij is the vector of residuals for observation ij.

From this I would have to estimate each vector eij via an MCMC updating

rule which is di�cult. Instead I propose to use Metropolis Hastings sampling on

the level 1 variance matrix. Before discussing the problem of creating a Hastings

update for a variance matrix, I will �rst consider the simpler case of using MCMC

to update a scalar variance.

7.2 Updating methods for a scalar variance

In earlier chapters I dealt with multi-level models where the variance at level

1, �2 is a scalar. In these chapters the conditional distribution of �2 is easily

determined, and a Gibbs sampling step can be used to update �2. Similarly

BUGS (Spiegelhalter et al. 1994) �ts such models using an adaptive rejection

Gibbs sampling algorithm. I am interested in �nding alternative approaches for

159

sampling parameters that are restricted to being strictly positive. Two alternative

approaches are described below :

7.2.1 Metropolis algorithm for log�2

I have already used the Metropolis algorithm via a normal proposal distribution to

update parameters de�ned on the whole real line. A similar approach can be used

on the current problem but �rstly the variable of interest must be transformed

to a variable that is de�ned on the whole real line.

Draper and Cheal (1997), when analysing a problem from Copas and Li (1997)

use the approach of transforming �2 to log �2. They then use a multivariate

normal proposal for all unknowns in the model, including log �2. I will use a

univariate proposal for �2 as I am updating it separately from the other unknowns.

I will therefore consider updating log �2 by using a univariate normal(0; s2)

proposal distribution.

As the parameter of interest has been transformed from �2 to log �2 I need

to consider the Jacobian of the transformation and build this into the prior. The

one disadvantage of this method is that it cannot extend to the harder problem

where I have �, a variance matrix to update. The technique of transforming

variance parameters to the log scale is not unique to Draper and Cheal (1997)

and is also used in section 11.6 of Gelman et al. (1995) on a hierarchical model.

7.2.2 Hastings algorithm for �2

In Chapter 3, I used normal proposal distributions for unknown parameters, �

which had their mean �xed at the current value for �, �t to generate �t+1. This

type of proposal has the advantage that p(�t+1 j �t) = p(�t j �t+1), and so the

Metropolis algorithm can be used.

As I am now considering a parameter, �2 that is restricted to be positive

I want to use a proposal that generates strictly positive values. I am therefore

going to use a scaled inverse chi-squared distribution with expectation the current

estimate for �2, �2t to generate �2

t+1: The scaled inverse chi-squared distribution

with parameters � and s2 has expectation ��2

s2, so letting � = w + 2 and

s2 = w�2t , where w is a positive integer degrees of freedom parameter, produces

a distribution with expectation �2t . The parameter w can be set to any value,

160

and plays a similar role to the variance parameter in the Metropolis proposal

distribution, as it a�ects the acceptance probability.

This proposal is not symmetric, p(�t+1 j �t) 6= p(�t j �t+1), so the Hastings

ratio has to be calculated.

Assuming currently �2t = a and that the value �2

t+1 = b is generated, then the

Hastings ratio is as follows :

hr =p(�2

t+1 = b j �2t+1 � SI�2(w + 2; wa

w+2))

p(�2t+1 = a j �2

t+1 � SI�2(w + 2; wbw+2

))

=(wb=w + 2)

w+22 a�(w+2

2+1)exp(�w+2

2wb

(w+2)a)

(wa=w + 2)w+22 b�(w+2

2+1)exp(�w+2

2wa

(w+2)b)

=bw2+1a�(w

2+2)exp(�wa

2b)

aw2+1b�(w

2+2)exp(�wb

2a)

= (b

a)w+3exp(

w

2(a

b� b

a)):

This Hastings ratio can then be used in the Hastings algorithm.

7.2.3 Example : Normal observations with an unknown

variance

The three methods discussed in this section will now be illustrated by the

following example. I generate 100 observations from a normal distribution with

known mean 0 and known variance 4. Then I will assume the global mean

is known to be zero and interest lies in estimating the variance parameter �2.

Assigning a scaled inverse chi-squared prior for �2, the model is then as follows :

Yi � N(0; �2) i = 1; : : : ; 100

�2 � SI�2(�0; �20)

This problem has a conjugate prior and consequently the posterior has a

known distribution as follows :

�2 � SI�2(�0 + n; (�0�20 + nV)=(�0 + n))

161

where n is the number of Y 0s (in this case 100) and V = 1n

Pni=1(yi � �)2. For

the prior distribution the values �0 = 3 and �20 = 6 will be used and the sample

variance of the 100 simulated observations, V = 4.38292. This leads to the

posterior distribution :

�2 � SI�2(103; 4:43):

This distribution has posterior mean for �2 of 4.5177 with standard deviation

0.6421.

The Gibbs sampling method used by BUGS will sample IID from the posterior

distribution for �2, and so should perform best. The Gibbs method can sample

�2 IID from its posterior distribution as it is the only unknown in this model.

The Metropolis log�2 method and the Hastings method should get reasonable

answers but may take longer to obtain accuracy.

7.2.4 Results

In the following analysis I will compare the three methods described earlier. For

the Gibbs sampler method a burn-in of 1,000 updates and a main run of 5,000

updates was used for each of three random number seeds and the results were

averaged. Several di�erent s standard deviation values in the Metropolis method

and several di�erent w degrees of freedom values in the Hastings method were

used for comparative purposes. For each s or w, three runs were performed

with a burn-in of 1,000 and main run of 100,000 and di�erent random number

seeds. The results obtained were averaged over the three runs. The Raftery Lewis

convergence diagnostic column of Table 7.1 contains the largest value of N , the

estimated run length required, between the three runs. As can be seen from the

table the largest N is 83,800 which is smaller than the 100,000 run lengths and

so this suggests all the runs have achieved their default accuracy goals.

Table 7.1 shows that the Gibbs method has the predicted fast convergence

rate. All three methods give the correct answer to two decimal places. The

convergence rates of the Metropolis and Hastings methods vary with the

parameter value of the proposal distribution. This is also linked with the

acceptance rates of the two algorithms. The best proposal parameter values are

when there is an acceptance rate of approximately 44%, although in this simple

162

Table 7.1: Comparison between three MCMC methods for a univariate normalmodel with unknown variance.

Method s/w Mean Sd Acc % R/L NTheory N/A 4.5177 0.642 N/A N/AGibbs N/A 4.516 0.648 100.0 3,600

1.0 4.518 0.640 17.3 30,0000.5 4.519 0.642 32.6 16,000

Metropolis 0.3 4.519 0.642 47.9 13,9000.2 4.519 0.641 60.4 18,0000.1 4.519 0.643 78.1 30,5000.05 4.514 0.639 88.8 83,800

2 4.517 0.640 15.9 39,9005 4.522 0.643 25.7 26,40010 4.523 0.644 34.9 20,50020 4.519 0.645 45.5 18,100

Hastings 50 4.517 0.641 60.0 18,500100 4.517 0.643 69.8 26,000200 4.517 0.641 77.8 28,700500 4.523 0.640 85.7 50,1001000 4.517 0.637 89.7 72,200

example acceptance rates of between 30% and 60% give similar convergence rates.

There is therefore scope to incorporate the adaptive procedure illustrated in

Chapter 5 to these methods.

I will now return to the original problem of updating a variance matrix.

7.3 Updating methods for a variance matrix

In the models in Chapter 5, the level 2 variance can be a matrix. In the algorithms

given, a simple Gibbs sampling step is used to update the variance as it has an

inverse Wishart posterior distribution with parameters that are easily evaluated.

If there is complex variation at level 1, then the level 1 variance, � is a matrix and

this will also have an inverse Wishart posterior distribution. Unfortunately the

parameters will depend on the vector of level 1 residuals, eij which are not easily

evaluated. Consequently an alternative method is needed that does not need to

evaluate the level 1 residuals. When I considered the case of a scalar variance,

163

I used a Hastings step that had a proposal distribution of a similar form to the

posterior distribution, and I will consider a similar approach here.

7.3.1 Hastings algorithm with an inverse Wishart prop-

osal

When there was a scalar variance, �2 to update, a proposal distribution that

generated strictly positive values was required. Now that there is a variance

matrix, � to update, I require a proposal distribution that generates positive

de�nite matrices. I am therefore going to use an inverse Wishart proposal

distribution with expectation the current estimate for �, �t to generate �t+1.

The inverse Wishart distribution for a kxk matrix with parameters � and S has

expectation (� � k � 1)�1S, so letting � = w + k + 1 and S = w�t, where w

is a positive integer degrees of freedom parameter, produces a distribution with

expectation �t. As in the univariate case, the parameter w is set to a value that

gives the desired acceptance rate. Again the proposal is not symmetric so the

Hastings ratio must be calculated. Assuming currently that �t = A and that

�t+1 = B is generated, then the Hastings ratio is as follows :

hr =p(�t+1 = B j �t+1 � IW (w + k + 1;Aw))

p(�t+1 = A j �t+1 � IW (w + k + 1;Bw))

=j Aw jw+k+1

2 j B j�(w+k+1+k+12

) exp(�12tr(AwB�1))

j Bw jw+k+12 j A j�(w+k+1+k+1

2) exp(�1

2tr(BwA�1))

=j A j 2w+3k+3

2

j B j 2w+3k+32

exp(w

2(tr(BA�1)� tr(AB�1))

7.3.2 Example : Bivariate normal observations with an

unknown variance matrix

A second simple example will now be used to show that this method works. I will

compare the results obtained with the theoretical answers and the Gibbs sampling

results. One hundred observations from a bivariate normal distribution with

known mean vector, � = (4; 2)T and known variance matrix � =

0@ 2.0 {0.2

{0.2 1.0

1A164

were generated. I then assume that the mean vector, � is known and interest lies

in estimating the variance matrix, �. An inverse Wishart prior distribution for

� was assigned, and the model is then as follows :

Yi �MVN

0@0@ 4

2

1A ;�

1A ;

p(�) � IW (�0;�0):

The inverse Wishart prior is conjugate for � and consequently the posterior

for � has the following distribution :

p(� j Y ) � IW (�0 + n; (�0 + nV));

where n is the number of Yis, in this case 100 and V = 1n

Pni=1(Yi � �)(Yi � �)T .

In this example the values �0 = 3 and �0 =

0@ 5.0 {0.5

{0.5 2.0

1A will be used for the

prior distribution. Then the posterior for � is

p(� j Y ) � IW

0@103;0@ 218.77 {36.27

{36.27 118.07

1A1A

which has a posterior mean matrix

0@ 2.188 {0.363

{0.363 1.181

1A :

I will again compare the Gibbs sampling method used by BUGS which samples

directly from the posterior distribution for � and should perform well with the

Hastings method which should take longer to converge.

7.3.3 Results

For the Gibbs sampler method, a burn-in of 1,000 updates and a main run of

5000 updates was used for each of 3 random number seeds and the results were

averaged. For the Hastings method several di�erent degrees of freedom values, w

were used. For each w three runs were performed with di�erent random number

seeds with a burn-in of 1,000 and a main run of 100,000. The Raftery and Lewis

convergence diagnostic is the maximum N of the three variables monitored over

the three runs. The results can be seen in Table 7.2.

165

Table 7.2: Comparison between two MCMC methods for a bivariate normalmodel with unknown variance matrix.

Method w �20 �01 �2

1 Acc% R/LTheory N/A 2.188 {0.363 1.181 N/A N/AGibbs N/A 2.193 (0.315) {0.365 (0.167) 1.178 (0.168) 100.0 3.9k

20 2.190 (0.315) {0.362 (0.165) 1.181 (0.167) 13.6 47.3k50 2.188 (0.313) {0.364 (0.166) 1.180 (0.169) 29.3 30.6k

Hastings 100 2.187 (0.309) {0.361 (0.165) 1.180 (0.168) 43.6 34.0k200 2.190 (0.312) {0.364 (0.167) 1.181 (0.169) 57.2 44.8k500 2.190 (0.312) {0.365 (0.167) 1.181 (0.170) 71.8 63.1k1000 2.186 (0.312) {0.365 (0.168) 1.186 (0.169) 79.6 99.8k

From Table 7.2 it can be seen that the choice of 100,000 as main run length

for the Hastings method satis�es the Raftery Lewis convergence diagnostic for

all selected values of w. It can also be seen that the default accuracy goals

are achieved quickest when the acceptance rate is between 30% and 35%. This

is di�erent from the univariate case but this is due to the fact that the new

procedure involves updating the whole variance matrix and not just a single

parameter. In fact Gelman, Roberts, and Gilks (1995) suggest a rate of 31:6%

for a 3 dimensional normal update which compares favourably with this analysis.

It can be clearly seen that the method is estimating the variance matrix correctly.

I now need to incorporate this method into the algorithm for the models that are

the theme of this chapter, multi-level models with complex variation at level 1.

7.4 Applying inverse Wishart updates to com-

plex variation at level 1

In Chapter 5 I showed how the Gibbs algorithm for a Gaussian 3 level model is

easily generalised to N levels. In this section I will simply describe how to use

the inverse Wishart updating step for the level 1 variance with a 2 level model.

Extending this algorithm to N levels should be analogous to the work in Chapter

5. The model to be considered is as follows :

yij = XFij�

F +XRij�

Rj +XC

ijeij

166


The important part of the algorithm is to store the variance for each

individual,

�2ij = (XC

ij)TVCX

Cij:

All the other parameters then depend on the level 1 variance, VC through these

individual variances. This means that the algorithm below is, apart from the

updating step for VC almost identical to the algorithm for the same model without

complex variation. The main di�erence is that everywhere that �2 appeared in

the old algorithm, it is replaced by �2ij and this often involves moving the �2

ij

inside summations.

7.4.1 MCMC algorithm

I will assume that the following general priors are used, for the level 1 variance,

p(VC) � IW (�1; S1), for the level 2 variance, p(VR) � IW (�2; S2) and for the

�xed e�ects, �F � N(�p; Sp). The algorithm then has four steps as follows :

Step 1 - The �xed e�ects, �F .

p(�F j y; �R;VC ;VR) / p(y j �F ; �R;VC ;VR)p(�F )

�F �MVN( b�F ; bDF)

where

bDF= [Xij

(XFij)

TXFij

�2ij

+ S�1p ]�1;

and

b�F = bDF �24X

ij

(XFij)

T (yij � XRij�

Rj )

�2ij

+ S�1p �p

35 :167

Step 2 - The level 2 residuals, �R.

p(�R j y; �F ;VC ;VR) / p(y j �F ; �R;VC ;VR)p(�RjVR)

�Rj �MVN( b�Rj ; bDR

j )

where

bDR

j = [njXi=1

(XRij)

TXRij

�2ij

+V�1R ]�1;

and

b�Rj = bDR

j �njXi=1

(XRij)

T (yij � XFij�

F )

�2ij

:

Step 3 - The level 1 variance, VC.

This step now involves a Hastings update using an inverse Wishart proposal

distribution.

V(t)C = V�

C with probability min(1; hr � p(V�C j y; : : :)=p(V(t�1)

C j y; : : :))= V

(t�1)C otherwise

where V�C � IWn1(S = w �V(t)

C ; � = w+n1+1); w being a tuning parameter which

will a�ect the acceptance rate of the Hastings proposals and n1 the number of

random parameters at level 1. The Hastings ratio, hr is as follows

hr =j V(t�1)

C j�j V�

C j�exp(

w

2(tr(V�

C(V(t�1)C )�1)� tr(V

(t�1)C (V�

C)�1)));

where � = 2w+3n1+32

.

Step 4 - The level 2 variance, VR.

p(V�1R j y; �F ;VC) / p(�R j VR)p(V

�1R )

168

V�1R �Wishartn2[Spos = (

JXj=1

�Rj (�Rj )

T + SP )�1; �pos = J + �P ];

where n2 is the number of random variables at level 2. If a uniform prior is

required then set SP = 0 and �P = �n2 � 1.

7.4.2 Example 1

The example to be used to illustrate the algorithm was described at the start

of the chapter and consists of the JSP dataset with an extension to the random

slopes regression model so that the predictor variable M3 is random at level 1.

The model is as follows :

M5ij = �F0 + �R0j + e0ij +M3ij(�F1 + �R1j + e1ij)

eij =

0@ e0ij

e1ij

1A �MVN(0;VC); �Rj =

0@ �R0j�R1j

1A � MVN(0;VR):

The predictor variable M3 has been centred around its mean. I will not use

the actual response variableM5 as this produces estimates via IGLS and RIGLS

that do not lead to a positive de�nite matrix VC . This problem will be discussed

in the next section where an alternative method is discussed. Instead I will use

the MLn SIMU command (Rasbash and Woodhouse 1995) to create a simulated

dataset with a positive de�nite matrix VC .

The results from one simulated dataset are given in Table 7.3. The results

for the MCMC methods are based on the average of three runs each of length

50,000 after a burn-in of 500 iterations. The value of w = 150 was selected as

this gives an acceptance rate of approximately 32%, and the rate suggested in

Gelman, Roberts, and Gilks (1995) for a 3 dimensional normal update is 31:6%.

The last column contains results for method 2 which is discussed later in the

chapter.

From Table 7.3 it should be noted that as the results are for only one dataset

generated using the values in the `True' column, the estimates should not be

identical to these true values. What is clear is that the MCMC methods are

169

Table 7.3: Comparison between IGLS/RIGLS and MCMC method on a simulateddataset with the layout of the JSP dataset.

Parameter IGLS RIGLS MCMC IW MCMC(True) w = 150 Method 2�0(30:0) 31.118 (0.487) 31.116 (0.492) 31.117 (0.527) 31.108 (0.526)�1(0:5) 0.537 (0.080) 0.537 (0.081) 0.536 (0.088) 0.538 (0.088)VR00(6:0) 8.417 (2.294) 8.642 (2.344) 10.303 (2.867) 10.289 (2.866)VR01(�0:25) {0.546 (0.282) {0.560 (0.288) {0.662 (0.361) {0.658 (0.359)VR11(0:1) 0.176 (0.062) 0.183 (0.064) 0.231 (0.084) 0.230 (0.084)VC00(28:0) 27.657 (2.187) 27.657 (2.188) 27.910 (2.279) 27.904 (2.291)VC01(�0:5) {0.654 (0.261) {0.656 (0.261) {0.675 (0.270) {0.677 (0.271)VC11(0:5) 0.571 (0.091) 0.571 (0.091) 0.589 (0.099) 0.589 (0.100)

exhibiting behaviour that could be predicted from the results in Chapter 4. The

MCMC methods are using uniform priors for both variance matrices and this is

giving larger variance estimates. The discrepancy is also larger at level 2 where

the estimates are based on 48 schools than at level 1 where the estimates are

based on 887 pupils as expected. The MCMC methods have wider uncertainty

bands than those from IGLS and RIGLS which is also to be expected.

7.4.3 Conclusions

From this one example it can be concluded that provided the dataset used gives

level 1 variance estimates in IGLS/RIGLS that produce a positive de�nite matrix

then the above method can be used to �t the model and give MCMC estimates.

One important point to note is that in the earlier examples with a single

residual at level 1 for each individual, eij these residuals could be estimated by

subtraction at each iteration. This would then lead to MCMC estimates for each

eij. Following this through to the complex variation models, the subtraction

approach will give estimates for the composite residual, XCijeij but not for the

individual eij. The best that is available is similar to the IGLS/RIGLS methods.

These methods calculate the residuals based on the �nal estimates of the other

parameters and could simply be applied while using the MCMC estimates for the

other parameters.

The one point I touched on earlier is that the IGLS/RIGLS method for the

170

original dataset produces estimates that do not form a positive de�nite variance

matrix, VC . Although the method I have just given is a useful way of using

the Metropolis-Hastings sampler, I will in the next section show an alternative

method that can handle non-positive de�nite variance matrices.

7.5 Method 2 : Using truncated normal Hast-

ings update steps

The inverse Wishart updating method assumes that the variance `matrix' at level

1 must be positive de�nite. Although the way the model has been written thus

far suggests that the variance at level 1 should be a matrix, an alternative form

would be to consider the variance at level 1 as a quadratic form. For example

using the JSP example considered already,

V ar(M5ij j �F�Rj ) = A+ 2BM3ij + C(M3ij)2:

Using the constraint that the matrix

0@ A B

B C

1A is positive de�nite is a

stronger constraint than is actually needed. Positive de�nite matrices will

guarantee that any vector XCij will produce a positive variance, but in the JSP

example the �rst random variable is constant and the second variable, M3 takes

integer values, before centering, between 0 and 40. So a looser constraint is to

allow all values (A�; B�; C�) such that A�+2B�M3ij +C�(M3ij)2 > 0 8 i; j: This

constraint looks quite complicated to work with but if I consider each of the

variables, A;B and C separately and assume the other variables are �xed the

constraints become easier.

I will now consider the steps required for our simple example before

generalising the algorithm to all problems with complex variation at level 1.

7.5.1 Update steps at level 1 for JSP example

At iteration t, assume that the current values for the parameters are A(t); B(t) and

C(t), and let �2ij = V ar(M5ij j �F�Rj ). Then I will update the three parameters

in turn.

171

Updating parameter A.

At time t,

�2ij = A(t) + 2B(t)M3ij + C(t)(M3ij)

2 > 0 8 i; j:

So let

2B(t)M3ij + C(t)(M3ij)2 = �rAij then A(t) > rAij 8 i; j:

This implies

A(t) > maxA where maxA = max(rAij):

I will use a normal proposal distribution with variance, vA but only consider

values generated that satisfy the constraint. This will lead to a truncated normal

proposal as shown in Figure 7-1 (i). The Hastings ratio can then be calculated

by the ratio of the two truncated normal distributions shown in Figure 7-1 (i)

and (ii). Letting the value for A at time t be Ac and the proposed value for time

t + 1 be A�.

hr =p(A(t+1) = A� j A(t) = Ac)

p(A(t+1) = Ac j A(t) = A�)

=1� �((maxA � A�)=

pvA)

1� �((maxA � Ac)=pvA)

:

The update step is now as follows :

A(t+1) = A� with probability min(1; hr � p(A� j y; : : :)=p(A(t) j y; : : :))= A(t) otherwise:

Updating parameter B.

At time t,

�2ij = A(t) + 2B(t)M3ij + C(t)(M3ij)

2 > 0 8 i; j:

172

So let A(t) + C(t)(M3ij)2 = �rBij ; then

B(t) > rBij=(2 �M3ij) 8M3ij > 0; and

B(t) < rBij=(2 �M3ij) 8M3ij < 0:

This leads to two constraints :

B(t) > maxB+ where maxB+ = max(rBij=(2 �M3ij)); M3ij > 0);

and

B(t) < minB� where minB� = min(rBij=(2 �M3ij)); M3ij < 0):

I will use a normal proposal distribution, with variance vB, but only consider

values generated that satisfy these constraints. This will lead to a truncated

normal proposal as shown in Figure 7-1 (iii). The Hastings ratio can then

be calculated by the ratio of the two truncated normal distributions shown in

Figure 7-1 (iii) and (iv). Letting the value for B at time t be Bc and the proposed

value for time t+ 1 be B�.

hr =p(B(t+1) = B� j B(t) = Bc)

p(B(t+1) = Bc j B(t) = B�)

=�((minB� �B�)=

pvB)� �((maxB+ � B�)=

pvB)

�((minB� � Bc)=pvB)� �((maxB+ � Bc)=

pvB)

:


B(t+1) = B� with probability min(1; hr � p(B� j y; : : :)=p(B(t) j y; : : :))= B(t) otherwise:

Updating parameter C.

At time t,

�2ij = A(t) + 2B(t)M3ij + C(t)(M3ij)

2 > 0 8 i; j:

173

So let

A(t) + 2B(t)M3ij = �rCij then C(t) > rAij=(M3ij)2 8 i; j:

This implies

C(t) > maxC where maxC = max(rCij=(M3ij)2):

I will use a normal proposal distribution but only consider values generated

that satisfy the constraint. This will lead to a truncated normal proposal as

shown in Figure 7-1 (i). The Hastings ratio can then be calculated by the ratio of

the two truncated normal distributions shown in Figure 7-1 (i) and (ii). Letting

the value for C at time t be Cc and the proposed value for time t+ 1 be C�.

hr =p(C(t+1) = C� j C(t) = Cc)

p(C(t+1) = Cc j C(t) = C�)

=1� �((maxC � Cc)=

pvC)

1� �((maxC � Cc)=pvC)

:


C(t+1) = C� with probability min(1; hr � p(C� j y; : : :)=p(C(t) j y; : : :))= C(t) otherwise:

The results of using this second method on the simulated dataset example

of the last section can be seen in the last column of Table 7.3. Here it can

be seen that although the two methods are based on di�erent constraints, this

example where the correct answer has a positive de�nite framework means that

both methods give similar estimates.

Before generalising the algorithm for all covariance structures at level 1, I will

now return to the original data from the JSP dataset. This will show that this

second method has the advantage of �tting models with a non-positive de�nite

variance structure at level 1.

174

...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

(i)

...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ABM

................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

(ii)

...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ABM

........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

(iii)

...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ABM m

................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

(iv)

...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ABM m

Figure 7-1: Plots of truncated univariate normal proposal distributions for aparameter, �. A is the current value, �c and B is the proposed new value, ��. Mis max� and m is min�, the truncation points. The distributions in (i) and (iii)have mean �c, while the distributions in (ii) and (iv) have mean ��.

175

7.5.2 Proposal distributions

I have not as yet mentioned the proposal distributions used in this method in

much detail. I simply stated that I was using a truncated normal distribution

with a particular variance for the untruncated version. The problem of choosing a

value for the variance parameter is the same problem that I had when considering

updating the �xed e�ects and higher level residuals by Metropolis Hastings

updates. Here like there, two possible solutions are to use the variance of the

parameter estimate from the RIGLS procedure multiplied by a suitable scaling

to give a variance for the normal proposal distribution, or to use an adaptive

approach before the burn-in and main run of the simulation. In example 1, I used

a scaling of 5.8 on the variance scale which gave acceptance rates of between 35%

and 50%.

7.5.3 Example 2 : Non-positive de�nite and incomplete

variance matrices at level 1

The original JSP dataset gave a non-positive de�nite matrix as an estimate for the

level 1 variance. The term Ve11 is very small and is not statistically signi�cant,

so it could be removed from the model. This will mean that in practical terms

the variance of individual students' M5 scores is dependent on their M3 score

in a linear way, rather than a quadratic. This will then lead to an incomplete

variance matrix at level 1.

This sort of variance structure at level 1 can be useful in other situations, for

example if the data consists of boys and girls and it is believed that there is a

di�erence in variability between boys and girls. Then the variance equation could

be as follows :

V ar(Yijj�F ; �R) = Ve00 + Ve01Boyij:

where Boyij has value 1 if the child is a boy and 0 if the child is a girl, and

Ve01 can be negative. Here including the quadratic term will not make sense as

(Boyij)2 = Boyij 8 i; j so an incomplete variance structure should be used.

Table 7.4 contains the estimates of the RIGLS and second MCMC methods

for the complex variation model �tted to the original JSP data (Model 2). The

176

table also contains estimates when �rstly Ve11 is removed (Model 3), and secondly

Ve01 is removed (Model 4). The MCMC algorithm described earlier can easily

accommodate the removal of terms from the variance matrix. The term is simply

set to zero before running the simulation method and during the method the

term is never updated.

Table 7.4: Comparison between RIGLS and MCMC method 2 on three modelswith complex variation �tted to the JSP dataset.

Parameter Model 2 Model 3 Model 4�0 30.582 (0.356) 30.582 (0.356) 30.589 (0.367)�1 0.617 (0.033) 0.617 (0.033) 0.612 (0.042)VR00 4.273 (1.230) 4.271 (1.230) 4.638 (1.306)

RIGLS VR01 {0.246 (0.098) {0.247 (0.098) {0.340 (0.117)VR11 0.017 (0.010) 0.017 (0.010) 0.028 (0.017)VC00 27.551 (1.465) 27.573 (1.419) 25.018 (1.651)VC01 {1.206 (0.137) {1.203 (0.079) 0.000 (0.000)VC11 0.001 (0.027) 0.000 (0.000) 0.063 (0.039)

�0 30.582 (0.387) 30.577 (0.385) 30.590 (0.398)�1 0.617 (0.039) 0.617 (0.039) 0.614 (0.048)VR00 5.302 (1.647) 5.330 (1.651) 5.603 (1.709)

MCMC VR01 {0.319 (0.143) {0.327 (0.142) {0.394 (0.160)VR11 0.030 (0.016) 0.030 (0.016) 0.048 (0.023)VC00 27.237 (1.528) 27.418 (1.409) 25.008 (1.545)VC01 {1.221 (0.136) {1.178 (0.079) 0.000 (0.000)VC11 0.012 (0.028) 0.000 (0.000) 0.067 (0.033)

Scale 1.0 1.3 3.0Accept Acc(VC00) 45.7 42.2 46.4

Acc(VC01) 36.0 43.9 |Acc(VC11) 37.8 | 43.8

RL(VC00) 51K 51.6K 22.7KDiag. RL(VC01) 119K 100.1K |

RL(VC11) 134K | 23.3K

The MCMC estimates are based on the average of 3 runs each of length 50,000

after a burn-in of 500. Each run takes approximately an hour on a Pentium

200MHz PC. The Raftery Lewis estimates are larger than the 50,000 iterations

performed on each run, however the estimates in the table are based on the

average of 3 runs, which gives a potential 150,000 iterations per estimate. The

177

Raftery Lewis estimates may also be in ated as they are based on a thinned

chain of length 10,000 (every 5th iteration of the main chain) due to storage

restrictions. The scaling of the proposal distribution variances was chosen based

on some shorter runs, so that acceptance rates were close to 44%.

From Table 7.4 it can be seen that the MCMC method can handle �rstly

non-positive de�nite level 1 variance matrices, and secondly incomplete level 1

variance matrices. The MCMC estimates exhibit behaviour similar to that seen

in example 1. The level 2 MCMC variance estimates are larger than those from

RIGLS but at level 1 there is less di�erence in the estimates from the two methods.

In fact the level 1 variance term, VC00 has smaller estimates using the MCMC

method.

Prior distributions

In the above example, uniform priors were used for all the variance parameters.

As an alternative, informative inverse Wishart priors could be used for the level

1 variance matrix, if the matrix is complete. If the matrix is incomplete, an

alternative prior to the uniform is not immediately obvious. The problem of

�nding alternative priors in this case is outside the scope of this thesis. A possible

solution would be to use univariate normal prior distributions for each parameter

as the likelihood would ensure that the posterior gave acceptable values for the

positivity constraints.

7.5.4 General algorithm for truncated normal proposal

method

The algorithm for a general Gaussian multi-level model using the updating

method of truncated normal proposals at level 1 follows directly from the example

in the last section. As both this method and the earlier method calculate the

�2ijs the update steps are all the same except for the update step for the level 1

variance. I will assume the model is as for the �rst method,

yij = XFij�

F +XRij�

Rj +XC

ijeij


178

Step 3 - The level 1 variance, VC.

The variance matrix VC is considered term by term and one of the following two

steps is followed depending on whether the term lies on the diagonal.

Updating diagonal terms, VCnn.

At time t,

�2ij = (XC

ij)TV

(t)C XC

ij > 0

= (XCij(n))

2V(t)Cnn � rCij(nn) > 0 8 i; j;

where

rCij(nn) = (XCij(n))

2V(t)Cnn � (XC

ij)TV

(t)C XC

ij:

So

V(t)Cnn > maxCnn where maxCnn = max(rCij(nn)=(X

Cij(n))

2):

I will use a normal proposal distribution with variance, vnn but only consider


proposal as shown in Figure 7-1 (i). The Hastings ratio can then be calculated

by the ratio of the two truncated normal distributions shown in Figure 7-1 (i)

and (ii). Letting the value for VCnn at time t be A and the proposed value for

time t+ 1 be B,

hr =p(V

(t+1)Cnn = B j V(t)

Cnn = A)

p(V(t+1)Cnn = A j V(t)

Cnn = B)

=1� �((maxCnn � B)=

pvnn)

1� �((maxCnn � A)=pvnn)

:


V(t+1)Cnn = V�

Cnn with probability min(1; hr � p(V�

Cnnjy;:::)

p(V(t)

Cnnjy;:::))

= V(t)Cnn otherwise:

179

Updating non diagonal terms, VCmn.

At time t,

�2ij = (XC

ij)TV

(t)C XC

ij > 0

= 2 � XCij(m)X

Cij(n)V

(t)Cmn � rCij(mn) > 0 8 i; j;

where

rCij(mn) = 2 � XCij(m)X

Cij(n)V

(t)Cmn � (XC

ij)TV

(t)C XC

ij:

So

V(t)Cmn > maxCmn+

where

maxCmn+ = max(rCij(mn)=(2 �XCij(m)X

Cij(n)); X

Cij(m)X

Cij(n) > 0);

and

V(t)Cmn < minCmn�

where

minCmn� = min(rCij(mn)=(2 � XCij(m)X

Cij(n)); X

Cij(m)X

Cij(n) < 0):

I will use a normal proposal distribution with variance, vmn but only consider


proposal as shown in Figure 7-1 (iii). The Hastings ratio can then be calculated

by the ratio of the two truncated normal distributions shown in Figure 7-1 (iii)

and (iv). Letting the value for VCmn at time t be A and the proposed value for

time t+ 1 be B,

hr =p(V(t+1)

Cmn = B j V(t)Cmn = A)

p(V(t+1)Cmn = A j V(t)

Cmn = B)

=�((minCmn� � B)=

pvnn)� �((maxCmn+ �B)=

pvnn)

�((minCmn� � A)=pvnn)� �((maxCmn+ � A)=

pvnn)

:


180

V(t+1)Cmn = V�

Cmn with probability min(1; hr � p(V�

Cmnjy;:::)

p(V(t)

Cmnjy;:::))

= V(t)Cmn otherwise:

7.6 Summary

In this chapter two MCMC algorithms have been given for the solution of

Gaussian multi-level models with complex variation at level 1. The �rst method

uses an inverse Wishart proposal distribution for the level 1 variance matrix, and

can only be used when the variance matrix at level 1 is strictly positive de�nite.

The second method uses truncated normal proposals for the individual variance

terms. It is not constrained to only positive de�nite matrices and can even cope

with incomplete variance matrices.

Both these methods were illustrated through some simple examples and gave

sensible estimates for the problems given. The one problem that both methods

fail to solve is that they do not give estimates for individual level 1 residuals.

This along with other models that are not covered in this thesis will be discussed

in the next chapter, along with potential solutions.

181

Chapter 8

Conclusions and Further Work

8.1 Conclusions

The general aim of this thesis is to combine the two areas of multi-level modelling

and Markov chain Monte Carlo (MCMC) methods by �tting multi-level models

using MCMC methods. This task was split into three parts. Firstly the types of

problems that are �tted in multi-level modelling were identi�ed and the existing

maximum likelihood methods were investigated. Secondly MCMC algorithms

for these models were derived and �nally these methods were compared to the

maximum likelihood based methods both in terms of estimate bias and coverage

properties.

Two simple 2 level Gaussian models were �rstly considered and it was shown

how to �t these models using the Gibbs sampler method. Then extensive

simulation studies were carried out to compare di�erent prior distributions for

the variance parameters in these models using the Gibbs sampler method with

the two maximum likelihood methods IGLS and RIGLS on these two models.

The results showed that in terms of bias the RIGLS method was less biased than

the MCMC methods. In terms of coverage properties the MCMC methods do

better than the maximum likelihood methods in many situations although not

always. The conclusions are summarised in more detail in Section 4.5.

The Gibbs sampler algorithms given for the two simple multi-level models

were then generalised to �t the family of N level Gaussian models. Two

alternative hybrid Metropolis-Gibbs methods were also given along with adaptive

samplers and all these methods were compared with each other.

182

It was found that the Gibbs sampler method was better than the hybrid

methods for the Gaussian models in terms of number of iterations required

to achieve a desired accuracy. Of the hybrid methods there was little to

choose between the univariate proposal distribution method and the multivariate

proposal distribution method. It was also found that the adaptive samplers were a

better method of achieving desired accuracies in a minimum number of iterations

than the methods using proposal distributions based on scaled variance estimates.

The univariate Normal proposal distribution Metropolis method was adapted

to �t binary response multi-level models. This method was then compared with

the quasi-likelihood methods via a simulation study on one binary response model

from Rodriguez and Goldman (1995) where the quasi-likelihood methods perform

particularly badly. For this model it was shown that the Metropolis-Gibbs hybrid

method performs much better both in terms of bias and coverage properties. It

was also shown that for this model, the choice of prior distribution for the variance

parameters is less important.

Finally Gaussian models with complex variation at level 1 were considered

and two MCMC methods with Hastings updates at level 1 were given. The �rst

method was based on an inverse Wishart proposal distribution for the level 1

variance matrix but could only be used when the level 1 variance parameters

formed a complete positive de�nite matrix. The second method was based on

truncated normal proposals for the individual variance components and could be

used on any variance structure at level 1. Both these methods were tested on

some simple examples.

As can be seen many di�erent multi-level models have been considered in this

thesis but these are just a selection from a far larger �eld. The further work

section following these conclusions will outline some other models that I did not

have time to consider. Before mentioning this future work I will mention brie y

the implementation of MCMC methods in the MLwiN package (Goldstein et al.

1998).

8.1.1 MCMC options in the MLwiN package

A by-product of this thesis has been the programming of MCMC methods into

the MLwiN package (Goldstein et al. 1998). The �rst release of MLwiN was in

183

February 1998 and over the �rst 6 months the package has been used by hundreds

of users from the social science community and elsewhere, some familiar with its

forerunner, MLn (Rasbash and Woodhouse 1995) and some new users. Although

many users who are familiar with the IGLS and RIGLS methods may not use the

MCMC methods, it is hoped that by implementing MCMC in MLwiN, we will

be exposing more people to MCMC methods.

Early feedback is encouraging, and two workshops have been set up to teach

these new users more about the MCMC methods in MLwiN and Bayesian

statistics in general. I have also received several questions about the MCMC

options in MLwiN that show that a new MCMC user community has emerged.

Apart from introducing MCMC methods to a new community of users, the

MLwiN package has been invaluable to me, for performing the simulation studies

found in this thesis. On this note I must also acknowledge the BUGS package

(Spiegelhalter et al. 1994) which was used to compare results in the original

programming of the MCMC options in MLwiN and to perform some of the

simulations in Chapter 4.

8.2 Further work

In this section I intend to include a brief introduction to other models that

have not been considered in this thesis but which can be �tted in the MLwiN

package using maximum likelihood methods. Some of these models have not been

considered due to time constraints although �tting them using MCMC is not

di�cult. Other models are more di�cult or have simply not been considered yet

but are included for completeness. Goldstein (1995) contains more information

on �tting all the following models using maximum likelihood methods.

8.2.1 Binomial responses

Binomial response models are used to �t datasets where the response variable is

a proportion. The binomial distribution has two parameters, p the probability

of success, and n the number of trials, known as the denominator in Goldstein

(1995). The binary response models considered in Chapter 6 are a special case

of the binomial response models where the denominator is 1 for all observations.

184

The logit link function is used to �t binomial response models as shown for the

binary response models in Chapter 6. In fact a multi-level binomial response

model can be converted to a multi-level binary response model by including an

extra level in the model for the individual trials.

A 3 level binomial response logistic regression model can be de�ned as follows

:

yijk = Binomial(nijk; pijk); where

logit(pijk) = X1ijk�1 +X2ijk�2jk +X3ijk�3k

�2jk �MVN(0;V2); �3k �MVN(0;V3):

The Metropolis Gibbs hybrid method can be used to �t Binomial response

models, and this can be easily done via minor alterations to the algorithm for

the binary response model in Chapter 6. This has not yet been implemented in

MLwiN but the only main requirements are to store the denominator ni for each

observation i, and then modify two conditional distributions.

The conditional distribution for the �xed e�ects should now be :

p(�1i j y; : : :) / p(�1) �Y

i2MT

(1 + e�(X�)i)�yi(1 + e(X�)i)yi�ni:

and the conditional distribution for the level l residuals should now be :

p(�lji j y; : : :) /Y

i2Ml;j

(1 + e�(X�)i)�yi(1 + e(X�)i)yi�ni� j Vl j� 12 exp[�1

2�TljV

�1l �lj]:

Extra binomial variation

As with the binary response models, the binomial response models do not have

a parameter for the level 1 variance. This is because if the mean and the

denominator of a binomial distribution are known then the variance can be

calculated, (var(yi) = nipi(1� pi)).

The multi-level binomial response model could also be written as follows :

yi = pi + eizi; zi =qpi(1� pi)=ni;

185

where

logit(pi) = (X�)i; �lj � MVN(0;Vl):

If the model is assumed to have binomial variation then �2e is constrained

to 1. The assumption of binomial variation need not hold and the quasi-

likelihood methods in MLwiN will allow the constraint to be dropped. When

the assumption of binomial variation is dropped the variation is known as extra

binomial variation. Work is needed to �t models with extra binomial variation

using the MCMC methods.

8.2.2 Multinomial models

The binomial model is used for proportional data where the response variable is

a collection of observations that have two possible states (0 and 1). This is a

special case of the multinomial model which is used when the response variable is

a collection of observations with S possible states. The quasi-likelihood methods

in MLwiN can �t multinomial models but as yet I have not considered how to �t

these models using MCMC. This work would probably follow easily after �tting

the multivariate Normal response models mentioned later in this chapter.

8.2.3 Poisson responses for count data

The other sort of discrete data that needs to be considered is count data. As

count data are restricted to being positive integer values, the Poisson distribution

is usually used for the response variable along with the log link function.

A multi-level Poisson model with a log link can therefore be written :

yi � Poisson(�i)

where

log(�i) = (X�)i; �lj �MVN(0;Vl):

The Poisson model is related to the binomial in that it has a variance that is

�xed if the mean is known. The algorithm for the binary response model can also

186

be modi�ed to �t Poisson models by modifying the conditional distributions. For

the Poisson model the conditional distribution for the �xed e�ects should now be

:

p(�1i j y; : : :) / p(�1) �Y

i2MT

e�e(X�)i (e(X�)i)yi :

and the conditional distribution for the level l residuals should now be :

p(�lji j y; : : :) /Y

i2Ml;j

e�e(X�)i (e(X�)i)yi� j Vl j� 1

2 exp[�1

2�TljV

�1l �lj]:

Extra Poisson variation

An alternative form for the multi-level Poisson response model is as follows :

yi = �i + eizi; zi =q(�i);

where

log(�i) = (X�)i; �lj �MVN(0;Vl):

If the model is assumed to have Poisson variation then �2e is constrained to

1. The assumption of Poisson variation need not hold and the quasi-likelihood

methods in MLwiN will allow the constraint to be dropped. When the assumption

of Poisson variation is dropped the excess heterogeneity is known as extra Poisson

variation. More work is needed to �t models with extra Poisson variation using

the MCMC methods.

8.2.4 Extensions to complex variation at level 1

In Chapter 7 I introduced two methods based on Hastings updating steps that

will �t Gaussian models with complex variation. The Hastings update step for

the level 1 variance parameters was only added to the general Gibbs sampling

algorithm from Chapter 5 and uniform priors were used for these parameters.

More work needs to be done on incorporating the Hastings step into the hybrid

Metropolis Gibbs methods in Chapter 5 and to allow users to incorporate

187

informative prior distributions for these models. In particular informative prior

distributions for incomplete variance matrices at level 1 will need some more

thought.

Simulating level 1 residuals

At present the Hastings algorithm works with the composite level 1 residual,

XCijeij to generate simulated values for the level 1 variance parameters. This

has the disadvantage that the r individual level 1 residuals for observation

ij; erij cannot then be estimated by the MCMC method. However the maximum

likelihood methods have a procedure for generating residuals which can be used

with the MCMC estimates for the �xed e�ects and the variance parameters. This

procedure can be used for now and more work on �nding MCMC methods that

simulate the individual residuals can be performed.

Incomplete variance matrices at higher levels

The MLwiN package allows the variance structure at any level to be an incomplete

matrix. This can be useful if the random parameters at a higher level are

thought to be uncorrelated, as the covariance term can then be set to zero in

the model. The quasi-likelihood methods have no problem �tting these models

but these models are outside the general Gaussian framework �tted by the MCMC

algorithms in Chapter 5. The second Hastings update method given in Chapter

7 for the problem of complex variation at level 1 dealt with incomplete variance

matrices at level 1 and this method could probably be used also at higher levels.

The one disadvantage is that the higher level residuals will then not be estimated

but instead the composite higher level residuals will be used as for complex

variation at level 1.

8.2.5 Multivariate response models

All the models studied in this thesis have a single response variable but many

problems exist where there are many response variables. A simple approach would

be to �t each response variable separately with its own multi-level model. There

is often, however correlation between the response variables and this correlation

188

needs to be modelled. To model many responses together multivariate response

models need to be �tted.

The MLwiN package uses a clever approach to �tting such models. An extra

level is added to the model and the various responses for each observation are

considered di�erent observations at this bottom level. This in e�ect reduces the

multivariate response model to a special case of the univariate response variable

with no variability at level 1. Then through the use of indicator variables the

multivariate response model can be �tted. For examples of multivariate response

models and more details on how they are �tted using maximum likelihood

methods in MLwiN see Chapter 4 of Goldstein (1995). Work is required to

�t these models using MCMC methods in MLwiN.

189

Bibliography

Bernardo, J. M. and A. F. M. Smith (1994). Bayesian Theory. Chichester:

Wiley.

Box, G. E. P. and M. E. Muller (1958). A Note on the Generation of Random

Normal Deviates. Annals of Mathematical Statistics 29, 610{611.

Box, G. E. P. and G. C. Tiao (1992). Bayesian Inference in Statistical Analysis.

New York: John Wiley.

Breslow, N. E. and D. G. Clayton (1993). Approximate Inference in Generalized

Linear Mixed Models. Journal of the American Statistical Association 88,

9{25.

Brooks, S. P. and G. O. Roberts (1997). Assessing Convergence of Markov

Chain Monte Carlo Algorithms. Unpublished.

Bryk, A. S. and S. W. Raudenbush (1992). Hierarchical Linear Models.

Newbury Park: Sage.

Bryk, A. S., S. W. Raudenbush, M. Seltzer, and R. Congdon (1988). An

Introduction to HLM: Computer Program and User's guide (2.0 ed.).

Chicago: University of Chicago Dept of Education.

Chat�eld, C. (1989). The Analysis of Time Series : An Introduction (4th ed.).

London: Chapman and Hall.

Copas, J. B. and H. G. Li (1997). Inference for Non-Random Samples. Journal

of the Royal Statistical Society, Series B 59, 55{96.

Cowles, M. K. and B. P. Carlin (1996). Markov Chain Monte Carlo Con-

vergence Diagnostics: A Comparative Review. Journal of the American

Statistical Association 91, 883{904.

190

Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum Likelihood

from Incomplete Data via the EM Algorithm (with discussion). Journal of

the Royal Statistical Society, Series B 39, 1{38.

Draper, D. (1995). Inference and Hierarchical Modeling in the Social Sciences.

Journal of Educational and Behavioral Statistics 20, 115{147.

Draper, D. and R. Cheal (1997). Practical MCMC for Assessment and

Propagation of Model Uncertainty. Unpublished.

DuMouchel, W. and C. Waternaux (1992). Hierarchical Model for Combining

Information and for Meta-analyses (Discussion). In J. M. Bernardo, J. O.

Berger, A. P. Dawid, and A. F. M. Smith (Eds.), Bayesian Statistics 4, pp.

338{341. Oxford: Clarendon Press.

Gelfand, A. E., S. E. Hills, A. Racine-Poon, and A. F. M. Smith (1990).

Illustration of Bayesian Inference in Normal Data Models Using Gibbs

Sampling. Journal of the American Statistical Association 85, 972{985.

Gelfand, A. E. and S. K. Sahu (1994). On Markov Chain Monte Carlo

acceleration. Journal of Computational and Graphical Statistics 3, 261{276.

Gelfand, A. E., S. K. Sahu, and B. P. Carlin (1995). E�cient Parameterizations

for normal linear mixed models. Biometrika 82, 479{488.

Gelfand, A. E. and A. F. M. Smith (1990). Sampling Based Approaches

to Calculating Marginal Densities. Journal of the American Statistical

Association 85, 398{409.

Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin (1995). Bayesian Data

Analysis. London: Chapman and Hall.

Gelman, A., G. O. Roberts, and W. R. Gilks (1995). E�cient Metropolis

Jumping Rules. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and

A. F. M. Smith (Eds.), Bayesian Statistics 5, pp. 599{607. Oxford: Oxford

University Press.

Gelman, A. and D. B. Rubin (1992). Inference from Iterative Simulation Using

Multiple Sequences. Statistical Science 7, 457{472.

Geman, S. and D. Geman (1984). Stochastic Relaxation, Gibbs Distributions

and the Bayesian Restoration of Images. IEEE Transactions on Pattern

Analysis and Machine Intelligence 45, 721{741.

191

Geweke, J. (1992). Evaluating the Accuracy of Sampling Based Approaches

to the Calculation of Posterior Moments. In J. M. Bernardo, J. O. Berger,

A. P. Dawid, and A. F. M. Smith (Eds.), Bayesian Statistics 4, pp. 169{193.

Oxford: Oxford University Press.

Gilks, W. R. (1995). Full Conditional Distributions. In W R Gilks and S

Richardson and D J Spiegelhalter (Ed.), Markov Chain Monte Carlo in

Practice. London: Chapman and Hall.

Gilks, W. R., G. O. Roberts, and S. K. Sahu (1996). Adaptive Markov

Chain Monte Carlo. Research report 20, Statistics Laboratory, University

of Cambridge.

Gilks, W. R. and P. Wild (1992). Adaptive Rejection Sampling for Gibbs

Sampling. Journal of the Royal Statistical Society, Series C 41, 337{348.

Goldstein, H. (1986). Multilevel mixed linear model analysis using iterative

generalised least squares. Biometrika 73, 43{56.

Goldstein, H. (1989). Restricted unbiased iterative generalised least squares

estimation. Biometrika 76, 622{623.

Goldstein, H. (1991). Nonlinear Multilevel Models With an Application to

Discrete Response Data. Biometrika 78, 45{51.

Goldstein, H. (1995). Multilevel Statistical Models (2 ed.). London: Edward

Arnold.

Goldstein, H. and J. Rasbash (1996). Improved Approximations for Multilevel

Models with Binary Responses. Journal of the Royal Statistical Society,

Series A 159, 505{513.

Goldstein, H., J. Rasbash, I. Plewis, D. Draper, W. Browne, M. Yang,

G. Woodhouse, and M. Healy (1998). A user's guide to MLwiN (1.0 ed.).

London: Institute of Education.

Goldstein, H. and D. J. Spiegelhalter (1996). League Tables and Their

Limitations: Statistical Issues in Comparisons of Institutional Performance.

Journal of the Royal Statistical Society, Series A 159, 385{409.

Hastings, W. K. (1970). Monte Carlo Sampling Methods using Markov Chains

and their Applications. Biometrika 57 (1), 97{109.

192

Heath, A., M. Yang, and H. Goldstein (1996). Multilevel Analysis of the

Changing Relationship between Class and Party in Britain 1964-1992.

Quality and Quantity 30, 389{404.

Hills, S. W. and A. F. M. Smith (1992). Parameterization Issues in Bayesian

Inference. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith

(Eds.), Bayesian Statistics 4, pp. 227{246. Oxford: Oxford University

Press.

Kreft, I. G. G., J. de Leeuw, and R. van der Leeden (1994). Review of Five

Multilevel Analysis Programs : BMDP-5V, GENMOD, HLM, ML2, and

VARCL. . American Statistician 48, 324{335.

Laird, N. M. (1978). Empirical Bayes Methods for Two-Way Contingency

Tables. Biometrika 65, 581{590.

Longford, N. T. (1987). A Fast Scoring Algorithm for Maximum Likelihood

Estimation in Unbalanced Mixed Models with Nested Random E�ects.

Biometrika 74, 817{827.

Longford, N. T. (1988). VARCL - software for variance components analysis

of data with hierarchically nested random e�ects (maximum likelihood) (1.0

ed.). Princeton, NJ: Educational Testing Service.

MacEachern, S. N. and L. M. Berliner (1994). Subsampling the Gibbs Sampler.

The American Statistician 48, 188{190.

McCullagh, P. and J. A. Nelder (1983). Generalized Linear Models. London:

Chapman and Hall.

Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and

E. Teller (1953). Equations of State Calculations by Fast Computing

Machines. Journal of Chemical Physics 21, 1087{1092.

Muller, P. (1993). A generic approach to posterior integration and Gibbs

sampling. Technical report, ISDS, Duke University.

Raftery, A. E. and S. M. Lewis (1992). How Many Iterations in the Gibbs

Sampler? In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith

(Eds.), Bayesian Statistics 4, pp. 763{773. Oxford: Oxford University

Press.

193

Rasbash, J. and G. Woodhouse (1995). MLn: Command Reference Guide (1.0

ed.). London: Institute of Education.

Ripley, B. D. (1987). Stochastic Simulation. New York, USA: Wiley.

Rodriguez, G. and N. Goldman (1995). An Assessment of Estimation Proc-

edures for Multilevel Models with Binary Responses. Journal of the Royal

Statistical Society, Series A 158, 73{89.

Seltzer, M. H. (1993). Sensitivity Analysis for Fixed E�ects in the Hierarchical

Model: A Gibbs Sampling Approach. Journal of Educational Statistics 18,

207{235.

Seltzer, M. H., W. H. Wong, and A. S. Bryk (1996). Bayesian Analysis

in Applications of Hierarchical Models: Issues and Methods. Journal of

Educational and Behavioral Statistics 21, 131{167.

Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis.

London: Chapman and Hall.

Spiegelhalter, D. J., A. Thomas, N. G. Best, and W. R. Gilks (1994). BUGS:

Bayesian inference using Gibbs sampling. Version 0.30a. Technical report,

MRC Biostatistics Unit, Cambridge.

Spiegelhalter, D. J., A. Thomas, N. G. Best, and W. R. Gilks (1995). BUGS:

Bayesian inference using Gibbs sampling. Version 0.50. Technical report,

MRC Biostatistics Unit, Cambridge.

Stiratelli, R., N. M. Laird, and J. Ware (1984). Random E�ects Models for

Serial Observations with Binary Responses. Biometrics 40, 961{971.

Woodhouse, G., J. Rasbash, H. Goldstein, and M. Yang (1995). Introduction

to Multilevel Modelling. In G Woodhouse (Ed.), A Guide to MLn for New

Users. Institute of Education.

Zeger, S. L. and M. R. Karim (1991). Generalized Linear Models with Random

E�ects: a Gibbs Sampling Approach. Journal of the American Statistical

Association 86, 79{86.

194

MCMC - University of California, Santa Cruzdraper/browne-PhD-dissertation-1999.pdf · Secondly MCMC...

Documents

Transcript of MCMC - University of California, Santa Cruzdraper/browne-PhD-dissertation-1999.pdf · Secondly MCMC...